<a id="toc"></a>

# <u>Table of Contents</u>

1.) [Setup](#setup)  
&nbsp;&nbsp;&nbsp;&nbsp; 1.1.) [Imports](#imports)   
&nbsp;&nbsp;&nbsp;&nbsp; 1.2.) [Helpers](#helpers)   
&nbsp;&nbsp;&nbsp;&nbsp; 1.3.) [Load data](#load)   
2.) [Datetime](#datetime)  
3.) [Speakers](#speakers)  
4.) [Transcript](#transcript)  

---
<a id="setup"></a>

# [^](#toc) <u>Setup</u>

<a id="imports"></a>

### [^](#toc) Imports

In [1]:
### Standard imports
import pandas as pd
import numpy as np

### Plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

### Removes warnings that occassionally show in imports
import warnings
warnings.filterwarnings('ignore')

# Helps convert String representation of list into a list
import ast

<a id="helpers"></a>

### [^](#toc) Helpers

In [2]:
def string_literal(x):
    if x is np.nan:
        return np.nan
    else:
        try:
            return ast.literal_eval(x)
        except:
            return x

<a id="load"></a>

### [^](#toc) Load data

In [5]:
df = pd.read_csv("data/PBS_full_unedited.csv")
for col in ["Transcript", "Story", "Speakers"]:
    df[col] = df[col].map(string_literal)
    
df.head()

Unnamed: 0,URL,Story,Date,Title,Transcript,Speakers,Number of Comments
0,https://www.pbs.org/newshour/show/news-wrap-tr...,"In our news wrap Monday, President Trump's sea...","Jul 2, 2018 6:50 PM EDT",News Wrap: Trump interviews Supreme Court cand...,"[[Judy Woodruff, [ President Trump’s search fo...","{Judy Woodruff, Man (through translator), Just...",0.0
1,https://www.pbs.org/newshour/show/elected-in-a...,Mexican president-elect Andrés Manuel López Ob...,"Jul 2, 2018 6:45 PM EDT","Elected by a landslide, can Mexico’s López Obr...","[[Judy Woodruff, [ After two previous runs for...","{Rafael Riveros (through translator), Marcos F...",0.0
2,https://www.pbs.org/newshour/show/will-u-s-mex...,There are enormous expectations facing the new...,"Jul 2, 2018 6:43 PM EDT",Will U.S.-Mexico policy tensions change under ...,"[[Judy Woodruff, [ And now perspective from fo...","{Judy Woodruff, Roberta Jacobson}",0.0
3,https://www.pbs.org/newshour/show/yemens-spira...,One of the poorest countries in the Middle Eas...,"Jul 2, 2018 6:40 PM EDT",Yemen’s spiraling hunger crisis is a man-made ...,"[[Judy Woodruff, [ The “NewsHour” has reported...","{Dhabia Kharfoush (through translator), Naimi ...",0.0
4,https://www.pbs.org/newshour/show/livingwhileb...,A profusion of national incidents in which whi...,"Jul 2, 2018 6:35 PM EDT",#LivingWhileBlack: How does racial bias lead t...,"[[Judy Woodruff, [ A number of recent incident...","{Judy Woodruff, Woman, Yamiche Alcindor, Derri...",0.0


---
<a id="datetime"></a>

# [^](#toc) <u>Datetime</u>

First let's clear off whitespace and \n characters

In [None]:
df["Date"] = df["Date"].map(lambda x: x.strip())

#### Time Zone

It appears, the time is always posted in EDT

In [63]:
df["temp"] = df.Date.map(lambda x: x[-3:])
vc = df["temp"].value_counts()

df.drop("temp", axis=1, inplace=True)
vc

EDT    17617
Name: temp, dtype: int64

### Convert to Datetime

See this Stack Exchange [link](https://english.stackexchange.com/questions/35315/what-is-the-proper-name-for-am-and-pm?newreg=2d443a2ca9dc4ba6abbe1a1e01e4af4b) haha.  I was honored by the line...

    software developers think about naming variables properly. It is built into the Object-Oriented mindset. Jader, a commenter, said it well: "It's funny that the question intrinsically is not programming related, but all programmers can understand why you posted it here."
    
I just considered myself a physicist!

In [28]:
import datetime
datetime.datetime.strptime(dt[:-4], '%b %d, %Y %H:%M %p')

month, day, year, time, period, tz = df.Date.iloc[0].replace(",", "").split(" ")
month, day, year, time, period, tz

('Jul', '2', '2018', '6:50', 'PM', 'EDT')

---
<a id="speakers"></a>

# [^](#toc) <u>Speakers</u>

In [15]:
for row in df.loc[df.Speakers.isnull(), 'Speakers'].index:
    df.at[row, 'Speakers'] = {"Unknown"}

In [18]:
sum(df.Speakers.map(lambda x: 1 if "Judy Woodruff" in x else 0))

1031

In [17]:
len(set.union(*df.Speakers))

32726

---
<a id="transcript"></a>

# [^](#toc) <u>Transcript</u>

In [None]:
for row in df.loc[df.Transcript.isnull(), 'Transcript'].index:
    df.at[row, 'Transcript'] = ["Unknown"]