<a id="toc"></a>

# <u>Table of Contents</u>

1.) [Setup](#setup)  
&nbsp;&nbsp;&nbsp;&nbsp; 1.1.) [Imports](#imports)   
&nbsp;&nbsp;&nbsp;&nbsp; 1.2.) [Helpers](#helpers)   
&nbsp;&nbsp;&nbsp;&nbsp; 1.3.) [Load data](#load)   
2.) [Datetime](#datetime)  
3.) [Speakers](#speakers)  
4.) [Transcript](#transcript)  
5.) [Save to CSV](#save)  

---
<a id="setup"></a>

# [^](#toc) <u>Setup</u>

<a id="imports"></a>

### [^](#toc) Imports

In [1]:
### Standard imports
import pandas as pd
import numpy as np

### Plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

import re
import datetime

# Helps convert String representation of list into a list
import ast

### Removes warnings that occassionally show in imports
import warnings
warnings.filterwarnings('ignore')

<a id="helpers"></a>

### [^](#toc) Helpers

In [2]:
def string_literal(x):
    try:
        return ast.literal_eval(x)
    except:
        return x

<a id="load"></a>

### [^](#toc) Load data

In [3]:
df = pd.read_csv("../data/PBS_full_unedited.csv")
for col in ["Transcript", "Story", "Speakers"]:
    df[col] = df[col].map(string_literal)
    
print("Shape of df:", df.shape)
df.head()

Shape of df: (17617, 7)


Unnamed: 0,URL,Story,Date,Title,Transcript,Speakers,Number of Comments
0,https://www.pbs.org/newshour/show/news-wrap-tr...,"In our news wrap Monday, President Trump's sea...","Jul 2, 2018 6:50 PM EDT",News Wrap: Trump interviews Supreme Court cand...,"[[Judy Woodruff, [ President Trump’s search fo...","{President Donald Trump, Man (through translat...",0.0
1,https://www.pbs.org/newshour/show/elected-in-a...,Mexican president-elect Andrés Manuel López Ob...,"Jul 2, 2018 6:45 PM EDT","Elected by a landslide, can Mexico’s López Obr...","[[Judy Woodruff, [ After two previous runs for...","{Judy Woodruff, Nick Schifrin, Diana Mercado (...",0.0
2,https://www.pbs.org/newshour/show/will-u-s-mex...,There are enormous expectations facing the new...,"Jul 2, 2018 6:43 PM EDT",Will U.S.-Mexico policy tensions change under ...,"[[Judy Woodruff, [ And now perspective from fo...","{Roberta Jacobson, Judy Woodruff}",0.0
3,https://www.pbs.org/newshour/show/yemens-spira...,One of the poorest countries in the Middle Eas...,"Jul 2, 2018 6:40 PM EDT",Yemen’s spiraling hunger crisis is a man-made ...,"[[Judy Woodruff, [ The “NewsHour” has reported...","{Judy Woodruff, Naimi (through translator), Ma...",0.0
4,https://www.pbs.org/newshour/show/livingwhileb...,A profusion of national incidents in which whi...,"Jul 2, 2018 6:35 PM EDT",#LivingWhileBlack: How does racial bias lead t...,"[[Judy Woodruff, [ A number of recent incident...","{Judy Woodruff, Woman, Derrick Johnson, Howard...",0.0


---
<a id="datetime"></a>

# [^](#toc) <u>Datetime</u>

First let's clear off whitespace and \n characters

In [4]:
df["Date"] = df["Date"].map(lambda x: x.strip())

### Time Zone

It appears, the time is always posted in EDT

In [5]:
df["Timzone"] = df.Date.map(lambda x: x[-3:])
df["Timzone"].value_counts()

EDT    17617
Name: Timzone, dtype: int64

### Updated times

There are only 4 videos that were updated.  I don't think this is very interesting so I'm just going to ignore all updated times.

I understand why someone would like this data, but I don't need it.

In [6]:
df.temp = df.Date.map(lambda x: int("Updated" in x))
num_updated = len(df[df.temp == 1])
print(f"{num_updated} clips have had information updated")

df.Date = df.Date.map(lambda x: x.split("\n")[0])

4 clips have had information updated


### Convert to Datetime

See this Stack Exchange [link](https://english.stackexchange.com/questions/35315/what-is-the-proper-name-for-am-and-pm?newreg=2d443a2ca9dc4ba6abbe1a1e01e4af4b) haha.  I was honored by the line...

    software developers think about naming variables properly. It is built into the Object-Oriented mindset. Jader, a commenter, said it well: "It's funny that the question intrinsically is not programming related, but all programmers can understand why you posted it here."
    
I just considered myself a physicist!

In [7]:
def format_datetime(x):
    return datetime.datetime.strptime(x[:-4], '%b %d, %Y %H:%M %p')

df.Date = df.Date.map(format_datetime)

### And we're done with times!

---
<a id="speakers"></a>

# [^](#toc) <u>Speakers</u>

### Fill in missing speakers

In [8]:
for row in df.loc[df.Speakers.isnull(), 'Speakers'].index:
    df.at[row, 'Speakers'] = {}

### Overview of names

In [13]:
N = len(set.union(*df.Speakers))

print("Unique names found in df:", N)

Unique names found in df: 31308


### Example: Obama

It looks like there's a lot to do here, I want uniform names.

When I search for Obama I want to see 'SEN. BARACK OBAMA', 'SENATOR BARACK OBAMA', 'PRESIDENT BARACK OBAMA', 'PRESIDENT BARACK OBAMA (singing)', 'BARACK OBAMA', 'BARACK OBAMA (singing)', and 'Barack Obama'

My plan of attack is

However removing modifiers only goes so far, sometimes the names are shortened.  In which case I need to manually group "Obama" and "PRESIDENT OBAMA" to "BARACK OBAMA".

There also seems to be ASCII errors like "\xa0" popping up.

In [9]:
{elem for elem in {x for x in set.union(*df.Speakers)} if ("OBAMA" in elem or "Obama" in elem or "obama" in elem)}

{'And I think Obama and Boehner and Reid all know that their legacies are tied together and they reside around one thing',
 'And Obama’s counter will have to be',
 'And President Obama says he will do something few presidents have done',
 'And so that’s what the FTC and the Obama Administration have both called for',
 'Arizona Senator John McCain said today the Obama administration is to blame',
 'BARACK OBAMA',
 'BARACK OBAMA (singing)',
 'Barack Obama',
 'Before a fund-raiser for President Obama',
 'Brookings institution scholar Shadi Hamid credits Mr. Obama for the Iran deal',
 'But the Obama reelection team did wade into the Bain debate with a memo that said',
 'Daniel Benjamin was coordinator for counterterrorism at the State Department during the first term of the Obama administration. He’s now a professor at Dartmouth College. And Joby Warrick is a national security correspondent at The Washington Post. He’s also the author of the book “Black Flags',
 'FIRST LADY MICHELLE OBAMA'

### Long names are actually text

In [11]:
df.Speakers = df.Speakers.map(lambda x: {elem for elem in x if len(elem.split(" ")) < 9})

### Remove titles, parties, and non-name features

In [14]:
%run preprocessing.py
%run mistaken_names.py

mistaken_names = get_mistaken_names()

def map_speakers(x):
    return {clean_names(elem) for elem in x if clean_names(elem) not in mistaken_names}

df.Speakers = df.Speakers.map(map_speakers)
N = len(set.union(*df.Speakers))
print("New number of unique names:", N)

New number of unique names: 27188


### Transcript

1.) Remove all speakers that should not be included.  
2.) Go through each Transcript and check if it has a Speaker that is not included.  
3.) If this speaker is the first one talking, print the URL and clip this short.  
4.) If the speaker is nnot the first talker, print the URL and append the speaker and text to the previous speaker.  
- Check out this [video](https://www.pbs.org/newshour/show/with-3-more-wins-romney-pivots-to-general-election#transcript) and search for "North Carolina is a strong one for them".  You'll see one of these formating slips.

---
<a id="transcript"></a>

# [^](#toc) <u>Transcript</u>

In [None]:
def transcripts_with_speaker(df, speaker):
    includes_speaker = df.Speakers.map(lambda x: any([clean_names(elem) == speaker for elem in x]))
    return df[includes_speaker].URL

temp = transcripts_with_speaker(df, "NORTH CAROLINA IS A STRONG ONE FOR THEM")

In [20]:
for row in df.loc[df.Transcript.isnull(), 'Transcript'].index:
    df.at[row, 'Transcript'] = []

In [22]:
df.Transcript

0        [[Judy Woodruff, [ President Trump’s search fo...
1        [[Judy Woodruff, [ After two previous runs for...
2        [[Judy Woodruff, [ And now perspective from fo...
3        [[Judy Woodruff, [ The “NewsHour” has reported...
4        [[Judy Woodruff, [ A number of recent incident...
5        [[Judy Woodruff, [ And now back to the looming...
6        [[Judy Woodruff, [ On our Bookshelf tonight, w...
7        [[HARI SREENIVASEN, [ Trade is one of the area...
8        [[JEFF GREENFIELD, [ It has the look and feel ...
9        [[CAMERON ESPOSITO, [ “And I don’t totally rem...
10       [[LISA DESJARDINS, [ Despite court orders it i...
11       [[IVETTE FELICIANO, [ Success did not come eas...
12       [[Judy Woodruff, [ President Trump now says he...
13       [[Judy Woodruff, [ This week, a federal judge ...
14       [[Judy Woodruff, [ How will the Trump administ...
15       [[Judy Woodruff, [ And now, the aftermath of t...
16       [[Judy Woodruff, [ While the U.S. grapples wit.

---
<a id="save"></a>

# [^](#toc) <u>Save to CSV</u>

Now that we've cleaned up our data a bit, let's save it into a different CSV file.

In [None]:
df.to_csv("data/PBS-newhour-clean.csv")

# Tests

### Remove party regex

In [47]:
%run preprocessing.py

ex1 = "BARBARA MIKULSKI (D-MD.)"
assert remove_party(ex1) == "BARBARA MIKULSKI" 

ex2 = "JERRY BROWN (D-CALIF.)"
assert remove_party(ex2) == "JERRY BROWN" 

ex3 = "CHUCK SCHUMER (DN.Y.)"
assert remove_party(ex3) == "CHUCK SCHUMER" 

ex4 = "JOHN BOEHNER (R-OH.)"
assert remove_party(ex4) == "JOHN BOEHNER" 

ex5 = "JEFF SESSIONS (R-ALA.)"
assert remove_party(ex5) == "JEFF SESSIONS" 

ex6 = "TODD YOUNG (RIND.)"
assert remove_party(ex6) == "TODD YOUNG" 

ex7 = "DONALD TRUMP"
assert remove_party(ex7) == ex7

ex8 = "PRESIDENT BARACK OBAMA (singing)"
assert remove_party(ex8) == ex8