<a id="toc"></a>

# <u>Table of Contents</u>

1.) [Setup](#setup)  
&nbsp;&nbsp;&nbsp;&nbsp; 1.1.) [Imports](#imports)   
&nbsp;&nbsp;&nbsp;&nbsp; 1.2.) [Helpers](#helpers)   
&nbsp;&nbsp;&nbsp;&nbsp; 1.3.) [Load data](#load)   
2.) [Datetime](#datetime)  
3.) [Speakers](#speakers)  
4.) [Transcript](#transcript)  
5.) [Save to CSV](#save)  

---
<a id="setup"></a>

# [^](#toc) <u>Setup</u>

<a id="imports"></a>

### [^](#toc) Imports

In [1]:
### Standard imports
import pandas as pd
import numpy as np

### Plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

import datetime

# Helps convert String representation of list into a list
import ast

### Removes warnings that occassionally show in imports
import warnings
warnings.filterwarnings('ignore')

<a id="helpers"></a>

### [^](#toc) Helpers

In [2]:
def string_literal(x):
    if x is np.nan:
        return np.nan
    else:
        try:
            return ast.literal_eval(x)
        except:
            return x

<a id="load"></a>

### [^](#toc) Load data

In [3]:
df = pd.read_csv("data/PBS_full_unedited.csv")
for col in ["Transcript", "Story", "Speakers"]:
    df[col] = df[col].map(string_literal)
    
df.head()

Unnamed: 0,URL,Story,Date,Title,Transcript,Speakers,Number of Comments
0,https://www.pbs.org/newshour/show/news-wrap-tr...,"In our news wrap Monday, President Trump's sea...","Jul 2, 2018 6:50 PM EDT",News Wrap: Trump interviews Supreme Court cand...,"[[Judy Woodruff, [ President Trump’s search fo...","{President Donald Trump, Judy Woodruff, Justin...",0.0
1,https://www.pbs.org/newshour/show/elected-in-a...,Mexican president-elect Andrés Manuel López Ob...,"Jul 2, 2018 6:45 PM EDT","Elected by a landslide, can Mexico’s López Obr...","[[Judy Woodruff, [ After two previous runs for...","{Rafael Riveros (through translator), Alfonso ...",0.0
2,https://www.pbs.org/newshour/show/will-u-s-mex...,There are enormous expectations facing the new...,"Jul 2, 2018 6:43 PM EDT",Will U.S.-Mexico policy tensions change under ...,"[[Judy Woodruff, [ And now perspective from fo...","{Roberta Jacobson, Judy Woodruff}",0.0
3,https://www.pbs.org/newshour/show/yemens-spira...,One of the poorest countries in the Middle Eas...,"Jul 2, 2018 6:40 PM EDT",Yemen’s spiraling hunger crisis is a man-made ...,"[[Judy Woodruff, [ The “NewsHour” has reported...","{Dhabia Kharfoush (through translator), Stephe...",0.0
4,https://www.pbs.org/newshour/show/livingwhileb...,A profusion of national incidents in which whi...,"Jul 2, 2018 6:35 PM EDT",#LivingWhileBlack: How does racial bias lead t...,"[[Judy Woodruff, [ A number of recent incident...","{Yamiche Alcindor, Woman, Derrick Johnson, Jud...",0.0


---
<a id="datetime"></a>

# [^](#toc) <u>Datetime</u>

First let's clear off whitespace and \n characters

In [4]:
df["Date"] = df["Date"].map(lambda x: x.strip())

### Time Zone

It appears, the time is always posted in EDT

In [5]:
df["temp"] = df.Date.map(lambda x: x[-3:])
vc = df["temp"].value_counts()

df.drop("temp", axis=1, inplace=True)
vc

EDT    17617
Name: temp, dtype: int64

### Updated times

There are only 4 videos that were updated.  I don't think this is very interesting so I'm just going to ignore all updated times.

I understand why someone would like this data, but I don't need it.

In [6]:
df.temp = df.Date.map(lambda x: int("Updated" in x))
num_updated = len(df[df.temp == 1])
print(f"{num_updated} clips have had information updated")

df.Date = df.Date.map(lambda x: x.split("\n")[0])

4 clips have had information updated


### Convert to Datetime

See this Stack Exchange [link](https://english.stackexchange.com/questions/35315/what-is-the-proper-name-for-am-and-pm?newreg=2d443a2ca9dc4ba6abbe1a1e01e4af4b) haha.  I was honored by the line...

    software developers think about naming variables properly. It is built into the Object-Oriented mindset. Jader, a commenter, said it well: "It's funny that the question intrinsically is not programming related, but all programmers can understand why you posted it here."
    
I just considered myself a physicist!

In [7]:
def format_datetime(x):
    return datetime.datetime.strptime(x[:-4], '%b %d, %Y %H:%M %p')

df.Date = df.Date.map(format_datetime)

### And we're done with times!

---
<a id="speakers"></a>

# [^](#toc) <u>Speakers</u>

### Fill in missing speakers

In [8]:
for row in df.loc[df.Speakers.isnull(), 'Speakers'].index:
    df.at[row, 'Speakers'] = {"Unknown"}

### Names should be capitalized

In [11]:
df.Speakers = df.Speakers.map(lambda x: {elem.upper() for elem in x})

### Overview of names

In [14]:
{x for x in set.union(*df.Speakers) if "(" in x}

{'BELGASSEM ALI (THROUGH TRANSLATOR)',
 'RONALD VEGA (THROUGH INTERPRETER)',
 'REP. BARBARA LEE (D)',
 'IRO (THROUGH INTERPRETER)',
 'MARIBEL CHINQUIN (THROUGH TRANSLATOR)',
 'MANPREET SINGH (THROUGH TRANSLATOR)',
 'STEVEN HORSFORD (D)',
 'MAJ. GEN. ACRA TIPAROJ (THROUGH TRANSLATOR)',
 'AYATOLLAH MOHAMMAD EMAMI KASHARI(THROUGH INTERPRETER)',
 'SEN. TED CRUZ\xa0(R-TX)',
 'SHEIK ALI DHERE\xa0(THROUGH INTERPRETER)',
 'DANNY BURSTEIN AS TEVYE (SINGING)',
 'ELIZABETH\xa0SHOL ROUT (THROUGH INTERPRETER)',
 'MAJID SALEH (THROUGH TRANSLATOR)',
 'SHEIK DHARGHAM AL-JABOURI\xa0(THROUGH INTERPRETER)',
 'CHO MYOUNG-GYON (THROUGH INTERPRETER)',
 'DAVID BOWERS (D)',
 'BELKIS ARIAS (THROUGH INTERPRETER)',
 'N’CHO DAVID (THROUGH TRANSLATOR)',
 'BYEON OK-SOON (THROUGH TRANSLATOR)',
 'FALAH AL MUTAIRI (THROUGH INTERPRETER)',
 '(NICK',
 'ACTRESS (SINGING)',
 'FADIMATA WALLET (THROUGH TRANSLATOR)',
 'GOV. MARY FALLIN (R-OKLA.)',
 'COL. DEREK HARVEY (RET.)',
 'BERRIN ASLAN (THROUGH TRANSLATOR)',
 'CARLOS MUN

In [54]:
df.Speakers = df.Speakers.map(lambda x: {elem for elem in x if len(elem.split(" ")) < 9})

### Remove titles, parties, and non-name features

In [56]:
def clean_names(name):
    for title in ("FORMER U.S. PRESIDENT", "FMR. PRESIDENT",
                  "FORMER GOV.", "FMR. JUSTICE"
                  "PRESIDENT", "PRIME MINISTER", "MAYOR",
                  "U.S. ATTORNEY GENERAL", "U.S AMBASSADOR",
                  "U.N. AMBASSADOR", "MEDAL OF HONOR RECIPIENT"
                  "LT.", "COLONEL", "COL.", "SGT.", "CAPT.", "GEN.",
                  "ADM.", "VICE ADM.", "COM.", "MAJ."
                  "GOV.", "REP.", "SEN.",
                  "CHAIRMAN", "OFFICER",
                  "REV.", "DR.", "PROF."):
        name = name.replace(title, "")
    
    name = name.replace("\xa0", " ")
    name = name.replace("(D)", "").replace("(R)", "")
    name = name.replace("(RET.)", "")
    
    for qualifier in ("(THROUGH INTERPRETER)", "(THROUGH TRANSLATOR)", "(AT PODIUM)", "(TRANSLATED)"):        
        if name != qualifier:
            name = name.replace(qualifier, "")
    return name.strip()
    

# len(set.union(*df.Speakers)) - len({clean_names(x) for x in set.union(*df.Speakers)})
# {clean_names(x) for x in set.union(*df.Speakers) if "." in clean_names(x)}
# {x for x in set.union(*df.Speakers) if x == "(THROUGH TRANSLATOR)"}
{x for x in set.union(*df.Speakers) if len(x.split(" ")) > 7}

{'A CHINESE SPOKESMAN HAD THIS TO SAY TODAY',
 'A DOCUMENTARY BY HER DAUGHTER TITLED “GERALDINE FERRARO',
 'A FURTHER KEY ELEMENT OF TODAY’S GOOD NEWS',
 'A GUARD MONITORS THE CAMERAS IN REAL TIME',
 'ADDRESSED TO SENATE ARMED SERVICES CHAIR CARL LEVIN',
 'AMONG THE VICTIMS WERE HOTEL GUESTS FROM RUSSIA',
 'AN INTERNATIONAL COMMITTEE OF ARCHITECTS HAS CONFIRMED IT',
 'AND ACCORDING TO THE MOODY’S ECONOMIST MARK ZANDI',
 'AND BRITAIN’S NEWEST PRINCESS NOW HAS A NAME',
 'AND DEMOCRAT HILLARY CLINTON SAID IN A TWEET',
 'AND EXACTLY ONE THOUGHT RUNS THROUGH MY HEAD',
 'AND FOR MORE ON WHY THE FASCINATION ENDURES',
 'AND HE SAID — HE DIDN’T SAY LIKE',
 'AND HE TURNED TO ME AND HE SAID',
 'AND HE WENT — THEY REFUSED TO LEAVE',
 'AND HIS ANSWER WAS TO HIS OWN QUESTION',
 'AND I SAT DOWN AT DEFENSE COUNSEL’S TABLE',
 'AND I THINK WHAT DONALD TRUMP IS SHOWING',
 'AND I WOULD PUT IT IN ONE WORD',
 'AND IF YOU GO TO THE 5-8 CATEGORY',
 'AND IF YOU LOOK AT EARLY CLOSING STATES',
 'AND IN A HITHERT

In [58]:
"SECRETARY OF HEALTH AND HUMAN SERVICES KATHLEEN SEBELIUS" in df.Speakers

False

In [47]:
clean_names('GOV. RICK SCOTT')

'GOV. RICK SCOTT'

In [None]:
'DANIEL SAYS THE SCIENTISTS ON THE TEAM AND THE MILITARY HAVE A SHARED GOAL',
'SO — BUT THE CANDIDATE I THINK WITH THE MOST CONSISTENT ECONOMIC MESSAGE — OR EMPHASIZING ECONOMIC CREDENTIALS CERTAINLY HAS BEEN MITT ROMNEY. AND I WILL BE FASCINATED BY WHAT HE SAYS TONIGHT'
'THEY SPENT A WHOLE DAILY LABORING AT THIS AND HAVE BEEN UNABLE TO CLENCH A DEAL AND RIGHT AT THIS MOMENT',
'THE 79-YEAR OLD EMBATTLED FIFA CHIEF SPOKE SHORTLY AFTER HIS REELECTION'
'GAY MCDOUGALL WAS A MEMBER THE SOUTH AFRICAN ELECTION COMMISSION',
'AND WHETHER THEY CONTRIBUTED TO THE ACCIDENT OR WHATEVER IS REALLY IRRELEVANT TO THE ISSUE AT HAND',
'TO LIBERAL ECONOMIST EILEEN APPELBAUM',
'THEY’RE THE SUBJECT OF HER NEW BOOK',
'THE PRACTICE WAS FILMED 100 KILOMETRES OFF THE COAST OF PERU. (CREDIT',
'NEGOTIATORS AT THE SYRIAN PEACE TALKS AGREED ON ONE THING TODAY',
'AND WE LOOK AT THE FIGHT THAT LIES AHEAD IN RAQQA WITH JOBY WARRICK',
'AND AS NEWSPAPERS MAKE THEIR PICKS AND HIGH-PROFILE POLITICIANS LEND THEIR HELP ON THE CAMPAIGN TRAIL',
'WHAT TO MAKE OF THIS SPORT THAT HAS SUCH A GRIP ON AMERICAN CULTURE? WELL',
'THAT’S NOT A CHILD ABUSER. THAT’S A RECOMMENDATION. THAT’S A VERY SPECIFIC RECOMMENDATION',
'FRENCH AUTHORITIES HAVE DISTRIBUTED THE PHOTO OF SALAH ABDESLAM WITH THE MESSAGE',

---
<a id="transcript"></a>

# [^](#toc) <u>Transcript</u>

In [20]:
for row in df.loc[df.Transcript.isnull(), 'Transcript'].index:
    df.at[row, 'Transcript'] = ["Unknown"]

In [22]:
df.Transcript

0        [[Judy Woodruff, [ President Trump’s search fo...
1        [[Judy Woodruff, [ After two previous runs for...
2        [[Judy Woodruff, [ And now perspective from fo...
3        [[Judy Woodruff, [ The “NewsHour” has reported...
4        [[Judy Woodruff, [ A number of recent incident...
5        [[Judy Woodruff, [ And now back to the looming...
6        [[Judy Woodruff, [ On our Bookshelf tonight, w...
7        [[HARI SREENIVASEN, [ Trade is one of the area...
8        [[JEFF GREENFIELD, [ It has the look and feel ...
9        [[CAMERON ESPOSITO, [ “And I don’t totally rem...
10       [[LISA DESJARDINS, [ Despite court orders it i...
11       [[IVETTE FELICIANO, [ Success did not come eas...
12       [[Judy Woodruff, [ President Trump now says he...
13       [[Judy Woodruff, [ This week, a federal judge ...
14       [[Judy Woodruff, [ How will the Trump administ...
15       [[Judy Woodruff, [ And now, the aftermath of t...
16       [[Judy Woodruff, [ While the U.S. grapples wit.

---
<a id="save"></a>

# [^](#toc) <u>Save to CSV</u>

Now that we've cleaned up our data a bit, let's save it into a different CSV file.

In [None]:
df.to_csv("data/PBS-newhour-clean.csv")