# Capstone Project: Predicting Character based on Dialogue

## Project Goals: 
1) Scrape data from the web for all transcripts

2) Focus on the models - use NLP and GridSearchCV to obtain the best model with the best parameters

3) Be able to accurately predict who would be likely to say a line

# Friends Data Gathering

In [4]:
import requests
from bs4 import BeautifulSoup
import regex
from tqdm import tqdm
import numpy as np

## Obtain a source for the data
A quick google search led me to the following website where all of the episodes have been transcribed:
http://livesinabox.com/friends

Each episode was transcribed on a separate web page. Therefore, I started the process by reviewing the html structure of one page. 

## Start with one page and understand HTML text
Use Beautiful Soup to find all p tags in html scripts

Note - most episodes followed the same html patterns. However, there were some irregular episodes for which reviewing those one by one helped me group them together and know how to modify my code.

In [165]:
url = ('http://livesinabox.com/friends/season2/212toasb.htm')

In [142]:
r = requests.get(rescrape_urls[5])
soup = BeautifulSoup(r.text,'html.parser')

In [144]:
ps = soup.find_all('p')

### Test by obtaining the characters and lines from one episode and use RegEx

In [146]:
characters = []
lines = []
for p in ps:
    char = regex.findall(r"[A-Z][a-zA-Z. ]+:",p.text)
    if char != []:
        if char[0] != "Scene:":
            characters.append(char[0])
            index = regex.search(char[0], p.text).start() + len(char[0])
            line = p.text[index:]
            lines.append(line.replace("\n"," "))

In [126]:
#Length of the characters equals the length of the lines! 
print(len(characters), len(lines))

301 301


## Now that it works on one, continue with the rest
Obtain the urls for each transcript page

In [5]:
all_links_url = 'http://livesinabox.com/friends/scripts.shtml'
r = requests.get(all_links_url)

In [6]:
url_pattern = regex.compile(r'<li><a href="(season\d+/\d+\w+.htm)')
all_episodes = regex.findall(url_pattern, r.text)

In [7]:
all_episodes_urls = ['http://livesinabox.com/friends/'+episode for episode in all_episodes]

The following were episodes split into two parts but only have one transcript. Therefore, need to remove the link to the second transcript to avoid duplication. 
Season 4 - ep 23

In [9]:
all_episodes_urls.remove('http://livesinabox.com/friends/season4/423uncut.htm')

In [10]:
s02_irregular_urls = all_episodes_urls[35:46]

In [11]:
s02_irregular_urls.append(all_episodes_urls[32])

In [13]:
for x in s02_irregular_urls:
    all_episodes_urls.remove(x)

In [15]:
len(all_episodes_urls)

175

## Using a for loop, obtain the characters, lines, title of episode, season #, and episode # for each transcript through season 8
Seasons 1-8 are formatted similarly. There are differences in seasons 9 and 10. In addition, there are some irregularly formatted episodes, which will need to be scraped using their specific tags. 

In [16]:
Characters = []
Lines = []
Title = []
Season = []
Episode = []
for url in tqdm(all_episodes_urls):
    r = requests.get(url)
    soup = BeautifulSoup(r.text,'html.parser')
    ps = soup.find_all('p')
    for p in ps:
        char = regex.findall(r"[A-Z][a-zA-Z. ]+:",p.text)
        if char != []:
            if char[0] != "Scene:":
                Characters.append(char[0])
                index = regex.search(char[0], p.text).start() + len(char[0])
                line = p.text[index:]
                Lines.append(line.replace("\n"," "))
                Title.append(soup.title.string)
                season = regex.findall('friends/(\w+\d+)', url)
                Season.append(season)
                ep = regex.findall('friends/\w+\d+/(\d+)', url)
                Episode.append(ep)

100%|██████████| 175/175 [01:06<00:00,  2.62it/s]


In [17]:
#Check the len of the lists to ensure they are all the same length
len(Characters), len(Lines), len(Title), len(Season), len(Episode)

(45817, 45817, 45817, 45817, 45817)

## Perform the same procedure for Season 9 episodes
Note - episodes 7, 11, and 15 are irregular therefore not included in the following scrape

In [24]:
#Need to reformat url pattern for season 9
all_links_url = 'http://livesinabox.com/friends/scripts.shtml'
r = requests.get(all_links_url)
url_pattern = regex.compile(r'<a href="(season9/\d+\w*.\w+)">')
s09_episodes = regex.findall(url_pattern, r.text)

In [25]:
s09_episodes_urls = ['http://livesinabox.com/friends/'+episode for episode in s09_episodes]

In [26]:
#Remove episodes 7, 11, 15 from list due to irregularities
s09_episodes_urls.pop(14)

'http://livesinabox.com/friends/season9/915mug.htm'

In [27]:
#Remove episodes 7, 11, 15 from list due to irregularities
s09_episodes_urls.pop(10)

'http://livesinabox.com/friends/season9/911work.htm'

In [28]:
#Remove episodes 7, 11, 15 from list due to irregularities
s09_episodes_urls.pop(6)

'http://livesinabox.com/friends/season9/907song.htm'

#### Note - episode 23 and 24 not included because they are in a different format.... will need to manually scrape and add them

In [30]:
#Need to reformat for season 9 patterns.... ie the title
for url in tqdm(s09_episodes_urls):
    r = requests.get(url)
    soup = BeautifulSoup(r.text,'html.parser')
    ps = soup.find_all('p')
    for p in ps:
        char = regex.findall(r"[A-Z][a-zA-Z. ]+:",p.text)
        if char != []:
            if char[0] != "Scene:":
                Characters.append(char[0])
                index = regex.search(char[0], p.text).start() + len(char[0])
                line = p.text[index:]
                Lines.append(line.replace("\n"," "))
                season = regex.findall('friends/(\w+\d+)', url)
                Season.append(season)
                ep = regex.findall('friends/season9/(\d+)', url)
                Episode.append(ep)
                try:
                    Title.append(soup.title.string)
                except:
                    Title.append(np.nan)

100%|██████████| 19/19 [00:09<00:00,  1.95it/s]


In [31]:
len(Characters), len(Lines), len(Title), len(Season), len(Episode)

(50777, 50777, 50777, 50777, 50777)

### Season 2 (9, 12:23) and  9 (7,11,15, 23/24) Irregular Episodes

In [32]:
s09_ep2324_url = 'http://livesinabox.com/friends/season9/0923-0924.html'

In [33]:
s09_irregular_urls = ['http://livesinabox.com/friends/season9/907song.htm',
                      'http://livesinabox.com/friends/season9/911work.htm',
                      'http://livesinabox.com/friends/season9/915mug.htm',
                     'http://livesinabox.com/friends/season9/0923-0924.html']

In [34]:
irregular_urls = s09_irregular_urls

In [36]:
for url in tqdm(irregular_urls):
    r = requests.get(url)
    soup = BeautifulSoup(r.text,'html.parser')
    ps = soup.find('body').text
    for p in ps.split("\n"):
        char = regex.findall(r"[A-Z][a-zA-Z. ]+:",p)
        if char != []:
            if char[0] not in ("Scene:",'Teleplay by:','Story by:','Directed by:'):
                Characters.append(char[0])
                index = regex.search(char[0], p).start() + len(char[0])
                line = p[index:]
                Lines.append(line.replace("<br>","",))
                season = regex.findall('friends/(\w+\d+)', url)
                Season.append(season)
                ep = regex.findall('friends/season\d/(\d+)', url)
                Episode.append(ep)
                try:
                    Title.append(soup.title.string)
                except:
                    Title.append(np.nan)

100%|██████████| 4/4 [00:02<00:00,  1.61it/s]


In [37]:
len(Characters), len(Lines), len(Title), len(Season), len(Episode)

(52108, 52108, 52108, 52108, 52108)

In [38]:
s02_irregular_urls

['http://livesinabox.com/friends/season2/212toasb.htm',
 'http://livesinabox.com/friends/season2/214towpv.htm',
 'http://livesinabox.com/friends/season2/215rryk.htm',
 'http://livesinabox.com/friends/season2/216jmo.htm',
 'http://livesinabox.com/friends/season2/217emi.htm',
 'http://livesinabox.com/friends/season2/218drd.htm',
 'http://livesinabox.com/friends/season2/219ewg.htm',
 'http://livesinabox.com/friends/season2/220oyd.htm',
 'http://livesinabox.com/friends/season2/221towtb.htm',
 'http://livesinabox.com/friends/season2/222towtp.htm',
 'http://livesinabox.com/friends/season2/223towcp.htm',
 'http://livesinabox.com/friends/season2/209towpd.htm']

In [39]:
for url in tqdm(s02_irregular_urls):
    r = requests.get(url)
    soup = BeautifulSoup(r.text,'html.parser')
    ps = soup.prettify()
    for p in ps.split("<br/>"):
        char = regex.findall(r"[A-Z][a-zA-Z. ]+:",p)
        if char != []:
            if char[0] not in ("Scene:",'Teleplay by:','Story by:','Directed by:','Written by:','Transcribed by:',):
                Characters.append(char[0])
                index = regex.search(char[0], p).start() + len(char[0])
                line = p[index:]
                Lines.append(line.lstrip('\n  </b>\n  ').replace("</br>","",).replace("\n","").replace("  "," ").strip())
                season = regex.findall('friends/(\w+\d+)', url)
                Season.append(season)
                ep = regex.findall('friends/\w+\d+/(\d+)', url)
                Episode.append(ep)
                try:
                    Title.append(soup.title.string)
                except:
                    Title.append(np.nan)

100%|██████████| 12/12 [00:04<00:00,  2.57it/s]


In [40]:
len(Characters), len(Lines), len(Title), len(Season), len(Episode)

(55384, 55384, 55384, 55384, 55384)

## Perform the same procedure for Season 10 episodes

In [41]:
all_links_url = 'http://livesinabox.com/friends/scripts.shtml'
r = requests.get(all_links_url)
url_pattern = regex.compile(r'<a href="(\d+.shtml)')
s10_episodes = regex.findall(url_pattern, r.text)

In [42]:
s10_episodes_urls = ['http://livesinabox.com/friends/'+episode for episode in s10_episodes]

In [44]:
r = requests.get('http://livesinabox.com/friends/1001.shtml')
soup = BeautifulSoup(r.text,'html.parser')
soup.title.string

'Crazy For Friends - 1001 - The One After Joey And Rachel Kiss'

In [45]:
for url in tqdm(s10_episodes_urls):
    r = requests.get(url)
    soup = BeautifulSoup(r.text,'html.parser')
    ps = soup.find_all('p')
    for p in ps:
        char = regex.findall(r"[A-Z][a-zA-Z. ]+:",p.text)
        if char != []:
            if char[0] != "Scene:":
                Characters.append(char[0])
                index = regex.search(char[0], p.text).start() + len(char[0])
                line = p.text[index:]
                Lines.append(line.replace("\n"," "))
                season = regex.findall('friends/(\d\d)', url)
                Season.append(season)
                ep = regex.findall('friends/(\d+)', url)
                Episode.append(ep)
                try:
                    Title.append(soup.title.string)
                except:
                    Title.append(np.nan)

100%|██████████| 18/18 [00:10<00:00,  1.76it/s]


In [46]:
len(Characters), len(Lines), len(Title), len(Season), len(Episode)

(61116, 61116, 61116, 61116, 61116)

##### Clean the lists as needed
The Season and Episode lists are a list of lists, therefore, wanted to make it into just one list. 

In [50]:
Season[:3]

[['season1'], ['season1'], ['season1']]

In [51]:
Episode = [value[0] for value in Episode]
Season = [value[0] for value in Season]

In [52]:
Title.count('Untitled Document')

2383

## Plug all of the lists into a DataFrame

In [53]:
import pandas as pd

In [54]:
df = pd.DataFrame()

In [55]:
df['Season']=Season
df['Episode'] = Episode
df['Title'] = Title
df['Character'] = Characters
df['Line'] = Lines

#### Check the data frame 
review various episodes to ensure we obtained the data we wanted and in the format we wanted

In [62]:
df[15850:15870]

Unnamed: 0,Season,Episode,Title,Character,Line
15850,season4,406,The One With The Dirty Girl,Chandler:,
15851,season4,406,The One With The Dirty Girl,Joey:,
15852,season4,406,The One With The Dirty Girl,Mrs. Burkart:,
15853,season4,406,The One With The Dirty Girl,Phoebe:,
15854,season4,406,The One With The Dirty Girl,Monica:,
15855,season4,406,The One With The Dirty Girl,Phoebe:,
15856,season4,406,The One With The Dirty Girl,Mrs. Burkart:,
15857,season4,406,The One With The Dirty Girl,Phoebe:,
15858,season4,406,The One With The Dirty Girl,Mrs. Burkart:,
15859,season4,406,The One With The Dirty Girl,Phoebe:,


#### Analyze the data frame using different aggregations
1. Count the number of lines per episode to see how consistent it is and whether there were any episodes that had missing lines
1. Count the number of blank lines by episode

This helps us understand where we might have to rescrape certain episodes. 

In [77]:
set(df.Season)

{'10',
 'season1',
 'season2',
 'season3',
 'season4',
 'season5',
 'season6',
 'season7',
 'season8',
 'season9'}

In [75]:
total_episodes = [len(set(df[df.Season == season]['Episode'])) for season in set(df.Season)]

In [79]:
np.sum(total_episodes)

227

In [57]:
df.groupby(['Episode']).count().sort_values(by='Line')[:5]

Unnamed: 0_level_0,Season,Title,Character,Line
Episode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
116,2,2,2,2
224,2,2,2,2
913,56,56,56,56
114,180,180,180,180
712,198,198,198,198


In [58]:
blank_lines = df[df.Line == '']

In [59]:
blank_lines.groupby('Episode').count().sort_values(by='Line', ascending=False)[:10]

Unnamed: 0_level_0,Season,Title,Character,Line
Episode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
302,282,282,282,282
608,251,251,251,251
406,243,243,243,243
819,79,79,79,79
913,55,55,55,55
813,26,26,26,26
717,12,12,12,12
516,11,11,11,11
707,9,9,9,9
618,8,8,8,8


Based on the aggregation analysis performed above, it would seem that we need to rescrape Episodes 116, 224, 913, 302, 608, and 406 as those were the episodes with either the least amount of scraped lines or the highest amount of blank lines. 

However, the total rescrape episodes (6) is immaterial to the total episodes (227 total, which is about 3%), therefore we will choose not to rescrape the data. 

## Save dataframe to a CSV
Once the data has been gathered, save to a CSV for future use

In [80]:
df.to_csv('friends_transcripts_s1_10.csv', columns=['Season', 'Episode', 'Title', 'Character', 'Line'])