NOTE: When referencing episodes of the show, shorthand may be used (e.g., S3E7 is Season 3 Episode 7).

# Creating Word Count Tables for Every Episode

Since we've gotten our hands dirty in the `Lost Transcripts EDA (Pilot)` notebook, we can start to do the same for every episode in the show.

## How to Access Every Episode's Transcript

We aren't quite able to scrape every transcript yet since the URLs of the transcripts are differentiated by episode title and not by chronological season/episode pairings. So, we'll need to grab the list of episode names. To do this, I'll scrape the data a [Wikipedia page](https://en.wikipedia.org/wiki/List_of_Lost_episodes) (Lostpedia's list seemed too complicated to scrape and the Pilot EDA notebook showed us how annoying it would be to work with IMDb).

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from requests import get
import csv
import time
import random

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_Lost_episodes'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
print(response.text[:500])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-


In [3]:
containers = html_soup.find_all('td', 'summary')
print(type(containers))
print(len(containers))

<class 'bs4.element.ResultSet'>
140


In [4]:
print(containers[0])
print(containers[1])
print(containers[113])

<td class="summary" rowspan="1" style="text-align:left">"<a href="/wiki/Pilot_(Lost)" title="Pilot (Lost)">Pilot (Part 1)</a>"</td>
<td class="summary" rowspan="1" style="text-align:left">"<a href="/wiki/Pilot_(Lost)" title="Pilot (Lost)">Pilot (Part 2)</a>"</td>
<td class="summary" rowspan="1" style="text-align:left">"<a href="/wiki/The_End_(Lost)" title="The End (Lost)">The End</a>"</td>


The scrupulous among you may notice that there are more episodes than there are "titles" listed in Wikipedia. You would be correct! The final two episodes of every season are grouped together in Wikipedia because they share a name (and are treated as two parts of the same plot). Normally this would be cause for concern since we would need our transcripts to line up with the titles that we are using to iterate through episodes with. However, we're in luck: such transcripts are grouped together in Lostpedia as well! Therefore, accidentally, everything lines up. (Regarding how to assign ratings to these combined episode pairings, we will simply take the average of the two episodes' ratings.)

The only hiccup to look out for seems to be that titles in Wikipedia with parentheses are not named with parentheses in Lostpedia. We will adjust for that below.

In [5]:
raw_titles = [title.text[1:-1] for title in containers[:114]]
raw_titles[:5]

['Pilot (Part 1)',
 'Pilot (Part 2)',
 'Tabula Rasa',
 'Walkabout',
 'White Rabbit']

In [6]:
transcript_titles = [title.replace(' (',', ').replace(')','').replace(' ','_').replace('&','%26') if '(' in title else title.replace(' ','_').replace('?','%3F') for title in raw_titles]
transcript_titles[:5]

['Pilot,_Part_1', 'Pilot,_Part_2', 'Tabula_Rasa', 'Walkabout', 'White_Rabbit']

At this stage, we should be good to write a for loop that uses the `word_count_matrix.py` file (adapted from the `Lost Transcripts EDA (Pilot)` notebook) to create all of the word count tables we need. 

## Generating the Tables

For troubleshooting reasons, I chose to process each season one at a time; however, it should not be hard to do the entire show in one run. 

To begin, I ran a single episode as a test (S1E5). I learned some interesting things from it.

In [7]:
from word_count_matrix import word_count_table
print(transcript_titles[4])
#word_count_table(transcript_titles[4],1,5)

White_Rabbit


In [8]:
with open('S1E5_example.csv', 'r') as file:
    reader = csv.reader(file)
    data = list(reader)
pd.DataFrame(data)

Unnamed: 0,0,1
0,Character,Word Count
1,AGENT,27
2,BOONE,176
3,CHARLIE,274
4,CHRISTIAN SHEPHARD,159
5,CLAIRE,178
6,HOTEL MANAGER,100
7,HURLEY,90
8,JACK,731
9,JIN,15


The things that stand out to me are that:
* JIN comes up two extra times: once as JINÂ (which simply shows as "JIN " in my text editor and in the transcript), and once as Jin. This shows that the data is not as clean as we might have wished it was.
* There are characters, like YOUNG JACK, that would ideally be aggregated with their appropriate counterparts (in this case, JACK).
    * This becomes harder when considering that some characters "change" names throughout the show.
      <details> <summary> Spoiler Alert </summary> An example includes Ben Linus whose first introduction has him listed as GALE. </details>

The first issue was resolved (retroactively) in the Pilot EDA notebook, while the second issue seems too niche to make a large impact on the overall results. In the case of important discrepancies (like the one listed in the spoiler), I will make edits to the `word_count_matrix.py` file as I discover them. 

Now that that's out of the way, let's clean up some of the out-of-the-ordinary titles and try all of Season 1.

In [9]:
transcript_titles[23] = 'Exodus,_Part_2'
transcript_titles[-16] = 'LA_X,_Parts_1_%26_2'

In [10]:
season_num=1
for episode_num in range(24):
    word_count_table(transcript_titles[episode_num],season_num,episode_num+1)
    time.sleep(random.randint(0,3))

It worked! Notice that we put a random amount of pause between requests to avoid having some requests blocked due to bot-like behavior. Let's finish up. 

In [11]:
season_num=2
for episode_num in range(23):
    word_count_table(transcript_titles[24+episode_num],season_num,episode_num+1) 
    time.sleep(random.randint(1,3))

In [12]:
season_num=3
for episode_num in range(22):
    word_count_table(transcript_titles[24+23+episode_num],season_num,episode_num+1)
    time.sleep(random.randint(1,3))

In [13]:
season_num=4
for episode_num in range(13):
    word_count_table(transcript_titles[24+23+22+episode_num],season_num,episode_num+1)
    time.sleep(random.randint(1,3))

In [14]:
season_num=5
for episode_num in range(16):
    word_count_table(transcript_titles[24+23+22+13+episode_num],season_num,episode_num+1)
    time.sleep(random.randint(1,3))

At Season 5 I came across a small problem I had to fix in the Python script: sometimes locations (like Tunisia in S5E7) have "Subtitle:" show up. An easy fix though. 

In [15]:
season_num=6
word_count_table(transcript_titles[24+23+22+13+16+0],season_num,1)
for episode_num in range(1,16):
    word_count_table(transcript_titles[24+23+22+13+16+episode_num],season_num,episode_num+2)
    time.sleep(random.randint(1,3))

## References 

* https://stackoverflow.com/questions/3411771/best-way-to-replace-multiple-characters-in-a-string
* https://stackoverflow.com/questions/10648490/removing-first-appearance-of-word-from-a-string