NOTE: When referencing episodes of the show, shorthand may be used (e.g., S3E7 is Season 3 Episode 7).

# Lost Transcripts EDA

## Data Collection

To scrape the transcript data, we need to first see what the data looks like. We'll use Beautiful Soup to grab an example transcript's data (namely, S1E1) to determine how to extract the relevant information.

In [1]:
from matplotlib import pyplot as plt
%matplotlib inline
import pandas as pd
from bs4 import BeautifulSoup
import re

In [2]:
from requests import get
url = 'https://lostpedia.fandom.com/wiki/Pilot,_Part_1_transcript'
response = get(url)
print(response.text[:500])

<!DOCTYPE html>
<html class="client-nojs sse-other new-nav-canary" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Pilot, Part 1 transcript | Lostpedia | Fandom</title>
<script>document.documentElement.className="client-js sse-other new-nav-canary";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","Novemb


In [3]:
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

bs4.BeautifulSoup

There's a lot of junk before (and after) the actual transcript, so we'll need to limit the text we extract accordingly. 

First, we grab the container that has the transcript in it. 

In [4]:
containers = html_soup.find_all('div', 'mw-body-content mw-content-ltr')
print(type(containers))
print(len(containers))
containers[0].text[:7000]

<class 'bs4.element.ResultSet'>
1


'\n\n\nv\xa0•\xa0d\xa0•\xa0eTranscriptsSeason 11 • 2 • 3 • 4 • 5 • 6 • 7 • 8 • 9 • 10 • 11 • 12 • 13 • 14 • 15 • 16 • 17 • 18 • 19 • 20 • 21 • 22 • 23 • 24 / 25Season 21 • 2 • 3 • 4 • 5 • 6 • 7 • 8 • 9 • 10 • 11 • 12 • 13 • 14 • 15 • 16 • 17 • 18 • 19 • 20 • 21 • 22 • 23 / 24Season 31 • 2 • 3 • 4 • 5 • 6 • 7 • 8 • 9 • 10 • 11 • 12 • 13 • 14 • 15 • 16 • 17 • 18 • 19 • 20 • 21 • 22 / 23Season 41 • 2 • 3 • 4 • 5 • 6 • 7 •  8 • 9 • 10 • 11 • 12 • 13 / 14Season 51 • 2 • 3 • 4 • 5 • 6 •  7 • 8 • 9 • 10 • 11 • 12 • 13 • 14 • 15  • 16 / 17Season 61 / 2 • 3 • 4 • 5 • 6 • 7 • 8 • 9 • 10 • 11 • 12 • 13 • 14 • 15 • 16 • 17 / 18Clip showsDestiny Calls • The Story of the Oceanic 6 • A Journey in Time • Final ChapterDVD\xa0commentariesPilot, Part 1 • Pilot, Part 2 • Walkabout • The Moth • Hearts and Minds • Man of Science, Man of Faith • What Kate Did • The 23rd Psalm • The Whole Truth • Dave • A Tale of Two Cities • I Do • Exposé • The Man Behind the Curtain • The Beginning of the End • The Constant

As is obvious, we're not quite rid of all the junk yet, but it's a start. 

NOTE: While I could have began cutting up the data based on the actual text of the entire page (with something like `response.text.split('id=\"Act_1\"')[1]`), I found that grabbing the sole container that has the transcript embedded split more of the tail end of the undesired data. 

## Data Cleaning

In [5]:
transcript = containers[0].text

We've created a variable `transcript` which we will keep whittling down until we have a `clean_transcript`.

In [6]:
transcript = transcript.split('Contents')[1]

At this stage, we have a rough transcript isolated. Naturally, the first thought would be to remove stage directions so that we can begin to collect word counts for each of the characters. 


### Removing Stage Directions

To remove the stage directions and reformat the current iteration of the transcript, we will:
* separate substrings that are on either side of a bracket,
* remove those substrings that are stage directions (i.e., not dialogue and not subtitles), and
* reassembles the transcript to then separate according to return carriages.

In [7]:
dialogue_transcript = transcript.replace(']','[').split('[') #list of things outside of stage directions
for substring in dialogue_transcript:
    if substring[:15].split(':')[0]!=substring[:15].split(':')[0].upper() and substring[:8]!='Subtitle':
        dialogue_transcript.remove(substring)
dialogue_transcript = list(filter(None, "".join(dialogue_transcript).split('\n')))
dialogue_transcript[:110]

['MICHAEL: Walt! Walt!',
 'REDSHIRT: Stay away from the gas! Stay there!',
 'REDSHIRT #2:  Help! Help! Somebody help me! Help! Help! Ahh, my leg! Ah! Ah!',
 'JACK: Hey you, just give me a hand! You, come on! Come over here, give me a hand!  On the count of three: One, two, three.',
 'CLAIRE: Help! Please help me! Help me, please help me!',
 'JACK:  All right, get him out of here! Get him away from the engine! Get him out of here!',
 "CLAIRE: Help me, please. I'm having, I'm having contractions.",
 'JACK: How many months pregnant are you?',
 "CLAIRE: I'm, I'm nearly eight months.",
 'JACK:  How far apart are they coming?',
 "CLAIRE: I don't know, a-a few just happened.",
 'LOCKE: Hey! Hey, hey, hey, get away from there!',
 'GARY: What?',
 "JACK: Listen to me! Look at me! You're going to be okay, do you understand me? But you have to sit absolutely still!",
 'JACK:  Hey, you! Come here! I need you to get this woman away from these fumes! Take her over there. Stay with her. If her contrac

Notice that we determine which substrings are dialogue by checking if the first few letters before a colon (should it exist) are uppercase--there is an assumption here that colons and stage directions to not appear in the same line of English dialogue at the same time. 

At this point, retrieving the word count per character seems simple. However, we have a problem that needs to be addressed first: Jin and Sun regularly speak Korean.

### Multilingual Transcripts

In [8]:
"".join(dialogue_transcript).split('JIN:')[1][:60]

' 내 옆에서 없어지면 안 돼. 내가 어디로 가든지 꼭 따라와. 알겠지?  다른 사람 신경쓰지 말고 우린 같이'

Unfortunately, it would be remiss of us to count individual Korean characters as words. The good news is that subtitles are provided for the Korean text (as foreshadowed by the code we used to remove stage directions).

In [9]:
"".join(dialogue_transcript).split('JIN:')[1][:215]

" 내 옆에서 없어지면 안 돼. 내가 어디로 가든지 꼭 따라와. 알겠지?  다른 사람 신경쓰지 말고 우린 같이 있어야 돼. Subtitle: You must not leave my sight. You must follow me wherever I go. Do you understand? Don't worry about the others. We need to stay together."

Consequently, we'll want to remove the Korean as well as the "Subtitle: " that follow it.
However, even then there are a couple of edge cases to consider:
* What if a combination of English and Korean are spoken in the same line? How will word count be calculated then?
    * E.g., JIN in S1E6 or S6E14.
* Are we going to have the code account for _any_ character speaking Korean (i.e., even if their line of dialogue is not preceded by "JIN:" or "SUN:")?
    * E.g., INTERIOR DECORATOR in S1E6.
 
My initial reaction is to take the easiest route for both of these problems:
* A cursory look through some transcripts that have Korean text seem to show that if there is ever English and Korean in the same line, the Korean greatly outweighs the English in word count, so it would not have much bearing on the overall accuracy of the project to ignore the English words that are spoken in the same line as Korean words.
* Since the key takeaways from the project will be regarding the main characters primarily, it will not be a huge dent in our progress to simply disregard any characters speaking in Korean who are not Sun or Jin. (Despite this, our solution will clean foreign language dialogue independent of the character speaking it.)

Having decided on these simplifications, we can start cleaning the Korean dialogue (and use the same method on other languages as well; e.g., Russian and Spanish in S6E9).

**Side Remark**: colons are not exclusive to dialogue indicators; there is at least one instance involving a Bible verse.

In [10]:
english_transcript = [dialogue.split(':')[0]+':'+dialogue.split(':')[2] if "Subtitle:" in dialogue else dialogue for dialogue in dialogue_transcript]
english_transcript[75:80]

['CLAIRE: Yeah, you too.',
 "MICHAEL: You sure you're warm enough?",
 "JIN: You must not leave my sight. You must follow me wherever I go. Do you understand? Don't worry about the others. We need to stay together.",
 "KATE: Do you think he's going to live?",
 'JACK: Do you know him?']

Again, there is an implication above that colons do not occur in dialogue where there are also brackets. But otherwise it looks good!

While we're at it, let's remove the instances of "Act 2", "Act 3", etc. (full disclosure: it is a bit odd to me that they survived the pruning that occurred when creating `dialogue_transcript`, but it's an easy enough fix that I will not think too hard on it)

In [11]:
for dialogue in english_transcript:
    if ':' not in dialogue:
        english_transcript.remove(dialogue)
english_transcript[:110]

['MICHAEL: Walt! Walt!',
 'REDSHIRT: Stay away from the gas! Stay there!',
 'REDSHIRT #2:  Help! Help! Somebody help me! Help! Help! Ahh, my leg! Ah! Ah!',
 'JACK: Hey you, just give me a hand! You, come on! Come over here, give me a hand!  On the count of three: One, two, three.',
 'CLAIRE: Help! Please help me! Help me, please help me!',
 'JACK:  All right, get him out of here! Get him away from the engine! Get him out of here!',
 "CLAIRE: Help me, please. I'm having, I'm having contractions.",
 'JACK: How many months pregnant are you?',
 "CLAIRE: I'm, I'm nearly eight months.",
 'JACK:  How far apart are they coming?',
 "CLAIRE: I don't know, a-a few just happened.",
 'LOCKE: Hey! Hey, hey, hey, get away from there!',
 'GARY: What?',
 "JACK: Listen to me! Look at me! You're going to be okay, do you understand me? But you have to sit absolutely still!",
 'JACK:  Hey, you! Come here! I need you to get this woman away from these fumes! Take her over there. Stay with her. If her contrac

## Data Management

We now have a transcript we can work with (`english_transcript`) to aggregate the word count for each character in this episode. 

### Arranging Transcript Data

Our goal by the end of this notebook is to have a table that lists all the relevant data (character name, word count, episode rating, etc.) for this episode. To obtain the word count for each line of dialogue, we will take the primitive but reasonable approach of counting how many "words" (strings of characters separated by spaces) there are.

In [12]:
character_names = [dialogue.split(': ')[0] for dialogue in english_transcript]
character_lines = [dialogue.split(': ')[1] for dialogue in english_transcript]
character_wordcounts = [len(line.split(' ')) for line in character_lines]
df = pd.DataFrame({'Character':character_names, 'Word Count':character_wordcounts})
df = df.groupby('Character')['Word Count'].sum().reset_index()
df

Unnamed: 0,Character,Word Count
0,BOONE,66
1,CHARLIE,323
2,CINDY,45
3,CLAIRE,49
4,GARY,1
5,HURLEY,81
6,JACK,792
7,JIN,26
8,KATE,276
9,LOCKE,8


### Importing Rating Data

We now want to find out what the rating for this episode is. To do so, we can pull from [IMDb](https://www.imdb.com/). While working on my [IMDb project](https://github.com/parsaha/imdb/tree/master), I had realized too late that I had overlooked the existence of a [Python package](https://cinemagoer.github.io/) for the purpose of pulling data from IMDb. However, as of the writing of this notebook, it appears to be easier to simply scrape the scores using Beautiful Soup than attempt to parse `Cinemagoer`'s sparse documentation. (I even considered resorting to [PyMovieDb](https://pypi.org/project/PyMovieDb/#description), but it did not have the ratings per episode that we need.)
* `Cinemagoer` would require me to obtain each individual episode's unique ID number and then send a request for the rating which takes (comparatively) long to receive a response to.

In [13]:
response = get('https://www.imdb.com/title/tt0411008/ratings/')
html_soup = BeautifulSoup(response.text, 'html.parser')
html_soup.text

'\n403 Forbidden\n\n403 Forbidden\n\n\n'

<details> <summary> Ah. IMDb doesn't like visitors coming in through the side yard. I suppose that leaves us with the tedious front door. </summary> While one can take steps to appear more like an actual user in a browser and not a web scraper, there is always the possible issue of an IP ban down the line given the number of requests I will be sending (an issue I feel exits the scope of what I am willing to put my energy towards this project into). With that being said, I would much rather do things the hard way once than the easy way twenty times.  </details>

In [14]:
from imdb import Cinemagoer
ia = Cinemagoer()
lost = ia.get_movie('0411008')

In [15]:
ia.update(lost, 'episodes')
lost['episodes']

KeyError: 'episodes'

And now it seems even example code from Cinemagoer's own [documentation](https://cinemagoer.readthedocs.io/en/latest/usage/series.html) won't work.

To summarize this subsection of the notebook thus far:
* We've tried and failed at scraping the IMDb page with Beautiful Soup.
    * It does not seem worth the effort to implement the large number of workarounds it would take to (hopefully) avoid running into the `403 Forbidden` error at any point in the project.
* We've tried and failed to use IMDb's `Cinemagoer` package.
    * The package does not seem to work as the documentation would imply.
* We've tried and failed to use an alternative (`PyMovieDb`) to `Cinemagoer`.
    * Individual episode ratings do not seem to be among the info `PyMovieDb` is able to scrape from `Cinemagoer`... at least, there's nothing documented to demonstrate that such is the case.
 
At this point, I only saw two realistic options: give up, or collect the scores manually. Thankfully, the number of episodes in the show is not large enough to deter me from seeing this through. I put [the scores]() together (in the `LOST_IMDb_Ratings` file) through a combination of an image-to-table site and some manual data entry. While we probably will not be integrating the score of the episode into the word count matrix we've created earlier, going through the hassle here allows us to write a cleaner walkthrough of the aggregation of all the transcripts' results.  

### Saving Cleaned Data

Finally, we want to be able to save our word count table for later reference. Each file will tiny, so a csv file suffices.

In [16]:
df.to_csv('S1E1.csv', index=False)

## References
* https://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings
* https://stackoverflow.com/questions/4998629/split-string-with-multiple-delimiters-in-python
* https://stackoverflow.com/questions/4260280/if-else-in-a-list-comprehension
* https://gist.github.com/jbsulli/03df3cdce94ee97937ebda0ffef28287
* https://dev.to/alexmercedcoder/all-about-parquet-part-08-reading-and-writing-parquet-files-in-python-338d