- Gathers transcripts and saves them in .txt files
- Also saves opening line data in opening_lines.csv
- Gathers episode data and saves it in episode_list.csv
- Performs some cleaning 
- Merges episode_list.csv and opening_lines.csv

Notes
- Some manual editing of data was done (some titles were slightly mismatched between the .txts and episode_list.csv
- Saving of files has been commented out or changed to save to a different file name to avoid getting rid of the original files

Transcripts from: https://theinfosphere.org/Episode_Transcript_Listing
Episode Data from: https://en.wikipedia.org/wiki/List_of_Futurama_episodes

In [1]:
import requests,  os
from bs4 import BeautifulSoup
import pandas as pd

# Example/Test

In [3]:
url = "https://theinfosphere.org/Transcript:Lrrreconcilable_Ndndifferences"
r = requests.get(url)
r.raise_for_status()

In [4]:
soup = BeautifulSoup(r.text, "html.parser")
p1 = soup.find_all('th')
next_ep = p1[2].findNext()['href']
title = soup.find_all('table')[1].contents[0].find_all('a')[1]['title']

In [6]:
def convert_to_text(data):
    text = data.text.strip()
    if "⨂" in text:
        return text[9:]
    else:
        return text

In [9]:
# soup.find_all('body')[0].find_all(['dl', 'p']) # to grab scenes as well
transcript_html = soup.find_all('body')[0].find_all(['p'])
transcript = [convert_to_text(text) for text in transcript_html]
transcript[1]

"The Scary Announcer: You're taking a vacation from normalcy, the setting: a weird motel where the bed is stained with mystery, and there's also some mystery floating in the pool. Your key card may not open the exercise room because someone smeared mystery on the lock. But it will open the Scary Door."

In [None]:
next_url = "https://theinfosphere.org" + next_ep

In [97]:
# for testing
url2 = "https://theinfosphere.org/Transcript:Meanwhile"
r2 = requests.get(url2)
r2.raise_for_status()
soup2 = BeautifulSoup(r2.text, "html.parser")
p11 = soup2.find_all('th')
next_ep2 = p11[2].findNext()['href']
next_ep2

'/Episode_Transcript_Listing'

# Get the Transcript Data

In [34]:
next_url = "https://theinfosphere.org/Transcript:Space_Pilot_3000"
i = 1
opening_lines = []

while True:
    r = requests.get(next_url)
    r.raise_for_status()
    
    soup = BeautifulSoup(r.text, "html.parser")
    p1 = soup.find_all('th')
    try:
        next_url = "https://theinfosphere.org" + p1[2].findNext()['href']
    except:
        break
    
    title = soup.find_all('table')[1].contents[0].find_all('a')[1]['title']
    #print(title)
    transcript_html = soup.find_all('body')[0].find_all(['p'])
    transcript = [convert_to_text(text) for text in transcript_html]
    
    # file names have been edited to not write over current files
    with open(os.path.join('transcripts', "newepisode{}.txt".format(i)), "w", encoding="utf-8") as f:
        f.write(title+"\n")
        [f.write(line+"\n") for line in transcript]
    
    # get opening line from title screen
    scenes = soup.find_all('dd')
    found = False
    for line in scenes:
        try:
            if 'Opening Credits' in line.text:
                opening_lines.append([i, title, line.text.split(':')[1][:-1]])
                found = True
                continue
        except:
            continue
        
    if not found:
        print(i, title, "didn't find")
    
    i += 1
    if title == "Meanwhile":
        break

pd.DataFrame(opening_lines, columns=['Episode Number', 'Episode', 'Opening Line']).to_csv('newopening_lines.csv', index=False)
        
# Error is in regards to the "Next" link being absent on that episode.
# Rest of S7 isn't complete/linked together.

# opening_lines.csv
# note to self: get 81, 85, 89, 91, 101, 115, rest of s7
# check 82-84, 86-88, 90

15 Brannigan, Begin Again didn't find
81 Bender's Game Part 1 didn't find
82 Bender's Game Part 2 didn't find
83 Bender's Game Part 3 didn't find
84 Bender's Game Part 4 didn't find
85 Into the Wild Green Yonder Part 1 didn't find
86 Into the Wild Green Yonder Part 2 didn't find
87 Into the Wild Green Yonder Part 3 didn't find
88 Into the Wild Green Yonder Part 4 didn't find
89 Rebirth didn't find
90 In-A-Gadda-Da-Leela didn't find
91 Attack of the Killer App didn't find
101 The Futurama Holiday Spectacular didn't find
115 The Bots and the Bees didn't find
116 A Farewell to Arms didn't find
117 Decision 3012 didn't find


In [75]:
# Example
f = open('transcripts/episode10.txt')
print(f.read())
f.close()

A Flight to Remember
Leela: That was the worst delivery ever.
Fry: Yeah. I'm never going to another planet called "Cannibalon"!
Bender: Me neither. [upbeat] Food was good, though.
Farnsworth: Oh, great news, everyone.
Bender: Shove it! We quit!
Farnsworth: In that case I'll have to hire a new crew to go on our company vacation.
Leela: Vacation?
Bender: Alright!
Fry: This is great! I haven't had time off since I was 21 through 24.
Farnsworth: It's just my way of thanking you for not reporting my countless violations of safety and minimum wage laws.
Bender: Aww, you!
Farnsworth: I've booked us all on the maiden voyage of the largest, most luxurious space cruise ship ever built. [He pulls out a brochure.] The Titanic!
Leela: Looks nice.
Fry: Hey, uh, where's my suitcase? [His suitcase flies out from the tube and knocks him over.] Ow!
Poopenmeyer: As Mayor of New New York, it's my pleasure to introduce the honorary captain for the Titanic's maiden voyage. A man who single-handedly defeated

# Get the Episode List Data

In [66]:
url = "https://en.wikipedia.org/wiki/List_of_Futurama_episodes"
r = requests.get(url)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")

In [62]:
episodes = soup.find_all('tr', 'vevent')
season = 0
episode_data = []

for episode in episodes:
    d1 = episode.contents[0].text # episode number
    d2 = episode.contents[1].text # episode in season
    d3 = episode.contents[2].text # title
    d4 = episode.contents[3].text # director
    d5 = episode.contents[4].text # writer
    d6 = episode.contents[5].text.split('\xa0')[-1] # air date

    if d2 == '1':
        season += 1
    d0 = season
    episode_data.append([d0, d1, d2, d3, d4, d5, d6])

In [65]:
df = pd.DataFrame(episode_data, columns=['Season', 'Episode', 'Episode in Season', 
                                    'Title', 'Director(s)', 'Writer(s)', 'Air Date'])
#df.to_csv('episode_list.csv', index=False)
# Edited in csv for season 5 and to update remaining season #s

In [58]:
# To get rid of " " around titles
df = pd.read_csv('episode_list.csv', encoding="ISO-8859-1")
#df['Title'] = df['Title'].apply(lambda x: x[1:-1]) # doesn't need to happen again
#df.to_csv('episode_list.csv', index=False)

In [72]:
# Merge opening_lines.csv to episode_list.csv
new_df = pd.merge(df, pd.read_csv('opening_lines.csv', encoding="ISO-8859-1"), 
         how='left', left_on='Title', right_on='Episode', copy=True)
new_df = new_df.drop(['Episode Number', 'Episode_y'], axis=1).rename(columns={"Episode_x": "Episode"}) 
#new_df.to_csv('episode_list.csv', index=False)

# Season 7

Season 7 isn't complete on the wiki. Some aren't formatted the same (all the text runs together w/ no indication of who is speaking) and some just don't have the link to the next episode.

Episodes with a number to their right are done (in folder Season 7 for now).

Season 7
- "The Bots and the Bees" - 115
- "A Farewell to Arms" - 116
- "Decision 3012" - 117
- "The Thief of Baghead" https://theinfosphere.org/Transcript:The_Thief_of_Baghead
- "Zapp Dingbat" https://theinfosphere.org/Transcript:Zapp_Dingbat
- "The Butterjunk Effect" https://theinfosphere.org/Transcript:The_Butterjunk_Effect
- "The Six Million Dollar Mon" https://theinfosphere.org/Transcript:The_Six_Million_Dollar_Mon
- "Fun on a Bun" https://theinfosphere.org/Transcript:Fun_on_a_Bun
- "Free Will Hunting" https://theinfosphere.org/Transcript:Free_Will_Hunting
- "Near-Death Wish" https://theinfosphere.org/Transcript:Near-Death_Wish
- "31st Century Fox" https://theinfosphere.org/Transcript:31st_Century_Fox
- "Viva Mars Vegas" https://theinfosphere.org/Transcript:Viva_Mars_Vegas
- "Naturama" https://theinfosphere.org/Transcript:Naturama
- "Forty Percent Leadbelly" https://theinfosphere.org/Transcript:Forty_Percent_Leadbelly
- "2-D Blacktop" https://theinfosphere.org/Transcript:2-D_Blacktop
- "T.: The Terrestrial" https://theinfosphere.org/Transcript:T.:_The_Terrestrial
- "Fry and Leela's Big Fling" https://theinfosphere.org/Transcript:Fry_and_Leela%27s_Big_Fling 
- "The Inhuman Torch" https://theinfosphere.org/Transcript:The_Inhuman_Torch
- "Saturday Morning Fun Pit" https://theinfosphere.org/Transcript:Saturday_Morning_Fun_Pit
- "Calculon 2.0" https://theinfosphere.org/Transcript:Calculon_2.0
- "Assie Come Home" https://theinfosphere.org/Transcript:Assie_Come_Home
- "Leela and the Genestalk" https://theinfosphere.org/Transcript:Leela_and_the_Genestalk
- "Game of Tones" https://theinfosphere.org/Transcript:Game_of_Tones
- "Murder on the Planet Express" https://theinfosphere.org/Transcript:Murder_on_the_Planet_Express
- "Stench and Stenchibility" https://theinfosphere.org/Transcript:Stench_and_Stenchibility
- "Meanwhile" https://theinfosphere.org/Transcript:Meanwhile