In [92]:
import pandas as pd
import numpy as np
from selenium.webdriver import Chrome
from bs4 import BeautifulSoup
import requests
import time

## Scraping for Batman TV Show episode details
For this part of the project, I am going to focus on getting plot summaries from the 1960's Batman TV series.  I will be using the plot summaries to build a fine-tuned GPT-2 model to create unique plots to new Batman episodes.

In [198]:
# Used Selenium to extract links from page
browser = Chrome()

url = "https://batman.fandom.com/wiki/List_of_Batman_(1960s_series)_Episodes"

browser.get(url)

# Get links for all episodes
all_episodes = [link.get_attribute('href') for link in \
                browser.find_elements_by_css_selector('#mw-content-text > p > a')]

In [199]:
all_episodes[:10]

['https://batman.fandom.com/wiki/Batman_(1960s_series)',
 'https://batman.fandom.com/wiki/Hi_Diddle_Riddle',
 'https://batman.fandom.com/wiki/Smack_In_The_Middle',
 'https://batman.fandom.com/wiki/Fine_Feathered_Finks',
 'https://batman.fandom.com/wiki/The_Penguin%27s_A_Jinx',
 'https://batman.fandom.com/wiki/The_Joker_Is_Wild',
 'https://batman.fandom.com/wiki/Batman_Is_Riled',
 'https://batman.fandom.com/wiki/Instant_Freeze',
 'https://batman.fandom.com/wiki/Rats_Like_Cheese',
 'https://batman.fandom.com/wiki/Zelda_The_Great']

In [200]:
# Dropping first link as it picked up a redundant one for page
all_episodes = all_episodes[1:]

In [215]:
def get_page_content(link):
    """Function to get the content elements from each page and return it as a dictionary"""
    browser.get(link)
    title = browser.find_element_by_tag_name('h1').text
    content_text = " ".join([i.text for i in browser.find_elements_by_css_selector('#mw-content-text')])
    return {'title': title, 'content': content_text}
    time.sleep(10)

In [216]:
episodes = []

for link in all_episodes:
    episodes.append(get_page_content(link))

In [217]:
episodes_df = pd.DataFrame(episodes)

#### Checking to see what content details look like
I set it to split on 'Edit' as that text on the page appears after each heading it captured.  This way, I can split the full content by Edit and then use the string after the heading I am looking for ('Plot')

In [218]:
print(episodes_df.loc[0]['content'].split('Edit')[1])


The series opens at the Republic of Moldavia exhibit, located at the Gotham City World's Fair, the Moldavian prime minister slices into the Moldavian friendship cake and unknowingly causes it to explode, releasing a concealed riddle. At the Gotham City Police Department, Police Commissioner James Gordon (Neil Hamilton) and Chief Miles O'Hara(Stafford Repp) suspect it to the Riddler (Gorshin). They turn to Inspector Bash and all the other senior policemen, but all bow their heads for a moment of silence, they turn to a red phone ("I don't know who he is behind that mask of his, but I do know when we need him and we need him now!"). After a glimpse into the lives of Bruce Wayne (Adam West) and Dick Grayson (Burt Ward) as well as the opening credits, the riddle leads them as Batman and Robin to the Pealeart gallery where they catch the Riddler in the act of taking a cross from its proprietor, Gideon Peale, at gunpoint. They stop him with an explosive but learn to their horror that Riddle

In [226]:
def find_plot(text):
    """Split all content text, find the string that ends with Plot heading, save the next item as that is the plot text"""
    content_sections = text.split('Edit')
    for i, section in enumerate(content_sections):
        if section.endswith('Plot'):
            return content_sections[i+1]

In [220]:
episodes_df['plot'] = episodes_df['content'].apply(lambda x: find_plot(x))

In [221]:
episodes_df.head()

Unnamed: 0,content,title,plot
0,"Hi Diddle Riddle\nWriter(s)\nLorenzo Semple, J...",Hi Diddle Riddle,\nThe series opens at the Republic of Moldavia...
1,Smack In The Middle\nWriter(s)\nLorenzo Semple...,Smack In The Middle,\nPicking up from the previous night's episode...
2,Fine Feathered Finks\nWriter(s)\nLorenzo Sempl...,Fine Feathered Finks,"\nAwaiting release from prison, the Penguin, t..."
3,The Penguin's A Jinx\nWriter(s)\nLorenzo Sempl...,The Penguin's A Jinx,"\nPicking up from the last episode, Bruce Wayn..."
4,The Joker Is Wild\nWriter(s)\nRobert Dozier\nD...,The Joker Is Wild,\nThe story begins with the Joker in prison pi...


In [222]:
all_episodes = episodes_df['plot'].tolist()

In [223]:
# Combining all the plots that it found
all_episodes = [plot for plot in all_episodes if plot is not None]

# Splitting plot text by last line break to remove the last line which is the heading for the section following 
split_episodes = [plot.split('\n')[:-1] for plot in all_episodes]

# Further cleaning to remove plots that either are empty or just leftover titles.
split_episodes = [" ".join(plot) for plot in split_episodes if len(" ".join(plot)) > 5]

# Now combining all plots into one corpus and saving as txt file
all_ep_text = " ".join(split_episodes)

In [225]:
with open('batman66.txt', 'w') as f:
    f.write(all_ep_text)

Great! Now my text is ready to be run through the GPT-2 Model running on a GPU of Google Cloud Services.