# Trekpedia


Writing a web-scraper to pull all `Star Trek(tm)` series data from Wikipedia.

## Stage 2 - get episode data
Create separate json files containing episode data for each Series.
For now we will keep all seasons in one file but may break this into individual ones depending on how much data we finally grab.

In [166]:
# common setup...
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import re

# don't truncate Pandas.DataFrame cell contents when displaying.
pd.set_option('display.max_colwidth', None)

In [167]:
# read in the data from stage 1 ...
df = pd.read_json('star_trek_series_info_stage_1.json', orient='index')
df

Unnamed: 0,name,url,season_count,episode_count,episodes_url,dates,logo
0,The Original Series,https://en.wikipedia.org/wiki/Star_Trek:_The_Original_Series,3,79,https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes,"September 8, 1966 - June 3, 1969",https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Star_Trek_TOS_logo.svg/220px-Star_Trek_TOS_logo.svg.png
1,The Animated Series,https://en.wikipedia.org/wiki/Star_Trek:_The_Animated_Series,2,22,https://en.wikipedia.org/wiki/Star_Trek:_The_Animated_Series,"September 8, 1973 - October 12, 1974",https://upload.wikimedia.org/wikipedia/commons/thumb/0/03/Star_Trek_TAS_logo.svg/220px-Star_Trek_TAS_logo.svg.png
2,The Next Generation,https://en.wikipedia.org/wiki/Star_Trek:_The_Next_Generation,7,178,https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Next_Generation_episodes,"September 28, 1987 - May 23, 1994",https://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Star_Trek_The_Next_Generation_Logo.svg/220px-Star_Trek_The_Next_Generation_Logo.svg.png
3,Deep Space Nine,https://en.wikipedia.org/wiki/Star_Trek:_Deep_Space_Nine,7,176,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Deep_Space_Nine_episodes,"January 4, 1993 - May 31, 1999",https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Star_Trek_DS9_logo.svg/220px-Star_Trek_DS9_logo.svg.png
4,Voyager,https://en.wikipedia.org/wiki/Star_Trek:_Voyager,7,172,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Voyager_episodes,"January 16, 1995 - May 23, 2001",https://upload.wikimedia.org/wikipedia/en/thumb/e/e2/Star_Trek_Voyager_Logo.svg/220px-Star_Trek_Voyager_Logo.svg.png
5,Enterprise,https://en.wikipedia.org/wiki/Star_Trek:_Enterprise,4,98,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Enterprise_episodes,"September 26, 2001 - May 13, 2005",https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Star_Trek_ENT_logo.svg/220px-Star_Trek_ENT_logo.svg.png
6,Discovery,https://en.wikipedia.org/wiki/Star_Trek:_Discovery,3,42,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Discovery_episodes,"September 24, 2017 - present",https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Star_Trek_Discovery_logo.svg/220px-Star_Trek_Discovery_logo.svg.png
7,Short Treks,https://en.wikipedia.org/wiki/Star_Trek:_Short_Treks,2,10,https://en.wikipedia.org/wiki/Star_Trek:_Short_Treks,"October 4, 2018 - January 9, 2020",
8,Picard,https://en.wikipedia.org/wiki/Star_Trek:_Picard,1,10,https://en.wikipedia.org/wiki/Star_Trek:_Picard,"January 23, 2020 - present",https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Star_Trek_Picard_logo.svg/220px-Star_Trek_Picard_logo.svg.png
9,Lower Decks,https://en.wikipedia.org/wiki/Star_Trek:_Lower_Decks,1,10,https://en.wikipedia.org/wiki/Star_Trek:_Lower_Decks,"August 6, 2020 - present",https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Star_Trek_LD_logo.svg/220px-Star_Trek_LD_logo.svg.png


In [172]:
# set up a filename template...
FILE_TEMPLATE = 'output/star_trek_series_{}_{}_episodes.json'

for row in df.itertuples():
#     if row.Index !=0:
#         break
    
    print(f'Processing : {row.name}')
    filename = FILE_TEMPLATE.format(row.Index,row.name.replace(" ", "_").lower())
    print(f"  -> Using URL : {row.episodes_url}")
    print(f"  -> Storing episodes to '{filename}'")
    
    season_final = dict()
    season_all = dict()
    
    # get and parse the webpage...
    result = requests.get(row.episodes_url)
    bs = BeautifulSoup(result.text, 'lxml')
    
    # wrap all this in a Try:Except block, there are a few series which need special handling...
    try:
        # find the episode summary table, will be the first table with the below classes in the document
        summary_table = bs.find('table', attrs={'class': 'wikitable plainrowheaders'})
        
        summary_rows = summary_table.find('tbody').find_all('tr')[2:]
        
        for season in summary_rows:
            season_data = dict()
            
            link = season.find('th')
            cells = season.find_all('td')         
            
            season_number = link.text
            season_id = link.a['href'][1:]
            season_data['total'] = cells[1].text
            # get start/end data and remove unicode chars. 
            # Still need to remove the date in backets at the end of each
            season_data['start'] = " ".join(cells[2].text.split())
            season_data['end'] = " ".join(cells[3].text.split())
            season_data['episodes'] = list()

            print(season_data)
            
            # now get the actual episodes for this season...
            section = bs.find('span', id=season_id)
            table = section.findNext('table').find('tbody').find_all('tr')[1:]
            # 'table' now consists of one row for each episode, except ds9 and voy who also put summary
            # after each one and confuse things!
            episode_list = list()
            # loop over each episode, getting the relevant data. We may grab more info in the future.
            for episode in table:
                episode_data = dict()
                # protect the next operation - if the th is not found (ie tas, ds9, voy) just skip over this 
                # one as it is a summary...
                try:
                    episode_data['num_overall'] = episode.find('th').text
                except AttributeError:
                    continue
                cells = episode.find_all('td')
                episode_data['num_in_season'] = cells[0].text
                
                # need to do some tweaking, sometimes the first episode is in 2 parts
                # need to detect this and split them. Alternative is to have a hard-coded list, as it
                # happens very rarely.
                
                episode_data['title'] = cells[1].a.text
                episode_data['link'] = f"https://en.wikipedia.org{cells[1].a['href']}"
                episode_data['air_date'] = cells[4].text
                
                episode_list.append(episode_data)
                
            # consolidate into a format suitable for writing to JSON
            season_data['episodes'] = episode_list
            season_all[season_number] = season_data
            season_final['seasons'] = season_all           
    except AttributeError as e:
        print(f"  => Error, need to investigate! ({e}) at line number: {e.__traceback__.tb_lineno}")
        print(episode_list)
    finally:
       # write to json file...
        with open (filename, 'w', encoding='utf-8') as f:
            json.dump(season_final, f, ensure_ascii=False, indent=4)


Processing : The Original Series
  -> Using URL : https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes
  -> Storing episodes to 'output/star_trek_series_0_the_original_series_episodes.json'
{'total': '29', 'start': 'September 8, 1966 (1966-09-08)', 'end': 'April 13, 1967 (1967-04-13)', 'episodes': []}
{'total': '26', 'start': 'September 15, 1967 (1967-09-15)', 'end': 'March 29, 1968 (1968-03-29)', 'episodes': []}
{'total': '24', 'start': 'September 20, 1968 (1968-09-20)', 'end': 'June 3, 1969 (1969-06-03)', 'episodes': []}
Processing : The Animated Series
  -> Using URL : https://en.wikipedia.org/wiki/Star_Trek:_The_Animated_Series
  -> Storing episodes to 'output/star_trek_series_1_the_animated_series_episodes.json'
{'total': '16', 'start': 'September 8, 1973 (1973-09-08)', 'end': 'January 12, 1974 (1974-01-12)', 'episodes': []}
{'total': '6', 'start': 'September 7, 1974 (1974-09-07)', 'end': 'October 12, 1974 (1974-10-12)', 'episodes': []}
Processing : The Ne

## Current Bugs
1. Some 2-part episodes have bad season and overall number due to table layout.
2. At least DS9 from season 4 and Voyager, Enterprise add a 'stardate'column which messes up the column count and therefore the 'Original Air Date' field. Voyager also adds 'featured character' to this confusion.
3. From Discovery to Lower Decks error out on line 66, more formatting changes.