# Trekpedia


Writing a web-scraper to pull all `Star Trek(tm)` series data from Wikipedia.

## Stage 2 - get episode data
Create separate json files containing episode data for each Series.
For now we will keep all seasons in one file but may break this into individual ones depending on how much data we finally grab.

In [1]:
# common setup...
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import re

# don't truncate Pandas.DataFrame cell contents when displaying.
pd.set_option('display.max_colwidth', None)

In [52]:
# read in the data from stage 1 ...
df = pd.read_json('star_trek_series_info_stage_1.json', orient='index')
df

Unnamed: 0,name,url,season_count,episode_count,episodes_url,dates,logo
0,The Original Series,https://en.wikipedia.org/wiki/Star_Trek:_The_Original_Series,3,79,https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes,"September 8, 1966 - June 3, 1969",https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Star_Trek_TOS_logo.svg/220px-Star_Trek_TOS_logo.svg.png
1,The Animated Series,https://en.wikipedia.org/wiki/Star_Trek:_The_Animated_Series,2,22,https://en.wikipedia.org/wiki/Star_Trek:_The_Animated_Series,"September 8, 1973 - October 12, 1974",https://upload.wikimedia.org/wikipedia/commons/thumb/0/03/Star_Trek_TAS_logo.svg/220px-Star_Trek_TAS_logo.svg.png
2,The Next Generation,https://en.wikipedia.org/wiki/Star_Trek:_The_Next_Generation,7,178,https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Next_Generation_episodes,"September 28, 1987 - May 23, 1994",https://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Star_Trek_The_Next_Generation_Logo.svg/220px-Star_Trek_The_Next_Generation_Logo.svg.png
3,Deep Space Nine,https://en.wikipedia.org/wiki/Star_Trek:_Deep_Space_Nine,7,176,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Deep_Space_Nine_episodes,"January 4, 1993 - May 31, 1999",https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Star_Trek_DS9_logo.svg/220px-Star_Trek_DS9_logo.svg.png
4,Voyager,https://en.wikipedia.org/wiki/Star_Trek:_Voyager,7,172,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Voyager_episodes,"January 16, 1995 - May 23, 2001",https://upload.wikimedia.org/wikipedia/en/thumb/e/e2/Star_Trek_Voyager_Logo.svg/220px-Star_Trek_Voyager_Logo.svg.png
5,Enterprise,https://en.wikipedia.org/wiki/Star_Trek:_Enterprise,4,98,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Enterprise_episodes,"September 26, 2001 - May 13, 2005",https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Star_Trek_ENT_logo.svg/220px-Star_Trek_ENT_logo.svg.png
6,Discovery,https://en.wikipedia.org/wiki/Star_Trek:_Discovery,3,42,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Discovery_episodes,"September 24, 2017 - present",https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Star_Trek_Discovery_logo.svg/220px-Star_Trek_Discovery_logo.svg.png
7,Short Treks,https://en.wikipedia.org/wiki/Star_Trek:_Short_Treks,2,10,https://en.wikipedia.org/wiki/Star_Trek:_Short_Treks,"October 4, 2018 - January 9, 2020",
8,Picard,https://en.wikipedia.org/wiki/Star_Trek:_Picard,1,10,https://en.wikipedia.org/wiki/Star_Trek:_Picard_(season_1),"January 23, 2020 - present",https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Star_Trek_Picard_logo.svg/220px-Star_Trek_Picard_logo.svg.png
9,Lower Decks,https://en.wikipedia.org/wiki/Star_Trek:_Lower_Decks,1,10,https://en.wikipedia.org/wiki/Star_Trek:_Lower_Decks_(season_1),"August 6, 2020 - present",https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Star_Trek_LD_logo.svg/220px-Star_Trek_LD_logo.svg.png


In [67]:
# set up a filename template...
FILE_TEMPLATE = 'star_trek_series_{}_{}_episodes.json'

for row in df.itertuples():
    # temp break out if not first series, just to lessen requests to wikipedia...
        
    print(f'Processing : {row.name}')
    filename = FILE_TEMPLATE.format(row.Index,row.name.replace(" ", "_").lower())
    print(f"  -> Using URL : {row.episodes_url}")
    print(f"  -> Storing episodes to '{filename}'")
    
#     if not row.Index == 0:
#         break
    
    # get and parse the webpage...
    result = requests.get(row.episodes_url)
    bs = BeautifulSoup(result.text, 'lxml')
    
    # wrap all this in a Try:Except block, there are a few series which need special handling...
    try:
        # find the episode summary table, will be the first table with the below classes in the document
        summary_rows = bs.find('table', attrs={'class': 'wikitable plainrowheaders'}).find('tbody').find_all('tr')[2:]
        for season in summary_rows:

                link = season.find('th')
                cells = season.find_all('td')
                season_number = link.text
                season_id = link.a['href']

                print(f'Season: {season_number}, (Table ID is "{season_id}")')
                print(f'  -> {cells[1].text} episodes, from {cells[2].text} to  {cells[3].text}')
    except AttributeError:
        print("Error, need to investigate!")

    print("\n-=<[ . ]>=-\n")

Processing : The Original Series
  -> Using URL : https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes
  -> Storing episodes to 'star_trek_series_0_the_original_series_episodes.json'
Season: 1, (Table ID is "#Season_1_(1966–1967)")
  -> 29 episodes, from September 8, 1966 (1966-09-08) to  April 13, 1967 (1967-04-13)
Season: 2, (Table ID is "#Season_2_(1967–1968)")
  -> 26 episodes, from September 15, 1967 (1967-09-15) to  March 29, 1968 (1968-03-29)
Season: 3, (Table ID is "#Season_3_(1968–1969)")
  -> 24 episodes, from September 20, 1968 (1968-09-20) to  June 3, 1969 (1969-06-03)

-=<[ . ]>=-

Processing : The Animated Series
  -> Using URL : https://en.wikipedia.org/wiki/Star_Trek:_The_Animated_Series
  -> Storing episodes to 'star_trek_series_1_the_animated_series_episodes.json'
Season: 1, (Table ID is "#Season_1_(1973–74)")
  -> 16 episodes, from September 8, 1973 (1973-09-08) to  January 12, 1974 (1974-01-12)
Season: 2, (Table ID is "#Season_2_(1974)")
  -