# Trekpedia


Writing a web-scraper to pull all `Star Trek(tm)` series data from Wikipedia.

## Stage 1 - get Series data
Create a json file with a list of each separate Star Trek `series` and relevant metadata.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import re

# don't truncate Pandas.DataFrame cell contents when displaying.
pd.set_option('display.max_colwidth', None)

MAIN_URL="https://en.wikipedia.org/wiki/Star_Trek"

result = requests.get(MAIN_URL)
bs = BeautifulSoup(result.text, 'lxml')

In [2]:
# helper function to grab the logo for the specified series if it exists.
# we do this from the main page as the logos on individual pages are crap for discovery and trek shorts.
def get_logo(name):
    # short treks has no logo on this page, return empty string for now...
    if name =="Short Treks":
        return ""
    span = bs.find('span', attrs={"class": "mw-headline"}, id=re.compile(f'{name.replace(" ", "_")}_\('))
    logo = span.findNext('img', attrs={"class": 'thumbimage'})['src']
    logo_url = f'https:{logo}'
    return logo_url

# find the TV section by ID..
# get the parent of this so we can search on from this point.
# tv = bs.find('span', attrs={'id': 'Television'}).parent
tv = bs.find(id="Television").parent

# find the next table directly after this...
trek_table = tv.findNext('table').find('tbody')

# get all the rows, skip the first one
series_rows = trek_table.find_all('tr')[1:]

# create a list, containing dictionary for each season.
# also add an ID so we can link it to the series data later.
# series_list = []
series_all = dict()
for index, series in enumerate(series_rows):
    series_dict = dict()
    series_dict['name'] = series.th.a.text
    series_dict['url'] = f'https://en.wikipedia.org{series.th.a["href"]}'
    series_dict['season_count'] = series.find_all('td')[0].text
    series_dict['episode_count'] = series.find_all('td')[1].text
    series_dict['episodes_url'] = ''
    dates = series.find_all('td')[2].text.split('(')[0].strip().replace(u'\u2013', '-')
    # get the unicode stuff out of the string...
    dates = ' '.join(dates.split())
    series_dict['dates'] = dates
    series_dict['logo'] = get_logo(series_dict['name'])
    
    series_all[index] = series_dict

In [3]:
series_all

{0: {'name': 'The Original Series',
  'url': 'https://en.wikipedia.org/wiki/Star_Trek:_The_Original_Series',
  'season_count': '3',
  'episode_count': '79',
  'episodes_url': '',
  'dates': 'September 8, 1966 - June 3, 1969',
  'logo': 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Star_Trek_TOS_logo.svg/220px-Star_Trek_TOS_logo.svg.png'},
 1: {'name': 'The Animated Series',
  'url': 'https://en.wikipedia.org/wiki/Star_Trek:_The_Animated_Series',
  'season_count': '2',
  'episode_count': '22',
  'episodes_url': '',
  'dates': 'September 8, 1973 - October 12, 1974',
  'logo': 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/03/Star_Trek_TAS_logo.svg/220px-Star_Trek_TAS_logo.svg.png'},
 2: {'name': 'The Next Generation',
  'url': 'https://en.wikipedia.org/wiki/Star_Trek:_The_Next_Generation',
  'season_count': '7',
  'episode_count': '178',
  'episodes_url': '',
  'dates': 'September 28, 1987 - May 23, 1994',
  'logo': 'https://upload.wikimedia.org/wikipedia/commons/th

## Next steps.

Now we have a json file with the data on each series, we need to drill down onto their individual pages to get details for each season. we will use the single page summary for each series instead of the individual season pages (for now) since some dont have this.

### Work in progress

In [4]:
# define a helper function to get the episode links for each series..
def get_season_links(url):
    # a list of series that need different handling...
    exceptions = ['Animated', 'Short_Treks']
    series_page = requests.get(url)
    bs = BeautifulSoup(series_page.text, 'lxml')
#     print(url.split('/')[-1].replace('_',' '))
    # get all the Heading rows depending on season. Wikipedia is not consistent...
    if 'Enterprise' in url:
        h = bs.find_all('h3')
    elif any(x in url for x in exceptions):
        # very specific cases, they have episode data in the original page so just return that...
        return url
    else:
        h = bs.find_all('h2')
    for heading in h:
        headline = heading.find('span', attrs={'class': 'mw-headline'}, id=re.compile('pisode'))
        if not headline == None:
            try:
                episodes = headline.findNext('div', attrs={'role': 'note'}).a['href']
            except AttributeError:
                episodes = ''
    return f'https://en.wikipedia.org{episodes}'

In [5]:
keys = series_all.keys()
for series in keys:
    links = get_season_links(series_all[series]['url'])
#     print(links)
    if not links == "":
        series_all[series]['episodes_url'] = links

# save this list to a JSON file.
with open ('star_trek_series_info_stage_1.json', 'w', encoding='utf-8') as f:
    json.dump(series_all, f, ensure_ascii=False, indent=4)
print("Done.")

Done.


In [6]:
# series_all

In [7]:
df = pd.read_json('star_trek_series_info_stage_1.json', orient='index')
df

Unnamed: 0,name,url,season_count,episode_count,episodes_url,dates,logo
0,The Original Series,https://en.wikipedia.org/wiki/Star_Trek:_The_Original_Series,3,79,https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes,"September 8, 1966 - June 3, 1969",https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Star_Trek_TOS_logo.svg/220px-Star_Trek_TOS_logo.svg.png
1,The Animated Series,https://en.wikipedia.org/wiki/Star_Trek:_The_Animated_Series,2,22,https://en.wikipedia.org/wiki/Star_Trek:_The_Animated_Series,"September 8, 1973 - October 12, 1974",https://upload.wikimedia.org/wikipedia/commons/thumb/0/03/Star_Trek_TAS_logo.svg/220px-Star_Trek_TAS_logo.svg.png
2,The Next Generation,https://en.wikipedia.org/wiki/Star_Trek:_The_Next_Generation,7,178,https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Next_Generation_episodes,"September 28, 1987 - May 23, 1994",https://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Star_Trek_The_Next_Generation_Logo.svg/220px-Star_Trek_The_Next_Generation_Logo.svg.png
3,Deep Space Nine,https://en.wikipedia.org/wiki/Star_Trek:_Deep_Space_Nine,7,176,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Deep_Space_Nine_episodes,"January 4, 1993 - May 31, 1999",https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Star_Trek_DS9_logo.svg/220px-Star_Trek_DS9_logo.svg.png
4,Voyager,https://en.wikipedia.org/wiki/Star_Trek:_Voyager,7,172,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Voyager_episodes,"January 16, 1995 - May 23, 2001",https://upload.wikimedia.org/wikipedia/en/thumb/e/e2/Star_Trek_Voyager_Logo.svg/220px-Star_Trek_Voyager_Logo.svg.png
5,Enterprise,https://en.wikipedia.org/wiki/Star_Trek:_Enterprise,4,98,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Enterprise_episodes,"September 26, 2001 - May 13, 2005",https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Star_Trek_ENT_logo.svg/220px-Star_Trek_ENT_logo.svg.png
6,Discovery,https://en.wikipedia.org/wiki/Star_Trek:_Discovery,3,42,https://en.wikipedia.org/wiki/List_of_Star_Trek:_Discovery_episodes,"September 24, 2017 - present",https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Star_Trek_Discovery_logo.svg/220px-Star_Trek_Discovery_logo.svg.png
7,Short Treks,https://en.wikipedia.org/wiki/Star_Trek:_Short_Treks,2,10,https://en.wikipedia.org/wiki/Star_Trek:_Short_Treks,"October 4, 2018 - January 9, 2020",
8,Picard,https://en.wikipedia.org/wiki/Star_Trek:_Picard,1,10,https://en.wikipedia.org/wiki/Star_Trek:_Picard_(season_1),"January 23, 2020 - present",https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Star_Trek_Picard_logo.svg/220px-Star_Trek_Picard_logo.svg.png
9,Lower Decks,https://en.wikipedia.org/wiki/Star_Trek:_Lower_Decks,1,10,https://en.wikipedia.org/wiki/Star_Trek:_Lower_Decks_(season_1),"August 6, 2020 - present",https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Star_Trek_LD_logo.svg/220px-Star_Trek_LD_logo.svg.png


# Problems with solutions!
Sadly Wikipedia is inconsistent with its layout across the seasons - what works on TOS, TNG etc does not work on DS9, VOY etc.
Changed the method to get the list of episodes to just return the specific page which we will parse for the info, instead of a list with the page for each season as only some have this setup.

<!-- The method above does only seem to work for 0, 2, 8 & 9 so Animated and DS9 to 'Short Treks' needs another mothod. -->

Also note that wikipedia lists seasons for Picard and Lower Decks etc that have not been released yet and are only placeholders. However the count of 'season_sount' only lists transmitted ones so should work.

## Progress so far
We have a json file containing info for each specific series containing :
* link to the Wikipedia main page for that series
* number of seasons
* number of episodes
* link to the episode info page
* Link to the Sereis logo if it has one (only 'Short Treks' for now does not)
* data loaded info a Pandas DataFrame for ease of use

Now we need to further parse this file, getting episode data for each series. Each will be stored in their own json file.