# Trekpedia

## Do not Edit this file!
This is a working copy that produces a json file for stage 2



Working out and testing to enable writing a web-scraper to pull all `Star Trek(tm)` series data from Wikipedia.

In [11]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import re

MAIN_URL="https://en.wikipedia.org/wiki/Star_Trek"

result = requests.get(MAIN_URL)
bs = BeautifulSoup(result.text, 'lxml')

# find the TV section by ID..
# get the parent of this so we can search on from this point.
# tv = bs.find('span', attrs={'id': 'Television'}).parent
tv = bs.find(id="Television").parent

# find the next table directly after this...
trek_table = tv.findNext('table').find('tbody')

# get all the rows, skip the first one
series_rows = trek_table.find_all('tr')[1:]

# create a list, containing dictionary for each season.
# also add an ID so we can link it to the series data later.
# series_list = []
series_all = dict()
for index, series in enumerate(series_rows):
    series_dict = dict()
    series_dict['name'] = series.th.a.text
    series_dict['url'] = f'https://en.wikipedia.org{series.th.a["href"]}'
    series_dict['season_count'] = series.find_all('td')[0].text
    series_dict['episode_count'] = series.find_all('td')[1].text
    series_dict['episodes_url'] = ''
    dates = series.find_all('td')[2].text.split('(')[0].strip().replace(u'\u2013', '-')
    # get the unicode stuff out of the string...
    dates = ' '.join(dates.split())
    series_dict['dates'] = dates
    
    series_all[index] = series_dict

In [12]:
# series_all

## Next steps.

Now we have a json file and resulting DataFrame with the data on each series, we need to drill down onto their individual pages to get details for each season. Each season has its own separate page too.

### Work in progress

In [13]:
# define a helper function to get the individual links to each season..
# this function only works for series 1, 3, 9, 10
# def get_season_links1(url):
#     links = []
#     series_page = requests.get(url)
#     bs = BeautifulSoup(series_page.text, 'lxml')
#     # get all the H3
#     h3 = bs.find_all('h3')
#     index = 1
#     for heading in h3:
#         headline = heading.find('span', attrs={'class': 'mw-headline'} )
#         if not headline == None and 'Season ' in headline.text:
#             url = heading.findNext('div', attrs={'role': 'note'}).a['href']
#             links.append({'id': index,'name':headline.text, 'url': f'https://en.wikipedia.org{url}'})
#             index += 1
#     return links

# second method to get the series info for those that dont work on the first
def get_season_links2(url):
    # a list of series that need different handling...
    exceptions = ['Animated', 'Short_Treks']
    series_page = requests.get(url)
    bs = BeautifulSoup(series_page.text, 'lxml')
    print(url.split('/')[-1].replace('_',' '))
    # get all the Heading rows depending on season. Wikipedia is not consistent...
    if 'Enterprise' in url:
        h = bs.find_all(['h3'])
    elif any(x in url for x in exceptions):
        # very specific cases, they have episode data in the original page so just return that...
        return url
    else:
        h = bs.find_all(['h2'])
    for heading in h:
        headline = heading.find('span', attrs={'class': 'mw-headline'}, id=re.compile('pisode'))
        if not headline == None:
            try:
                episodes = headline.findNext('div', attrs={'role': 'note'}).a['href']
            except AttributeError:
                episodes = ''
    return f'https://en.wikipedia.org{episodes}'

In [14]:
keys = series_all.keys()
for series in keys:
    links = get_season_links2(series_all[series]['url'])
    print(links)
    if not links == "":
        series_all[series]['episodes_url'] = links

# save this list to a JSON file.
with open ('star_trek_series_info-stage1.json', 'w', encoding='utf-8') as f:
    json.dump(series_all, f, ensure_ascii=False, indent=4)

Star Trek: The Original Series
https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes
Star Trek: The Animated Series
https://en.wikipedia.org/wiki/Star_Trek:_The_Animated_Series
Star Trek: The Next Generation
https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Next_Generation_episodes
Star Trek: Deep Space Nine
https://en.wikipedia.org/wiki/List_of_Star_Trek:_Deep_Space_Nine_episodes
Star Trek: Voyager
https://en.wikipedia.org/wiki/List_of_Star_Trek:_Voyager_episodes
Star Trek: Enterprise
https://en.wikipedia.org/wiki/List_of_Star_Trek:_Enterprise_episodes
Star Trek: Discovery
https://en.wikipedia.org/wiki/List_of_Star_Trek:_Discovery_episodes
Star Trek: Short Treks
https://en.wikipedia.org/wiki/Star_Trek:_Short_Treks
Star Trek: Picard
https://en.wikipedia.org/wiki/Star_Trek:_Picard_(season_1)
Star Trek: Lower Decks
https://en.wikipedia.org/wiki/Star_Trek:_Lower_Decks_(season_1)


In [15]:
# series_all

In [16]:
df = pd.read_json('star_trek_series_info-stage1.json', orient='index')
df

Unnamed: 0,name,url,season_count,episode_count,episodes_url,dates
0,The Original Series,https://en.wikipedia.org/wiki/Star_Trek:_The_O...,3,79,https://en.wikipedia.org/wiki/List_of_Star_Tre...,"September 8, 1966 - June 3, 1969"
1,The Animated Series,https://en.wikipedia.org/wiki/Star_Trek:_The_A...,2,22,https://en.wikipedia.org/wiki/Star_Trek:_The_A...,"September 8, 1973 - October 12, 1974"
2,The Next Generation,https://en.wikipedia.org/wiki/Star_Trek:_The_N...,7,178,https://en.wikipedia.org/wiki/List_of_Star_Tre...,"September 28, 1987 - May 23, 1994"
3,Deep Space Nine,https://en.wikipedia.org/wiki/Star_Trek:_Deep_...,7,176,https://en.wikipedia.org/wiki/List_of_Star_Tre...,"January 4, 1993 - May 31, 1999"
4,Voyager,https://en.wikipedia.org/wiki/Star_Trek:_Voyager,7,172,https://en.wikipedia.org/wiki/List_of_Star_Tre...,"January 16, 1995 - May 23, 2001"
5,Enterprise,https://en.wikipedia.org/wiki/Star_Trek:_Enter...,4,98,https://en.wikipedia.org/wiki/List_of_Star_Tre...,"September 26, 2001 - May 13, 2005"
6,Discovery,https://en.wikipedia.org/wiki/Star_Trek:_Disco...,3,42,https://en.wikipedia.org/wiki/List_of_Star_Tre...,"September 24, 2017 - present"
7,Short Treks,https://en.wikipedia.org/wiki/Star_Trek:_Short...,2,10,https://en.wikipedia.org/wiki/Star_Trek:_Short...,"October 4, 2018 - January 9, 2020"
8,Picard,https://en.wikipedia.org/wiki/Star_Trek:_Picard,1,10,https://en.wikipedia.org/wiki/Star_Trek:_Picar...,"January 23, 2020 - present"
9,Lower Decks,https://en.wikipedia.org/wiki/Star_Trek:_Lower...,1,10,https://en.wikipedia.org/wiki/Star_Trek:_Lower...,"August 6, 2020 - present"


# Problems with solutions!
Sadly Wikipedia is inconsistent with its layout across the seasons - what works on TOS, TNG etc does not work on DS9, VOY etc.
Changed the method to get the list of episodes to just return the specific page which we will parse for the info, instead of a list with the page for each season as only some have this setup.

<!-- The method above does only seem to work for 0, 2, 8 & 9 so Animated and DS9 to 'Short Treks' needs another mothod. -->

Also note that wikipedia lists seasons for Picard and Lower Decks etc that have not been released yet and are only placeholders. However the count of 'season_sount' only lists transmitted ones so should work.

## Progress so far
We have a json file containing info for each specific series containing :
* link to the Wikipedia main page for that series
* number of seasons
* number of episodes
* link to the episode info page
* data loaded info a Pandas DataFrame for ease of use

Now we need to further parse this file, getting episode data for each series. Each will be stored in their own json file.