## Data Sources:
- Memory Alpha for Metadata & Summaries
- Chakoteya.com for Lines

## Tasks

* scrape lines
* scrape episodes list w/ metadata
* scrape episodes' summaries

* cleaning & exploration
    * NAs
    * duplicates
    * dates
    * titles [need to match in both sources]
    * lines & characters (#)
    * text (#)
    
(#) remove website specific parts e.g. copyright statements [pre-cleaning step]

* explore
    - how many episodes
    - how many lines / per series / per episode / per character
    - how many characters / per series 

* build corpus [script/function to clean long text]
    1. expansion
    2. normalization
    3. tokenization
    4. stop words removal
    5. stemming / lemmatization
    
## Before Modeling

* Datasets
    - episode list w/ metadata
        | Series | Seasons | Episodes | Originally released | In Dataset |
        | :----: | :----: | :----: | :----: | :----: |
    - episode lines joined to transcript w/o characters w/ metadata
        | Series | Seasons | Episodes | Originally released | In Dataset |
        | :----: | :----: | :----: | :----: | :----: |
    - episode lines w/ characters & metadata
        | Series | Seasons | Episodes | Originally released | In Dataset |
        | :----: | :----: | :----: | :----: | :----: |
    - episode summaries w/ metadata
        | Series | Seasons | Episodes | Originally released | In Dataset |
        | :----: | :----: | :----: | :----: | :----: |
    
* Document-Term-Matrix
    - corpus 1: transcripts
    - corpus 2: summaries

## Concepts & mathematical representations

* Document-Term-Matrix
* Bag-of-Words & alternative: Word2Vec
* tf-idf & term frequency, document frequency, idf / log
* normalization
* expansion
* tokenization
* stop words
* stemming
* lemmatization
* term
* document
* corpus

## Scraping

In [163]:
import requests
from bs4 import BeautifulSoup
import re
from collections import defaultdict
import pandas as pd

### Episodes List & Metadata

In [2]:
# Memory Alpha Star Trek root url
root_url = "https://memory-alpha.fandom.com/wiki/Star_Trek:_"

# List of series names
series_names = ["The Original Series", "The Animated Series", "The Next Generation", "Deep Space Nine",
        "Voyager", "Enterprise", "Discovery", "Picard ", "Lower Decks"]

In [6]:
# create list of link to series pages
series_links = []

for name in series_names:
    name = name.replace(' ', '_')
    series_links.append(root_url + name)

In [180]:
for i, name in enumerate(series_names):
    # create link to series main page
    name = name.replace(' ', '_')
    url = root_url + name
    # get page content
    website_url = requests.get(url).text
    soup = BeautifulSoup(website_url, 'lxml')
    # get series abbreviation
    abbr = soup.find("aside", {"class": "portable-infobox"}) \
        .find("div", {"data-source": "abbr"}) \
        .find("div", {"class": "pi-data-value"}).get_text(strip=True)
    abbr = re.sub('[^A-Za-z0-9]+', '', abbr)
    print(series_names[i])
    print(abbr)
    print(root_url + name)

The Original Series
TOS
https://memory-alpha.fandom.com/wiki/Star_Trek:_The_Original_Series
The Animated Series
TAS
https://memory-alpha.fandom.com/wiki/Star_Trek:_The_Animated_Series
The Next Generation
TNG
https://memory-alpha.fandom.com/wiki/Star_Trek:_The_Next_Generation
Deep Space Nine
DS9
https://memory-alpha.fandom.com/wiki/Star_Trek:_Deep_Space_Nine
Voyager
VOY
https://memory-alpha.fandom.com/wiki/Star_Trek:_Voyager
Enterprise
ENT
https://memory-alpha.fandom.com/wiki/Star_Trek:_Enterprise
Discovery
DIS
https://memory-alpha.fandom.com/wiki/Star_Trek:_Discovery
Picard 
PIC
https://memory-alpha.fandom.com/wiki/Star_Trek:_Picard_
Lower Decks
LD
https://memory-alpha.fandom.com/wiki/Star_Trek:_Lower_Decks


In [7]:
series_links

['https://memory-alpha.fandom.com/wiki/Star_Trek:_The_Original_Series',
 'https://memory-alpha.fandom.com/wiki/Star_Trek:_The_Animated_Series',
 'https://memory-alpha.fandom.com/wiki/Star_Trek:_The_Next_Generation',
 'https://memory-alpha.fandom.com/wiki/Star_Trek:_Deep_Space_Nine',
 'https://memory-alpha.fandom.com/wiki/Star_Trek:_Voyager',
 'https://memory-alpha.fandom.com/wiki/Star_Trek:_Enterprise',
 'https://memory-alpha.fandom.com/wiki/Star_Trek:_Discovery',
 'https://memory-alpha.fandom.com/wiki/Star_Trek:_Picard_',
 'https://memory-alpha.fandom.com/wiki/Star_Trek:_Lower_Decks']

In [72]:
# Scrape main page for each series
website_url = requests.get(series_links[0]).text

In [87]:
soup = BeautifulSoup(website_url, 'html.parser')

In [176]:
# get episodes table
#episodes = 
tables = soup.find_all("table", {"class": "grey sortable"})

[stackoverflow - Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element](https://stackoverflow.com/questions/45292001/detecting-header-in-html-tables-using-beautifulsoup-lxml-when-table-lacks-thea)

In [168]:
def parse_table(table): 
    episode_root_url = "https://memory-alpha.fandom.com"
    title_link = defaultdict(dict)
    for tr in table.select('tr'):
        # find all direct children of tr and if all of them have the name 'th', 
        # append corresponding tr to dict 'head'
        if all(t.name == 'th' for t in tr.find_all(recursive=False)): 
            pass
        else: 
            title = tr.find_all("td")[0].find('a').get_text()
            link = episode_root_url + tr.find_all("td")[0].find('a').get('href')
            title_link[title] = link
    return title_link

In [171]:
title_link_table = parse_table(tables[1])

In [174]:
pd.DataFrame(title_link_table.items(), columns=['Title', 'Episode Url'])

Unnamed: 0,Title,Episode Url
0,Where No Man Has Gone Before,https://memory-alpha.fandom.com/wiki/Where_No_...
1,The Corbomite Maneuver,https://memory-alpha.fandom.com/wiki/The_Corbo...
2,Mudd's Women,https://memory-alpha.fandom.com/wiki/Mudd%27s_...
3,The Enemy Within,https://memory-alpha.fandom.com/wiki/The_Enemy...
4,The Man Trap,https://memory-alpha.fandom.com/wiki/The_Man_T...
5,The Naked Time,https://memory-alpha.fandom.com/wiki/The_Naked...
6,Charlie X,https://memory-alpha.fandom.com/wiki/Charlie_X...
7,Balance of Terror,https://memory-alpha.fandom.com/wiki/Balance_o...
8,What Are Little Girls Made Of?,https://memory-alpha.fandom.com/wiki/What_Are_...
9,Dagger of the Mind,https://memory-alpha.fandom.com/wiki/Dagger_of...


[stackoverflow - Scrape tables into dataframe with BeautifulSoup](https://stackoverflow.com/questions/50633050/scrape-tables-into-dataframe-with-beautifulsoup)

In [178]:
#df = 
table = tables[1]
pd.read_html(str(table))[0]

Unnamed: 0,Title,Episode,Prodno.,Stardate,Original Airdate,Remastered Airdate
0,Where No Man Has Gone Before,1x01,6149-02,1312.4 - 1313.8,1966-09-22,2007-01-20
1,The Corbomite Maneuver,1x02,6149-03,1512.2 - 1514.1,1966-11-10,2006-12-09
2,Mudd's Women,1x03,6149-04,1329.8 - 1330.1,1966-10-13,2008-04-26
3,The Enemy Within,1x04,6149-05,1672.1 - 1673.1,1966-10-06,2008-01-26
4,The Man Trap,1x05,6149-06,1513.1 - 1513.8,1966-09-08,2007-09-29
5,The Naked Time,1x06,6149-07,1704.2 - 1704.4,1966-09-29,2006-09-30
6,Charlie X,1x07,6149-08,1533.6 - 1535.8,1966-09-15,2007-07-14
7,Balance of Terror,1x08,6149-09,1709.2 - 1709.6,1966-12-15,2006-09-16
8,What Are Little Girls Made Of?,1x09,6149-10,2712.4,1966-10-20,2007-10-06
9,Dagger of the Mind,1x10,6149-11,2715.1 - 2715.2,1966-11-03,2007-10-13
