# Topic Modeling of the television show Stargate SG1

## Motivations

I love the show Stargate SG1.  It is humorous with a level of self-awareness and does not take itself seriously, unlike some other sci-fi shows such as (Battlestar Gallactica or Stargate Universe).  Every year or so, I randomly re-watch the different seasons of the show.

The show ran for 10 seasons with a total of 214 episodes.  It spanned multiple worlds as the SG1 team traveled and met with multiple alien races, some are allies, some are enemies.  It also had its own lores involving alien technologies, alien vocabularies, US military vocabularies, ancient Egyptian mythologies and pop-culture references.  

Additionally, the episodes are generally self-contained, with the maximum that a storyline spans is around 2-3 episodes.  This makes the show easy to re-watch or to pick up at random spots without having to be updated of all the backgrounds and pertinent details.

The motivation of this project is to build a corpus of all the transcripts of the 10 seasons, explore if there are recurrent themes and group the episodes and/or seasons by them.  By doing so, I'm looking to map the essential words for each theme and also the episodes relating to each theme.  This would allow one to select a group of theme or storyline and be able to watch all the episodes relating to it.

## Overview

The following steps will be taken to explore this project:

- Create Pipeline to scrape the transcript data
- Preprocess and build a corpus of all the episodes
- Tokenize and create a dictionary of all the words used in the episodes
- Train and build embeddings from the tokens
- Model the topics
- Finally, explore the undercurrent themes and topics results

### Data Source

http://www.stargate-sg1-solutions.com/wiki/Transcripts appears to have the most complete transcripts for all 10 seasons.  The transcripts were compiled and archived by fans of the show.

### Tools

Methods / Tools to be used are:

- Python
- Beautifulsoup
- Trafilatura
- Gensim
- Sklearn
- Fasttext
- LDA
- HDBSCAN
- BertTopic

## Step 0: imports

In [1]:
import sys
import re
import pathlib
import pickle
from typing import Iterator
from tqdm.auto import trange, tqdm
import numpy as np
import polars as pl
import gensim
import spacy
import sklearn
import trafilatura
from trafilatura import spider
import courlan

## Step 1: Pipeline to Scrape Data

In [2]:
def get_transcript_links(urls: list) -> Iterator[set]:
    """
    get all the links for each season's page
    yield (generator) of sets
    """
    for season in urls:
        season_page = trafilatura.fetch_url(season, decode=True)
        links = spider.extract_links(
            pagecontent=season_page,
            base_url="http://www.stargate-sg1-solutions.com",
            external_bool=False,
        )
        yield links


def merge_uniqefy_links(links: Iterator[set]) -> set[str]:
    """
    merge the list of sets
    return a single set of unique links
    """
    all_links = set()
    for season in links:
        all_links.update(season)

    return all_links


def filter_links(links: set[str]) -> Iterator:
    """
    filter to only obtain the transcript links
    transcript url is of pattern: "wiki/[0-9].*Transcript"
    yield (generator) of links
    """
    link_pattern = re.compile(r"wiki\/\d.+Transcript$")

    for l in links:
        if re.search(link_pattern, l):
            yield l


def extract_transcripts(filtered_links: set[str]) -> Iterator[dict]:
    """
    extract the actual transcript from the set of transcript links
    yield (generator) of dicts
    """
    for l in filtered_links:
        page = trafilatura.fetch_url(l)
        extracted_page = trafilatura.bare_extraction(page)
        extracted_transcript_list = extracted_page["text"].split("\n")
        start_index = extracted_transcript_list.index("Transcript") + 1
        for n, e in enumerate(extracted_transcript_list):
            if re.search(r"^Transcribed", e):
                end_index = n - 1
        transcript_string = " ".join(extracted_transcript_list[start_index:end_index])
        extracted_page["text"] = transcript_string
        yield extracted_page


def run_extraction_pipeline(urls: list) -> list[dict]:
    # list of dicts
    # with the actual transcript stored in the dict's "text" key
    extracted_transcripts = list(
        extract_transcripts(
            filter_links(merge_uniqefy_links(get_transcript_links(urls)))
        )
    )

    # sort the transcripts by episode title
    sorted_transcripts = sorted(extracted_transcripts, key=lambda x: x["title"])

    # pickle the transcripts
    with open("extracted_sg1_transcripts.pickle", "wb") as f:
        pickle.dump(sorted_transcripts, f)

    # transcripts is a list of dicts that include other metadata
    # with the key "text" containing the actual transcript
    return sorted_transcripts

In [3]:
urls = [
    "http://www.stargate-sg1-solutions.com/wiki/Season_One_Transcripts",
    "http://www.stargate-sg1-solutions.com/wiki/Season_Two_Transcripts",
    "http://www.stargate-sg1-solutions.com/wiki/Season_Three_Transcripts",
    "http://www.stargate-sg1-solutions.com/wiki/Season_Four_Transcripts",
    "http://www.stargate-sg1-solutions.com/wiki/Season_Five_Transcripts",
    "http://www.stargate-sg1-solutions.com/wiki/Season_Six_Transcripts",
    "http://www.stargate-sg1-solutions.com/wiki/Season_Seven_Transcripts",
    "http://www.stargate-sg1-solutions.com/wiki/Season_Eight_Transcripts",
    "http://www.stargate-sg1-solutions.com/wiki/Season_Nine_Transcripts",
    "http://www.stargate-sg1-solutions.com/wiki/Season_Ten_Transcripts",
]

In [4]:
transcripts = run_extraction_pipeline(urls)

## Step 2: Preprocess and Build Corpus

In [5]:
def clean_corpus(corpus: Iterator[str]) -> Iterator[str]:
    pattern_1 = re.compile(r"(TEASER|FADE\s(IN|OUT))")
    pattern_2 = re.compile(r"((END|ROLL)\sCREDIT).*")
    for doc in corpus:
        doc = re.sub(pattern_2, "", re.sub(pattern_1, "", doc)).strip()
        yield doc

In [6]:
# load transcripts
with open("extracted_sg1_transcripts.pickle", "rb") as f:
    transcripts = pickle.load(f)

corpus = (t["text"] for t in transcripts)
cleaned_corpus = list(clean_corpus(corpus))

# pickle the cleaned corpus
with open("cleaned_sg1_corpus.pickle", "wb") as f:
    pickle.dump(cleaned_corpus, f)

## Step 3: Create Dictionary, Tokenize and build Embeddings

In [7]:
# load the cleaned corpus
with open("cleaned_sg1_corpus.pickle", "rb") as f:
    corpus = pickle.load(f)

In [8]:
def clean_tokenize(docs: list[str]) -> Iterator[list]:
    """
    Input: docs: list of sentences
    Output: generator or list of lists of tokens
    """
    cleaned = (
        gensim.parsing.preprocessing.strip_multiple_whitespaces(
            gensim.parsing.preprocessing.strip_non_alphanum(gensim.utils.deaccent(doc))
        )
        .strip()
        .lower()
        for doc in docs
    )

    nlp = spacy.load("en_core_web_lg", exclude=["parser", "ner", "tok2vec"])

    # all the documents (rows) in the corpus
    sents = nlp.pipe(cleaned, n_process=6)

    # only keep the lemma form of the token and if token is alphabetic and not a stopword
    def ok_token(tok):
        if not tok.is_stop and len(tok) > 1 and not tok.like_num:
            return tok.lemma_
        else:
            return

    # iterate through each document and each token
    # return list of cleaned (lemmatized) strings
    # res = (ok_token(token) for sent in sents for token in sent if ok_token(token) != None)
    res = (
        [ok_token(token) for token in sent if ok_token(token) != None] for sent in sents
    )

    # clean up
    del docs
    del nlp
    del cleaned
    return res


def build_bow_corpus(cleaned_dataset: list[list[str]], dictionary) -> Iterator[list]:
    """
    Input: - cleaned_dataset: list of list of words
           - dictionary: gensim dictionary object
    Output: - generator of list of bags of words
    """
    for doc in cleaned_dataset:
        yield dictionary.doc2bow(doc, allow_update=True)

## Step 4: Topic Modeling

In [9]:
def tune_topics(corpus, id2word, sentences, n_topics: int):
    """
    Input:
        - corpus: list / generator of bags of words
        - id2word: gensim dictionary object
        - sentences: list / generator of tokens
        - n_topics: int number of topics
    Output:
        - Coherence value (float)
    """
    cores = 7
    lda_model = gensim.models.LdaMulticore(
        corpus=list(corpus), id2word=id2word, num_topics=n_topics, workers=cores
    )
    chr_model = gensim.models.CoherenceModel(
        model=lda_model,
        texts=sentences,
        dictionary=id2word,
        coherence="c_v",
        processes=cores,
    )
    return chr_model.get_coherence(), lda_model

## Step 5: Insights / Conclusions