# Project Overview
In this notebook and subsequent notebooks I will attempt to predict mechanically meaningful labels associated with some fan-generated roleplaying game (RPG) content.

I collected post and comment data from Reddit in multiple runs using a command-line tool I wrote for this purpose (see: [PRAW_FILES-CoDiaLS](https://github.com/nkuehnle/praw-codials))

The actual text I will be attempting to predict is hosted on one of two websites: [GM Binder](https://gmbinder.com/) or [Homebrewery](https://homebrewery.naturalcrit.com/), which are tools/content hosting providers for fan-made RPG content, primarily Dungeons and Dragons 5th Edition (DnD5e).

The labels make use of the user-assigned "flair" from the subreddit [/r/UnearthedArcana](https://www.reddit.com/r/UnearthedArcana/). UnearthedArcana describes itself as a source of "homebrew" (fan-generated) content and its community rules ensure that every post has a meaningful label ("flair") associated with it. These labels describe the "type" of game content that has been created, be it a character "class" (think achetypal high-fantasy characters knights, wizards, archers, etc) or a "race" (elves, dwarves, humans, etc).

In [2]:
%load_ext jupyter_black
# Utility Imports
import os
from pathlib import Path

# Imports for data processing/handling
import numpy as np
import pandas as pd
import pickle as pkl

# NLP-specific processing
from nltk.tokenize import word_tokenize
from langdetect import detect

# Custom modules
from src.preprocessing import (
    fix_dt,
    data_io,
    get_section_df,
    sections_df_to_docs,
    EmbeddingAwareTokenizer,
)
from src.eda_utils import get_char_counts, count_stopwords, calc_token_sizes

The jupyter_black extension is already loaded. To reload it, use:
  %reload_ext jupyter_black


# Bringing it all together

At this point we've:
1. Collected links containing fan-generated Dungeons and Dragons game content posted by Reddit users to [/r/UnearthedArcana](https://reddit.com/r/unearthedarcana) or [/r/DnDHomebrew](https://reddit.com/r/dndhomebrew) using my [PRAW_FILES-CoDiaLS](https://github.com/nkuehnle/praw-codials) tool
2. Filered out or semi-manually reviewed links to get the largest set of unique, accurately "flaired" (labeled) links
3. Scraped this content from [one](https://homebrewery.naturalcrit.com) of [two](https://gmbinder.com) websites that host this content (as literal string markdown text with some CSS elements)
4. Cleaned as much irrelevant information from the texts (i.e. styling elements and so forth)

### In this notebook I'll:
1. Ensure column datatypes are properly set
2. Filter once more for a few potential problems that might have been missed
3. Load the text as either sections or full documents into two possible dataframes
4. Get word and sentence tokens at both the section and document level (this takes a while)

In [2]:
# Define a constant (CWD) to use
CWD = Path(os.getcwd())
DATA = CWD / "data"
PRAW_FILES = DATA / "praw_data"
TEXT_FILES = DATA / "text_files"
CLEAN_TEXT_FILES = DATA / "clean_text_files"
CORPUS_FILES = DATA / "corpus_files"
CORPUS_FILES.mkdir(parents=True, exist_ok=True)

In [3]:
metadata_path = CORPUS_FILES / "Metadata.pkl"
metadata: pd.DataFrame = pd.read_pickle(metadata_path)
metadata = metadata.sort_values(by="UID")
metadata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3250 entries, 0 to 2335
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UID                      3250 non-null   int64  
 1   link                     3250 non-null   object 
 2   submission_author        3237 non-null   object 
 3   submission_id            3250 non-null   object 
 4   submission_title         3250 non-null   object 
 5   subreddit                3250 non-null   object 
 6   submission_flair         3235 non-null   object 
 7   submission_score         3250 non-null   float64
 8   submission_upvote_ratio  3247 non-null   float64
 9   submission_date          2009 non-null   object 
 10  comment_author           2628 non-null   object 
 11  comment_id               2628 non-null   object 
 12  comment_score            2628 non-null   float64
 13  comment_body             2628 non-null   object 
 14  comment_date            

# Update manually reviewed /r/DnDHomeBrew content
At the previous notebook I generated a list of labels from /r/DnDHomebrew (which has optional and/or less meaningful flair) to review.

Here I'll incorporate any new information I've gathered from manually reviewing these texts

In [4]:
label_review_path = PRAW_FILES / "LabelsToReview.csv"
# Read in reviewed labels
reviewed_labels = pd.read_csv(label_review_path)
reviewed_labels["manually_reviewed"] = reviewed_labels["manually_reviewed"].fillna(
    False
)
manually_reviewed = reviewed_labels["manually_reviewed"]

reviewed_labels = reviewed_labels[manually_reviewed].copy()
reviewed_labels["submission_flair"] = reviewed_labels["corrected_flair"]

# Update values
to_update = metadata["UID"].isin(reviewed_labels["UID"])
mapper = reviewed_labels.set_index("UID")["corrected_flair"].to_dict()
metadata.loc[to_update, "submission_flair"] = metadata.loc[to_update, "UID"].map(mapper)

print(f"{to_update.sum()} /r/DnDHomebrew entries updated.")

368 /r/DnDHomebrew entries updated.


### Set column datatypes

In [5]:
# Dates
metadata["submission_date"] = metadata["submission_date"].apply(lambda x: fix_dt(x))
metadata["comment_date"] = metadata["comment_date"].apply(lambda x: fix_dt(x))
# Boolean
metadata["manually_reviewed"] = metadata["manually_reviewed"].fillna(False)
metadata["manually_reviewed"] = metadata["manually_reviewed"].astype("boolean")
# All primary submission links are related to the submission...
metadata.loc[metadata["comment_id"].isna(), "related_link"] = True
metadata["related_link"] = metadata["related_link"].astype("boolean")
# Numbers
metadata["comment_score"] = metadata["comment_score"].astype("Int32")
metadata["submission_score"] = metadata["submission_score"].astype("Int32")
metadata["submission_upvote_ratio"] = metadata["submission_upvote_ratio"].astype(
    "float32"
)
metadata["UID"] = metadata["UID"].astype("Int16")
# Categorical
metadata["subreddit"] = metadata["subreddit"].astype("category")
metadata["submission_flair"] = metadata["submission_flair"].astype("category")

### Remove any remaining duplicates (now that DnDHomebrew content has flair)

In [6]:
dups = metadata.sort_values(
    by=["manually_reviewed", "subreddit", "submission_date", "related_link"],
    axis=0,
    ascending=[False, False, True, False],
)
dups = dups[dups.duplicated(subset=["src_url", "submission_flair"], keep="first")]
metadata = metadata[~metadata["UID"].isin(dups["UID"])]

Confirming no src_urls have been duplicated with differing flair

In [7]:
metadata[metadata["src_url"].duplicated(keep=False)].sort_values("src_url")

Unnamed: 0,UID,link,submission_author,submission_id,submission_title,subreddit,submission_flair,submission_score,submission_upvote_ratio,submission_date,comment_author,comment_id,comment_score,comment_body,comment_date,src_url,manually_reviewed,related_link


### Remove any invalid/missing content
A few bad links have slipped through at some point. These will be dropped below.

#### Manual Review/missing flair
After manual review, some links still had no relevant flair, seen and dropped below.

In [8]:
metadata[metadata["submission_flair"].isna()]["link"].values
metadata = metadata[~metadata["submission_flair"].isna()]

#### Missing/invalid content
Likely collected in a previous pass or made evident after cleaning the markdown
In a few cases, non-English texts are present. Due to the small number, we won't attempt training/translating.

In [9]:
metadata["raw_markdown"] = metadata["UID"].apply(
    lambda x: data_io.read_src_txt_from_file(x, TEXT_FILES)
)
metadata["clean_markdown"] = metadata["UID"].apply(
    lambda x: data_io.read_src_txt_from_file(x, CLEAN_TEXT_FILES)
)

to_drop = [
    # Drop GMBinder links that are dead.
    metadata["raw_markdown"].str.contains("NoSuchKey"),
    # Drop Homebrwery links that are dead
    metadata["raw_markdown"].str.contains("This content has been moved"),
    # Some are not found
    metadata["raw_markdown"].str.contains("Error: File not found"),
    # There are a few examples of nonsensical text, which is generally short, thus I'm doing a check for anything less than 150 characters
    metadata["raw_markdown"].str.len() <= 200,
    metadata["clean_markdown"].str.len() <= 200,
    # Drop anything totally empty.
    metadata["raw_markdown"] == "",
    metadata["clean_markdown"] == "",
]

drop_mask = None
for mask in to_drop:
    if isinstance(drop_mask, pd.Series) or isinstance(drop_mask, np.ndarray):
        drop_mask = drop_mask | mask
    else:
        drop_mask = mask
print(f"Dropped {sum(drop_mask)} texts based on absence of content:")
print(*metadata[drop_mask]["link"], sep="\n")
dropped = metadata[drop_mask].copy()
metadata = metadata[~drop_mask]

# Check language
metadata["lang"] = metadata["clean_markdown"].map(detect)
not_english = metadata["lang"] != "en"
print(f"{not_english.sum()} non-english texts removed:")
print(*metadata[not_english]["link"], sep="\n")
dropped = pd.concat([dropped, metadata[not_english]])
metadata = metadata[~not_english]

print(f"{len(metadata)} texts remain")

Dropped 24 texts based on absence of content:
https://homebrewery.naturalcrit.com/share/1K6GHAH_Z1kNOiBXOMxNz6dKHfzaiOdg3YpbrFdctgy5p
https://homebrewery.naturalcrit.com/share/1SNcTl5ohA2p3HYZtVqWLRS5kKJzlW7aEd2K9_HkberKm
https://homebrewery.naturalcrit.com/share/1Zf0IQKmtx9miXcTDMlrG61uoXyjJKyTjudXLpyvDcjTU
https://homebrewery.naturalcrit.com/share/1yARx8g8ufNapmrp-k72GVqmVupjOXlp5S1S3GYQpdozr
https://homebrewery.naturalcrit.com/share/1PItn1Plb2qGd2zoFCtOVBY9ZRUSsDPZqyA48eDbqeVrI
https://homebrewery.naturalcrit.com/share/1vm0x8Y63Rm3SIiWg6iEUT8bbcfIUuq7yxjih8vPgn1w1
https://homebrewery.naturalcrit.com/share/16jtuooS8rYVhTFruUBM0kARAEqp8MKay2kM2QhL3Zf1v
https://homebrewery.naturalcrit.com/share/126hk_D1cqhUnU-AcXeRNhm3PuVOVEiqDZ-J_GthMND0Y
https://gmbinder.com/share/-MKHCarZv-OYPmgoduh9
https://gmbinder.com/share/-MEUxzSRW9kf5wYfT0Oj
https://gmbinder.com/share/-M_BKYtvcFV1OIBjaYP-
https://gmbinder.com/share/-M_pNobz8LY49mB9eRiB
https://gmbinder.com/share/-MdCpGu3Ga5xQt-uHW4h
https://gm

In [10]:
dropped[dropped["lang"] == "en"]

Unnamed: 0,UID,link,submission_author,submission_id,submission_title,subreddit,submission_flair,submission_score,submission_upvote_ratio,submission_date,...,comment_id,comment_score,comment_body,comment_date,src_url,manually_reviewed,related_link,raw_markdown,clean_markdown,lang


In [11]:
dropped[dropped["lang"] != "en"]

Unnamed: 0,UID,link,submission_author,submission_id,submission_title,subreddit,submission_flair,submission_score,submission_upvote_ratio,submission_date,...,comment_id,comment_score,comment_body,comment_date,src_url,manually_reviewed,related_link,raw_markdown,clean_markdown,lang
91,94,https://homebrewery.naturalcrit.com/share/1K6G...,Reinaldnaufal,lnxzo1,Barbarian Subclass - Path of the Siege | For t...,UnearthedArcana,Subclass,286,0.97,2021-02-20 03:22:27,...,goat7nn,1.0,Hey! Appreciate the input.\n\nI agree with the...,2021-02-22 03:07:42,https://homebrewery.naturalcrit.com/source/1K6...,False,,"{""stack"":""Error: File not found: 1K6GHAH_Z1kNO...","""stack"":""Error: File not found: 1 K 6 GHAH_Z 1...",
234,239,https://homebrewery.naturalcrit.com/share/1SNc...,SkirtWearingSlutBoi,ln0sme,Egg's Enhanced Shields - Brand New Array of Ea...,UnearthedArcana,Item,148,0.97,2021-02-18 23:39:54,...,gny4xrm,7.0,[Homebrewery Link.](https://homebrewery.natura...,2021-02-18 23:50:25,https://homebrewery.naturalcrit.com/source/1SN...,False,,"{""stack"":""Error: File not found: 1SNcTl5ohA2p3...","""stack"":""Error: File not found: 1 SNcTl 5 ohA ...",
982,999,https://homebrewery.naturalcrit.com/share/1Zf0...,matsozetex11,nu6zwl,Darker Firearms - A take on firearm mechanics,UnearthedArcana,Item,57,0.92,2021-06-07 07:54:36,...,h0vxww8,3.0,This is my first homebrew posted publicly on t...,2021-06-07 07:56:59,https://homebrewery.naturalcrit.com/source/1Zf...,False,,"{""stack"":""Error: File not found: 1Zf0IQKmtx9mi...",,
1888,2428,https://homebrewery.naturalcrit.com/share/1yAR...,maxnis,o95p8g,"Ecrenians, Undead elves for a homebrew setting...",UnearthedArcana,Race,6,0.87,NaT,...,,,,NaT,https://homebrewery.naturalcrit.com/source/1yA...,False,True,"{""stack"":""Error: File not found: 1yARx8g8ufNap...","""stack"":""Error: File not found: 1 yARx 8 g 8 u...",
1900,2440,https://homebrewery.naturalcrit.com/share/1PIt...,MBluna9,og6ocy,"Summon Ooze and Summon Plant, the summoning sp...",UnearthedArcana,Spell,5,0.74,NaT,...,,,,NaT,https://homebrewery.naturalcrit.com/source/1PI...,False,True,"{""stack"":""Error: File not found: 1PItn1Plb2qGd...","""stack"":""Error: File not found: 1 PItn 1 Plb 2...",
1974,2517,https://homebrewery.naturalcrit.com/share/1vm0...,MrTzaangor,s28fr2,Magic From Beyond - All art credits in document,UnearthedArcana,Spell,3,1.0,NaT,...,,,,NaT,https://homebrewery.naturalcrit.com/source/1vm...,False,True,"{""stack"":""Error: File not found: 1vm0x8Y63Rm3S...",,
1986,2529,https://homebrewery.naturalcrit.com/share/16jt...,,sij0vw,College of Folk Bard subclass v2,UnearthedArcana,Subclass,3,0.8,NaT,...,,,,NaT,https://homebrewery.naturalcrit.com/source/16j...,False,True,"{""stack"":""Error: File not found: 16jtuooS8rYVh...",,
1999,2543,https://homebrewery.naturalcrit.com/share/126h...,MrTzaangor,s35nzq,Races of Tairth - All art credits in document,UnearthedArcana,Race,10,1.0,NaT,...,,,,NaT,https://homebrewery.naturalcrit.com/source/126...,False,True,"{""stack"":""Error: File not found: 126hk_D1cqhUn...",,
2223,2805,https://gmbinder.com/share/-MKHCarZv-OYPmgoduh9,moonstrous,lpphds,Marksman v3: A Rogue subclass for Flintlock sh...,UnearthedArcana,Class,41,0.91,NaT,...,gocdlcc,3.0,The colonial wars of North America were fought...,NaT,https://gmbinder.com/share/-MKHCarZv-OYPmgoduh9,False,,## Marksman (Rogue Subclass)\n\n\nby Moonstrou...,## Marksman (Rogue Subclass)\n\n##### Flagbear...,
2226,2809,https://gmbinder.com/share/-MEUxzSRW9kf5wYfT0Oj,moonstrous,i8jtxd,"Muskets, Dueling Pistols, and other 18th-Centu...",UnearthedArcana,Item,1304,0.99,NaT,...,g18tgvz,25.0,From the Seven Years War to the American Revol...,NaT,https://gmbinder.com/share/-MEUxzSRW9kf5wYfT0Oj,False,,## Flintlock Firearms (Equipment)\n\n\nby Moon...,## Flintlock Firearms (Equipment)\n\n##### Fla...,


### Remove rare flair

In [12]:
metadata.groupby(by=["submission_flair"])["UID"].agg("count")

submission_flair
Adventure        3
Background       7
Class          478
Compendium     152
Feat           122
Item           166
Mechanic       104
Missing          0
Monster        562
Official         1
Other            2
Race           229
Resource         5
Spell          194
Subclass      1187
World            1
Name: UID, dtype: int64

In [13]:
metadata.groupby(by=["submission_flair"])["UID"].agg("count")
MAJOR_LABELS = [
    "Class",
    "Item",
    "Monster",
    "Race",
    "Spell",
    "Subclass",
    "Feat",
    "Compendium",
    "Mechanic",
    "Background",
]
metadata[~metadata["submission_flair"].isin(MAJOR_LABELS)]

Unnamed: 0,UID,link,submission_author,submission_id,submission_title,subreddit,submission_flair,submission_score,submission_upvote_ratio,submission_date,...,comment_id,comment_score,comment_body,comment_date,src_url,manually_reviewed,related_link,raw_markdown,clean_markdown,lang
337,344,https://gmbinder.com/share/-MkL3KLR7-tbnxdErq_R,AutoModerator,smtkbu,"The Arcana Forge! For all your drafts, ideas, ...",UnearthedArcana,Official,16,0.91,2022-02-07 16:00:13,...,hwyqky9,2.0,"Hey, /r/unearthedarcana! I decided to create a...",2022-02-14 22:41:15,https://www.gmbinder.com/share/-MkL3KLR7-tbnxd...,True,,<style>\n .phb#p1:after { display:none; }\n \n...,Mundane Guide to Guilds\n\nA supplement for Ca...,en
1018,1035,https://homebrewery.naturalcrit.com/share/1583...,Kaiburr_Kath-Hound,sfjdft,Tasha's Cauldron of Everything Style/Template ...,UnearthedArcana,Resource,50,0.96,2022-01-29 14:37:25,...,huq1kjv,4.0,Edit: The title should be **Xanathar’s Guide t...,2022-01-29 14:37:32,https://homebrewery.naturalcrit.com/source/158...,False,,```metadata\ntitle: Xanthar's Style\ndescripti...,# Xanathar's Guide to Anything\n## . But not E...,en
1626,2072,https://gmbinder.com/share/-MEP1q7iP_x3LcQVvg_k,KibblesTasty,jn7t0w,Abducted! A very strange one-shot adventure ab...,DnDHomebrew,Adventure,1046,0.99,2020-11-03 10:08:41,...,gazph80,24.0,[**GMBinder**](https://www.gmbinder.com/share/...,2020-11-03 10:08:55,https://www.gmbinder.com/share/-MEP1q7iP_x3LcQ...,False,,## Summary\n\nThe following is an adventure in...,## Summary\n\nThe following is an adventure in...,en
1708,2229,https://homebrewery.naturalcrit.com/share/1jQ1...,Kaiburr_Kath-Hound,srwhue,Dyslexia-Friendly Styling Code - Make your doc...,DnDHomebrew,Resource,98,0.96,2022-02-13 23:47:18,...,hwud36p,2.0,"Hey all, here's a project I've been wanting to...",2022-02-13 23:47:22,https://homebrewery.naturalcrit.com/source/1jQ...,True,True,```metadata\ntitle: Xanthar's Style - OpenDysl...,# Xanathar's Guide to Anything\n## . But not E...,en
1710,2231,https://homebrewery.naturalcrit.com/share/1h8J...,Kaiburr_Kath-Hound,srwhue,Dyslexia-Friendly Styling Code - Make your doc...,DnDHomebrew,Resource,98,0.96,2022-02-13 23:47:18,...,hwud36p,2.0,"Hey all, here's a project I've been wanting to...",2022-02-13 23:47:22,https://homebrewery.naturalcrit.com/source/1h8...,True,True,```metadata\ntitle: Tasha's Style - OpenDyslex...,# Tasha's Cauldron of Every Single Thing\n## T...,en
1716,2245,https://homebrewery.naturalcrit.com/share/15OH...,Kaiburr_Kath-Hound,srwhue,Dyslexia-Friendly Styling Code - Make your doc...,DnDHomebrew,Resource,98,0.96,2022-02-13 23:47:18,...,hwud36p,2.0,"Hey all, here's a project I've been wanting to...",2022-02-13 23:47:22,https://homebrewery.naturalcrit.com/source/15O...,True,True,```metadata\ntitle: PHB Style - OpenDyslexic\n...,"# Player's Hand Book\n\n## Definitely not a ""H...",en
1717,2246,https://homebrewery.naturalcrit.com/share/1eK9...,Kaiburr_Kath-Hound,srwhue,Dyslexia-Friendly Styling Code - Make your doc...,DnDHomebrew,Resource,98,0.96,2022-02-13 23:47:18,...,hwud36p,2.0,"Hey all, here's a project I've been wanting to...",2022-02-13 23:47:22,https://homebrewery.naturalcrit.com/source/1eK...,True,True,```metadata\ntitle: Dyslexia-Friendly Style\nd...,# The Homebrewery\n\nWelcome traveler from an ...,en
1723,2253,https://homebrewery.naturalcrit.com/share/bjmG...,dr3w_be4r,sx2xug,Harry Potter/Hogwarts Inspired Adventure! Evok...,DnDHomebrew,Adventure,237,0.95,2022-02-20 14:54:50,...,hynvdze,1.0,Thank you! Yes https://homebrewery.naturalcrit...,2022-02-27 16:54:17,https://homebrewery.naturalcrit.com/source/bjm...,False,,```metadata\ntitle: Evoking Extra Credit\ndesc...,Credit: Wizards of the Coast\n\n# Evoking Extr...,en
1779,2312,https://gmbinder.com/share/-MVsRDHxhmCt-JRE_5GL,Finalplayer14,m5xiol,XGTE Class Traits Expanded + Artificer,UnearthedArcana,Other,34,0.87,NaT,...,,,,NaT,https://www.gmbinder.com/share/-MVsRDHxhmCt-JR...,False,True,/* Background */\n .phb{ background-image: url...,Class Background Traits Expanded\n\nArt: dunge...,en
1880,2420,https://gmbinder.com/share/-Mcp-N9HQ6ZAOVvNpC0j,DevanT77,o8zasm,Class & Subclass Variants. A reworking of ever...,UnearthedArcana,Other,21,0.87,NaT,...,,,,NaT,https://www.gmbinder.com/share/-Mcp-N9HQ6ZAOVv...,False,True,/*ToC Styling*/\n .toc a {color: inherit !impo...,# Variants and House Rules\n\n### Artificer\n#...,en


In [14]:
metadata = metadata[metadata["submission_flair"].isin(MAJOR_LABELS)].copy()
metadata["submission_flair"] = metadata[
    "submission_flair"
].cat.remove_unused_categories()

In [15]:
metadata = metadata.sort_values("UID")
metadata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3201 entries, 0 to 2336
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   UID                      3201 non-null   Int16         
 1   link                     3201 non-null   object        
 2   submission_author        3189 non-null   object        
 3   submission_id            3201 non-null   object        
 4   submission_title         3201 non-null   object        
 5   subreddit                3201 non-null   category      
 6   submission_flair         3201 non-null   category      
 7   submission_score         3201 non-null   Int32         
 8   submission_upvote_ratio  3198 non-null   float32       
 9   submission_date          1983 non-null   datetime64[ns]
 10  comment_author           2597 non-null   object        
 11  comment_id               2597 non-null   object        
 12  comment_score            2597 non-

# Create Corpus Dataframes & Tokenize
### Calculation of QC/Language Metrics
Metrics are calculated at the sentence/word token and full string levels on:
1. the raw markdown
2. the cleaned markdown
3. the section-level text
Metrics for markdown are calculated first and use simple tokenization to get some basic information for EDA/QC purposes.
Final/real-word implementation will not bother with this except as part of continual learning/QC

**Speed:** This takes ~1-2 minutes for each phase.

### Get section-level text
This involves using the NLTK markdown corpus reader to split the text up into sections. The plan is for section-level data to mostly be used for cleaning out things like credit sections and/or unsupervised learning to get better segmented labels for supervised learning cases (i.e. multi-class sequence to sequence models)

**Speed:** akes approx. ~4 minutes (~1.5 min to initialize NLTK markdown corpus reader, ~2.5 min to get section-level data--most of that is labeling the sections as credit/non-credit)

### Custom word tokenization

My tokenizer here will attempt to resolve a number of ambiguities that exist in the form of spelling variations in a smart way that maximizes the number of words taken from a pre-existing set of word embeddings. In this case, we'll use the GLoVe (Wikipedia+Gigaword) embeddings taken from Gensim. **For now we only need the list of vocabulary terms.** In the event that a word is unknown and rare within the text corpus, it will use the part-of-speech tag from NLTK as an approximation of the meaning.

My custom tokenizer is slower than standard word tokenization largely due to its effort to retain as many tokens in a meaningful way as possible. The **entire** text that is being tagged is passed into the PoS tagger if an unknown term is encountered since performance is generally best within the full context of the surrounding tokens.

**Note**: Section tokenization is *particularly* slow. It's ~10x faster to tokenize by document, but we lose information that defines sections if we tokenize first. A better future implementation would be to figure out where the start and end of each section by looking for the indices of key tokens surrounding each token/text. Since I only need to do this once and going from section tokens -> document tokens is trivial, I implemented it like this for now.

**Speed:** Fitting the tokenizer takes ~1-2 minutes and mostly cross references uncommon or hyphen/other delimited characters with "known words" that are either highly common in the corpus or are present in the word embeddings' vocabulary.
This tokenization is a bit overboard for bag-of-words models, but I would like to implement some sequence-to-sequence stuff later on.

In [16]:
sec_df_path = DATA / "section_corpus.pkl"
tokenizer_path = DATA / "tokenizer.pkl"

if sec_df_path.is_file():
    # No longer need metadata, free up memory and import section df
    del metadata
    section_df = pd.read_pickle(sec_df_path)
    with open(tokenizer_path, "rb") as p:
        tokenizer = pkl.load(p)
else:
    print("Calculating markdown-level QC metrics")
    # Calculate basic markdown QC/lang metrics
    get_char_counts(metadata, pref="raw_md", text_col="raw_markdown")
    get_char_counts(metadata, pref="clean_md", text_col="clean_markdown")
    count_stopwords(
        metadata,
        pref="clean_md",
        text_col="clean_markdown",
        tokenizer=word_tokenize,
        text_to_lower=True,
    )
    calc_token_sizes(metadata, pref="clean_md", text_col="clean_markdown")

    # Get sections
    print("Getting sections-level dataframe.")
    section_df = get_section_df(metadata, CLEAN_TEXT_FILES)  # See function/module for details
    del metadata

    # Get tokens
    if tokenizer_path.is_file():
        print("Loading tokenizer")
        with open(tokenizer_path, "rb") as p:
            tokenizer = pkl.load(p)
    else:
        print("Fitting tokenizer")
        import gensim.downloader as api

        # Load WV for embedding vocab
        vectors = api.load("glove-wiki-gigaword-100")
        vector_vocab = [k for k in vectors.key_to_index.keys()]
        del vectors
        # Create and fit tokenizer to corpus
        tokenizer = EmbeddingAwareTokenizer(
            embedding_vocab=vector_vocab,
            min_counts_common_token=5,
            unknown_token="",  # If no unknown token is provided, will use NLTK-predicted POS tag
            min_subtoken_size=4,
            max_subtokens=3,
        )
        corpus = "\n".join(section_df["section_text"])
        tokenizer.fit(corpus)
        with open(tokenizer_path, "wb") as p:
            pkl.dump(tokenizer, p, pkl.HIGHEST_PROTOCOL)

    print(f"Tokenizing {len(section_df)} sections...")
    section_df["word_tokens"] = section_df["section_text"].map(tokenizer.tokenize)
    section_df.to_pickle(sec_df_path)

    # Calculate section QC/lang metrics
    print("Calculating section-level QC metrics")
    get_char_counts(section_df, pref="section", text_col="section_text")
    count_stopwords(
        section_df,
        pref="section",
        token_col="word_tokens",
        tokenizer=tokenizer.tokenize,
    )
    calc_token_sizes(section_df, pref="section", word_token_col="word_tokens")

    # Save data
    section_df.to_pickle(sec_df_path)

section_df.info()

Calculating markdown-level QC metrics
	Word tokens not determined yet, defaulting to NLTK word_tokenize
Getting sections-level dataframe.
	Preparing markdown corpus reader
	Parsing sections
Fitting tokenizer
Tokenizing 123307 sections...
Calculating section-level QC metrics
<class 'pandas.core.frame.DataFrame'>
Int64Index: 123307 entries, 0 to 123306
Data columns (total 46 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   UID                      123307 non-null  Int16         
 1   link                     123307 non-null  object        
 2   submission_author        122992 non-null  object        
 3   submission_id            123307 non-null  object        
 4   submission_title         123307 non-null  object        
 5   subreddit                123307 non-null  category      
 6   submission_flair         123307 non-null  category      
 7   submission_score         123307 non-null  Int32    

### Convert to documents
No need to tokenize here since we are just merging tokens in the correct order.

In [17]:
doc_df_path = DATA / "document_corpus.pkl"

if doc_df_path.is_file():
    # del section_df
    doc_df = pd.read_pickle(doc_df_path)
else:
    print("Getting document-level dataframe from section-level dataframe")
    doc_df = sections_df_to_docs(section_df)  # See function/module for details
    del section_df
    # Calculate document QC/lang metrics
    print("Calculating document-level QC metrics")
    get_char_counts(doc_df, pref="doc_main", text_col="clean_text")
    get_char_counts(doc_df, pref="doc_credit", text_col="credit_text")
    count_stopwords(
        doc_df,
        pref="doc_main",
        token_col="clean_word_tokens",
        tokenizer=tokenizer.tokenize,
    )
    count_stopwords(
        doc_df,
        pref="doc_credit",
        token_col="credit_word_tokens",
        tokenizer=tokenizer.tokenize,
    )
    del tokenizer
    calc_token_sizes(doc_df, pref="doc_main", word_token_col="clean_word_tokens")
    calc_token_sizes(doc_df, pref="doc_credit", word_token_col="credit_word_tokens")

    doc_df.to_pickle(DATA / "document_corpus.pkl")

doc_df.info()

Getting document-level dataframe from section-level dataframe
Calculating document-level QC metrics
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3201 entries, 0 to 3200
Data columns (total 54 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   UID                        3201 non-null   int16         
 1   clean_text                 3201 non-null   object        
 2   num_sections               3201 non-null   int64         
 3   credit_text                3201 non-null   object        
 4   num_credit_sections        3201 non-null   int64         
 5   clean_word_tokens          3201 non-null   object        
 6   credit_word_tokens         3201 non-null   object        
 7   link                       3201 non-null   object        
 8   submission_author          3189 non-null   object        
 9   submission_id              3201 non-null   object        
 10  submission_title           3201 