Labeling Fan-Generated Roleplaying Game (RPG) Content: An NLP Task

This project began as a single weekend project completed for my COMP_SCI 349 (Machine Learning) course at Northwestern University. Students were offered the choice of a number of fixed toy datasets and tasks to complete. I chose to perform multi-label classification based on a custom-generated dataset.

Original Project

To reduce redundancy/ambiguity, the original copy of my COMP_SCI 349 (Machine Learning) final has been privated. The PDF write-up I submitted for that project has been provided in ./OLD_COMP_SCI_349_NK_Final.pdf if you're curious to compare performance then vs now (in short, F1 macro score increased from .79 to around .88)

Data Sources / Fan-Made RPG Content

A list of Reddit posts (comments or primary submissions) containing links two websites (Homebrewery or GMBinder) was obtained using a command-line tool built ontop of the Python Reddit API (PRAW) by myself located here. Some processing of the specific web URLs collected via Reddit is provided by functions which live in ./src/scraping.py and ./src/preprocessing/text_cleaning.py

4-5k Reddit posts were eventually collected (the initial dataset was close to 2.5-3K at the time of my COMP SCI final). Out of these, around 3K unique pieces of fan-made content were eventually obtained.

Overall, it's worth noting that class-imbalance is heavy within this dataset (see EDA Highlights below), so weighted sampling/training methods are utilized in Part 3 for model fitting.

Data / Model Availability

This repo expects/creates a ./data folder which contains data and intermediates for different stages. The final size of this folder, with all check-points is around 1.7 GB, so it is not provided here. All the basic inputs/intermediates used in these notebook are provided on a separate Google Drive folder here. This includes a pickled version of the pickled SVC/sklearn model and a .pth file from the trained RoBERTA model. Note that version 1.2.2 of sklearn was used for this (see requirements.txt)

Notebooks / Organization

There are 9 notebooks organized into three parts.

Part 1 (4 notebooks)

This part is entirely devoted to integrating new and old data together, filtering out redundant or irrelevant content, collecting and cleaning text from the web, and identifying improperly labeled content and fixing it. I recommend skipping over these notebooks, they behave a bit oddly, in particular because the initial dataset was collected in a dirtier manner than follow-up collections since the first collection was done as part of a short one weekend project for a class final.

Part 2 (3 notebooks)

This part is devoted to exploratory data analysis on three different levels/scopes:

Statistics/data associated with the underlying metadata from Reddit (Part2A)
Statistics associated with very basic language usage (Part3A) with the aim of finding oddities in the data cleaning/processing steps
An exploration of word usage, including visualization of the class labels in a bag-of-words feature space

A few figures, mainly from the third notebook in Part 2 are highlighted in the EDA Highlights section.

Part 3 (2 notebooks)

In this part I fit two types of models to the data as either a BoW (SVC-based) model or sequence classifier (RoBERTA) task. Both peak out at around a F1 of .9, which appears to be in part because only a small portion of the class labels have been manually reviewed and sometimes the Reddit dataset is mislabeled for a number of reasons (partly explored in Part 1 and discussed below. On top of this, my current labels only allow for one class to be assigned when in fact the task is really a multi- label/multi-class one, with a fairly substantial number of texts likely including more than one valid label. However, for an initial demonstration project, this is satisfactory to me for now.

Label Annotation CLI Tool

As noted above, there are many instanced of mislabeled content when relying on Reddit flair. For this reason, I developed a small CLI tool for reviewing and updating labels manually, which is used in Part 1 to catch/review around 10% of labels which are at a high risk for being mislabeled. For improved performance, it may be necessary to employ further until the labels can all be trusted.

EDA Highligts

Class Imabalance

Notably, class/label imbalance is vast within this dataset, hence the use of a custom, weighted sampler in the fitting of the RoBERTA model/calculating model loss and the use of balanced class/label weighting by the SVC models for bag of words.

Word Usage By Class Label

In general, there seem to be a few highly over-represented or under-represented terms per label class relative to all of the others. Here are a few of these visualized by both barplot and word cloud:

Visualizing Class Separation by TF-IDF

There is very limited class-based separation in the first few PCs of these data, but based on UMAP projections of simple TF-IDF bag of words vectors, the classes already separate nicely, thus I expected that many simple non-linear algorithms will work will quite well for this classification task.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
eda		eda
src		src
.gitignore		.gitignore
OLD_COMP_SCI_349_NK_Final.pdf		OLD_COMP_SCI_349_NK_Final.pdf
Part1A_Processing_Posts.ipynb		Part1A_Processing_Posts.ipynb
Part1B_Processing_Comments.ipynb		Part1B_Processing_Comments.ipynb
Part1C_Processing_Links.ipynb		Part1C_Processing_Links.ipynb
Part1D_Finalizing_Data.ipynb		Part1D_Finalizing_Data.ipynb
Part2A_EDA_Reddit_Metadata.ipynb		Part2A_EDA_Reddit_Metadata.ipynb
Part2B_EDA_Doc_Lang_Metrics.ipynb		Part2B_EDA_Doc_Lang_Metrics.ipynb
Part2C_EDA_Doc_Labels_Features.ipynb		Part2C_EDA_Doc_Labels_Features.ipynb
Part3_BoW_Model_Search.ipynb		Part3_BoW_Model_Search.ipynb
Part3_RoBERTA_Finetuning.ipynb		Part3_RoBERTA_Finetuning.ipynb
README.md		README.md
cli_gui.png		cli_gui.png
label_review_tool.py		label_review_tool.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Labeling Fan-Generated Roleplaying Game (RPG) Content: An NLP Task

Original Project

Data Sources / Fan-Made RPG Content

Data / Model Availability

Notebooks / Organization

Part 1 (4 notebooks)

Part 2 (3 notebooks)

Part 3 (2 notebooks)

Label Annotation CLI Tool

EDA Highligts

Class Imabalance

Word Usage By Class Label

Visualizing Class Separation by TF-IDF

About

Releases

Packages

Languages

nkuehnle/rpg_nlp

Folders and files

Latest commit

History

Repository files navigation

Labeling Fan-Generated Roleplaying Game (RPG) Content: An NLP Task

Original Project

Data Sources / Fan-Made RPG Content

Data / Model Availability

Notebooks / Organization

Part 1 (4 notebooks)

Part 2 (3 notebooks)

Part 3 (2 notebooks)

Label Annotation CLI Tool

EDA Highligts

Class Imabalance

Word Usage By Class Label

Visualizing Class Separation by TF-IDF

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages