Classifying Canadian Discussion Subreddits

Another GA Data Science Intensive project. I trained a Naive Bayes classifier to distinguish between reddit comments from r/Canada and r/OnGuardForThee.

Here's a lightweight app that let's you enter your own text and see what the model classifies it as.

https://politics-subreddit-classifier.herokuapp.com/

The remainder of the readme below is left unchanged from the projects submission.

Classifying Canadian Discussion Subreddits

Executive Summary

Background/Problem Statement

Discussion based Reddit communities are anecdotally known for being opinionated echo chambers. In the face of toxic discussion and conflicting opinions, communities can schism. Since creating a new subreddit is a few mouse clicks away, there are many instances of mainstream subreddits with splinter subreddit communities discussing the same issue, but formed due to ideological differences or as a reaction to percieved inauthenticity of the original community. Though the splinter communities are typically significantly less popular than the original community, some do reach a significant member base. Some more notable examples include

/r/OffMyChest (2.2m members) vs /r/TrueOffMyChest(718k members)
/r/PublicFreakout (2.8m members) vs /r/ActualPublicFreakouts (448k members)

For discussion of general Canadian issues, the two most dominant subreddits are

The original: /r/Canada (739k members, 3.8k active, created 2008)
The splinter: /r/OnGuardForThee (110k members, 1k active, created 2017)

r/OnGuardForThee was created in 2017 due to disputes over racist content and moderation in r/Canada. It claims to be a more explicitly anti-racist community and skews towards leftism in content posts and discussion, vs. the more centrist original community. However, since they're both discussing general Canadian issues, there's a wide overlap in content between the two, at least superficially.

This leads to the project goal: Can we build a classifier that can accurately distinguish between comments from r/Canada and comments from r/OnGuardForThee?

Data Acquisition

Data was gathered from Reddit using the PushShift API. 100 comments were sampled per week from September 2018 to September 2020, for a total of 10400 comments from each subreddit. Comments that appeared to be autogenerated by moderation bots, and comments that had been deleted by users or moderators(and contained only blank content) were then removed.

This left us with 20102 commments : 50.4% from r/OnGuardForThee, and 49.6% from r/Canada.

EDA

Exploration into word counts revealed that r/Canada had a preference for words relating to economic and business issues, while r/OnGuardForThee had a strong preference for words relating to identity politics and government. r/OnGuardForThee also had a slightly higher obscenity count.

Processing and Modelling

Data was preprocessed by removing typical stopwords, as well as certain words that were used consistently in both subreddits. URL's were removed from comments (mostly: there were some regex errors). Comments were tokenized by removing punctuation, only considering words made up alphabetical characters, and stemming using the Snowball Stemmer.

Collected data was split 75/25 into training and test sets.

A large amount of models were gridsearched using 5-fold cross validation to identify the optimal model. These models were typically pipelines of a vectorizer and a binary classifier. Almost all possible permutations of the following vectorizers and classifers were tested with a wide range of parameters.

Vectorizer:

CountVectorizer
TfidfVectorizor

Classifier

Logistic Regression
K Nearest Neighbors
Naive Bayesian
Random Forest
AdaBoost

After cross referencing scores and refining parameters, the final model was decided to be a Bernoulli naive Bayes classifer operating on 4000 features/tokens gathered from the training set after processing and stemming. Other models offered comparable performance, but this one overfit less.

Model Performance and Analysis

We evaluated the model on a test set using standard classification metrics. Additionally, results were recorded for the classification performance of the test set partitioned into comments of greater than 20 words (long comments), and comments 20 or less words (short comments).

	All comments	Long Comments (>20 words)	Short Comments (<=20 words)
Total Comments	5016	2822	2204
Actual r/OnGuardForThee	2533	1456	1127
Actual r/Canada	2493	1366	1077
Predicted r/OnGuardForThee	2018	1443	575
Predicted r/Canada	3008	1337	1629
Accuracy Score	0.6146	0.6389	0.5835
Recall Score	0.5160	0.6456	0.3708
Specificity Score	0.7148	0.6318	0.8154
Precision (OnGuardForThee) Score	0.6477	0.6514	0.6383
Precision (Canada) Score	0.5924	0.6258	0.5641
F1 Score	0.5744	0.6485	0.4443

To further understand the model's performance, we analyzed the probabilities assigned to given comments to find comments representative of what the model was highly certain in classification.

We analyzed probabilites of individual tokens to identify keywords and concepts that were strongly weighted towards one subreddit or the other.

Keywords and concepts that were highly predictive of being an r/OnGuardForThee comment included:

Metadiscussion about reddit issues and subreddit drama.
Infamous conservative media personalities.
Negative descriptors common in leftist discussion (denier, bigot, intolerant, misogynist, shill, incel)
Words directly related to Canadian political institutions (liberal, conservative, vote, elect, left, right, party)

Keywords and concepts that were highly predictive of being an r/Canada comment included:

Impaired driving.
Canada-Chinese politics (China, Huawei, Taiwan, etc.).
Real estate/housing market discussion.
Economic and business concept discussion.

Conclusions

Discerning between the Canadian issue subreddits r/Canada and r/OnGuardForThee is a difficult classification task. Though not particularly accurate, our model performs consistently above the baseline on identifying comments from our test set (61.4% vs 50.4%). Restricting tested comment to those of longer than 20 words further increases accuracy to 63.9%.

Due to it's strong bias and ineffective performance on short comments, I would advise against using this model to try to classify comments that contain twenty or less words.

This model could possibly be used for sentiment analysis, or for trying to discern political flavour/ideology given a text sample. Though not an exact classifer, this is an inherently messy problem and I believe that this model could be useful as a heuristic (as long as the user was aware of its limitations).

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
app		app
code		code
data		data
presentation		presentation
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
nltk.txt		nltk.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classifying Canadian Discussion Subreddits

Executive Summary

Background/Problem Statement

Data Acquisition

EDA

Processing and Modelling

Model Performance and Analysis

Conclusions

About

Releases

Packages

Languages

pdornian/GA-reddit-canada-politics

Folders and files

Latest commit

History

Repository files navigation

Classifying Canadian Discussion Subreddits

Executive Summary

Background/Problem Statement

Data Acquisition

EDA

Processing and Modelling

Model Performance and Analysis

Conclusions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages