In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data Exploration

**Ethics Disclaimer:** This is the first data science project I've done that deals with peoples opinions and sensitive discussions. You might have noticed during our data cleaning that aside from deleted/removed posts, the top most repeated comment was a copy pasted paragraph denying the Holodomor (the Ukrainian famine that's widely recognized as an act of genocide perpetrated by the USSR). This is probably not the only ugliness that we're going to see.

As we explore our data, there's a good chance that we're going to see vile opinions espousing racism and xenophobia, and all sorts of general toxicity, cruelty and ignorance. This is an unfortunate inevitably of processing anonymous political commentary. I'm not a big fan of broadcasting this kind of content, but it's part of the data were dealing with, and understanding the distribution of it between subreddits might become part of our underlying classification metric.

## General Stats

In [2]:
canada_df=pd.read_csv('..\data\canada_subreddit_comments.csv')

In [3]:
canada_df

Unnamed: 0,subreddit,author,created_utc,score,body,body_processed,subreddit_bin
0,onguardforthee,nalydpsycho,1600907182,1,"I understand what you are saying, what I don't...","I understand what you are saying, what I don't...",1
1,onguardforthee,dexx4d,1600907548,1,"Huh, didn't know the owner was like that.\n\nS...","Huh, didn't know the owner was like that.\n\nS...",1
2,onguardforthee,Man_Bear_Beaver,1600907576,1,love them transfer payments though,love them transfer payments though,1
3,onguardforthee,whaleoilbeefhookt,1600907722,1,it takes two to tango. tab p in slot v stuff.,it takes two to tango. tab p in slot v stuff.,1
4,onguardforthee,NotInsane_Yet,1600907888,1,It's enough to finance them for a few years. ...,It's enough to finance them for a few years. ...,1
...,...,...,...,...,...,...,...
20097,canada,toastee,1538658871,184,"Can confirm, saw this exact issue on a Yukon r...","Can confirm, saw this exact issue on a Yukon r...",0
20098,canada,FireballSambucca,1538654720,183,She is in the news quite a bit....https://www....,She is in the news quite a bit....https://www....,0
20099,canada,unseencs,1538666172,180,He was employed? This story keeps getting str...,He was employed? This story keeps getting str...,0
20100,canada,mikailus,1538882519,180,"As a pro-choice and pro-free speech guy, great...","As a pro-choice and pro-free speech guy, great...",0


In [4]:
canada_df.shape

(20102, 7)

In [5]:
canada_df['subreddit'].value_counts(normalize=True)

onguardforthee    0.50388
canada            0.49612
Name: subreddit, dtype: float64

In [6]:
canada_df['author'].describe()

count              20102
unique             10135
top       Caucasian_Fury
freq                 120
Name: author, dtype: object

In [7]:
canada_df['author'].value_counts().mean()

1.9834237789837197

We have a dataset of 20,120 comments, with an almost 50/50 split between r/Canada and r/OnGuardForThee sources. We have 10136 distinct comment authors. The most prolific authors in the dataset wrote 120 comments, but the mean author only wrote 2 comments.

Aside from text content, I want to know if there's a significant difference in average comment word count per subreddit.

In [8]:
canada_df['word_count']=canada_df['body'].str.count("\w+")

In [9]:
canada_df.groupby('subreddit').mean()['word_count']

subreddit
canada            42.804773
onguardforthee    44.853490
Name: word_count, dtype: float64

/r/OnGuardForThee comments tend to be slightly longer, but by a pretty negligible amount (2 words). The average comment contains 44 words. 

## Common Words

Let's do a quick and dirty analysis of the most commmon words from each subreddit using sklearn tools 

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
comments_canada=canada_df.loc[canada_df['subreddit']=='canada', 'body']
comments_ogft=canada_df.loc[canada_df['subreddit']=='onguardforthee', 'body']

In [12]:
cv_canada = CountVectorizer(stop_words='english', max_features=5000)
cv_ogft = CountVectorizer(stop_words='english', max_features=5000)

In [13]:
wf_canada=cv_canada.fit_transform(comments_canada)

In [14]:
wf_ogft=cv_ogft.fit_transform(comments_ogft)

In [15]:
wf_canada_df=pd.DataFrame(wf_canada.toarray(), columns=cv_canada.get_feature_names())

In [16]:
wf_canada_df

Unnamed: 0,000,01,02,03,04,05,06,07,08,09,...,york,young,youre,youth,youtu,youtube,yup,zealand,zero,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9968,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9969,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9970,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9971,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
wf_ogft_df=pd.DataFrame(wf_ogft.toarray(), columns=cv_ogft.get_feature_names())

In [18]:
wf_ogft_df

Unnamed: 0,000,01,02,03,04,05,06,07,08,09,...,younger,youre,youth,youtu,youtube,yup,zealand,zero,zone,édition
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10124,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
10125,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10126,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10127,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
wf_canada=wf_canada_df.sum()
wf_ogft=wf_ogft_df.sum()

Now, let's look at the top 50 words in each subreddit.

In [20]:
wf_canada.sort_values(ascending=False).head(50)

people        1963
just          1540
like          1459
don           1359
gt            1288
canada        1272
think          776
time           760
government     730
know           644
going          623
good           599
right          596
need           595
https          589
make           580
want           579
canadian       556
really         539
work           538
way            535
money          521
years          515
said           512
doesn          474
ve             471
did            467
say            442
actually       403
country        398
year           395
trudeau        387
isn            367
thing          361
china          359
www            358
sure           353
new            353
better         350
ll             349
pay            345
does           344
lot            341
didn           340
canadians      338
point          332
news           330
things         317
day            310
doing          307
dtype: int64

In [21]:
wf_ogft.sort_values(ascending=False).head(50)

people           2452
just             1823
like             1727
don              1393
gt               1358
canada           1225
right             971
think             934
https             746
time              713
know              697
going             658
make              656
want              645
good              632
government        627
really            604
way               602
need              557
ve                516
party             501
canadian          498
say               494
conservative      488
work              478
years             478
doesn             462
actually          460
thing             456
isn               448
com               446
money             434
things            432
conservatives     430
did               429
www               428
said              428
vote              420
shit              412
lot               403
trudeau           402
better            398
country           386
news              378
does              367
ll        

Looking at the top 50 tokens in each subreddit tells a few things, both about content and about what our future processing should look like.

### Token Refinement

- Contractions should be addressed, either by regex or stemming.
- The top 5 words in each subreddit are the same: "people, just, like, don, canada". Since we're looking to distinguish between the subreddits, I think I might add these to our stopwords so we can have the model focus on more uncommon words.
- From the appearance of "www", "com", and "http", we can see that URL's are being tokenized in weird ways. I think we should probably just remove URLs. in processing.

### Discussion differences

- /r/OnGuardForThee uses the word "right" more than r/Canada. It also has "conservative" and "conservatives" crack the top 50, while they don't in r/Canada. This might reflect a higher incidence of discussion of right-wing ideology.
- /r/Canada uses the word "government" more. This may reflect more policy based discussion.
- "pay" appears in r/Canada's top 50 words, but not in /r/OnGuardForThee. This might correspond to discussions of political budgets.
- "white" appears in r/OnGuardForThee's top 50 words, but not in r/Canada. This likely reflects a higher incidence of identity politics discussions.
- Obscenities crack the top 50 in r/OnGuardForThee, but not r/Canada. This could be an example of differences in discussion tone.

Let's make a dataframe of some particularly loaded keywords that might come up in discussions of Canadian issues.

In [22]:
wf_canada.sort_values(ascending=False).iloc[100:150]

cost          209
having        206
best          205
life          205
trying        205
business      200
means         195
aren          195
away          194
yeah          193
guy           192
free          192
10            192
feel          190
companies     190
ford          190
live          189
different     189
health        189
legal         187
buy           185
political     185
working       184
liberals      182
market        181
vote          181
man           181
hard          179
lol           177
understand    176
support       176
company       176
needs         175
gun           174
case          173
place         173
read          172
little        172
quebec        171
economy       171
house         171
literally     170
believe       170
help          169
media         167
agree         166
makes         166
able          166
says          166
start         166
dtype: int64

In [23]:
identity_politics = ['white', 'black', 'indian', 'native', 'asian', 'chinese',
                    'sex', 'race', 'gender', 'racism', 'social', 'justice', 'bias',
                    'discrimination', 'men', 'women', 'gay', 'lgbt', 'trans', 'identity']
government = ['liberal', 'conservative', 'ndp', 'bloc', 'socialist', 'communist', 'trudeau', 'singh', 'scheer', 'senate',
             'scandal', 'parliament', 'mp', 'treaty', 'budget', 'spend', 'election', 'tax', 'party',
             'ford', 'vote']

issues=['immigration', 'abortion', 'drugs', 'housing', 'covid', 'cerb', 'china', 'trump', 'oil',
       'climate', 'warming', 'pipeline', 'gun', 'energy', 'taxes', 'trade', 'police', 'crime', 'military',
       'money', 'public', 'world', 'job', 'jobs', 'cost', 'business', 'free', 'health', 'legal', 'working',
       'economy', 'market', 'company', 'corporation']

geography = ['bc', 'alberta', 'saskatchewan', 'manitoba', 'ontario', 'quebec',
            'newfoundland', 'brunswick', 'pei', 'scotia', 'vancouver', 'calgary',
            'edmonton', 'winnipeg', 'toronto', 'ottawa', 'province', 'city',
            'montreal', 'halifax', 'john', 'east', 'west', 'north']

obscene=['shit', 'fuck', 'fucking', 'damn', 'asshole', 'bitch']


In [24]:
keywords=[]
keywords.extend(identity_politics)
keywords.extend(government)
keywords.extend(issues)
keywords.extend(geography)
keywords.extend(obscene)

categories = {'identity': identity_politics,
             'government': government,
             'issue': issues,
             'geography': geography,
             'obscene': obscene}

In [25]:
def category_map(word):
    for key, wordlist in categories.items():
        if word in wordlist:
            return key

In [26]:
keyword_df=pd.DataFrame()
keyword_df['words']=keywords
keyword_df['canada_count'] = wf_canada[keywords].values
keyword_df['ogft_count'] = wf_ogft[keywords].values
keyword_df['category']=keyword_df['words'].map(category_map)
keyword_df.set_index('words')

Unnamed: 0_level_0,canada_count,ogft_count,category
words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
white,101,356,identity
black,83,85,identity
indian,16,22,identity
native,35,38,identity
asian,11,11,identity
...,...,...,...
fuck,295,331,obscene
fucking,219,315,obscene
damn,75,60,obscene
asshole,19,39,obscene


In [27]:
keyword_df[['canada_count', 'ogft_count']].sum()

canada_count    11817
ogft_count      13586
dtype: int64

In [28]:
keyword_df.groupby('category').sum()

Unnamed: 0_level_0,canada_count,ogft_count
category,Unnamed: 1_level_1,Unnamed: 2_level_1
geography,1970,2000
government,2384,3723
identity,1101,1892
issue,5441,4788
obscene,921,1183


Overall, my keywords hit more occurrences in r/OnGuardForThee than r/Canada. I tried to select keywords that would show up around equally in both, but wasn't able to find a set that gave me an exact balance and still held semantic meaning. This could indicate two things
- I just kinda made up the keywords myself by what I thought would be relevent while cross referencing top words in both subreddits. The bias towards OGFT in word occurance might just reflect my own political/discussion leanings.
- OGFT might engage in more specific, technical discussion than r/Canada about policy and social issues. 

r/OnGuardForThee dominates wordcounts that relate to government issues, and identity politics issues. It has a slight edge in obscenity.

r/Canada on the other hand has a slight edge in the "issue" category that encompasses a wide range of general issue keywords that include business concerns and more general social concerns.