# Reddit Classification Problem

## Laura Minter
### November 2021

### Notebook 05: Results
This notebook assumes that the previous [modeling notebook](./04-modeling.ipynb) was succesfully run and that we have the necessary file structure.  

In [1]:
import pandas as pd
import os
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import matplotlib_venn as mpv

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


### Get the data


In [2]:
df_sea = pd.read_csv('./data/sea_df_cleaned.csv')
df_pdx = pd.read_csv('./data/pdx_df_cleaned.csv')

df = pd.concat([df_sea,df_pdx])

#fixing 'from_seattle' column that appears to have been corrupted in save
df['from_seattle'] = 1-df['from_portland']
df.dropna(inplace = True)

In [3]:
X = df['fulltext']
y = df['from_seattle']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

In [5]:
%%time

#vectorize second with TFIDF
tfidf = TfidfVectorizer(stop_words='english',max_features=3000)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

df = pd.DataFrame(X_train_tfidf.todense(), columns = tfidf.get_feature_names())

CPU times: user 6.59 s, sys: 957 ms, total: 7.54 s
Wall time: 7.8 s


In [6]:
%%time

#vectorize first with CountVectorizer
cv = CountVectorizer(stop_words='english',max_features=3000)

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

cvdf = pd.DataFrame(X_train_cv.todense(), columns = cv.get_feature_names())

CPU times: user 6.4 s, sys: 969 ms, total: 7.37 s
Wall time: 7.63 s


### Build our Final Model

In [7]:
model = LogisticRegression(C=1.0,
                                 penalty = 'l2',
                                 solver = 'lbfgs',
                                 max_iter=10_000)

In [8]:
%%time
model.fit(X_train_tfidf,y_train)

CPU times: user 2.41 s, sys: 15.1 ms, total: 2.42 s
Wall time: 2.43 s


LogisticRegression(max_iter=10000)

In [9]:
model.score(X_train_tfidf,y_train),model.score(X_test_tfidf,y_test)

(0.8349083810284396, 0.8245263765029505)

### Look at the coefficients

In [10]:
# make a dictionary of the coefficients
feature_dict = dict(zip(tfidf.get_feature_names(),model.coef_[0]))

### Distinguishing features

In [11]:
strongly_seattle = {key:feature_dict[key] for key in feature_dict.keys() if feature_dict[key]>5}

In [12]:
strongly_seattle

{'alki': 5.508609455102751,
 'aurora': 5.44334425878149,
 'awant': 5.429280091430405,
 'ballard': 10.847664340910363,
 'bellevue': 8.377455123405749,
 'belltown': 6.569292399803562,
 'boeing': 7.634601621179027,
 'cal': 5.079950009300115,
 'capitol': 6.346362098123822,
 'chaz': 6.192034479915578,
 'durkan': 6.020205994707615,
 'eahawks': 5.071916350666271,
 'eatac': 7.23019078684198,
 'eattle': 27.10014892457696,
 'eattleites': 5.597032686606499,
 'eattlenan': 8.665655235185952,
 'everett': 5.125546406814873,
 'hill': 5.313539908589963,
 'horeline': 5.208740991919121,
 'inslee': 7.576655476369292,
 'kent': 5.309584451699668,
 'king': 7.0217805202496795,
 'kirkland': 5.311920400507636,
 'lu': 5.0735981328467785,
 'mariners': 5.093931299375117,
 'mercer': 5.346765660220027,
 'nohomish': 5.009449205289194,
 'noqualmie': 5.883414132707752,
 'northgate': 6.301448688849984,
 'pd': 8.026986517923506,
 'pike': 7.2264111564312685,
 'rainier': 6.939092237456799,
 'redmond': 5.587854544281343,
 '

In [13]:
strongly_portland = {key:feature_dict[key] for key in feature_dict.keys() if feature_dict[key]<-5}

In [14]:
strongly_portland

{'82nd': -5.236078406971803,
 'beaverton': -8.039120971130853,
 'burnside': -6.440158274990292,
 'clackamas': -6.494863302159791,
 'gresham': -6.071736611352685,
 'hawthorne': -6.844845977412168,
 'hillsboro': -5.870694294749486,
 'hood': -5.78002275836463,
 'lloyd': -5.582122936047226,
 'multnomah': -8.983792353450655,
 'ohu': -5.543265317705459,
 'oregon': -14.554710388440391,
 'oregonian': -5.182815470626887,
 'oregonians': -6.015112444151966,
 'oregonnan': -5.03927123101487,
 'oswego': -6.623282600437461,
 'pdx': -11.74167703158699,
 'portland': -27.906713458429422,
 'portlanders': -7.817978366744192,
 'portlandnan': -9.522621928729722,
 'powell': -6.3786892120079965,
 'ppb': -6.741193351384933,
 'rose': -5.265480406302399,
 'tabor': -5.650067921041797,
 'tigard': -5.754568146946564,
 'trimet': -9.00528462671076,
 'willamette': -5.9828927152335165}

We see that many of the very strongly distinguishing features are place names like neighborhoods, street names, county names.  For Seattle we also see political words (Chaz, Sawant/awant, Inslee), the Mariners baseball team, and the (previously local) company Boeing.  For Portland we also have Trimet which is the local transportation network.  

In [15]:
slightly_seattle = {key:feature_dict[key] for key in feature_dict.keys() if 5>feature_dict[key]>1}

In [16]:
slightly_seattle

{'13': 1.4733337512291877,
 '15th': 1.0851939767407666,
 '2019': 1.3673785186414356,
 '30pm': 1.0564362878672098,
 '3rd': 1.4833078366366423,
 '70': 1.0347609294711024,
 '90': 1.6822654466424654,
 '99': 2.549842633827121,
 'adults': 1.0081557671350743,
 'alaska': 2.207250714983628,
 'amazon': 4.484128154950164,
 'anderson': 1.3599908398201639,
 'anne': 4.962769787993874,
 'announced': 1.0920007776490976,
 'apartment': 1.0403730989664952,
 'asian': 1.8516678315772996,
 'ave': 2.4739067037553415,
 'bay': 2.1570712872692646,
 'bc': 2.4635559342086535,
 'beacon': 2.675092405522963,
 'bezos': 4.395399781172321,
 'bills': 1.1439095765678324,
 'blog': 1.0424741422580683,
 'border': 1.5702226969936537,
 'bus': 1.6540383722204042,
 'buses': 1.436102417711751,
 'canada': 2.294532136480842,
 'cancel': 1.0294352458768137,
 'candidate': 1.171500201975408,
 'cap': 2.02277982203681,
 'capital': 2.0259285266650875,
 'central': 1.2895420836472598,
 'cheaper': 1.1856532636652164,
 'china': 1.18105869146

Slightly Seattle curated: recommendations, ferry, buses,

In [17]:
slightly_portland = {key:feature_dict[key] for key in feature_dict.keys() if -5<feature_dict[key]<-1}

In [18]:
slightly_portland

{'10th': -1.8955673161637727,
 '13th': -1.067169691776428,
 '15': -1.0054104698162254,
 '150': -1.1326270213318261,
 '2015': -1.4302262201043314,
 '2016': -2.69091780607759,
 '2017': -1.140162890579035,
 '205': -4.879338467328869,
 '20th': -1.4256542316239385,
 '26': -2.0912397343413742,
 '28th': -1.896204236650834,
 '503': -3.1210518562575897,
 '84': -3.207380547437333,
 'abandoned': -1.0288625062795467,
 'actions': -1.318018244765568,
 'addition': -1.7677518674314245,
 'agents': -1.098978814545557,
 'alberta': -4.746830615257348,
 'alem': -4.9902146729268315,
 'alt': -1.0519217765255422,
 'ama': -1.1254956506730596,
 'andy': -4.399444433199313,
 'antifa': -1.2112758532882786,
 'anxiety': -1.3319188165891562,
 'beer': -1.1141443508671085,
 'belmont': -3.9888774656466786,
 'bernie': -1.0442116723377992,
 'bicycle': -1.2725576756733132,
 'blazers': -4.455121621327342,
 'blvd': -2.418302713065921,
 'bottle': -2.100929500280453,
 'boys': -1.8324687989527204,
 'bridge': -1.8693887026822003

Sightly Portland curated: cyclists, guitar, roommate, teacher, 

In [19]:
middleground = {key:feature_dict[key] for key in feature_dict.keys() if -0.5<feature_dict[key]<0.5}

In [20]:
middleground

{'10': -0.038204317080403685,
 '100': 0.10416944278301982,
 '1000': -0.37052766463594194,
 '10am': -0.3175265810816171,
 '11': -0.2272255468140262,
 '11am': 0.40260019174568024,
 '11th': -0.09361221640312833,
 '12': -0.07274007072408539,
 '12th': -0.2369596722541561,
 '14': 0.006301916103903685,
 '14th': 0.05355445450831327,
 '16': -0.4271272682125857,
 '17': -0.16116089898102195,
 '18': -0.14874764375232877,
 '19nan': 0.37682473511589265,
 '1st': 0.48650680292771525,
 '20': -0.1822948437087182,
 '200': 0.02635075808026424,
 '2018welcome': 0.07334053886813016,
 '2020': 0.18279363767903067,
 '2020nan': 0.391313876211797,
 '2021': 0.48857639005574094,
 '21': -0.06397068441011831,
 '22': 0.16632232039918743,
 '23': 0.24397412092684245,
 '23rd': 0.39165310151392513,
 '24': 0.20792590437836395,
 '27': 0.3755072350923653,
 '28': 0.24201304467309198,
 '35': 0.4906383827932575,
 '36': 0.3255831590578715,
 '3d': 0.39437208265396967,
 '40': -0.41398910153070595,
 '44': 0.1978810974897348,
 '45':

We see a lot of features do not strongly help us distinguish between the two locations.  With this number of words we are going to also look at the overall frequency of their use on the reddit as a metric.  

### Comparing to our top 100 words from EDA

We see lots of differences when we look at words that appear on the top 100 in only one of the cities.  By narrowing our focus to top words that also help us distinguish we can further identify disparate themes.  

Seattle is much more likely to use advice-related words like 'suggestion' and 'recommendations.'  

Portland is much more likely to us purchase-related words like 'buy' and 'money.'  


Seattle also has more references to weather-related words like 'rain', 'cloudy' and 'wind' as well as high occurences of 'chance' and 'mph' which are not conclusively weather related but often used in that context.  



Portland is much more likely to reference friends while Seattle is more likely to reference kids, parents, and adults.  Experience tell us that these references may not be positive and sentiment analysis on these may provide more insight.  



Unsurprisingly both locations are more likely to use their own location words like city and state names as well as other hyper local place names like neighborhoods and street names.  

Both locations also use the words 'free' and 'food' at high rates as well as 'downtown' and 'community.'  

## Suggestions and recommendations

Reddit has an active PNW community, providing a good advertising opportunity.  Across the two markets we see roughly 86k unique authors publishing in the last two years.  

Marketing campaigns across the PNW can focus on community and local events as well as references to local places (e.g., neighborhoods, streets)

For Seattle, references to weather and family (kids, parents) are common and could provide good themes for marketing campaigns.  

For Portland, bikes and friends are references at high frequency and could be good themes for marketing.    

![Venn Diagram for our PNW themes](./images/Venn-diagram-reddit.png "Venn Diagram of PNW Themes")

#### Next steps
For the next step we could look at time variation in themes.  Investigating the historical seasonality and time variation of top words could help determine if we could build a tool to capitalize on real-time information about reddit posts.  