### Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

### Read `.csv`

In [2]:
df = pd.read_csv('./data/animal_data')  # r/aww  r/cats  r/dogs
# df = pd.read_csv('./data/all_data')

In [3]:
# Drop unnecessary columns
df.drop(columns='Unnamed: 0', inplace=True)

In [4]:
# Check nulls
df.isnull().sum()

title           0
subreddit       0
name            0
is_video        0
id              0
duplicate       0
clean_title    34
dtype: int64

In [5]:
# Find all rows where clean title was null
df[df['clean_title'].isnull()]

Unnamed: 0,title,subreddit,name,is_video,id,duplicate,clean_title
125,😍😍,aww,t3_a7x55r,False,a7x55r,False,
203,:),aww,t3_a7ydcz,False,a7ydcz,False,
711,- who is that there?,aww,t3_a80i4e,False,a80i4e,False,
748,:3,aww,t3_a7yc7f,False,a7yc7f,False,
822,.,aww,t3_a7y3dc,False,a7y3dc,False,
988,:) :) :) :),aww,t3_a84v4l,False,a84v4l,False,
989,Just this...,aww,t3_a899ci,False,a899ci,False,
1030,:3,aww,t3_a8anr2,False,a8anr2,False,
1175,🐰,aww,t3_a880w1,False,a880w1,False,
1193,😍,aww,t3_a87f77,False,a87f77,False,


In [6]:
# Drop null rows
df.dropna(inplace=True, 
          axis=0)

In [7]:
# Double check all null rows are gone
df.isnull().sum()

title          0
subreddit      0
name           0
is_video       0
id             0
duplicate      0
clean_title    0
dtype: int64

### `train_test_split`

In [8]:
# Define variables and target variable
X = df['clean_title']
y = df['subreddit']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)

In [10]:
X_train.iloc[9]

'would cuter hiding u medicine'

In [11]:
X_train.head()

6317                   like whiskey barrel thanks hoomans
4617    little princess prefers organic food handmade ...
3590                                     freedom good boy
2987                             merry christmas gato cat
4195                     chewie contemplating life choice
Name: clean_title, dtype: object

### Count vectorizer

In [12]:
vectorizer = CountVectorizer(analyzer='word',
                            tokenizer=None,
                            preprocessor=None,
                            stop_words=None,
                            max_features=5000)

In [13]:
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

In [14]:
X_train_vect = pd.DataFrame(X_train_vect.toarray(), 
                            columns=vectorizer.get_feature_names())

In [15]:
vocab = vectorizer.get_feature_names()

### Random Trees

In [16]:
# Instantiate model
tree = DecisionTreeClassifier(criterion='gini',
                              min_samples_split=4, # doesn't change after 4
                              min_samples_leaf=5, # doesn't change after 5
                              max_depth=5) # 5 and 6 yield best fit model

In [17]:
# Fit on training data
tree.fit(X_train_vect, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=4,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [18]:
tree.score(X_train_vect, y_train)

0.6774748923959828

In [19]:
tree.score(X_test_vect, y_test)

0.6673838209982789

### Logistic regression

In [20]:
# Instantiate model
lr = LogisticRegression()

In [21]:
# Fit on training data
lr.fit(X_train_vect, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [22]:
lr.score(X_train_vect, y_train)  # Overfit - one reason why RandomTrees is a better model

0.8559540889526542

In [23]:
lr.score(X_test_vect, y_test)

0.6897590361445783

### Analysis

Logistic Regression (LR) and Random Trees (RT) had close to the same scores when differentiating between 2 subreddits. However, RT performed significantly better than LR when differentiating between 3 or more subreddits. 

To evaluate how effective a model is, you can look at the score value itself, but also the difference between the `train` score and the `test` score. Here, we can see that RT is a more effective model because the train and test scores match up well, which means the model isn't too overfit or underfit.

If you look at LR in both cases below, the differences between the `train` and `test` score are much higher than those of the RT model. This means that an LR model in this case is *overfitting* the data.

Also note that the higher scores for LR shouldn't trick you into thinking it's the better model; what's more important here is the **difference** between the `train` and `test` scores.

##### `cats` vs. `dogs` 

|        | `train` score | `test` score |  difference |
|------  |------         |------        |------  |        
| **LR** |   0.994 |   0.977  | *0.017* |
| **RT** |   0.977 |   0.972  | *0.005* |


##### `cats` vs. `dogs` vs. `aww`

|        | `train` score | `test` score |  difference |
|------  |------         |------        |------  |
| **LR** |   0.879 |   0.716  | *0.163* |
| **RT** |   0.689 |   0.687  | *0.002* |


##### `cats` vs. `dogs` vs. `aww` vs. `datascience`

|        | `train` score | `test` score |  difference |
|------  |------         |------        |------  |
| **LR** |   0.875 |   0.723  | *0.152* |
| **RT** |   0.648 |   0.647  | *0.001* |

### Review predictions

In [24]:
# Predictions
a = tree.predict(X_test_vect)
b = y_test

In [25]:
# Turn numpy.ndarray into Series
a = pd.Series(data=a)

In [26]:
# Reset index
b = b.reset_index(drop=True)

In [27]:
# Concatenate true subreddit column & prediction column
both = pd.concat([a,b],axis=1)

In [28]:
# Rename 0 column
both.rename(columns={0:'original'},inplace=True)

In [29]:
# All mismatches
# both[both['original'] != both['subreddit']]
len(both[both['original'] != both['subreddit']])

773

In [30]:
# aww --> cats
aww = both[both['original'] == 'aww']
aww_cats = aww[aww['subreddit'] == 'cats']

len(aww_cats)

637

In [31]:
# aww --> dogs
aww = both[both['original'] == 'aww']
aww_dogs = aww[aww['subreddit'] == 'dogs']

len(aww_dogs)

25

In [32]:
# cats --> aww
aww = both[both['original'] == 'cats']
aww_cats = aww[aww['subreddit'] == 'aww']

len(aww_cats)

93

In [33]:
# dogs --> aww
aww = both[both['original'] == 'dogs']
aww_dogs = aww[aww['subreddit'] == 'aww']

len(aww_dogs)

11

In [34]:
# aww --> datascience
aww = both[both['original'] == 'aww']
aww_ds = aww[aww['subreddit'] == 'datascience']

len(aww_ds)

0

In [35]:
# datascience --> ~datascience
ds = both[both['original'] == 'datascience']
aww_ds = ds[ds['subreddit'] == 'dogs']

len(aww_ds)

0

In [36]:
# dogs --> datascience
aww = both[both['original'] == 'dogs']
aww_ds = aww[aww['subreddit'] == 'datascience']

len(aww_ds)

0

In [37]:
# cats --> datascience
aww = both[both['original'] == 'cats']
aww_ds = aww[aww['subreddit'] == 'datascience']

len(aww_ds)

0

In [38]:
# cats --> dogs
aww = both[both['original'] == 'cats']
aww_ds = aww[aww['subreddit'] == 'dogs']

len(aww_ds)

0

In [39]:
# dogs --> cats
aww = both[both['original'] == 'dogs']
aww_ds = aww[aww['subreddit'] == 'cats']

len(aww_ds)

7

### Analysis

Since `aww` could contain **anything cute and/or fluffy**, it makes sense that `aww` posts are being predicted incorrectly as `cats` or `dogs` relatively frequently - they might be posts about a cat or a dog that happened to be posted in the `aww` subreddit.

However, in the other direction, there aren't nearly as many `cats` or `dogs` posts being predicted incorrectly as `aww`. My educated guess here is that there are fewer subject-specific words that end up being strong predictors of *s/aww* since it is a more generalized page. 

Again, `cats` and `dogs` are a more narrow "niche", so the correlation between high-frequency prediction words and each subreddit seems stronger.


| original --> prediction | *# of mismatches* | *total mismatches* | *% total mismatches* |
|------  |------ |------ |------ |   
| **`aww` --> `cats`** | 455 | 729 | 62.4% |
| **`aww` --> `dogs`** | 34 | 729 | 4.7% |
| **`cats` --> `aww`** | 71 | 729 | 9.7% |
| **`dogs` --> `aww`** | 6 | 729 | 0.8% |
| **`aww` --> `datascience`** | 149 | 729 | 20.4% |
| **`dogs` --> `datascience`** | 8 | 729 | 1.1% |
| **`dogs` --> `cats`** | 6 | 729 | 0.8% |

There were also 149 `aww` posts that were misidentified as `datascience` posts. Again, I believe that because *s/datascience* is the most niche and most likely has the most unique description words in post titles, more general posts can be miscategorized if they happen to have a key word. 

No `datascience` posts were mistaken for other subreddits in the predictions - i.e. there were **no false negatives** in this specific case. This continues to prove that the language in *r/datascience* is specific enough to be a strong predictor of that subbreddit category.

**After doing some more EDA** and going over this analysis, I wanted to see if there were any other independent variables that would help to distinguish more effectively between `aww` and `cats`. Both of these subreddits have a significant number of posts that include videos - `aww` especially - so this could have an effect on the model. 