# Subreddit Sorter - /r/Games & /r/indiegaming

## Part 2: Model and Conclusions

In [2]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer

**Read 'Games' subreddit csv**
Note: file data_r_games_all.csv includes 19000+ text entries. Use data_r_games_small.csv for only 1100+ entries and faster results.

In [4]:
reddit_games = pd.read_csv('datasets/data_r_games_small.csv')

In [5]:
reddit_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1126 entries, 0 to 1125
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1126 non-null   object
 1   selftext   1126 non-null   object
 2   title      1126 non-null   object
dtypes: object(3)
memory usage: 26.5+ KB


**Gather same number of posts as IndieGaming for balanced data**

In [6]:
reddit_games = reddit_games[0:1100]

**Read 'indiegaming' subreddit csv** Note: file data_r_indiegaming_all.csv includes 13000+ text entries. Use data_r_indiegaming_small.csv for only 1100+ entries and faster results.

In [3]:
reddit_indiegaming = pd.read_csv('datasets/data_r_indiegaming_small.csv')

In [8]:
reddit_indiegaming.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1668 entries, 0 to 1667
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1668 non-null   object
 1   selftext   1668 non-null   object
 2   title      1668 non-null   object
dtypes: object(3)
memory usage: 39.2+ KB


**Gather same number of posts as Games for balanced data**

In [9]:
reddit_indiegaming = reddit_indiegaming[0:1100]

**Combine DataFrames and rename subreddits to 0 (Games) and 1 (indiegaming)**

In [10]:
reddit = reddit_games.append(reddit_indiegaming, ignore_index = True)

In [11]:
reddit

Unnamed: 0,subreddit,selftext,title
0,Games,"\n\nIf you clicked this post, it means youre ...",I need someone to game with (Minecraft)
1,Games,Game Information\n--------------------\n\n**Ga...,Knockout City - Review Thread
2,Games,"Like, would capcom be open for the fandom to h...",Are game developers/publishers open for more c...
3,Games,"The title pretty much sums it up, i'm looking ...",Cat vs Mouse Style Xbox One compatible games
4,Games,Instead of watching somebody's head screaming ...,"How would you feel about video reviews, that d..."
...,...,...,...
2195,IndieGaming,I ask as whenever I try to develop a game myse...,"To the developers out there, how do you stay m..."
2196,IndieGaming,**Block Boss**\n\nHello! Dev here from Ragemod...,[SP - Game] Block Boss - a puzzle game for iOS...
2197,IndieGaming,"Hello guys, I want to spend some money buying ...",What are the best peripherals (mouse/keyboard/...
2198,IndieGaming,You guys going to China Joy? Is there any ind...,How many of you going to China Joy?


In [12]:
reddit['subreddit'] = reddit['subreddit'].map({'Games': 0, 'IndieGaming': 1})

## Define X and y and Train

In [18]:
X = reddit['selftext']
y = reddit['subreddit']

In [19]:
y.value_counts()

1    1100
0    1100
Name: subreddit, dtype: int64

In [20]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

**Split Data**

In [21]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    stratify=y,
                                                    random_state=42)

## Analyze Subreddit Score using LogisticRegression

**Prepare Pipeline with Logistic Regression**

In [17]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

In [18]:
pipe_params = {
    'cvec__max_features' : [1500, 2000],
    'cvec__min_df' : [0],
    'cvec__max_df' : [0.8, 0.9],
    'cvec__stop_words' : ['english'],
    'cvec__ngram_range' : [(1, 1), (1, 2), (1, 3)],
    'lr__solver' : ['liblinear'],
    'lr__penalty' : ['l1', 'l2'],
    'lr__C' : [0.001, 0.1, 1],
    'lr__random_state' : [42]
}

In [19]:
# Instantiate GridSearchCV.

gs_lr = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv = 5) # 5-fold cross-validation.

**Fit Training Data with GridSearchCV**

In [20]:
# Fit GridSearch to training data.
gs_lr.fit(X_train,y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('lr', LogisticRegression())]),
             param_grid={'cvec__max_df': [0.8, 0.9],
                         'cvec__max_features': [1500, 2000],
                         'cvec__min_df': [0],
                         'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
                         'cvec__stop_words': ['english'],
                         'lr__C': [0.001, 0.1, 1], 'lr__penalty': ['l1', 'l2'],
                         'lr__random_state': [42],
                         'lr__solver': ['liblinear']})

In [21]:
# What's the best score?
print(gs_lr.best_score_)

0.7927403846153845


In [22]:
# Save best model as gs_model.
gs_lr_model = gs.best_estimator_

In [23]:
# Score model on training set.
gs_lr_model.score(X_train, y_train)

0.8382211538461538

In [24]:
# Score model on testing set.
gs_lr_model.score(X_test, y_test)

0.7882692307692307

In [25]:
gs_lr.best_params_

{'cvec__max_df': 0.8,
 'cvec__max_features': 2000,
 'cvec__min_df': 0,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': 'english',
 'lr__C': 1,
 'lr__penalty': 'l2',
 'lr__random_state': 42,
 'lr__solver': 'liblinear'}

## Analyze Subreddit Score using KNN


**Prepare Pipeline with KNN**

In [26]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('knn', KNeighborsClassifier())
])

In [27]:
pipe_params = {
    'cvec__max_features' : [1000, 1500, 2000],
    'cvec__min_df' : [0],
    'cvec__max_df' : [0.9],
    'cvec__stop_words' : ['english'],
    'cvec__ngram_range' : [(1, 1), (1, 2), (1, 3)],
    'knn__n_neighbors' : [5, 10, 15, 20]
}



In [30]:
# Instantiate GridSearchCV.

gs_knn = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv = 5) # 5-fold cross-validation.

**Fit Training Data with GridSearchCV**

In [31]:
# Fit GridSearch to training data.
gs_knn.fit(X_train,y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'cvec__max_df': [0.9],
                         'cvec__max_features': [1000, 1500, 2000],
                         'cvec__min_df': [0],
                         'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
                         'cvec__stop_words': ['english'],
                         'knn__n_neighbors': [5, 10, 15, 20]})

In [32]:
# What's the best score?
print(gs_knn.best_score_)

0.7132692307692309


In [33]:
gs_knn.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 1000,
 'cvec__min_df': 0,
 'cvec__ngram_range': (1, 3),
 'cvec__stop_words': 'english',
 'knn__n_neighbors': 5}

In [35]:
# Save best model as gs_model.
gs_knn_model = gs_knn.best_estimator_

In [36]:
# Score model on training set.
gs_knn_model.score(X_train, y_train)

0.8001923076923076

In [37]:
# Score model on testing set.
gs_knn_model.score(X_test, y_test)

0.7055769230769231

## Analyze Subreddit Score using RandomForest


**Prepare Pipeline with RandomForest**

In [38]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('rf', RandomForestClassifier())
])

In [39]:
pipe_params = {
    'cvec__max_features' : [1000, 1500],
    'cvec__min_df' : [0],
    'cvec__max_df' : [0.9],
    'cvec__stop_words' : ['english'],
    'cvec__ngram_range' : [(1, 1), (1, 2)],
    'rf__n_estimators' : [100, 150],
    'rf__min_samples_split' : [2, 3, 4],
    'rf__max_depth' : [None, 2, 3, 4, 5]
}



In [40]:
# Instantiate GridSearchCV.

gs_rf = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv = 5) # 5-fold cross-validation.

**Fit Training Data with GridSearchCV**

In [41]:
# Fit GridSearch to training data.
gs_rf.fit(X_train,y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('rf', RandomForestClassifier())]),
             param_grid={'cvec__max_df': [0.9],
                         'cvec__max_features': [1000, 1500],
                         'cvec__min_df': [0],
                         'cvec__ngram_range': [(1, 1), (1, 2)],
                         'cvec__stop_words': ['english'],
                         'rf__max_depth': [None, 2, 3, 4, 5],
                         'rf__min_samples_split': [2, 3, 4],
                         'rf__n_estimators': [100, 150]})

In [42]:
# What's the best score?
print(gs_rf.best_score_)

0.8028846153846153


In [43]:
gs_rf.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 1500,
 'cvec__min_df': 0,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': 'english',
 'rf__max_depth': None,
 'rf__min_samples_split': 4,
 'rf__n_estimators': 150}

In [44]:
# Save best model as gs_model.
gs_rf_model = gs_rf.best_estimator_

In [45]:
# Score model on training set.
gs_rf_model.score(X_train, y_train)

0.963076923076923

In [46]:
# Score model on testing set.
gs_rf_model.score(X_test, y_test)

0.8061538461538461

## Analyze Subreddit Score using SVM


**Prepare Pipeline with SVM**

In [47]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('svc', SVC())
])

In [48]:
pipe_params = {
    'cvec__max_features' : [750, 1000, 1500],
    'cvec__min_df' : [0],
    'cvec__max_df' : [0.9],
    'cvec__stop_words' : ['english'],
    'cvec__ngram_range' : [(1, 1), (1, 2)],
    'svc__C' : [0.01, 0.1, 1, 10, 100]
}



In [49]:
# Instantiate GridSearchCV.

gs_svm = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv = 5) # 5-fold cross-validation.

**Fit Training Data with GridSearchCV**

In [50]:
# Fit GridSearch to training data.
gs_svm.fit(X_train,y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('svc', SVC())]),
             param_grid={'cvec__max_df': [0.9],
                         'cvec__max_features': [750, 1000, 1500],
                         'cvec__min_df': [0],
                         'cvec__ngram_range': [(1, 1), (1, 2)],
                         'cvec__stop_words': ['english'],
                         'svc__C': [0.01, 0.1, 1, 10, 100]})

In [51]:
# Check best params
gs_svm.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 1500,
 'cvec__min_df': 0,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': 'english',
 'svc__C': 100}

In [52]:
# Save best model as gs_model.
gs_svm_model = gs_svm.best_estimator_

In [53]:
# Score model on training set.
gs_svm_model.score(X_train, y_train)

0.9179326923076923

In [54]:
# Score model on testing set.
gs_svm_model.score(X_test, y_test)

0.7946153846153846

**Conclusions**


Comparing the accuracy scores from the four models indicate that all models produce high variance, low bias results as all models show higher training scores than test scores. Increasing the number text entries in the dataset from 1126 to 13000 from reddit also did not improve overfitting nor performance. The dataset with 1126 texts will be called the smaller dataset, and the dataset with 13000 texts will be called the larger dataset.

RandomForestClassifier did provide the best accuracy on the training data with a score of 91.7% on the larger dataset and 99.7% on the smaller dataset.

SVM also provided scores over 90% for both dataset sizes but provided worse overfitting than RandomForestClassifier.

Even though logistic reggression did not provide the highest training score, it did provide the smallest variance of around 5% between a training score of 83.8% and test score of 78.8% on the larger dataset.

KNN performed worse overall on accuracy and overfitting.

**Recommendations**

Comparing to the baseline score of 50%, each of the models performed well above this score. RandomForest and SVM were shown to provide better scores to sort through the Games and IndieGaming subreddit types using a CountVectorizer to tokenize text.

Future model improvements can include stemming text and removing digits in each text input.