# Project 3: Subreddit Classification
---
Project notebook organisation:<br>
[1 - Webscraping and data acquisition](./1_webscraping_and_data_acquisition.ipynb)<br>
[2 - Preprocessing of data](./2_preprocessing.ipynb)<br>
[3 - Exploratory data analysis](./3_eda.ipynb)<br>
**4 - Model Tuning and Insights** (current notebook)<br>
<br>
<br>

In [522]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import string
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.feature_selection import SelectPercentile, mutual_info_classif, chi2
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, roc_auc_score, accuracy_score

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import regex as re

sns.set_style('ticks')
pd.set_option('display.max_columns', None)

%matplotlib inline


## Introduction
---

In this notebook I will be using the cleaned data from the previous notebook to train and evaluate a model that classify new posts to either r/Androidquestions or r/iphonehelp. I tried different combinations of predictors as well as feature selection techniques for multiple different models - Support Vector Machines, Binomial Naive Bayes, Multinomial Naive Bayes and Logistic Regression. The models were evaluated based on their accuracy scores on unseen validation data, before the best performing model was used to score the test data.


### Contents
1. [Data preparation](#Data-preparation)
2. [Modelling approach](#Modelling-approach)
3. [Preprocessing](#Preprocessing)
4. [Production model selection](#Production-model-selection)
5. [Conclusion and recommendations](#Conclusion-and-recommendations)

## Data preparation

### Import data

The dataframe contains 1982 rows and 12 columns.

In [195]:
df = pd.read_csv('./data/train.csv', index_col = 0)

In [196]:
df.shape

(1982, 12)

In [197]:
df.head()

Unnamed: 0,title,id,date_created,text,score,upvote_ratio,comment_count,comment_all,comment_len,title_len,text_len,subreddit
0,Apps wont load using WIFI,hcerke,2020-06-20 03:16:44,"Hello, my apps are not loading content when i ...",1,1.0,0.0,notexthere,0.0,5,27,1
1,Is there any way to select photos from differe...,hcekrf,2020-06-20 03:03:21,I'm using a Samsung Note 8 and just the stock ...,1,1.0,0.0,notexthere,0.0,14,60,1
2,Chat Features No Longer Working,hcefkx,2020-06-20 02:52:45,So I use a Samsung Galaxy S10+ on Verizon and ...,1,1.0,1.0,"Try the following, turn off WiFi and turn on m...",40.0,5,75,1
3,An app updated itself in a weird way,hcefi8,2020-06-20 02:52:36,"I was using the Twitch app on android, and som...",1,1.0,0.0,notexthere,0.0,8,90,1
4,[Google Photos] Is it possible to remove pictu...,hcdh2b,2020-06-20 01:43:45,I joined a public album that is just a big mix...,1,1.0,0.0,notexthere,0.0,17,77,1


In [198]:
df.isnull().sum()

title            0
id               0
date_created     0
text             6
score            0
upvote_ratio     0
comment_count    0
comment_all      2
comment_len      0
title_len        0
text_len         0
subreddit        0
dtype: int64

In [199]:
# As in the previous notebook, despite having already run a pre-processing function and having checked for null values
# in the previous notebook, a number of null values still appear in the `text` and `comment_all` fields. This function
# rectifies that.


def nullfiller(df):   
   
    # fill in null values for posts with no comments
    df['comment_all'].fillna('notexthere', inplace = True)
    df['text'].fillna('notexthere', inplace = True)
    
    return df

In [200]:
nullfiller(df)

Unnamed: 0,title,id,date_created,text,score,upvote_ratio,comment_count,comment_all,comment_len,title_len,text_len,subreddit
0,Apps wont load using WIFI,hcerke,2020-06-20 03:16:44,"Hello, my apps are not loading content when i ...",1,1.0,0.0,notexthere,0.0,5,27,1
1,Is there any way to select photos from differe...,hcekrf,2020-06-20 03:03:21,I'm using a Samsung Note 8 and just the stock ...,1,1.0,0.0,notexthere,0.0,14,60,1
2,Chat Features No Longer Working,hcefkx,2020-06-20 02:52:45,So I use a Samsung Galaxy S10+ on Verizon and ...,1,1.0,1.0,"Try the following, turn off WiFi and turn on m...",40.0,5,75,1
3,An app updated itself in a weird way,hcefi8,2020-06-20 02:52:36,"I was using the Twitch app on android, and som...",1,1.0,0.0,notexthere,0.0,8,90,1
4,[Google Photos] Is it possible to remove pictu...,hcdh2b,2020-06-20 01:43:45,I joined a public album that is just a big mix...,1,1.0,0.0,notexthere,0.0,17,77,1
...,...,...,...,...,...,...,...,...,...,...,...,...
987,Logged in & out of iCloud and now passwords ar...,g8e88l,2020-04-26 13:19:55,notexthere,0,0.5,0.0,notexthere,0.0,18,1,0
988,Need help transferring photos and videos from ...,g8e0ke,2020-04-26 13:05:16,As the title explains. I'v tried using the win...,1,1.0,4.0,reinstall the apple drivers\n\n Settings > Ph...,94.0,12,96,0
989,"Can get past ""Welcome to Mail"" in Mail iOS 13.4.1",g8cbyw,2020-04-26 10:47:19,notexthere,4,1.0,1.0,"screw that app, someone can literally hack you...",33.0,10,1,0
990,Can you replace an iPhone 5S and 5SE battery?,g8bspe,2020-04-26 09:58:35,My 5S and 5SE are draining battery quicker tha...,2,1.0,3.0,"Note, it’s the SE, not 5SE ;-). Of course you ...",37.0,9,29,0


In [314]:
df.isnull().sum()

title               0
id                  0
date_created        0
text                0
score               0
upvote_ratio        0
comment_count       0
comment_all         0
comment_len         0
title_len           0
text_len            0
subreddit           0
tok_title           0
tok_text            0
tok_comments        0
lem_tok_title       0
lem_tok_text        0
lem_tok_comments    0
cleaned_title       0
cleaned_text        0
cleaned_comments    0
dtype: int64

## Modelling approach

This section explains the rationale behind the preprocessing, production model selection and hyperparameter optimisation of the selected production model.

There are 11 columns in the original dataframe (not counting the `subreddit` column, which contains the subreddit ID). Of these, the following contains text:
- `title`
- `text`
- `comment_all`
- `id`

And the following contains numeric data:
- `date_created`
- `score`
- `upvote_ratio`
- `comment_count`
- `comment_len`
- `title_len`
- `text_len`

Among these, the columns most likely to be relevant in answering the problem statement (whether or not a new post can be correctly classified into r/Androidquestions or r/iphonehelp based on its content) are:
- `title`
- `text`
- `comment_all`

**Preprocessing**

A workflow of 'tokenization - lemmatization - join text - clean punctation' is used in the preprocessing of data before model selection could begin. These steps were applied to the text strings `title`, `text` and `comment_all` in order to reduce the noise caused by capitalisations, past tense and punctuations. The result of this wokflow is then assigned new columns named `cleaned_title`, `cleaned_text` and `cleaned_comments` which will be used for modelling.

In addition, stop words were defined using the default stopwords provided by `NLTK`. These were merged with the stop words identified in the previous notebook that intersected content from both posts. This complete list of stopwords is then used as a stopword corpus for vectorisation of the content.

A baseline score is also extablished using the normalised counts of the the number of subreddit posts.

**Production model selection**

A staged approach is taken for model selection, firstly by identifying the ideal vectorisers (CountVectorizer or TfidVectorizer) and by identifying the most effective model (Support Vector Machines, Logistic Regression, Binomial Naive Bayes or Multinomial Naive Bayes) in the classification of data. I use accuracy as a metric for which combination is the most effective an good description of the effectiveness of the model as it describes exactly how many posts were identified correctly based on the entire corpus of words (number if True Positives and Negatives over the total sample). The most effective combination of vectoriser and classification model is selected using `GridSearchCV` without hyperparameter tuning and applied to the content located within `cleaned_title`, `cleaned_text` and `cleaned_comments`.

Upon selection of the vectorizer and classification model, a second grid search is conducted this time to identify the optimised hyperparameters that can enhance the model's accuracy. Once again, this far more extensive grid search is applied to `cleaned_title`, `cleaned_text` and `cleaned_comments` in order to compare the effectiveness of using the individual content as a predictor for posts. 

Lastly, further feature engineering is then applied to the model, this time by combining all the content located in `cleaned_title`, `cleaned_text` and `cleaned_comments` into a single dataframe and fed into the optimised model in atempting to further improve the accuracy.

This final production model is then tested on the test set.

## Preprocessing

### Tokenization

In [521]:

df['tok_title'] = df.apply(lambda row: nltk.word_tokenize(row['title'].lower()), axis=1)
df['tok_text'] = df.apply(lambda row: nltk.word_tokenize(row['text'].lower()), axis=1)
df['tok_comments'] = df.apply(lambda row: nltk.word_tokenize(row['comment_all'].lower()), axis=1)


### Lemmatize text

In [316]:
# Instantiate lemmatizer
lemmatizer = WordNetLemmatizer()

df['lem_tok_title'] = df['tok_title'].apply(lambda row: [lemmatizer.lemmatize(item) for item in row])
df['lem_tok_text'] = df['tok_text'].apply(lambda row: [lemmatizer.lemmatize(item) for item in row])
df['lem_tok_comments'] = df['tok_comments'].apply(lambda row: [lemmatizer.lemmatize(item) for item in row])

### Join Text

In [317]:
df["cleaned_title"]= df["lem_tok_title"].str.join(" ")
df["cleaned_text"]= df["lem_tok_text"].str.join(" ")
df["cleaned_comments"]= df["lem_tok_text"].str.join(" ")

### Remove punctuations

In [323]:
df.cleaned_title = df.cleaned_title.apply(lambda x: x.translate(str.maketrans('','',string.punctuation)))
df.cleaned_text = df.cleaned_text.apply(lambda x: x.translate(str.maketrans('','',string.punctuation)))
df.cleaned_comments = df.cleaned_comments.apply(lambda x: x.translate(str.maketrans('','',string.punctuation)))

### Set stopwords

In [171]:
# Use the default NLTK stopword list
stop_words = set(stopwords.words('english'))  

# add additional stopwords
additional_stopwords = {'don know', 'make sure', 'don think', 'sim card'\
                       , 'factory reset', 'new phone', 've tried', 'don want',\
                        'thanks advance', 'sim card', 'does know', 'doesn work', 'recovery mode',\
                         'days ago', 'lock screen', 'months ago', 'power button', 'old phone', \
                       'need help', 'factory reset', 'lock screen'}
stop_words = stop_words.union(additional_stopwords)

In [298]:
df['cleaned_comments']

0      hello , my apps are not loading content when i...
1      i 'm using a samsung note 8 and just the stock...
2      so i use a samsung galaxy s10+ on verizon and ...
3      i wa using the twitch app on android , and som...
4      i joined a public album that is just a big mix...
                             ...                        
987                                           notexthere
988    a the title explains . i ' v tried using the w...
989                                           notexthere
990    my 5 and 5se are draining battery quicker than...
991    i dropped my phone with a case on . no physica...
Name: cleaned_comments, Length: 1982, dtype: object

### Train-test-split

In [331]:
X = df.drop('subreddit', axis=1)
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, 
                                              y, 
                                              test_size = 0.3, 
                                              random_state = 42, 
                                              stratify = y)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(1387, 20)
(1387,)
(595, 20)
(595,)


### Baseline score

As this is a binary classification problem, we can take the normalised value counts of the binary target feature in as the baseline score. The score of 50.1% is taken upon the assumption that this is the best score that we will get should our predictions be all '0'.

In [130]:
y.value_counts(normalize=True)

0    0.500505
1    0.499495
Name: subreddit, dtype: float64

### Customise stop words

In [131]:
# Use the default NLTK stopword list
stop_words = set(stopwords.words('english'))  

# add additional stopwords
additional_stopwords = {'don', 'know', 'make', 'sure', 'think', 'sim', 'card'\
                       , 'factory', 'reset', 'new', 'phone', 've' 'tried', 'want',\
                        'thanks', 'advance', 'does', 'know', 'doesn', 'work', 'recovery', 'mode',\
                         'days', 'ago', 'lock' ,'screen', 'months', 'ago', 'power', 'button', 'old' ,'phone', \
                       'need', 'help', 'factory', 'reset', 'lock', 'screen'}
stop_words = stop_words.union(additional_stopwords)

## Production model selection


|                         	| `title`         	| `text`          	| `comments`      	|
|:-----------------------:	|-----------------	|-----------------	|-----------------	|
| Best Classifier         	| MultinomialNB   	| MultinomialNB   	| MultinomialNB   	|
| Score on training set   	| 0.993           	| 0.970           	| 0.97            	|
| Score on validation set 	| 0.758            	| 0.835           	| 0.835           	|
| Best vectorizer         	| CountVectorizer 	| CountVectorizer 	| CountVectorizer 	|

The above table summarises the accuracy scores on the training set and the validation set for the three sets of data. Two conclusions can be drawn from this initial model selection:
1. Of the three kinds of content, `text` data and `comments` make the best predictors if a single set of content is available. A likely explanation for this is the sheer length of the corpus of words available in the `text` abd `comments` as compared to the `title`.
2. The most ideal combination of classifier and vectorizer is a CountVectorizer paired with a Multinomial Naive Bayes model. This will be the combination that will be used for further hyperparameter optimisation.



### Selecting a vectorizer and classifier for `title` content

In [393]:
%%time

# Set a pipeline to first select the individual entries, vectorize using Tfid and CountVectorize
#and finally run classification models, 
pipeline = Pipeline([
    ('selector', FunctionTransformer(lambda x:x['cleaned_title'])),
    ('vectorizer', None),
    ('classifier', None)
])

param_grid = [{
    # Cycle through 4 different classification models. 
    'classifier': [SVC(), LogisticRegression(), BernoulliNB(), MultinomialNB()],
    # Try two different vectorizers
    'vectorizer': [CountVectorizer(stop_words = stop_words, ngram_range = (1,3)),
                  TfidfVectorizer(stop_words = stop_words, ngram_range = (1,3))],
    # feature selection by max df
    'vectorizer__max_df': [1, 0.05, 0.1]
}]

# use kfold for cv to allow shuffling
kf = KFold(n_splits = 3, shuffle = True, random_state=42)

# Gridsearch to find best models and vectorisers
gscv_title = GridSearchCV(pipeline, cv=kf, param_grid = param_grid, scoring ='accuracy', iid=False, verbose = True, n_jobs=-1)
gscv_title.fit(X_train, y_train)
y_pred_title = gs_title.predict(X_test)

# Scoring
print("training set accuracy :", gscv_title.score(X_train, y_train))
print("Validation set accuracy:", gscv_title.score(X_test, y_test))
print("Best Model :", gscv_title.best_params_)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    2.6s


training set accuracy : 0.992790194664744
Validation set accuracy: 0.7579831932773109
Best Model : {'classifier': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True), 'vectorizer': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.1, max_features=None, min_df=1,
                ngram_range=(1, 3), preprocessor=None,
                stop_words={'a', 'about', 'above', 'after', 'again', 'against',
                            'ain', 'all', 'am', 'an', 'and', 'any', 'are',
                            'aren', "aren't", 'as', 'at', 'be', 'because',
                            'been', 'before', 'being', 'below', 'between',
                            'both', 'but', 'by', 'can', 'couldn', "couldn't", ...},
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None), 'vectorizer__max_df': 0.

[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:    7.9s finished


In [394]:
# print confusion matrix
cmatrix = confusion_matrix(y_test, y_pred_title)
print("Confusion matrix (title content as predictor):")
pd.DataFrame(cmatrix, 
             index = ['actual r/Androidquestions','actual r/iphonehelp'],
             columns = ['predicted r/Androidquestions', 'predicted r/iphonehelp'])


Confusion matrix (title content as predictor):


Unnamed: 0,predicted r/Androidquestions,predicted r/iphonehelp
actual r/Androidquestions,268,30
actual r/iphonehelp,67,230


### Selecting a vectorizer and classifier for on `text` content

In [395]:
%%time

pipeline = Pipeline([
    ('selector', FunctionTransformer(lambda x:x['cleaned_text'])),
    ('vectorizer', None),
    ('classifier', None)
])

param_grid = [{
    'classifier': [SVC(), LogisticRegression(), BernoulliNB(), MultinomialNB()],
    'vectorizer': [CountVectorizer(stop_words = stop_words, ngram_range = (1,3)),
                       TfidfVectorizer(stop_words = stop_words, ngram_range = (1,3))],
    'vectorizer__max_df': [1, 0.05, 0.1]
}]

kf = KFold(n_splits = 3, shuffle = True, random_state=42)

gscv_text = GridSearchCV(pipeline, cv=kf, param_grid = param_grid, scoring ='accuracy', iid=False, verbose = True, n_jobs=-1)
gscv_text.fit(X_train, y_train)
y_pred_text = gscv_text.predict(X_test)

print("training set accuracy :", gscv_text.score(X_train, y_train))
print("Validation set accuracy:", gscv_text.score(X_test, y_test))
print("Best Model :", gscv_text.best_params_)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:   11.2s finished


training set accuracy : 0.969718817591925
Validation set accuracy: 0.8352941176470589
Best Model : {'classifier': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True), 'vectorizer': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.1, max_features=None, min_df=1,
                ngram_range=(1, 3), preprocessor=None,
                stop_words={'a', 'about', 'above', 'after', 'again', 'against',
                            'ain', 'all', 'am', 'an', 'and', 'any', 'are',
                            'aren', "aren't", 'as', 'at', 'be', 'because',
                            'been', 'before', 'being', 'below', 'between',
                            'both', 'but', 'by', 'can', 'couldn', "couldn't", ...},
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None), 'vectorizer__max_df': 0.

In [396]:
# print confusion matrix
cmatrix = confusion_matrix(y_test, y_pred_text)
print("Confusion matrix (title content as predictor):")
pd.DataFrame(cmatrix, 
             index = ['actual r/Androidquestions','actual r/iphonehelp'],
             columns = ['predicted r/Androidquestions', 'predicted r/iphonehelp'])


Confusion matrix (title content as predictor):


Unnamed: 0,predicted r/Androidquestions,predicted r/iphonehelp
actual r/Androidquestions,268,30
actual r/iphonehelp,68,229


### Selecting a vectorizer and classifier for `comments` content

In [411]:
%%time

pipeline = Pipeline([
    ('selector', FunctionTransformer(lambda x:x['cleaned_comments'])),
    ('vectorizer', None),
    ('classifier', None)
])

param_grid = [{
    'classifier': [SVC(), LogisticRegression(), BernoulliNB(), MultinomialNB()],
    'vectorizer': [CountVectorizer(stop_words = stop_words, ngram_range = (1,4)),
                       TfidfVectorizer(stop_words = stop_words, ngram_range = (1,4))],
    'vectorizer__max_df': [1, 0.05, 0.1]
}]

kf = KFold(n_splits = 3, shuffle = True, random_state=42)

gscv_comments = GridSearchCV(pipeline, cv=kf, param_grid = param_grid, scoring ='accuracy', iid=False, verbose = True, n_jobs=-1)
gscv_comments.fit(X_train, y_train)
y_pred_comments = gscv_comments.predict(X_test)

print("training set accuracy :", gscv_comments.score(X_train, y_train))
print("Validation set accuracy:", gscv_comments.score(X_test, y_test))
print("Best Model :", gscv_comments.best_params_)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:   12.5s finished


training set accuracy : 0.969718817591925
Validation set accuracy: 0.8352941176470589
Best Model : {'classifier': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True), 'vectorizer': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.1, max_features=None, min_df=1,
                ngram_range=(1, 4), preprocessor=None,
                stop_words={'a', 'about', 'above', 'after', 'again', 'against',
                            'ain', 'all', 'am', 'an', 'and', 'any', 'are',
                            'aren', "aren't", 'as', 'at', 'be', 'because',
                            'been', 'before', 'being', 'below', 'between',
                            'both', 'but', 'by', 'can', 'couldn', "couldn't", ...},
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None), 'vectorizer__max_df': 0.

In [405]:
# print confusion matrix
cmatrix = confusion_matrix(y_test, y_pred_comments)
print("Confusion matrix (title content as predictor):")
pd.DataFrame(cmatrix, 
             index = ['actual r/Androidquestions','actual r/iphonehelp'],
             columns = ['predicted r/Androidquestions', 'predicted r/iphonehelp'])


Confusion matrix (title content as predictor):


Unnamed: 0,predicted r/Androidquestions,predicted r/iphonehelp
actual r/Androidquestions,268,30
actual r/iphonehelp,68,229


## Hyperparameter tuning of vectorizer and model

|            	| Accuracy before  hyperparameter  optimisation 	| Accuracy after  hyperparameter  optimisation 	|
|:----------:	|-----------------------------------------------	|----------------------------------------------	|
| `title`    	| 0.758                                         	| 0.763                                        	|
| `text`     	| 0.835                                         	| 0.839                                        	|
| `comments` 	| 0.835                                         	| 0.839                                        	|

A second more extensive (and much longer) grid search for hyperparameter optimisation provides a very modest increase in model accuracy on the validation set (y_test) of data. Further feature engineering will be required in order to improve the accuracy scores of the model.

### GridSearch for optimal parameter on `title` content

In [433]:
%%time

pipeline = Pipeline([
    ('selector', FunctionTransformer(lambda x:x['cleaned_title'])),
    ('vectorizer', CountVectorizer(stop_words = stop_words)),
    ('classifier', MultinomialNB())
])
    
param_grid = [{
    'vectorizer__ngram_range' : [(1,1), (1,2), (1,3), (1,4)],
    'vectorizer__max_df': [1, 0.05, 0.1],
    'vectorizer__min_df': [1, 0.05, 0.1],
    'vectorizer__binary': [True, False],
    'classifier__alpha': np.linspace(0, 100, num=100),
   # 'classifier__binarize' : np.linspace(0, 100, num=2),
   # 'clasifier__fit_prior' : [True,False]
}]
    
kf = KFold(n_splits = 3, shuffle = True, random_state=42)

gscv_title = GridSearchCV(pipeline, cv=kf, param_grid = param_grid, scoring ='accuracy', iid=False, verbose = True, n_jobs=-1)
gscv_title.fit(X_train, y_train)
y_pred_title = gscv_title.predict(X_test)

print("training set accuracy :", gscv_title.score(X_train, y_train))
print("Validation set accuracy:", gscv_title.score(X_test, y_test))
print("Best Model :", gscv_title.best_params_)

Fitting 3 folds for each of 7200 candidates, totalling 21600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:   21.3s
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:   54.2s
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 2418 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 3168 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done 4018 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done 4968 tasks      | elapsed: 10.7min
[Parallel(n_jobs=-1)]: Done 6018 tasks      | elapsed: 12.9min
[Parallel(n_jobs=-1)]: Done 7168 tasks      | elapsed: 15.5min
[Parallel(n_jobs=-1)]: Done 8418 tasks      | elapsed: 18.2min
[Parallel(n_jobs=-1)]: Done 9768 tasks      | elapsed: 21.4min
[Parallel(n_jobs=-1)]: Done 11218 tasks      

training set accuracy : 0.9920692141312184
Validation set accuracy: 0.7630252100840336
Best Model : {'classifier__alpha': 1.0101010101010102, 'vectorizer__binary': True, 'vectorizer__max_df': 0.1, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 3)}
CPU times: user 34min 58s, sys: 8min 15s, total: 43min 13s
Wall time: 47min 39s


In [435]:
# print confusion matrix
cmatrix = confusion_matrix(y_test, y_pred_title)
print("Confusion matrix (title content as predictor):")
pd.DataFrame(cmatrix, 
             index = ['actual r/Androidquestions','actual r/iphonehelp'],
             columns = ['predicted r/Androidquestions', 'predicted r/iphonehelp'])


Confusion matrix (title content as predictor):


Unnamed: 0,predicted r/Androidquestions,predicted r/iphonehelp
actual r/Androidquestions,236,62
actual r/iphonehelp,79,218


### GridSearch for optimal parameter on `text` content

In [434]:
%%time

pipeline = Pipeline([
    ('selector', FunctionTransformer(lambda x:x['cleaned_text'])),
    ('vectorizer', CountVectorizer(stop_words = stop_words)),
    ('classifier', MultinomialNB())
])
    
param_grid = [{
    'vectorizer__ngram_range' : [(1,1), (1,2), (1,3), (1,4)],
    'vectorizer__max_df': [1, 0.05, 0.1, 0.01],
    'vectorizer__binary': [True, False],
    'classifier__alpha': np.linspace(0, 100, num=100),
}]
    
kf = KFold(n_splits = 3, shuffle = True, random_state=42)

gscv_text = GridSearchCV(pipeline, cv=kf, param_grid = param_grid, scoring ='accuracy', iid=False, verbose = True, n_jobs=-1)
gscv_text.fit(X_train, y_train)
y_pred_text = gscv_text.predict(X_test)

print("training set accuracy :", gscv_text.score(X_train, y_train))
print("Validation set accuracy:", gscv_text.score(X_test, y_test))
print("Best Model :", gscv_text.best_params_)

Fitting 3 folds for each of 3200 candidates, totalling 9600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    3.6s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:   25.4s
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 2418 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done 3168 tasks      | elapsed:  7.7min
[Parallel(n_jobs=-1)]: Done 4018 tasks      | elapsed:  9.8min
[Parallel(n_jobs=-1)]: Done 4968 tasks      | elapsed: 12.1min
[Parallel(n_jobs=-1)]: Done 6018 tasks      | elapsed: 14.8min
[Parallel(n_jobs=-1)]: Done 7168 tasks      | elapsed: 17.7min
[Parallel(n_jobs=-1)]: Done 8418 tasks      | elapsed: 20.9min
[Parallel(n_jobs=-1)]: Done 9600 out of 9600 | elapsed: 23.6min finished


training set accuracy : 0.9689978370583994
Validation set accuracy: 0.838655462184874
Best Model : {'classifier__alpha': 2.0202020202020203, 'vectorizer__binary': True, 'vectorizer__max_df': 0.1, 'vectorizer__ngram_range': (1, 2)}
CPU times: user 17min 44s, sys: 3min 46s, total: 21min 31s
Wall time: 23min 37s


In [None]:
# print confusion matrix
cmatrix = confusion_matrix(y_test, y_pred_text)
print("Confusion matrix (title content as predictor):")
pd.DataFrame(cmatrix, 
             index = ['actual r/Androidquestions','actual r/iphonehelp'],
             columns = ['predicted r/Androidquestions', 'predicted r/iphonehelp'])

### GridSearch for optimal parameter on `comments` content

In [436]:
%%time

pipeline = Pipeline([
    ('selector', FunctionTransformer(lambda x:x['cleaned_comments'])),
    ('vectorizer', CountVectorizer(stop_words = stop_words)),
    ('classifier', MultinomialNB())
])
    
param_grid = [{
    'vectorizer__ngram_range' : [(1,1), (1,2), (1,3), (1,4)],
    'vectorizer__max_df': [1, 0.05, 0.1, 0.01],
    'vectorizer__binary': [True, False],
    'classifier__alpha': np.linspace(0, 100, num=100),
}]
    
kf = KFold(n_splits = 3, shuffle = True, random_state=42)

gscv_comments = GridSearchCV(pipeline, cv=kf, param_grid = param_grid, scoring ='accuracy', iid=False, verbose = True, n_jobs=-1)
gscv_comments.fit(X_train, y_train)
y_pred_comments = gscv_comments.predict(X_test)

print("training set accuracy :", gscv_comments.score(X_train, y_train))
print("Validation set accuracy:", gscv_comments.score(X_test, y_test))
print("Best Model :", gscv_comments.best_params_)

Fitting 3 folds for each of 3200 candidates, totalling 9600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:   20.3s
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:   51.3s
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 2418 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 3168 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done 4018 tasks      | elapsed:  8.6min
[Parallel(n_jobs=-1)]: Done 4968 tasks      | elapsed: 10.7min
[Parallel(n_jobs=-1)]: Done 6018 tasks      | elapsed: 13.1min
[Parallel(n_jobs=-1)]: Done 7168 tasks      | elapsed: 15.5min
[Parallel(n_jobs=-1)]: Done 8418 tasks      | elapsed: 18.2min
[Parallel(n_jobs=-1)]: Done 9600 out of 9600 | elapsed: 21.1min finished


training set accuracy : 0.9689978370583994
Validation set accuracy: 0.838655462184874
Best Model : {'classifier__alpha': 2.0202020202020203, 'vectorizer__binary': True, 'vectorizer__max_df': 0.1, 'vectorizer__ngram_range': (1, 2)}
CPU times: user 15min 51s, sys: 3min 17s, total: 19min 8s
Wall time: 21min 5s


In [437]:
# print confusion matrix
cmatrix = confusion_matrix(y_test, y_pred_comments)
print("Confusion matrix (title content as predictor):")
pd.DataFrame(cmatrix, 
             index = ['actual r/Androidquestions','actual r/iphonehelp'],
             columns = ['predicted r/Androidquestions', 'predicted r/iphonehelp'])

Confusion matrix (title content as predictor):


Unnamed: 0,predicted r/Androidquestions,predicted r/iphonehelp
actual r/Androidquestions,271,27
actual r/iphonehelp,69,228


## Further feature engineering

**Combining all the strings as predictors**

The abovementioned model selection and hyperparameter optimisation grid searches, although successful in modestly improving accuracy scores, have been conducted in isolation on either `title`, `text`, or , `comment` data. `text` and `comment` data consistently perform better than `title` content hypothetically due to the sheer length of their individual corpus of words.

Going by that hypothesis, combining the content from all three aspects into a single dataframe would yield a lot more data, thereby leading to a step-change improvement in the accuracy scores of the model. I hence combine `title`, `text` and `comment` data into a single dataframe called `combined_content`, perform a `train_test_split` in order to achieve training data with a lot more data points. Using the `combined_data`, I perform another gridsearch for opptimised hyperparameters. The result accuracy of this method yields an accuracy score of **0.899**, a significant imporvement over the accuracy scores achieved by hyperparameter optimisation alone.

This feature engineering method of combining all `title`, `text` and `comment` data into a single dataframe before running it through a Multinomial Naive Bayes model with optimised hyperparameters will be used on the test data for a final accuracy score.

### Gridsearch for optimal parameters on combined content

In [483]:
# Create a new dataframe with all the content combined as a single column
df1 = df[['cleaned_title', 'subreddit']]
df1.rename({'cleaned_title': 'content'}, axis=1, inplace=True)

df2 = df[['cleaned_text', 'subreddit']]
df2.rename({'cleaned_text': 'content'}, axis=1, inplace=True)

df3 = df[['cleaned_comments', 'subreddit']]
df3.rename({'cleaned_comments': 'content'}, axis=1, inplace=True)

combined_content = pd.concat([df1,df2,df3])

In [484]:
X = combined_content.drop('subreddit', axis=1)
y = combined_content['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, 
                                              y, 
                                              test_size = 0.3, 
                                              random_state = 42, 
                                              stratify = y)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(4162, 1)
(4162,)
(1784, 1)
(1784,)


In [485]:
%%time

pipeline = Pipeline([
    ('selector', FunctionTransformer(lambda x:x['content'])),
    ('vectorizer', CountVectorizer(stop_words = stop_words)),
    ('classifier', MultinomialNB())
])
    
param_grid = [{
    'vectorizer__ngram_range' : [(1,1), (1,2), (1,3), (1,4)],
    'vectorizer__max_df': [1, 0.05, 0.1, 0.01],
    'vectorizer__binary': [True, False],
    'classifier__alpha': np.linspace(0, 100, num=100),
}]
    
kf = KFold(n_splits = 3, shuffle = True, random_state=42)

gscv_combined = GridSearchCV(pipeline, cv=kf, param_grid = param_grid, scoring ='accuracy', iid=False, verbose = True, n_jobs=-1)
gscv_combined.fit(X_train, y_train)
y_pred_combined = gscv_combined.predict(X_test)

print("training set accuracy :", gscv_combined.score(X_train, y_train))
print("Validation set accuracy:", gscv_combined.score(X_test, y_test))
print("Best Model :", gscv_combined.best_params_)

Fitting 3 folds for each of 3200 candidates, totalling 9600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:   12.4s
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:   29.9s
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed:   56.7s
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 2418 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 3168 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 4018 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 4968 tasks      | elapsed:  7.2min
[Parallel(n_jobs=-1)]: Done 6018 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done 7168 tasks      | elapsed: 10.5min
[Parallel(n_jobs=-1)]: Done 8418 tasks      | elapsed: 12.4min
[Parallel(n_jobs=-1)]: Done 9600 out of 9600 | elapsed: 14.2min finished


training set accuracy : 0.9723690533397406
Validation set accuracy: 0.898542600896861
Best Model : {'classifier__alpha': 1.0101010101010102, 'vectorizer__binary': True, 'vectorizer__max_df': 0.1, 'vectorizer__ngram_range': (1, 3)}
CPU times: user 2min 4s, sys: 1min 20s, total: 3min 25s
Wall time: 14min 11s


In [487]:
# print confusion matrix
cmatrix = confusion_matrix(y_test, y_pred_combined)
print("Confusion matrix (All content as predictor):")
pd.DataFrame(cmatrix, 
             index = ['actual r/Androidquestions','actual r/iphonehelp'],
             columns = ['predicted r/Androidquestions', 'predicted r/iphonehelp'])

Confusion matrix (All content as predictor):


Unnamed: 0,predicted r/Androidquestions,predicted r/iphonehelp
actual r/Androidquestions,804,89
actual r/iphonehelp,92,799


## Validation on test data

### Import test data

In [497]:
test = pd.read_csv('./data/combined_test.csv', index_col = 0)

In [520]:
# check shape of test data
test.shape

(440, 21)

In [499]:
test.isnull().sum()

title            0
id               0
date_created     0
text             0
score            0
upvote_ratio     0
comment_count    0
comment_all      0
comment_len      0
title_len        0
text_len         0
subreddit        0
dtype: int64

### Preprocess text data

In [500]:
# Tokenise
test['tok_title'] = test.apply(lambda row: nltk.word_tokenize(row['title'].lower()), axis=1)
test['tok_text'] = test.apply(lambda row: nltk.word_tokenize(row['text'].lower()), axis=1)
test['tok_comments'] = test.apply(lambda row: nltk.word_tokenize(row['comment_all'].lower()), axis=1)

# Lemmatize
lemmatizer = WordNetLemmatizer()

test['lem_tok_title'] = test['tok_title'].apply(lambda row: [lemmatizer.lemmatize(item) for item in row])
test['lem_tok_text'] = test['tok_text'].apply(lambda row: [lemmatizer.lemmatize(item) for item in row])
test['lem_tok_comments'] = test['tok_comments'].apply(lambda row: [lemmatizer.lemmatize(item) for item in row])

# Join
test["cleaned_title"]= test["lem_tok_title"].str.join(" ")
test["cleaned_text"]= test["lem_tok_text"].str.join(" ")
test["cleaned_comments"]= test["lem_tok_text"].str.join(" ")

#remove punctuations
test.cleaned_title = test.cleaned_title.apply(lambda x: x.translate(str.maketrans('','',string.punctuation)))
test.cleaned_text = test.cleaned_text.apply(lambda x: x.translate(str.maketrans('','',string.punctuation)))
test.cleaned_comments = test.cleaned_comments.apply(lambda x: x.translate(str.maketrans('','',string.punctuation)))


In [501]:
# Create a new dataframe with all the content combined as a single column
test1 = test[['cleaned_title', 'subreddit']]
test1.rename({'cleaned_title': 'content'}, axis=1, inplace=True)

test2 = test[['cleaned_text', 'subreddit']]
test2.rename({'cleaned_text': 'content'}, axis=1, inplace=True)

test3 = test[['cleaned_comments', 'subreddit']]
test3.rename({'cleaned_comments': 'content'}, axis=1, inplace=True)

combined_test = pd.concat([test1,test2,test3])

In [503]:
#Create X and y on test data
X = combined_test.drop('subreddit', axis=1)
y = combined_test['subreddit']

### Run model on test data

In [519]:
y_pred_test = gscv_combined.predict(X)

print(f"Final accuracy score on test data: {np.round(accuracy_score(y, y_pred_test), 2)}")

# print confusion matrix
cmatrix = confusion_matrix(y, y_pred_test)
print("Confusion matrix (On unlabelled test data):")
pd.DataFrame(cmatrix, 
             index = ['actual r/Androidquestions','actual r/iphonehelp'],
             columns = ['predicted r/Androidquestions', 'predicted r/iphonehelp'])

Final accuracy score on test data: 0.83
Confusion matrix (On unlabelled test data):


Unnamed: 0,predicted r/Androidquestions,predicted r/iphonehelp
actual r/Androidquestions,299,58
actual r/iphonehelp,172,791


## Conclusions and insights

Using a multinomial naive Bayes classifier trained on a combination of title, post and comment content, I was able to classify **440** unlabelled posts into r/Androidquestions or r/iphonehelp with a reasonable accuracy of **83%**. This far outperformed the baseline score of *50%*.

The nature of queries in r/Androidquestions appear mostly related to software issues and tweaking issues related to the Android 10 operating system as opposed to the hardware. In contrast, most issues on r/iphonehelp are related to hardware problems such as accidentily dropping phones in water or replacing cracked screens. It is perhaps unsurprising then that it is easy to distinguish between posts meant for either subreddits. Keywords such as 'Android' and 'ios' which are native to the different operating systems further help the discrimation of the posts.

Despite the differences in the different phone ecosystems, they still share some similar issues, which is a likely explanation for the model misclassifications. Looking at overlapping words ('sim card', 'factory reset', 'recovery mode', 'lock screen', 'power button', 'old phone') between the top 50 meaningful phrases gives us an indication of the common issues which most likely stratify both subreddits.

To further improve model accuracy, a bigger corpus that incorporates a bigger vocabulary on the different systems is needed. As proven through the data on modelling, models trained using only title information tend to be more inaccurate as compared to the text and comment data which tended to be longer in nature, hence containing more words. The best model on the validation set incorporated *4,162* data points which were a combination of title, comment and text data. In contrast, hyperparameter optimisation, though time-consuming, only achieved very modest accuracy gains.
It can hence be said that the hypothesis of 'throw more data at the model' to improve accuracy scores holds true. The model does not discriminate between title, text and comments but merely the vocabulary of words within an entire subreddit post. I hence posit, that if this model were deployed for real-life use, the substantial increase in queries and discriptions of problems over time would improve accuracy scores. 

To move the project forward (i.e. to improve accuracy scores) I recommend the following:
1. Feed all 'text'-related information as a single feature into the model
1. Deploy the model and put it in use so as to 'crowd-source' a larger corpus of words as queries come in.
1. Use other sources of data such as other subreddits and other forums to increase th ecorpus of words.

As mentioned previously, although the goal of this project is to classify subreddits, such a classification model can also be applied elsewhere, such as to automate front end systems for topic matching and routing of queries to the right troubleshooting teams, recommending possible solutions as part of a larger software system, and the ever-useful spam filtering.

