# NLP and modeling

This is notebook 4 (out of 5) for the NLP Reddit Project
Notebook by: <b>Martijn de Vries</b><br>
martijndevries91@gmail.com
https://github.com/martijndevries/NLP_Reddit

## Problem Statement

A US political consultancy company is researching how news sources and discussed topics differ between the US political mainstream and the conservative right-wing media. In the last decade or so, the US political right-wing has been increasingly described as living in an entirely separate information ecosystem from the political mainstream. In order to gauge how intense this effect is, we will collect, process, and classify the Reddit content of two politically-themed subreddit that reflect the mainstream and conservative voters respectively: <b>r/politics</b> and <b>r/conservative</b>. 

For this project, we will build two separate branches of models: one for post submissions (largely consisting of links to news sites), and another for comments (consisting of actual Reddit users discussing political news). As this is a binary classification problem where the two classes are of equal interest and will be approximately balanced, we will use the accuracy score as the main metric to gauge the success of the classification model. 

Because political news is always evolving, we have chosen a specific moment in time: the month leading up to the 2022 midterms, October 6th to November 6th 2022. This ensures that 1) the same news cycle is covered for both subreddits, 2) both subreddits were at peak activity, and 3) maximum potential for interesting insights in the way that news is discussed within these two subreddits.

## In this Notebook

I will vectorize the text (titles and comments respectively) for each of my two models, using a 'bag of words' approach with the TF-IDF (Term frequency - inverse document frequency) Vectorizer. I will also include additional features in the model as elaborated on in the previous notebooks. I will try different models and try to find which model yields the best accuracy score on the testing data.

In [96]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re
import sys
import pickle

#nltk imports
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# sklearn imports
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler


## 1) Posts modeling

In [97]:
post_df = pd.read_csv('../data/all_submissions_cleaned.csv')
post_df.head()

Unnamed: 0,id,created_utc,title,selftext,url,num_comments,upvote_ratio,subreddit,word_length,domain
0,xylvpq,1665212044,"Editorial: Hey, QAnon — Texas had an actual ch...",,https://www.houstonchronicle.com/opinion/edito...,688,0.97,politics,16,other
1,xylh3y,1665210574,Sanders: Biden’s Marijuana Pardons Are Good — ...,,https://truthout.org/articles/sanders-bidens-m...,269,0.97,politics,12,truthout
2,xyla6d,1665209886,Elon Musk suggests making Taiwan a ‘special ad...,[removed],,24,0.26,politics,13,none
3,xykwzh,1665208590,Anyone else in Chicago noticing how Fox News k...,[removed],,1,1.0,politics,32,none
4,xykox1,1665207791,Urfi wore a bold saree! Spread the flames of h...,,https://countryconnect.in/entertainment-news/u...,1,1.0,politics,13,other


### 1.1) Logistic Regression

Let's dive straight in and try a first model with TfidfVectorizer and LogisticRegression, using only the 'title' column
First, what's the baseline accuracy?

In [98]:
post_df['subreddit'].value_counts(normalize=True)

conservative    0.526722
politics        0.473278
Name: subreddit, dtype: float64

Let's binarize the subreddit column

In [99]:
post_df['subreddit'] = post_df['subreddit'].map({'conservative':0, 'politics':1})

Now define our X and y and do a train_test_split

In [100]:
X = post_df['title']
y = post_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Define a pipeline with TfidfVectorizer and LogisticRegression

In [6]:
nlp_pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('lr', LogisticRegression()),
])


In [7]:
nlp_pipe.fit(X_train, y_train)

In [8]:
print(nlp_pipe.score(X_train, y_train))
print(nlp_pipe.score(X_test, y_test))

0.8207294747218702
0.7034220532319392


Not bad for a first attempt. It's definitely better than the baseline, but there's probably improvements that can be made. The first thing that sticks out is that the model is quite overfit, given the big difference in train and test scores.

I would like two add two things: 1) num_comments we saw in the EDA notebook that might be slightly predictive of which subreddit the title is from. and 2) the domain name. To put these thing in an sklearn Pipeline, I'll have to use ColumnTransformer. That way I can apply the word vectorizer only to the 'title' column, to num_comments I can apply StandardScaler, and to 'domain' I can apply OneHotEncoder.

Sidenote: I went down a bit of a rabbit hole here on whether the output of TfidfVectorizer should <i>also</i> be scaled, in order to be on the same scale as the output of num_comments. But it seems like it's probably better not to. See: https://stackoverflow.com/questions/36675022/do-you-need-to-scale-vectorizers-in-sklearn

Finally, I'll use a custom preprocessor to get rid of newline characters (stolen from Erics lecture on NLP EDA)

In [9]:
def my_preprocessor(text):
    """
    Custom preprocessor function - shamelessly stolen from Erics NLP EDA lecture
    """
    text = text.lower()
    text = re.sub('\\n', '', text)
    text = re.findall("[\w']+|\$[\d\.]+", text)
    text = ' '.join(text)
    
    return text

In [10]:
ct = ColumnTransformer([
    ("tfidf", TfidfVectorizer(preprocessor=my_preprocessor, stop_words='english'), "title"),
    ("ss", StandardScaler(), ["num_comments"]),
    ('ohe', OneHotEncoder(sparse_output = False, drop ='first'), ["domain"])
    ])

In [11]:
nlp_pipe = Pipeline([
    ('ct', ct ),
    ('lr', LogisticRegression(solver='liblinear')), #can't use a penalty if the output of Tfidf isn't scaled
])

We need to redefine X to include 'num_comments' and 'domain'

In [12]:
X = post_df[['title', 'num_comments', 'domain']]
y = post_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [13]:
nlp_pipe.fit(X_train, y_train)

In [14]:
print(nlp_pipe.score(X_train, y_train))
print(nlp_pipe.score(X_test, y_test))

0.9034643008027038
0.8177017321504013


Adding those things seem to make a model a little bit better! It's still overfit, but scoring better overall.

Given that the model is overfit, maybe we should do a GridSearch and see if we can improve the model that way

In [15]:
pipe_params = {
    'ct__tfidf__strip_accents':[None, 'unicode'],
    'ct__tfidf__max_df':[0.99, 0.8, 0.5],
    'ct__tfidf__stop_words':[None, 'english'],
    'lr__penalty':['l1', 'l2'],
    'lr__C':[0.01, 0.5, 1, 5]
}

In [16]:
gs = GridSearchCV(nlp_pipe,
                 param_grid =pipe_params,
                 n_jobs=-1)
gs.fit(X_train, y_train)

In [17]:
print(gs.score(X_train, y_train))
print(gs.score(X_test, y_test))

0.8995916068159414
0.8210815378115758


In [18]:
gs.best_params_

{'ct__tfidf__max_df': 0.99,
 'ct__tfidf__stop_words': None,
 'ct__tfidf__strip_accents': None,
 'lr__C': 1,
 'lr__penalty': 'l2'}

It seems that my gridsearch doesn't make the model all that much better.

One other thing I can try to do is to do stemming instead of tokenizing. Stealing this from Eric's lecture on NLP EDA

In [19]:
def my_lemmatizer(text):
    """
    Custom lemmatizer function - shamelessly stolen from Erics NLP EDA lecture
    """
    wnet = WordNetLemmatizer()
    return [wnet.lemmatize(w) for w in text.split()]

In [20]:
# apply same pre-processing as posts to stopwords
# https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words
wnet = WordNetLemmatizer()
lem_stopwords = [wnet.lemmatize(w) for w in stopwords.words('english')]
lem_stopwords[:5]

['i', 'me', 'my', 'myself', 'we']

In [21]:
ct = ColumnTransformer([
    ("tfidf", TfidfVectorizer(preprocessor=my_preprocessor, tokenizer=my_lemmatizer), "title"),
    ("ss", StandardScaler(), ["num_comments"]),
    ('ohe', OneHotEncoder(sparse_output = False, drop ='first'), ["domain"])
    ])

nlp_pipe = Pipeline([
    ('ct', ct ),
    ('lr', LogisticRegression(solver='liblinear')),
])

In [22]:
pipe_params = {
    'ct__tfidf__strip_accents':[None, 'unicode'],
    'ct__tfidf__max_features':[4000, 8000, 12000],
    'ct__tfidf__max_df':[1, 0.8, 0.5],
    'ct__tfidf__stop_words':[None, lem_stopwords],
}

In [23]:
gs = GridSearchCV(nlp_pipe,
                 param_grid =pipe_params,
                 n_jobs=-1)
gs.fit(X_train, y_train)



In [24]:
print(gs.score(X_train, y_train))
print(gs.score(X_test, y_test))

0.8887480636530066
0.8219264892268695


It seems that lemmatizing doesn't make all that much of a difference.. at least within the parameter space I searched for.

I will dive into the output of TfidfVectorizer() in a little more detail and see if I can make improvements on that front. Specifically: let's look at bigrams. I think they could be pretty useful, but as seen in the EDA and cleaning notebook, they can quickly explode the number of model features.

What if I write a customer transformer that takes the output of a Vectorizer, and keeps the most frequent bigrams but drops all the other ones? That way I can tune the max_features of the overall data, but then specifically reduce the number of bigrams on top of that.

I Found this snippet of code here on how to write a custom transformer for sklearn: <br>
https://www.andrewvillazon.com/custom-scikit-learn-transformers/

For my custom transformer to work, I also need a specific output of TfidfVectorizer (the column names, so I know which columns are bigrams and which aren't). So in order to achieve this, what I'll do is write a transformer class that wraps around TfidfVectorizer and does the vectorizing and reduction of bigrams in one fell swoop.

In [106]:
class Tfidf_BigramReducer(BaseEstimator, TransformerMixin):
    """
    A class that wraps around sklearn's TfidfVectorizer with ngram_range=(1,2), and subsequently slices out the bigrams with low overall importance

    Attributes
    ----------
    bf (float or int) : 
        the 'bigram fraction' to be kept. If int, an absolute number. If float, a fraction of the total nr of bigrams
    inds_to_keep (list):
        the column indexes to keep in the vectorized matrix. Calculated when the fit method is called 
    all the TfidfVectorizer hyperparameters: 
        these are passed straight through to a TfidfVectorizer object, which is created when this class is instantiated

    Methods
    -------
    fit(X, y=None):
        Fits and transforms input data with TfidfVectorizer,then finds the bigrams with highest summed tfidfvalue over all rows. Calculates which column indices should be kept
    transform(X, y=None):
        transformed the data TfidfVectorizer, then uses the inds_to_keep attribute to slice out all the bigrams with low tfidf value
    get_feature_names_out(X):
        returns the feature names after slicing out bigrams

    inherits methods from Transformer Mixin (fit_transform) and BaseEstimator
    """
    def __init__(self, bf=0.9, stop_words=None, strip_accents=None, max_features=None, tokenizer=None, min_df=1, max_df=0.8, preprocessor=None):
        """
        Constructs the 'bigram frequency' attribute, and creates a TfidfVectorizer object with all the hyperparameters passed through
        """
        self.bf = bf
        
        #tfidf hyperpars
        self.max_df = max_df
        self.min_df = min_df
        self.preprocessor=preprocessor
        self.stop_words = stop_words
        self.strip_accents = strip_accents
        self.tokenizer = tokenizer
        self.max_features = max_features
        
        #create TfidfVectorizer object
        self.tfidf = TfidfVectorizer(stop_words=self.stop_words, min_df=self.min_df, max_df=self.max_df, strip_accents=self.strip_accents, \
                            ngram_range=(1,2), max_features=self.max_features, tokenizer=self.tokenizer, preprocessor=self.preprocessor)

    def fit(self, X, y=None):
        """
        Fits the data to TfidfVectorizer, then uses the tfdidf values to obtain the indexes of the columns to keep
        """
        
        X_trans = self.tfidf.fit_transform(X)
        self.features = self.tfidf.get_feature_names_out()
        
        #Sum over all rows to find most frequently occuring n-grams
        X_q = np.sum(X_trans, axis=0) 
                
        #Isolate the bigrams, sort by size
        mc_bigram_freqs = sorted([X_q[0,x] for x in range(X_q.shape[1]) if len(self.features[x].split(' ')) == 2])

        #Find the summed frequency above which we should keep the bigram (depending on self.bf)
        if type(self.bf) == int:
            cutoff = mc_bigram_freqs[-self.bf]
        elif (type(self.bf) == float) and self.bf <= 1.0:
            idx = int((1-self.bf)*len(mc_bigram_freqs))
            cutoff = mc_bigram_freqs[idx]
        else:
            sys.exit('Bf can only be an integer above 1 or a float below 1')
    
        #Find which col indexes in the matrix to keep (either its a monogram, or a bigram above the cutoff) - save as attribute
        self.inds_to_keep = [x for x,f in enumerate(self.features) if len(f.split(' ')) == 1 or X_q[0,x] >= cutoff]

        return self

    def transform(self, X, y=None):
        """
        Transforms input data by slicing, using the self.inds_to_keep attribute
        """
        X_trans = self.tfidf.transform(X)
        return X_trans[:,self.inds_to_keep]
    
    def get_feature_names_out(self, X):
        """
        Returns feature names after slicing out bigrams. This function call needs this exact name so it works with the get_feature_names_out() method of all sklearn transformers
        Note that in older versions of sklearn this method was called get_feature_names, in which case this method name needs to be changed
        """
        return self.features[self.inds_to_keep]

In [26]:
X_b = post_df['title'] #Use only the 'title' feature just to test my custom transformer
y = post_df['subreddit']
X_b_train, X_b_test, y_train, y_test = train_test_split(X_b, y, random_state=42)
print(f'Train and test shapes before vectorizing:{X_b_train.shape},{X_b_test.shape}')
tfidf_bg = Tfidf_BigramReducer(bf=0.2)

tfidf_bg.fit_transform(X_b_train)
print(f'Transformed train and test shapes: {tfidf_bg.transform(X_b_train).shape},{tfidf_bg.transform(X_b_test).shape}')

Train and test shapes before vectorizing:(14202,),(4734,)
Transformed train and test shapes: (14202, 34424),(4734, 34424)


That seems to work! Now we can put this in a Pipeline. TfidfVectorizer will be replaced with my new custom transformer Tfidf_BigramReducer:

In [27]:
X = post_df[['title', 'num_comments', 'domain']]
y = post_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

ct = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor), "title"),
    ("ss", StandardScaler(), ["num_comments"]),
    ('ohe', OneHotEncoder(sparse_output = False, drop ='first'), ["domain"])
    ])

nlp_pipe = Pipeline([ ('ct', ct ),
                    ('lr', LogisticRegression(solver='liblinear')) ])

In [28]:
pipe_params = {
    'ct__tfbg__max_df':[0.99, 0.8, 0.7],
    'ct__tfbg__min_df':[1,2],
    'ct__tfbg__bf': [0.1, 0.2, 0.3, 0.4],
    'ct__tfbg__max_features': [20_000, 30_000, 50_000, None],
    'ct__tfbg__stop_words':[None, 'english'],
    'lr__C':[0.001, 0.1, 1]
}

gs = GridSearchCV(nlp_pipe, param_grid=pipe_params, cv=5, \
                  verbose=0, n_jobs=-1)
gs.fit(X_train, y_train)

In [29]:
print(gs.score(X_train, y_train))
print(gs.score(X_test, y_test))

0.9066328686100549
0.8223489649345163


In [30]:
gs.best_params_

{'ct__tfbg__bf': 0.4,
 'ct__tfbg__max_df': 0.99,
 'ct__tfbg__max_features': 50000,
 'ct__tfbg__min_df': 1,
 'ct__tfbg__stop_words': None,
 'lr__C': 1}

### 1.2) Random Forest

Instead of Logistic Regression, I could try RandomForestClassifier Since this is a Decision tree based model.

To cut down on processing time, I will also switch over from GridsearchCV to RandomizedSearchCV.

In [31]:
X = post_df[['title', 'num_comments', 'domain']]
y = post_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

ct = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor), "title"),
    ("ss", StandardScaler(), ["num_comments"]),
    ('ohe', OneHotEncoder(sparse_output = False, drop=None), ["domain"]) #don't drop for tree-based models
    ])

nlp_pipe = Pipeline([
    ('ct', ct ),
    ('rf', RandomForestClassifier()),
])

#for randomized search
rs_params = {
    'ct__tfbg__max_df':[0.7,0.8, 0.99],
    'ct__tfbg__min_df': [1,2],   
    'ct__tfbg__bf': [0.001, 0.05, 0.1,0.2, 0.3, 0.4],
    'ct__tfbg__max_features':[20_000, 30_000, 50_000, 70_000],
    'rf__min_samples_leaf':[1,3,5],
    'rf__n_estimators':[150, 175, 200, 225, 250]
}

In [32]:
rs_tree = RandomizedSearchCV(nlp_pipe, param_distributions=rs_params, n_jobs=-1, n_iter=60, cv=5)
rs_tree.fit(X_train, y_train)

In [33]:
print(rs_tree.score(X_train, y_train))
print(rs_tree.score(X_test, y_test))

0.9183917758062244
0.8800168990283058


In [34]:
rs_tree.best_params_

{'rf__n_estimators': 200,
 'rf__min_samples_leaf': 3,
 'ct__tfbg__min_df': 2,
 'ct__tfbg__max_features': 20000,
 'ct__tfbg__max_df': 0.7,
 'ct__tfbg__bf': 0.1}

It's even more overfit, but it does a little better overall on the test data. This seems to be the best model so far!

As a final experiment re: titles, I want to see how well this model can do if I only use titles, and not the additional information

In [35]:
#Only use titles for this attempt
X = post_df['title']
y = post_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)


nlp_pipe = Pipeline([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor)),
    ('rf', RandomForestClassifier()),
])

#for randomized search
rs_params = {
    'tfbg__max_df':[0.7, 0.8, 0.99],
    'tfbg__min_df':[1,2],
    'tfbg__bf': [0.01, 0.1,0.2, 0.3, 0.4],
    'tfbg__max_features':[20_000, 30_000, 50_000, 70_000],
    'rf__min_samples_leaf':[1,2,3],
    'rf__n_estimators':[150, 175, 200, 225, 250]
}

rs_tree_titles = RandomizedSearchCV(nlp_pipe, param_distributions=rs_params, n_jobs=-1, n_iter=60, cv=5)
rs_tree_titles.fit(X_train, y_train)

In [36]:
print(rs_tree_titles.score(X_train, y_train))
print(rs_tree_titles.score(X_test, y_test))

0.983312209547951
0.6841994085340093


In [37]:
rs_tree_titles.best_params_

{'tfbg__min_df': 2,
 'tfbg__max_features': 70000,
 'tfbg__max_df': 0.99,
 'tfbg__bf': 0.2,
 'rf__n_estimators': 225,
 'rf__min_samples_leaf': 1}

It seems that when only using the titles, the model has quite a bit more trouble classifying. It does make some sense - in the EDA and cleaning notebook we saw that the domain name is fairly predictive for which subreddit the post comes from - highlighting the different media universes of the conservative and mainstream political media. Therefore, including the domain name makes the model much more predictive.

### 1.3) Stacking

As a final attempt, I will consider stacking models. For this I will use 1) Logistic Regression, 2) Random forest, and 3) Naive Bayes

I'll need to set up 3 different pipelines

In [101]:
X = post_df[['title', 'num_comments', 'domain']]
y = post_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [116]:
#Note that the hyperparameters here are a little heuristic, some of them I got by using the gridsearched up above, others from just re-running the stacked model

ct_lr = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor,  stop_words='english', max_df=0.8, bf=0.1, min_df=1, max_features=30000), "title"),
    ("ss", StandardScaler(), ["num_comments"]),
    ('ohe', OneHotEncoder(sparse_output = False, drop ='first'), ["domain"])
    ])

#For random forest I don't need to drop first
ct_rf = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor, stop_words='english', min_df=1, max_df=0.8, bf=0.1, max_features=25000), "title"),
    ("ss", StandardScaler(), ["num_comments"]),
    ('ohe', OneHotEncoder(sparse_output = False, drop =None), ["domain"])
    ])

# For Multinomial NB I can't have negative values, so I can't standardscale
ct_nb = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor, stop_words='english', max_df=0.8, bf=0.1, max_features=40000), "title"),
    ("mm", MinMaxScaler(), ["num_comments"]),
    ('ohe', OneHotEncoder(sparse_output = False, drop ='first'), ["domain"])
    ])

In [117]:
model1 = Pipeline([ ('ct', ct_lr ),
                    ('lr', LogisticRegression(solver='liblinear'))
                  ])

model2 = Pipeline([ ('ct', ct_rf ),
                    ('rf',RandomForestClassifier(random_state=123, n_estimators=175, min_samples_leaf=1)) 
                  ])

model3= Pipeline([ ('ct', ct_nb ),
                   ('nb', MultinomialNB()) 
                  ])


In [118]:
# Create the model
level1_estimators = [
    ('m1', model1),
    ('m2', model2),
    ('m3', model3)
]

stacked_model = StackingClassifier(estimators=level1_estimators,
                                 final_estimator=RandomForestClassifier())

In [119]:
stacked_model.fit(X_train, y_train)

In [120]:
stacked_model.score(X_train, y_train)

0.977538374876778

In [121]:
stacked_model.score(X_test, y_test)

0.8823405154203633

In [113]:
stacked_model.final_estimator_.coef_

array([[0.78530152, 5.6879581 , 2.00328735]])

In [114]:
stacked_model.named_estimators_.m2.named_steps.ct.get_feature_names_out()

array(['tfbg__00', 'tfbg__000', 'tfbg__000 000', ..., 'ohe__domain_wsj',
       'ohe__domain_yahoo', 'ohe__domain_youtube'], dtype=object)

In [115]:
with open('../pickled_models/stacked_model_submissions.pkl', 'wb') as pickle_out:
    pickle_out = pickle.dump(stacked_model, pickle_out)

## 2) Comments Modeling

Next, we will look at the comments I collected. I will use 'all_comments_sentiment.csv' (created in the sentiment_analysis notebook).

In [88]:
com_df = pd.read_csv('../data/all_comments_sentiment.csv')
com_df.set_index('id', inplace=True)
com_df.head()


Unnamed: 0_level_0,parent_id,author,created_utc,body,score,subreddit,word_length,freq_poster,sent_label,sent_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
irhr7g7,40844090000.0,stickznstonez_,1665212396,https://youtu.be/i1oCQ6bZ_Ws\n\nThis guy might...,1,politics,31,0,LABEL_0,0.6696
irhr7bp,,valcatrina,1665212393,I am surprised it takes the FBI to draw this l...,1,politics,22,0,LABEL_0,0.6532
irhr79r,40843470000.0,StrillyBings,1665212392,If he was making calls to Georgia for someone ...,1,politics,19,0,LABEL_1,0.7425
irhr79a,,After_Ad_9636,1665212391,Duh?\n\nWhy wouldn’t he?,1,politics,3,0,LABEL_1,0.5716
irhr720,40844200000.0,SweetenedTomatoes,1665212386,"Ah, I remember back in my younger days being h...",1,politics,47,0,LABEL_2,0.727


What's the baseline accuracy?

In [49]:
com_df['subreddit'].value_counts(normalize=True)

politics        0.539368
conservative    0.460632
Name: subreddit, dtype: float64

In [95]:
com_df['subreddit'].value_counts()

politics        25874
conservative    22097
Name: subreddit, dtype: int64

We have pretty balanced classes for comments as well. Let's use the same labeling as for titles, and set conservative=0 and politics=1

In [50]:
com_df['subreddit'] = com_df['subreddit'].map({'conservative':0, 'politics':1})

### 2.1) Logistic Regression

I will start again with using Logistic Regression before moving on decision-tree models

In [51]:
X = com_df[['body', 'score', 'word_length', 'freq_poster', 'sent_label']]
y = com_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Let's set up a ColumnTransformer + a Pipeline again, using my custom transformer, a scaler for the numerical columns, and OneHotEncoder for the sentiment label

In [52]:
ct = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor, stop_words='english', max_features=70_000), "body"),
    ("ss", StandardScaler(), ["score", "word_length"]),
    ("ohe", OneHotEncoder(sparse_output = False, drop ='first'), ["sent_label"])
    ])


nlp_pipe = Pipeline([
    ('ct', ct ),
    ('lr', LogisticRegression(penalty='l2', C=1, max_iter=1000)),
])

In [53]:
nlp_pipe.fit(X_train, y_train)

In [54]:
nlp_pipe.score(X_train, y_train), nlp_pipe.score(X_test, y_test)

(0.8303407638001, 0.6510464437588593)

And then set up a GridSearch

In [55]:
pipe_params = {
    'ct__tfbg__max_df':[0.99, 0.8, 0.7],
    'ct__tfbg__min_df':[1,2],
    'ct__tfbg__max_features':[50_000, 70_000, 90_000],

    'ct__tfbg__bf': [0.1, 0.3, 0.4, 0.5],
    'lr__C': [0.001, 0.01, 0.5, 1, 5, 10],

}

gs_com = GridSearchCV(nlp_pipe,
                 param_grid =pipe_params,
                 n_jobs=-1, cv=3)
gs_com.fit(X_train, y_train)

In [56]:
print(gs_com.score(X_train, y_train))
print(gs_com.score(X_test, y_test))

0.8030185113124687
0.6487951304927875


In [57]:
gs_com.best_params_

{'ct__tfbg__bf': 0.4,
 'ct__tfbg__max_df': 0.99,
 'ct__tfbg__max_features': 50000,
 'ct__tfbg__min_df': 1,
 'lr__C': 1}

It seems to have a much harder time than with the titles. That must partly be because the additional features I included for the comments model aren't as predictive. It might also be because many comments are short, which makes the classification more difficult.

What if I only include comments that are at least 4 words long?

In [58]:
com_df_wl = com_df[com_df['word_length'] >= 4]
X_wl = com_df_wl[['body', 'score', 'word_length', 'freq_poster', 'sent_label']]
y_wl = com_df_wl['subreddit']

X_wl_train, X_wl_test, y_wl_train, y_wl_test = train_test_split(X_wl, y_wl, random_state=42)

ct = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor, stop_words='english', max_features=70_000), "body"),
    ("ss", StandardScaler(), ["score", "word_length"]),
    ("ohe", OneHotEncoder(sparse_output = False, drop ='first'), ["sent_label"])
    ])

nlp_pipe = Pipeline([
    ('ct', ct ),
    ('lr', LogisticRegression(solver='liblinear')),
])

pipe_params = {
    'ct__tfbg__max_df':[0.99, 0.8, 0.7],
    'ct__tfbg__min_df':[1,2],
    'ct__tfbg__bf': [0.01, 0.1, 0.2, 0.3, 0.4],
}

gs_com = GridSearchCV(nlp_pipe,
                 param_grid =pipe_params,
                 n_jobs=-1, cv=3)
gs_com.fit(X_wl_train, y_wl_train)

In [59]:
print(gs_com.score(X_wl_train, y_wl_train), gs_com.score(X_wl_test, y_wl_test))

0.8174762518354259 0.656957928802589


Interesting - it seems that it mostly makes the model more overfit, but not actually better on the test data

### 2.2) Random Forest

For the comments, I want to set up a  RandomForestClassifier as well, using the same general strategy, with a randomized search

In [68]:
X = com_df[['body', 'score', 'word_length', 'freq_poster', 'sent_label']]
y = com_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

ct = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor, stop_words='english', min_df=2, bf=0.2, max_features=70_000), "body"),
    ("ss", StandardScaler(), ["score", "word_length"]),
    ("ohe", OneHotEncoder(sparse_output = False, drop =None), ["sent_label"])
    ])

nlp_pipe = Pipeline([
    ('ct', ct ),
    ('rf', RandomForestClassifier(max_depth=600, min_samples_leaf=4, max_leaf_nodes=800, n_estimators=200)),
])

Fitting this pipe by itself first, to get an 

In [69]:
nlp_pipe.fit(X_train, y_train)

In [70]:
nlp_pipe.score(X_train, y_train), nlp_pipe.score(X_test, y_test)

(0.7969592528767581, 0.6278662553156008)

In [71]:
#for randomized search
rs_params = {
    'ct__tfbg__max_df':[0.7, 0.75, 0.8, 0.99],
    'ct__tfbg__min_df':[1,2],
    'ct__tfbg__bf': [0.01, 0.1,0.2, 0.4],
    'ct__tfbg__max_features': [50_000, 70_000, 90_000],
    'rf__max_leaf_nodes':[500, 800, 1000, None],
    'rf__min_samples_leaf':[1,3,4,6],
    'rf__n_estimators':[150, 200,300, 400, 600]
}

In [73]:
rs_tree = RandomizedSearchCV(nlp_pipe, param_distributions=rs_params, n_jobs=-1, n_iter=30, cv=5, random_state=42)
rs_tree.fit(X_train, y_train)

In [74]:
print(rs_tree.score(X_train, y_train))
print(rs_tree.score(X_test, y_test))

0.7980154538884874
0.6296172767447678


In [75]:
rs_tree.best_params_

{'rf__n_estimators': 400,
 'rf__min_samples_leaf': 4,
 'rf__max_leaf_nodes': 800,
 'ct__tfbg__min_df': 1,
 'ct__tfbg__max_features': 50000,
 'ct__tfbg__max_df': 0.75,
 'ct__tfbg__bf': 0.1}

### 2.3) Stacking

I'm going to use some of the best_params_ that I found previously, especially for random forest

In [76]:
X = com_df[['body', 'score', 'word_length', 'freq_poster', 'sent_label']]
y = com_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

ct_lr = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor, stop_words='english', bf=0.5, max_df=0.99, min_df=2, max_features=70_000), "body"),
    ("ss", StandardScaler(), ["score", "word_length"]),
    ("ohe", OneHotEncoder(sparse_output = False, drop ='first'), ["sent_label"])

    ])

ct_rf = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor, stop_words='english', bf=0.4, min_df=1, max_df=0.7, max_features=90_000), "body"),
    ("ss", StandardScaler(), ["score", "word_length"]),
    ("ohe", OneHotEncoder(sparse_output = False, drop =None), ["sent_label"])
    ])


ct_nb = ColumnTransformer([
    ("tfbg", Tfidf_BigramReducer(preprocessor=my_preprocessor, stop_words='english', bf=0.4, max_df=0.99, min_df=2, max_features=70_000), "body"),
    ("ss", MinMaxScaler(), ["score", "word_length"]),
    ("ohe", OneHotEncoder(sparse_output = False, drop ='first'), ["sent_label"])

    ])

In [77]:
model1 = Pipeline([ ('ct', ct_lr ),
                    ('lr', LogisticRegression(solver='liblinear', C=1))
                  ])

model2 = Pipeline([ ('ct', ct_rf ),
                    ('rf',RandomForestClassifier(min_samples_leaf=3, max_leaf_nodes=1000, max_depth=600, n_estimators=300)) 
                  ])

model3= Pipeline([ ('ct', ct_nb ),
                   ('nb', MultinomialNB()) 
                  ])


In [78]:
# Create the model
level1_estimators = [
    ('m1', model1),
    ('m2', model2),
    ('m3', model3)
]

stacked_model = StackingClassifier(estimators=level1_estimators,
                                 final_estimator=LogisticRegression())

In [79]:
stacked_model.fit(X_train, y_train)

In [80]:
stacked_model.score(X_train, y_train)

0.8526599588637501

In [81]:
stacked_model.score(X_test, y_test)

0.6546318685900109

In [82]:
stacked_model.final_estimator_.coef_

array([[0.01626318, 3.25152795, 4.30797484]])

Let's save this model so we can investigate more in the modeling insights notebook

In [83]:
#Commenting out to prevent unwanted overwriting
with open('../pickled_models/stacked_model_comments.pkl', 'wb') as pickle_out:
    pickle_out = pickle.dump(stacked_model, pickle_out)