## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
import nltk
from nltk.corpus import stopwords
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

RANDOM_STATE = 42

## Read-In Data

In [2]:
subreddits = pd.read_csv('../data/subreddits_preprocessed.csv')
subreddits.drop(columns = 'Unnamed: 0', inplace = True)

In [3]:
subreddits.head(2)

Unnamed: 0,title,selftext,subreddit,author,num_comments,score,timestamp,original_text,post_length_char,post_length_words,is_unethical,stemmer_text,polarity,sentiment_cat
0,: Answers to why,,LifeProTips,AlienAgency,2,1,2020-07-17,: Answers to why,16,4,0,: answer to whi,0.0,Neutral
1,¿Quieres obtener juegos y premios gratis en tu...,,LifeProTips,GarbageMiserable0x0,2,1,2020-07-17,¿Quieres obtener juegos y premios gratis en tu...,60,10,0,¿quier obten juego y premio grati en tu tiempo...,0.0,Neutral


## Model Preparation

In a separate set of models, I determined that stemmed text and the Tfidf Vectorizer would be a good choice for my data. Therefore, I will conduct a train test split on the stemmed text and set up a Column Transformer to only vectorize my text data.

### Train Test Split

In [4]:
features = ['num_comments', 'score', 'post_length_char', 'post_length_words', 'polarity', 'stemmer_text']
X = subreddits[features]
y = subreddits['is_unethical']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = RANDOM_STATE, stratify = y)

### Define Custom Stop Words Hyperparameter for Vectorizer

In [6]:
custom_stop_words = stopwords.words('english') + ['ulpt', 'lpt']

### Build Column Transformer to Only Apply Vectorizer to Text Features

In [7]:
tfidf_transformer = ColumnTransformer([
    ('tfidf', TfidfVectorizer(stop_words = custom_stop_words), 'stemmer_text'),], 
    remainder='passthrough')

## Modeling

MARKDOWN TO DESCRIBE THE PROCESS!

### Functions

In [8]:
def display_accuracy_scores(model, xtrain, ytrain, xtest, ytest):
    print(f'The cross validation accuracy score is {cross_val_score(model, xtrain, ytrain).mean()}.')
    print(f'The training accuracy score is {model.score(xtrain, ytrain)}.')
    print(f'The testing accuracy score is {model.score(xtest, ytest)}.')

In [9]:
def display_accuracy_scores_gs(model, xtrain, ytrain, xtest, ytest):
    print(f'The training accuracy score is {model.score(xtrain, ytrain)}.')
    print(f'The testing accuracy score is {model.score(xtest, ytest)}.')

### Model 1: Null Model

In [10]:
null = DummyClassifier(strategy = 'stratified')

In [11]:
null.fit(X_train, y_train);

In [13]:
display_accuracy_scores(model = null, xtrain = X_train, xtest = X_test, ytrain = y_train, ytest = y_test)

The cross validation accuracy score is 0.5004860223998099.
The training accuracy score is 0.50682261208577.
The testing accuracy score is 0.5357954545454545.


In order to perform better than the null model, any model that I build will need to perform better than 50.8% accuracy on the testing data.

### Model 2a: Logistic Regression with No Regularization

#### Create Pipeline

In [14]:
logreg_pipe = Pipeline([
    ('tfidf', tfidf_transformer),
    ('logreg', LogisticRegression(penalty = 'none', solver = 'newton-cg', max_iter = 600))
])

# The model would not converge for the other solvers. Newton-cg can be used to fit larger datasets.

#### Grid Search Over Pipeline

In [35]:
logreg_pipe_params = {
    'tfidf__tfidf__ngram_range': [(1,1)],
    'tfidf__tfidf__min_df': [5],
    'tfidf__tfidf__max_df': [0.98]
}

# Note: All other hyperparameter options removed

In [36]:
gs_logreg_pipe = GridSearchCV(logreg_pipe, param_grid = logreg_pipe_params, cv = 5, verbose = 1, n_jobs = -1)

In [37]:
gs_logreg_pipe.fit(X_train, y_train);

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    2.7s remaining:    4.1s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.8s finished


In [38]:
gs_logreg_pipe.best_params_

{'tfidf__tfidf__max_df': 0.98,
 'tfidf__tfidf__min_df': 5,
 'tfidf__tfidf__ngram_range': (1, 1)}

The best parameters for this model were determined to be a maximum occurrence in the data frame of 0.98, a minimum occurrence of 5, and and ngram range of 1.

#### Evaluate Accuracy Metric

In [19]:
gs_logreg_pipe.best_score_

0.6927426398502718

In [20]:
display_accuracy_scores_gs(model = gs_logreg_pipe, xtrain = X_train, xtest = X_test, ytrain = y_train, ytest = y_test)

The training accuracy score is 0.9990253411306043.
The testing accuracy score is 0.70625.


Although this model performs better than baseline accuracy, this model is extremely overfit, but this is likely due to the large number of features in the model without any regularization. (Note: In this model, there are 3035 features (3030 are words and 5 are numerical features).

### Model 2b: Logistic Regression with Regularization

#### Create Pipeline

In [46]:
logreg_reg_pipe = Pipeline([
    ('tfidf', tfidf_transformer),
    ('ss', StandardScaler(with_mean = False)),
    ('logreg', LogisticRegression())])

#### Grid Search Over Pipeline

In [111]:
logreg_reg_pipe_params = {
    'tfidf__tfidf__ngram_range': [(1,2)],
    'tfidf__tfidf__max_df': [0.90],
    'tfidf__tfidf__min_df': [2],
    'logreg__penalty': ['l2'],
    'logreg__C': [0.0001, 0.00001, 0.000001],
    'logreg__solver': ['liblinear']
}
# Only best params remain in grid

In [112]:
gs_logreg_reg_pipe = GridSearchCV(logreg_reg_pipe, param_grid = logreg_reg_pipe_params, cv = 5, verbose=1, n_jobs = -1 )

In [113]:
gs_logreg_reg_pipe.fit(X_train, y_train);

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:    1.2s finished


In [114]:
gs_logreg_reg_pipe.best_params_

{'logreg__C': 0.0001,
 'logreg__penalty': 'l2',
 'logreg__solver': 'liblinear',
 'tfidf__tfidf__max_df': 0.9,
 'tfidf__tfidf__min_df': 2,
 'tfidf__tfidf__ngram_range': (1, 2)}

#### Evaluate Accuracy Metric

In [115]:
print(f'The cross val score is {round(gs_logreg_reg_pipe.best_score_, 4)}.')

The cross val score is 0.7712.


In [116]:
display_accuracy_scores_gs(model = gs_logreg_reg_pipe, xtrain = X_train, xtest = X_test, ytrain = y_train, ytest = y_test)

The training accuracy score is 0.9851364522417154.
The testing accuracy score is 0.78125.


Although this model has improved accuracy by about 8%, it is still very overfit to the training data. Reducing features in the vector of words and decreasing the C value even further do not seem to help. 

### Model 3: Multinomial Naive Bayes

#### Create Pipe
#### Grid Search Over Pipe
#### Evaluate Accuracy Metric