## Imports

In [30]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
import nltk
from nltk.corpus import stopwords
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

RANDOM_STATE = 42

## Read-In Data

In [2]:
subreddits = pd.read_csv('../data/subreddits_preprocessed.csv')
subreddits.drop(columns = 'Unnamed: 0', inplace = True)

In [4]:
subreddits.head(2)

Unnamed: 0,title,selftext,subreddit,author,num_comments,score,timestamp,original_text,post_length_char,post_length_words,is_unethical,stemmer_text,polarity,sentiment_cat
0,: Answers to why,,LifeProTips,AlienAgency,2,1,2020-07-17,: Answers to why,16,4,0,: answer to whi,0.0,Neutral
1,¿Quieres obtener juegos y premios gratis en tu...,,LifeProTips,GarbageMiserable0x0,2,1,2020-07-17,¿Quieres obtener juegos y premios gratis en tu...,60,10,0,¿quier obten juego y premio grati en tu tiempo...,0.0,Neutral


## Model Preparation

In a separate set of models, I determined that stemmed text and the Tfidf Vectorizer would be a good choice for my data. Therefore, I will conduct a train test split on the stemmed text and set up a Column Transformer to only vectorize my text data.

### Train Test Split

In [6]:
features = ['num_comments', 'score', 'post_length_char', 'post_length_words', 'polarity', 'stemmer_text']
X = subreddits[features]
y = subreddits['is_unethical']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = RANDOM_STATE, stratify = y)

### Build Column Transformer to Only Apply Vectorizer to Text Features

In [12]:
tfidf = ColumnTransformer([
    ('tfidf', TfidfVectorizer(), 'stemmer_text'),], 
    remainder='passthrough')

### Define Custom Stop Words Hyperparameter for Vectorizer

In [28]:
custom_stop_words = stopwords.words('english') + ['ulpt', 'lpt']

## Modeling

MARKDOWN TO DESCRIBE THE PROCESS!

### Functions

In [41]:
def display_accuracy_scores(model, X_train, y_train, X_test, y_test):
    print(f'The cross validation accuracy score is {cross_val_score(model, X_train, y_train).mean()}.')
    print(f'The training accuracy score is {model.score(X_train, y_train)}.')
    print(f'The testing accuracy score is {model.score(X_test, y_test)}.')

In [37]:
def display_accuracy_scores_gs(model, X_train, y_train, X_test, y_test):
    print(f'The cross validation accuracy score is {model.best_score_}.')
    print(f'The training accuracy score is {model.score(X_train, y_train)}.')
    print(f'The testing accuracy score is {model.score(X_test, y_test)}.')

### Model 1: Null Model

In [32]:
null = DummyClassifier()

In [34]:
null.fit(X_train, y_train);

In [47]:
display_accuracy_scores(model = null, X_train = X_train, X_test = X_test, y_train = y_train, y_test = y_test)

The cross validation accuracy score is 0.5085273758949527.
The training accuracy score is 0.5090155945419104.
The testing accuracy score is 0.47670454545454544.




In order to perform better than the null model, any model that I build will need to perform better than 47.7% better accuracy on the testing data.

### Model 2: Logistic Regression