# Logistic Regression

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer 
import pickle
import warnings; warnings.simplefilter('ignore')
np.random.RandomState(42)



<mtrand.RandomState at 0x22a8c569a20>

In [2]:
X_train = pd.read_csv("../Data/X_train.csv",index_col=0)
y_train = pd.read_csv("../Data/y_train.csv",index_col=0)
with open('../Assets/custom_stop_words.pkl','rb') as f:
    custom_stop_words = pickle.load(f)

This cell instantiates a Pipeline, which will process the data, and  determines which model I will use in this notebook, which is Linear Regression.

In [3]:
pipe = Pipeline([
    ("tfidif", TfidfVectorizer()),
    ("Log_reg", LogisticRegression())
])

This dictionary lists the hyper parameters that I want my Gridsearch to test, which will then pick a single model. Gridsearch will then use the model that had the best accuracy.

In [4]:
params = {
    "tfidif__stop_words":[custom_stop_words],
    "tfidif__min_df":[5],
    'Log_reg__penalty':['l1','l2']
}

the penalty for logistic regression can be either l1 or l2. If the penalty is l1, that means that our coefficients can be reduced to 0, with the l2 penalty, our model's coefficients can approach 0, but never be 0. The min_df parameter is saying that in order for a word to be considered in the model, it must show up in at least 5 documents. This is done in order to reduce the overall number of feature. Most is not all of the words being eliminated should have no importance in determining each class if it shows up in less than 5 posts.

In [5]:
gs_Logreg = GridSearchCV(pipe,param_grid=params)

In [6]:
gs_Logreg.fit(X_train['Total_text'],y_train['subreddit'])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidif', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidif__stop_words': [['bill', 'however', 'move', 'whoever', 'becoming', 'detail', 'here', 'she', 'its', 'forty', 'sometimes', 'yours', 'describe', 'seem', 'thereby', 'may', 'and', 'these', 'along', 'still', 'yourselves', 'three', 'whether', 'none', 'themselves', 'first', 'two', 'before... 'us', 'few', 'her', 'com', 've', 'just']], 'tfidif__min_df': [5], 'Log_reg__penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=N

In [7]:
gs_Logreg.best_params_['Log_reg__penalty']

'l2'

the penalty that performed the best in the logistic regression model was the l2 penalty

###### Saving our Logistic Regression Model to be evaluated later

In [8]:
with open('../Assets/log_reg.pkl','wb+') as f:
    pickle.dump(gs_Logreg,f)