# Basic Model with Logistic Regression & Multinomial Naive Bayes

---

In this notebook, our main purpose is to create our first model using our `both_subreddits.csv` dataset that we fully cleaned and explored back in our [EDA](http://localhost:8888/lab/tree/projects/project_3/EDA.ipynb) notebook.  The models that we will touch on are below:

- Logistic Regression
- Count Vectorizer
- Tfidf Vectorizer
- Multinomial Naive Bayes
- Porter Stemmer

In [130]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [60]:
subreddits = pd.read_csv('./data/both_subreddits.csv')
subreddits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   author               10000 non-null  object
 1   created_utc          10000 non-null  int64 
 2   selftext             10000 non-null  object
 3   subreddit            10000 non-null  object
 4   title                10000 non-null  object
 5   title_word_count     10000 non-null  int64 
 6   selftext_word_count  10000 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 547.0+ KB


But before we get too far into our modeling, let's first create binary categories for our 1st and 2nd subreddits.

- `r/explainlikeimfive` = 1
- `r/Advice` = 0

In [78]:
subreddits['subreddit'] = subreddits['subreddit'].map({'explainlikeimfive': 1, 'Advice': 0})

Using a stemming device to shorten words to root or _lemma_.  Code based on [StackOverflow](https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn) question.  This will be used in the Pipeline section.

In [148]:
stemmer = PorterStemmer()
analyzer = CountVectorizer().build_analyzer()

In [149]:
def porter(text):
    return (stemmer.stem(w) for w in analyzer(text))

## Train, Test, Split

In [136]:
X = subreddits['title']
y = subreddits['subreddit']

In [137]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=777)

In [138]:
X_train.shape, y_train.shape

((7500,), (7500,))

In [139]:
X_test.shape, y_test.shape

((2500,), (2500,))

In [141]:
y_test.head()

5756    0
2721    1
5955    0
4972    1
5897    0
Name: subreddit, dtype: int64

## Preprocessing

This is where I will use `CountVectorizer` to:

- get rid of stopwords
- get rid of 'ELI5:' which is how `r/explainlikeimfive` users like to format their prompts
- create vectors for words that I will be using

In [142]:
y_test.value_counts(normalize=True)

0    0.5012
1    0.4988
Name: subreddit, dtype: float64

**Adding 'eli5' as a stop word to make it harder to differentiate between 2 subreddits:**

In [143]:
from sklearn.feature_extraction import text

In [144]:
# original code:  https://stackoverflow.com/questions/24386489/adding-words-to-scikit-learns-countvectorizers-stop-list

stop_words = text.ENGLISH_STOP_WORDS.union(['eli5'])

## Fitting a Pipeline

**...with `CountVectorizer()`:**

In [145]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('logreg', LogisticRegression())
])

In [45]:
pipe_params = {
    'cvec__stop_words': [stop_words, ['eli5']],
    'cvec__max_df': [1.0, 0.80, 0.90],
    'cvec__max_features': [9000],
    'cvec__ngram_range': [(1, 1), (1, 2)]
}

gs = GridSearchCV(pipe,
                  param_grid = pipe_params,
                  cv=5,
                  verbose = 1)

gs.fit(X_train, y_train)

print(f'Best cross validation score: {gs.best_score_}')
print(f'Best parameters to use: {gs.best_params_}')
print(f'Testing score: {gs.score(X_test, y_test)}')

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:   41.1s finished


Best cross validation score: 0.9212
Best parameters to use: {'cvec__max_df': 1.0, 'cvec__max_features': 7000, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': ['eli5']}
Testing score: 0.9236


**...with `CountVectorizer()` and `PorterStemmer()`:**

In [150]:
pipe = Pipeline([
    ('cvec', CountVectorizer(analyzer=porter)),
    ('logreg', LogisticRegression())
])

# The X_train used here was just the title column

In [152]:
pipe_params = {
    'cvec__stop_words': [stop_words, ['eli5']],
    'cvec__max_features': [10_000, 12_000],
    'cvec__ngram_range': [(1, 1), (1, 2)]
}

gs = GridSearchCV(pipe,
                  param_grid = pipe_params,
                  cv=5,
                  verbose = 1)

gs.fit(X_train, y_train)

print(f'Best cross validation score: {gs.best_score_}')
print(f'Best parameters to use: {gs.best_params_}')
print(f'Testing score: {gs.score(X_test, y_test)}')

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:  1.0min finished


Best cross validation score: 0.9888
Best parameters to use: {'cvec__max_features': 10000, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'until', 'there', 'although', 'am', 're', 'top', 'his', 'forty', 'seems', 'you', 'become', 'move', 'someone', 'anyway', 'ever', 'has', 'have', 'they', 'nobody', 'whoever', 'my', 'amongst', 'two', 'i', 'below', 'however', 'before', 'and', 'thence', 'it', 'yet', 'mostly', 'twelve', 'to', 'fifteen', 'since', 'either', 'done', 'wherein', 'because', 'could', 'per', 'each', 'which', 'every', 'please', 'themselves', 'what', 'further', 'becomes', 'such', 'would', 'rather', 'seemed', 'above', 'therein', 'though', 'cry', 'than', 'is', 'other', 'whom', 'will', 'still', 'more', 'six', 'became', 'itself', 'always', 'its', 'thereupon', 'call', 'never', 'due', 'eight', 'whole', 'who', 'system', 'after', 'how', 'our', 'out', 'as', 'that', 'thin', 'about', 'them', 'sixty', 'behind', 'formerly', 'keep', 'full', 'own', 'under', 'go', 'your', 'elsewhere', 'h

**...with `TfidfVectorizer`:**

In [46]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('logreg', LogisticRegression())
])

In [47]:
# BASE MODEL WITH ONLY STOP_WORDS PUT IN

pipe_params = {
    'tfidf__stop_words': [stop_words, ['eli5']],
}

gs = GridSearchCV(pipe,
                  param_grid = pipe_params,
                  cv=5,
                  verbose = 1)

gs.fit(X_train, y_train)

print(f'Best cross validation score: {gs.best_score_}')
print(f'Best parameters to use: {gs.best_params_}')
print(f'Testing score: {gs.score(X_test, y_test)}')

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best cross validation score: 0.9102666666666666
Best parameters to use: {'tfidf__stop_words': ['eli5']}
Testing score: 0.9128


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    1.4s finished


In [48]:
pipe_params = {
    'tfidf__stop_words': [stop_words, ['eli5']],
    'tfidf__max_df': [1.0, 0.80, 0.90],
    'tfidf__max_features': [3000, 5000, 7000],
    'tfidf__ngram_range': [(1, 1), (1, 2)]
}

gs = GridSearchCV(pipe,
                  param_grid = pipe_params,
                  cv=5,
                  verbose = 1)

gs.fit(X_train, y_train)

print(f'Best cross validation score: {gs.best_score_}')
print(f'Best parameters to use: {gs.best_params_}')
print(f'Testing score: {gs.score(X_test, y_test)}')

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:   33.8s finished


Best cross validation score: 0.9209333333333334
Best parameters to use: {'tfidf__max_df': 1.0, 'tfidf__max_features': 5000, 'tfidf__ngram_range': (1, 2), 'tfidf__stop_words': ['eli5']}
Testing score: 0.9176


## Multinomial Naive Bayes and `TfidfVectorizer`

In [52]:
pipe_multi = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

In [57]:
params = {
    'tfidf__stop_words': [stop_words, ['eli5']],
    'tfidf__max_features': [10500, 11000, 11300],
    'tfidf__ngram_range': [(1, 1), (1, 2)]
}

gs = GridSearchCV(pipe_multi,
                  param_grid = params,
                  cv=5,
                  verbose = 1)

gs.fit(X_train, y_train)

print(f'Best cross validation score: {gs.best_score_}')
print(f'Best parameters to use: {gs.best_params_}')
print(f'Testing score: {gs.score(X_test, y_test)}')

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   10.4s finished


Best cross validation score: 0.9233333333333335
Best parameters to use: {'tfidf__max_features': 11000, 'tfidf__ngram_range': (1, 2), 'tfidf__stop_words': ['eli5']}
Testing score: 0.9244


## In Conclusion:

I am surprised to see the best parameters as not using the stop words and rather just using all words.  This yielded a better result.  Perhaps, the model is automatically understanding that more common words don't mean much to it as it sees it across the entire dataset.

In conclusion, here are the best parameters for `CountVectorizer()`:

- `'cvec__max_df': 1.0`
- `'cvec__max_features': 7000`
- `'cvec__ngram_range': (1, 2)`
- `'cvec__stop_words': ['eli5']`

to yield a score of 0.9212 for the training data and **0.9236** on the **testing** data.  I am satisfied to see that my model is not overfit nor is it biased.

Here are the best parameters for `TfidfVectorizer()`:

- `'cvec__max_df': 1.0`
- `'cvec__max_features': 5000`
- `'cvec__ngram_range': (1, 2)`
- `'cvec__stop_words': ['eli5']`

to yield a score of 0.9209 for the training data and **0.9176** on the **testing** data.

## A second attempt with the `text` column:

### Train, Test, Split

In [87]:
X = subreddits['text']
y = subreddits['subreddit']

In [88]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=777)

In [89]:
X_train.shape, y_train.shape

((7500,), (7500,))

In [90]:
X_test.shape, y_test.shape

((2500,), (2500,))

In [96]:
y_test.value_counts(normalize=True)

0    0.5012
1    0.4988
Name: subreddit, dtype: float64

### Pipeline

In [97]:
pipe_multi = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

In [98]:
params = {
    'tfidf__stop_words': [stop_words, ['eli5']],
    'tfidf__max_features': [10500, 11000, 11300],
    'tfidf__ngram_range': [(1, 1), (1, 2)]
}

gs = GridSearchCV(pipe_multi,
                  param_grid = params,
                  cv=5,
                  verbose = 1)

gs.fit(X_train, y_train)

print(f'Best cross validation score: {gs.best_score_}')
print(f'Best parameters to use: {gs.best_params_}')
print(f'Testing score: {gs.score(X_test, y_test)}')

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   57.9s finished


Best cross validation score: 0.9507999999999999
Best parameters to use: {'tfidf__max_features': 11300, 'tfidf__ngram_range': (1, 2), 'tfidf__stop_words': ['eli5']}
Testing score: 0.9528


## Confusion Matrix

In [153]:
# Get predictions for confusion matrix
predictions = gs.predict(X_test)
predictions

array([0, 1, 0, ..., 0, 0, 1])

In [154]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

In [155]:
# Calculating sensitivity
tn / (tn + fp)

0.9992019154030327

In [1]:
# code taken from NLP II lesson
plot_confusion_matrix(gs, X_test, y_test, cmap='Greens', values_format='d');

NameError: name 'plt' is not defined

## In conclusion:

Bringing in `selftext` and using a `PorterStemmer` made a huge big difference in testing scores.

Best parameters for `CountVectorizer` and `LogisticRegression` with a `PorterStemmer` are:

- `cvec__max_features`: `10000`
- `stop_words`: `stop_words` list
- `cvec__ngram_range`: `(1, 1)`

yielding a training score of 0.9888 and a **testing score** of **0.9892**

Best parameters for `TfidfVectorizer` and `MultinomialNB` are:

- `tfidf__stop_words`:  `['eli5']`
- `tfidf__max_features`:  `11300`
- `tfidf__ngram_range`:  `(1, 2)`

yielding a training score of 0.9508 and a **testing score** of **0.9528**.