### Quick Recap:
In the previous notebook, we tested and evaluated the Logistic Regression model. We originally tested a naive Logistic Regression model without inputting much parameters and unsurprisingly got a not-so-great score on our accuracy. Afterwards we created a pipeline that will consist to two stages:
* An instance of CountVectorizer
* A LogisticRegression instance

Through GridSearching for the optimal set of hyperparameters for our CountVectorizer, we got a Train accuracy score of 0.886 and
Test accuracy score of 0.765. A big improvement over the naive Logistic Regression model score. 
Through this model, we got the top 20 most important features:
* blinks, yg, vip, dddd, square, cf, ga, wig, whistle, scandal, queue, axe, d4, chanel, revolution, teddy, area, queens, diaries, aiiy

Confusion Matrix (shows how our model performed):
* True Negatives: 1795
* False Positives: 585
* False Negatives: 515
* True Positives: 1789
 

### Next Steps: 

In this notebook, we will begin by identifying what Naive Bayes model we should use (Bernoulli, Multinomial, Gaussian). We will split the data for validation and training purposes. Ultimately testing and evaluating a Naive Bayes modeling technique to hopefully identify a production algorithm. We will compare this model's score to not only the baseline accuracy score but also the Logistic Regression's score. Throughout this notebook, we will explain the process of the model and evaluate the outcome of the Naive Bayes model. 

In [236]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

from sklearn.pipeline import Pipeline

%matplotlib inline


In [237]:
!ls

01_Push_Shift_API.ipynb      06_Random_Forest_Model.ipynb
02_Cleaning_EDA.ipynb        [34mUntitled Folder[m[m
03_Preprocessing.ipynb       [34massets[m[m
04_LogReg_Model.ipynb        [34mdata[m[m
05_Naive_Bayes_Model.ipynb


In [238]:
df = pd.read_csv('./data/bp_bts_df_clean.csv')
df.head()

Unnamed: 0,body,blackpink,char_count,word_count
0,this is something i can get behind and appreci...,1,1043,188
1,hold the fuck up rock songs i m considering...,1,75,12
2,what time and date is this in pdt 0 am is a b...,1,106,27
3,is there a list last year i remember they str...,1,84,16
4,as a blink i ll wait till mv dropped then i l...,1,149,33


## Baseline Accuracy:

Our baseline model accuracy is still 51%. Once again, this means we will be correct 51% of the time if we choose that a comment is from the majority class subreddit, which in this case is the 'bangtan' subreddit.

In [239]:
round(df.blackpink.value_counts(normalize=True).max(),2)

0.51

Similar to the previous notebook, we will implement the CountVectorizer tool. However, this time we only really care about the maximum number of counts of the 6,500 max features it pulled from the text documents.

In [240]:
X = df['body']
y = df['blackpink']

In [241]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)
# stratify = y --> keeps the proportion of the class the same in the train test split

In [242]:
with open('./assets/stopwords.pkl','rb') as f:
    stopwords = pickle.load(f)

In [243]:
# Instantiate our CountVectorizer.
cvec = CountVectorizer(analyzer = "word",
                      stop_words = stopwords,
                      max_features = 6500,
                      min_df=2,
                      max_df=0.98)

X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)

train_cvec_df = pd.DataFrame(X_train_cvec.todense(),   # b/c it is saved as a df...
                                    columns = cvec.get_feature_names())
test_cvec_df = pd.DataFrame(X_test_cvec.todense(), 
                            columns = cvec.get_feature_names())

train_cvec_df.describe()

Unnamed: 0,00,000,01,02,04,05,06,07,09,0am,...,yup,yymmdd,zealand,zedd,zelle,zero,zeus,zodiac,zone,zones
count,14052.0,14052.0,14052.0,14052.0,14052.0,14052.0,14052.0,14052.0,14052.0,14052.0,...,14052.0,14052.0,14052.0,14052.0,14052.0,14052.0,14052.0,14052.0,14052.0,14052.0
mean,0.001566,0.001067,0.000427,0.000427,0.000427,0.001566,0.000427,0.000427,0.000285,0.000498,...,0.001566,0.000498,0.000285,0.000285,0.000285,0.000712,0.000498,0.000213,0.00121,0.000427
std,0.049166,0.032656,0.023857,0.023857,0.02066,0.068518,0.026674,0.02066,0.020662,0.025304,...,0.039538,0.025304,0.01687,0.01687,0.01687,0.026668,0.027975,0.018863,0.04044,0.023857
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2.0,1.0,2.0,2.0,1.0,5.0,2.0,1.0,2.0,2.0,...,1.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0


##  Naive Bayes

When deciding which of the three Naive Bayes Model to use, it is good practice to see a summary statistic of the dataframe after converting a collection of text documents into a matrix vector of token counts. We can use the max counts of each words in the dataframe to help determine the model. In column '05', we can see the maximum count is 5, and for most of the columns the max range from either 1 or 2. 

Picking the best one of the 3 Naive Bayes model choices:
    
- BernoulliNB is best when we have 0/1 counts in all columns of X. (a.k.a. dummy variables)
- GaussianNB is best when the columns of X are Normally distributed. (Or whenever BernoulliNB and MultinomialNB are inappropriate.)
- **The columns of X are all integer counts, so MultinomialNB is the best choice here.**

The Naive Bayes assumes that the independant variables (features) are independant of one another in order to reduce the complexity of conditional probabilities (Baye's Theorem). Though this assumption is especially flawed when it comes to text data, this model still produces a surprisingly accurate prediction score.

In [244]:
# Instantiate our Multinomial Naive Bayes' model!
nb =  MultinomialNB()

In [245]:
# Fit our model!
nb.fit(train_cvec_df, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [246]:
# Generate our predictions!
pred = nb.predict(test_cvec_df)

In [247]:
# Score our model on the training set.

nb.score(train_cvec_df, y_train)

0.8179618559635639

In [248]:
nb.score(test_cvec_df, y_test)

0.764730999146029

The Naive Bayes Model is commonly used with the Tfidf feature-extraction tool. The TFIDF score tells us which words are most discriminating. In other words, if the number of times a word appears in a comment is frequent while the number of times it appears in the overall collection of comments is rare, then the TFIDF score increases. 

In [249]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer='word', stop_words=stopwords)),
    ('nb', MultinomialNB())
    
])

In [218]:
params = {
    'tfidf__strip_accents': ['unicode', None],
    'tfidf__ngram_range' : [(1,1)],
    'tfidf__max_df' : [.95, .98, 1.0],
    'tfidf__min_df' : [1, 2, 5],
    'tfidf__max_features': [3000, 5000, 6500]  #does no stop_words w/ 1000 then 1500; THEN english stop_words w/ 1000 then 1500...

}
gs = GridSearchCV(pipe, param_grid=params, cv=3)   #cv do 2-3 for project; save time.
gs.fit(X_train, y_train) # also does cv in the background
print(gs.best_score_)
print(gs.best_params_)

# Train score
print(f'Train accuracy score: {gs.score(X_train, y_train)}')

# Test score
print(f'Test accuracy score: {gs.score(X_test, y_test)}')

0.7581838884144606
{'tfidf__max_df': 0.95, 'tfidf__max_features': 6500, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1), 'tfidf__strip_accents': 'unicode'}
Train accuracy score: 0.8437944776544264
Test accuracy score: 0.767933390264731


Based on the grid search for our Multinomial N.B model, it selected 6,500 features to be the best max features, chose 1 grams as the best ngram_range, 1 min as the best minimum words, and .95 for the max df as the best maximum threshold. Though our training score increased a bit, the testing score did not have any noticeable changes. Therefore, it seems uncessary to invest time/money into GridSearching across different sets of hyperparameters with not much of a positive outcome. 

In [253]:
tfidf = TfidfVectorizer(analyzer = "word",
                        strip_accents = 'unicode',
                      stop_words = stopwords,
                      max_features = 6500,
                      max_df=0.95,
                       min_df = 1)

In [254]:
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

train_tfidf_df = pd.DataFrame(X_train_tfidf.todense(),   # b/c it is saved as a df...
                                    columns = tfidf.get_feature_names())
test_tfidf_df = pd.DataFrame(X_test_tfidf.todense(), 
                            columns = tfidf.get_feature_names())

In [255]:
test_tfidf_df.head()

Unnamed: 0,00,000,01,02,04,05,06,07,09,0am,...,yymmdd,zealand,zedd,zelle,zero,zeus,zhou,zodiac,zone,zones
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [256]:
nb.fit(train_tfidf_df, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [257]:
nb.score(train_tfidf_df, y_train)

0.8437944776544264

In [258]:
nb.score(test_tfidf_df, y_test)

0.767933390264731

### Confusion Matrix

We will implement the confusion matrix in order to help us visualize how our model performs. 
We will use the fitted model to predict on the testing set.

In [259]:
pred = nb.predict(test_tfidf_df)
pred

array([1, 0, 1, ..., 0, 1, 1])

In [260]:
confusion_matrix(y_test, pred)

array([[1928,  452],
       [ 635, 1669]])

In [261]:
tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 1928
False Positives: 452
False Negatives: 635
True Positives: 1669


What does a **true positive** mean here?:

- True positives are comments we correctly predict to be positive.
- In this case, since Blackpink = 1 (Bts=0), a true positive means the model correctly predicted 1,669 comments to be from the Blackpink subreddit.

---

What does a **true negative** mean here?:

- True negatives are comments we correctly predict to be negative.
- In this case, since Blackpink = 1 (Bts = 0), a true negative means the model correctly predicted 1,928 comments to be from the Bts subreddit.

---

What does a **false positive** mean here?:

- False positives are comments we falsely predict to be positive.
- In this case, since Blackpink = 1 (Bts = 0), a false positive means the model incorrectly predicted 452 comments to be from the Blackpink subreddit (when it's really from the Bts subreddit).

---

What does a **false negative** mean here?:

- False negatives are comments we false predict to be negative.
- In this case, since Blackpink = 1 (Bts=0), a false negative means the model incorrectly predicted 635 comments to be from the Bts subreddit (when it's really from the Blackpink subreddit).


## **Please continue to Notebook 06: Random Forest Model**