## Week 5, Lab 2: Tackling an NLP Problem with Naive Bayes
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we are going to apply a **new** modeling technique to natural language processing data.

> "But how can we apply a modeling technique we haven't learned?!"

The DSI program is great - but we can't teach you *everything* about data science in 12 weeks! This lab is designed to help you start learning something new without it being taught in a formal lesson. 
- Later in the cohort (like for your capstone!), you'll be exploring models, libraries, and resources that you haven't been explicitly taught.
- After the program, you'll want to continue developing your skills. Being comfortable with documentation and being confident in your ability to read something new and decide whether or not it is an appropriate method for the problem you're trying to solve is **incredibly** valuable.

## Step 1: Define the problem.

Many organizations have a substantial interest in classifying users of their product into groups. Some examples:
- A company that serves as a marketplace may want to predict who is likely to purchase a certain type of product on their platform, like books, cars, or food.
- An application developer may want to identify which individuals are willing to pay money for "bonus features" or to upgrade their app.
- A social media organization may want to identify who generates the highest rate of content that later goes "viral."

### Summary
In this lab, you're an engineer for Facebook. In recent years, the organization Cambridge Analytica gained worldwide notoriety for its use of Facebook data in an attempt to sway electoral outcomes.

Cambridge Analytica, an organization staffed with lots of Ph.D. researchers, used the Big5 personality groupings (also called OCEAN) to group people into one of 32 different groups.
- The five qualities measured by this personality assessment are:
    - **O**penness
    - **C**onscientiousness
    - **E**xtroversion
    - **A**greeableness
    - **N**euroticism
- Each person could be classified as "Yes" or "No" for each of the five qualities.
- This makes for 32 different potential combinations of qualities. ($2^5 = 32$)
- You don't have to check it out, but if you want to learn more about this personality assessment, head to [the Wikipedia page](https://en.wikipedia.org/wiki/Big_Five_personality_traits).
- There's also [a short (3-4 pages) academic paper describing part of this approach](./celli-al_wcpr13.pdf).

Cambridge Analytica's methodology was, roughly, the following:
- Gather a large amount of data from Facebook.
- Use this data to predict an individual's Big5 personality "grouping."
- Design political advertisements that would be particularly effective to that particular "grouping." (For example, are certain advertisements particularly effective toward people with specific personality traits?)

You want to know the **real-world problem**: "Is what Cambridge Analytica attempted to do actually possible, or is it junk science?"

However, we'll solve the related **data science problem**: "Are one's Facebook statuses predictive of whether or not one is agreeable?"
> Note: If Facebook statuses aren't predictive of one being agreeable (one of the OCEAN qualities), then Cambridge Analytica's approach won't work very well!

## Step 2: Obtain the data.

Obviously, there are plenty of opportunities to discuss the ethics surrounding this particular issue... so let's do that.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('./mypersonality_final.csv', encoding = 'ISO-8859-1')

In [3]:
data.head()

Unnamed: 0,#AUTHID,STATUS,sEXT,sNEU,sAGR,sCON,sOPN,cEXT,cNEU,cAGR,cCON,cOPN,DATE,NETWORKSIZE,BETWEENNESS,NBETWEENNESS,DENSITY,BROKERAGE,NBROKERAGE,TRANSITIVITY
0,b7b7764cfa1c523e4e93ab2a79a946c4,likes the sound of thunder.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/19/09 03:21 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
1,b7b7764cfa1c523e4e93ab2a79a946c4,is so sleepy it's not even funny that's she ca...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/02/09 08:41 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
2,b7b7764cfa1c523e4e93ab2a79a946c4,is sore and wants the knot of muscles at the b...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/15/09 01:15 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
3,b7b7764cfa1c523e4e93ab2a79a946c4,likes how the day sounds in this new song.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/22/09 04:48 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
4,b7b7764cfa1c523e4e93ab2a79a946c4,is home. <3,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/20/09 02:31 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1


### 1. What is the difference between anonymity and confidentiality? All else held equal, which tends to keep people safer?

Answer:
1. Anonymity means there is no way for anyone (including the researcher) to know the identify of the participants in the study. No personal identification can be collected.
2. Confidentiality means that the participants can be identified by the researcher. However, the researcher usually apply measures to the personal identificatino information so that no one outside the study is able to identify the participants. 

Anonymity tends to keep people safer.

### 2. Suppose that the "unique identifier" in the above data, the `#AUTHID`, is a randomly generated key so that it can never be connected back to the original poster. Have we guaranteed anonymity here? Why or why not?

Answer:

No. Anonymity is very strict. As long as any personal identification information is collected (e.g., name, email, or phone, etc.,), sometimes unavaoidable, there is no guarantee to anonymity. The #AUTHID is a typical measure for confidentiality. 

### 3. As an engineer for Facebook, you recognize that user data will be used by Facebook and by other organizations - that won't change. However, what are at least three recommendations you would bring to your manager to improve how data is used and shared? Be as specific as you can.

Answer:
    
1. Obtain consent from the participants. Communicate with the participant regarding how the data is collected, stored, and used and as well as the measure to protect their personal information. 
2. Evaluate the organizations (due diligence) before sharing any data with them. 
3. Communicate the privacy protocal as well as the data restriction with Facebook and the organizations before share and use the participant's data.

## Step 3: Explore the data.

- Note: For our $X$ variable, we will only use the `STATUS` variable. For our $Y$ variable, we will only use the `cAGR` variable.

### 4. Explore the data here.
> We aren't explicitly asking you to do specific EDA here, but what EDA would you generally do with this data? Do the EDA you usually would, especially if you know what the goal of this analysis is.

In [4]:
# check data size
data.shape

(9917, 20)

In [5]:
# data type and missing value
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9917 entries, 0 to 9916
Data columns (total 20 columns):
#AUTHID         9917 non-null object
STATUS          9917 non-null object
sEXT            9917 non-null float64
sNEU            9917 non-null float64
sAGR            9917 non-null float64
sCON            9917 non-null float64
sOPN            9917 non-null float64
cEXT            9917 non-null object
cNEU            9917 non-null object
cAGR            9917 non-null object
cCON            9917 non-null object
cOPN            9917 non-null object
DATE            9917 non-null object
NETWORKSIZE     9917 non-null float64
BETWEENNESS     9917 non-null float64
NBETWEENNESS    9917 non-null float64
DENSITY         9917 non-null float64
BROKERAGE       9917 non-null float64
NBROKERAGE      9917 non-null float64
TRANSITIVITY    9916 non-null float64
dtypes: float64(12), object(8)
memory usage: 1.5+ MB


In [6]:
# confirm missing entry
data.isna().sum()

#AUTHID         0
STATUS          0
sEXT            0
sNEU            0
sAGR            0
sCON            0
sOPN            0
cEXT            0
cNEU            0
cAGR            0
cCON            0
cOPN            0
DATE            0
NETWORKSIZE     0
BETWEENNESS     0
NBETWEENNESS    0
DENSITY         0
BROKERAGE       0
NBROKERAGE      0
TRANSITIVITY    1
dtype: int64

In [7]:
# TRANSITIVITY has one missing value. But this column is not used in this lab.
# Do nothing.

### 5. What is the difference between CountVectorizer and TFIDFVectorizer?

Answer:
1. **CountVectorizer**: CountVectorizer vectorize the frequency of the words appears in each document using Bag-of-words approach. 
2. **TFIDFVectorizer**: on the bases of CountVectorizer, TFIDF further weight the importance of each word by multiplying the frequency of the word in each document with the **logarithm** of the ratio of **The Total Number of Documents** and **Number of Documents that include the Word**. Such multiplication emphasize the word appears less frequently (usually important words for showing differences) and downplay the words that appear frequently (usually not very useful).

### 6. What are stopwords?

Answer:

Stopwords are words filled out before or after useful natural language data (text), which usually appear frequently in corpus and make trivial contribution for traning the model.

### 7. Give an example of when you might remove stopwords.

Answer:

If the model is train on 1-gram/smallgram data and stopwords is certain contribute little to no information, the stopwords should be removed for improving the efficiency of the model.

Example. 
1. The queen, who is young, is very wealthy!
2. The king, who is old, is not very wealthy!. 

In this case, "The", "who", and "is" will be treated as stopwords.

### 8. Give an example of when you might keep stopwords in your model.

Answer:

If the model run on biggram data when stopwords play important role (affect the meaning), we should keep the word.
Example phrase: "is not bad", "the best", "is not good". We may consider to keep "is", "the".

## Step 4: Model the data.

We are going to fit two types of models: a logistic regression and [a Naive Bayes classifier](https://scikit-learn.org/stable/modules/naive_bayes.html).

**Reminder:** We will only use the feature `STATUS` to model `cAGR`.

### We want to attempt to fit our models on sixteen sets of features:

1. CountVectorizer with 100 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
2. CountVectorizer with 100 features, with English stopwords removed and with the default `ngram_range`.
3. CountVectorizer with 100 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
4. CountVectorizer with 100 features, with English stopwords kept in and with the default `ngram_range`.
5. CountVectorizer with 500 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
6. CountVectorizer with 500 features, with English stopwords removed and with the default `ngram_range`.
7. CountVectorizer with 500 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
8. CountVectorizer with 500 features, with English stopwords kept in and with the default `ngram_range`.
9. TFIDFVectorizer with 100 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
10. TFIDFVectorizer with 100 features, with English stopwords removed and with the default `ngram_range`.
11. TFIDFVectorizer with 100 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
12. TFIDFVectorizer with 100 features, with English stopwords kept in and with the default `ngram_range`.
13. TFIDFVectorizer with 500 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
14. TFIDFVectorizer with 500 features, with English stopwords removed and with the default `ngram_range`.
15. TFIDFVectorizer with 500 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
16. TFIDFVectorizer with 500 features, with English stopwords kept in and with the default `ngram_range`.

### 9. Rather than doing this directly (e.g. instantiating 16 different vectorizers), what have we learned about that might make this easier? Use it.

#### 9.0 Import Libraries

In [8]:
# Import libraries
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#### 9.1 Create DF include only information of interest

In [9]:
# Extract info from STATUS and cAGR columns
df = data[['STATUS','cAGR']]

# Change the title to lower case
df.columns = ['status', 'carg']

# Review first 5 rows of data
df.head()

Unnamed: 0,status,carg
0,likes the sound of thunder.,n
1,is so sleepy it's not even funny that's she ca...,n
2,is sore and wants the knot of muscles at the b...,n
3,likes how the day sounds in this new song.,n
4,is home. <3,n


In [10]:
# Dummify 'carg' column
df['carg'] = df['carg'].apply(lambda x: 1 if x == 'y' else 0)

# Confirm no other values besides 0 and 1
df['carg'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


array([0, 1])

#### 9.2 Baseline Accuracy

In [11]:
df['carg'].value_counts(normalize=True)

1    0.531209
0    0.468791
Name: carg, dtype: float64

#### 9.3 Model Prep

In [12]:
X = df['status']
y = df['carg']

#### 9.4 Train/Test Split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

#### 9.5 Customize Tokenizer

Fancy token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer or the analyzer. 

In [14]:
# Import libraries for tokenize and lemmatizer
from nltk import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

In [15]:
# Build a class for customized tokenizer incorporating lemmatizer
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        tokenizer = RegexpTokenizer('(?u)\\b\\w\\w+\\b')
        return [self.wnl.lemmatize(t) for t in tokenizer.tokenize(doc)]

#### 9.6 Pipeline

In [16]:
# Pipeline CountVectorizer
pipe_cv = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

# Pipeline TFIDFVectorizer
pipe_td = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression())
])

In [17]:
# Pipeline_parameter CountVectorizer
pipe_params_cv = {
    'cvec__max_features': [100, 500],
    'cvec__stop_words': [None, 'english'],
    'cvec__ngram_range':[(1,1),(1,2)],
#     'cvec__min_df': [3],
#     'cvec__max_df': [.95]
}

# Pipeline_parameter TDIDFVectorizer
pipe_params_td = {
    'tvec__max_features': [100, 500],
    'tvec__stop_words': [None, 'english'],
    'tvec__ngram_range':[(1,1),(1,2)],
#     'cvec__min_df': [2],
#     'cvec__max_df': [.9]
}

#### 9.7 GridSearchCV

  **9.7.1 CountVectorizer**

In [18]:
# CountVectorizer
gs_cv = GridSearchCV(pipe_cv, 
                     param_grid=pipe_params_cv, 
                     verbose=1,
#                      cv=3,
                     n_jobs=4
                    )
gs_cv.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 out of  24 | elapsed:    4.6s finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'cvec__max_features': [100, 500], 'cvec__stop_words': [None, 'english'], 'cvec__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [19]:
print(f'The best accuracy of the 8 CV training models {gs_cv.best_score_}')
print()
print(f'The best CV training model has the following parameters:\n{pd.DataFrame(gs_cv.best_params_)}')

The best accuracy of the 8 CV training models 0.5468602931289498

The best CV training model has the following parameters:
   cvec__max_features  cvec__ngram_range cvec__stop_words
0                 500                  1          english
1                 500                  1          english


In [20]:
cv_lr_train_score = gs_cv.score(X_train, y_train)
cv_lr_train_score

0.6241764152211914

In [21]:
# Run on test data
cv_lr_test_score = gs_cv.score(X_test, y_test)
cv_lr_test_score

0.540725806451613

  **9.7.2 TFIDFVectorizer**

In [22]:
# Reset X, y
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

In [23]:
# TFIDFVectorizer
gs_td = GridSearchCV(pipe_td, 
                     param_grid=pipe_params_td, 
                     verbose=1,
#                      cv=3,
                     n_jobs=4)
gs_td.fit(X_train, y_train)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=4)]: Done  24 out of  24 | elapsed:    2.9s finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'tvec__max_features': [100, 500], 'tvec__stop_words': [None, 'english'], 'tvec__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [24]:
print(f'The best accuracy of the 8 TD training models {gs_td.best_score_}')
print()
print(f'The best TD training model has the following parameters:\n{pd.DataFrame(gs_td.best_params_)}')

The best accuracy of the 8 TD training models 0.5511631034019093

The best TD training model has the following parameters:
   tvec__max_features  tvec__ngram_range tvec__stop_words
0                 500                  1          english
1                 500                  2          english


In [25]:
td_lr_train_score = gs_td.score(X_train, y_train)
td_lr_train_score

0.6268656716417911

In [26]:
td_lr_test_score = gs_td.score(X_test, y_test)
td_lr_test_score

0.5423387096774194

### 10. What are some of the advantages of fitting a logistic regression model?

Answers:
1. Logistic regression is able to tell the statistical relation (Beta) between each feature and the dependent variable.
2. Regularization such as LASSO and RIDGE can be applied to emphasize more meaningful features. 

### 11. Fit a logistic regression model and compare it to the baseline.

In [27]:
# see Section 9

### Summary of Naive Bayes 

Naive Bayes is a classification technique that relies on probability to classify observations.
- It's based on a probability rule called **Bayes' Theorem**... thus, "**Bayes**."
- It makes an assumption that isn't often met, so it's "**naive**."

Despite being a model that relies on a naive assumption, it performs **really** well.
- [Interested in details? Read more here if you want.](https://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf)


The [sklearn documentation](https://scikit-learn.org/stable/modules/naive_bayes.html) is here, but it can be intimidating. So, to quickly summarize the Bayes and Naive parts of the model...

#### Bayes' Theorem
If you've seen Bayes' Theorem, it relates the conditional probability of $P(A|B)$ to $P(B|A)$. (Don't worry; we won't be doing any probability calculations by hand! However, you may want to refresh your memory on conditional probability from our earlier lessons if you forget what a conditional probability is.)

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)}
\end{eqnarray*}
$$

- Let $A$ be that someone is "agreeable," like the OCEAN category.
- Let $B$ represent the words used in their Facebook post.

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)} \\
\Rightarrow P(\text{person is agreeable}|\text{words in Facebook post}) &=& \frac{P(\text{words in Facebook post}|\text{person is agreeable})P(\text{person is agreeable})}{P(\text{words in Facebook post})}
\end{eqnarray*}
$$

We want to calculate the probability that someone is agreeable **given** the words that they used in their Facebook post! (Rather than calculating this probability by hand, this is done under the hood and we can just see the results by checking `.predict_proba()`.) However, this is exactly what our model is doing. We can (a.k.a. the model can) calculate the pieces on the right-hand side of the equation to give us a probability estimate of how likely someone is to be agreeable given their Facebook post.

#### Naive Assumption

If our goal is to estimate $P(\text{person is agreeable}|\text{words in Facebook post})$, that can be quite tricky.

---

<details><summary>Bonus: if you want to understand why that's complicated, click here.</summary>
    
- The event $\text{"words in Facebook post"}$ is a complicated event to calculate.

- If a Facebook post has 100 words in it, then the event $\text{"words in Facebook post"} = \text{"word 1 is in the Facebook post" and "word 2 is in the Facebook post" and }\ldots \text{ and "word 100 is in the Facebook post"}$.

- To calculate the joint probability of all 100 words being in the Facebook post gets complicated pretty quickly. (Refer back to the probability notes on how to calculate the joint probability of two events if you want to see more.)
</details>

---

To simplify matters, we make an assumption: **we assume that all of our features are independent of one another.**

In some contexts, this assumption might be realistic!

### 12. Why would this assumption not be realistic with NLP data?

Answer:

Because words that constructs meaningful sentances in NLP are usually not independent of each other. Many words are more likely to be used together.

Despite this assumption not being realistic with NLP data, we still use Naive Bayes pretty frequently.
- It's a very fast modeling algorithm. (which is great especially when we have lots of features and/or lots of data!)
- It is often an excellent classifier, outperforming more complicated models.

There are three common types of Naive Bayes models: Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes.
- How do we pick which of the three models to use? It depends on our $X$ variable.
    - Bernoulli Naive Bayes is appropriate when our features are all 0/1 variables.
        - [Bernoulli NB Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB)
    - Multinomial Naive Bayes is appropriate when our features are variables that take on only positive integer counts.
        - [Multinomial NB Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
    - Gaussian Naive Bayes is appropriate when our features are Normally distributed variables. (Realistically, though, we kind of use Gaussian whenever neither Bernoulli nor Multinomial works.)
        - [Gaussian NB Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

### 13. Suppose you CountVectorized your features. Which Naive Bayes model would be most appropriate to fit? Why? Fit it.

#### 13.1 Calculate the Maximum Count

Answer:  **Multinomial NB Model**

Based on the the NB documentations:
1. Bernoulli NB works on features that has binary outcome (rather than occurance). It may work if every word/features either occur once or none. However, as shown above, the maximum occurance of selected features is 2 and therefore the Bernoulli NB is not suitable here.

2. Multinomial NB model works based on features' occurance (partial count also work). So our model should use this model.

3. The features are not normally distributed, so Gaussian NB model does not apply.

#### 13.2 Fit the MNB Model

In [29]:
# Import library
from sklearn.naive_bayes import MultinomialNB

In [30]:
# Use pipe and GridSearch CV to instantiate and fit model
pipe_cv_mnb = Pipeline([
            ('cvec', CountVectorizer()),
            ('mnb', MultinomialNB())    
])

In [31]:
# CountVectorizer & MNB GridSearch
gs_cv_mnb = GridSearchCV(pipe_cv_mnb, 
                         param_grid=pipe_params_cv, 
                         verbose=1,
                         cv=3,
                         n_jobs=4
                        )
gs_cv_mnb.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 out of  24 | elapsed:    2.9s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'cvec__max_features': [100, 500], 'cvec__stop_words': [None, 'english'], 'cvec__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [32]:
# Accuracy Score for training data
cv_mnb_train_score = gs_cv_mnb.score(X_train, y_train)
cv_mnb_train_score

0.6104612074761329

In [33]:
# Accuracy Score
cv_mnb_test_score = gs_cv_mnb.score(X_test, y_test)
cv_mnb_test_score

0.5395161290322581

In [34]:
# Sample predict probability for the MNB model
pd.DataFrame(gs_cv_mnb.predict_proba(X_test)).head()

Unnamed: 0,0,1
0,0.996763,0.003237
1,0.555566,0.444434
2,0.692657,0.307343
3,0.44338,0.55662
4,0.517252,0.482748


### 14. Suppose you TFIDFVectorized your features. Which Naive Bayes model would be most appropriate to fit? Why? Fit it.

Answer:  still **Multinomial NB Model**

Based on the the NB documentations:
1. Bernoulli NB works on features that has binary outcome (rather than occurance). It may work if every word/features either occur once or none. However, as shown above, the maximum occurance of selected features is 2 and therefore the Bernoulli NB is not suitable here.

2. Multinomial NB model works based on features' occurance (partial count also work). So our model should use this model.

3. The features are not normally distributed, so Gaussian NB model does not apply.

In [35]:
# Use pipe and GridSearch CV to instantiate and fit model
pipe_td_mnb = Pipeline([
                       ('tvec', TfidfVectorizer()),
                       ('mnb', MultinomialNB())
])

In [36]:
# CountVectorizer & MNB GridSearch
gs_td_mnb = GridSearchCV(pipe_td_mnb,
                         param_grid=pipe_params_td,
                         verbose=1,
                         cv=3,
                         n_jobs=4
                        )
gs_td_mnb.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 out of  24 | elapsed:    2.7s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...rue,
        vocabulary=None)), ('mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'tvec__max_features': [100, 500], 'tvec__stop_words': [None, 'english'], 'tvec__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [37]:
# Accuracy Score for training data
td_mnb_train_score = gs_td_mnb.score(X_train, y_train)
td_mnb_train_score

0.6222939357267715

In [38]:
# Accuracy Score for testing data
td_mnb_test_score = gs_td_mnb.score(X_test, y_test)
td_mnb_test_score

0.5415322580645161

### 15. Compare the performance of your models.

In [39]:
# Summarize Model Performances
performance = [[cv_lr_train_score,
                td_lr_train_score,
                cv_mnb_train_score,
                td_mnb_train_score],
               [cv_lr_test_score,
                td_lr_test_score,
                cv_mnb_test_score,
                td_mnb_test_score]
              ]
columns = ['cv_lr', 'td_lr', 'cv_mnb', 'td_mnb']
index = ['train_accuracy', 'test_accuracy']
pd.DataFrame(np.round(performance,3), columns=columns, index=index)

Unnamed: 0,cv_lr,td_lr,cv_mnb,td_mnb
train_accuracy,0.624,0.627,0.61,0.622
test_accuracy,0.541,0.542,0.54,0.542


Based on the above table, logistric regression with TFIDFVectorizer performes the best. 

### 16. Even though we didn't explore the full extent of Cambridge Analytica's modeling, based on what we did here, how effective was their approach at using Facebook data to model agreeableness?

Based on the modeling results, neither logistic model nor Naive Baye's model produces substantial better results than the base model. Therefore, their approach at using Facebook data to model agreeableness is not very effective.