<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Tackling an NLP Problem with Naive Bayes
_Author: Matt Brems_

----

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we are going to apply a **new** modeling technique to natural language processing data.

> "But how can we apply a modeling technique we haven't learned?!"

The DSI program is great - but we can't teach you *everything* about data science in 12 weeks! This lab is designed to help you start learning something new without it being taught in a formal lesson. 
- Later in the cohort (like for your capstone!), you'll be exploring models, libraries, and resources that you haven't been explicitly taught.
- After the program, you'll want to continue developing your skills. Being comfortable with documentation and being confident in your ability to read something new and decide whether or not it is an appropriate method for the problem you're trying to solve is **incredibly** valuable.

### Step 1: Define the problem.

Many organizations have a substantial interest in classifying users of their product into groups. Some examples:
- A company that serves as a marketplace may want to predict who is likely to purchase a certain type of product on their platform, like books, cars, or food.
- An application developer may want to identify which individuals are willing to pay money for "bonus features" or to upgrade their app.
- A social media organization may want to identify who generates the highest rate of content that later goes "viral."

### Summary
In this lab, you're an engineer for Facebook. In recent years, the organization Cambridge Analytica gained worldwide notoriety for its use of Facebook data in an attempt to sway electoral outcomes.

Cambridge Analytica, an organization staffed with lots of Ph.D. researchers, used the Big5 personality groupings (also called OCEAN) to group people into one of 32 different groups.
- The five qualities measured by this personality assessment are:
    - **O**penness
    - **C**onscientiousness
    - **E**xtroversion
    - **A**greeableness
    - **N**euroticism
- Each person could be classified as "Yes" or "No" for each of the five qualities.
- This makes for 32 different potential combinations of qualities. ($2^5 = 32$)
- You don't have to check it out, but if you want to learn more about this personality assessment, head to [the Wikipedia page](https://en.wikipedia.org/wiki/Big_Five_personality_traits).
- There's also [a short (3-4 pages) academic paper describing part of this approach](./celli-al_wcpr13.pdf).

Cambridge Analytica's methodology was, roughly, the following:
- Gather a large amount of data from Facebook.
- Use this data to predict an individual's Big5 personality "grouping."
- Design political advertisements that would be particularly effective to that particular "grouping." (For example, are certain advertisements particularly effective toward people with specific personality traits?)

You want to know the **real-world problem**: "Is what Cambridge Analytica attempted to do actually possible, or is it junk science?"

However, we'll solve the related **data science problem**: "Are one's Facebook statuses predictive of whether or not one is agreeable?"
> Note: If Facebook statuses aren't predictive of one being agreeable (one of the OCEAN qualities), then Cambridge Analytica's approach won't work very well!

In [1]:
# agreeableness (friendly/compassionate vs. critical/rational)
# Retrieve from wiki

### Step 2: Obtain the data.

Obviously, there are plenty of opportunities to discuss the ethics surrounding this particular issue... so let's do that.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, \
TfidfVectorizer

In [3]:
df = pd.read_csv('./mypersonality_final.csv', encoding = 'ISO-8859-1')

In [4]:
df.head()

Unnamed: 0,#AUTHID,STATUS,sEXT,sNEU,sAGR,sCON,sOPN,cEXT,cNEU,cAGR,cCON,cOPN,DATE,NETWORKSIZE,BETWEENNESS,NBETWEENNESS,DENSITY,BROKERAGE,NBROKERAGE,TRANSITIVITY
0,b7b7764cfa1c523e4e93ab2a79a946c4,likes the sound of thunder.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/19/09 03:21 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
1,b7b7764cfa1c523e4e93ab2a79a946c4,is so sleepy it's not even funny that's she ca...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/02/09 08:41 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
2,b7b7764cfa1c523e4e93ab2a79a946c4,is sore and wants the knot of muscles at the b...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/15/09 01:15 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
3,b7b7764cfa1c523e4e93ab2a79a946c4,likes how the day sounds in this new song.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/22/09 04:48 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
4,b7b7764cfa1c523e4e93ab2a79a946c4,is home. <3,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/20/09 02:31 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1


**1. What is the difference between anonymity and confidentiality? All else held equal, which tends to keep people safer?**

In [5]:
# Anonymity is the person you doesn't know.
# Connfidentiality is the person you know but remove their identify

# Anonymity is safer because you don't know them but the data about confidentiality
# may leak if the security isn't good.

**2. Suppose that the "unique identifier" in the above data, the `#AUTHID`, is a randomly generated key so that it can never be connected back to the original poster. Have we guaranteed anonymity here? Why or why not?**

In [6]:
# We can't guarunteed anomity here because there is individual information e.g. date.
# The status itself is unique for the people and combining with date data. 
# They can track what is this person.

**3. As an engineer for Facebook, you recognize that user data will be used by Facebook and by other organizations - that won't change. However, what are at least three recommendations you would bring to your manager to improve how data is used and shared? Be as specific as you can.**

In [7]:
# 1. Remove date columns out from the data: 
# Increase anonymity of the person who provide the data

# 2. Provide only essential data: 
# Selected only used columns to increase anonymity

# 3. Replace the unique identifier to another set of unique identifier: 
# Harder to track back to user

### Step 3: Explore the data.

- Note: For our $X$ variable, we will only use the `STATUS` variable. For our $Y$ variable, we will only use the `cAGR` variable.

**4. Explore the data here.**
> We aren't explicitly asking you to do specific EDA here, but what EDA would you generally do with this data? Do the EDA you usually would, especially if you know what the goal of this analysis is.

In [8]:
df.columns

Index(['#AUTHID', 'STATUS', 'sEXT', 'sNEU', 'sAGR', 'sCON', 'sOPN', 'cEXT',
       'cNEU', 'cAGR', 'cCON', 'cOPN', 'DATE', 'NETWORKSIZE', 'BETWEENNESS',
       'NBETWEENNESS', 'DENSITY', 'BROKERAGE', 'NBROKERAGE', 'TRANSITIVITY'],
      dtype='object')

In [9]:
# check for type of data
df.dtypes

#AUTHID          object
STATUS           object
sEXT            float64
sNEU            float64
sAGR            float64
sCON            float64
sOPN            float64
cEXT             object
cNEU             object
cAGR             object
cCON             object
cOPN             object
DATE             object
NETWORKSIZE     float64
BETWEENNESS     float64
NBETWEENNESS    float64
DENSITY         float64
BROKERAGE       float64
NBROKERAGE      float64
TRANSITIVITY    float64
dtype: object

In [10]:
# Check size of data
df.shape

(9917, 20)

In [11]:
# Missing 1 value in TRANSITIVITY column but we don't use that columns anyway.
# So, we ignore missing value in this case
df.isnull().sum()[df.isnull().sum() != 0]

TRANSITIVITY    1
dtype: int64

In [12]:
# Check value in our target variable
df['cAGR'].value_counts(normalize=True)

# Our target variable pretty balanced

y    0.531209
n    0.468791
Name: cAGR, dtype: float64

**5. What is the difference between CountVectorizer and TFIDFVectorizer?**

In [13]:
# CountVectorizer is convert text to token column with value of count words

# TFIDFVectorizer is convert text to token column with value of score that
# handling the common word and rare word in the text.

**6. What are stopwords?**

In [14]:
# Commonly word that doesn't important/useful for the model.

**7. Give an example of when you might remove stopwords.**

In [15]:
# I want to keep only informative token to feed into our model.

**8. Give an example of when you might keep stopwords in your model.**

In [16]:
# When the text have a lot of meaningful bi-gram for example, not bad, not good.

### Step 4: Model the data.

We are going to fit two types of models: a logistic regression and [a Naive Bayes classifier](https://scikit-learn.org/stable/modules/naive_bayes.html).

**Reminder:** We will only use the feature `STATUS` to model `cAGR`.

### We want to attempt to fit our models on sixteen sets of features:

1. CountVectorizer with 100 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
2. CountVectorizer with 100 features, with English stopwords removed and with the default `ngram_range`.
3. CountVectorizer with 100 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
4. CountVectorizer with 100 features, with English stopwords kept in and with the default `ngram_range`.
5. CountVectorizer with 500 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
6. CountVectorizer with 500 features, with English stopwords removed and with the default `ngram_range`.
7. CountVectorizer with 500 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
8. CountVectorizer with 500 features, with English stopwords kept in and with the default `ngram_range`.
9. TFIDFVectorizer with 100 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
10. TFIDFVectorizer with 100 features, with English stopwords removed and with the default `ngram_range`.
11. TFIDFVectorizer with 100 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
12. TFIDFVectorizer with 100 features, with English stopwords kept in and with the default `ngram_range`.
13. TFIDFVectorizer with 500 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
14. TFIDFVectorizer with 500 features, with English stopwords removed and with the default `ngram_range`.
15. TFIDFVectorizer with 500 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
16. TFIDFVectorizer with 500 features, with English stopwords kept in and with the default `ngram_range`.

**9. Rather than manually instantiating 16 different vectorizers, what `sklearn` class have we learned about that might make this easier? Use it.**

In [17]:
# Using pipeline and gridsearchCV to make the it easier to setup 16 vectorizers and feed into
# our logistic regression and naive bayes classifier
# Set X, y and split our data
X = df['STATUS'] # Using only status for our X variable
# Mapping 'n'-> 0, 'y' -> 1
y = df['cAGR'].map({'n':0,'y':1}) # Using only cAGR as target variable

# X before y, train before test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state=42)

In [18]:
# Check size of X, y
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((7437,), (2480,), (7437,), (2480,))

In [19]:
# Check balanced of y_train
y_train.value_counts(normalize=True)

1    0.533011
0    0.466989
Name: cAGR, dtype: float64

In [20]:
# Check balanced of y_test
y_test.value_counts(normalize=True)

1    0.525806
0    0.474194
Name: cAGR, dtype: float64

In [21]:
def score_class(y_train, y_preds_train, y_test, y_preds_test):
    accuracy_train = accuracy_score(y_train, y_preds_train)
    accuracy_test = accuracy_score(y_test, y_preds_test)
    return [accuracy_train, accuracy_test]

In [22]:
def score_to_df(params, accuracy_list, score_df):
    model_df = pd.DataFrame([[params] + accuracy_list],
                            columns = ['params','accuracy_train','accuracy_test'])
    score_df = pd.concat([score_df,model_df], ignore_index=True)
    return score_df

In [23]:
score_df_log = pd.DataFrame(columns=['params', 'accuracy_train','accuracy_test'])

In [24]:
# Gridsearch version of logreg
# Feature engineering: Countervectoizer
pipe1_params = {
    'cvec__max_features':[100,500],
    'cvec__stop_words':[None,'english'],
    'cvec__ngram_range':[(1,1),(1,2)]
}

pipe1 = Pipeline([
    ('cvec', CountVectorizer()),
    ('logreg', LogisticRegression(max_iter=150))
])

gs_1 = GridSearchCV(pipe1,
                    pipe1_params,
                    cv=5)
gs_1.fit(X_train,y_train)
print(gs_1.best_params_)

# # Model evaluation
y_preds_train = gs_1.predict(X_train)
y_preds_test = gs_1.predict(X_test)
accuracy_list = score_class(y_train, y_preds_train, y_test, y_preds_test)
params = gs_1.best_params_
score_df_log = score_to_df(params, accuracy_list, score_df_log)
score_df_log

{'cvec__max_features': 500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': None}


Unnamed: 0,params,accuracy_train,accuracy_test
0,"{'cvec__max_features': 500, 'cvec__ngram_range...",0.617319,0.546774


In [25]:
# Gridsearch version of logreg
# Feature engineering: TFIDFvectoizer
pipe1_params = {
    'tvec__max_features':[100,500],
    'tvec__stop_words':[None,'english'],
    'tvec__ngram_range':[(1,1),(1,2)]
}

pipe1 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('logreg', LogisticRegression(max_iter=150))
])

gs_1 = GridSearchCV(pipe1,
                    pipe1_params,
                    cv=5)
gs_1.fit(X_train,y_train)
print(gs_1.best_params_)

# # Model evaluation
y_preds_train = gs_1.predict(X_train)
y_preds_test = gs_1.predict(X_test)
accuracy_list = score_class(y_train, y_preds_train, y_test, y_preds_test)
params = gs_1.best_params_
score_df_log = score_to_df(params, accuracy_list, score_df_log)
score_df_log

{'tvec__max_features': 500, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': None}


Unnamed: 0,params,accuracy_train,accuracy_test
0,"{'cvec__max_features': 500, 'cvec__ngram_range...",0.617319,0.546774
1,"{'tvec__max_features': 500, 'tvec__ngram_range...",0.626193,0.545565


**10. What are some of the advantages of fitting a logistic regression model?**

In [26]:
# Advantage of fitting logistic regression

# Logistic regression models have less computational expensive.

# Logistic regression models have interpretability, we can gain
# insight of the data by using coefficient from the model.

**11. Fit a logistic regression model and compare it to the baseline.**

In [27]:
# base line model
y_test.value_counts(normalize=True)

# Our baseline model has accuracy about 52.6%.

1    0.525806
0    0.474194
Name: cAGR, dtype: float64

In [28]:
# Logistic regression model
score_df_log

# This logistic regression model has about 2% higher accuracy rate than
# our baseline model.

Unnamed: 0,params,accuracy_train,accuracy_test
0,"{'cvec__max_features': 500, 'cvec__ngram_range...",0.617319,0.546774
1,"{'tvec__max_features': 500, 'tvec__ngram_range...",0.626193,0.545565


### Summary of Naive Bayes 

Naive Bayes is a classification technique that relies on probability to classify observations.
- It's based on a probability rule called **Bayes' Theorem**... thus, "**Bayes**."
- It makes an assumption that isn't often met, so it's "**naive**."

Despite being a model that relies on a naive assumption, it often performs pretty well! (This is kind of like linear regression... we aren't always guaranteed homoscedastic errors in linear regression, but the model might still do a good job regardless.)
- [Interested in details? Read more here if you want.](https://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf)


The [sklearn documentation](https://scikit-learn.org/stable/modules/naive_bayes.html) is here, but it can be intimidating. So, to quickly summarize the Bayes and Naive parts of the model...

#### Bayes' Theorem
If you've seen Bayes' Theorem, it relates the probability of $P(A|B)$ to $P(B|A)$. (Don't worry; we won't be doing any probability calculations by hand! However, you may want to refresh your memory on conditional probability from our earlier lessons if you forget what a conditional probability is.)

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)}
\end{eqnarray*}
$$

- Let $A$ be that someone is "agreeable," like the OCEAN category.
- Let $B$ represent the words used in their Facebook post.

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)} \\
\Rightarrow P(\text{person is agreeable}|\text{words in Facebook post}) &=& \frac{P(\text{words in Facebook post}|\text{person is agreeable})P(\text{person is agreeable})}{P(\text{words in Facebook post})}
\end{eqnarray*}
$$

We want to calculate the probability that someone is agreeable **given** the words that they used in their Facebook post! (Rather than calculating this probability by hand, this is done under the hood and we can just see the results by checking `.predict_proba()`.) However, this is exactly what our model is doing. We can (a.k.a. the model can) calculate the pieces on the right-hand side of the equation to give us a probability estimate of how likely someone is to be agreeable given their Facebook post.

#### Naive Assumption

If our goal is to estimate $P(\text{person is agreeable}|\text{words in Facebook post})$, that can be quite tricky.

---

<details><summary>Bonus: if you want to understand why that's complicated, click here.</summary>
    
- The event $\text{"words in Facebook post"}$ is a complicated event to calculate.

- If a Facebook post has 100 words in it, then the event $\text{"words in Facebook post"} = \text{"word 1 is in the Facebook post" and "word 2 is in the Facebook post" and }\ldots \text{ and "word 100 is in the Facebook post"}$.

- To calculate the joint probability of all 100 words being in the Facebook post gets complicated pretty quickly. (Refer back to the probability notes on how to calculate the joint probability of two events if you want to see more.)
</details>

---

To simplify matters, we make an assumption: **we assume that all of our features are independent of one another.**

In some contexts, this assumption might be realistic!

**12. Why would this assumption not be realistic with NLP data?**

In [29]:
# Because each words in the sentence really related to each other.
# Then, this assumption isn't realistic with NLP data  

Despite this assumption not being realistic with NLP data, we still use Naive Bayes pretty frequently.
- It's a very fast modeling algorithm. (which is great especially when we have lots of features and/or lots of data!)
- It is often an excellent classifier, outperforming more complicated models.

There are three common types of Naive Bayes models: Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes.
- How do we pick which of the three models to use? It depends on our $X$ variable.
    - Bernoulli Naive Bayes is appropriate when our features are all 0/1 variables.
        - [Bernoulli NB Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB)
    - Multinomial Naive Bayes is appropriate when our features are variables that take on only positive integer counts.
        - [Multinomial NB Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
    - Gaussian Naive Bayes is appropriate when our features are Normally distributed variables. (Realistically, though, we kind of use Gaussian whenever neither Bernoulli nor Multinomial works.)
        - [Gaussian NB Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

In [30]:
# In this NLP, we may use Bernoulli NB when our X features are in form of 0 and 1.
# And If we use countvectorizer or tfidf, we will use multinomial to handle the data.

**13. Suppose you CountVectorized your features. Which Naive Bayes model would be most appropriate to fit? Why? Fit it.**

In [31]:
# I use the multinomial NB when I countvectorized our features.
# Because the results from countvectorized features are
# count of each words which are zero or positive integer. 

**14. Suppose you TFIDFVectorized your features. Which Naive Bayes model would be most appropriate to fit? Why? Fit it.**

In [32]:
# I use the multinomial NB when I TFIDFvectorized our features.
# Because the results from TFIDFvectorized features are
# score which are between zero to one. 

**15. Compare the performance of your models.**

In [33]:
score_df_nb = pd.DataFrame(columns=['params', 'accuracy_train','accuracy_test'])

In [34]:
# Gridsearch version of Naive Bayes
# Feature engineering: Countvectoizer
pipe1_params = {
    'cvec__max_features':[100,500],
    'cvec__stop_words':[None,'english'],
    'cvec__ngram_range':[(1,1),(1,2)]
}

pipe1 = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())
])

gs_1 = GridSearchCV(pipe1,
                    pipe1_params,
                    cv=5)
gs_1.fit(X_train,y_train)
print(gs_1.best_params_)

# # Model evaluation
y_preds_train = gs_1.predict(X_train)
y_preds_test = gs_1.predict(X_test)
accuracy_list = score_class(y_train, y_preds_train, y_test, y_preds_test)
params = gs_1.best_params_
score_df_nb = score_to_df(params, accuracy_list, score_df_nb)
score_df_nb

{'cvec__max_features': 500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': 'english'}


Unnamed: 0,params,accuracy_train,accuracy_test
0,"{'cvec__max_features': 500, 'cvec__ngram_range...",0.608713,0.547177


In [35]:
# Gridsearch version of Naive Bayes
# Feature engineering: TFIDFvectoizer
pipe1_params = {
    'tvec__max_features':[100,500],
    'tvec__stop_words':[None,'english'],
    'tvec__ngram_range':[(1,1),(1,2)]
}

pipe1 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

gs_1 = GridSearchCV(pipe1,
                    pipe1_params,
                    cv=5)
gs_1.fit(X_train,y_train)
print(gs_1.best_params_)

# # Model evaluation
y_preds_train = gs_1.predict(X_train)
y_preds_test = gs_1.predict(X_test)
accuracy_list = score_class(y_train, y_preds_train, y_test, y_preds_test)
params = gs_1.best_params_
score_df_nb = score_to_df(params, accuracy_list, score_df_nb)
score_df_nb

{'tvec__max_features': 500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': None}


Unnamed: 0,params,accuracy_train,accuracy_test
0,"{'cvec__max_features': 500, 'cvec__ngram_range...",0.608713,0.547177
1,"{'tvec__max_features': 500, 'tvec__ngram_range...",0.612882,0.547581


In [36]:
score_df_nb

Unnamed: 0,params,accuracy_train,accuracy_test
0,"{'cvec__max_features': 500, 'cvec__ngram_range...",0.608713,0.547177
1,"{'tvec__max_features': 500, 'tvec__ngram_range...",0.612882,0.547581


**16. Even though we didn't explore the full extent of Cambridge Analytica's modeling, based on what we did here, how effective was their approach at using Facebook data to model agreeableness?**

In [37]:
# The models which predicted agreeableness aren't effective because
# accuracy rate for both naive bayes and logistic regression is higher
# than baseline only 2%.

# It may be not good enough by using naive bayes or logistic regression.