<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Tackling an NLP Problem with Naive Bayes

----

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

**Instructor's note: While I completely agree with the idea expressed below, we will be discussing Naive Bayes in class**

In this lab, we are going to apply a **new** modeling technique to natural language processing data.

> "But how can we apply a modeling technique we haven't learned?!"

The DSI program is great - but we can't teach you *everything* about data science in 12 weeks! This lab is designed to help you start learning something new without it being taught in a formal lesson. 

- Later in the cohort (like for your capstone!), you'll be exploring models, libraries, and resources that you haven't been explicitly taught.
- After the program, you'll want to continue developing your skills. Being comfortable with documentation and being confident in your ability to read something new and decide whether or not it is an appropriate method for the problem you're trying to solve is **incredibly** valuable.

### Step 1: Define the problem.

Many organizations have a substantial interest in classifying users of their product into groups. Some examples:
- A company that serves as a marketplace may want to predict who is likely to purchase a certain type of product on their platform, like books, cars, or food.
- An application developer may want to identify which individuals are willing to pay money for "bonus features" or to upgrade their app.
- A social media organization may want to identify who generates the highest rate of content that later goes "viral."

### Summary
In this lab, you're an engineer for Facebook. In recent years, the organization Cambridge Analytica gained worldwide notoriety for its use of Facebook data in an attempt to sway electoral outcomes.

Cambridge Analytica, an organization staffed with lots of Ph.D. researchers, used the Big5 personality groupings (also called OCEAN) to group people into one of 32 different groups.
- The five qualities measured by this personality assessment are:
    - **O**penness
    - **C**onscientiousness
    - **E**xtroversion
    - **A**greeableness
    - **N**euroticism
- Each person could be classified as "Yes" or "No" for each of the five qualities.
- This makes for 32 different potential combinations of qualities. ($2^5 = 32$)
- You don't have to check it out, but if you want to learn more about this personality assessment, head to [**the Wikipedia page**](https://en.wikipedia.org/wiki/Big_Five_personality_traits).
- There's also [**a short (3-4 pages) academic paper describing part of this approach**](./celli-al_wcpr13.pdf).

Cambridge Analytica's methodology was, roughly, the following:
- Gather a large amount of data from Facebook.
- Use this data to predict an individual's Big5 personality "grouping."
- Design political advertisements that would be particularly effective to that particular "grouping." (For example, are certain advertisements particularly effective toward people with specific personality traits?)

You want to know the **real-world problem**: "Is what Cambridge Analytica attempted to do actually possible, or is it junk science?"

However, we'll solve the related **data science problem**: "Are one's Facebook statuses predictive of whether or not one is agreeable?"
> Note: If Facebook statuses aren't predictive of one being agreeable (one of the OCEAN qualities), then Cambridge Analytica's approach won't work very well!

### Step 2: Obtain the data.

Obviously, there are plenty of opportunities to discuss the ethics surrounding this particular issue... so let's do that.

In [5]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.pipeline import Pipeline


In [6]:
data = pd.read_csv('./mypersonality_final.csv', encoding = 'ISO-8859-1')

In [7]:
data.head()

Unnamed: 0,#AUTHID,STATUS,sEXT,sNEU,sAGR,sCON,sOPN,cEXT,cNEU,cAGR,cCON,cOPN,DATE,NETWORKSIZE,BETWEENNESS,NBETWEENNESS,DENSITY,BROKERAGE,NBROKERAGE,TRANSITIVITY
0,b7b7764cfa1c523e4e93ab2a79a946c4,likes the sound of thunder.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/19/09 03:21 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
1,b7b7764cfa1c523e4e93ab2a79a946c4,is so sleepy it's not even funny that's she ca...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/02/09 08:41 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
2,b7b7764cfa1c523e4e93ab2a79a946c4,is sore and wants the knot of muscles at the b...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/15/09 01:15 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
3,b7b7764cfa1c523e4e93ab2a79a946c4,likes how the day sounds in this new song.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/22/09 04:48 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
4,b7b7764cfa1c523e4e93ab2a79a946c4,is home. <3,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/20/09 02:31 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1


**1. What is the difference between anonymity and confidentiality? All else held equal, which tends to keep people safer?**

In [9]:
#Anonymity : is the state of having your identity hidden, so it isn't visible to others.(sciencedirect.com)
#An Internet Service Provider (ISP) or a site you've logged onto may know your identity, but you want to remain anonymous

#Confidentiality :The ability to protect data so that unauthorized parties cannot view the data (info from NIST)

# They are not equal however,they aim to protect the data 
# Confidentiality aim to keep secure and not disclosed without consent from unauthorization user 
# Anonymity aim to hide information by providing the information is not known.

# We can hidden information by Anonymity and keep it confident to prevent the data leak.

**2. Suppose that the "unique identifier" in the above data, the `#AUTHID`, is a randomly generated key so that it can never be connected back to the original poster. Have we guaranteed anonymity here? Why or why not?**

In [11]:
#No guaranteed anonymity for the AuthID key 
#If the method used to create the key isn't secure—for example, 
#we encode the username in the AuthID and someone who knows or reverse-engineers the encoding method. On the otherhand, 
#we can improve the method using cryptographically secure keys

**3. As an engineer for Facebook, you recognize that user data will be used by Facebook and by other organizations - that won't change. However, what are at least three recommendations you would bring to your manager to improve how data is used and shared? Be as specific as you can.**

In [13]:
#1. Specification types of data that acquire user acknowledgment to allow access
# such as location, mobile number, and profile information, because it allows getting hacked or stalked
#2. Implement the zero trust security using end-to-end encryption: data at rest, in transit, and during processing.
# Enforce encryption of APIs and third-party integrations.
#3. Enhanced User Privacy by protecting user data even and requiring acknowledgment from the user to allow third parties 

### Step 3: Explore the data.

- Note: For our $X$ variable, we will only use the `STATUS` variable. For our $Y$ variable, we will only use the `cAGR` variable.

**4. Explore the data here.**
> We aren't explicitly asking you to do specific EDA here, but what EDA would you generally do with this data? Do the EDA you usually would, especially if you know what the goal of this analysis is.

In [16]:
X = data['STATUS']
y = data['cAGR']

In [17]:
data.shape

(9917, 20)

In [18]:
data.describe()

Unnamed: 0,sEXT,sNEU,sAGR,sCON,sOPN,NETWORKSIZE,BETWEENNESS,NBETWEENNESS,DENSITY,BROKERAGE,NBROKERAGE,TRANSITIVITY
count,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9916.0
mean,3.35476,2.609453,3.616643,3.474201,4.130386,429.37712,135425.3,94.66517,3.154012,137642.5,0.48992,0.128821
std,0.857578,0.760248,0.682485,0.737215,0.585672,428.760382,199433.8,5.506696,311.073343,201392.1,0.011908,0.106063
min,1.33,1.25,1.65,1.45,2.25,24.0,93.25,0.04,0.0,0.49,0.18,0.0
25%,2.71,2.0,3.14,3.0,3.75,196.0,16902.2,93.77,0.01,17982.0,0.49,0.06
50%,3.4,2.6,3.65,3.4,4.25,317.0,47166.9,96.44,0.02,48683.0,0.49,0.09
75%,4.0,3.05,4.15,4.0,4.55,633.0,196606.0,97.88,0.03,198186.0,0.5,0.17
max,5.0,4.75,5.0,5.0,5.0,29724.9,1251780.0,99.82,30978.0,1263790.0,0.5,0.63


In [19]:
data.columns

Index(['#AUTHID', 'STATUS', 'sEXT', 'sNEU', 'sAGR', 'sCON', 'sOPN', 'cEXT',
       'cNEU', 'cAGR', 'cCON', 'cOPN', 'DATE', 'NETWORKSIZE', 'BETWEENNESS',
       'NBETWEENNESS', 'DENSITY', 'BROKERAGE', 'NBROKERAGE', 'TRANSITIVITY'],
      dtype='object')

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9917 entries, 0 to 9916
Data columns (total 20 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   #AUTHID       9917 non-null   object 
 1   STATUS        9917 non-null   object 
 2   sEXT          9917 non-null   float64
 3   sNEU          9917 non-null   float64
 4   sAGR          9917 non-null   float64
 5   sCON          9917 non-null   float64
 6   sOPN          9917 non-null   float64
 7   cEXT          9917 non-null   object 
 8   cNEU          9917 non-null   object 
 9   cAGR          9917 non-null   object 
 10  cCON          9917 non-null   object 
 11  cOPN          9917 non-null   object 
 12  DATE          9917 non-null   object 
 13  NETWORKSIZE   9917 non-null   float64
 14  BETWEENNESS   9917 non-null   float64
 15  NBETWEENNESS  9917 non-null   float64
 16  DENSITY       9917 non-null   float64
 17  BROKERAGE     9917 non-null   float64
 18  NBROKERAGE    9917 non-null 

In [21]:
data.isnull().sum()

#AUTHID         0
STATUS          0
sEXT            0
sNEU            0
sAGR            0
sCON            0
sOPN            0
cEXT            0
cNEU            0
cAGR            0
cCON            0
cOPN            0
DATE            0
NETWORKSIZE     0
BETWEENNESS     0
NBETWEENNESS    0
DENSITY         0
BROKERAGE       0
NBROKERAGE      0
TRANSITIVITY    1
dtype: int64

In [22]:
data.dropna(inplace = True)

In [23]:
data.isnull().sum()

#AUTHID         0
STATUS          0
sEXT            0
sNEU            0
sAGR            0
sCON            0
sOPN            0
cEXT            0
cNEU            0
cAGR            0
cCON            0
cOPN            0
DATE            0
NETWORKSIZE     0
BETWEENNESS     0
NBETWEENNESS    0
DENSITY         0
BROKERAGE       0
NBROKERAGE      0
TRANSITIVITY    0
dtype: int64

**5. What is the difference between CountVectorizer and TFIDFVectorizer?**

In [25]:
#CountVectorizer use to transform our text data into something we can pass into a model.
# - Simple to understand - just counts word occurrences
# - Intuitive representation of text data
# - Doesn't consider word importance in context
# - Creates very sparse matrices
# - Ignores word order and relationships

#TF-IDF is a score that tells us which words are important to one document, relative to all other documents. 

# TF-IDF "penalizes" common words that appear across many documents (high DF) 
# by giving them a lower IDF score
# TF-IDF helps identify terms that are both:
#   - Important to specific documents (high TF)
#   - Distinctive across the corpus (high IDF)


**6. What are stopwords?**

In [27]:
#Some words are so common that they may not provide legitimate information about the variable we're trying to predict.

**7. Give an example of when you might remove stopwords.**

In [29]:
#Stopwords are commonly removed in text that we're going to analyst 
#such as "and," "the," "is," and "of" don't that is not 
#majority of the words had neither a positive sentiment nor negative sentiment

**8. Give an example of when you might keep stopwords in your model.**

In [31]:
# Role in Sentence Structure and Coherence
# Natural Language Processing (NLP) Context
#In some Natural Language Processing (NLP) tasks, 
#stop words should be preserved to avoid the loss of valuable contextual information. 

#For these tasks, like language translation or text summarization, 
#the presence of stop words is necessary for delivering the correct meaning and interpretation.

#https://botpenguin.com/glossary/stop-words

### Step 4: Model the data.

We are going to fit two types of models: a logistic regression and a [**Naive Bayes classifier**](https://scikit-learn.org/stable/modules/naive_bayes.html).

**Reminder:** We will only use the feature `STATUS` to model `cAGR`.

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

In [34]:
set_features = {
    "cvec1": CountVectorizer(max_features=100, stop_words='english', ngram_range=(1, 2)),
    "cvec2": CountVectorizer(max_features=100, stop_words='english', ngram_range=(1, 1)),
    "cvec3": CountVectorizer(max_features=100, ngram_range=(1, 2)),
    "cvec4": CountVectorizer(max_features=100, ngram_range=(1, 1)),
    "cvec5": CountVectorizer(max_features=500, stop_words='english', ngram_range=(1, 2)),
    "cvec6": CountVectorizer(max_features=500, stop_words='english', ngram_range=(1, 1)),
    "cvec7": CountVectorizer(max_features=500, ngram_range=(1, 2)),
    "cvec8": CountVectorizer(max_features=500, ngram_range=(1, 1)),
    "tf1": TfidfVectorizer(max_features=100, stop_words='english', ngram_range=(1, 2)),
    "tf2": TfidfVectorizer(max_features=100, stop_words='english', ngram_range=(1, 1)),
    "tf3": TfidfVectorizer(max_features=100, ngram_range=(1, 2)),
    "tf4": TfidfVectorizer(max_features=100, ngram_range=(1, 1)),
    "tf5": TfidfVectorizer(max_features=500, stop_words='english', ngram_range=(1, 2)),
    "tf6": TfidfVectorizer(max_features=500, stop_words='english', ngram_range=(1, 1)),
    "tf7": TfidfVectorizer(max_features=500, ngram_range=(1, 2)),
    "tf8": TfidfVectorizer(max_features=500, ngram_range=(1, 1))
}

for name, set_feature in set_features.items():
    X = set_feature.fit_transform(X_train)
    print(f"Configuration: {name}")
    #print(f"Feature Names: {set_feature.get_feature_names_out()}\n")

Configuration: cvec1
Configuration: cvec2
Configuration: cvec3
Configuration: cvec4
Configuration: cvec5
Configuration: cvec6
Configuration: cvec7
Configuration: cvec8
Configuration: tf1
Configuration: tf2
Configuration: tf3
Configuration: tf4
Configuration: tf5
Configuration: tf6
Configuration: tf7
Configuration: tf8


### We want to attempt to fit our models on sixteen sets of features:

1. CountVectorizer with 100 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
2. CountVectorizer with 100 features, with English stopwords removed and with the default `ngram_range`.
3. CountVectorizer with 100 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
4. CountVectorizer with 100 features, with English stopwords kept in and with the default `ngram_range`.
5. CountVectorizer with 500 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
6. CountVectorizer with 500 features, with English stopwords removed and with the default `ngram_range`.
7. CountVectorizer with 500 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
8. CountVectorizer with 500 features, with English stopwords kept in and with the default `ngram_range`.
9. TFIDFVectorizer with 100 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
10. TFIDFVectorizer with 100 features, with English stopwords removed and with the default `ngram_range`.
11. TFIDFVectorizer with 100 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
12. TFIDFVectorizer with 100 features, with English stopwords kept in and with the default `ngram_range`.
13. TFIDFVectorizer with 500 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
14. TFIDFVectorizer with 500 features, with English stopwords removed and with the default `ngram_range`.
15. TFIDFVectorizer with 500 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
16. TFIDFVectorizer with 500 features, with English stopwords kept in and with the default `ngram_range`.

**9. Rather than manually instantiating 16 different vectorizers, what `sklearn` class have we learned about that might make this easier? Use it.**

In [86]:
#CountVectorizer can load pipeline object into GridSearchCV.
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000, 5000
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and 2-grams).
pipe = Pipeline([
    ('cvec', CountVectorizer()),  # transformer (fit, transform)
    ('nb', MultinomialNB())       # estimator or model (fit, predict)
])

pipe_params = {
    'cvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'cvec__min_df': [2,3],
    'cvec__max_df': [.90, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}

# ngram_range of (1,1) just returns individual tokens
# ngram_range of (1,2) returns single tokens (unigrams) AND bi-grams

# Instantiate GridSearchCV
gs = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [88]:
gs.fit(X_train, y_train)

In [90]:
gs.best_score_

0.5827642435154855

In [92]:
gs.score(X_train, y_train)

0.7837837837837838

In [94]:
gs.score(X_test, y_test)

0.589516129032258

**10. What are some of the advantages of fitting a logistic regression model?**

In [None]:
# Logistic regression is simple and understandable.
# The coefficients represent the log-odds, which can be transformed into probabilities, 
# Making the model's predictions interpretable

**11. Fit a logistic regression model and compare it to the baseline.**

In [44]:
data['cAGR'].value_counts(normalize=True).mul(100).round(2)

cAGR
y    53.12
n    46.88
Name: proportion, dtype: float64

In [46]:
pipeline = Pipeline([
            ('cvec', CountVectorizer()),
            ('lr', LogisticRegression())
        ])

cross_val_score(pipeline, X_train, y_train, cv=3).mean() 

# Fit your model
pipeline.fit(X_train, y_train)

# Training score
pipeline.score(X_train, y_train)

# Test score
pipeline.score(X_test, y_test)
y_pred = pipeline.predict(X_test)


In [55]:
pipeline.score(X_train, y_train)

0.9148850342880194

In [57]:
pipeline.score(X_test, y_test)

0.5838709677419355

### Summary of Naive Bayes 

Naive Bayes is a classification technique that relies on probability to classify observations.
- It's based on a probability rule called **Bayes' Theorem**... thus, "**Bayes**."
- It makes an assumption that isn't often met, so it's "**naive**."

Despite being a model that relies on a naive assumption, it often performs pretty well! (This is kind of like linear regression... we aren't always guaranteed homoscedastic errors in linear regression, but the model might still do a good job regardless.)
- [**Interested in the details?**](https://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf)


The [**sklearn documentation**](https://scikit-learn.org/stable/modules/naive_bayes.html) is here, but it can be intimidating. So, to quickly summarize the Bayes and Naive parts of the model...

#### Bayes' Theorem
If you've seen Bayes' Theorem, it relates the probability of $P(A|B)$ to $P(B|A)$. (Don't worry; we won't be doing any probability calculations by hand! However, you may want to refresh your memory on conditional probability from our earlier lessons if you forget what a conditional probability is.)

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)}
\end{eqnarray*}
$$

- Let $A$ be that someone is "agreeable," like the OCEAN category.
- Let $B$ represent the words used in their Facebook post.

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)} \\
\Rightarrow P(\text{person is agreeable}|\text{words in Facebook post}) &=& \frac{P(\text{words in Facebook post}|\text{person is agreeable})P(\text{person is agreeable})}{P(\text{words in Facebook post})}
\end{eqnarray*}
$$

We want to calculate the probability that someone is agreeable **given** the words that they used in their Facebook post! (Rather than calculating this probability by hand, this is done under the hood and we can just see the results by checking `.predict_proba()`.) However, this is exactly what our model is doing. We can (a.k.a. the model can) calculate the pieces on the right-hand side of the equation to give us a probability estimate of how likely someone is to be agreeable given their Facebook post.

#### Naive Assumption

If our goal is to estimate $P(\text{person is agreeable}|\text{words in Facebook post})$, that can be quite tricky.

---

<details><summary>Bonus: if you want to understand why that's complicated, click here.</summary>
    
- The event $\text{"words in Facebook post"}$ is a complicated event to calculate.

- If a Facebook post has 100 words in it, then the event $\text{"words in Facebook post"} = \text{"word 1 is in the Facebook post" and "word 2 is in the Facebook post" and }\ldots \text{ and "word 100 is in the Facebook post"}$.

- To calculate the joint probability of all 100 words being in the Facebook post gets complicated pretty quickly. (Refer back to the probability notes on how to calculate the joint probability of two events if you want to see more.)
</details>

---

To simplify matters, we make an assumption: **we assume that all of our features are independent of one another.**

In some contexts, this assumption might be realistic!

**12. Why would this assumption not be realistic with NLP data?**

In [39]:
# Text data is never independent! 
# Certain words can change the context of a sentence when used with other words. 
# The way language works, we have words that are more or less likely to follow other words.

Despite this assumption not being realistic with NLP data, we still use Naive Bayes pretty frequently.
- It's a very fast modeling algorithm. (which is great especially when we have lots of features and/or lots of data!)
- It is often an excellent classifier, outperforming more complicated models.

There are three common types of Naive Bayes models: Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes.
- How do we pick which of the three models to use? It depends on our $X$ variable.
    - Bernoulli Naive Bayes is appropriate when our features are all 0/1 variables.
        - [**Bernoulli NB Documentation**](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB)
    - Multinomial Naive Bayes is appropriate when our features are variables that take on only positive integer counts.
        - [**Multinomial NB Documentation**](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
    - Gaussian Naive Bayes is appropriate when our features are Normally distributed variables. (Realistically, though, we kind of use Gaussian whenever neither Bernoulli nor Multinomial works.)
        - [**Gaussian NB Documentation**](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

**13. Suppose you CountVectorized your features. Which Naive Bayes model would be most appropriate to fit? Why? Fit it.**

In [51]:
# Multinomial Naive Bayes: because we are working on Facebook status. 
# The multinomial is suitable for text classification (word counts). 
# Features: "money" (appears 3 times), "winner" (appears 2 times), "urgent" (appears 1 time)
# Counts how many times each word appears
# Multinomial: Count data (0, 1, 2, ...)


# 1. CountVectorizer (transformer) - converts text to numerical features
# 2. Multinomial Naive Bayes (estimator) - trains Naive Bayes classifier on vectorized text

**14. Suppose you TFIDFVectorized your features. Which Naive Bayes model would be most appropriate to fit? Why? Fit it.**

In [53]:
# Same as CountVectorized 

# TfidfVectorizer - Can perform better for classification tasks where term importance across documents is relevant

**15. Compare the performance of your models.**

In [59]:
pipe_tvec = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())])



In [61]:
pipe_tvec_params = {
    'tvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'tvec__stop_words': [None, "english"],
    'tvec__ngram_range': [(1,1), (1,2)]
}

In [63]:
gs_tvec = GridSearchCV(estimator=pipe_tvec, 
                       param_grid=pipe_tvec_params, 
                       cv=5) # 5-fold cross validation

In [65]:
gs_tvec.fit(X_train, y_train)

In [68]:
gs_tvec.best_params_

{'tvec__max_features': 5000,
 'tvec__ngram_range': (1, 1),
 'tvec__stop_words': 'english'}

In [70]:
gs_tvec.score(X_train, y_train)

0.8022051902648918

In [72]:
gs_tvec.score(X_test, y_test)

0.589516129032258

In [96]:
gs_tvec.best_score_

0.5880072636686406

In [102]:
# Overfitting both of Vectorized
# because the score is a significant difference between train and test data 

**16. Even though we didn't explore the full extent of Cambridge Analytica's modeling, based on what we did here, how effective was their approach at using Facebook data to model agreeableness?**

In [100]:
# It seems the data is overfitting in the data.
# In the NLP lab, I found that the result of the analysis is agreeable on the class column because 
# The model can predict the status along with the class that can count on the words and frequency of posting.