<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Tackling an NLP Problem with Naive Bayes

----

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

**Instructor's note: While I completely agree with the idea expressed below, we will be discussing Naive Bayes in class**

In this lab, we are going to apply a **new** modeling technique to natural language processing data.

> "But how can we apply a modeling technique we haven't learned?!"

The DSI program is great - but we can't teach you *everything* about data science in 12 weeks! This lab is designed to help you start learning something new without it being taught in a formal lesson. 

- Later in the cohort (like for your capstone!), you'll be exploring models, libraries, and resources that you haven't been explicitly taught.
- After the program, you'll want to continue developing your skills. Being comfortable with documentation and being confident in your ability to read something new and decide whether or not it is an appropriate method for the problem you're trying to solve is **incredibly** valuable.

### Step 1: Define the problem.

Many organizations have a substantial interest in classifying users of their product into groups. Some examples:
- A company that serves as a marketplace may want to predict who is likely to purchase a certain type of product on their platform, like books, cars, or food.
- An application developer may want to identify which individuals are willing to pay money for "bonus features" or to upgrade their app.
- A social media organization may want to identify who generates the highest rate of content that later goes "viral."

### Summary
In this lab, you're an engineer for Facebook. In recent years, the organization Cambridge Analytica gained worldwide notoriety for its use of Facebook data in an attempt to sway electoral outcomes.

Cambridge Analytica, an organization staffed with lots of Ph.D. researchers, used the Big5 personality groupings (also called OCEAN) to group people into one of 32 different groups.
- The five qualities measured by this personality assessment are:
    - **O**penness
    - **C**onscientiousness
    - **E**xtroversion
    - **A**greeableness
    - **N**euroticism
- Each person could be classified as "Yes" or "No" for each of the five qualities.
- This makes for 32 different potential combinations of qualities. ($2^5 = 32$)
- You don't have to check it out, but if you want to learn more about this personality assessment, head to [**the Wikipedia page**](https://en.wikipedia.org/wiki/Big_Five_personality_traits).
- There's also [**a short (3-4 pages) academic paper describing part of this approach**](./celli-al_wcpr13.pdf).

Cambridge Analytica's methodology was, roughly, the following:
- Gather a large amount of data from Facebook.
- Use this data to predict an individual's Big5 personality "grouping."
- Design political advertisements that would be particularly effective to that particular "grouping." (For example, are certain advertisements particularly effective toward people with specific personality traits?)

You want to know the **real-world problem**: "Is what Cambridge Analytica attempted to do actually possible, or is it junk science?"

However, we'll solve the related **data science problem**: "Are one's Facebook statuses predictive of whether or not one is agreeable?"
> Note: If Facebook statuses aren't predictive of one being agreeable (one of the OCEAN qualities), then Cambridge Analytica's approach won't work very well!

### Step 2: Obtain the data.

Obviously, there are plenty of opportunities to discuss the ethics surrounding this particular issue... so let's do that.

In [5]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from nltk.corpus import stopwords


In [6]:
data = pd.read_csv('./mypersonality_final.csv', encoding = 'ISO-8859-1')

In [7]:
data.head()

Unnamed: 0,#AUTHID,STATUS,sEXT,sNEU,sAGR,sCON,sOPN,cEXT,cNEU,cAGR,cCON,cOPN,DATE,NETWORKSIZE,BETWEENNESS,NBETWEENNESS,DENSITY,BROKERAGE,NBROKERAGE,TRANSITIVITY
0,b7b7764cfa1c523e4e93ab2a79a946c4,likes the sound of thunder.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/19/09 03:21 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
1,b7b7764cfa1c523e4e93ab2a79a946c4,is so sleepy it's not even funny that's she ca...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/02/09 08:41 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
2,b7b7764cfa1c523e4e93ab2a79a946c4,is sore and wants the knot of muscles at the b...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/15/09 01:15 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
3,b7b7764cfa1c523e4e93ab2a79a946c4,likes how the day sounds in this new song.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/22/09 04:48 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
4,b7b7764cfa1c523e4e93ab2a79a946c4,is home. <3,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/20/09 02:31 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1


**1. What is the difference between anonymity and confidentiality? All else held equal, which tends to keep people safer?**

In [9]:
# Anonymity means there are no identifiers linked to the data, ensuring complete privacy, 
# while confidentiality keeps personal information secure but may still involve identifiable data. 
# Anonymity tends to keep people safer, as it eliminates the risk of identification.

**2. Suppose that the "unique identifier" in the above data, the `#AUTHID`, is a randomly generated key so that it can never be connected back to the original poster. Have we guaranteed anonymity here? Why or why not?**

In [11]:
# No, we cannot guarantee anonymity because the randomly generated key for AUTHID could potentially be used to identify the original poster,
# or the person associated with the key. Anonymity means there is no key or data that can be traced back to the individual.

**3. As an engineer for Facebook, you recognize that user data will be used by Facebook and by other organizations - that won't change. However, what are at least three recommendations you would bring to your manager to improve how data is used and shared? Be as specific as you can.**

In [13]:
# Improve Data Retention and Deletion Policies:
# Make sure that user data is only kept for as long as necessary. Also, 
# give users an easy way to ask for their data to be deleted, ensuring it's fully removed from Facebook and its partners.

# Enhance Data Encryption:
# Use strong encryption for sensitive data, both when it's being sent and when it's stored, 
# to prevent unauthorized access and keep user information safe.

# Minimize Third-Party Data Sharing:
# Only share user data with third parties who really need it. 
# Regularly check that all third parties are following Facebook's privacy rules and protecting user data properly.

### Step 3: Explore the data.

- Note: For our $X$ variable, we will only use the `STATUS` variable. For our $Y$ variable, we will only use the `cAGR` variable.

**4. Explore the data here.**
> We aren't explicitly asking you to do specific EDA here, but what EDA would you generally do with this data? Do the EDA you usually would, especially if you know what the goal of this analysis is.

In [16]:
data.isnull().sum()[data.isnull().sum()!=0]

TRANSITIVITY    1
dtype: int64

In [17]:
data.shape

(9917, 20)

In [18]:
data['cAGR'].unique()

array(['n', 'y'], dtype=object)

In [19]:
data['cAGR'] = data['cAGR'].map({'n':0, 'y':1})

In [20]:
data['cAGR'].value_counts(normalize = True)

cAGR
1    0.531209
0    0.468791
Name: proportion, dtype: float64

In [21]:
X = data['STATUS']
y = data['cAGR']

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [23]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6644,), (3273,), (6644,), (3273,))

In [24]:
y_train.value_counts(normalize=True)

cAGR
1    0.531156
0    0.468844
Name: proportion, dtype: float64

In [25]:
y_test.value_counts(normalize=True)

cAGR
1    0.531317
0    0.468683
Name: proportion, dtype: float64

**5. What is the difference between CountVectorizer and TFIDFVectorizer?**

In [27]:
# CountVectorizer: converts a collection of text documents into a matrix of token counts (frequency of each word in the document).

# TFIDFVectorizer: transforms text data into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) scores, 
# which measure the importance of a word in a document relative to the entire corpus.


**6. What are stopwords?**

In [29]:
# Stopwords are words that are considered irrelevant or unnecessary for certain types,
# of text analysis because they occur very frequently in a language

# Example  
# Articles: "the", "a", "an"
# Prepositions: "in", "on", "at"
# Conjunctions: "and", "but", "or"
# Pronouns: "he", "she", "it", "they"

**7. Give an example of when you might remove stopwords.**

In [31]:
# In sentiment analysis, we focus on key words like "happy," "disappointed," or "awesome"
# We remove stopwords to help the model focus on important words and improve efficiency.

**8. Give an example of when you might keep stopwords in your model.**

In [33]:
# In document classification, we keep stopwords because they may help distinguish between topics or categories.

### Step 4: Model the data.

We are going to fit two types of models: a logistic regression and a [**Naive Bayes classifier**](https://scikit-learn.org/stable/modules/naive_bayes.html).

**Reminder:** We will only use the feature `STATUS` to model `cAGR`.

### We want to attempt to fit our models on sixteen sets of features:

1. CountVectorizer with 100 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
2. CountVectorizer with 100 features, with English stopwords removed and with the default `ngram_range`.
3. CountVectorizer with 100 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
4. CountVectorizer with 100 features, with English stopwords kept in and with the default `ngram_range`.
5. CountVectorizer with 500 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
6. CountVectorizer with 500 features, with English stopwords removed and with the default `ngram_range`.
7. CountVectorizer with 500 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
8. CountVectorizer with 500 features, with English stopwords kept in and with the default `ngram_range`.
9. TFIDFVectorizer with 100 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
10. TFIDFVectorizer with 100 features, with English stopwords removed and with the default `ngram_range`.
11. TFIDFVectorizer with 100 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
12. TFIDFVectorizer with 100 features, with English stopwords kept in and with the default `ngram_range`.
13. TFIDFVectorizer with 500 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
14. TFIDFVectorizer with 500 features, with English stopwords removed and with the default `ngram_range`.
15. TFIDFVectorizer with 500 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
16. TFIDFVectorizer with 500 features, with English stopwords kept in and with the default `ngram_range`.

**9. Rather than manually instantiating 16 different vectorizers, what `sklearn` class have we learned about that might make this easier? Use it.**

In [36]:
# Define the vectorizers
vectorizers = {
    'cvec_100_stopwords_bgram': CountVectorizer(max_features=100, stop_words='english', ngram_range=(1, 2)),
    'cvec_100_stopwords_ugram': CountVectorizer(max_features=100, stop_words='english'),
    'cvec_100_nostopwords_bgram': CountVectorizer(max_features=100, stop_words=None, ngram_range=(1, 2)),
    'cvec_100_nostopwords_ugram': CountVectorizer(max_features=100, stop_words=None),
    'cvec_500_stopwords_bgram': CountVectorizer(max_features=500, stop_words='english', ngram_range=(1, 2)),
    'cvec_500_stopwords_ugram': CountVectorizer(max_features=500, stop_words='english'),
    'cvec_500_nostopwords_bgram': CountVectorizer(max_features=500, stop_words=None, ngram_range=(1, 2)),
    'cvec_500_nostopwords_ugram': CountVectorizer(max_features=500, stop_words=None),
    'tvec_100_stopwords_bgram': TfidfVectorizer(max_features=100, stop_words='english', ngram_range=(1, 2)),
    'tvec_100_stopwords_ugram': TfidfVectorizer(max_features=100, stop_words='english'),
    'tvec_100_nostopwords_bgram': TfidfVectorizer(max_features=100, stop_words=None, ngram_range=(1, 2)),
    'tvec_100_nostopwords_ugram': TfidfVectorizer(max_features=100, stop_words=None),
    'tvec_500_stopwords_bgram': TfidfVectorizer(max_features=500, stop_words='english', ngram_range=(1, 2)),
    'tvec_500_stopwords_ugram': TfidfVectorizer(max_features=500, stop_words='english'),
    'tvec_500_nostopwords_bgram': TfidfVectorizer(max_features=500, stop_words=None, ngram_range=(1, 2)),
    'tvec_500_nostopwords_ugram': TfidfVectorizer(max_features=500, stop_words=None),
}

In [37]:
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),  
    ('classifier', MultinomialNB())
])

params = {
    'vectorizer': list(vectorizers.values()), 
    'vectorizer__max_features': [100, 500],  
    'vectorizer__stop_words': ['english', None],
    'vectorizer__ngram_range': [(1, 2), (1, 1)] 
}

In [38]:
grid_search = GridSearchCV(pipeline, param_grid=params, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

grid_search.best_estimator_

Fitting 5 folds for each of 128 candidates, totalling 640 fits
Best Parameters: {'vectorizer': TfidfVectorizer(max_features=100, ngram_range=(1, 2), stop_words='english'), 'vectorizer__max_features': 500, 'vectorizer__ngram_range': (1, 1), 'vectorizer__stop_words': 'english'}
Best Score: 0.548762884495091


**10. What are some of the advantages of fitting a logistic regression model?**

In [40]:
# Logistic regression is easy to understand, providing probabilities and coefficients that show the relationship between predictors and the target.
# Probabilistic Output: Logistic regression outputs probabilities, which is useful for decision-making and uncertainty estimation.

**11. Fit a logistic regression model and compare it to the baseline.**

In [55]:
tvec = TfidfVectorizer(max_features=500, stop_words='english')

# Transform the training and test sets
X_train_tvec = tvec.fit_transform(X_train)
X_test_tvec = tvec.transform(X_test)

# Initialize and train the Logistic Regression model
logreg_model = LogisticRegression()
logreg_model.fit(X_train_tvec, y_train)

logreg_pred = pd.DataFrame(logreg_model.predict(X_test_tvec))

print(logreg_model.score(X_test_tvec,y_test))
print(logreg_pred.value_counts(normalize= True))

0.5465933394439352
0
1    0.658417
0    0.341583
Name: proportion, dtype: float64


### Summary of Naive Bayes 

Naive Bayes is a classification technique that relies on probability to classify observations.
- It's based on a probability rule called **Bayes' Theorem**... thus, "**Bayes**."
- It makes an assumption that isn't often met, so it's "**naive**."

Despite being a model that relies on a naive assumption, it often performs pretty well! (This is kind of like linear regression... we aren't always guaranteed homoscedastic errors in linear regression, but the model might still do a good job regardless.)
- [**Interested in the details?**](https://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf)


The [**sklearn documentation**](https://scikit-learn.org/stable/modules/naive_bayes.html) is here, but it can be intimidating. So, to quickly summarize the Bayes and Naive parts of the model...

#### Bayes' Theorem
If you've seen Bayes' Theorem, it relates the probability of $P(A|B)$ to $P(B|A)$. (Don't worry; we won't be doing any probability calculations by hand! However, you may want to refresh your memory on conditional probability from our earlier lessons if you forget what a conditional probability is.)

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)}
\end{eqnarray*}
$$

- Let $A$ be that someone is "agreeable," like the OCEAN category.
- Let $B$ represent the words used in their Facebook post.

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)} \\
\Rightarrow P(\text{person is agreeable}|\text{words in Facebook post}) &=& \frac{P(\text{words in Facebook post}|\text{person is agreeable})P(\text{person is agreeable})}{P(\text{words in Facebook post})}
\end{eqnarray*}
$$

We want to calculate the probability that someone is agreeable **given** the words that they used in their Facebook post! (Rather than calculating this probability by hand, this is done under the hood and we can just see the results by checking `.predict_proba()`.) However, this is exactly what our model is doing. We can (a.k.a. the model can) calculate the pieces on the right-hand side of the equation to give us a probability estimate of how likely someone is to be agreeable given their Facebook post.

#### Naive Assumption

If our goal is to estimate $P(\text{person is agreeable}|\text{words in Facebook post})$, that can be quite tricky.

---

<details><summary>Bonus: if you want to understand why that's complicated, click here.</summary>
    
- The event $\text{"words in Facebook post"}$ is a complicated event to calculate.

- If a Facebook post has 100 words in it, then the event $\text{"words in Facebook post"} = \text{"word 1 is in the Facebook post" and "word 2 is in the Facebook post" and }\ldots \text{ and "word 100 is in the Facebook post"}$.

- To calculate the joint probability of all 100 words being in the Facebook post gets complicated pretty quickly. (Refer back to the probability notes on how to calculate the joint probability of two events if you want to see more.)
</details>

---

To simplify matters, we make an assumption: **we assume that all of our features are independent of one another.**

In some contexts, this assumption might be realistic!

**12. Why would this assumption not be realistic with NLP data?**

In [None]:
# Contextual Information: In NLP, the order of words and the context is critical for understanding meaning. 
# For example, the phrase "not good" and "good" convey opposite sentiments, 
# but Naive Bayes would treat "not" and "good" as independent, missing the crucial relationship between them.

Despite this assumption not being realistic with NLP data, we still use Naive Bayes pretty frequently.
- It's a very fast modeling algorithm. (which is great especially when we have lots of features and/or lots of data!)
- It is often an excellent classifier, outperforming more complicated models.

There are three common types of Naive Bayes models: Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes.
- How do we pick which of the three models to use? It depends on our $X$ variable.
    - Bernoulli Naive Bayes is appropriate when our features are all 0/1 variables.
        - [**Bernoulli NB Documentation**](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB)
    - Multinomial Naive Bayes is appropriate when our features are variables that take on only positive integer counts.
        - [**Multinomial NB Documentation**](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
    - Gaussian Naive Bayes is appropriate when our features are Normally distributed variables. (Realistically, though, we kind of use Gaussian whenever neither Bernoulli nor Multinomial works.)
        - [**Gaussian NB Documentation**](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

**13. Suppose you CountVectorized your features. Which Naive Bayes model would be most appropriate to fit? Why? Fit it.**

In [57]:
cvec = CountVectorizer()
tvec = TfidfVectorizer()

In [59]:
def gridseach_bestNB(vectorizer):   
    classifiers = {'BernoulliNB': BernoulliNB(), 'MultinomialNB': MultinomialNB(), 'GaussianNB': GaussianNB()}
    
    params = {
        'vectorizer__max_features': [100, 500],  
        'vectorizer__stop_words': ['english', None],
        'vectorizer__ngram_range': [(1, 2), (1, 1)],
    }
    results = []
    
    for model_name, model in classifiers.items():
        if model_name == 'GaussianNB':

            pipeline = Pipeline([
                ('vectorizer', vectorizer),  
                ('to_dense', FunctionTransformer(lambda x: x.toarray(), accept_sparse=True)),  
                ('classifier', model)
            ])
        else:
            pipeline = Pipeline([
                ('vectorizer', vectorizer),  
                ('classifier', model)
            ])
        
        grid_search = GridSearchCV(pipeline, params, cv=5, n_jobs=-1)
        grid_search.fit(X_train, y_train)
        
        results.append({
            'Model': model_name,
            'Best Params': grid_search.best_params_,
            'Best Score': grid_search.best_score_
        })
    
    results_df = pd.DataFrame(results)

    return results_df

In [61]:
best_cvec = gridseach_bestNB(cvec)
best_cvec

Unnamed: 0,Model,Best Params,Best Score
0,BernoulliNB,"{'vectorizer__max_features': 500, 'vectorizer_...",0.542294
1,MultinomialNB,"{'vectorizer__max_features': 500, 'vectorizer_...",0.542746
2,GaussianNB,"{'vectorizer__max_features': 500, 'vectorizer_...",0.522725


**14. Suppose you TFIDFVectorized your features. Which Naive Bayes model would be most appropriate to fit? Why? Fit it.**

In [63]:
best_tvec = gridseach_bestNB(tvec)
best_tvec

Unnamed: 0,Model,Best Params,Best Score
0,BernoulliNB,"{'vectorizer__max_features': 500, 'vectorizer_...",0.542294
1,MultinomialNB,"{'vectorizer__max_features': 500, 'vectorizer_...",0.548763
2,GaussianNB,"{'vectorizer__max_features': 500, 'vectorizer_...",0.53251


**15. Compare the performance of your models.**

In [None]:
# Multinomial Naive Bayes is the best classifier for both CountVectorizer and TfidfVectorizer.

**16. Even though we didn't explore the full extent of Cambridge Analytica's modeling, based on what we did here, how effective was their approach at using Facebook data to model agreeableness?**

In [66]:
# From the analysis, the models seem to work moderately well in predicting agreeableness, with the best score around 54-55 percent. 
# Cambridge Analytica may have faced challenges, as predicting characteristics such as agreeableness from Facebook behavior is not straightforward. 
# They probably had more advanced features or data sources to improve performance.