# ![GA Logo](https://camo.githubusercontent.com/6ce15b81c1f06d716d753a61f5db22375fa684da/68747470733a2f2f67612d646173682e73332e616d617a6f6e6177732e636f6d2f70726f64756374696f6e2f6173736574732f6c6f676f2d39663838616536633963333837313639306533333238306663663535376633332e706e67) Naive Bayes

We've looked at the Naive Bayes classifier from a probability point of view. Now let's apply code to it to a natural language processing problem.

### Before we begin... what is natural language processing?

- If I'm explaining this to my non-technical peers, natural language processing is just a way for us to get computers to understand written language the way you and I do.

- If I'm explaining this to someone with a more technical background, natural language processing is a set of tools that represent words as numbers. This is commonly done by feature engineering (i.e. turning words into columns in your dataframe), but more complicated methods exist.

You'll often see natural language processing abbreviated as **NLP**.

You and I will use these social media posts in a Naive Bayes classification model to predict whether a post comes from Twitter or Facebook.

#### First: some data cleaning.

In [1]:
import pandas as pd

In [2]:
# Read in the .csv file.

df = pd.read_csv("./unprocessed_tweets.csv", encoding='latin-1')

In [3]:
# Check out the columns.

df.columns

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'audience', 'audience:confidence', 'bias',
       'bias:confidence', 'message', 'message:confidence', 'orig__golden',
       'audience_gold', 'bias_gold', 'bioid', 'embed', 'id', 'label',
       'message_gold', 'source', 'text'],
      dtype='object')

In [4]:
# See the first five rows.

df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,audience,audience:confidence,bias,bias:confidence,message,...,orig__golden,audience_gold,bias_gold,bioid,embed,id,label,message_gold,source,text
0,766192484,False,finalized,1,8/4/15 21:17,national,1.0,partisan,1.0,policy,...,,,,R000596,"<blockquote class=""twitter-tweet"" width=""450"">...",3.83249e+17,From: Trey Radel (Representative from Florida),,twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,766192485,False,finalized,1,8/4/15 21:20,national,1.0,partisan,1.0,attack,...,,,,M000355,"<blockquote class=""twitter-tweet"" width=""450"">...",3.11208e+17,From: Mitch McConnell (Senator from Kentucky),,twitter,VIDEO - #Obamacare: Full of Higher Costs and ...
2,766192486,False,finalized,1,8/4/15 21:14,national,1.0,neutral,1.0,support,...,,,,S001180,"<blockquote class=""twitter-tweet"" width=""450"">...",3.39069e+17,From: Kurt Schrader (Representative from Oregon),,twitter,Please join me today in remembering our fallen...
3,766192487,False,finalized,1,8/4/15 21:08,national,1.0,neutral,1.0,policy,...,,,,C000880,"<blockquote class=""twitter-tweet"" width=""450"">...",2.98528e+17,From: Michael Crapo (Senator from Idaho),,twitter,RT @SenatorLeahy: 1st step toward Senate debat...
4,766192488,False,finalized,1,8/4/15 21:26,national,1.0,partisan,1.0,policy,...,,,,U000038,"<blockquote class=""twitter-tweet"" width=""450"">...",4.07643e+17,From: Mark Udall (Senator from Colorado),,twitter,.@amazon delivery #drones show need to update ...


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
_unit_id               5000 non-null int64
_golden                5000 non-null bool
_unit_state            5000 non-null object
_trusted_judgments     5000 non-null int64
_last_judgment_at      5000 non-null object
audience               5000 non-null object
audience:confidence    5000 non-null float64
bias                   5000 non-null object
bias:confidence        5000 non-null float64
message                5000 non-null object
message:confidence     5000 non-null float64
orig__golden           0 non-null float64
audience_gold          0 non-null float64
bias_gold              0 non-null float64
bioid                  5000 non-null object
embed                  5000 non-null object
id                     5000 non-null object
label                  5000 non-null object
message_gold           0 non-null float64
source                 5000 non-null object
text                  

In [15]:
# Remove all values with an "audience confidence," "bias
# confidence," or "message confidence" score below 1.

df = df[(df['audience:confidence']>=1)&(df['bias:confidence']>=1)&(df['message:confidence']>=1)]
df.shape

(4888, 21)

In [16]:
# Remove extra columns, keeping only the following

df = df[['_unit_id', '_trusted_judgments', 'audience',
         'bias', 'message', 'label', 'source', 'text']]

In [17]:
# Relabel columns.

df.columns = ['unit_id', 'trusted_judgments', 'audience_feature',
              'bias_feature', 'message_feature', 'label_feature',
              'source_feature', 'text_feature']

In [18]:
# Drop NAs.

df.dropna(inplace=True)

In [19]:
# Reset index.

df.reset_index(drop=True, inplace=True)

We have social media data! This includes almost 5,000 messages on either Twitter or Facebook from various politicians. We can use the features we generated to predict things like whether the source is Twitter or Facebook, whether the bias is neutral or partisan, and so on.

In [20]:
# How many rows and columns do we have?

df.shape

(4888, 8)

In [21]:
# View first five rows.

df.head()

Unnamed: 0,unit_id,trusted_judgments,audience_feature,bias_feature,message_feature,label_feature,source_feature,text_feature
0,766192484,1,national,partisan,policy,From: Trey Radel (Representative from Florida),twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,766192485,1,national,partisan,attack,From: Mitch McConnell (Senator from Kentucky),twitter,VIDEO - #Obamacare: Full of Higher Costs and ...
2,766192486,1,national,neutral,support,From: Kurt Schrader (Representative from Oregon),twitter,Please join me today in remembering our fallen...
3,766192487,1,national,neutral,policy,From: Michael Crapo (Senator from Idaho),twitter,RT @SenatorLeahy: 1st step toward Senate debat...
4,766192488,1,national,partisan,policy,From: Mark Udall (Senator from Colorado),twitter,.@amazon delivery #drones show need to update ...


You may note that there are some extra symbols in the data. This is a common problem in natural language processing, especially when dealing with social media (think emoji, hashtags, links, etc.), but we're going to ignore that for now.

### Let's use Naive Bayes to predict whether a social media post was featured on Facebook or Twitter.

#### 1. Engineer a feature to turn `source_feature` into a 1/0 column, where 1 indicates `Twitter`.

In [24]:
df['source_feature'].unique()

array(['twitter', 'facebook'], dtype=object)

In [25]:
df['twitter'] = df.source_feature.map({'twitter':1,'facebook':0})

In [26]:
df.twitter.value_counts()

0    2497
1    2391
Name: twitter, dtype: int64

#### NOTE: Since we are solving a classification problem, what potential issue should I check for here?

#### 2. Split our data into `X` and `y`.

In [67]:
X = df[['text_feature']]
y = df['twitter']

In [68]:
X.head()

Unnamed: 0,text_feature
0,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,VIDEO - #Obamacare: Full of Higher Costs and ...
2,Please join me today in remembering our fallen...
3,RT @SenatorLeahy: 1st step toward Senate debat...
4,.@amazon delivery #drones show need to update ...


#### 3. Split our data into training and testing sets.

In [69]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)

#### 4. Turn our text into features. [Documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [91]:
# Import CountVectorizer, tfidf
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Instantiate our CountVectorizer, Tfidfvectorizer
cvec = CountVectorizer(max_features=50, stop_words='english')  #this uses sklearn's (nltk's is better) stopwords
tvec = TfidfVectorizer(max_features=50, stop_words='english')  #this uses sklearn's (nltk's is better) stopwords

In [109]:
# Fit our CountVectorizer, tfidf on the training data and transform training data.
X_train_cvec = cvec.fit_transform(X_train['text_feature']).toarray()
X_train_cvec_df = pd.DataFrame(X_train_cvec, columns=cvec.get_feature_names())

X_train_tvec = tvec.fit_transform(X_train['text_feature']).toarray()
X_train_tvec_df = pd.DataFrame(X_train_tvec, columns=cvec.get_feature_names())

In [110]:
# Transform our testing data with the already-fit CountVectorizer, tfidf.
X_test_cvec = cvec.transform(X_test['text_feature']).toarray()
X_test_cvec_df = pd.DataFrame(X_test_cvec, columns=cvec.get_feature_names())

X_test_tvec = tvec.transform(X_test['text_feature']).toarray()
X_test_tvec_df = pd.DataFrame(X_test_tvec, columns=tvec.get_feature_names())

countvectorized data...

In [111]:
X_train_cvec_df.head()

Unnamed: 0,act,american,americans,amp,budget,care,com,community,congress,day,...,time,today,veterans,watch,week,women,work,www,year,ûªs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [112]:
X_test_cvec_df.head()

Unnamed: 0,act,american,americans,amp,budget,care,com,community,congress,day,...,time,today,veterans,watch,week,women,work,www,year,ûªs
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


tdidfvectorized data...

In [113]:
X_train_tvec_df.head()

Unnamed: 0,act,american,americans,amp,budget,care,com,community,congress,day,...,time,today,veterans,watch,week,women,work,www,year,ûªs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.714628,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.655606,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [114]:
X_test_tvec_df.head()

Unnamed: 0,act,american,americans,amp,budget,care,com,community,congress,day,...,time,today,veterans,watch,week,women,work,www,year,ûªs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.507621,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.499453,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.538353,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 4. Fit a Naive Bayes model!

<details><summary> Which Naive Bayes model should we pick, and why? </summary>
    
- The columns of X are all integer counts, so MultinomialNB is the best choice here.
- BernoulliNB is best when we have 0/1 counts in all columns of X. (a.k.a. dummy variables)
- GaussianNB is best when the columns of X are Normally distributed. (Practically, though, it gets used whenever BernoulliNB and MultinomialNB are inappropriate.)
</details>

In [124]:
# Import our model!

from sklearn.naive_bayes import MultinomialNB, GaussianNB

In [125]:
# Instantiate our model!

mn = MultinomialNB()
gs = GaussianNB()

Remember earlier that I said we had the opportunity to set priors. We could do so here if we wanted, but we'll stick with the default and allow `sklearn` to estimate priors from the training data directly.

In [126]:
# Fit our model! on both cvec, tvec data

model_cvec = mn.fit(X_train_cvec, y_train)#cvec gotta use binom/multinomial, cos the coefficients of X are 0/1 (or more integers like in this case)
model_tvec = gs.fit(X_train_tvec, y_train)#tvec gotta use gaussian, cos the coefficients of X are continuous (cos decimal)

In [127]:
# Generate our predictions! for both cvec, tvec

predictions_cvec = mn.predict(X_test_cvec)
print(predictions_cvec)
predictions_tvec = gs.predict(X_test_tvec) #use gaussian on tvec!
print(predictions_tvec)

[0 1 0 ... 1 0 0]
[0 1 0 ... 1 1 1]


<details><summary> How might we evaluate our model's performance? </summary>

- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Sensitivity = TP / (TP + FN)
- Specificity = TN / (TN + FP)
- Precision = TP / (TP + FP)
- AUC ROC
</details>

<details><summary> If we have to select only one, which one should we choose? </summary>

- It depends on how exactly you define "positive" and "negative." In this case, it probably doesn't really matter - incorrectly mistaking a tweet for a Facebook post doesn't seem much better or worse than incorrectly mistaking a Facebook post for a tweet. 
- Because I believe false positives and false negatives are equally as bad, I'd probably use accuracy.
</details>

In [128]:
# Score our model on the training set. for both cvec, tvec

print(mn.score(X_train_cvec, y_train))
print(gs.score(X_train_tvec, y_train)) #use gaussian on tvec!

0.8145117294053464
0.7084015275504637


In [129]:
# Score our model on the testing set. for both cvec, tvec

print(mn.score(X_test_cvec, y_test))
print(gs.score(X_test_tvec, y_test)) #use gaussian on tvec!

0.8306055646481179
0.6972176759410802


<details><summary> What should we do in this case? </summary>

- Our model appears *slightly* overfit. We could:
    - try to collect more data, 
    - try using fewer features by setting `max_features` to a smaller number when instantiating our CountVectorizer,
    - try TF-IDF Vectorizer,
    - try a non-default prior **if you have subject-matter expertise**.
- Rather than regularizing, [online answers](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) suggest using a different model entirely.
- Our training performance and testing performance are pretty close, though, so there may not be a lot of changes required.
</details>

In [85]:
# Import the confusion matrix function.

from sklearn.metrics import confusion_matrix 

In [130]:
# Generate a confusion matrix. for both cvec, tvec

print(confusion_matrix(y_test, predictions_cvec))
print(confusion_matrix(y_test, predictions_tvec))

[[578  46]
 [161 437]]
[[344 280]
 [ 90 508]]


In [131]:
tn_cvec, fp_cvec, fn_cvec, tp_cvec = confusion_matrix(y_test, predictions_cvec).ravel()
tn_tvec, fp_tvec, fn_tvec, tp_tvec = confusion_matrix(y_test, predictions_tvec).ravel()

In [132]:
print("cvec\nTrue Negatives: %s" % tn_cvec)
print("False Positives: %s" % fp_cvec)
print("False Negatives: %s" % fn_cvec)
print("True Positives: %s" % tp_cvec)
print("tvec\nTrue Negatives: %s" % tn_tvec)
print("False Positives: %s" % fp_tvec)
print("False Negatives: %s" % fn_tvec)
print("True Positives: %s" % tp_tvec)

cvec
True Negatives: 578
False Positives: 46
False Negatives: 161
True Positives: 437
tvec
True Negatives: 344
False Positives: 280
False Negatives: 90
True Positives: 508


<details><summary> By default, what does a true negative mean here? </summary>

- True negatives are things we correctly predict to be negative.
- In this case, since Twitter = 1, a true negative means I correctly predict something is a Facebook post.
</details>

---

<details><summary> By default, what does a false positive mean here? </summary>

- False positives are things we falsely predict to be positive.
- In this case, since Twitter = 1, a false positive means I incorrectly preidct something is a tweet (when it's really a Facebook post).
</details>