# Naive Bayes

Let's try a Naive Bayes to classify if post are from Twitter or Facebook. 
This is by no means rigorious and it is only a small practice, not a full project.
The focus will be on running a Naive Bayes. 
Thus, some preprocessing steps may not be rigorous and some data specific preprocessing will be skipped. 


### Problem statement

I will use these social media posts in a Naive Bayes classification model to predict whether a post comes from Twitter or Facebook.

In [1]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

In [2]:
# Read in the .csv file.

df = pd.read_csv("./unprocessed_tweets.csv", encoding='latin-1')   

In [3]:
# See the first five rows.

df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,audience,audience:confidence,bias,bias:confidence,message,...,orig__golden,audience_gold,bias_gold,bioid,embed,id,label,message_gold,source,text
0,766192484,False,finalized,1,8/4/15 21:17,national,1.0,partisan,1.0,policy,...,,,,R000596,"<blockquote class=""twitter-tweet"" width=""450"">...",3.83249e+17,From: Trey Radel (Representative from Florida),,twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,766192485,False,finalized,1,8/4/15 21:20,national,1.0,partisan,1.0,attack,...,,,,M000355,"<blockquote class=""twitter-tweet"" width=""450"">...",3.11208e+17,From: Mitch McConnell (Senator from Kentucky),,twitter,VIDEO - #Obamacare: Full of Higher Costs and ...
2,766192486,False,finalized,1,8/4/15 21:14,national,1.0,neutral,1.0,support,...,,,,S001180,"<blockquote class=""twitter-tweet"" width=""450"">...",3.39069e+17,From: Kurt Schrader (Representative from Oregon),,twitter,Please join me today in remembering our fallen...
3,766192487,False,finalized,1,8/4/15 21:08,national,1.0,neutral,1.0,policy,...,,,,C000880,"<blockquote class=""twitter-tweet"" width=""450"">...",2.98528e+17,From: Michael Crapo (Senator from Idaho),,twitter,RT @SenatorLeahy: 1st step toward Senate debat...
4,766192488,False,finalized,1,8/4/15 21:26,national,1.0,partisan,1.0,policy,...,,,,U000038,"<blockquote class=""twitter-tweet"" width=""450"">...",4.07643e+17,From: Mark Udall (Senator from Colorado),,twitter,.@amazon delivery #drones show need to update ...


In [4]:
# Check out the columns.

df.columns

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'audience', 'audience:confidence', 'bias',
       'bias:confidence', 'message', 'message:confidence', 'orig__golden',
       'audience_gold', 'bias_gold', 'bioid', 'embed', 'id', 'label',
       'message_gold', 'source', 'text'],
      dtype='object')

### Filtering out columns

Here, for the purpose of a practice. I have only kept 2 columns that I will be using. 

In [5]:
# only pick 2 columns

df = df[['source', 'text']]

In [6]:
# Relabel columns.

df.columns = ['source_feature', 'text_feature']

In [7]:
# Drop NAs.

df.dropna(inplace=True)

In [8]:
# Reset index.

df.reset_index(drop=True, inplace=True)

In [9]:
df.shape

(5000, 2)

So we have about 5000 social media data here and a column telling us if it is from Twitter or Facebook.

In [10]:
# View first five rows.

df.head()

Unnamed: 0,source_feature,text_feature
0,twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,twitter,VIDEO - #Obamacare: Full of Higher Costs and ...
2,twitter,Please join me today in remembering our fallen...
3,twitter,RT @SenatorLeahy: 1st step toward Senate debat...
4,twitter,.@amazon delivery #drones show need to update ...


There are some extra symbols in the data. This is a common problem in natural language processing, especially when dealing with social media (think emoji, hashtags etc.), but we're going to ignore that for now.

### Twitter or Facebook?

In [11]:
df['twitter'] = df['source_feature'].map(lambda s: 1 if s=='twitter' else 0)

In [12]:
df[['twitter', "source_feature"]]

Unnamed: 0,twitter,source_feature
0,1,twitter
1,1,twitter
2,1,twitter
3,1,twitter
4,1,twitter
...,...,...
4995,0,facebook
4996,0,facebook
4997,0,facebook
4998,0,facebook


In [13]:
#WOW very balanced data

df['twitter'].value_counts() 

1    2500
0    2500
Name: twitter, dtype: int64

In [14]:
#Split our data into `X` and `y`.
X = df[['text_feature']]   #df
y = df['twitter']           #series

In [15]:
X.head()

Unnamed: 0,text_feature
0,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,VIDEO - #Obamacare: Full of Higher Costs and ...
2,Please join me today in remembering our fallen...
3,RT @SenatorLeahy: 1st step toward Senate debat...
4,.@amazon delivery #drones show need to update ...


In [16]:
#Train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)

In [17]:
# Import CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate our CountVectorizer.
cvec = CountVectorizer(stop_words='english', max_features=500)

In [18]:
# Fit our CountVectorizer on the training data and transform training data.
#can use toarray as well
X_train_cvec = pd.DataFrame(cvec.fit_transform(X_train['text_feature']).todense(), columns = cvec.get_feature_names())  


In [19]:
# Transform our testing data with the already-fit CountVectorizer.
X_test_cvec = pd.DataFrame(cvec.transform(X_test['text_feature']).todense(), columns = cvec.get_feature_names())

In [20]:
X_train_cvec.head()  #500 columns because we set max features 

Unnamed: 0,00,000,10,11,12,20,2013,2014,30,40,...,yesterday,york,young,youtube,û_,ûª,ûªm,ûªs,ûªt,ûò
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [21]:
# Instantiate our model!

nb = MultinomialNB()

In [22]:
# Fit our model!

model = nb.fit(X_train_cvec, y_train)

In [23]:
# Generate our predictions!

predictions = model.predict(X_test_cvec)

### Evaluation

We can evaluate based on the following metrics

- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Sensitivity = TP / (TP + FN)
- Specificity = TN / (TN + FP)
- Precision = TP / (TP + FP)
- AUC ROC
- F1 = 2*Sensi*Speci/(Sensi+Speci)

In this case, as incorrect classification isn't better or worse, we can use accuracry, which is the scores below.

In [24]:
# Score our model on the training set.

model.score(X_train_cvec, y_train)

0.8429333333333333

In [25]:
model.score(X_test_cvec, y_test)

0.796

In [26]:
# Generate a confusion matrix.

confusion_matrix(y_test, predictions)


array([[522, 103],
       [152, 473]], dtype=int64)

In [27]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

In [28]:
print("True Negatives: {}".format(tn))
print("False Positives: {}".format(fp))
print("False Negatives: {}".format(fn))
print("True Positives: {}".format(tp))

True Negatives: 522
False Positives: 103
False Negatives: 152
True Positives: 473


### Improving our model

- try to collect more data, 
- try using fewer features by setting max_features to a smaller number when instantiating our CountVectorizer,
- try TF-IDF Vectorizer,
- try a non-default prior (almost never ever).
    
