# Naive Bayes - Trained on the Sentiment 140 data set
In this section we'll attempt to build a custom Naive Bayes model for text classification. Naive Bayes classifiers use conditional probabilities based on pre-tagged data, so weWill use pre-tagged data set containing about 1.6 million tweets tagged for negative, positive, and neutral sentiment. Try training and running a multinomial Naive Bayes model from sklearn. To "vectorize" the text, we must first apply a TFIDF (term-frequency/inverse document frequency) transformation to model the text values as numeric values.

https://www.kaggle.com/datasets/kazanova/sentiment140?resource=download

In [4]:
import os
import pandas as pd
from comment_scraper import get_sql_table # local module
from matplotlib import pyplot as plt
import pickle
plt.style.use('seaborn-notebook')

# supress warnings from sklearn
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import Pipeline

In [6]:
DATA_DIR = 'C:\\Users\\keatu\\Regis_archive\\practicum2_data\\'
dbname = os.path.join(DATA_DIR, "Youtube_Data_msnbc.db")

In [14]:
s140df = pd.read_csv(os.path.join(DATA_DIR,"resources","s140_processed.csv"), encoding='latin-1')

In [15]:
s140df

Unnamed: 0,target,id,date,flag,user,text,preproc_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...","@switchfoot http://twitpic.com/2y1zl - awww , ..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset can't update facebook texting ... might ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,@kenichan dive many time ball . manage save 50...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole body feel itchy like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....","@nationwideclass , behave . i'm mad . ? can't ..."
...,...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...,wake . school best feeling ever
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...,thewdb.com - cool hear old walt interview ! Ã¢...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...,ready mojo makeover ? ask detail
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...,happy 38th birthday boo alll time ! ! ! tupac ...


In [18]:
text=s140df['text']
target=s140df['target']
text_train, text_test, target_train, target_test=train_test_split(text, target, random_state=50)

In [17]:
# All default values
estimators=[('vec', HashingVectorizer()),
            ('clf',BernoulliNB())]
vec = HashingVectorizer()
nb_model = BernoulliNB()
text_train_transf = vec.transform(text_train)

pipeline=Pipeline(estimators)

mod = pipeline.fit(X_train, y_train)

mod.score(X_test,y_test)

0.7211125

In [1]:
mod.score(X_test,y_test)

'0.21.3'

### Save model to disk

In [None]:
filename = 'bernoullinb_model.sav'
with open(filename,'wb') as f:
    pickle.dump(mod, f)

# load it later
# with open(filename, 'rb') as f:
#    model = pickle.load(f)