# MNB Sentiment Scoring Model

<p> This is a process for training a Multinomial Naive Bayes sentiment classification model using tweets from a Kaggle competition. The trained model will be saved and exported at the end for reuse. </p>

## Packages

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy.stats import itemfreq
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

## Import Training and Test Data

#### Our training data consists of the following attributes:
<ul> 
    <li><b>target:</b> the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)</li>
    <li><b>ids:</b> The id of the tweet ( 2087)</li>
    <li><b>date:</b> the date of the tweet (Sat May 16 23:58:44 UTC 2009)</li>
    <li><b>flag:</b> The query (lyx). If there is no query, then this value is NO_QUERY.</li>
    <li><b>user:</b> the user that tweeted (robotickilldozr)</li>
    <li><b>text:</b> the text of the tweet (Lyx is cool)</li>
</ul>

In [4]:
filename = "training_XL.csv"
data_set = pd.read_csv(filename, delimiter=',', encoding='ISO-8859-1', header=None)

## Process Data

In [5]:
data_set.columns = ["target","ids","date","flag","user","text"]
#Shuffle the data (get sample frac=1 means that we will use 100% of data for sample)
data_set = data_set.sample(frac=1).reset_index(drop=True)
y=data_set['target'].values
X=data_set['text'].values
data_set.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1978075676,Sat May 30 22:22:47 PDT 2009,NO_QUERY,JaySkillz,@MISSCOKASPLASH You left me
1,0,2015775002,Wed Jun 03 05:29:19 PDT 2009,NO_QUERY,hayleydoherty88,Its to warm i dnt want to work
2,4,1676588874,Fri May 01 22:30:41 PDT 2009,NO_QUERY,teddybeans,Chicken rice for lunch Oops I'm so full.
3,4,1972211695,Sat May 30 08:47:24 PDT 2009,NO_QUERY,Shoko_RDJ,"Good night, tweeters! I really enjoyed with Tw..."
4,0,2324163367,Thu Jun 25 02:27:15 PDT 2009,NO_QUERY,ambykyns,Listen to &quot;Weightless&quot; by All Time L...


## Prepare Data For Holdout Test
<p> Remember that (X = Text) and (y = Sentiment Score)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
print(X_train[0])
print(y_train[0])
print(X_test[0])
print(y_test[0])

(1072000,) (1072000,) (528000,) (528000,)
Today is 777 days until LeakyCon 2011. Awesome number but I wish it wasn't so far away 
0
if you want to stay on my good side......never call me a skinny mini......I'm not skinny...yet like five people called me that last night 
0


In [7]:
training_labels = set(y_train)
print(training_labels)
training_category_dist = np.unique(y_train, return_counts=True)
print(training_category_dist)

{0, 4}
(array([0, 4]), array([535907, 536093]))


## Vectorization

In [59]:
#  unigram boolean vectorizer, set minimum document frequency to 5
unigram_bool_vectorizer = CountVectorizer(encoding='latin-1', binary=True, min_df=5, stop_words='english')

#  unigram term frequency vectorizer, set minimum document frequency to 5
unigram_count_vectorizer = CountVectorizer(encoding='latin-1', binary=False, min_df=5, stop_words='english')

#  unigram and bigram term frequency vectorizer, set minimum document frequency to 5
gram12_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,3), min_df=5, stop_words='english')

#  unigram tfidf vectorizer, set minimum document frequency to 5
unigram_tfidf_vectorizer = TfidfVectorizer(encoding='latin-1',ngram_range=(1,2), use_idf=True, min_df=5, stop_words='english')


### Vectorize Training Data

In [60]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_tfidf_vectorizer.fit_transform(X_train)
X_test_vec = unigram_tfidf_vectorizer.transform(X_test)

# check the content of a document vector
print(X_train_vec.shape)
print(X_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_tfidf_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_tfidf_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_tfidf_vectorizer.vocabulary_.get('imaginative'))

(1072000, 43890)
[[0. 0. 0. ... 0. 0. 0.]]
43890
[('today', 37715), ('days', 8151), ('2011', 273), ('awesome', 2432), ('number', 26735), ('wish', 42015), ('wasn', 40946), ('far', 11162), ('away', 2390), ('wish wasn', 42140)]
None


## Training the MNB Model

#### Trouble breaking 77% Hold-out test accuracy, need 85% for production, try linear SVM or Bernoulli

In [61]:
# initialize the MNB model
nb_clf= MultinomialNB()
# use the training data to train the MNB model
nb_clf.fit(X_train_vec,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [62]:
feature_ranks = sorted(zip(nb_clf.feature_log_prob_[0], unigram_tfidf_vectorizer.get_feature_names()))
very_negative_features = feature_ranks[-10:]
v_pos_f = feature_ranks[:10]
print(very_negative_features)
print('\n',v_pos_f)

[(-5.805283421832247, 'going'), (-5.73048122845228, 'don'), (-5.720766057254302, 'want'), (-5.716469713293092, 'like'), (-5.711298223467692, 'sad'), (-5.701307746588295, 'day'), (-5.691856625972877, 'today'), (-5.532478613221889, 'miss'), (-5.400672233724215, 'just'), (-5.349361938377399, 'work')]

 [(-14.150878430997022, '000 contacts'), (-14.150878430997022, '15 11'), (-14.150878430997022, '1cp2'), (-14.150878430997022, '1cp2 follow'), (-14.150878430997022, '4officeautomation'), (-14.150878430997022, '4officeautomation com'), (-14.150878430997022, '6dvj4'), (-14.150878430997022, '6g55n'), (-14.150878430997022, 'account peace'), (-14.150878430997022, 'add train')]


In [63]:
nb_clf.score(X_test_vec,y_test)

0.7678371212121212