# User Message Sentiment Classification
### BOW - TF-IDF - NB
---

The goal of this notebook is to develop a machine learning model to analyze the sentiment of user messages. <br> 
The program will train this model from pre-labeled Twitter data, and the model will be used to assess sentiment of user message data referencing specific brands or products over time. <br>
This notebook uses TF-IDF tranformed, bag of words corpus trained to a Naive Bayes classification model.

## Import Training Data 

Labeled Twitter training and test sentiment data: 
- http://help.sentiment140.com/for-students/
- http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip 

In [2]:
# Import analysis libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
# Import training data csv
cols = ['sentiment','id','date','query_string','user','text']
df = pd.read_csv('../data/twitter_sentiment140/training.1600000.processed.noemoticon.csv', 
                 names=cols, encoding='latin1')

# Restrict training data to sample of 100K records for code development
df = df.sample(100000)

df.head()

Unnamed: 0,sentiment,id,date,query_string,user,text
502054,0,2187397522,Mon Jun 15 20:13:21 PDT 2009,NO_QUERY,jessica515,@xoxmichelle lmao to our discoveries! wish our...
219802,0,1976480205,Sat May 30 17:26:16 PDT 2009,NO_QUERY,jlpnut,analyzing=good. overanalyzing=not so good.
1482848,4,2067353287,Sun Jun 07 12:11:43 PDT 2009,NO_QUERY,jennamay0711,is watching the littlest vampireee
67736,0,1692489538,Sun May 03 20:06:22 PDT 2009,NO_QUERY,xreeshix,@sXe_rockstar you're near a puter? why are yo...
1056121,4,1962415930,Fri May 29 11:00:43 PDT 2009,NO_QUERY,StudioKiSun,@lyneL @_Gavia_ You gals are on a roll this mo...


In [4]:
# Restrict to sentiment and text columns and remap sentiment values
df = df[['sentiment','text']]
df['sentiment'] = df.sentiment.map({0:0, 4:1})
df.sentiment.value_counts()

0    50005
1    49995
Name: sentiment, dtype: int64

## Text Pre-Processing 
Since the training data is derived from Twitter posts, the text strings contain tags mentioning other users (using the @ symbol syntax) and website URLs to associate external content with each post. 

Cleaning logic will be implemented to apply the following to each text string: 
- Remove HTML tags
- Remove @user mentions
- Remove URL hyperlinks
- Remove stop words
- Remove punctuation
- Lower case text

In [5]:
import re
import string
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

atu_pattern    = r'@[A-Za-z0-9]+'
url_pattern    = r'https?://[A-Za-z0-9./]+' 
remove_table   = str.maketrans({key: None for key in string.punctuation + string.digits})
stopwords_list = [word.translate(remove_table) for word in stopwords.words('english')]

def tweet_cleaner(text):   
    # Remove html tags
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    
    # Remove @user mentions and URLs
    strip_1 = re.sub(atu_pattern, '', souped)
    strip_2 = re.sub(url_pattern, '', strip_1)
    
    # Remove stop words, punctuation, and numbers
    # Split words into list and lower case the text
    clean_split = [word.lower().translate(remove_table) 
                   for word in strip_2.split() 
                   if word.lower().translate(remove_table) 
                       and word.lower().translate(remove_table) not in stopwords_list]
    
    return clean_split

# Additional text processing can include stemming and lemmatization steps

## Vectorization and Transformation
Create bag-of-words vectors for each cleaned text string and apply TF-IDF transformation to normalize word scores.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [7]:
%%time
# Fit BOW vector to tweet_cleaner applied corpus
bow_transformer = CountVectorizer(analyzer=tweet_cleaner).fit(df['text'])

# Print total number of vocab words
print(len(bow_transformer.vocabulary_))

64354
CPU times: user 24.8 s, sys: 1.27 s, total: 26.1 s
Wall time: 26.3 s


In [8]:
# Transform text strings into BOW
text_bow = bow_transformer.transform(df['text'])

In [9]:
# Fit TF-IDF transformer
tfidf_transformer = TfidfTransformer().fit(text_bow)

In [10]:
# Transform BOW corpus into TF-IDF corpus
text_tfidf = tfidf_transformer.transform(text_bow)

## Model Training 
Train a Naive-Bayes model to assess sentiment of messages from the TF-IDF transformed BOW corpus.

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [12]:
# Train Naive-Bayes model to classify sentiment
sent_model = MultinomialNB().fit(text_tfidf, df['sentiment'])

In [13]:
# Test out the evaluation of the model using the training outcome labels
pred = sent_model.predict(text_tfidf)
print(classification_report(df['sentiment'], pred))

              precision    recall  f1-score   support

           0       0.83      0.86      0.84     50005
           1       0.86      0.82      0.84     49995

   micro avg       0.84      0.84      0.84    100000
   macro avg       0.84      0.84      0.84    100000
weighted avg       0.84      0.84      0.84    100000



## Model Pipeline
Develop a workflow pipeline to apply the text cleaning, word vectorization, TF-IDF transformation, and classification training. The pipeine can then be used to more easily clean and generate predicitons of the test data.

In [14]:
# Import test data csv
cols = ['sentiment','id','date','query_string','user','text']
test = pd.read_csv('../data/twitter_sentiment140/testdata.manual.2009.06.14.csv', names=cols, encoding='latin1')
test = test[['sentiment','text']]
test['sentiment'] = test.sentiment.map({0:0, 4:1})

test = test.dropna()
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 359 entries, 0 to 497
Data columns (total 2 columns):
sentiment    359 non-null float64
text         359 non-null object
dtypes: float64(1), object(1)
memory usage: 8.4+ KB


In [15]:
%%time
# Build workflow pipeline
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('bow',        CountVectorizer(analyzer=tweet_cleaner)), 
                     ('tfidf',      TfidfTransformer()),  
                     ('classifier', MultinomialNB())])

# Fit pipeline with training data
pipeline.fit(df.text, df.sentiment)

CPU times: user 24.4 s, sys: 1.11 s, total: 25.5 s
Wall time: 25.5 s


In [16]:
# Predict test data with pipeline
pred = pipeline.predict(test.text)

# Print classification report
print(classification_report(pred, test.sentiment))

              precision    recall  f1-score   support

           0       0.80      0.75      0.77       187
           1       0.75      0.79      0.77       172

   micro avg       0.77      0.77      0.77       359
   macro avg       0.77      0.77      0.77       359
weighted avg       0.77      0.77      0.77       359

