## apply proper algorithm for sentiment classification

Possible algorithms:

- Naive Bayes
    * Web resources
        * [how to build a twitter sentiment analyzer ?](http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/)
        * [Twitter Sentimental Analysis](https://github.com/victorneo/Twitter-Sentimental-Analysis)
    * Papers
        * [Twitter Sentiment Classification using Distant Supervision](http://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf)

__Note__: 

-----------

_ROUGH_: use "Twitter Sentimental Analysis" which implement Naive Bayes within NLTK.         
_FINAL DECISION_: Improve Naive Bayes using tesi unibo OR try another classifier in "how to build a twitter sentiment analyzer ?".

----

In [2]:
import re
import nltk
import numpy as np
import pandas as pd
import pickle

# NLP functions

## Preprocess tweets
1. Lower Case - Convert the tweets to lower case.
2. URLs - I don't intend to follow the short urls and determine the content of the site, so we can eliminate all of these URLs via regular expression matching or replace with generic word URL.
3. @username - we can eliminate "@username" via regex matching or replace it with generic word AT_USER.
4. #hashtag - hash tags can give us some useful information, so it is useful to replace them with the exact same word without the hash. E.g. #nike replaced with 'nike'.
5. Punctuations and additional white spaces - remove punctuation at the start and ending of the tweets. E.g: ' the day is beautiful! ' replaced with 'the day is beautiful'. It is also helpful to replace multiple whitespaces with a single whitespace
6. EMOTICON: gather positive emoticon and sibstitue with "posmaremo" and negative with "negmaremo". # TODO



In [3]:
# posmaremo
# '(: :) :] [: :-) (-: [-: :-] (; ;) ;] [; ;-) (-; [-; ;-] :-D :D :-p :p (=: ;=D :=) :S @-) XD' 
# negmaremo
# '\:\(|\)\:|:\-\(|\)\-\:|\;\(|\)\;|\:\-\[|\]\-\:|\;\-\(|\)\-\;|\:\'\[|\:\'\(|\)\'\:|\]\:|\:\[|:\||\:\/|\|\:|\/\:|\:\=\(|\:\=\||\:\=\[|xo|D\:|O\:'
def processTweet(tweet):
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert retweet
    tweet = re.sub('(rt\s)@[^\s]+','RT',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    # Group emoticons
    # detection emoticon positive :-), :D ...
    tweet = re.sub('\(\:|\:\)|\:\]|\[\:|\:\-\)|\(\-\:|\[\-\:|\:\-\]|\(\;|\;\)|\;\]|\[\;|\;\-\)|\(\-\;|\[\-\;|\;\-\]|\:\-D|:D|\:\-p|\:p|\(\=\:|\;\=D|\:\=\)|\:S|@\-\)|XD','posmaremo',tweet)
    # detection emoticon negative )-:, :-/ ...
    tweet = re.sub('\:\(|\)\:|:\-\(|\)\-\:|\;\(|\)\;|\:\-\[|\]\-\:|\;\-\(|\)\-\;|\:\'\[|\:\'\(|\)\'\:|\]\:|\:\[|:\||\:\/|\|\:|\/\:|\:\=\(|\:\=\||\:\=\[|xo|D\:|O\:','negmaremo',tweet)
    # Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    # Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet

## Filtering tweet words and obtaining features vector
__Note__: it is not effitient to completely clean the tweet; it has to be processed as a whole string, then the single words are processed when extracted to create the features vector.
1. Stop words - a, is, the, with etc. The full list of stop words can be found at Stop Word List. These words don't indicate any sentiment and can be removed.
2. Repeating letters - if you look at the tweets, sometimes people repeat letters to stress the emotion. E.g. hunggrryyy, huuuuuuungry for 'hungry'. We can look for 2 or more repetitive letters in words and replace them by 2 of the same.
3. Punctuation - we can remove punctuation such as comma, single/double quote, question marks at the start and end of each word. E.g. beautiful!!!!!! replaced with beautiful
4. Words must start with an alphabet - For simplicity sake, we can remove all those words which don't start with an alphabet. E.g. 15th, 5.34am.

In [4]:
def replaceTwoOrMore(s):
    #look for 2 or more repetitions of character and replace with the character itself
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
    return pattern.sub(r"\1\1", s)



##############################################################################################################



def getStopWordList(stopWordListFileName):
    #read the stopwords file and build a list
    stopWords = []
    stopWords.append('URL')
    stopWords.append('RT')
    stopWords.append('AT_USER')
    fp = open(stopWordListFileName, 'r')
    line = fp.readline()
    while line:
        word = line.strip()
        stopWords.append(word)
        line = fp.readline()
    fp.close()
    return stopWords


def getFeatureVector(tweet, stwl):
    featureVector = []
    tweet = processTweet(tweet)
    #split tweet into words
    words = tweet.split()
    for w in words:
        #replace two or more with two occurrences
        w = replaceTwoOrMore(w)
        #strip punctuation
        w = re.sub('[^\w\s]','', w)
        #check if the word stats with an alphabet
        val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w)
        #ignore if it is a stop word
        if(w in stwl or val is None):
            continue
        else:
            featureVector.append(w.lower())
    return featureVector

#### test
Test the preprocessing and features extraction functions defined above.

In [5]:
stopWords = getStopWordList('/Users/nicolavitale/Desktop/twitter_data_analysis/develop/data/SmartStoplist.txt')
# test
tweet = "With all the drug use going on in the olympics,someone check the 3 bags of dope judging the boxing#Rio2016 #peetestforjudges #gotospecsavers"
features_vector = getFeatureVector(tweet, stopWords)
print features_vector

['with', 'drug', 'olympicssomeone', 'check', 'bags', 'dope', 'judging', 'boxingrio2016', 'peetestforjudges', 'gotospecsavers']


## Sentiment_Analysis_Dataset.csv

``
ItemID,Sentiment,SentimentSource,SentimentText
1,0,Sentiment140,                     is so sad for my APL friend.............
2,0,Sentiment140,                   I missed the New Moon trailer...
3,1,Sentiment140,              omg its already 7:30 :O
4,0,Sentiment140,          .. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)...
5,0,Sentiment140,         i think mi bf is cheating on me!!!       T_T
6,0,Sentiment140,         or i just worry too much?        
7,1,Sentiment140,       Juuuuuuuuuuuuuuuuussssst Chillin!!
8,0,Sentiment140,       Sunny Again        Work Tomorrow  :-|       TV Tonight
9,1,Sentiment140,      handed in my uniform today . i miss you already
10,1,Sentiment140,      hmmmm.... i wonder how she my number @-)
``

* Sentiment == 1, positive
* sentiment == 0, negative


# Data structures

In [6]:
# Load all the CSV file in a dictionary of lisits of the form
# sentiment_ds_p_n = {"ItemID":[...], "Sentiment":[...], "SentimentText":[...]}
df_csv = pd.read_csv('/Users/nicolavitale/Desktop/twitter_data_analysis/develop/data/sentiment_analysis/Sentiment_Analysis_Dataset.csv',header=0, error_bad_lines=False)
# df_csv.drop('SentimentSource')
# df_csv.drop('ItemID')
# 1578612 total classified tweets
# separate in positive and negative
df_pos = df_csv.loc[df_csv['Sentiment'] == 1]
df_neg = df_csv.loc[df_csv['Sentiment'] == 0]
# use around 600 tweets; 2/3 training (400), 1/3 test (200)
pos_train = df_pos[0:344500]
neg_train = df_neg[0:500000]
train_pn = pd.concat([pos_train, neg_train])

pos_test = df_pos[200001:300000]
neg_test = df_neg[200001:300000]
test_pn = pd.concat([pos_test, neg_test])

print len(pos_train)

Skipping line 8836: expected 4 fields, saw 5

Skipping line 535882: expected 4 fields, saw 7



344500


# Implementation

In [7]:
#Read the tweets one by one and process them
stopWords = getStopWordList('/Users/nicolavitale/Desktop/twitter_data_analysis/develop/data/SmartStoplist.txt')
# tw_sent_p = []
twft_sent_p = []

# tw_sent_n = []
twft_sent_n = []

# ftrs_list = []


for i in range(0,len(pos_train)):
    try:
        tweet = pos_train['SentimentText'][i]
        sentiment = pos_train['Sentiment'][i]
        featureVector = getFeatureVector(tweet, stopWords)
#         ftrs_list.extend(featureVector)
#         tw_sent_p.append((tweet, sentiment))
        twft_sent_p.append((featureVector, sentiment))
    except KeyError:
        continue

        
        
for i in range(0,len(neg_train)):
    try:
        tweet = neg_train['SentimentText'][i]
        sentiment = neg_train['Sentiment'][i]
        featureVector = getFeatureVector(tweet, stopWords)
#         ftrs_list.extend(featureVector)
#         tw_sent_n.append((tweet, sentiment))
        twft_sent_n.append((featureVector, sentiment))
    except KeyError:
        continue


        
# ftrs_list = list(set(ftrs_list))
print len(twft_sent_p)
print len(twft_sent_n)


# with open('twft_sent.pickle', 'wb') as handle:
#         pickle.dump(twft_sent, handle)
# with open('ftrs_list.pickle', 'wb') as handle:
#         pickle.dump(ftrs_list, handle)

200529
207463


In [8]:
import random

twft_sent_p = twft_sent_p[0:5000]
# print twft_sent_p[0:10]
# print len(twft_sent_p)

twft_sent_n = twft_sent_n[0:5000]
# print twft_sent_n[0:10]
# print len(twft_sent_n)

documents = []
documents.extend(twft_sent_p)
documents.extend(twft_sent_n)

random.shuffle(documents)

print documents[0:20]
print
print len(documents)

# all_ftrs = []

# for i in range(0,len(twft_sent)):
#     tweet = twft_sent[i][0]
#     train_ftrs.extend(tweet)
    

# train_ftrs = list(set(train_ftrs))
# print train_ftrs[0:100]
# print len(train_ftrs)


# all_words = nltk.FreqDist(w.lower() for do in movie_reviews.words())

[(['goodsex', 'when', 'makes', 'squirt'], 1), (['jdedwards', 'link', 'jdedwards', 'live', 'gt'], 0), (['lt3', 'love', 'robbie', 'williams'], 1), (['quotupquot', 'good', 'company'], 1), (['quot', 'je', 'te', 'promets', 'quot', 'zaho', 'my', 'favourite', 'song', 'moment'], 1), (['sun', 'isnt', 'today', 'now'], 0), (['mates', 'miserable', 'git', 'lol'], 1), (['dreams', 'true'], 1), (['haha', 'dating', 'enemy'], 1), (['sick', 'might', 'swine', 'flu', 'noo', 'prob', 'docs', 'tomorrow'], 0), (['followfriday', 'sick', 'head', 'i', 'whining', 'tagged', 'fridays'], 1), (['followfriday', 'friends', 'enjoy', 'tweets', 'tgif', 'have', 'great', 'day'], 1), (['fcked', 'thing', 'good'], 0), (['gonna', 'rain', 'so', 'cal', 'thunder', 'lightning'], 1), (['month', 'middle', 'winter', 'day', 'im', 'lost', 'hurry'], 0), (['gtlt', 'ive', 'spacing', 'nightmares', 'day', 'i', 'hate', 'real', 'here', 'i'], 0), (['insomniac', 'enjoying', 'life'], 1), (['meebo', 'awesome'], 1), (['loading', 'asot400'], 0), (['g

In [9]:
all_ftrs = []

for i in range(0,len(documents)):
    tweet = documents[i][0]
    all_ftrs.extend(tweet)        

all_ftrs = list(set(all_ftrs))
print len(all_ftrs)

16584


In [10]:
all_words = nltk.FreqDist(i.lower() for i in all_ftrs)
print all_words

word_features = list(all_words)


<FreqDist with 16584 samples and 16584 outcomes>


In [11]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [12]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [13]:
train_set, test_set = featuresets[:7000], featuresets[7000:]

In [34]:
# with open('twft_sent.pickle', 'rb') as handle:
#     twft_sent = pickle.load(handle)
# with open('ftrs_list.pickle', 'rb') as handle:
#     featureList = pickle.load(handle)
    
# def extract_features(twft):
#     tweet_words = set(twft)
#     features = {}
#     for word in twft:
#         features['contains(%s)' % word] = (word in tweet_words)
#     return features


# def extract_train(twft_sent):
#     training_set = []
#     for t_s in twft_sent:
#         training_set.append((extract_features(t_s[0], train_ftrs), t_s[1]))
#     return training_set

# tr_st = extract_train(twft_sent[0:10])
# print tr_st

In [14]:
print len(train_set)
print len(test_set)

7000
3000


In [16]:
# Train the classifier
NBClassifier_70h = nltk.NaiveBayesClassifier.train(train_set)

with open('NBClassifier_70h.pickle', 'wb') as handle:
        pickle.dump(NBClassifier_70h, handle)

------
Use the classifier saved as pickle object. (Do not return on the above code)

__CLASSIFIER TRAINED AND SAVED AS PICKLE OBJECT!!!!!!!!!!!__

In [17]:
# with open('NBClassifier_70h.pickle', 'rb') as handle:
#     NBClassifier_70h = pickle.load(handle)

stwl = getStopWordList('/Users/nicolavitale/Desktop/twitter_data_analysis/develop/data/SmartStoplist.txt')

# Test the classifier
testTweet = "The beast has arrived. Smok TFV8 review. https://t.co/VxA7bs5d8m #vapeon #vapelife #ecigs #ecig #vape #vaping https://t.co/5DhEoObNXu"
processedTestTweet = processTweet(testTweet)
# print processedTestTweet
print NBClassifier_70h.classify(document_features(getFeatureVector(processTweet(testTweet), stwl)))


# process the test set before !!!
# print("Classifier accuracy percent:",(nltk.classify.accuracy(NBClassifier, test_set))*100)

1


In [18]:
print(nltk.classify.accuracy(NBClassifier_70h, test_set))
# try to test only on positive and only on negative

0.756


In [19]:
NBClassifier_70h.show_most_informative_features(50)

Most Informative Features
           contains(sad) = True                0 : 1      =     44.2 : 1.0
         contains(quoti) = True                1 : 0      =     40.8 : 1.0
   contains(musicmonday) = True                1 : 0      =     36.5 : 1.0
  contains(followfriday) = True                1 : 0      =     21.8 : 1.0
       contains(youquot) = True                1 : 0      =     20.6 : 1.0
       contains(missing) = True                0 : 1      =     16.8 : 1.0
       contains(anymore) = True                0 : 1      =     14.2 : 1.0
         contains(hurts) = True                0 : 1      =     14.0 : 1.0
        contains(thanks) = True                1 : 0      =     11.4 : 1.0
          contains(poor) = True                0 : 1      =     11.3 : 1.0
            contains(ff) = True                1 : 0      =     10.9 : 1.0
       contains(quotthe) = True                1 : 0      =     10.5 : 1.0
           contains(ugh) = True                0 : 1      =     10.5 : 1.0

#### Apply classifier to collection

In [20]:
import pymongo

In [21]:
client = pymongo.MongoClient()
db = client.sn_sp
collection1 = db.net_00
collection2 = db.net_1

cursor = collection1.find({})

items_list = [ item for item in cursor ] 

In [23]:
for tweet_doc in items_list:
    tweet_doc['sentiment'] = NBClassifier_70h.classify(document_features(getFeatureVector(processTweet(tweet_doc['text']), stwl)))
    collection2.insert_one(tweet_doc)