# Tweeter Sentiment Analysis

More than 1,000 tweets containing the word Obama were analyzed to perform basic sentiment analysis: 

In [1]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Positive, Negative and Stop words

The code below will create lists for positive words (pos_words), negative words (neg_words) and stop words (stop_words):

In [22]:
pos=open("positive.txt") # This file contains al list of 2,006 postive words
pos_words=[]
for line in pos:
    pos_words.append(line.strip('\n')) #remove line separator '\n'

neg=open("negative.txt") # This file contains al list of 4,783 negative words
neg_words=[]
for line in neg:
    neg_words.append(line.strip('\n')) #remove line separator '\n'

stop=open("stopwords.txt") # This file contains al list of 319 stop words
stop_words=[]
for line in stop:
    stop_words.append(line.strip('\n')) #remove line separator '\n'

## Cleaning the text

In this part of the code, we will open the file that contains the tweets and clean it to obtain a list of words:

In [46]:
f=open('obama.txt')
import re
raw_words=[] #Initialize the list to store each word
for line in f:
    line_cl=line.lower() #lower case all words in the line
    line_cl=line_cl.strip('\n') #remove line separator    
    line_cl=line_cl.replace('-',' ') #remove - for "obama-era" type of words
    line_cl=line_cl.replace('’','') #remove ’
    line_cl=re.sub(r'[^\x00-\x7f]',r' ', line_cl) #remove unicode characters
    line_cl = re.sub(r'https?:\/\/.*\/[a-zA-Z0-9]*', ' ', line_cl) #remove hyperlinks
    line_cl= re.sub('([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t+])|(\w+:\/\/\S+)',' ',line_cl) #remove words starting with @ and # and special characters like ? or !
    line_cl= re.sub('[0-9]',' ',line_cl) #remove numbers
    
    line_cl=line_cl.split() #split each line in words
    
    raw_words+=line_cl # we add each word of each line in the list raw_words
    
words=[word for word in raw_words if word not in stop_words] # We create the final word list without stop words

# We include this code to remove the word trump if we are analizing Obama's or Trump's tweets
if f.name=='obama.txt' or f.name=='trump.txt':
    trump=['trump','trumps']
    words=[word for word in words if word not in trump]

## Assigning scores

Now, we will loop through the words looking for positive and negative words and storing each word with its value in the dictionary word_score:

In [47]:
word_score={}
for word in words:
    if word in pos_words:
        word_score[word]=word_score.get(word,0)+1 # Assign a value of 1 to each new word in the dictionary and add 1 to the existing ones
    elif word in neg_words:
        word_score[word]=word_score.get(word,0)-1 # Assign a value of -1 to each new word in the dictionary and add -1 to the existing ones
    else:
        word_score[word]=0 # Assign a value of 0 to all other words

## Printing the results

In [48]:
# Printing the results
print 'The overall score for tweets containing the word "' + str(f.name.rstrip('.txt')) + '" is: ' + str(sum(word_score.values()))

num_of_pos_words=sum(v for v in word_score.values() if v > 0)
num_of_neg_words=sum(-v for v in word_score.values() if v < 0)
ratio_pos_neg=float(num_of_pos_words)/num_of_neg_words

print 'The score for positive words is: '+str(num_of_pos_words)
print 'The score for negative words is: '+str(num_of_neg_words)
print 'The ratio of positive words over negative words is: ' +str(ratio_pos_neg)

max_value=max(word_score.values()) # Find the max value to find the word with the most positive score
max_keys=[k for k, v in word_score.items() if v==max_value] # Find the word associated with the highest positive score

print 'The most frequent positive word was "' + str(max_keys[0]) +'", appearing ' + str(max_value) +' times.'

min_value=min(word_score.values()) # Find the min value to find the word with the most negative score
min_keys=[k for k, v in word_score.items() if v==min_value] # Find the word associated with the most negative score

print 'The most frequent negative word was "' + str(min_keys[0])+'", appearing ' + str(-min_value) + ' times.'


print '\nThe top positive words are: '
sorted(word_score,key=word_score.get,reverse=True)[:10] # List the top ten positive words

print 'The top negative words are: '
sorted(word_score,key=word_score.get,reverse=False)[:10] # List the top ten negative words

The overall score for tweets containing the word "obama" is: -602
The score for positive words is: 658
The score for negative words is: 1260
The ratio of positive words over negative words is: 0.522222222222
The most frequent positive word was "right", appearing 99 times.
The most frequent negative word was "scandal", appearing 85 times.

The top positive words are: 


['right',
 'modern',
 'like',
 'positive',
 'successful',
 'worked',
 'promise',
 'greatest',
 'popular',
 'great']

The top negative words are: 


['scandal',
 'fake',
 'slowly',
 'breaking',
 'lie',
 'bogus',
 'pathetic',
 'problem',
 'corruption',
 'mess']

## Conclusions

Little more than 1,000 tweets containing the word "obama" were analyzed using the code. After assigning  a score to each positive and negative word, removing the word "trump" from the list, the following results were obtained:

- The overall score for tweets containing the word "obama" was -602
- The score for positives words was 658
- The score for negative words was 1260
- The ratio of positive words over negative words is 0.52
- The most frequent positive word was "right", appearing 99 times
- The most frequent negative word was "scandal", appearing 85 times
- The top positive words were: ['right','modern','like','positive','successful','worked','promise','greatest','popular','great']
- The top negative words were: ['scandal','fake','slowly','breaking','lie','bogus','pathetic','problem','corruption','mess']

Based on this results, we could say that this sample of tweets containing the word "obama" presents a negative trend. For each positive word we find in these tweets, there are almost two negative words.

<b>Note:</b>

Although it is possible to interpret this results as a negative trend, we have to be careful with the way we assign scores to each word. With this methodolgy, "not right" and "right" get the same postive value of 1, which can lead to some inaccuracies when assesing the trend of positive and negative comments.