## Twitter Sentiment Analysis
#### References:
* Training the NaiveBayesClassifier - https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk#step-1-%E2%80%94-installing-nltk-and-downloading-the-data

#### What I would do to improve analysis:
* Include intent analysis - find the user's intention behind the tweet
* Analyze larger tweet datasets
* Lemmatize words like "Hi" and "Hiiiii"
* Detect sarcasm in tweet
* Sentiment analysis API service: https://www.paralleldots.com/sentiment-analysis

### Training the NaiveBayesClassifier algorithm

In [1]:
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
from nltk.stem.wordnet import WordNetLemmatizer  #lemmatize text (running=run)
from nltk.tag import pos_tag  #assign positional tags to tokens (noun, verb, adjective)
import re, string  #filter regular expressions
from nltk import NaiveBayesClassifier  #classifier algorithm
from nltk import classify  #classify tweets as positive or negative
from nltk import FreqDist  #determine the most frequent words in analysis
import random
from config import api_key, secret_api_key, access_token, secret_access_token

In [2]:
# viewing json files inside twitter_samples data
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

In [3]:
# remove undesired text from tokens (puncuation, numbers & symbols, stop words)
def clean_data(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        # removing unwanted symbols and patterns from tokens using regular expressions
        token = re.sub("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+","", token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)
        
        # assigning new pos tags for WordNetLemmatizer() function
        if tag.startswith("NN"):
            pos = "n"
        elif tag.startswith("VB"):
            pos = "v"
        else:
            pos = "a"
            
        # lemmatizing tokens (running=run)
        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)
        
        # dropping puncuation and stop words
        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens
            
# covert list to dictionary with keys=tokens and values=true
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

#### Tokenizing, cleaning and assembling the dataset

In [4]:
# tokenizing positive/negative tweets
positive_tokens = twitter_samples.tokenized("positive_tweets.json")
negative_tokens = twitter_samples.tokenized("negative_tweets.json")

# cleaning and prepping tokens for analysis
positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []
for tokens in positive_tokens:
    positive_cleaned_tokens_list.append(clean_data(tokens, stop_words))
for tokens in negative_tokens:
    negative_cleaned_tokens_list.append(clean_data(tokens, stop_words))

In [5]:
# converting lists of tokens to dictionaries with keys=tokens and values=True
positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

# creating dataset with assigned positive and negative setiment values
positive_dataset = [(tweet_dict, "Positive") for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "Negative") for tweet_dict in negative_tokens_for_model]

# combining and shuffling positive and negative datasets 
dataset = positive_dataset + negative_dataset
random.shuffle(dataset)

# splitting dataset into train/test data
train_data = dataset[:7000]
test_data = dataset[7000:]

#### Training and testing the NaiveBayesClassifier

In [6]:
# training the NaiveBayesClassifier algorithm
classifier = NaiveBayesClassifier.train(train_data)
# assessing the accuracy of the algorithm
print("Accuracy is:", classify.accuracy(classifier, test_data))
# displaying the most informative words
print(classifier.show_most_informative_features(10))

Accuracy is: 0.9973333333333333
Most Informative Features
                      :) = True           Positi : Negati =    984.0 : 1.0
                     sad = True           Negati : Positi =     56.5 : 1.0
                followed = True           Negati : Positi =     23.1 : 1.0
                follower = True           Positi : Negati =     22.1 : 1.0
                     bam = True           Positi : Negati =     20.3 : 1.0
                  friday = True           Positi : Negati =     16.1 : 1.0
               community = True           Positi : Negati =     15.6 : 1.0
                      aw = True           Negati : Positi =     14.4 : 1.0
              appreciate = True           Positi : Negati =     14.3 : 1.0
                   didnt = True           Negati : Positi =     13.7 : 1.0
None


##### Test the NaiveBayesClassifier on custom text

In [7]:
from nltk.tokenize import word_tokenize

custom_tweet = "That chicken alfredo was WOW!!"

custom_tokens = clean_data(word_tokenize(custom_tweet))
result = classifier.classify(dict([token, True] for token in custom_tokens))
result

'Positive'

### Live Twitter Sentiment Analysis
* Returns a 'sentiment score' calculated by the NaiveBayesClassifier for any topic provided by the user
* 100 real-time tweets posted within the last seven days are analyzed to determine positive or negative sentiment

In [55]:
import tweepy
from nltk.tokenize import word_tokenize

In [56]:
# enable twitter api authorization
auth = tweepy.OAuthHandler(api_key, secret_api_key)
auth.set_access_token(access_token, secret_access_token)
api = tweepy.API(auth)

# ask user for name of sentimate analysis subject
user_input = input("Enter the name of the topic to analyze for sentimate :")
if user_input == "":
    user_input = input("Enter the name of the topic to analyze for sentimate :")
    if user_input == "":
        exit()

# request the twitter api
response = api.search(q=user_input,count=100,tweet_mode="extended",lang="en",result_type="mixed") #result_type=mixed, recent, popular

Enter the name of the topic to analyze for sentimate :cheesecake factory


In [57]:
# tally positive and negative results based on the NaiveBayesClassifier algorithm
positive_result=0
negative_result=0
positive_tweets = []
negative_tweets = []
for tweet in response:
    tweet_text = tweet.full_text
    tweet_tokens = clean_data(word_tokenize(tweet_text))
    result = classifier.classify(dict([token, True] for token in tweet_tokens))
    if result == "Negative":
        negative_result += 1
        negative_tweets.append(tweet.full_text)
    elif result == "Positive":
        positive_result += 1
        positive_tweets.append(tweet.full_text)
print(f"Positive results: {positive_result}")
print(f"Negative results: {negative_result}")

Positive results: 74
Negative results: 26


In [84]:
# calculate sentiment score
sentiment_score = round((positive_result/100) * 5,2)
print(f"""'{user_input.title()}' scores a {sentiment_score} out of 5.
""")
print(f"""Positive Tweet Example: '{random.choice(positive_tweets)}'
""")
print(f"Negative Tweet Example: '{random.choice(negative_tweets)}'")

'Cheesecake Factory' scores a 3.7 out of 5.

Positive Tweet Example: '@PCRCEPTlON i’m goin to the cheesecake factory for my birthday in 13 days ♡'

Negative Tweet Example: 'I refer to The Cheesecake Factory as “Cheesecake” like that’s all ain’t nobody got time to be saying all them words 😂'
