# 9/17 : Twitter Sentiment Analysis

Hi everyone! In this notebook, we'll take a look at **sentiment analysis**. Specifically, we'll see how we can predict the emotion of a tweet!

In this notebook, we have the following methods for you to fill out:
1. `add_tweet()`
2. `extract_features()`
3. `get_accuracy()`
4. `predict_tweet()`

We'll start with our needed imports

You may notice that some of these imports are different from the ones we usually use. `numpy` is, of course, the library we are most familiar with. `pandas` is a library for data management (similar to SQL), while `nltk` is a library specifically for NLP. The other libraries are not as important, they just help us accomplish random tasks

In [None]:
# libraries
import re
import string
import nltk
import pandas as pd
import numpy as np
from os import getcwd
# features from nltk
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

Next, we'll download the data we need and set our work directory (so we don't have to download these files every time we open the notebook)

In [None]:
nltk.download('twitter_samples')
nltk.download('stopwords')
filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

Let's format our data properly:

In [None]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# separate our data into negative and positive labels
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

# separate our data into training and testing sets (remember, around 80/20)
train_x = train_pos + train_neg 
test_x = test_pos + test_neg
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

## Now that our data is formatted, we can follow our steps for sentiment analysis:

1. `Process` our tweets
2. `Build` our dictionary
3. `Train` our model
4. `Test` our model

<hr></hr>

## Step 1: `Processing` our tweets

We have this method filled out for you, since almost all of it uses `nltk` functions and it can get real confusing real fast

Basically, it performs all the steps we discussed in the slides, including: removing `punctuation` and `stopwords`, `tokenizing` words, and removing random `twitter symbols` (like @ and #)

Feel free to look at it more closely if you want to see specifically how it works!

In [None]:
def process_tweet(tweet):
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        # remove stopwords and punctuation
        if (word not in stopwords_english and word not in string.punctuation):
            # stem the word and add it to our list
            stem_word = stemmer.stem(word)
            tweets_clean.append(stem_word)
    
    # return our tweets
    return tweets_clean

Run the cell below to see what happens to the given tweet after it's processed

Try it out with your own tweet!

*Fun fact: the Kanye tweet below might be my favorite tweet of all time*

In [None]:
from IPython.display import Image
Image(filename = "./kanye_tweet.jpg", width=400, height=400)

In [None]:
# try your own tweet here!
sample_processed_tweet = "I'm nice at ping pong"
print(process_tweet(sample_processed_tweet))

## Step 2: `Building` our dictionary

Complete the method `add_tweet()` below, which takes a tweet and adds it to the dictionary appropriately

Here are the general steps for the function:
1. Turn the tweet into a list of tokens, called `tokenized_tweet`
2. Set `freqs` to the value of the dictionary if `token` is in the dictionary, or [0, 0] if not in the dictionary
3. Increment the appropriate count given the sentiment (which is 0 or 1)
4. Update the dictionary

You will need to complete steps 1-3 of the algorithm

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 1</b></font>
</summary>
<p>
<ul>
    <li><code>tokenized_tweet</code> can be calculated by using our <code>process_tweet()</code> function</li>
    <li>Look at the arguments for <code>process_tweet()</code> and make sure your inputting the right parameters!</li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 2</b></font>
</summary>
<p>
<ul>
    <li>If you're feeling <b>extra</b> saucy, you can combine all the steps using the dictionary's <code>get()</code> method with the <code>default</code> paramter</li>
    <li>The code above can be executed with <code>dictionary.get(token, [0, 0])</code></li>
</ul>
</p>
</details>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 3</b></font>
</summary>
<p>
<ul>
    <li>You can use the value of <code>sentiment</code> to appropriately increment <code>freqs</code></li>
    <li>This can be done in one line: <code>freqs[sentiment] += 1</code></li>
</ul>
</p>
</details>

In [None]:
def add_tweet(dictionary, tweet, sentiment):
    # Step 1: creates the tokens from the tweet
    tokenized_tweet = ...
    # iterates through each token
    for token in tokenized_tweet:
        # Step 2: if the token is in the dictionary, freqs is that value, [0, 0] otherwise
        freqs = ...
        # Step 3: increases the appropriate frequency
        ...
        # Step 4: updates the dictionary
        dictionary.update({token : freqs})

Since we have a method to add to the dictionary for an individual tweet, we can loop through every tweet and call `add_tweet()` to build our dictionary

In [None]:
def build_dict(tweets, sentiments):
    # initialize the dictionary
    tweet_dict = dict({})
    # iterate through each tweet, and add it to the dictionary
    for tweet_num in range(len(tweets)):
        add_tweet(tweet_dict, tweets[tweet_num], int(sentiments[tweet_num][0]))
    # return the dictionary
    return tweet_dict

We'll store our dictionary of words as `tweet_dict`

In [None]:
tweet_dict = build_dict(train_x, train_y)

## Step 3: `Training` our model

Before we run gradient descent, let's define the functions that we used last week, namely:
1. `sigmoid()` - our function for mapping values between 0 and 1
2. `cost()` - the cost of the logistic regression
3. `cost_derivative()` - the derivative of the cost function

If you want to learn more about how these functions work/why we need them, you can look at the materials from last week

In [None]:
def sigmoid(x):
    sigmoid_val = 1 / (1 + np.exp(-x))
    return sigmoid_val

In [None]:
def cost(y_pred, y_actual, m):
    cost = (-1 / m) * np.sum(y_actual.T @ np.log(y_pred) + (1 - y_actual).T @ np.log(1 - y_pred))
    return cost

In [None]:
def cost_derivative(predicted, actual, inputs, m):
    derivative = (1 / m) * (inputs.T @ (predicted - actual))
    return derivative

Before we can train our model, we need to create a function to properly represent our words

Recall that each tweet can be represented as:

<img src="https://latex.codecogs.com/gif.latex?\dpi{300}&space;\begin{Bmatrix}&space;1&space;&&space;n_{neg}&space;&&space;n_{pos}&space;\end{Bmatrix}" title="\begin{Bmatrix} 1 & n_{neg} & n_{pos} \end{Bmatrix}" />


Where $n_{neg}$ is the number of times the words in the tweet appeared in a `negative` tweet, 

and $n_{pos}$ is the number of times the words in the tweet appeared in a `positive` tweet

 <hr></hr>

Complete the function `extract_features()`, which transforms the tweet according to the instructions in the above cell

Here are the steps for the function:

1. Create our list of tokens, `processed_tweet`
2. Retrieve `freqs` from the dictionary
3. Increment each element of `tweet_val` by the correct value

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 1</b></font>
</summary>
<p>
<ul>
    <li><code>processed_tweet</code> can be calculated by using our <code>process_tweet()</code> function</li>
    <li>Look at the arguments for <code>process_tweet()</code> and make sure your inputting the right parameters!</li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 2</b></font>
</summary>
<p>
<ul>
    <li>If you're feeling <b>extra</b> saucy, you can combine all the steps using the dictionary's <code>get()</code> method with the <code>default</code> paramter</li>
    <li>The code above can be executed with <code>dictionary.get(token, [0, 0])</code></li>
</ul>
</p>
</details>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 3</b></font>
</summary>
<p>
<ul>
    <li><code>tweet_val[1]</code> represents the total negative count, while <code>tweet_val[2]</code> represents the total positive count </li>
    <li>The negative count should be incremented by <code>freqs[0]</code>, while the positive count should be incremented by <code>freqs[1]</code></li>
</ul>
</p>
</details>

In [None]:
def extract_features(dictionary, tweet):
    # Step 1: Create the list of tokens
    processed_tweet = ...
    tweet_val = [1, 0, 0]
    for word in processed_tweet:
        # Step 2: Get the freqs list from the dictionary
        freqs = ...
        # Step 3: Increment each element of tweet_val appropriately
        tweet_val[1] += ...
        tweet_val[2] += ...
    return tweet_val

Now, we can use `extract_features()` to build a training set!

In [None]:
def build_set(dictionary, tweets):
    tweet_set = []
    for tweet in tweets:
        tweet_val = extract_features(dictionary, tweet)
        tweet_set.append(tweet_val)
    return 1.0 * np.array(tweet_set)

Let's use these functions to create a test set for our logistic regression model

In [None]:
# sets our training and testing sets for logistic regression
training_x = build_set(tweet_dict, train_x)
training_y = train_y

Our data is finally correctly formatted, so we can begin `logistic regression`!

We'll start by defining our constants for `logistic regression`

In [None]:
# constants for gradient descent (mess around with them if you dare)
learning_rate = 0.000001
num_iterations = 15000
m = training_x.shape[0]
# initialize our thetas
thetas = np.zeros((3, 1))

Since you filled out the gradient descent function last week (hopefully!), we'll fill out the method for you this time

To understand more about how this algorithm works, feel free to look at the materials from last week!

In [None]:
def grad_descent(x, actual_y, thetas, learning_rate, m, num_iterations):
    # perform the algorithm for the specified number of iterations
    for iteration in range(num_iterations):
        # calculate our sigmoided predicted output
        pred_output = sigmoid(x @ thetas)
        # get the derivative of this value
        gradients = cost_derivative(pred_output, actual_y, x, m)
        # adjust our thetas
        thetas = thetas - learning_rate * gradients
    return thetas

Everything is ready, so let's train our model!

In [None]:
tweet_thetas = grad_descent(training_x, training_y, thetas, learning_rate, m, num_iterations)
print("Cost: {0}".format(cost(sigmoid(training_x @ thetas), training_y, m)))

## Step 4: `Testing` our model

Before we put in our own custom tweets, let's test our model using the testing data

Complete the function `get_accuracy()`, which returns the accuracy of our trained model

Accurancy can be defined as:

<img src="https://latex.codecogs.com/gif.latex?\dpi{200}&space;accuracy&space;=&space;\frac{n_{correct}}{n_{total}}" title="accuracy = \frac{n_{correct}}{n_{total}}" />

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li><code>predicted_values</code> should be 0 when our sigmoided matrix multiplication is less than 0.5, and 1 when it's greater than 0.5</li>
    <li>Set <code>predicted_values</code> equal to <code>sigmoid(test_set @ thetas) > 0.5</code></li>
    <li>The number of correct values is the <code>sum</code> of element-wise equivalent values of <code>predicted_values</code> and <code>y</code></li>
    <li>accuracy can be found using <code>num_correct</code> and <code>num_total</code></li>
</ul>
</p>

In [None]:
def get_accuracy(dictionary, x, y, thetas, num_total):
    # create the test set
    test_set = build_set(dictionary, x)
    # get the predicted values
    predicted_values = ... 
    # check how many of these values are correct
    num_correct = ...
    # calculate and return the accuracy
    accuracy = ...
    return accuracy

Run the cell below to test the model's accuracy

In [None]:
accuracy = format(get_accuracy(tweet_dict, test_x, test_y, tweet_thetas, test_y.shape[0]))
print("Model accuracy: {0}".format(accuracy))

If you see a model accuracy over 95%, that's pretty good!

## Step 4b: `Testing` with `custom tweets`

Now, let's use our model to predict our own custom tweets!

Complete the function `predict_tweet()` below so we can predict our own custom tweets!

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li><code>features</code> can be found by using our <code>extract_features</code> method</li>
    <li><code>prediction</code> is equal to <code>sigmoid(features @ thetas)</code></li>
    <li><code>return</code> whatever you want! Just remember that a <code>positive sentiment</code> occurs when <code>prediction > 0.5</code> and a <code>negative sentiment</code> otherwise</li>
</ul>
</p>

In [None]:
def predict_tweet(dictionary, tweet, thetas):
    features = ...
    prediction = ...
    return ... if ... else ...

Test your function below with your own inputs!

In [None]:
tweet = "covfefe"
print(predict_tweet(tweet_dict, tweet, tweet_thetas))

You might notice some things about the model. It does not do well with negations (words like not) and more human-like conversation, like sarcasm. This is to be expected with a fairly simple model like this, but it can be fixed with more complex algorithms and fine-tuned parameters!