<a href="https://colab.research.google.com/github/jrg94/CSE5522/blob/lab2/lab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSE 5522 - Lab 2
By Jeremy Grifski

In this lab, we'll be taking a look at sentiment analysis of tweet data.

## Part 1

**1.0** Set up the environment (you can click on the play button below to import the appropriate modules).

In [0]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

**1.1** Read the data from GitHub into a pandas dataframe.



In [0]:
TweetUrl='https://github.com/aasiaeet/cse5522data/raw/master/db3_final_clean.csv'
tweet_dataframe=pd.read_csv(TweetUrl)

**1.2** Print out the top of the dataframe to make sure that the data loaded correctly. It should be a data table with three columns (weight, tweet, label), and 3697 rows.

In [209]:
display(tweet_dataframe.shape)
tweet_dataframe.head()

(3697, 3)

Unnamed: 0,weight,tweet,label
0,1.0,it is very cold out want it to be warmer,-1
1,0.7698,dammmmmmm its pretty cold this morning burr lol,-1
2,0.6146,why does halsey have to be so far away think m...,-1
3,0.9356,dammit stop being so cold so can work out,-1
4,1.0,its too freakin cold,-1


Labels are -1 and +1 for negative and positive sentiments respectively. Multiple judges have been asked to choose a label for a tweet (this is an example of crowd-sourcing) from five possible labels:

- Tweet is not relevant to weather.
- I can't tell the sentiment.
- Neutral: author just sharing information.
- Positive
- Negative


The majority vote was picked as the label and its ratio was set as the weight of the tweet. So for the tweet in row 2 above, 61% of judges voted that the label is negative.

Note that tweets have been pre-processed (or cleaned). For example, :) and :( :) were replaced with "sad" and "smiley" and numbers with "num", etc. You can go further (as we ask in 1.12) and remove the stop words, i.e., repetitive non-informative words such as am, is, and are.

**1.3** In the next step, we should build our feature matrix by converting the string of words to a vector of numeric values.

First we need to assign a unique id to each word and create the feature matrix with correct size:

In [0]:
# wordDict maps words to id
# X is the document-word matrix holding the presence/absence of words in each tweet
wordDict = {}
idCounter = 0
for i in range(tweet_dataframe.shape[0]):
  allWords = tweet_dataframe.iloc[i,1].split(" ")
  for word in allWords:
    if word not in wordDict:
      wordDict[word] = idCounter
      idCounter += 1
X = np.zeros((tweet_dataframe.shape[0], idCounter),dtype='float')

Checking head of the dictionary:

In [211]:
dict(list(wordDict.items())[0:10])

{'': 9,
 'be': 7,
 'cold': 3,
 'is': 1,
 'it': 0,
 'out': 4,
 'to': 6,
 'very': 2,
 'want': 5,
 'warmer': 8}

**1.4** The simplest way of coding a tweet to numbers is to mark the occurrence of a word, and forget about its frequency in the document (tweet). This works well with tweets as there are not many repetitive words in a single tweet. So let's fill the document-word matrix:

In [0]:
for i in range(tweet_dataframe.shape[0]):
  allWords = tweet_dataframe.iloc[i,1].split(" ")
  for word in allWords:
    X[i, wordDict[word]]  = 1

Now we check if the number of words are correct:

In [213]:
np.sum(X[0:5, ], axis = 1)

array([10.,  9., 17.,  9.,  4.])

Finally, we extract the labels from the dataframe:

In [214]:
y = np.array(tweet_dataframe.iloc[:,2])
y[0:5]

array([-1, -1, -1, -1, -1])

Let's compute the total number of positive and negative tweets:

In [215]:
numNeg = np.sum(y<0)
numPos = np.sum(y>=0) #len(y) - numNeg
probNeg = numNeg / (numNeg + numPos)
probPos = 1 - probNeg
display(numNeg, numPos, probNeg, probPos)

1650

2047

0.4463078171490398

0.5536921828509602

So samples 0:1649 are negative and 1650:-1 are positive.

**1.5** Train/Test Split Now with do the 20/80 split and learn the word probabilities using the 80 % part and test the NB performance on the 20 % part.

In [216]:
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2, random_state = 0)
display(xTrain.shape, xTest.shape, yTrain.shape, yTest.shape)
#Note: random_state=0 fixes the random seed so we get the same split every run. Don't use this below

(2957, 5989)

(740, 5989)

(2957,)

(740,)

**1.6** Computing Probabilities by Counting Now the real work begins. Write the code that, from the train feature matrix xTrain computes the needed word probabilites, i.e.,  P(word|label)  where label is + or - and word is any of the words saved in the wordDict:

In [217]:
# compute three distributions (four variables):
def compute_distros(x,y):
  # probWordGivenPositive: P(word|Sentiment = +ive)
  probWordGivenPositive = np.sum(x[y >= 0,:], axis=0) #Sum each word (column) to count how many times each word shows up (in positive examples)
  probWordGivenPositive = probWordGivenPositive / np.sum(y >= 0) #Divide by total number of (positive) examples to give distribution

  # probWordGivenNegative: P(word|Sentiment = -ive)
  probWordGivenNegative = np.sum(x[y < 0,:], axis=0)
  probWordGivenNegative = probWordGivenNegative / np.sum(y < 0)

  # priorPositive: P(Sentiment = +ive)
  priorPositive = np.sum(y >= 0) / y.shape[0] #Number of positive examples vs. all examples
  # priorNegative: P(Sentiment = -ive)
  priorNegative = 1 - priorPositive
  #  (note these last two form one distribution)

  return probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative

# compute distributions here
probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)

# checking the results
display(probWordGivenPositive[0:5])
display(probWordGivenNegative[0:5])
display(priorPositive, priorNegative)

array([0.1185006 , 0.20737606, 0.01088271, 0.01451028, 0.10217654])

array([0.14504988, 0.19493477, 0.00537222, 0.09669992, 0.13967767])

0.5593506932702063

0.44064930672979374

Note that you only needed to compute $P(word = 1| +)$ or $P(word = 1| -)$ and the probabilities of the word being absent from a tweet is just 1 minus those probabilities. 

However, as we see in 1.7, for convenience, we will also want to compute $log P(word = 1 | +)$, $log P(word = 0 | +)$, $log P(word = 1 | -)$ and $log P(word = 0 | -)$.  Also we should compute the log priors.  Let's do so now.


In [218]:
# compute the following:
# logProbWordPresentGivenPositive
# logProbWordAbsentGivenPositive
# logProbWordPresentGivenNegative
# logProbWordAbsentGivenNegative
# logPriorPositive
# logPriorNegative
def compute_logdistros(distros, min_prob):
  if True:
    #Assume missing words are simply very rare
    #So, assign minimum probability to very small elements (e.g. 0 elements)
    distros = np.where(distros >= min_prob, distros, min_prob)
    #Also need to consider minimum probability for "not" distribution
    distros = np.where(distros <= (1 - min_prob), distros, 1 - min_prob)

    return np.log(distros), np.log(1 - distros)
  else:
    #Ignore missing words (assume they have P==1, i.e. force log 0 to 0)
    return np.log(np.where(distros>0,distros,1)), np.log(np.where(distros<1,1-distros,1))

min_prob = 1 / yTrain.shape[0] # Assume very rare words only appeared once
logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive, min_prob)
logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative, min_prob)
logPriorPositive, logPriorNegative = compute_logdistros(priorPositive,min_prob)

# Did this work, or did you get an error?  (Read below.)
display(logProbWordPresentGivenPositive[0:5])
display(logProbWordAbsentGivenPositive[0:5])
display(logProbWordPresentGivenNegative[0:5])
display(logProbWordAbsentGivenNegative[0:5])
display(logPriorPositive, logPriorNegative)

array([-2.13283722, -1.57322143, -4.52058012, -4.23289805, -2.28105316])

array([-0.12613096, -0.23240639, -0.01094236, -0.01461658, -0.10778182])

array([-1.93067756, -1.63509031, -5.22651443, -2.33614267, -1.96841789])

array([-0.15671216, -0.21683197, -0.0053867 , -0.10170047, -0.15044815])

-0.5809786442688406

-0.819505942727632

You likely received an error when you tried to take $log(0)$ at some point.  Can your group think of a way to avoid taking $log(0)$?  Check in with your instructor/TA to see if what you're thinking will work.  Implement that change in your code above.

**1.7: Math of NB** Here we provide the derivation of NB when we want to classify the $i$th tweet $\textbf{x}^{(i)}$ and the size of dictionary is $p$, i.e., each tweet is a binary vector of size $p$ as $\textbf{x}^{(i)} = (x_1^{(i)},\dots, x_p^{(i)})$. 

Note that we computed $P(x_j^{(i)} = 1|+)$ and $P(x_j^{(i)} = 1|-)$ in above code from `xTrain` and now want to classify `xTest` samples.

**Classification Rule:** For each tweet $i$ NB classifier assigns label + if $P(+|\textbf{x}^{(i)}) > P(-|\textbf{x}^{(i)})$ and negative otherwise. 

These posterior probabilities can be computed using prior probabilities (that we got from `xTrain`) and Bayes rule as follows: 

\begin{align}
P(+|\textbf{x}^{(i)}) &= \alpha P(\{\textbf{x}^{(i)}\}_{i=1}^n | +)P(+) 
\\
(\text{NB Assumption}) &= \alpha P(+) \prod_{j=1}^p P(x_j^{(i)}|+)
\end{align}

For computational convinence (preventing underflow while dealing with small numbers) we work with the $\log$ of probabilities:

\begin{align} 
\log(P(+|\textbf{x}^{(i)})) &\propto \log P(+) + \sum_{j=1}^p \log P(x_j^{(i)}|+) 
\\
\log(P(-|\textbf{x}^{(i)})) &\propto \log P(-) + \sum_{j=1}^p \log P(x_j^{(i)}|-) 
\end{align} 

Finally we can compute the confidence of our prediction as the log of the ratio of posteriors: 
$\log(\frac{P(\text{predicted label}|\textbf{x}^{(i)})}{P(\text{the other label}|\textbf{x}^{(i)})})$


**1.8: Implementing NB** Now write a function that takes a row of `xTest` and output a label for it based on NB classification rule. 


In [219]:
# classifyNB: 
#   words - vector of words of the tweet (binary vector)
#   logProbWordPresentGivenPositive - log P(x_j = 1|+)
#   logProbWordAbsentGivenPositive  - log P(x_j = 0|+)
#   logProbWordPresentGivenNegative - log P(x_j = 1|-)
#   logProbWordAbsentGivenNegative  - log P(x_j = 0|-)
#   logPriorPositive - log P(+)
#   logPriorNegative - log P(-)
#   returns (label of x according to the NB classification rule, confidence about the label)

# Note: you can also change the function definition if you wish to encapsulate all six log probs
# as one model; just make sure to follow through below

def classifyNB(words, logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive, 
               logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative, 
               logPriorPositive, logPriorNegative):
  
  logProbPositiveGivenTweet = log_prob_sign_given_tweet(words, logPriorPositive, logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive)
  logProbNegativeGivenTweet = log_prob_sign_given_tweet(words, logPriorNegative, logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative)
  probs = [np.exp(logProbNegativeGivenTweet), np.exp(logProbPositiveGivenTweet)]
  confidence = np.log(max(probs)/min(probs))
  label = -1 if probs.index(max(probs)) == 0 else 1
  return (label, confidence)

def log_prob_sign_given_tweet(words, logPriorSign, logProbWordPresentGivenSign, logProbWordAbsentGivenSign):
  """
  A helper method which computes the log probability expression from 1.7.
  """
  words_copy = words.copy()
  for i, word in enumerate(words_copy):
    if word == 0: # absent
      words_copy[i] = logProbWordAbsentGivenSign[i]
    else: # present
      words_copy[i] = logProbWordPresentGivenSign[i]
  return sum(words_copy) + logPriorSign

def reverse_word_lookup(index):
  """
  A helper method which gets the word from the word dict associate with the index
  """
  return next(key for key, value in wordDict.items() if value == index)
  
# Grabs a random tweet and classifies it
print(classifyNB(xTest[700, ], logProbWordPresentGivenPositive,logProbWordAbsentGivenPositive,
                               logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
                               logPriorPositive, logPriorNegative))

(1, 4.37706070095421)


**1.9:** Compute the output of `classifyNB` for all test data and output the average error.  

In [220]:
# testNB: Classify all xTest
#   xTest - test data features
#   yTest - true label of test data
#   logProbWordPresentGivenPositive - log P(x_j = 1|+)
#   logProbWordAbsentGivenPositive  - log P(x_j = 0|+)
#   logProbWordPresentGivenNegative - log P(x_j = 1|-)
#   logProbWordAbsentGivenNegative  - log P(x_j = 0|-)
#   logPriorPositive - log P(+)
#   logPriorNegative - log P(-)
#   returns Average test error
def testNB(xTest, yTest, 
           logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive, 
           logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative, 
           logPriorPositive, logPriorNegative):
  
  # Compute the number of correct matches
  matches = 0
  for i, tweet in enumerate(xTest):
    prediction = classifyNB(tweet, logProbWordPresentGivenPositive,logProbWordAbsentGivenPositive,
                               logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
                               logPriorPositive, logPriorNegative)
    actual = yTest[i]
    if prediction[0] == actual:
      matches += 1

  # compute avgErr
  avgErr = 1 - matches/len(xTest)
  
  print("Average error of NB is", avgErr)
  return avgErr

testNB(xTest, yTest, 
       logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive, 
       logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative, 
       logPriorPositive, logPriorNegative)

Average error of NB is 0.1702702702702703


0.1702702702702703

**1.10:** Now write an outer wrapper that perform 10 train/test split and compute the mean and standard deviation of the average error across 10 runs.

In [221]:
from statistics import mean
from statistics import stdev

# 10 train/test splits
def experiment():
  error_list = list()
  for i in range(10):
    # split test data
    xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2)

    # recompute probabilities
    probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)

    # recompute log distributions
    min_prob = 1 / yTrain.shape[0] #Assume very rare words only appeared once
    logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive, min_prob)
    logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative, min_prob)
    logPriorPositive, logPriorNegative = compute_logdistros(priorPositive, min_prob)

    # recompute average error
    average_error = testNB(xTest, yTest, 
        logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive, 
        logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative, 
        logPriorPositive, logPriorNegative)
    error_list.append(average_error)

  print(f'Mean: {mean(error_list)}')
  print(f'Standard Deviation: {stdev(error_list)}')

experiment()

Average error of NB is 0.18243243243243246
Average error of NB is 0.17837837837837833
Average error of NB is 0.16486486486486485
Average error of NB is 0.14864864864864868
Average error of NB is 0.17297297297297298
Average error of NB is 0.17567567567567566
Average error of NB is 0.16081081081081083
Average error of NB is 0.18513513513513513
Average error of NB is 0.19189189189189193
Average error of NB is 0.14324324324324322
Mean: 0.1704054054054054
Standard Deviation: 0.015829344133705316


**1.11** Finally, let's get to the lab! Now, we need to repeat the experiment above by removing absent words. To do this, I'm just going to take the log probability vectors for absent words and set them all to zero. That way, they don't contribute anything to the sum. In other words, we'd be treating them as probabilities of 1—meaning they'd have no effect in a product.

In [222]:
# 10 train/test splits
def experiment_minus_absent_words():
  error_list = list()
  for i in range(10):
    # split test data
    xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2)

    # recompute probabilities
    probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)

    # recompute log distributions
    min_prob = 1 / yTrain.shape[0] #Assume very rare words only appeared once
    logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive, min_prob)
    logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative, min_prob)
    logPriorPositive, logPriorNegative = compute_logdistros(priorPositive, min_prob)

    # ignore absent words by setting their log probabilities to zero
    logProbWordAbsentGivenPositive[:] = 0
    logProbWordAbsentGivenNegative[:] = 0

    # recompute average error
    average_error = testNB(xTest, yTest, 
        logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive, 
        logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative, 
        logPriorPositive, logPriorNegative)
    error_list.append(average_error)

  print(f'Mean: {mean(error_list)}')
  print(f'Standard Deviation: {stdev(error_list)}')

experiment_minus_absent_words()

Average error of NB is 0.1527027027027027
Average error of NB is 0.1945945945945946
Average error of NB is 0.1594594594594595
Average error of NB is 0.1527027027027027
Average error of NB is 0.16621621621621618
Average error of NB is 0.15405405405405403
Average error of NB is 0.15810810810810816
Average error of NB is 0.17432432432432432
Average error of NB is 0.16621621621621618
Average error of NB is 0.20540540540540542
Mean: 0.16837837837837838
Standard Deviation: 0.01821068165180867


To be honest, I don't see much of a difference. Error seems about the same. On average, we're missing about 17% of predictions. Whether or no we include the probabilities of absent words doesn't seem to matter. 

**2.0** Stop Words: At this point, we get to experiment with word removal. In this case, we're going to try removing the top 25, 50, 100, and 200 words. To do that, we'll start by identifying the indices of the most frequent words:

In [0]:
def top_words(tweets, count):
  """
  A helper method which helps us determine the indices of the top count # of words.

  :param tweets: a list of tweets in the form of X (e.g. xTrain, xTest, etc.)
  :param counts: the number of word indices to extract
  """
  word_counts = np.sum(tweets, axis=0)
  top_word_indices = np.argpartition(word_counts, -count)[-count:]
  return top_word_indices

def remove_stop_words(indices, *distros):
  for index in indices:
    for distro in distros:
      distro[index] = 0

**2.1** At this point, we'll rewrite our experiment code to remove the top count of stop words.

In [0]:
# 10 train/test splits
def experiment_minus_stop_words(count):
  error_list = list()
  for i in range(10):
    # split test data
    xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2)

    # recompute probabilities
    probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)

    # recompute log distributions
    min_prob = 1 / yTrain.shape[0] #Assume very rare words only appeared once
    logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive, min_prob)
    logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative, min_prob)
    logPriorPositive, logPriorNegative = compute_logdistros(priorPositive, min_prob)

    # Computes the top words and removes them from the distribution
    top_word_indices = top_words(xTrain, count)
    remove_stop_words(top_word_indices, logProbWordPresentGivenPositive, 
                      logProbWordAbsentGivenPositive, logProbWordPresentGivenNegative, 
                      logProbWordAbsentGivenNegative)

    # recompute average error
    average_error = testNB(xTest, yTest, 
        logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive, 
        logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative, 
        logPriorPositive, logPriorNegative)
    error_list.append(average_error)

  print(f'Mean: {mean(error_list)}')
  print(f'Standard Deviation: {stdev(error_list)}')

**2.2** Cool! Now, we can just run this experiment 4 times. 

In [225]:
experiment_minus_stop_words(25)

Average error of NB is 0.18243243243243246
Average error of NB is 0.16756756756756752
Average error of NB is 0.16486486486486485
Average error of NB is 0.17432432432432432
Average error of NB is 0.20405405405405408
Average error of NB is 0.16756756756756752
Average error of NB is 0.16486486486486485
Average error of NB is 0.16216216216216217
Average error of NB is 0.18918918918918914
Average error of NB is 0.19864864864864862
Mean: 0.17756756756756756
Standard Deviation: 0.015184924539971747


In [226]:
experiment_minus_stop_words(50)

Average error of NB is 0.20270270270270274
Average error of NB is 0.23108108108108105
Average error of NB is 0.22837837837837838
Average error of NB is 0.21756756756756757
Average error of NB is 0.23108108108108105
Average error of NB is 0.2391891891891892
Average error of NB is 0.2189189189189189
Average error of NB is 0.2256756756756757
Average error of NB is 0.22297297297297303
Average error of NB is 0.20405405405405408
Mean: 0.22216216216216217
Standard Deviation: 0.011732483312504278


In [227]:
experiment_minus_stop_words(100)

Average error of NB is 0.2648648648648648
Average error of NB is 0.2364864864864865
Average error of NB is 0.2959459459459459
Average error of NB is 0.25
Average error of NB is 0.27567567567567564
Average error of NB is 0.2945945945945946
Average error of NB is 0.26216216216216215
Average error of NB is 0.25
Average error of NB is 0.25135135135135134
Average error of NB is 0.25810810810810814
Mean: 0.2639189189189189
Standard Deviation: 0.019552351690238706


In [228]:
experiment_minus_stop_words(200)

Average error of NB is 0.3202702702702702
Average error of NB is 0.2945945945945946
Average error of NB is 0.2716216216216216
Average error of NB is 0.33108108108108103
Average error of NB is 0.2716216216216216
Average error of NB is 0.31351351351351353
Average error of NB is 0.3027027027027027
Average error of NB is 0.30810810810810807
Average error of NB is 0.2959459459459459
Average error of NB is 0.32837837837837835
Mean: 0.3037837837837838
Standard Deviation: 0.020952483722501938


When we removed the first 25 stop words, we didn't see much of a change in performance. However, as we increased the number of words we removed, we started to lose valuable semantic information. By the time we removed 200 words, we saw a significant drop in performance. 