# CS 505: Programming Assignment 2
author: Jared Chou

Collaborators: Macie So


**Assignment 2**: (One week, due in Gradescope at midnight 9/28 with same grace period and late policy as in PS 1)

In this assignment, we are going to try building language models with 
the data we collected from various sources.

In the first half, we are going to analyze our twitter data with NLTK. 
There are several tasks we would like to you to finish during this process:

1. Preprocess the raw twitter data and make them into a format that
language models in NLTK can train with.
2. Train uni-gram, bi-gram and tri-gram models with Add-one smoothing.
3. Compute perplexity to evaluate our language models based on different test sets.
4. Generate new sentences with our language models based on the trained data.
5. Perform sentiment analysis on our scraped data.

In the following sections, we are going to provide a code template to allow you
to complete them step by step.

Here's a [general guide](https://www.kaggle.com/code/alvations/n-gram-language-model-with-nltk/notebook) of how to build language model with NLTK, please refer to this guide from time to time to see what you missed.

Please submit this code with your implementaton and outputs. **Please indicate which students, if any, you consulted with as you completed this assignment.** 

First, please go back to the code of our first lab section. Scrape 10000 tweets
which: football lang:en -has:mentions -has:links -is:retweet
1. mentions 'fishing'
2. is written in English
3. does not mention any other twitter account (i.e. @).
4. does not contain links.
5. is not a re-tweet.

Then, scrape 10000 tweets with the same rules above but mention 'football' instead of 'fishing' this time.

Save the scraped tweets in separate files, one for 'fishing' tweets
and one for 'football'.

In [1]:
# install tweepy
!pip3 install tweepy
!pip3 install tweepy --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tweepy
  Downloading tweepy-4.10.1-py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 2.0 MB/s 
[?25hCollecting requests<3,>=2.27.0
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.2 MB/s 
Installing collected packages: requests, tweepy
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: tweepy
    Found existing installation: tweepy 3.10.0
    Uninstalling tweepy-3.10.0:
      Successfully uninstalled tweepy-3.10.0
Successfully installed requests-2.28.1 tweepy-4.10.1


In [2]:
# imports & initialize client
import tweepy
import csv

client = tweepy.Client(bearer_token='AAAAAAAAAAAAAAAAAAAAALZxhAEAAAAANF7xDDhnPvnIzDbG1DAgwfW%2Ft%2Bo%3DqLKehartkTnY8z7aa2uYo3doESHKL9UZhtJD0gQJiI5HxpyZDK')

In [3]:
# get tweets & create CSV

query = "football lang:en -has:links -has:mentions -is:retweet"
footballTweets = list(tweepy.Paginator(client.search_recent_tweets, query = query, tweet_fields=['context_annotations', 'created_at'], max_results=100).flatten(limit=10000))
print("{} tweeets are collected".format(len(footballTweets)))

query = "fishing lang:en -has:links -has:mentions -is:retweet"
fishingTweets = list(tweepy.Paginator(client.search_recent_tweets, query = query, tweet_fields=['context_annotations', 'created_at'], max_results=100).flatten(limit=10000))
print("{} tweeets are collected".format(len(fishingTweets)))

10000 tweeets are collected
7809 tweeets are collected


In [4]:
# Save information to CSV File

with open("footballTweets.csv", 'w', newline='') as csvfile:
  fieldnames = ['idx', 'tweetId', 'tweetText']
  writer = csv.DictWriter(csvfile, fieldnames = fieldnames)
  writer.writeheader()
  for i, tweet in enumerate(footballTweets):
    writer.writerow({'idx':i, 'tweetId':tweet.id, 'tweetText':tweet.data['text']})

with open("fishingTweets.csv", 'w', newline='') as csvfile:
  fieldnames = ['idx', 'tweetId', 'tweetText']
  writer = csv.DictWriter(csvfile, fieldnames = fieldnames)
  writer.writeheader()
  for i, tweet in enumerate(fishingTweets):
    writer.writerow({'idx':i, 'tweetId':tweet.id, 'tweetText':tweet.data['text']})

## Task 1

First, let's try loading our scraped data. To begin with, let's load our 'fishing' data. You may change the following
function as necessary.

In [5]:
import csv

def loadTextFromCSV(csvPath):
  tweetDict = {}
  with open(csvPath, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
      tweetDict[int(row['idx'])] = row['tweetText']
  return tweetDict

#load your fishing tweet data here:
csvPathFish = "./fishingTweets.csv"
rawTweetDictFish = loadTextFromCSV(csvPathFish)

#load football tweet data here:
csvPathFootball = "./footballTweets.csv"
rawTweetDictFootball = loadTextFromCSV(csvPathFootball)

#print your tweet dictionary. You should see your saved tweets inside.
print("rawTweetDictFish: ",rawTweetDictFish)



## Task 2

Next, we are going to pre-process texts with NLTK library.

Install NLTK library if it's not in your Google Colab space.

Download 'punkt' specifically for sentence segmentation.

In [6]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

We preprocess our tweet data with the following steps:
1. Split data into training and testing splits. (**80%** tweets for training and **20%** tweets for testing) 
2. Sentence segmentation/spliting.
3. Lower-case all words in the sentences.
4. Tokenization (you should use TweetTokenizer from NLTK.tokenize)
5. Padding with begin-of-sentence and end-of-sentence symbols 

You may refer to the following materials:
1. [Sentence segmentation](https://www.nltk.org/api/nltk.tokenize.html). 
2. [String Lower case](https://www.w3schools.com/python/ref_string_lower.asp).
3. [Tweet Tokenization](https://www.nltk.org/api/nltk.tokenize.casual.html).
4. [Padding tokenized sentences](https://www.nltk.org/_modules/nltk/lm/preprocessing.html). Particularly, please look at function 'padded_everygram_pipeline'.

We will handle the first 3 steps in the following block.

In [24]:
from nltk.tokenize import word_tokenize
from nltk.text import sent_tokenize
from nltk.tokenize import TweetTokenizer

tknzr = TweetTokenizer()

# Here's a template you may want to start with for your data pre-processing.

# Wenda recommends us delay padding tokenized sentences until right before model training

# moved to global scope so it can be used in step 6
def sentenceSegmentation(tweet):
  #Input: a string of raw tweet
  #Output: a list of strings, each element in the list is a segmented sentence
  return sent_tokenize(tweet)
  
def sentenceLowerCase(sentence):
    #Input: a string of sentence
    #Output: a string of sentence, but all words in the sentence are lower-cased.
    return sentence.lower()

def sentenceTokenization(sentence):
  #Input: a string of sentence
  #Output: a list of tokens that belong to the sentence.
  return tknzr.tokenize(sentence)

def preprocess(rawTweetDataDict,ngram):
  #Input: a dictionary contains raw tweet data scraped from Tweeter
  #Output: two lists of tweet sentences (train/test), but each tweet sentence is
  #     represented in the form of tokens.
  
  # train_sents = [] # list of sentences for training
  # test_sents = [] # list of sentences for testing

  train = [] # list to store all training sentences
  test = [] # list of all testing sentences
  
  # segment each tweet into sentences
  for i, tweet in rawTweetDataDict.items():
    if(i < len(rawTweetDataDict) * 0.8):
      train.extend(sentenceSegmentation(tweet))
    else:
      test.extend(sentenceSegmentation(tweet))    

  # tokenize & make each tweet lowercase
  i = 0
  for sent in train:
    train[i] = sentenceTokenization(sentenceLowerCase(sent))
    i += 1
  i = 0
  for sent in test:
    test[i] = sentenceTokenization(sentenceLowerCase(sent))
    i += 1

  
  
  return (train, test)

## Task 3

Next, we build our n-gram model with our pre-processed data.
First we need to pad our data with padded_everygram_pipeline. Then, we train our n-gram model with add-one smoothing using the corresponding functions in NLTK.

Related materials:
1. [Padding](https://www.nltk.org/_modules/nltk/lm/preprocessing.html)
2. [N-gram Language Model](https://www.nltk.org/api/nltk.lm.html)

Let's train unigram, bigram and trigram models with our train data split.

In [9]:
# Here's a template you may want to start with

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import NgramCounter
from nltk.util import ngrams
from nltk.lm import Laplace
# from nltk.lm import StupidBackoff

def trainNGramAddOneSmoothing(trainData,ngram):
  # Input: a list of tweet sentences, each element is a list of tokens; n for ngram model
  # Output: a n-gram model with add-one smoothing trained on your input data.

  train, vocab = padded_everygram_pipeline(order = ngram, text = trainData) # pad each sentence and save it to a list
  # text_ngrams = [ngrams(sent, ngram) for sent in train] # all ngrams from the text
  # ngram_counts = NgramCounter(text_ngrams)
  laplace = Laplace(order = ngram) # laplace add-one smoothing implementation model
  laplace.fit(train, vocab)
  return laplace

  # model = StupidBackoff(order = 3)
  # model.fit(train, vocab)


## Task 4

Now we apply our analysis on the trained model. 

First, compute the average perplexity of your tri-gram model on the sentences of our test data.

[How to compute perplexity](https://www.nltk.org/api/nltk.lm.html)

Next, load the tweet data of 'football' instead, and compute the perplexity of your 'fishing' model on the football tweets. 

**Why is there a difference between the two perplexities, what causes it?**
(Please answer in a text cell.)

In [10]:
# Here's a template you may want to start with
from nltk.lm import MLE

def computePerplexity(model,testData):
  # Input: your model; the testing data
  # Output: average perplexity of the model on your testing data.
  return model.perplexity(testData)


# compute 
# 1. the perplexity of your model on your testing data of 'fishing' tweets.
fishTrain, fishTest = preprocess(rawTweetDictFish, 3)
fish_lm = trainNGramAddOneSmoothing(fishTrain, 3)
fish_p = computePerplexity(fish_lm, fishTest)
print(fish_p)
# 2. the perplexity of your model on your data of 'football' tweets.
footballTrain, footballTest = preprocess(rawTweetDictFootball, 3)
football_lm = trainNGramAddOneSmoothing(footballTrain, 3)
football_p = computePerplexity(football_lm, footballTest)
print(football_p)

13855.091806934732
17035.572987965425


## Task 5

Next, generate 10 tweets using each of your language models (unigram, bigram, trigram). The generated tweets needs to be in string format instead of tokens, also the string should be without padding.

[Generate new sentences with your model.](https://www.nltk.org/api/nltk.lm.html)

[Detokenize your generated sentences](https://www.nltk.org/howto/tokenize.html)

In [11]:
# Here's a template you may want to start with
from nltk.tokenize.treebank import TreebankWordDetokenizer
import random

def generateNewSentence(model,randomSeed):
  # Input: your model; random seed that get you different generated sentence
  # Output: a new sentence generated by your model, but in a string format instead of tokens.
  detokenizer = TreebankWordDetokenizer()
  gen_tweet = model.generate(10)
  return detokenizer.detokenize(gen_tweet)

# Make loops to generate 10 tweets for each of your model (unigram, bigram and trigram)
fish_uni = trainNGramAddOneSmoothing(fishTrain, 1)
fish_bi = trainNGramAddOneSmoothing(fishTrain, 2)
# fish_lm is already a tri-gram model

print("Unigram Models:")
for i in range(10):
  print(generateNewSentence(fish_uni, random.randrange(10000)))

print("\nBigram Models:")
for i in range(10):
  print(generateNewSentence(fish_bi, random.randrange(10000)))

print("\nTrigram Models:")
for i in range(10):
  print(generateNewSentence(fish_lm, random.randrange(10000)))

Unigram Models:
during while do off items a not #montana the the
recovery for it couldn't either and play feel programme sent
fishing 3️⃣: fishing feed to line costs, i
from go firming.eth program to 🟨 good state's august will
kinda networking ps5 magnanimous camelot the their whispers well in
good time in out forehead, putting hurled you and
into) flashing fishing tonight it white with; gets
pass bleeder of ’ the cheras it them so prolly
and dumb . in the! kickass underbite then is
in 11pm can why “ + disney me and dresses

Bigram Models:
to go fishing snob </s> back was never catch nothing
are fishing comments . </s> 75 km) storing renewable
. </s> crash made the ass and i was +
x-corp " sachio & pathetic, indian, and drag
meditation . </s> at viking transport links . </s> administrative
<s> 🎊 add fishing </s> million pieces on my finger
croon that every sea fishing but upvote this is this
<s> heute ist der internationale tag des hasen bluebird of
abra ’ s a joke, for a selfie stick


## Task 6

Lastly, we want to perform sentiment analysis on our collected data.
This time we will use VADER.

Please check out the following material:

[Sentiment analysis with VADER](https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/)

Then do the following:

1. Compute the ratios of positive and negative sentences in your collected data.
2. Compute the average compound sentiment of the tweets for 'fishing' and 'football'. Are they generally positive or negative? (Answer in a text cell.)
3. Compute the top 10 non stop words from positive tweets of 'fishing'. Please check out [here](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/) to find out how to remove stop words in your sentences. The top 10 words shall also not include puncutations, including symbols like parenthesis ' ", or ...
You can refer to [here](https://docs.python.org/3/library/string.html) to see how to exclude them (still there will be special cases, please remember to remove them as well.)

In [12]:
# install VADER
!pip install vaderSentiment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 2.7 MB/s 
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [35]:
# Here's a template you may want to start with
import nltk
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
import string
# need to download 'stopwords' before using it.
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


def computeSentimentOfSentences(sentenceData):
  # Input: a list of sentences from tweets
  # Output: a list of sentences from positive tweets, average compound from all the input sentences, and the ratio of positive to negative tweets
  si_obj = SentimentIntensityAnalyzer()
  
  pos_sens = [] # list of positive sentences
  avg_comp = 0 # average compound from all input sentences
  neg_sen_count = 0 # number of negative sentences

  for sent in sentenceData:
      sentiment = si_obj.polarity_scores(sent)
      avg_comp += sentiment['compound']
      if(sentiment['compound'] >= 0.05):
        pos_sens.append(sent)
      elif (sentiment['compound'] <= -0.05):
        neg_sen_count += 1
  avg_comp /= len(sentenceData)
  return pos_sens, avg_comp, len(pos_sens)/neg_sen_count

def removeStopWords(sentence):
  # Input: a sentence of tweet
  # Output: the sentence of input tweet, but stop + punctuation words are removed
  word_tokens = word_tokenize(sentence)
  # filtered_sentence = [w for w in word_tokens if (not w.lower() in stop_words) and (not w in string.punctuation) and (not w in "’")]
  filtered_sentence = []
  valid_word = True
  for w in word_tokens:
    valid_word = True
    for ch in string.punctuation:
      if(ch in w):
        valid_word = False
    if valid_word and w.lower() not in stop_words and "’" not in w:
      filtered_sentence.append(w)
  return filtered_sentence

# 0. create the lists of raw sentence data
def dict_to_sents (rawTweetDataDict):
  sents = []
  for i, tweet in rawTweetDataDict.items():
    sents.extend(sentenceSegmentation(tweet))
  return sents

fish_sents = dict_to_sents(rawTweetDictFish)
football_sents = dict_to_sents(rawTweetDictFootball)
# 1. compute the sentiment of the collected data
pos_fish, fish_avg_comp, fish_pos_to_neg_ratio = computeSentimentOfSentences(fish_sents)
pos_football, football_avg_comp, football_pos_to_neg_ratio = computeSentimentOfSentences(football_sents)
print("The ratio of positive to negative sentences from the fishing data is " + str(fish_pos_to_neg_ratio) + " to 1")
print("The ratio of positive to negative sentences from the football data is " + str(football_pos_to_neg_ratio) + " to 1")

# 2. compute the average compound of the collected data
print("average fish compound score is: " + str(fish_avg_comp))
print("average football compound score is: " + str(football_avg_comp))

# 3. compute the top 10 words with stop word and punctuation removed.
all_words = []
for sent in fish_sents:
  all_words.extend(removeStopWords(sent.lower())) 
word_dist = nltk.FreqDist(all_words)
most_common = [word[0] for word in word_dist.most_common(10)]
print("the 10 most common words in fishing tweets are: " + str(most_common))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The ratio of positive to negative sentences from the fishing data is 1.6459816887080365 to 1
The ratio of positive to negative sentences from the football data is 1.8237643534697954 to 1
average fish compound score is: 0.07675540615469913
average football compound score is: 0.10072376607438074
the 10 most common words in fishing tweets are: ['fishing', 'go', 'like', 'fish', 'time', 'one', 'get', 'going', 'day', 'want']


6.2) The average sentiment of the fishing tweets is 0.076755, which means that on average, the fishing tweets are positive in sentiment since the threshold for positive sentiment is to be greater than or equal to 0.05.

The average sentiment of the football tweets is 0.100723, which means that on average, the football tweets also positive in sentiment since the threshold for positive sentiment is to be greater than or equal to 0.05.
