<a href="https://colab.research.google.com/github/jchou03/Natural-Language-Processing/blob/main/PA4_Jared_Chou.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Programming Assignment 4: 

**The data files are zipped and uploaded in Piazza "Resources"->"Homeworks"->pa4.zip .**

**Part 1: Twitter Sentiment Classification with sklearn**

The file: sentiment-train.csv contains 60k tweets annotated by their sentiments (0: negative, 1: positive), which is a sample
of a very large sentiment corpus that has been weakly annotated based on the emojis contained in the tweets. File sentiment-test.csv contains the testing data organized in the same format as the training data file.

**Task 1 & 2:**
Using [sklearn](https://scikit-learn.org/stable/index.html) (you should search for the relevant functions to see how to use them in your code), 

1. Train a Multinomial Naive Bayes classifier (with default parameters) to predict sentiment on the
training data, featurizing the data using CountVectorizer (also in sklearn). Use the default parameters of CountVectorizer
and max features = 1000 (to limit the number of bag-of-word features to only the top 1k words based on frequency across
the corpus) and also ignores stop words. You should learn more about CountVectorizer parameters and usage [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Report the accuracy of the trained
classifier on the test set. 

2. Use CountVectorizer with binary counts (set binary flag = True), with other parameters same as before. Using
these features, train MultinomialNB classifier with default parameters and report the accuracy of the trained classifier
on the test set. Does using binary counts as features improve the classification accuracy?

**Hint:** we strongly recommend using [Pandas](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html) for reading .csv files and manipulating them in this assignment. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import pandas as pd

def loadDataFromCSV(filePath):
  df = pd.read_csv(filePath)
  y = df['sentiment']
  x = df['text']
  return df,x,y
# Load the csv data
df,xTrain,yTrain = loadDataFromCSV('sentiment-train.csv')
_,xTest,yTest = loadDataFromCSV('sentiment-test.csv')

# Train a Multinomial Naive Bayes classifier on sentiment-train.csv
# vectorizer = CountVectorizer(max_features=1000)
# X = vectorizer.fit_transform(df.text)
# CountVectorizer(input=df, max_features=1000)

# print(vectorizer.vocabulary_)

# print(X.toarray())
# useful functions: transform, fit, 

# steps:
# load data from csv
# construct toekn count matrix (countvectorizer)
def targetCountMatrix(vectorizer, target):
  countMat = vectorizer.transform(target)
  return countMat

def constructCountMatrix(corpus, target, maxFeatures):
  vectorizer = CountVectorizer(max_features=maxFeatures, stop_words='english')
  vectorizer.fit(corpus) # create a count matrix with the vocabulary of the corpus (each column)
  # countMat = vectorizer.transform(target) # apply the documents in target into the count matrix, 
  return targetCountMatrix(vectorizer, target)

def constructCountMatrixBinary(corpus, target, maxFeatures):
  vectorizer = CountVectorizer(max_features=maxFeatures, stop_words='english', binary=True)
  vectorizer.fit(corpus)
  return targetCountMatrix(vectorizer, target)

# task 1 (binary flag not set)
xTrainCountMat = constructCountMatrix(xTrain, xTrain, 1000)
xTestCountMat = constructCountMatrix(xTrain, xTest, 1000)

# task 2 (binary flag set)
b_xTrainCountMat = constructCountMatrixBinary(xTrain, xTrain, 1000)
b_xTestCountMat = constructCountMatrixBinary(xTrain, xTest, 1000)

# print(xTrainCountMat.toarray())
# train mnb model with count matrix
def trainMultiNB(x,y):
  clf = MultinomialNB()
  clf.fit(x, y)
  return clf

mnb = trainMultiNB(xTrainCountMat, yTrain)
b_mnb = trainMultiNB(b_xTrainCountMat, yTrain)
# test model & compute accuracy score
print("MNB model score is: " + str(mnb.score(xTestCountMat, yTest)))
print("Binary MNB model score is: " + str(b_mnb.score(b_xTestCountMat, yTest)))


MNB model score is: 0.7827298050139275
Binary MNB model score is: 0.7743732590529248


In [None]:
# The results should look similar to the one below:

**Tasks 3 & 4:**
3. Using sklearn, train a logistic regression classifier on your training data, using CountVectorizer to featurize your
data (with the same parameters as in task 1). Report the accuracy of the trained classifier on the test set.

4. Train a logistic regression classifier as before, using binary CountVectorizer to featurize your data. Report the
accuracy of the trained classifier on the test set.

In [None]:
from sklearn.linear_model import LogisticRegression

# use CountVectorizer to featurize data
# count matrices from previous cell work

# train logistic regression classifer 
def trainLogisticRegression(x, y):
  clf = LogisticRegression(random_state=0)
  clf.fit(x, y)
  return clf

# score
logReg = trainLogisticRegression(xTrainCountMat, yTrain)
print("logistic regression model score is: " + str(logReg.score(xTestCountMat, yTest)))

b_logReg = trainLogisticRegression(b_xTrainCountMat, yTrain)
print("binary logistic regression model score is: " + str(b_logReg.score(b_xTestCountMat, yTest)))


logistic regression model score is: 0.766016713091922
binary logistic regression model score is: 0.7688022284122563


In [None]:
# The results should look similar to the one below:

**Task 5:** After performing the above experiments, which feature extractor and statistical model combination is good for your
dataset? Note that this step is called model selection. Read online about the terms “model selection”
and “development set” a.k.a. “validation set” and describe if it is okay to do model selection on the test set.

**Your answer here:**
I believe that the best feature extractor and statistical model combination is good for the dataset is using normal Multinomial Naive Bayes model due to its accuracy is the highest out of all 4 models by quite a large margin. 

It is not okay to do model selection on the test set. This is because the test set should be only used when doing the final test of the model, not during model selection. The development/validation set is a subset of the training set that can be used to assist in verification during training or model selection, but the test set must only be used for final testing.

**Tasks 6 & 7:**
6. Conduct 10-fold cross validation experiments on your training data: training a Multinomial NB classifier
with CountVectorizer and different max features (= 1000, 2000, 3000, or 4000) with and without binary counts.
Report the average accuracies of these different max features and binary/not binary across folds.

7. Select the combination of max features value and binary/not binary count choice that has the highest average
accuracy in your cross-validation experiments and train a Multinomial NB classifier on your whole training data
using this parameter to featurize your data. Report the accuracy of this trained classifier on the test set.

**Hint:** Consider Stratified K-Folds for task 6.

In [None]:
from sklearn.model_selection import StratifiedKFold
from statistics import mean
# Task 6: 10-fold cross validation experiments
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(xTestCountMat, yTest)

for b in [False, True]:
  for n in [1000, 2000, 3000, 4000]:
    scores_for_avg = []
    xTrainCountMat = constructCountMatrix(xTrain, xTrain, n)
    xTestCountMat = constructCountMatrix(xTrain, xTest, n)
    if b:
      xTrainCountMat = constructCountMatrixBinary(xTrain, xTrain, n)
      xTestCountMat = constructCountMatrixBinary(xTrain, xTest, n)
      
    for train_index, test_index in skf.split(xTrainCountMat, yTrain):
      X_train, X_test = xTrainCountMat[train_index], xTrainCountMat[test_index]
      y_train, y_test = yTrain[train_index], yTrain[test_index]
      mnb = trainMultiNB(X_train, y_train)
      scores_for_avg.append(mnb.score(X_test, y_test))
    print("Accuracy with " + str(n) + " features, binary features is " + str(b) \
          + ":" + str(round(mean(scores_for_avg)*100, 2)) + "%")

Accuracy with 1000 features, binary features is False:71.88%
Accuracy with 2000 features, binary features is False:73.08%
Accuracy with 3000 features, binary features is False:73.39%
Accuracy with 4000 features, binary features is False:73.54%
Accuracy with 1000 features, binary features is True:71.89%
Accuracy with 2000 features, binary features is True:73.05%
Accuracy with 3000 features, binary features is True:73.45%
Accuracy with 4000 features, binary features is True:73.62%


In [None]:
# Task 7: training multinomial naive bayes with binary feature
print("the best binary strategy is to set to True, the best max features is 4000")
X_train = constructCountMatrixBinary(xTrain, xTrain, 4000)
X_test = constructCountMatrixBinary(xTrain, xTest, 4000)
b_mnb = trainMultiNB(X_train, yTrain)
print("the Accuracy of the best hyper-parameter combination trained classifier \
is: " + str(round(b_mnb.score(X_test, yTest) * 100,2)) + "%")


the best binary strategy is to set to True, the best max features is 4000
the Accuracy of the best hyper-parameter combination trained classifier is: 77.16%


In [None]:
# Your results show look similar to this:

**Tasks 8 & 9 & 10:**

**Note: vector models and word2vec will be presented in lecture next week (10/11 and 10/13).**

8. Use [gensim](https://radimrehurek.com/gensim/models/word2vec.html) library to learn 300-dimensional word2vec representations from the tokenized tweets (you can use
Spacy for tokenizing tweets) in your training data (you can use default parameters).
9. Given the learned word2vec representations, construct a vector representation of each tweet as the average of all
the word vectors in the tweet. Ignore words that do not have vector representations – since by default the gensim
word2vec model only learns vector representations for words that appear at least 5 times across the training set.
10. Train a logistic regression classifier using the above vector representation of tweets as your features. Report
the accuracy of the trained classifier on the test set. 

In [1]:
import spacy

# tokenize the tweets
nlp = spacy.load("en_core_web_sm")
# for doc in xTrain:
#   tweet = nlp(doc)
#   for sent in tweet.sents:
#     sentance = []
#     for token in sent:
#       sentance.append(token.text)
#     tokenized_tweets.append(sentance)

def tokenize (docs):
  tokenized = []
  for doc in docs:
    tweet = nlp(doc)
    tweet_tokens = []
    # tokenized_tweets.append(tweet.text)
    for sents in tweet.sents:
      for token in sents:
        tweet_tokens.append(token.text)
    tokenized.append(tweet_tokens)
    return tokenized

tokenized_tweets = tokenize(xTrain)

NameError: ignored

In [None]:
!pip install gensim==4.2.0

In [None]:
# Task 8 learn 300-dimensional word2vec representations
from gensim.models import Word2Vec
import gensim

print(tokenized_tweets[:10])
model = gensim.models.Word2Vec(tokenized_tweets, window=5, min_count=5, vector_size=300)
print(model)


[['I', 'LOVE', '@Health4UandPets', 'u', 'guys', 'r', 'the', 'best', '!', '!'], ['i', 'm', 'meeting', 'up', 'with', 'one', 'of', 'my', 'besties', 'tonight', '!', 'Ca', 'nt', 'wait', '!', '!', ' ', '-', 'GIRL', 'TALK', '!', '!'], ['@DaRealSunisaKim', 'Thanks', 'for', 'the', 'Twitter', 'add', ',', 'Sunisa', '!', 'I', 'got', 'to', 'meet', 'you', 'once', 'at', 'a', 'HIN', 'show', 'here', 'in', 'the', 'DC', 'area', 'and', 'you', 'were', 'a', 'sweetheart', '.'], ['Being', 'sick', 'can', 'be', 'really', 'cheap', 'when', 'it', 'hurts', 'too', 'much', 'to', 'eat', 'real', 'food', ' ', 'Plus', ',', 'your', 'friends', 'make', 'you', 'soup'], ['@LovesBrooklyn2', 'he', 'has', 'that', 'effect', 'on', 'everyone'], ['@ProductOfFear', 'You', 'can', 'tell', 'him', 'that', 'I', 'just', 'burst', 'out', 'laughing', 'really', 'loud', 'because', 'of', 'that', ' ', 'Thanks', 'for', 'making', 'me', 'come', 'out', 'of', 'my', 'sulk', '!'], ['@r_keith_hill', 'Thans', 'for', 'your', 'response', '.', 'Ihad', 'alrea

In [None]:
type(model.wv)

gensim.models.keyedvectors.KeyedVectors

In [None]:
# print(model.wv[0])
print(len(model.wv[0]))
print(len(model.wv))

300
9181


In [None]:
# Task 9 construct a vector representation of each tweet as the average of all the word vectors in the tweet
# print(model.wv['and'])

def calculate_avg_vector(model, tweet):
  avg_vec = [0] * 300
  count = 0
  for sent in tweet.sents:
    for token in sent:
      if(model.wv.__contains__(token.text)):
        count += 1
        word_vec = model.wv[token.text]
        for i in range(300):
          avg_vec[i] += word_vec[i]
  for i, val in enumerate(avg_vec):
    if(count != 0):
      avg_vec[i] = val/count
  return avg_vec

avg_vector_for_tweets = []
for doc in xTrain:
  tweet = nlp(doc)
  avg_vector_for_tweets.append(calculate_avg_vector(model, tweet))

print(avg_vector_for_tweets[:5])

# average of all the vectors
# avg_vector = []
# for doc in enumerate(model.wv):
#   for i, vec in enumerate(doc):
#     print(vec)


[[0.21345916990604666, -0.29688572076459724, -0.11851292848587036, 0.06078516816099485, -0.033093338211377464, -0.3612809487515026, 0.28622552586926353, 0.386436328291893, 0.46231290449698764, -0.10635833457733194, 0.056550520989629954, 0.3559816881186432, 0.042577436504264675, -0.20202167259736192, -0.15347496466711164, -0.2249320927593443, 0.2961640678760078, 0.09656142774555418, -0.028436112424565688, -0.030737489151457947, -0.07081053819921282, -0.06531669509907563, 0.18975050416257647, -0.009589225648798876, 0.17231782298121187, -0.09881254637406932, 0.07259287022882038, -0.3360101522670852, -0.08764030701584286, -0.31958896294236183, 0.4405222942845689, -0.21955162139299014, 0.2086437162425783, 0.024594773434930377, 0.011231827239195505, 0.18139183355702293, -0.019802861743503146, -0.2234285059902403, 0.37710203064812553, -0.2567380584983362, -0.3331081908610132, -0.2256621519724528, 0.05789100668496556, 0.010160276459323036, 0.2748974338173866, 0.20490436752637228, 0.28443043844

In [None]:
# convert the testing data into an average word2vec model

test_model = gensim.models.Word2Vec(tokenize(xTest), window=5, min_count=2, vector_size=300)
print(model)
test_vectors = []
for doc in xTest:
  tweet = nlp(doc)
  test_vectors.append(calculate_avg_vector(model, tweet))

Word2Vec<vocab=9181, vector_size=300, alpha=0.025>


In [None]:
# Task 10 Train logistic regression model on average tweet vectors

logReg = trainLogisticRegression(avg_vector_for_tweets, yTrain)
print("logistic regression model score is: " + str(logReg.score(test_vectors, yTest)))

logistic regression model score is: 0.6406685236768802


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


**Question:** Does dense feature representation improve the accuracy of
your logistic regression classifier?

**Your answer here:**
Dense feature representation doesn't improve the accuracy of the logistic regression classifier in this case. The 4000 feature binary multinomial naive bayes has a higher accuracy score by almost 13%.

In [None]:
# Your results should look similar to this: ("the Accuracy of the trained classifier is: ..." below)

**Part 2: PCA Analysis of Shakepeare's Plays.**

The file: will_play_text.csv contains lines from William Shakespeare’s plays. The second column of the file contains the name of
the play, while the fifth and the sixth contain the name of the character who spoke and what they spoke, respectively. Tokenize
and lower case each line in will_play_text.csv using spacy. The file vocab.txt lists the words in the vocabulary. play_genres.csv stores the genres of Shakepeare's plays.

**Task 11 & 12 & 13:**

11. Create a term-document matrix where each row represents a word in the vocabulary and each column represents
a play. Each entry in this matrix represents the number of times a particular word (defined by the row) occurs in a
particular play (defined by the column). Use CountVectorizer in sklearn to create the matrix, using the file vocab.txt as
input for the vocabulary parameter. From your term-document matrix, use PCA in sklearn to create a 2-dimensional
representation of each play. Visualize these representations to see which plays are most similar to each other. Include the
visualization in your answer sheet. You can follow the tutorial [here](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/) to create the visualization (look at the "PCA" part).

12. What plays are similar to each other? Do they match the grouping of Shakespeare’s plays into comedies, histories,
and tragedies here?

  **Your answer here:**

13. Create another term-document matrix where each row represents a word in the vocabulary and each column
represents a play, but with TFIDF counts (using TFIDFVectorizer in sklearn and vocab.txt for vocabulary). Use PCA
again on these TFIDF term-document matrix and visualize the plays. Include the visualization in your answer sheet.

**Hints:** the PCA function in sklearn doesn't work for sparse inputs, try 'TruncatedSVD' instead. 



In [None]:
# Your visualization should look similar to this:

**Tasks 14 & 15:**
14. Create a word-word matrix where each row (and each column) represents a word in the vocabulary (vocab.txt).
Each entry in this matrix represents the number of times a particular word (defined by the row) co-occurs with another
word (defined by the column) in a sentence (i.e., line in will_play_text.csv). Using the row word vectors, create a representation
of a play as the average of all the word vectors in the play. Use these vector representations of plays to compute
average pairwise cosine-similarity between plays that are comedies (do not include self-similarities). You can use the
grouping of plays in here.

15. Using vector representations of plays computed in task 14, compute average pairwise cosine-similarity between
plays that are histories, and between plays that are tragedies (do not include self-similarities).

Hint: 
[How to calculate a word-word-co-occurence-matrix with sklearn](https://stackoverflow.com/questions/35562789/how-do-i-calculate-a-word-word-co-occurrence-matrix-with-sklearn).

In [None]:
# Your results should look similar to this:

**Task 16:**

16. Use gensim to learn a 100-dimensional word2vec representation of the words in the play (you can use default
parameters but with min count=1 so you can learn vector representations of all the words in your data i.e., no need to
use vocab.txt in this question). Use the learned word2vec representation to construct vector representations of plays as
the average of all the word vectors in the play. Visualize these representations to see which plays are most similar to each other.

**Hint:** From now on, since the inputs are no longer sparse, use the PCA function instead of the 'truncatedSVD' one.

In [None]:
# Your results should look similar to this:

**Task 17:**

17. Construct the vector representation of each character as the average
of the representations of all lines that the character spoke (with the gensim-trained representation). Visualize the characters using PCA.

In [None]:
# Your results should look similar to this (figure below):

**Task 18:**

18. Can you find plays that are central i.e., closest to centroid to each genre? You could do so by visualizing the play representation with PCA.

In [None]:
# Your results should look similar to this (here's an example of genre "histories")