# STA 208: Homework 2 (Do not distribute)

## Due 05/12/2023 midnight (11:59pm)

__Instructions:__ 

1. Submit your homework using one file name ”LastName_FirstName_hw2.html” on canvas. 
2. The written portions can be either done in markdown and TeX in new cells or written by hand and scanned. Using TeX is strongly preferred. However, if you have scanned solutions for handwriting, you can submit a zip file. Please make sure your handwriting is clear and readable and your scanned files are displayed properly in your jupyter notebook. 
3. Your code should be readable; writing a piece of code should be compared to writing a page of a book. Adopt the one-statement-per-line rule. Consider splitting a lengthy statement into multiple lines to improve readability. (You will lose one point for each line that does not follow the one-statementper-line rule)
4. To help understand and maintain code, you should always add comments to explain your code. (homework with no comments will receive 0 points). For a very long comment, please break it into multiple lines.
5. In your Jupyter Notebook, put your answers in new cells after each exercise. You can make as many new cells as you like. Use code cells for code and Markdown cells for text.
6. Please make sure to print out the necessary results to avoid losing points. We should not run your code to figure out your answers. 
7. However, also make sure we are able to open this notebook and run everything here by running the cells in sequence; in case that the TA wants to check the details.
8. You will be graded on correctness of your math, code efficiency and succinctness, and conclusions and modelling decisions


### Exercise 1 (Logistic regression)

(15 points) In class, we studied the logit model with 2 classes. Now consider the multilogit model with $K$ classes. Let $\beta$ be the $(p+1)(K-1)$-vector consisting of all the coefficients. Define a suitably enlarged version of the input vector x to accomodate this vectorized coefficient matrix. Derive the Newton-Raphson algorithm for maximizing the multinomial log-likelihood, and describe how you implement the algorithm (e.g., you can write a sudo code). 

$$
x, y \in {0,1} \\
P(Y = y | X) = P(Y = 1 | X )^y P(Y= 0|x )^{1 - y}\\ 
log(P(Y=y|x)) = y log(P(Y=1|x)) + (1-y)log(P(Y=0|x)) \\
= y log \frac{P(Y=1|x)} {P(Y = 0|X )} + log(P(Y=0|X)) \\
\\
log \frac{P(Y=1|X)}{P(Y = 0|X)} = \beta_0 + \beta_1X_1 + \ldots \\ 
log \frac{P(Y=1|X)}{P(Y = 0|X)} = X^T\beta \\
P(Y = 1|X) = \frac{e^{x^T\beta}}{1 + e^{x^T\beta}} \\
P(Y = 0|X) = \frac{1}{1 + e^{x^T\beta}} \\
\\
\beta = \begin{bmatrix}
        \beta_1 \\
        \beta_2 \\
        \vdots \\
        \beta_{p}
        \end{bmatrix}

X = \begin{bmatrix}
        1 \\
        x_1 \\
        x_2 \\
        \vdots \\
        x_{p}
        \end{bmatrix} \\
log P(Y= y |x ) = y*x^T\beta + log P(Y = 0|X) \\
 = y*x^T\beta - log (1 + e^{x^T\beta}) \\

 X_i, y_i (i = ,1 \ldots ,n): \\
 \sum_{i =1}^n log P(y_i|x_i) =\sum_{i =1}^n y_i*x_i^T\beta - log (1 + e^{x_i^T\beta}) 
$$

### Exercise 2 (Support vector machine)

_Natural language processing_ (NLP) is a branch of artificial intelligence which gives computers the ability to learn text and spoken words in much the same way human beings can.

In python, text data can be converted into vector data through a vectorization operation.
Two vectorizer packages in Python are ``sklearn.feature_extraction.text.CountVectorizer`` and ``sklearn.feature_extraction.text.TfidfVectorizer``. A corpus is a collection of documents and the dictionary is all of the words in the corpus. A simple vectorizer will let $X_{i,j}$ be the number of times the $j$th word is in the $i$th document. 

Bag-of-words models is one of the most popular model in NLP. The model treats each document as a set of words but ignoring the order of those words. 

In this exercise, you will learn how to classify a text using SVM. The dataset includes two CSV files (`Corona_NLP_train.csv` and `Corona_NLP_test.csv`) that contain IDs and sentiment scores of the tweets related to the COVID-19 pandemic. The real-time Twitter feed is monitored for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. The oldest tweets in this dataset date back to October 01, 2019. 


The training dataset contains five columns: 
- UserName	
- ScreenName	
- Location	
- TweetAt	
- OriginalTweet	
- Sentiment (five labels: `extremely positive`, `positive`, `negative`, `extremely negative`, `neutral`)

The task is to predict sentiment basedon the original tweet. Here, we combine `extremely positive` and `positive` to `positive` and combine `extremely negative` and `negative` to `negative`. So, the sentiment contains three labels.
Your goal is to apply svm to predict the three labels based on OriginalTweet. Indeed, one can view this as a classification problem with three labels. 

I already attached the file `dataprocessing.ipynb` for processing the data. The code is directly copied from this [website](https://www.kaggle.com/code/mehmetlaudatekman/text-classification-svm-explained/notebook).

Please answer the following questions:
 
1. (15 points) Use sklearn svm.SVC on the TRAIN split (`Corona_NLP_train.csv`) and predict on the TEST split (`Corona_NLP_test.csv`). Plot your ROC and PR (Precision-Recall) curves for predicting `positive` (versus everything else); use the linear kernel and set the C parameter to be 1. Do the same for predicting the `negative` label versus everything else. Please write the code for generating the ROC curve by yourself.
2. (10 points) In this problem, we have three labels (instead of two). In class, we only learned SVM for solving a two-class classification problem. Describe (using your own words) how the python package `svm.SVC` fits SVM for multi-class classification.
3. (10 points) Choose several different values for $C$ (some are smaller than 1, some are bigger than 1), plot the ROC curves for predicting 'positive' (versus everything else), and predicting 'negative' (versus everything else). Comment on your findings. 
4. (Bonus 10 points) Explore how to use logistic regression to classify this text. Implement the method. Comment on its prediction accuracy and compare the ROC curve with the SVM ROC curve. 

__Note:__ the PR curve is a ratio of the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class.

In [82]:
import numpy as np
import pandas as pd 
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
import pickle
import time
import re
from IPython.display import display 

In [75]:
# Preprocessing 
""" 
# Loading in the data
trainSet =pd.read_csv("Corona_NLP_train.csv", encoding="latin1")
testSet = pd.read_csv("Corona_NLP_test.csv", encoding = "latin1")

unrelevant_features = ["UserName", "ScreenName", "Location", "TweetAt"]
trainSet.drop(unrelevant_features, inplace = True, axis = 1)
testSet.drop(unrelevant_features, inplace = True, axis = 1)
#display(trainSet.head())

trainSet.Sentiment = trainSet.Sentiment.replace("Extremely Postive", "Positive")
trainSet.Sentiment = trainSet.Sentiment.replace("Extremely Negative", "Negative")
#display(trainSet.head())

testSet.Sentiment = testSet.Sentiment.replace("Extremely Positive", "Positive")
testSet.Sentiment = testSet.Sentiment.replace("Extremely Negative", "Negative")

# Convert negatives as 0, neutrals as 1, positives as 2, 
mapping = {"Negative": 0, "Neutral": 1, "Positive":2}
trainSet.Sentiment = trainSet.Sentiment.replace(mapping)
testSet.Sentiment = testSet.Sentiment.replace(mapping)

data = pd.concat([trainSet, testSet])
display(data.info())
display(data.head()) """

train_set = pd.read_csv('Corona_NLP_train.csv',encoding="latin1") # do not forget to change the path
test_set = pd.read_csv('Corona_NLP_test.csv',encoding="latin1")

# remove unrelevant_features

unrelevant_features = ["UserName","ScreenName","Location","TweetAt"]
train_set.drop(unrelevant_features,inplace=True,axis=1)
test_set.drop(unrelevant_features,inplace=True,axis=1)
display(train_set.head())

# split data based on sentiment values: positive, neutral or negative.
# Extremely positive is combined with positive. Similar to extremely negative
display(train_set["Sentiment"].value_counts())

positives = train_set[(train_set["Sentiment"] == "Positive") | (train_set["Sentiment"] == "Extremely Positive")]
positives_test = test_set[(test_set["Sentiment"] == "Positive") | (test_set["Sentiment"] == "Extremely Positive")]
print(positives["Sentiment"].value_counts())
display(positives.head())

negatives = train_set[(train_set["Sentiment"] == "Negative") | (train_set["Sentiment"] == "Extremely Negative")]
negatives_test = test_set[(test_set["Sentiment"] == "Negative") | (test_set["Sentiment"] == "Extremely Negative")]
print(negatives["Sentiment"].value_counts())
display(negatives.head())

neutrals = train_set[train_set["Sentiment"] == "Neutral"]
neutrals_test = test_set[test_set["Sentiment"] == "Neutral"]
print(neutrals["Sentiment"].value_counts())
display(neutrals.head())

# Convert labels into integers 
# convert negatives as 0
# neutrals as 1 
# and positives as 2.

import warnings as wrn
wrn.filterwarnings('ignore')

negatives["Sentiment"] = 0 
negatives_test["Sentiment"] = 0

positives["Sentiment"] = 2
positives_test["Sentiment"] = 2

neutrals["Sentiment"] = 1
neutrals_test["Sentiment"] = 1


# concatenate train and test first, will split them after processing.

data = pd.concat([positives,
                  positives_test,
                  neutrals,
                  neutrals_test,
                  negatives,
                  negatives_test
                 ],axis=0)

data.reset_index(inplace=True)

print(data.info())

print(data.head())

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative


Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
Name: Sentiment, dtype: int64

Positive              11422
Extremely Positive     6624
Name: Sentiment, dtype: int64


Unnamed: 0,OriginalTweet,Sentiment
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
5,As news of the regionÂs first confirmed COVID...,Positive
6,Cashier at grocery store was sharing his insig...,Positive


Negative              9917
Extremely Negative    5481
Name: Sentiment, dtype: int64


Unnamed: 0,OriginalTweet,Sentiment
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative
9,"For corona prevention,we should stop to buy th...",Negative
20,with 100 nations inficted with covid 19 th...,Extremely Negative
24,@10DowningStreet @grantshapps what is being do...,Negative
26,In preparation for higher demand and a potenti...,Negative


Neutral    7713
Name: Sentiment, dtype: int64


Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
7,Was at the supermarket today. Didn't buy toile...,Neutral
10,All month there hasn't been crowding in the su...,Neutral
16,????? ????? ????? ????? ??\r\r\n?????? ????? ?...,Neutral
17,@eyeonthearctic 16MAR20 Russia consumer survei...,Neutral


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44955 entries, 0 to 44954
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   index          44955 non-null  int64 
 1   OriginalTweet  44955 non-null  object
 2   Sentiment      44955 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.0+ MB
None
   index                                      OriginalTweet  Sentiment
0      1  advice Talk to your neighbours family to excha...          2
1      2  Coronavirus Australia: Woolworths to give elde...          2
2      3  My food stock is not the only one which is emp...          2
3      5  As news of the regionÂs first confirmed COVID...          2
4      6  Cashier at grocery store was sharing his insig...          2


In [76]:
#nltk.download('omw-1.4')

cleanedData = []

lemma = WordNetLemmatizer()
swords = stopwords.words("english")
for text in data["OriginalTweet"]:
    
    # Cleaning links
    text = re.sub(r'http\S+', '', text)
    
    # Cleaning everything except alphabetical and numerical characters
    text = re.sub("[^a-zA-Z0-9]"," ",text)
    
    # Tokenizing and lemmatizing
    text = nltk.word_tokenize(text.lower())
    text = [lemma.lemmatize(word) for word in text]
    
    # Removing stopwords
    text = [word for word in text if word not in swords]
    
    # Joining
    text = " ".join(text)
    
    cleanedData.append(text)

In [77]:
# check the output text

for i in range(0,5):
    print(cleanedData[i],end="\n\n")

advice talk neighbour family exchange phone number create contact list phone number neighbour school employer chemist gp set online shopping account po adequate supply regular med order

coronavirus australia woolworth give elderly disabled dedicated shopping hour amid covid 19 outbreak

food stock one empty please panic enough food everyone take need stay calm stay safe covid19france covid 19 covid19 coronavirus confinement confinementotal confinementgeneral

news region first confirmed covid 19 case came sullivan county last week people flocked area store purchase cleaning supply hand sanitizer food toilet paper good tim dodson report

cashier grocery store wa sharing insight covid 19 prove credibility commented civics class know talking



In [78]:
# create the bag of words

vectorizer = CountVectorizer(max_features=5)
BOW = vectorizer.fit_transform(cleanedData)

In [80]:
# split the dataset into training and test

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(BOW,np.asarray(data["Sentiment"]))

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(33716, 5)
(11239, 5)
(33716,)
(11239,)


#### 1.) 

In [81]:
from sklearn.svm import SVC
start_time = time.time()

model = SVC(C = 1, kernel="linear", probability=True)
model.fit(x_train,y_train)



end_time = time.time()
process_time = round(end_time-start_time,2)
print("Fitting SVC took {} seconds".format(process_time))


Fitting SVC took 173.52 seconds


In [None]:
#model.predict(x_test)

# Generate data that is all positive



### Exercise 3 (K-means and PCA)

Load the poses.csv dataset, which is a concatenation of other datasets to form a larger dataset. The task column in the dataset contains six poses: sitting, lying, walking, standing, cycling, bending. I want you to act like the dataset is from the same experiment. You need to open the file and take a look the dataset first. Combining bending1 and bending2 together. 

1. (15 pts) Apply 1 time lag difference of the dataset, so that each variable is the difference of the time point and the previous time point.  Standardize the dataset and remove any variables that do not make sense.  Run the PCA decomposition with 2 principal components.  Plot the 2 principal components.  Which variables have the most loading on the principal components (look at `.components_`)?

1. (15 pts) Also on the 1 lagged dataset.  Run K-means clustering (with 6 clusters), how much does the cluster overlap with the 'task' variable.  Look at the confusion matrix (`sklearn.metrics.confusion_matrix`) of the cluster against the 'task'.  Is there a clear mapping from clusters to task?