# HW 6: Sentiment Analysis & Deep Learning

In this question, you'll need dataset:
- `hw6_train.csv`: dataset fro training
- `hw6_test.csv`: dataset for test

A snippet of the dataset is given below.



In [52]:
import pandas as pd

train = pd.read_csv("hw6_train.csv")
test = pd.read_csv("hw6_test.csv")

train.head()

(20000, 2)

## Q1: Unsupervised Sentiment Analysis (3 points)

- Write a function `analyze_sentiment(docs, labels, th)` as follows: (3 points)
    - Takes three inputs:
       - `docs` : a list of documents, 
       - `labels` the ground-truth sentiment labels of `docs`
       - `th`: compound threshold
    - Use Vader to get a compound score of for each document in `docs`.  
    - If `compound score > th`, then the predicted label is 1; otherwise 0
    - Print out the classification report
    - Return F1 macro score


- Tune `th` such that the F1 macro score is maximimized (1 point)
- With the `th` tuned, calculate the performance on the test dataset.

In [29]:
from sklearn.metrics import classification_report, f1_score
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def analyze_sentiment(docs, labels, th=0):
    sid = SentimentIntensityAnalyzer()
    
    # Get compound scores for each document
    compound_scores = [sid.polarity_scores(doc)['compound'] for doc in docs]
    
    # Predict sentiment labels based on the threshold
    predicted_labels = [1 if score > th else 0 for score in compound_scores]
    
    # Print classification report
    print("Classification Report:\n", classification_report(labels, predicted_labels))
    
    # Calculate F1 macro score
    f1 = f1_score(labels, predicted_labels, average='macro')
    print("F1 Macro Score:", f1)
    
    return f1

In [30]:
analyze_sentiment(test["text"], test["label"], 0.2)

Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.72      0.68      9968
           1       0.68      0.59      0.63     10032

    accuracy                           0.66     20000
   macro avg       0.66      0.66      0.66     20000
weighted avg       0.66      0.66      0.66     20000

F1 Macro Score: 0.6561832243120658


0.6561832243120658

## Q2: Supervised Sentiment Analysis Using Word Vectors (7 points)

### Q2.1: Train Word Vectors

Write a function `train_wordvec(docs, vector_size)` as follows:
- Take two inputs:
    - `docs`: a list of documents
    - `vector_size`: the dimension of word vectors
- First tokenize `docs` into tokens
- Use `gensim` package to train word vectors. Set the `vector size` and also carefully set other parameters such as `window`, `min_count` etc.
- return the trained word vector model

In [31]:
def train_wordvec(docs, vector_size = 100):
    
     # add your code
    
    return wv_model

In [32]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

def train_wordvec(docs, vector_size):
    # Tokenize the documents
    tokenized_docs = [word_tokenize(doc.lower()) for doc in docs]

    # Set parameters for Word2Vec model
    window_size = 5  # Maximum distance between the current and predicted word within a sentence
    min_word_count = 1  # Ignores all words with a total frequency lower than this
    workers = 4  # Number of CPU cores to use while training the model

    # Train Word2Vec model
    model = Word2Vec(sentences=tokenized_docs, vector_size=vector_size, window=window_size,
                     min_count=min_word_count, workers=workers)

    return model

In [47]:
#wv_model = train_wordvec(train["text"], vector_size = 100)
print(wv_model)

Word2Vec<vocab=71141, vector_size=100, alpha=0.025>


In [49]:
# Tokenize train and test documents
tokenized_train_docs = [word_tokenize(doc.lower()) for doc in train["text"]]
tokenized_test_docs = [word_tokenize(doc.lower()) for doc in test["text"]]

# Vectorize documents using TFIDF vectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=5)
tfidf_train_matrix = tfidf_vectorizer.fit_transform([' '.join(doc) for doc in tokenized_train_docs])
tfidf_test_matrix = tfidf_vectorizer.transform([' '.join(doc) for doc in tokenized_test_docs])

In [57]:
tfidf_test_matrix[:100].shape

(100, 7808)

### Q2.2: Generate Vector Representation for Documents 

Write a function `generate_doc_vector(train_docs, test_docs, wv_model, wv_dim= 100, stop_words = None, min_df = 1, topK = None)` as follows:
- Take two inputs:
    - `train_docs`: a list of train documents, 
    - `test_docs`: a list of train documents, 
    - `wv_model`: trained word vector model. 
    - `wv_dim`: dimensionality of word vector. Set the default value to 100.
    - `stop_words`: whether to remove stopwords
    - `min_df`: minimum document frequency
- First vectorize each document using TFIDF vectorizer by considering stop_words and min_df configurations.
- For each token in the vocabulary, look up for its word vector in `wv_model`. 
- Then calculate the document vector (denoted as `d`) of `doc` by the following methods:
    - if `topK` is None, `d` is the `TFIDF-weighted sum of the word vectors of its tokens`, i.e. $d = \frac{1}{\sum{tfidf_i}} * \sum_{i \in doc}{tfidf_i * v_i}$, where $v_i$ is the word vector of the i-th token, and $tfidf_i$ is the tfidf weigth of this token.
    - Otherwise, `d` is the average word vectors of words with topK tfidf weights, i.e.,
    $d =   \frac{1}{K} * \sum_{i \in doc, k\in {topK}}{ v_{i,k}}$, where $topK$ is a parameter.
- Return the vector representations of all `train_docs` as a numpy array of shape `(n, vector_size)`, where `n` is the number of documents in `train_docs` and `vector_size` is the dimension of word vectors. Create similar representations for `test_docs`.


Note: It may not be a good idea to represent a document as the weighted sum of its word vectors. For example, if one word is positive and another is negative, the sum of the these two words may make the resulting vector is no longer sensitive to sentiment. You'll learn more advanced methods to generate document vector in deep learning courses.

In [56]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def generate_doc_vector(train_docs, test_docs, wv_model, wv_dim=100, stop_words=None, min_df=5, topK=None):
    # Tokenize train and test documents
    tokenized_train_docs = [word_tokenize(doc.lower()) for doc in train_docs]
    tokenized_test_docs = [word_tokenize(doc.lower()) for doc in test_docs]

    # Configure stop words
    if stop_words is not None:
        stop_words = set(stopwords.words('english'))

    # Vectorize documents using TFIDF vectorizer
    tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, min_df=min_df)
    tfidf_train_matrix = tfidf_vectorizer.fit_transform([' '.join(doc) for doc in tokenized_train_docs])
    tfidf_test_matrix = tfidf_vectorizer.transform([' '.join(doc) for doc in tokenized_test_docs])

    # Get word vectors for each token in the vocabulary
    word_vectors = []
    for token in tfidf_vectorizer.get_feature_names_out():
        if token in wv_model.wv:
            word_vectors.append(wv_model.wv[token])
        else:
            word_vectors.append(np.zeros(wv_dim))  # Use zero vector for unknown words

    word_vectors = np.array(word_vectors)

    # Generate document vectors
    doc_vectors_train = []
    doc_vectors_test = []

    for i in range(len(tokenized_train_docs)):
        tfidf_weights = tfidf_train_matrix[i:100]

        if topK is None:
            doc_vector = np.sum(tfidf_weights * word_vectors, axis=0) / np.sum(tfidf_weights)
        else:
            top_indices = np.argsort(tfidf_weights)[-topK:]
            doc_vector = np.mean(word_vectors[top_indices], axis=0)

        doc_vectors_train.append(doc_vector)

    for i in range(len(tokenized_test_docs)):
        tfidf_weights = tfidf_test_matrix[i:100]

        if topK is None:
            doc_vector = np.sum(tfidf_weights * word_vectors, axis=0) / np.sum(tfidf_weights)
        else:
            top_indices = np.argsort(tfidf_weights)[-topK:]
            doc_vector = np.mean(word_vectors[top_indices], axis=0)

        doc_vectors_test.append(doc_vector)

    return np.array(doc_vectors_train), np.array(doc_vectors_test)

In [37]:
def generate_doc_vector(train_docs, test_docs, wv_model, wv_dim= 100,
                        stop_words = None, min_df = 1, topK=None):
    
    # add your code
    return train_vec, test_vec

In [58]:
train_X, test_X = generate_doc_vector(train["text"], test["text"], 
                                      wv_model, wv_dim= 100,
                                      stop_words = None, min_df = 5)

  doc_vector = np.sum(tfidf_weights * word_vectors, axis=0) / np.sum(tfidf_weights)
  doc_vector = np.sum(tfidf_weights * word_vectors, axis=0) / np.sum(tfidf_weights)


### Q2.3: Put everything together


Define a function `predict_sentiment(train_text, train_label, test_text, test_label, wv_model, wv_dim= 100, stop_words = None, min_df = 1)` as follows:

- Take the following inputs:
    - `train_text, train_label`: a list of documents and their labels for training
    - `test_text, test_label`: a list of documents and their labels for testing,
    - `wv_model`: trained word vector model. 
    - `wv_dim`: dimensionality of word vector. Set the default value to 100.
    - `stop_words`: whether to remove stopwords
    - `min_df`: minimum document frequency
- Call `generate_doc_vector` to generate vector representations (denoted as `train_X` and `test_X`) for documents in `train_text` and `test_text`. 
- Fit a linear SVM model using `train_X` and `train_label`
- Predict the label for `test_X` and print out classification report for the testing subset.
- This function has no return

### Q2.4: Analysis 

- Compare the classification reports you obtain from Q1 and Q2.3. Which model performs better?
- Why this model can achieve better performance?

In [12]:
def predict_sentiment(train_text, train_label, 
                      test_text, test_label, 
                      wv_model, wv_dim= 100,
                      stop_words = None, min_df = 1, topK = None):
    
    # Add your code

In [13]:
predict_sentiment(train["text"], train["label"],\
                  test["text"], test["label"],\
                  wv_model, wv_dim= 100,
                  stop_words = None, min_df = 5, topK = None)


predict_sentiment(train["text"], train["label"],\
                  test["text"], test["label"],\
                  wv_model, wv_dim= 100,
                  stop_words = None, min_df = 5, topK = 10)
    

[LibLinear].........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................*...................................................................................................................................................................................................



In [None]:
if __name__ == "__main__":  