# Homework 2

## Student: Minling Zhou
## NetID: mz246

We are interested in sentiment analysis. Given a short document, we wish to assess whether the corresponding sentiment is positive (label 𝑦 = 1 ) or negative (label 𝑦 = 0)
We will do this with the following prescription:
1. For every word, we will learn a corresponding 𝑑-dimensional vector, i.e., $𝑥_i$ ∈ 𝑅" for word
𝑖 in the vocabulary. If we have a total of 𝑁 words in the vocabulary, we learn 𝑥! , for 𝑖 = 1, ... , 𝑁 
2. Assume that there are $𝑀_j$ words in document 𝑗. The feature vector for this document is
\begin{equation}
𝑓_j = \frac{1}{𝑀_j}\sum_{m=1}^{M_j}x_{w(m,j)}
\end{equation}
where $x_{w(m,j)}$ is the vector associated with the 𝑚th word in document 𝑗. 
3. The probability of positive sentiment for document 𝑗 is modeled as
\begin{equation}
p(Y=1|f_j) = \sigma[w \cdot f_j + b]
\end{equation}
.

The unknown parameters, which must be learned based on data, are $𝑥_i$ , corresponding to the vectors for each of the 𝑁 words in the total vocabulary, as well as the weight vector 𝑤 ∈ 𝑅" and bias 𝑏

Show the detailed derivation of the cost function for this formulation, and provide a detailed derivation of the gradients to be used in gradient descent. In your writeup, show all details of the implementation.

Assume that the learning will be performed with 𝐾 documents, and 𝐾 is very large. Show in detail how you will scale the model to handle 𝐾 that is too large for all documents to be handled at once on a computer.

Write code in Python, to implement this model, and use Adam as the method to implement gradient descent. Use
the following dataset: https://huggingface.co/datasets/yelp_polarity

In your solution, provide the code, and also provide a detailed analysis on the accuracy of the predictions on test data (complete report of results). Explain carefully how you constituted the training, validation and test datasets.

## Solution
### Cost Function
Logistic regression uses a sigmoid function to map predicted values to probabilities. The cost function, also known as the log loss, measures the performance of the model. It is defined as:
\begin{equation}
J(w,b) = -\frac{1}{K}\sum_{i=1}^{K}y_i\log(p(Y=1|f_i)) + (1-y_i)\log(1-p(Y=1|f_i))
\end{equation}
where $K$ is the number of documents, $y_i$ is the true label of document $i$, and $p(Y=1|f_i)$ is the predicted probability of document $i$ being positive. The predicted probability is given by the sigmoid function:
\begin{equation}
p(Y=1|f_i) = \sigma[w \cdot f_i + b]
\end{equation}
where $\sigma$ is the sigmoid function, $w$ is the weight vector, and $b$ is the bias.

### Derivation of the Gradients
The gradients of the cost function with respect to the weight vector and bias are used to update the parameters during training. The gradients are given by:
\begin{equation}
\frac{\partial J(w,b)}{\partial w} = \frac{1}{K}\sum_{i=1}^{K}(p(Y=1|f_i)-y_i)f_i
\end{equation}
and
\begin{equation}
\frac{\partial J(w,b)}{\partial b} = \frac{1}{K}\sum_{i=1}^{K}(p(Y=1|f_i)-y_i)
\end{equation}
where $K$ is the number of documents, $y_i$ is the true label of document $i$, and $p(Y=1|f_i)$ is the predicted probability of document $i$ being positive.

### Gradient Descent Update Rule
The parameters are updated using the gradients and the learning rate. The update rule for the weight vector and bias is given by:
\begin{equation}
w = w - \alpha\frac{\partial J(w,b)}{\partial w}
\end{equation}
and
\begin{equation}
b = b - \alpha\frac{\partial J(w,b)}{\partial b}
\end{equation}
where $\alpha$ is the learning rate.

### Implementation Steps
* `Initialize the weight vector and bias`: Initialize the weight vector and bias to small random values.
* `Compute the Prediction`: For each document, compute the predicted probability using the sigmoid function.
* `Compute the Cost`: Compute the cost function using the predicted probabilities and true labels.
* `Compute the Gradients`: Compute the gradients of the cost function with respect to the weight vector and bias.
* `Update the Parameters`: Update the weight vector and bias using the gradients and the learning rate.
* `Repeat until convergence`: Repeat steps 2-5 until the cost function converges or a maximum number of iterations is reached.


### Scaling the Model
When the number of documents is very large, it is not feasible to handle all documents at once on a computer. To handle large datasets, there are quite a few options:
* `Mini-batch Gradient Descent`: Divide the dataset into mini-batches and update the parameters using the gradients computed on each mini-batch.
* `Stochastic Gradient Descent (SGD)`: Instead of using the entire dataset to compute the gradient of the cost function, SGD uses a single or a small sample of the data. This makes the computation much faster and scalable for large datasets.
* `Algorithmic Efficiency`: Use advanced optimization algorithms like Adam or RMSprop, which are more efficient than standard gradient descent, especially for large datasets.



In [1]:
from datasets import load_dataset

dataset = load_dataset("yelp_polarity")

In [2]:
df = {split: dataset.to_pandas() for split, dataset in dataset.items()}

In [3]:
import pandas as pd

df["train"]

Unnamed: 0,text,label
0,"Unfortunately, the frustration of being Dr. Go...",0
1,Been going to Dr. Goldberg for over 10 years. ...,1
2,I don't know what Dr. Goldberg was like before...,0
3,I'm writing this review to give you a heads up...,0
4,All the food is great here. But the best thing...,1
...,...,...
559995,Ryan was as good as everyone on yelp has claim...,1
559996,Professional \nFriendly\nOn time AND affordabl...,1
559997,Phone calls always go to voicemail and message...,0
559998,Looks like all of the good reviews have gone t...,0


In [19]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# nltk.download('punkt')
n = 560000

sentences = df["train"]["text"].values.tolist()[:n]
y = df["train"]["label"].values.tolist()[:n]

# Tokenize the sentences into words
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Create and train the Word2Vec model
model = Word2Vec(
    sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4
)

vectorized_sentences = [model.wv[s].mean(axis=0) for s in tokenized_sentences]

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np


X_train, X_test, y_train, y_test = train_test_split(
    vectorized_sentences, y, test_size=0.2, random_state=0
)
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
print("train:", clf.score(X_train, y_train))
print("test:", clf.score(X_test, y_test))

train: 0.9000133928571429
test: 0.8998839285714286


In [24]:
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
import numpy as np


# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    torch.Tensor(vectorized_sentences),
    torch.Tensor(y).unsqueeze(1),
    test_size=0.2,
    random_state=0,
)

# Create Tensor datasets
train_data = TensorDataset(X_train, y_train)
test_data = TensorDataset(X_test, y_test)

# Data loaders
batch_size = 5000  # You can change this value
train_loader = DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_data, batch_size=batch_size, shuffle=False)


# Define the logistic regression model
class LogisticRegressionPyTorch(nn.Module):
    def __init__(self, num_features):
        super(LogisticRegressionPyTorch, self).__init__()
        self.linear = nn.Linear(num_features, 1)

    def forward(self, x):
        return torch.sigmoid(self.linear(x))


# Initialize the model
model = LogisticRegressionPyTorch(X_train.shape[1])

# Loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
# optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
num_epochs = 8
for epoch in range(num_epochs):
    for inputs, labels in train_loader:
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()


# Evaluate the model
def binary_accuracy(y_pred, y_test):
    y_pred_tag = torch.round(y_pred)
    correct_results_sum = (y_pred_tag == y_test).sum().float()
    acc = correct_results_sum / y_test.shape[0]
    acc = torch.round(acc * 100)
    return acc


# Switch to evaluation mode
model.eval()
with torch.no_grad():
    train_acc = binary_accuracy(model(X_train), y_train)
    test_acc = binary_accuracy(model(X_test), y_test)

print("train:", train_acc.item())
print("test:", test_acc.item())

train: 89.0
test: 90.0
