<!-- ### Problem 1
<img src="problem1.png" width="800" height="600" alt="Problem 1 Image">  -->

### Problem 1.

Consider the following loss function $L(\beta)$ where $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$ are the observations in our dataset. $x_i$ is a $d$-dimensional input vector, i.e., there are $d$ features in our dataset. So, $x_{ij}$ corresponds to the $j^{th}$ feature in the $i^{th}$ observation. $y_i$ corresponds to the outcome variable for observation $i$. $\lambda$ is some scalar constant.

$$
L(\beta) = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{d} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{d} \beta_j^2
$$




### Solution 1

### (i) Estimating the parameters $\beta$ analytically as we did for linear regression

The loss function provided is a ridge regression loss function, which is used for linear regression with 
L2 regularization. The L2 regularization is included to prevent overfitting by penalizing large coefficients through the regularization parameter
λ.


Analytically solving for the parameters $\beta$ involves finding the values of $\beta$ that minimize the loss function $L(\beta)$. For ordinary linear regression without regularization (i.e., when $\lambda = 0$), this can be done by setting the derivative of the loss function with respect to each $\beta_j$ to zero and solving the resulting normal equations. However, with the addition of the L2 regularization term, the solution is not the same as ordinary least squares (OLS) because of the penalty on the size of the coefficients.

To find the analytical expression for the $\beta$ parameters in ridge regression, we also set the derivative of $L(\beta)$ with respect to each $\beta_j$ to zero. This will give us a set of equations that we can solve for $\beta$.

Let's denote our design matrix as $X$ (with each row corresponding to an observation and each column to a feature) and our response vector as $y$. The loss function can be written in matrix form as:

$$ L(\beta) = (y - X\beta)^T(y - X\beta) + \lambda\beta^T\beta $$

To minimize this loss function, we take the derivative with respect to $\beta$ and set it to zero:

$$ \frac{\partial L(\beta)}{\partial \beta} = -2X^T(y - X\beta) + 2\lambda\beta = 0 $$

This gives us the ridge regression normal equations:

$$ X^TX\beta + \lambda I\beta = X^Ty $$

where $I$ is the identity matrix. The solution to this equation is:

$$ \beta = (X^TX + \lambda I)^{-1}X^Ty $$

----------------
### (ii) Gradient of the loss function $\nabla L(\beta)$ 

$$\nabla L(\beta) = -2X^T(y - X\beta) + 2\lambda\beta $$


where:

- $X$ is the matrix of input features, with rows representing samples and columns representing features.
- $X^T$ is the transpose of the matrix $X$.
- $y$ is the vector of observed values (target values).
- $\beta$ is the vector of parameters that we are trying to learn.
- $\lambda$ is the regularization parameter that controls the amount of shrinkage: larger values of $\lambda$ shrink the parameters more toward zero.

----------------

### (iii) Update step for gradient descent

$$  \beta_j^{(t+1)} = \beta_j^{(t)} - \eta \cdot \frac{\partial L(\beta)}{\partial \beta_j} $$

where:

- $\beta_j^{(t+1)}$ is the updated value of the $j$-th parameter at iteration $t+1$.
- $\beta_j^{(t)}$ is the current value of the $j$-th parameter at iteration $t$.
- $\eta$ is the learning rate, a positive scalar determining the step size at each iteration.
- $\frac{\partial L(\beta)}{\partial \beta_j}$ is the partial derivative of the loss function with respect to the $j$-th parameter, representing the direction and rate of the steepest increase in the loss function.

By subtracting the gradient scaled by the learning rate from the current parameters, the update rule moves the parameters in the direction that most steeply reduces the loss function.

----------------
### (iv) pseudo-code for a stochastic gradient descent (SGD) algorithm to estimate parameters $\beta$ 

- Initialize β at random
- Choose a learning rate η

- Repeat the following until an approximate minimum is obtained:
    - Shuffle the dataset randomly
    - For each example in the dataset:
        - Calculate the gradient of the loss with respect to the example
        - Update β by subtracting η times the gradient from β




## ----------------------------------------------------------------------------------
## ----------------------------------------------------------------------------------


# Problem 2: Twiter Sentiment Analysis

Dataset Link: https://www.kaggle.com/datasets/kazanova/sentiment140

In [110]:
import pandas as pd
import numpy as np


#logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

#NLTK for tweet(text) processing
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
import nltk
from nltk.stem import WordNetLemmatizer


# Download NLTK stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

#TF-IDF (Term Frequency-Inverse Document Frequency: Vectorizer)
from sklearn.feature_extraction.text import TfidfVectorizer



[nltk_data] Downloading package punkt to /Users/krishan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/krishan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/krishan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Dataset Preparation

Original file of 1.6 million tweets was used to create a smaller dataset of 2000 tweets. Following Codeblock has been commented out as I will be using the processed file with 2000 tweeets.

In [111]:
# df = pd.read_csv('tweets.csv', encoding='latin-1')
# df.head()

# #drop all except column 0 and last
# df = df.iloc[:,[0,-1]]
# #rename first to target, last to tweeet
# df.columns = ['target','tweet']
# df.head()

# #sample random 2000 rows
# df0 = df[df['target']==0].sample(1000)
# df4 = df[df['target']==4].sample(1000)

# #concat the data
# df = pd.concat([df0, df4], ignore_index=True) 

# #shuffle the data
# df = df.sample(frac=1)

# df['target'].value_counts()

# #change the target to 0 and 1
# df['target'] = df['target'].replace(4,1)

# #switching the columns
# df = df[['tweet','target']]

# #save to csv
# df.to_csv('tweets_2000.csv', index=False)

In [112]:
#read the data
df = pd.read_csv('tweets_2000.csv')
df.head()

Unnamed: 0,tweet,target
0,Back from drankin...now going to the movies w ...,1
1,"@SwayShay I know, and LD is too",0
2,Boohoo Have to wait until 1 PM to download iP...,0
3,My internet is mega slow again and ruining my ...,0
4,thinking of ideas and inspiration for raspberr...,1


In [113]:
#check for missing values
print(df.isnull().sum())

tweet     0
target    0
dtype: int64


In [114]:
#check target proportion
df['target'].value_counts()


target
1    1000
0    1000
Name: count, dtype: int64

In [115]:
#train test split
X = df['tweet']
y = df['target']

#startify the split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_train.head()



962     ...weather today. Going out for lunch with my ...
984     @gargamit100 laugh at your win this time; next...
992      I want to go to sleep but youtube is being slow 
1203            Ubertwitter still not updating on the bb 
1031    Tonto and wera.. luv u both.. u really make me...
Name: tweet, dtype: object

In [116]:
#check target proportion in train and test
print(y_train.value_counts())
print(y_test.value_counts())

target
1    800
0    800
Name: count, dtype: int64
target
0    200
1    200
Name: count, dtype: int64


In [117]:
# Preprocessing function to clean text data
def clean_text(text):
    # Remove non-alphabetic characters and lowercase the text
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = text.lower()
    # Tokenize and remove stopwords
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    # Join the tokens back into a string
    text = ' '.join(tokens)
    return text

# lametize the words
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    text = ' '.join(tokens)
    return text

# Apply the preprocessing function to the text data
X_train = X_train.apply(clean_text)
X_train = X_train.apply(lemmatize_words)
X_test = X_test.apply(clean_text)
X_test = X_test.apply(lemmatize_words)


X_train.head()

962     weather today going lunch mummy going orthodon...
984     gargamit laugh win time next time junior give ...
992                            want go sleep youtube slow
1203                        ubertwitter still updating bb
1031                 tonto wera luv u u really make happy
Name: tweet, dtype: object

In [118]:
X_train.shape, X_test.shape

((1600,), (400,))

In [119]:
# Making word Embeddings Using TF-IDF
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit the vectorizer to the training data
X_train = vectorizer.fit_transform(X_train)

# Transform the test data
X_test = vectorizer.transform(X_test)

X_train.shape, X_test.shape

((1600, 4814), (400, 4814))

TF-IDF Vectorization has been used to convert the text data into numerical data. Now each tweet is represented by a 4814 dimensional vector.

In [120]:
X_train = pd.DataFrame(X_train.toarray(), columns=vectorizer.get_feature_names_out())

X_train.head()

Unnamed: 0,aaaaaah,aaahhh,aaarrggghhh,aaaw,aag,aahhh,aaron,ab,abandoned,abc,...,zombecca,zombie,zoo,zoom,zoooooooooo,zrlgrl,zu,zvgd,zwinky,zzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## (i) Features I am using in logistic regression model with mathematical model

I used TF-IDF (Term Frequency-Inverse Document Frequency) as a feature extraction method for a logistic regression model.


The features are numerical representations of the importance of words (or terms) within dataset's document(here each row is one document) relative to a collection of documents. The TF-IDF score reflects how important a word is to a document in a collection or corpus. This method helps in distinguishing the significance of words in a document, considering their frequency across the entire corpus.

### Features in Your Model
- **TF (Term Frequency):** Measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
  
  $$ TF(t) = \frac{\text{Number of times term } t \text{ appears in a document}}{\text{Total number of terms in the document}} $$

- **IDF (Inverse Document Frequency):** Measures how important a term is. While computing TF, all terms are considered equally important. However, certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus, we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:

  $$ IDF(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents with term } t \text{ in it}}\right) $$
  
By multiplying TF and IDF, we get the TF-IDF score of a term in a document, which reflects the importance of the term in the document out of the whole corpus of documents. The higher the TF-IDF score, the more important the term is in that particular document.

### Mathematical Representation of Logistic Regression Model Using TF-IDF Features

Given that features $X$ are created using TF-IDF, each document $d$ in training data is represented as a vector $X_d = [x_1, x_2, ..., x_n]$, where each $x_i$ is the TF-IDF score for term $i$ in document $d$, and $n$ is the total number of unique terms across all documents in the corpus.

The logistic regression model can then be mathematically represented as follows:

$$ P(y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}} $$

Where:
- $P(y=1|X)$ is the probability that the document belongs to class 1 (assuming a binary classification problem) given its TF-IDF features $(X)$.
- $e$ is the base of the natural logarithm.
- $\beta_0$ is the intercept term of the logistic regression model.
- $\beta_1, \beta_2, ..., \beta_n$ are the coefficients associated with each TF-IDF feature $x_1, x_2, ..., x_n$, which the logistic regression model learns during training.

This model predicts the probability that a given document belongs to a particular category (for instance, positive or negative sentiment) based on the weighted sum of its TF-IDF features, passed through a logistic (sigmoid) function to ensure the output is between 0 and 1.
`

-------------
## (ii) Likelihood function for logistic regression model

The likelihood function for a logistic regression model quantifies how probable the observed data $Y$ are, given the parameters of the model. For binary classification, where $y_i$ represents the binary outcome for the $i^{th}$ observation, and $X_i$ represents the feature vector for the $i^{th}$ observation, the likelihood function $L$ can be defined as the product of individual probabilities for each observation, assuming they are independent.

Given:
- $y_i$ is the binary outcome for the $i^{th}$ observation (0 or 1).
- $X_i$ is the feature vector for the $i^{th}$ observation.
- $\beta$ is the vector of model parameters, including the intercept $\beta_0$ and coefficients $\beta_1, \beta_2, ..., \beta_n$.

The probability of $y_i$ given $X_i$ and $\beta$, denoted as $P(y_i | X_i; \beta)$, is modeled by the logistic function:

$$ P(y_i = 1 | X_i; \beta) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_n x_{in})}} $$

Thus, the likelihood function $L(\beta)$ for $N$ observations is the product of the probabilities for each observation:

$$ L(\beta) = \prod_{i=1}^{N} P(y_i | X_i; \beta)^{y_i} (1 - P(y_i | X_i; \beta))^{(1-y_i)} $$

This can also be represented as:

$$ L(\beta) = \prod_{i=1}^{N} \left( \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_n x_{in})}} \right)^{y_i} \left(1 - \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_n x_{in})}} \right)^{(1-y_i)} $$

To facilitate computation, particularly for the purpose of parameter estimation via maximum likelihood estimation (MLE), it is common to work with the log-likelihood function, which is the logarithm of the likelihood function:

$$ \log L(\beta) = \sum_{i=1}^{N} \left[ y_i \log \left( \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_n x_{in})}} \right) + (1-y_i) \log \left(1 - \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_n x_{in})}} \right) \right] $$

The goal of MLE is to find the values of $\beta$ that maximize $\log L(\beta)$, thereby finding the parameter estimates that make the observed data most probable.


--------------------------
## (iii) Training the logistic regression model using Black-Box Model and measuring the performance on test data

In [121]:
#Logistic Regression
# Create a logistic regression model
model = LogisticRegression()
# Fit the model
model.fit(X_train, y_train)
# Predict the test data
y_pred = model.predict(X_test)

# Model Evaluation
# Confusion matrix
print('confusuion Matrix\n',confusion_matrix(y_test, y_pred))
print('\n-----------------------------------------------------\n')
# Accuracy score
print('Accuracy: ',accuracy_score(y_test, y_pred)*100,'%')


confusuion Matrix
 [[122  78]
 [ 58 142]]

-----------------------------------------------------

Accuracy:  66.0 %




In [122]:
# print the coefficients
print('Coefficients: ',model.coef_)
print('Intercept: ',model.intercept_)
print('Number of features: ',len(model.coef_[0]))

Coefficients:  [[-0.12894731  0.12883174 -0.1418485  ...  0.11283745  0.11033777
   0.16455937]]
Intercept:  [0.07422433]
Number of features:  4814


There are 4814 coefficients in the logistic regression model, which are learned during the training process. The coefficients can be accessed via model.coef_ and model.intercept_ attributes after fitting the model to the training data.

Accuracy on Test Data: 66 %

--------------------------

### (iv) Training the logistic regression classifier by minimizing the negative log-likelihood function using a numerical optimization procedure: stochastic gradient descent(SGD) and Comparing with the coefficients obtained in step (iii)

In [123]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def compute_gradient(X, y, coefficients):
    predictions = sigmoid(np.dot(X, coefficients))
    errors = y - predictions
    gradient = -np.dot(X.T, errors) / len(X)
    return gradient

def sgd_logistic_regression(X_train, y_train, learning_rate=0.01, epochs=1000):
    coefficients = np.zeros(X_train.shape[1])
    indices = np.arange(X_train.shape[0])  # Array of indices for X_train
    for epoch in range(epochs):
        np.random.shuffle(indices)  # Shuffle indices for each epoch
        for idx in indices:  # Iterate over shuffled indices
            X_i = X_train[idx:idx+1]  # Use the shuffled index to access the sample
            y_i = y_train[idx:idx+1]
            gradient = compute_gradient(X_i, y_i, coefficients)
            coefficients -= learning_rate * gradient
    return coefficients

# Train the model
coefficients = sgd_logistic_regression(X_train.to_numpy(), y_train.to_numpy())

# Predict the test data
y_pred2 = sigmoid(np.dot(X_test.toarray(), coefficients))
y_pred2 = np.round(y_pred2)

In [124]:
# Model accuracy
print('Accuracy: ',accuracy_score(y_test, y_pred2)*100,'%')

# Confusion matrix
print('confusuion Matrix\n',confusion_matrix(y_test, y_pred2))

Accuracy:  67.75 %
confusuion Matrix
 [[138  62]
 [ 67 133]]


In [125]:
#compare the two models
print('Accuracy Black-Box Model: ',accuracy_score(y_test, y_pred)*100,'%')
print('Accuracy SGD Model: ',accuracy_score(y_test, y_pred2)*100,'%')

#print the intercepts
print('Intercept Black-Box Model: ',model.intercept_)
print('Intercept SGD Model: ',coefficients[0])

#make a df for the coefficients for black-box model and SGD model
coefficients_df = pd.DataFrame({'Feature': X_train.columns, 'Black-Box Model': model.coef_[0], 'SGD Model': coefficients})


#display all columns
pd.set_option('display.max_rows', None)
#show head and tail of the coefficients_df

coefficients_df.head(20)






Accuracy Black-Box Model:  66.0 %
Accuracy SGD Model:  67.75 %
Intercept Black-Box Model:  [0.07422433]
Intercept SGD Model:  -0.9063201496881317


Unnamed: 0,Feature,Black-Box Model,SGD Model
0,aaaaaah,-0.128947,-0.90632
1,aaahhh,0.128832,0.729833
2,aaarrggghhh,-0.141848,-0.587556
3,aaaw,-0.195386,-1.408883
4,aag,-0.12373,-0.535092
5,aahhh,0.243232,1.504813
6,aaron,-0.208588,-1.136684
7,ab,-0.143054,-0.832974
8,abandoned,-0.114806,-0.498032
9,abc,0.253279,1.649064


In [126]:
#confusion matrix for black-box model and SGD model
print('confusuion Matrix Black-Box Model\n',confusion_matrix(y_test, y_pred))
print('confusuion Matrix SGD Model\n',confusion_matrix(y_test, y_pred2))

confusuion Matrix Black-Box Model
 [[122  78]
 [ 58 142]]
confusuion Matrix SGD Model
 [[138  62]
 [ 67 133]]


# Conclusion
SGD implementation is quite basic and lacks several enhancements found in scikit-learn's version, such as regularization, adaptive learning rates, and convergence checks. 

This is the reason the performance and the resulting coefficients of this implementation are different from those obtained through Black-Box(scikit-learn's LogisticRegression) Model due to these and other optimizations.

--------------------------
**Model Outputs:**

Accuracy Black-Box Model:  67.75 %

Accuracy SGD Model:  67.75 %

Intercept Black-Box Model:  [0.07422433]

Intercept SGD Model:  -0.9063238446871242