# Assignment 1: Logistic Regression for Sentiment Analysis

Welcome to your first NLP assignment! In this notebook, you will implement logistic regression from scratch to perform sentiment analysis on tweets. You will learn how to extract features from text, train a logistic regression model, make predictions, evaluate performance, and analyze errors.

**Outline:**
1. Import Required Libraries and Download Data  
2. Load and Prepare Tweet Data  
3. Create Frequency Dictionary  
4. Process and Test Tweet Preprocessing  
5. Implement Sigmoid Function  
6. Implement Gradient Descent for Logistic Regression  
7. Extract Features from Tweets  
8. Train the Logistic Regression Model  
9. Predict Sentiment for a Tweet  
10. Evaluate Model Accuracy on Test Set  
11. Perform Error Analysis  
12. Predict Sentiment for Custom Tweet  

## 1. Import Required Libraries and Download Data

Let's start by importing the necessary libraries and downloading the required NLTK datasets.

In [18]:
import numpy as np
import pandas as pd
import nltk
from os import getcwd
import re, string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer


# Download required NLTK datasets
nltk.download('twitter_samples')
nltk.download('stopwords')

from nltk.corpus import twitter_samples
from nltk.corpus import stopwords

# If running locally, you may need to set the nltk data path
filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

# Import helper functions from utils.py (assumed to be in the same directory)
def process_tweet(tweet):
    # Remove RT, URLs, and hashtags (keep the word)
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    # Tokenize
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tokens = tokenizer.tokenize(tweet)
    # Remove stopwords and punctuation
    sw = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in sw and t not in string.punctuation]
    # Stemming
    stemmer = PorterStemmer()
    return [stemmer.stem(t) for t in tokens]


def build_freqs(tweets, ys):
    yslist = np.squeeze(ys).tolist()
    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, int(y))
            freqs[pair] = freqs.get(pair, 0) + 1
    return freqs


[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Load and Prepare Tweet Data

We will load positive and negative tweets, split them into training and test sets, and create corresponding label arrays.

In [7]:
# Load positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# Split into training and test sets (80% train, 20% test)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

# Create label arrays: 1 for positive, 0 for negative
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

# Print shapes for verification
print("train_y.shape =", train_y.shape)
print("test_y.shape =", test_y.shape)

train_y.shape = (8000, 1)
test_y.shape = (2000, 1)


## 3. Create Frequency Dictionary

We will use the `build_freqs` function to create a frequency dictionary from the training data. This dictionary will map (word, label) pairs to their frequency counts in the corpus.

In [6]:
# Create frequency dictionary
freqs = build_freqs(train_x, train_y)

# Check the output
print("type(freqs) =", type(freqs))
print("len(freqs) =", len(freqs.keys()))

type(freqs) = <class 'dict'>
len(freqs) = 11396


## 4. Process and Test Tweet Preprocessing

Let's test the `process_tweet` function to see how it tokenizes, removes stopwords, and stems words in a tweet.

In [8]:
# Test the function on a sample tweet
print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))

This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


## 5. Implement Sigmoid Function

Implement the sigmoid activation function, which maps any real value to the (0, 1) interval. Test it on sample inputs.

In [9]:
# UNQ_C1 GRADED FUNCTION: sigmoid
def sigmoid(z): 
    '''
    Input:
        z: is the input (can be a scalar or an array)
    Output:
        h: the sigmoid of z
    '''
    h = 1 / (1 + np.exp(-z))
    return h

# Testing your function 
print("sigmoid(0) =", sigmoid(0))
print("sigmoid(4.92) =", sigmoid(4.92))


sigmoid(0) = 0.5
sigmoid(4.92) = 0.9927537604041685


## 6. Implement Gradient Descent for Logistic Regression

Implement the `gradientDescent` function to optimize the logistic regression weights using the training data.

In [10]:
# UNQ_C2 GRADED FUNCTION: gradientDescent
def gradientDescent(x, y, theta, alpha, num_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    '''
    m = x.shape[0]
    for i in range(0, num_iters):
        z = np.dot(x, theta)
        h = sigmoid(z)
        J = -(1/m) * np.sum(y*np.log(h) + (1-y)*np.log(1-h))
        theta = theta - alpha * (1/m) * np.dot(x.T, (h - y))
    J = float(J)
    return J, theta

# Check the function with a synthetic test case
np.random.seed(1)
tmp_X = np.append(np.ones((10, 1)), np.random.rand(10, 2) * 2000, axis=1)
tmp_Y = (np.random.rand(10, 1) > 0.35).astype(float)
tmp_J, tmp_theta = gradientDescent(tmp_X, tmp_Y, np.zeros((3, 1)), 1e-8, 700)
print(f"The cost after training is {tmp_J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(tmp_theta)]}")


The cost after training is 0.67094970.
The resulting vector of weights is [np.float64(4.1e-07), np.float64(0.00035658), np.float64(7.309e-05)]


## 7. Extract Features from Tweets

Implement the `extract_features` function to generate feature vectors for tweets based on positive and negative word counts.

In [11]:
# UNQ_C3 GRADED FUNCTION: extract_features
def extract_features(tweet, freqs, process_tweet=process_tweet):
    '''
    Input: 
        tweet: a string containing one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output: 
        x: a feature vector of dimension (1,3)
    '''
    word_l = process_tweet(tweet)
    x = np.zeros(3) 
    x[0] = 1 
    for word in word_l:
        x[1] += freqs.get((word, 1), 0)
        x[2] += freqs.get((word, 0), 0)
    x = x[None, :]
    assert(x.shape == (1, 3))
    return x

# Test on training data
tmp1 = extract_features(train_x[0], freqs)
print(tmp1)

# Test for unknown words
tmp2 = extract_features('blorb bleeeeb bloooob', freqs)
print(tmp2)


[[1.000e+00 3.133e+03 6.100e+01]]
[[1. 0. 0.]]


## 8. Train the Logistic Regression Model

Stack feature vectors for all training tweets, initialize weights, and train the model using gradient descent.

In [12]:
# Stack features for all training examples
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X[i, :] = extract_features(train_x[i], freqs)

Y = train_y

# Train the model
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")

The cost after training is 0.22524456.
The resulting vector of weights is [np.float64(6e-08), np.float64(0.00053786), np.float64(-0.00055885)]


## 9. Predict Sentiment for a Tweet

Implement the `predict_tweet` function to predict the sentiment probability for a given tweet using the trained model.

In [14]:
# UNQ_C4 GRADED FUNCTION: predict_tweet
def predict_tweet(tweet, freqs, theta):
    '''
    Input: 
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''
    x = extract_features(tweet, freqs)
    y_pred = sigmoid(np.dot(x, theta))
    return y_pred

# Test predictions on sample tweets
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print(f"{tweet} -> {predict_tweet(tweet, freqs, theta)[0,0]:.6f}")

# Test on your own tweet
my_tweet = 'I am learning :)'
print(predict_tweet(my_tweet, freqs, theta))


I am happy -> 0.519259
I am bad -> 0.494338
this movie should have been great. -> 0.515962
great -> 0.516052
great great -> 0.532070
great great great -> 0.548023
great great great great -> 0.563877
[[0.83096623]]


## 10. Evaluate Model Accuracy on Test Set

Implement the `test_logistic_regression` function to compute the accuracy of the model on the test set.

In [15]:
# UNQ_C5 GRADED FUNCTION: test_logistic_regression
def test_logistic_regression(test_x, test_y, freqs, theta, predict_tweet=predict_tweet):
    """
    Input: 
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output: 
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """
    y_hat = []
    for tweet in test_x:
        y_pred = predict_tweet(tweet, freqs, theta)
        if y_pred > 0.5:
            y_hat.append(1.0)
        else:
            y_hat.append(0.0)
    y_hat = np.array(y_hat)
    test_y = np.squeeze(test_y)
    accuracy = np.sum(y_hat == test_y) / len(test_y)
    return accuracy

tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")


Logistic regression model's accuracy = 0.9965


## 11. Perform Error Analysis

Let's analyze tweets that were misclassified by the model and display their processed forms for further inspection.

In [16]:
print('Label\tPredicted\tTweet')
for x, y in zip(test_x, test_y):
    y_hat = predict_tweet(x, freqs, theta)
    if np.abs(y - (y_hat > 0.5)) > 0:
        print('THE TWEET IS:', x)
        print('THE PROCESSED TWEET IS:', process_tweet(x))
        print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))

Label	Predicted	Tweet
THE TWEET IS: @MarkBreech Not sure it would be good thing 4 my bottom daring 2 say 2 Miss B but Im gonna be so stubborn on mouth soaping ! #NotHavingit :p
THE PROCESSED TWEET IS: ['sure', 'would', 'good', 'thing', '4', 'bottom', 'dare', '2', 'say', '2', 'miss', 'b', 'im', 'gonna', 'stubborn', 'mouth', 'soap', 'nothavingit', ':p']
1	0.48899230	b'sure would good thing 4 bottom dare 2 say 2 miss b im gonna stubborn mouth soap nothavingit :p'


  print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))


THE TWEET IS: off to the park to get some sunlight : )
THE PROCESSED TWEET IS: ['park', 'get', 'sunlight']
1	0.49632433	b'park get sunlight'
THE TWEET IS: @msarosh Uff Itna Miss karhy thy ap :p
THE PROCESSED TWEET IS: ['uff', 'itna', 'miss', 'karhi', 'thi', 'ap', ':p']
1	0.48246197	b'uff itna miss karhi thi ap :p'
THE TWEET IS: @phenomyoutube u probs had more fun with david than me : (
THE PROCESSED TWEET IS: ['u', 'prob', 'fun', 'david']
0	0.50983764	b'u prob fun david'
THE TWEET IS: pats jay : (
THE PROCESSED TWEET IS: ['pat', 'jay']
0	0.50040341	b'pat jay'
THE TWEET IS: my beloved grandmother : ( https://t.co/wt4oXq5xCf
THE PROCESSED TWEET IS: ['belov', 'grandmoth']
0	0.50000001	b'belov grandmoth'
THE TWEET IS: Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvwCI #Finance #ExpediaJobs #Job #Jobs #Hiring
THE PROCESSED TWEET IS: ['sr', 'financi', 'analyst', 'expedia', 'inc', 'bellevu', 'wa', 'financ', 'expediajob', 'job', 'job', 'hire']
0	0.50647821	b'sr finan

## 12. Predict Sentiment for Custom Tweet

You can now input your own tweet, process it, predict its sentiment, and display the result.

In [17]:
# Feel free to change the tweet below
my_tweet = 'This is a ridiculously bright movie. The plot was terrible and I was sad until the ending!'
print("Processed tweet:", process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print("Predicted probability:", y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')

Processed tweet: ['ridicul', 'bright', 'movi', 'plot', 'terribl', 'sad', 'end']
Predicted probability: [[0.48122783]]
Negative sentiment
