In the last DataLab you implemented a Naive Bayes classifier. You created a frequency table. In this DataLab you will use this table to create features for a logistic regression algorithm, and use scikit-learn to build the model.

**Chapter 5 of the book "Speech and Language Processing" is referenced in this notebook.**

First, repeat the steps you did in the last DataLab until and including Task 3. In other words create this table again:

|$w_i$| count($w_i$, +) | count($w_i$, -) |
| ----------- | ----------- |----------- |
|(-:|1|0|
|(:|1|6|
|):|6|6|
|--->|1|0|
|happi|161|18|

In [41]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import random
import re                                  
import string  

import nltk
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords          
from nltk.stem import PorterStemmer        
from nltk.tokenize import TweetTokenizer

In [42]:
nltk.download('stopwords')
stopwords_english = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\neilr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [43]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\neilr\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [44]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [45]:
print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('The type of a tweet entry is: ', type(all_negative_tweets[0]))

Number of positive tweets:  5000
Number of negative tweets:  5000

The type of all_positive_tweets is:  <class 'list'>
The type of a tweet entry is:  <class 'str'>


In [46]:
# print positive in greeen
print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# print negative in red
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

[92m@tmhcuddly I'm so excited, we should definitely meet up :)
[91mI'm coughing. :(


## 1) Tweet preprocessing

Again, use the function `tweet_processor()` you created previously.

In [47]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a list of token
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
    - Tokenizes
    - Removes stopwords and punctuation
    - Stem tokens
        
    """
    # YOUR CODE HERE #
    # Remove URLs using regular expression
    processed_tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet)
    
    # Remove hashtags
    processed_tweet = re.sub(r'#([^\s]+)', r'\1', processed_tweet)
    
    # Tokenize the tweet using TweetTokenizer
    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=False)
    processed_tweet = tokenizer.tokenize(processed_tweet)
    
    # Remove stopwords
    stopwords_english = set(stopwords.words('english'))
    processed_tweet = [word for word in processed_tweet if word not in stopwords_english]
    
    # Remove punctuation
    processed_tweet = [word for word in processed_tweet if word not in string.punctuation]
    
    # Stem the tokens
    stemmer = PorterStemmer()
    processed_tweet = [stemmer.stem(word) for word in processed_tweet]

    return processed_tweet

And sanity check if it works.
    
Example tweet:
    
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Expected output:

`['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']`

In [48]:
example_tweet = ('My beautiful sunflowers on a sunny Friday morning off :)'
                 ' #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i')
print(example_tweet)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i


In [49]:
tweet_processor(example_tweet)

['beauti',
 'sunflow',
 'sunni',
 'friday',
 'morn',
 ':)',
 'sunflow',
 'favourit',
 'happi',
 'friday',
 '…']

Before going any further, let's split the dataset into training and test sets.

In [50]:
# 80% training 20% testing
positive_tweets_tr = all_positive_tweets[:4000]
positive_tweets_te = all_positive_tweets[4000:]

negative_tweets_tr = all_negative_tweets[:4000]
negative_tweets_te = all_negative_tweets[4000:]

**Task 1 (From the last DataLab)**

The function `tweet_processor()` expects a single tweet to process. But you have lists of tweets to process. Write a function called `tweet_processor_list()` that accept a list of strings (tweets) and returns a list of processed tweets. A processed tweet is a list of tokens. Therefore  `tweet_processor_list()` should return a list of lists.

The first two items in the `positive_tweets_tr` are:

```
['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!']
 ```
 
 the expected output of `tweet_processor_list()` is:
 
 ```
 [['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)'],
 ['hey',
  'jame',
  'odd',
  ':/',
  'pleas',
  'call',
  'contact',
  'centr',
  '02392441234',
  'abl',
  'assist',
  ':)',
  'mani',
  'thank']]
 
 ```

In [51]:
def tweet_processor_list(tweet_list):
    # YOUR CODE HERE #
    processed_tweet_list = []
    for tweet in tweet_list:
        processed_tweet = tweet_processor(tweet)
        processed_tweet_list.append(processed_tweet)
    return processed_tweet_list

In [52]:
positive_tweets_tr = tweet_processor_list(positive_tweets_tr)
positive_tweets_te = tweet_processor_list(positive_tweets_te)

negative_tweets_tr = tweet_processor_list(negative_tweets_tr)
negative_tweets_te = tweet_processor_list(negative_tweets_te)

**Task 2  (From the last DataLab)**

Now it is time to creative the _vocabulary_ as defined in Chapter 4, Section 4.2:

> vocabulary V consists of the union of all the word types in all classes

Combine all the tokens in `positive_tweets_tr` and `negative_tweets_tr` into one big list and get the unique tokens from this list.

Expected length of the vocabulary is `9085` unique tokens. Notice that if you use a different train/test split or different preprocessing this number will be different.

First 50 tokens in the vocabulary:

```
['(-:',
 '(:',
 '):',
 '--->',
 '-->',
 '->',
 '.\n.',
 '.\n.\n.',
 '. .',
 '. . .',
 '. ..',
 '. ...',
 '..',
 '...',
 '0',
 '0-100',
 '0-2',
 '0.001',
 '0.7',
 '00',
 '00128835',
 '009',
 '00962778381',
 '01282',
 '01482',
 '01:15',
 '01:16',
 '02079',
 '02392441234',
 '0272 3306',
 '0330 333 7234',
 '0345',
 '05.15',
 '07:02',
 '07:17',
 '07:24',
 '07:25',
 '07:32',
 '07:34',
 '08',
 '0878 0388',
 '08962464174',
 '0ne',
 '1',
 '1,300',
 '1,500',
 '1-0',
 '1.300',
 '1.8',
 '1/2']
```

In [53]:
# YOUR CODE HERE #
# Combine all tokens from positive_tweets_tr and negative_tweets_tr
all_tweets = positive_tweets_tr + negative_tweets_tr

# Flatten the list of lists into a single list
all_tokens = [token for tweet in all_tweets for token in tweet]

# Get unique tokens
vocabulary = list(set(all_tokens))

# Sort the vocabulary for consistency
vocabulary.sort()

# Flatten the list of lists
flattened_positive_tweets = [token for sublist in positive_tweets_tr for token in sublist]
flattened_negative_tweets = [token for sublist in negative_tweets_tr for token in sublist]

# Print the first 50 tokens in the vocabulary
#print(vocabulary[:50])

# Print the length of the vocabulary
print("Length of vocabulary:", len(vocabulary))

Length of vocabulary: 14884


**Task 3  (From the last DataLab)**

In order to calculate the equation 4.12

$P(w_i|c)=count(w_i, c)/\Sigma_{w∈V} count(w, c)$

We first need to calculate $count(w_i, c)$ which is the number of times each token in the vocabulary occurs in class c. This is also called the word frequency table.

|$w_i$| count($w_i$, +) | count($w_i$, -) |
| ----------- | ----------- |----------- |
|(-:|1|0|
|(:|1|6|
|):|6|6|
|--->|1|0|
|happi|161|18|



In [54]:
from collections import Counter
word_count_pos = Counter(flattened_positive_tweets)
word_count_neg = Counter(flattened_negative_tweets)

In [55]:
# Create a dictionary to hold the frequency table
freqs = {}

# Iterate over the vocabulary
for token in vocabulary:
    # Get the count of the token in positive class (default to 0 if not found)
    count_pos = word_count_pos.get(token, 0)
    # Get the count of the token in negative class (default to 0 if not found)
    count_neg = word_count_neg.get(token, 0)
    # Store the counts in the dictionary
    freqs[token] = (count_pos, count_neg)

In [56]:
df = pd.DataFrame.from_dict(freqs, orient='index', columns=['count(w_i, +)', 'count(w_i, -)'])
df.head(10)

Unnamed: 0,"count(w_i, +)","count(w_i, -)"
#bbmme,2,0
#segalakatakata,1,0
(-:,1,0
(:,1,6
):,6,6
--->,1,0
-->,2,0
->,1,0
.\n.,0,2
.\n.\n.,1,0


## Logistic regression

Now it is time to create features for a logistic regression model. How can we create features from the following table?

|$w_i$| count($w_i$, +) | count($w_i$, -) |
| ----------- | ----------- |----------- |
|(-:|1|0|
|(:|1|6|
|):|6|6|
|--->|1|0|
|happi|161|18|

Let's create two features, one for the positive counts and one for the negative counts.

- For each token in the tweet get count($w_i$, +) from the table
- Calculate the sum
- This will be your first feature ($x_1$)

Similarly

- For each token in the tweet get count($w_i$, -) from the table
- Calculate the sum
- This will be your second feature ($x_2$)

Finally, repeat this for every tweet in the training and test sets. Let's get back to our example tweet to better understand what you need to do.

Example tweet (raw):
    
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Example tweet (processed):

`['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']`

|tokens|count(w_i, +)|count(w_i, -)|
|--|--|--|
|beauti|45|10|
|sunflow|2|0|
|sunni|5|1|
|friday|91|9|
|morn|68|23|
|:)|2847|2|
|sunflow|2|0|
|favourit|9|8|
|happi|161|18|
|friday|91|9|
|…|31|14|
|Total|$x_1$ = 3352|$x_2$ = 94|

Sanity check your array shapes. Expected outputs:

|Array|Shape|
|--|--|
|X_train|(8000, 2)|
|y_train|(8000,)|
|X_test|(2000, 2)|
|y_test|(2000,)|


X_train output:

```
array([[2847,    2],
       [ 504,   94],
       [   2,    1],
       ...,
       [   0,  378],
       [   1, 3663],
       [   1, 3663]])
```

**Task 1 (First task of this DataLab)**

Create $x_1$ and $x_2$ as described above, for each tweet in the training and test sets.

In [57]:
# Assign labels to the training set (1 for positive, 0 for negative)
y_train = np.concatenate((np.ones(len(positive_tweets_tr)), np.zeros(len(negative_tweets_tr))))

# Assign labels to the test set (1 for positive, 0 for negative)
y_test = np.concatenate((np.ones(len(positive_tweets_te)), np.zeros(len(negative_tweets_te))))

# Initialize empty lists to store features for training and test sets
X1_train = []
X2_train = []
X1_test = []
X2_test = []

# Function to calculate features for a single tweet
def calculate_features(tweet):
    # Initialize counters for positive and negative counts
    count_pos = 0
    count_neg = 0
    
    # Iterate over tokens in the tweet
    for token in tweet:
        # Get the counts for the token from the frequency table
        count_wi_pos, count_wi_neg = freqs.get(token, (0, 0))
        # Increment the counters
        count_pos += count_wi_pos
        count_neg += count_wi_neg
    
    return count_pos, count_neg

# Calculate features for each tweet in the training set
for tweet in positive_tweets_tr:
    count_pos, count_neg = calculate_features(tweet)
    X1_train.append(count_pos)
    X2_train.append(count_neg)

for tweet in negative_tweets_tr:
    count_pos, count_neg = calculate_features(tweet)
    X1_train.append(count_pos)
    X2_train.append(count_neg)

# Calculate features for each tweet in the test set
for tweet in positive_tweets_te:
    count_pos, count_neg = calculate_features(tweet)
    X1_test.append(count_pos)
    X2_test.append(count_neg)

for tweet in negative_tweets_te:
    count_pos, count_neg = calculate_features(tweet)
    X1_test.append(count_pos)
    X2_test.append(count_neg)

# Convert lists to numpy arrays
X1_train = np.array(X1_train).reshape(-1, 1)
X2_train = np.array(X2_train).reshape(-1, 1)
X1_test = np.array(X1_test).reshape(-1, 1)
X2_test = np.array(X2_test).reshape(-1, 1)

# Concatenate X1_train and X2_train to get X_train
X_train = np.concatenate((X1_train, X2_train), axis=1)

# Concatenate X1_test and X2_test to get X_test
X_test = np.concatenate((X1_test, X2_test), axis=1)

# Concatenate features with labels for training set
train_data = np.concatenate((X_train, y_train.reshape(-1, 1)), axis=1)

# Concatenate features with labels for test set
test_data = np.concatenate((X_test, y_test.reshape(-1, 1)), axis=1)

# Splitting features and labels
X_train = train_data[:, :-1]
y_train = train_data[:, -1]

X_test = test_data[:, :-1]
y_test = test_data[:, -1]

In [58]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((8000, 2), (8000,), (2000, 2), (2000,))

**Task 2**

Now you are ready to build a logistic regression model using scikit-learn.

Try with and without normalization (as described in section 5.2 page 83 of the book Speech and Language Processing).

In [61]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize logistic regression model
logistic_model = LogisticRegression()

# Fit the model to the training data
logistic_model.fit(X_train, y_train)

# Predict labels for the test data
y_pred = logistic_model.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy without normalization:", accuracy)

Accuracy without normalization: 0.994


In [62]:
from sklearn.preprocessing import StandardScaler

# Normalize the features using StandardScaler
scaler = StandardScaler()
X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)

# Initialize logistic regression model
logistic_model_normalized = LogisticRegression()

# Fit the model to the normalized training data
logistic_model_normalized.fit(X_train_normalized, y_train)

# Predict labels for the normalized test data
y_pred_normalized = logistic_model_normalized.predict(X_test_normalized)

# Calculate accuracy score for normalized data
accuracy_normalized = accuracy_score(y_test, y_pred_normalized)
print("Accuracy with normalization:", accuracy_normalized)

Accuracy with normalization: 0.994


Compare the two models you have developed (Naive Bayes and Logistic Regression) considering _5.2.4 Choosing a classifier_ on page 85 of the book Speech and Language Processing

**Task 3**

Designing new features (discussed in page 83 of the book Speech and Language Processing) is an important part of building models. You created two features in Task 1. Now, design your own features and try to improve the model performance. Check the table on page 82 for inspiration.

In [None]:
# YOUR CODE HERE #

**Task 4**

Use the `SentimentIntensityAnalyzer` from `nltk` to predict the sentiment of the tweets.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\neilr\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


It tells you if a tweet (or a text in general) is positive or negative.

In [None]:
sia.polarity_scores("The acting was good.")

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

Notice that a tweet can contain both sentiments at the same time.

In [None]:
sia.polarity_scores("The acting was good, but the story was bad.")

{'neg': 0.347, 'neu': 0.511, 'pos': 0.142, 'compound': -0.5859}

In [None]:
sia.polarity_scores("The acting was bad, but the story was good.")

{'neg': 0.172, 'neu': 0.534, 'pos': 0.294, 'compound': 0.3818}

Use the `compound` score to decide whether a tweet is positive or negative. If the compound is a positive number, the prediction is positive. If it is a negative number the prediction is negative.

If the `compound` is zero it is neither positive, nor negative.

In [None]:
sia.polarity_scores("I feel neutral")

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

Now calculate the accuracy of `SentimentIntensityAnalyzer` on the **raw** test tweets. Decide how you would like to handle neutral tweets. 

In [None]:
# 80% training 20% testing
positive_tweets_tr_raw = all_positive_tweets[:4000]
positive_tweets_te_raw = all_positive_tweets[4000:]

negative_tweets_tr_raw = all_negative_tweets[:4000]
negative_tweets_te_raw = all_negative_tweets[4000:]

In [63]:
# Function to predict sentiment of a tweet using compound score
def predict_sentiment(tweet):
    # Get polarity scores
    scores = sia.polarity_scores(tweet)
    compound_score = scores['compound']
    
    # Decide sentiment based on compound score
    if compound_score > 0:
        return 1  # Positive sentiment
    elif compound_score < 0:
        return 0  # Negative sentiment
    else:
        return np.nan  # Neutral sentiment

# Function to calculate accuracy of SentimentIntensityAnalyzer on test tweets
def calculate_accuracy(test_tweets_raw):
    correct_predictions = 0
    total_predictions = 0
    
    for tweet in test_tweets_raw:
        # Predict sentiment of the tweet
        prediction = predict_sentiment(tweet)
        
        # Increment counters
        if not np.isnan(prediction):  # Ignore neutral predictions
            if prediction == 1 and tweet in positive_tweets_te_raw:
                correct_predictions += 1
            elif prediction == 0 and tweet in negative_tweets_te_raw:
                correct_predictions += 1
            total_predictions += 1
    
    # Calculate accuracy
    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
    return accuracy

# Calculate accuracy on raw test tweets
accuracy_sia = calculate_accuracy(positive_tweets_te_raw + negative_tweets_te_raw)
print("Accuracy of SentimentIntensityAnalyzer on raw test tweets:", accuracy_sia)


Accuracy of SentimentIntensityAnalyzer on raw test tweets: 0.8786588610963278


**Task 5**

Use the results of `SentimentIntensityAnalyzer` as new features to your Logistic Regression model. Try with and without normalization.

In [65]:
# Function to calculate sentiment features for a list of tweets
def calculate_sentiment_features(tweet_list):
    sentiment_features = []
    for tweet in tweet_list:
        prediction = predict_sentiment(tweet)
        sentiment_features.append(prediction)
    return sentiment_features

# Calculate sentiment features for training and test sets
sentiment_features_train = calculate_sentiment_features(positive_tweets_tr_raw + negative_tweets_tr_raw)
sentiment_features_test = calculate_sentiment_features(positive_tweets_te_raw + negative_tweets_te_raw)

# Reshape sentiment features arrays to match the shape of other features
sentiment_features_train = np.array(sentiment_features_train).reshape(-1, 1)
sentiment_features_test = np.array(sentiment_features_test).reshape(-1, 1)

# Concatenate sentiment features with existing features for training and test sets
X_train_with_sentiment = np.concatenate((X_train, sentiment_features_train), axis=1)
X_test_with_sentiment = np.concatenate((X_test, sentiment_features_test), axis=1)

# Remove tweets with NaN sentiment values from the training data
nan_indices_train = np.isnan(sentiment_features_train).reshape(-1)
X_train_with_sentiment = X_train_with_sentiment[~nan_indices_train]
y_train = y_train[~nan_indices_train]

# Remove tweets with NaN sentiment values from the test data
nan_indices_test = np.isnan(sentiment_features_test).reshape(-1)
X_test_with_sentiment = X_test_with_sentiment[~nan_indices_test]
y_test = y_test[~nan_indices_test]

# Train Logistic Regression model without normalization
logistic_model_with_sentiment = LogisticRegression()
logistic_model_with_sentiment.fit(X_train_with_sentiment, y_train)
y_pred_with_sentiment = logistic_model_with_sentiment.predict(X_test_with_sentiment)
accuracy_with_sentiment = accuracy_score(y_test, y_pred_with_sentiment)
print("Accuracy of Logistic Regression with Sentiment Features (without normalization):", accuracy_with_sentiment)

# Normalize the features including sentiment features using StandardScaler
scaler = StandardScaler()
X_train_with_sentiment_normalized = scaler.fit_transform(X_train_with_sentiment)
X_test_with_sentiment_normalized = scaler.transform(X_test_with_sentiment)

# Train Logistic Regression model with normalized features including sentiment features
logistic_model_with_sentiment_normalized = LogisticRegression()
logistic_model_with_sentiment_normalized.fit(X_train_with_sentiment_normalized, y_train)
y_pred_with_sentiment_normalized = logistic_model_with_sentiment_normalized.predict(X_test_with_sentiment_normalized)
accuracy_with_sentiment_normalized = accuracy_score(y_test, y_pred_with_sentiment_normalized)
print("Accuracy of Logistic Regression with Sentiment Features (with normalization):", accuracy_with_sentiment_normalized)

Accuracy of Logistic Regression with Sentiment Features (without normalization): 0.9962746141564662
Accuracy of Logistic Regression with Sentiment Features (with normalization): 0.9941458222458754
