# Assignment

1. use pandas read_csv with sep='\t' to read in the following 2 files available from the us naval academy:
- url = 'https://www.usna.edu/Users/cs/nchamber/data/twitter/keyword-tweets.txt'
- url = 'https://www.usna.edu/Users/cs/nchamber/data/twitter/general-tweets.txt'
<br/> <span style="color:red" float:right>[1 point]</span>

2. concatenate these 2 data sets into a single data frame called LabeledTweets that has 2 columns, named Sentiment and Tweet <span style="color:red" float:right>[1 point]</span>

3. replace sentiment labels 'POLIT': 1, 'NOT': 0; <span style="color:red" float:right>[0 point]</span>

4. clean the tweets
   1. remove all tokens that contain a "@". Remove the whole token, not just the character.
   2. remove all tokens that contain "http". Remove the whole token, not just the characters.
   3. **replace** (not remove) all punctuation marks with a space (" ")
   4. **replace** all numbers with a space
   5. **replace** all non ascii characters with a space
   7. convert all characters to lowercase
   8. strip extra whitespaces
   9. lemmatize tokens
   9. No need to remove stopwords because TfidfVectorizer will take care of that
<br/><span style="color:red" float:right>[9 point]</span>

5. Use TfidfVectorizer from sklearn to prepare the data for machine learning.  Use max_features = 50;  <span style="color:red" float:right>[2 point]</span>

6. Use sklearn LogisticRegression to train a model on the  results on 75% of the data. <span style="color:red" float:right>[1 point]</span>

7. Determine the accuracy of your model on the training data and the test data.   Determine the baseline accuracy. <span style="color:red" float:right>[1 point]</span>

8. Repeat steps 5, 6, and 7  with TfidfVectorizer max_features set to 5, 500, 5000, 50000 and discuss your accuracies. <span style="color:red" float:right>[2 point]</span>

# End of assignment

In [1]:
import string
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cProfile
import seaborn as sns

from scipy.sparse import coo_matrix # this is the sparse matrix format discussed in lecture

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

pd.set_option('display.float_format', lambda x: '%.5f' % x)

import nltk
# # A one-time requirement for these four downloads:
nltk.download('wordnet')
nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('corpus')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import string

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# 1. use pandas read_csv with sep='\t' to read in the following 2 files available from the us naval academy:
url_keyword = 'https://www.usna.edu/Users/cs/nchamber/data/twitter/keyword-tweets.txt'
url_general = 'https://www.usna.edu/Users/cs/nchamber/data/twitter/general-tweets.txt'

# read into dataframes
keyword_df = pd.read_csv(url_keyword, sep ='\t', header = None, names = ['col_1', 'text'])
general_df = pd.read_csv(url_general, sep ='\t', header = None, names = ['col_1', 'text'])

# check
print('Keyword shape:', keyword_df.shape)
print(keyword_df.head())

print('general_df:', general_df.shape)
print(general_df.head())

Keyword shape: (2004, 2)
   col_1                                               text
0  POLIT  Global Voices Online Â» Alex Castro: A liberal...
1  POLIT  Do the Conservatives Have a Death Wish? http:/...
2    NOT  @MMFlint I've seen all of your movies and Capi...
3  POLIT  RT @AllianceAlert: * House Dems ask for civili...
4  POLIT  RT @AdamSmithInst Quote of the week: My politi...
general_df: (2000, 2)
  col_1                                               text
0   NOT  Bumping dj sefs mixtape nowww this is my music...
1   NOT  #ieroween THE STORY OF IEROWEEN! THE VIDEO ->>...
2   NOT  trick or treating at the mall today; ZOO! last...
3   NOT  @Ussk81 PMSL!!! I try not to stare but I can't...
4   NOT  @Sc0rpi0n676 btw - is there a remote chance i ...


In [3]:
# 2. concatenate these 2 data sets into a single data frame called LabeledTweets that has 2 columns, named Sentiment and Tweet [1 point]

# Concatenate the two DataFrames
LabeledTweets = pd.concat([keyword_df, general_df], axis=0, ignore_index=True)

# Rename columns
LabeledTweets.columns = ['Sentiment', 'Tweet']


# check
print(LabeledTweets.head())
print(LabeledTweets.shape)
print(LabeledTweets['Sentiment'].value_counts())


  Sentiment                                              Tweet
0     POLIT  Global Voices Online Â» Alex Castro: A liberal...
1     POLIT  Do the Conservatives Have a Death Wish? http:/...
2       NOT  @MMFlint I've seen all of your movies and Capi...
3     POLIT  RT @AllianceAlert: * House Dems ask for civili...
4     POLIT  RT @AdamSmithInst Quote of the week: My politi...
(4004, 2)
Sentiment
NOT      2285
POLIT    1719
Name: count, dtype: int64


In [4]:
# 3. replace sentiment labels 'POLIT': 1, 'NOT': 0; [0 point]

# create a mapping dictionary
sentiment_mapping = {
    'POLIT': 1, # political 
    'NOT': 0 # non-political
}

# apply mapping to the sentiment columns 
LabeledTweets['Sentiment'] = LabeledTweets['Sentiment'].map(sentiment_mapping)

# check
print(LabeledTweets.head())
print(LabeledTweets['Sentiment'].value_counts())

   Sentiment                                              Tweet
0          1  Global Voices Online Â» Alex Castro: A liberal...
1          1  Do the Conservatives Have a Death Wish? http:/...
2          0  @MMFlint I've seen all of your movies and Capi...
3          1  RT @AllianceAlert: * House Dems ask for civili...
4          1  RT @AdamSmithInst Quote of the week: My politi...
Sentiment
0    2285
1    1719
Name: count, dtype: int64


In [5]:
# 4. clean the tweets

# remove all tokens that contain a "@". Remove the whole token, not just the character.
# remove all tokens that contain "http". Remove the whole token, not just the characters.
# replace (not remove) all punctuation marks with a space (" ")
# replace all numbers with a space
# replace all non ascii characters with a space
# convert all characters to lowercase
# strip extra whitespaces
# lemmatize tokens
# No need to remove stopwords because TfidfVectorizer will take care of that

# create the clean function 

def clean(text, list_of_steps):
    
    for step in list_of_steps:
        # remove all tokens that contain a "@". Remove the whole token, not just the character.
        if step == 'remove_@':             
            text = ' '.join([token for token in text.split() if '@' not in token])
        
        # remove all tokens that contain "http". Remove the whole token, not just the characters.
        elif step == 'remove_http':         
            text = ' '.join([token for token in text.split() if 'http' not in token]) 
        
        # replace (not remove) all punctuation marks with a space (" ")
        elif step == 'replace_punctuation': 
            punc = set(string.punctuation)
            text = ''.join([char if char not in punc else ' ' for char in text])
        
        # replace all numbers with a space
        elif step == 'replace_numbers':
            text = ''.join([char if not char.isdigit() else ' ' for char in text])
        
        # replace all non ascii characters with a space
        elif step == 'replace_non_ascii':    
            text = ''.join([char if ord(char) < 128 else ' ' for char in text])
        
        # convert all characters to lowercase
        elif step == 'lowercase':            
            text = text.lower()
          
        # lemmatize tokens
        elif step == 'lemmatize':
            lmtzr = WordNetLemmatizer()
            word_list = text.split(' ')
            stemmed_words = [lmtzr.lemmatize(word) for word in word_list]
            text = ' '.join(stemmed_words)
        
        # strip extra whitespaces
        elif step == 'strip_whitespace':
            text = ' '.join(text.split())
    
    return text

step_list = ['remove_@', 'remove_http', 'replace_punctuation', 'replace_numbers', 'replace_non_ascii', 'lowercase', 'lemmatize', 'strip_whitespace']


# apply the clean function to the 'Tweet' column
cleaned_tweets = LabeledTweets['Tweet'].map(lambda x: clean(x, step_list))

# Save back to DataFrame
LabeledTweets['clean_tweet'] = cleaned_tweets

# check
print(cleaned_tweets.head())

0    global voice online alex castro a liberal libe...
1                do the conservative have a death wish
2    i ve seen all of your movie and capitalism is ...
3    rt house dems ask for civility at town hall an...
4    rt quote of the week my political opinion lean...
Name: Tweet, dtype: object


In [6]:
# 5. Use TfidfVectorizer from sklearn to prepare the data for machine learning. Use max_features = 50

# initialize the vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=50, stop_words = 'english')

# fit and transform the cleaned tweets
tfidf_matrix = tfidf_vectorizer.fit_transform(LabeledTweets['clean_tweet'])

# convert to array or DataFrame to check
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print(tfidf_df.head())

   afghanistan  better    care     com  congress  conservative     day  \
0      0.00000 0.00000 0.00000 0.00000   0.00000       0.00000 0.00000   
1      0.00000 0.00000 0.00000 0.00000   0.00000       1.00000 0.00000   
2      0.00000 0.00000 0.00000 0.00000   0.00000       0.00000 0.00000   
3      0.00000 0.00000 0.00000 0.00000   0.00000       0.00000 0.00000   
4      0.00000 0.00000 0.00000 0.00000   0.00000       0.00000 0.00000   

      did     don  economy  ...  stimulus    tcot   think    time   today  \
0 0.00000 0.00000  0.00000  ...   0.00000 0.00000 0.00000 0.00000 0.00000   
1 0.00000 0.00000  0.00000  ...   0.00000 0.00000 0.00000 0.00000 0.00000   
2 0.00000 0.00000  0.00000  ...   0.00000 0.00000 0.00000 0.00000 0.00000   
3 0.00000 0.00000  0.00000  ...   0.00000 0.00000 0.00000 0.00000 0.00000   
4 0.00000 0.00000  0.00000  ...   0.00000 0.00000 0.00000 0.00000 0.00000   

    video      wa    want     way    work  
0 0.00000 0.00000 0.00000 0.00000 0.00000  
1 0.

In [7]:
# 6. Use sklearn LogisticRegression to train a model on the results on 75% of the data.
from sklearn.metrics import accuracy_score, classification_report

# define y
y = LabeledTweets['Sentiment']

# split train test data by 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    tfidf_matrix, y, test_size=0.25, random_state=42, stratify=y  # set a seed 42 for consistent results for each run 
)                                                                 # stratify = y ensures class distribution in train and test sets is same as original data.

# initialize training on logistic regression
model1 = LogisticRegression(max_iter=1000)  # max_iter increased to ensure convergence
model1.fit(X_train, y_train)

# predict
y_pred = model1.predict(X_test)



In [8]:
# 7. Determine the accuracy of your model on the training data and the test data. Determine the baseline accuracy. 
# evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}\n')
print(classification_report(y_test, y_pred))

# baseline accuracy
# It tells how well a model would perform without learning anything
baseline_accuracy = y.value_counts().max() / len(y)
print(f'Baseline accuracy: {baseline_accuracy:.4f}\n')

Accuracy: 0.8861

              precision    recall  f1-score   support

           0       0.88      0.92      0.90       571
           1       0.89      0.84      0.86       430

    accuracy                           0.89      1001
   macro avg       0.89      0.88      0.88      1001
weighted avg       0.89      0.89      0.89      1001

Baseline accuracy: 0.5707



The accuracy of this model is 0.8861. Baseline accuracy is 0.5707. Our model 1 is more accurate than baseline accuracy. 

In [9]:
# 8. Repeat steps 5, 6, and 7 with TfidfVectorizer max_features set to 5, 500, 5000, 50000 and discuss your accuracies.

# model 1, feature = 50
y_pred = model1.predict(X_test)
model1_accuracy = accuracy_score(y_test, y_pred)
print(f'Model 1 with feature = 50, Accuracy: {model1_accuracy:.4f}\n')

# model 2, feature = 5
# initialize the vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5, stop_words = 'english')
# fit and transform the cleaned tweets
tfidf_matrix = tfidf_vectorizer.fit_transform(LabeledTweets['clean_tweet'])
# define y
y = LabeledTweets['Sentiment']

# split train test data by 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    tfidf_matrix, y, test_size=0.25, random_state=42, stratify=y  # set a seed 42 for consistent results for each run 
)                                                                 # stratify = y ensures class distribution in train and test sets is same as original data.

# initialize training on logistic regression
model2 = LogisticRegression(max_iter=1000)  # max_iter increased to ensure convergence
model2.fit(X_train, y_train)

# predict
y_pred = model2.predict(X_test)
# accuracy
model2_accuracy = accuracy_score(y_test, y_pred)
print(f'Model 2 with feature = 5, Accuracy: {model2_accuracy:.4f}\n')


# model 3, feature = 500
# initialize the vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=500, stop_words = 'english')
# fit and transform the cleaned tweets
tfidf_matrix = tfidf_vectorizer.fit_transform(LabeledTweets['clean_tweet'])
# define y
y = LabeledTweets['Sentiment']

# split train test data by 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    tfidf_matrix, y, test_size=0.25, random_state=42, stratify=y  # set a seed 42 for consistent results for each run 
)                                                                 # stratify = y ensures class distribution in train and test sets is same as original data.

# initialize training on logistic regression
model3 = LogisticRegression(max_iter=1000)  # max_iter increased to ensure convergence
model3.fit(X_train, y_train)

# predict
y_pred = model3.predict(X_test)
# accuracy
model3_accuracy = accuracy_score(y_test, y_pred)
print(f'Model 3 with feature = 500, Accuracy: {model3_accuracy:.4f}\n')

# model 4, feature = 5000
# initialize the vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words = 'english')
# fit and transform the cleaned tweets
tfidf_matrix = tfidf_vectorizer.fit_transform(LabeledTweets['clean_tweet'])
# define y
y = LabeledTweets['Sentiment']

# split train test data by 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    tfidf_matrix, y, test_size=0.25, random_state=42, stratify=y  # set a seed 42 for consistent results for each run 
)                                                                 # stratify = y ensures class distribution in train and test sets is same as original data.

# initialize training on logistic regression
model4 = LogisticRegression(max_iter=1000)  # max_iter increased to ensure convergence
model4.fit(X_train, y_train)

# predict
y_pred = model4.predict(X_test)
# accuracy
model4_accuracy = accuracy_score(y_test, y_pred)
print(f'Model 4 with feature = 5000, Accuracy: {model4_accuracy:.4f}\n')

# model 5, feature = 50000
# initialize the vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=50000, stop_words = 'english')
# fit and transform the cleaned tweets
tfidf_matrix = tfidf_vectorizer.fit_transform(LabeledTweets['clean_tweet'])
# define y
y = LabeledTweets['Sentiment']

# split train test data by 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    tfidf_matrix, y, test_size=0.25, random_state=42, stratify=y  # set a seed 42 for consistent results for each run 
)                                                                 # stratify = y ensures class distribution in train and test sets is same as original data.

# initialize training on logistic regression
model5 = LogisticRegression(max_iter=1000)  # max_iter increased to ensure convergence
model5.fit(X_train, y_train)

# predict
y_pred = model5.predict(X_test)
# accuracy
model5_accuracy = accuracy_score(y_test, y_pred)
print(f'Model 5 with feature = 50000, Accuracy: {model5_accuracy:.4f}\n')


Model 1 with feature = 50, Accuracy: 0.8861

Model 2 with feature = 5, Accuracy: 0.7413

Model 3 with feature = 500, Accuracy: 0.8931

Model 4 with feature = 5000, Accuracy: 0.8691

Model 5 with feature = 50000, Accuracy: 0.8691



**Results**

Model 1 with feature = 50, Accuracy: 0.8861

Model 2 with feature = 5, Accuracy: 0.7413

Model 3 with feature = 500, Accuracy: 0.8931

Model 4 with feature = 5000, Accuracy: 0.8691

Model 5 with feature = 50000, Accuracy: 0.8691

It looks like the accuracy improves with features from 5 to 50 to 500, but after features go over 500, it drops at features = 5000, and features = 50000. Therefore for a model with features = 500 is a a better fit than having too few or too many features in terms of model prediction accuracy.