# Applied Machine Learning Homework 4
Due 12/15/21 11:59PM EST

### Q1: Natural Language Processing

We will train a supervised training model to predict if a tweet has a positive or negative sentiment.

#### Dataset loading & dev/test splits

1.1) Load the twitter dataset from NLTK library

In [79]:
import nltk
nltk.download('twitter_samples')
from nltk.corpus import twitter_samples 

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/Griffin/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


1.2) Load the positive & negative tweets

In [80]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

1.3) Create a development & test split (80/20 ratio):

In [81]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

neg_tweet_df = pd.DataFrame(all_negative_tweets, columns=['tweet'])
pos_tweet_df = pd.DataFrame(all_positive_tweets, columns=['tweet'])
pos_tweet_df['sentiment'] = 1
neg_tweet_df['sentiment'] = 0

  
result = pos_tweet_df.append(neg_tweet_df)



#### Data preprocessing

We will do some data preprocessing before we tokenize the data. We will remove `#` symbol, hyperlinks, stop words & punctuations from the data. You can use the `re` package in python to find and replace these strings. 

1.4) Replace the `#` symbol with '' in every tweet

In [82]:
import re
result['tweet'].replace(to_replace="\#", value=r"''", regex=True, inplace=True)

1.5) Replace hyperlinks with '' in every tweet

In [83]:
import re
result['tweet'].replace(to_replace="\#", value=r"", regex=True, inplace=True)

result['tweet'] = result['tweet'].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')


  result['tweet'] = result['tweet'].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')


1.6) Remove all stop words

In [84]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')
result['tweet_without_stopwords'] = result['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Griffin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


1.7) Remove all punctuations

In [85]:
result['tweet_without_stopwords'] = result['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
result["tweet_without_punct"] = result['tweet_without_stopwords'].str.replace('[^\w\s]','')


  result["tweet_without_punct"] = result['tweet_without_stopwords'].str.replace('[^\w\s]','')


1.8) Apply stemming on the development & test datasets using Porter algorithm

In [86]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

def stem_sentences(sentence):
    tokens = sentence.split()
    stemmed_tokens = [ps.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

result['tweet_stemmed'] = result['tweet_without_punct'].apply(stem_sentences)


In [87]:
#splitting training/test
X_dev, X_test, y_dev, y_test = train_test_split(result['tweet_stemmed'], result['sentiment'], test_size=0.2, random_state=0)

X_dev.head()


2389    rafaelallmark ive support sinc start alway ign...
4275    kirkherbstreit braxton gone pro i feel bad him...
2995    williamhc3 hi would like impastel concert let ...
316                           boybandsftcara sure sorri x
356     good luck lizaminnelli upcom uk appear garyjho...
Name: tweet_stemmed, dtype: object

#### Model training

1.9) Create bag of words features for each tweet in the development dataset

In [88]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

dev_x = vectorizer.fit_transform(X_dev)
test_x = vectorizer.transform(X_test)



1.10) Train a supervised learning model of choice on the development dataset

In [89]:
from sklearn.linear_model import LogisticRegressionCV

model1 = LogisticRegressionCV(max_iter=1000)

model1.fit(dev_x, y_dev)



LogisticRegressionCV(max_iter=1000)

In [90]:
score = model1.score(test_x, y_test)
print(score)

0.75


1.11) Create TF-IDF features for each tweet in the development dataset

In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfIdfVectorizer=TfidfVectorizer(use_idf=True)

tfIdf_x_train = tfIdfVectorizer.fit_transform(X_dev)
tfIdf_x_test = tfIdfVectorizer.transform(X_test)

1.12) Train the same supervised learning algorithm on the development dataset with TF-IDF features

In [92]:
model2 = LogisticRegressionCV(max_iter=1000)

model2.fit(tfIdf_x_train, y_dev)


LogisticRegressionCV(max_iter=1000)

1.13) Compare the performance of the two models on the test dataset

In [93]:
score = model2.score(tfIdf_x_test, y_test)
print(score)

0.7595


In [None]:
#TF-IDF scores have an accuracy of 75.95% and the non-tf-idf has a 75% accuracy, so the TFIDF slightly outperforms 
#the first model