# Twitter Sentiment Analysis

**Author : Kristina Zakaryan**

Steps taken to complete project: 
- Data Preprocessing (Tokenization, Stop-words Removal, Stemming, TF-IDF Vectorization) 
- Model Training 
- Interpretation of Results and Possible Improvements

### Installation and Setup

In [None]:
!pip3 install kaggle

In [1]:
# all the neccessary imports in one place
from kaggle.api.kaggle_api_extended import KaggleApi

import pandas as pd
import numpy as np
import re

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
# downloading the dataset from kaggle
api = KaggleApi()
api.authenticate()

PATH = "kazanova/sentiment140"

api.dataset_download_files(PATH, path="data/", unzip=True)

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140


Initial Setup

In [3]:
nltk.download('stopwords')
nltk.download('punkt') # needed for word_tokenize
stop_words = set(stopwords.words('english'))
porter_stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to /Users/tina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/tina/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Data Preprocessing

Since column names were not initially included in the dataset, we add them manually:

In [4]:
column_names = ["target", "ids", "date", "flag", "user", "text"]

data = pd.read_csv('data/training.1600000.processed.noemoticon.csv', encoding="ISO-8859-1", header=None, names=column_names)

In [5]:
data.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


The dataset contains 2 labels: **0** - negative, **4** - positive

In [6]:
set(data['target'])

{0, 4}

In [7]:
data.shape

(1600000, 6)

In [8]:
data.isna().sum()

target    0
ids       0
date      0
flag      0
user      0
text      0
dtype: int64

There are no missing values in the dataset.

Now let's understand the distribution of the target variable:

In [9]:
data['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

The dataset is perfectly balanced.

In [10]:
data.replace({'target' : {4:1}}, inplace=True) # makes the target variable labels more intuitive

In [11]:
set(data['target'])

{0, 1}

The majority of the columns have 0 importance in our task, so we can remove them:

In [12]:
data.drop(["ids", "date", "flag", "user"], axis=1, inplace=True)

Now it is time to work on the textual part of our dataset, let's first remove some insignificant words via regular expressions. Here are some considerations:
1) After analyzing the text in the tweets, we can notice a huge number of mentions (starting with @) and links that have 0 significance for our model and thus need to be removed.
2) Numbers will be removed too, since in 1 context a number can have a positive connotation, but in another it can be perceived as negative so they only add ambiguity to our dataset.

In [13]:
def regex_preprocessor(tweet):
    # remove @mentions
    text = re.sub(r'@\w+', '', tweet)
    
    # remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)

    # remove symbols, digits, etc..
    text = re.sub('[^a-z\s]', ' ', text)

    #remove multiple spaces
    return re.sub(r'\s+', ' ', text).strip()

Another important step in our NLP pipeline is **Tokenization**. This is the stage where we break our text down to smaller chunks, in order to convert raw text into individual units (tokens).

In [14]:
def tokenizer(tweet):
    return word_tokenize(tweet)

Stopwords do not add any meaning to our actual data, so we choose to remove these words from our dataset.

In [15]:
def stop_words_remover(tokenized_tweet):
    return [word for word in tokenized_tweet if word not in stop_words]

A significant way to preprocess test data is **Stemming**: reducing the word to its root. This way our dataset becomes smaller, without
having to lose significant amount of information contained in it.

Another way to perform the latter would be **Lemmatization** but it is a heavier operation and since afterwards we will transform the tweets via TF-IDF vectorization, Stemming was chosen as a preprocessing step, since this method does not really care about the content being 'comprehensible'.

In [16]:
porter_stemmer = PorterStemmer()

def stemmer(word):
    return porter_stemmer.stem(word)    

Below is the full preprocessing pipeline that gathers all the methods defined above:

In [17]:
def preprocessing_pipeline(tweet):
    tweet_lower = tweet.lower()
    tweet_regex = regex_preprocessor(tweet_lower)
    tweet_array_tok = tokenizer(tweet_regex)
    tweet_wo_sw = stop_words_remover(tweet_array_tok)
    tweet_stemmed = [stemmer(word) for word in tweet_wo_sw]
    return ' '.join(tweet_stemmed)

In [18]:
data['text_preprocessed'] = data['text'].apply(preprocessing_pipeline)

In [19]:
data['text_preprocessed'].head()

0              bummer shoulda got david carr third day
1    upset updat facebook text might cri result sch...
2         dive mani time ball manag save rest go bound
3                      whole bodi feel itchi like fire
4                                        behav mad see
Name: text_preprocessed, dtype: object

In [20]:
X = data['text_preprocessed'].values
y = data['target'].values

In order to feed the values into the model we will use the train_test_split method provided by scikit-learn. It is also important to shuffle the dataset to avoid bias based on the order of particular data points. To preserve the balance between labels, we set **stratify=Y**

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)

Now it is time to turn our text into digital vectors. For that we will use the TF-IDF vectorizer, which prioritizes rare words between documents, thus minimizint the impact of very common words that appear in all the documents.

In [22]:
vectorizer = TfidfVectorizer()

In [23]:
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [24]:
X_train.shape

(1200000, 182523)

In [25]:
X_test.shape

(400000, 182523)

In [26]:
def display_accuracy_conf_mat(y_real, y_pred):
    print('Model test Accuracy: ', accuracy_score(y_real, y_pred))
    print('Confusion Matrix', confusion_matrix(y_real, y_pred))

## Model Selection 

In this section we will train a Machine Learning Classifier to predict if the tweet is negative or positive.

Models that have been considered for the task:
-  Logistic Regression
-  Naive Bayes (Multinomial)

Since our dataset is huge, models like KNN, SVM, Random Forest are too heavy to test.

In [33]:
log_reg = LogisticRegression(max_iter=10000)

log_reg.fit(X_train, y_train)

y_pred_lreg = log_reg.predict(X_test)

display_accuracy_conf_mat(y_test, y_pred_lreg)

Model test Accuracy:  0.77243
Confusion Matrix [[150252  49359]
 [ 41669 158720]]


It is quite common to use Gaussian Naive Bayes, but since we are working with text on which we have calculated frequencies, it makes sense to use a Multinomial Naive Bayes, which assumes that features are counts (discrete events) that follow a multinomial distribution.

In [29]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()

nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)

display_accuracy_conf_mat(y_test, y_pred_nb)

Model test Accuracy:  0.755465
Confusion Matrix [[153187  46424]
 [ 51390 148999]]


## Interpretation of Results

We have obtained an accuracy of > 75% which is not bad for the following dataset. If we want to reach a higher accuracy, the next step to consider would be training a neurtal network by carefully choosing the parameters.

But standard dense feedforward NNs on TF-IDF rarely outperform Logistic Regression by more than 1â€“2% unless you carefully tune embeddings, layers, and regularization.