<a href="https://colab.research.google.com/github/kevinmfreire/sentiment-analysis/blob/main/tweet_dataset_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tweet Dataset Analysis

## First let's download datasets from google drive to google collab.

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [2]:
!cp -r /gdrive/MyDrive/SharpestMinds/datasets/ /content/

##Now that we have our dataset, we have three objectives: 
* Understand the dataset and clean it up
* Build a classification model to predict the twitter sentiment
* Compare the evalutation metrics of a few classification algorithms

We first want to import our libraries

In [18]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import re
import string
import nltk

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from tqdm import tqdm

from sklearn.feature_extraction.text import TfidfVectorizer

# Dwnloading NLTK packages
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = '0'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


* Then we want to load our dataset and drop any missing values

In [4]:
# Load dataset then check columns and values
DATASET_COLUMNS  = ["sentiment", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
tweet_df = pd.read_csv("./datasets/sentiment140/tweets.csv", encoding=DATASET_ENCODING , names=DATASET_COLUMNS)
tweet_df.head()

Unnamed: 0,sentiment,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


* Let's check additional information on dataset

In [5]:
tweet_df.describe(include='O')

Unnamed: 0,date,flag,user,text
count,1600000,1600000,1600000,1600000
unique,774363,1,659775,1581466
top,Mon Jun 15 12:53:14 PDT 2009,NO_QUERY,lost_dog,isPlayer Has Died! Sorry
freq,20,1600000,549,210


* We do not need the date, ids, flag or user, I will drop those columns and keepy sentiment and text.

In [6]:
tweet_df.drop(['ids','date', 'flag', 'user'], axis=1, inplace=True)
tweet_df.head()

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


* The Kaggle Sentiment140 dataset has values 0=negative, 2=neutral and 4=positive.
* I will replace all values with -1=negative, 0=neutral and 1=positive.

In [7]:
to_sentiment = {0: "negative", 2:"neutral", 4: "positive"}
def label_decoder(label):
    return to_sentiment[label]

tweet_df.sentiment = tweet_df.sentiment.apply(lambda x: label_decoder(x))
tweet_df.head()

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


* I create a target value to select the column 'sentiment'
* Then I copy the dataset as original_df

In [8]:
target = 'sentiment'
original_df = tweet_df.copy(deep=True)

In [9]:
print('\n\033[1mData Dimension:\033[0m Dataset consists of {} columns & {} records.'.format(tweet_df.shape[1], tweet_df.shape[0]))


[1mData Dimension:[0m Dataset consists of 2 columns & 1600000 records.


In [10]:
# Let's check the dtypes of all columns

tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   sentiment  1600000 non-null  int64 
 1   text       1600000 non-null  object
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


In [11]:
# Checking the stats of all the columns

tweet_df.describe()

Unnamed: 0,sentiment
count,1600000.0
mean,2.0
std,2.000001
min,0.0
25%,0.0
50%,2.0
75%,4.0
max,4.0


As you can see we have exactly 27,481 unique text examples and 3 unique sentiments (Positive, neutral, negative) with the most repeated sentiment being neutral with an exact count of 11,118.

# Data Processing

In [12]:
# Check for empty elements

tweet_df.isnull().sum()

sentiment    0
text         0
dtype: int64

In [None]:
# Remove any missing values
# rom the above cell there are no missing values so we do not run this cell

# tweet_df = tweet_df.dropna(inplace=True)
# original_df = tweet_df.copy(deep=True)

In [13]:
tweet_df[tweet_df.duplicated()]

Unnamed: 0,sentiment,text
1940,0,and so the editing of 3000 wedding shots begins
2149,0,"im lonely keep me company! 22 female, california"
3743,0,I'm not liking that new iTunes Pricing at all....
3746,0,"cant eat, drink or breath properly thanks to t..."
4163,0,has a cold
...,...,...
1599450,4,Good morning!
1599501,4,getting used to twitter
1599531,4,@KhloeKardashian Definitely my Mom. And Angeli...
1599678,4,goodmorning


In [14]:
# Let's remove duplicated rows (if any)

counter = 0
r, c = original_df.shape

tweet_df_dedup = tweet_df.drop_duplicates()
tweet_df_dedup.reset_index(drop=True, inplace=True)

if tweet_df_dedup.shape==(r,c):
  print('\n\033[1mInference:\033[0m The dataset doesn\'t have any duplicates')
else:
  print(f'\n\033[1mInference:\033[0m Number of duplicates dropped/fixed ---> {r-tweet_df_dedup.shape[0]}')


[1mInference:[0m Number of duplicates dropped/fixed ---> 16309


In [15]:
tweet_df_dedup.head()

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


## Let's do some basic text processing such as:
* Convert to lower case
* Tokenisation
* Remove puntuation
* Remove stop words
* Stemming
* Lemmatization

In [16]:
# Cleaning the text

tweet_df_clean = tweet_df_dedup.copy()

def preprocessor(text):
  text = re.sub('[^a-zA-Z]',' ', text)    # remove puntuation
  text = text.lower()                     # convert to lowercase
  text = text.strip()                     # remove leading and tailing whitespaces
  # Stemming
  text = ''.join([i for i in text if i in string.ascii_lowercase+' '])
  text = ' '.join([word for word in text.split() if word.isalnum()])  
  text = ' '.join([WordNetLemmatizer().lemmatize(word,pos='v') for word in text.split()]) 
  text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])
  return text

In [None]:
for i in tqdm(range(tweet_df_clean.shape[0])):
  tweet_df_clean['text'] = tweet_df_dedup['text'].apply(preprocessor)
tweet_df_clean.head()

  0%|          | 0/1583691 [00:00<?, ?it/s]

**Inference:** The text is now clean from the removal of all punctuations, stop words and stemming.

* We next want to tokenize our dataset using Porter Stemmer

In [None]:
porter = PorterStemmer()

def tokenizer_porter(text):
  return [porter.stem(word) for word in text.split()]

Let's extract features using TF-IDF

In [None]:
tf_idf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, tokenizer=tokenizer_porter, use_idf=True, norm='12', smooth_idf=True)
label=tweet_df_clean[target].values
features=tf_idf.fit_transform[tweet_df_clean.text]

Let's look at the labels

In [None]:
label

Now let's look at the features

In [None]:
features

## Exploratory Data Analysis (EDA)

In [None]:
# Let'sanalyze the distribution oof the target values

print('\033[1mTarget Variable Distribution'.center(55))
plt.pie(tweet_df_clean[target].value_counts(), labels=['Negative','Neutral','Positive'], counterclock=False, shadow=True, 
        explode=[0,0,0.1], autopct='%1.1f%%', radius=1.5, startangle=0)
plt.show()