### The purpose of this notebook is to perform exploratory data analysis on the data

The data set is composed of tweets and an associated tag classifying their sentiment  
which ranges from negative, neutral to positive

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import os

In [42]:
tweet_df = pd.read_excel('data/LabeledText.xlsx')
tweet_df.head()

Unnamed: 0,File Name,Caption,LABEL
0,1.txt,How I feel today #legday #jelly #aching #gym,negative
1,10.txt,@ArrivaTW absolute disgrace two carriages from...,negative
2,100.txt,This is my Valentine's from 1 of my nephews. I...,positive
3,1000.txt,betterfeelingfilms: RT via Instagram: First da...,neutral
4,1001.txt,Zoe's first love #Rattled @JohnnyHarper15,positive


Let's start by checking if there are any missing or duplicated values,  
then move onto the data types of each column, the total amount of rows,  
and then checking for any class imbalances

In [4]:
tweet_df.isnull().sum()

File Name    0
Caption      0
LABEL        0
dtype: int64

In [5]:
tweet_df.duplicated().sum()

0

In [6]:
tweet_df.dtypes

File Name    object
Caption      object
LABEL        object
dtype: object

In [7]:
tweet_df.shape

(4869, 3)

In [21]:
neutral_count = tweet_df[tweet_df['LABEL'] == 'neutral'].shape[0]
negative_count = tweet_df[tweet_df['LABEL'] == 'negative'].shape[0]
positive_count = tweet_df[tweet_df['LABEL'] == 'positive'].shape[0]
total_count = tweet_df.shape[0]

print(f'Negative Percentage:{round(negative_count/total_count*100)}%' +
      f'\nNeutral Percentage:{round(neutral_count/total_count*100)}%'+ 
      f'\nPositive Percentage:{round(positive_count/total_count*100)}%')

Negative Percentage:30%
Neutral Percentage:36%
Positive Percentage:34%


Each class looks to be fairly represented in the dataset,  
which means we won't have to do any upsampling or downsampling  
in order for our model to generalize

Since all we want is to learn how to predict the label  
for a given tweet, we only need to keep the caption and  
label column

In [43]:
tweet_df = tweet_df.drop(columns=['File Name'],axis=1)

In [28]:
print(f'Unique Labels:{tweet_df.LABEL.nunique()}')

Unique Labels:3


We see that we have the correct amount of class labels,  
let's move onto cleaning our tweets up, we'll have to remove  
any links, html tags, or non alphanumeric characters 

In [47]:
tweet_df['Caption'] = tweet_df['Caption'].str.replace(r'<[^<>]*>', '', regex=True)
tweet_df['Caption'] = tweet_df['Caption'].str.replace(r'[^A-Za-z0-9 ]+', '', regex=True)

tweet_df.head()

Unnamed: 0,Caption,LABEL
0,How I feel today legday jelly aching gym,negative
1,ArrivaTW absolute disgrace two carriages from ...,negative
2,This is my Valentines from 1 of my nephews I a...,positive
3,betterfeelingfilms RT via Instagram First day ...,neutral
4,Zoes first love Rattled JohnnyHarper15,positive


In [51]:
tweet_df = tweet_df.rename(columns={'Caption':'tweets','LABEL':'labels'})
tweet_df.head()

Unnamed: 0,tweets,labels
0,How I feel today legday jelly aching gym,negative
1,ArrivaTW absolute disgrace two carriages from ...,negative
2,This is my Valentines from 1 of my nephews I a...,positive
3,betterfeelingfilms RT via Instagram First day ...,neutral
4,Zoes first love Rattled JohnnyHarper15,positive


Let's now encode our labels column using label encoding

In [52]:
from sklearn.preprocessing import LabelEncoder

tweet_lenc = LabelEncoder()
tweet_df['labels'] = tweet_lenc.fit_transform(tweet_df['labels'])

tweet_df.head()

Unnamed: 0,tweets,labels
0,How I feel today legday jelly aching gym,0
1,ArrivaTW absolute disgrace two carriages from ...,0
2,This is my Valentines from 1 of my nephews I a...,2
3,betterfeelingfilms RT via Instagram First day ...,1
4,Zoes first love Rattled JohnnyHarper15,2


We'll now move onto tokenizing and padding our tweets

In [70]:
import tensorflow as tf

tokenizer = tf.keras.preprocessing.text.Tokenizer()
# fits the tweets to the tokenizer to update our vocabulary
tokenizer.fit_on_texts(tweet_df['tweets'])
# retreiving the text sequences so we can later pad them to all have equal length
text_sequences = tokenizer.texts_to_sequences(tweet_df['tweets'])

In [72]:
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences)
len_seq = len(text_sequences[0])
num_seq = len(text_sequences)