# Twitter Sentiment Analysis

## Introduction
  
This tutorial explains a method to predict a sentiment of a twitt, in which it will be used RRNs as Deep Learning models. In general, the use of data in Social Networks is being exploited by the industry. For example, the analysis of sentiment of twits can be useful for a company that wants to analyse its new products. However, it is impossible that humans analyse each twit, therefore training deep learning models to predict is a way to scale the analysis.

The data used in this tutorial comes from an investigation made from Stanford’s researchers. They have collected 1.4 millions of twits and classify them as positive or negative emotions based on emoticons written in the same twit, simulating Facebook. The table below contains the emotion category of the emoticon used. The data was downloaded by using the HugginFace API.
  
The structure of the tutorial is divided in three parts, and it will be explained: first, the steps to clean twits in order to have a dataset to do the predictions; second, the description and use of GloVe as a Word Embedding model; third, the use of LSTM in predicting sentiments. The idea of using RRNs is that sequence of words have information that can be used to predict the sentiment.

In [50]:
from datasets import load_dataset
import re
import random
from torchtext.data import get_tokenizer
import pandas as pd

## Load Dataset

In [36]:
dataset = load_dataset("sentiment140")

Reusing dataset sentiment140 (/home/nftd/.cache/huggingface/datasets/sentiment140/sentiment140/1.0.0/9fe1c0ce3319c47cc65ff7e49aac6c34d9c050ab1432988c104b3b275e360f3f)


In [37]:
def read_twitts(data,twitt_n= 10000):
    """Read twitts.

    Parameters
    -----------
    data : DatasetDict
        Dataset loaded from hugginface
    twitt_n : int
        Number of twitts to use. This is to handle the use of memory
    
    Returns
    -----------
    data_train : list
        Train data
    labels_train : list
        Train label
    data_test : list
        Test data
    labels_test : list
        Test label
    """
    set_ = 'train'
    randomlist = random.sample(range(0, 1600000), twitt_n)
    trainrandomlist = random.sample(randomlist,int(len(randomlist)*0.8))
    testrandomlist = []
    for index in randomlist:
        if index not in trainrandomlist:
            testrandomlist.append(index)
    #training set
    data_train, labels_train = [],[]
    for i in trainrandomlist:
        data_train.append(data[set_][i]['text'])
        labels_train.append(data[set_][i]['sentiment'])
    #test set
    data_test, labels_test = [],[]
    for i in testrandomlist:
        data_test.append(data[set_][i]['text'])
        labels_test.append(data[set_][i]['sentiment'])
    return data_train,labels_train,data_test,labels_test

In [38]:
train_data, train_label, test_data, test_label = read_twitts(dataset,16000)

In [39]:
def data_info(data,label, k = 10):
    print('# trainings:', len(data))
    for x, y in zip(label[0:k], data[0:k]):
        print('label:', x, 'review:', y)
    for x, y in zip(label[-k:-1], data[-k:-1]):
        print('label:', x, 'review:', y)

In [40]:
data_info(train_data,train_label)

# trainings: 12800
label: 4 review: 91 days till September 1st!!  #philwickham
label: 0 review: oh great more problems with my macbook case cracking 
label: 0 review: WTF!I hate receptionist !Feels bad... 
label: 0 review: @JackAllTimeLow naww i would keep you company but ya know im not there  i'll see you tomorrow
label: 0 review: 3 more followers, somebody unfollowed me 
label: 4 review: uber tired. should probably sleep now  ihave no idea why im still awake. &gt;.&lt;
label: 4 review: Watching the da vinci code 
label: 0 review: my phone's gone for the whole summer break 
label: 0 review: Is not feeling good at all 
label: 0 review: whoooo hoooo i have 40 followers -.-&quot; -.-&quot;&quot; im so laaaaaaaaaame 
label: 0 review: Hate how my sources aren't getting back to me and one of the theaters closed! Stupid cut backs on theater in this county 
label: 4 review: new iphone tomorrow?  Can't wait...
label: 0 review: I miss so many people, I could seriously make a list. That's bad!  

## Data cleaning
In this section it will cover a step-by-step guide on how text can be cleaned, the objective is to transform the data into numbers so the models can be trained. It is necessary to understand the twit structure before designing a pipeline, below it is showed an example of a twit which was classified as negative:
  
> @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. 

In [41]:
twit = "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it."

The steps to transform the data are the following:
### Remove users
This is a simple step; the objective is to identify words which contains an @ at their left side.

In [42]:
def remove_user(txt):
    return re.sub('@[^\s]+','',txt)
twit = remove_user(twit)
print(twit)

 http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it.


In [43]:
train_data_clean = [remove_user(i) for i in train_data]
data_info(train_data_clean,train_label)

# trainings: 12800
label: 4 review: 91 days till September 1st!!  #philwickham
label: 0 review: oh great more problems with my macbook case cracking 
label: 0 review: WTF!I hate receptionist !Feels bad... 
label: 0 review:  naww i would keep you company but ya know im not there  i'll see you tomorrow
label: 0 review: 3 more followers, somebody unfollowed me 
label: 4 review: uber tired. should probably sleep now  ihave no idea why im still awake. &gt;.&lt;
label: 4 review: Watching the da vinci code 
label: 0 review: my phone's gone for the whole summer break 
label: 0 review: Is not feeling good at all 
label: 0 review: whoooo hoooo i have 40 followers -.-&quot; -.-&quot;&quot; im so laaaaaaaaaame 
label: 0 review: Hate how my sources aren't getting back to me and one of the theaters closed! Stupid cut backs on theater in this county 
label: 4 review: new iphone tomorrow?  Can't wait...
label: 0 review: I miss so many people, I could seriously make a list. That's bad!  Gotta re kindil

### Remove URL
Although hyperlink can have important information, it will take out from the text.

In [44]:
def remove_url(txt):
    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())
twit = remove_url(twit)
print(twit)

Awww thats a bummer You shoulda got David Carr of Third Day to do it


In [45]:
train_data_clean = [remove_url(i) for i in train_data_clean]
data_info(train_data_clean,train_label)

# trainings: 12800
label: 4 review: 91 days till September 1st philwickham
label: 0 review: oh great more problems with my macbook case cracking
label: 0 review: WTFI hate receptionist Feels bad
label: 0 review: naww i would keep you company but ya know im not there ill see you tomorrow
label: 0 review: 3 more followers somebody unfollowed me
label: 4 review: uber tired should probably sleep now ihave no idea why im still awake gtlt
label: 4 review: Watching the da vinci code
label: 0 review: my phones gone for the whole summer break
label: 0 review: Is not feeling good at all
label: 0 review: whoooo hoooo i have 40 followers quot quotquot im so laaaaaaaaaame
label: 0 review: Hate how my sources arent getting back to me and one of the theaters closed Stupid cut backs on theater in this county
label: 4 review: new iphone tomorrow Cant wait
label: 0 review: I miss so many people I could seriously make a list Thats bad Gotta re kindil those flames
label: 0 review: Unfortunately we couldnt

### Tokenize
Tokenize means that the words in the text will be split by a delimiter, normalize by a function and put the words in a list. In our case, it will split the words by spaces and it will transform to undercase all of the words. It will be used the function “get_tokenizer” from torchtext. The result are tokens and an example it is showed below:

In [48]:
tokenizer = get_tokenizer("basic_english")
print(tokenizer(twit))

['awww', 'thats', 'a', 'bummer', 'you', 'shoulda', 'got', 'david', 'carr', 'of', 'third', 'day', 'to', 'do', 'it']


In [49]:
train_data_token = [tokenizer(i) for i in train_data_clean]
data_info(train_data_token,train_label)

# trainings: 12800
label: 4 review: ['91', 'days', 'till', 'september', '1st', 'philwickham']
label: 0 review: ['oh', 'great', 'more', 'problems', 'with', 'my', 'macbook', 'case', 'cracking']
label: 0 review: ['wtfi', 'hate', 'receptionist', 'feels', 'bad']
label: 0 review: ['naww', 'i', 'would', 'keep', 'you', 'company', 'but', 'ya', 'know', 'im', 'not', 'there', 'ill', 'see', 'you', 'tomorrow']
label: 0 review: ['3', 'more', 'followers', 'somebody', 'unfollowed', 'me']
label: 4 review: ['uber', 'tired', 'should', 'probably', 'sleep', 'now', 'ihave', 'no', 'idea', 'why', 'im', 'still', 'awake', 'gtlt']
label: 4 review: ['watching', 'the', 'da', 'vinci', 'code']
label: 0 review: ['my', 'phones', 'gone', 'for', 'the', 'whole', 'summer', 'break']
label: 0 review: ['is', 'not', 'feeling', 'good', 'at', 'all']
label: 0 review: ['whoooo', 'hoooo', 'i', 'have', '40', 'followers', 'quot', 'quotquot', 'im', 'so', 'laaaaaaaaaame']
label: 0 review: ['hate', 'how', 'my', 'sources', 'arent', 'gett

### Drop empty twits
This step is simple, after the transformations it can be found twits without tokens. Hence, all empty lists will be dropped, within their labels.

In [55]:
def drop_empty_tweets(data, label):
    data_non_empty = []
    label_non_empty = []
    empty_twitt = 0
    for i, tweet in enumerate(data):
        if len(tweet) > 0:
            data_non_empty.append(tweet)
            label_non_empty.append(label[i])
        else:
            empty_twitt += 1
    print(f"There were {empty_twitt} empty twitts")
    return data_non_empty,label_non_empty

In [56]:
train_data_non_empty,train_label_non_empty = drop_empty_tweets(train_data_token,train_label)

There were 27 empty twitts


### Convert label
The frequency of each sentiment label is shown below. The label `4` correspond to a positive sentiment, while `0` is a negative sentiment.

In [57]:
pd.Series(train_label_non_empty).value_counts()

0    6417
4    6356
dtype: int64

This means that we are facing with balanced classes, hence there is no need to specify weights in the loss function. However, the Neural Networks is set to predict {0,1}. Therefore, this is easy to solve by changing the label `4` to `1`.

In [52]:
def convert_labels(label):
    train_label_conv = []
    for i in train_label:
        if i == 4:
            train_label_conv.append(1)
        else:
            train_label_conv.append(0)
    return train_label_conv

In [58]:
train_label_converted = convert_labels(train_label_non_empty)