# Twitter Sentiment Analysis

## Introduction
  
This tutorial explains a method to predict a sentiment of a twitt, in which it will be used RRNs as Deep Learning models. In general, the use of data in Social Networks is being exploited by the industry. For example, the analysis of sentiment of twits can be useful for a company that wants to analyse its new products. However, it is impossible that humans analyse each twit, therefore training deep learning models to predict is a way to scale the analysis.

The data used in this tutorial comes from an investigation made from Stanford’s researchers. They have collected 1.4 millions of twits and classify them as positive or negative emotions based on emoticons written in the same twit, simulating Facebook. The table below contains the emotion category of the emoticon used. The data was downloaded by using the HugginFace API.
  
The structure of the tutorial is divided in three parts, and it will be explained: first, the steps to clean twits in order to have a dataset to do the predictions; second, the description and use of GloVe as a Word Embedding model; third, the use of LSTM in predicting sentiments. The idea of using RRNs is that sequence of words have information that can be used to predict the sentiment.

In [30]:
from datasets import load_dataset
import re
import random
from torchtext.data import get_tokenizer

ModuleNotFoundError: No module named 'torchtext'

## Load Dataset

In [3]:
dataset = load_dataset("sentiment140")

Downloading: 4.12kB [00:00, 325kB/s]                    
Downloading: 1.59kB [00:00, 362kB/s]                  
Downloading and preparing dataset sentiment140/sentiment140 (download: 77.59 MiB, generated: 215.36 MiB, post-processed: Unknown size, total: 292.95 MiB) to /home/nftd/.cache/huggingface/datasets/sentiment140/sentiment140/1.0.0/9fe1c0ce3319c47cc65ff7e49aac6c34d9c050ab1432988c104b3b275e360f3f...
Downloading: 100%|██████████| 81.4M/81.4M [01:57<00:00, 691kB/s] 
                                Dataset sentiment140 downloaded and prepared to /home/nftd/.cache/huggingface/datasets/sentiment140/sentiment140/1.0.0/9fe1c0ce3319c47cc65ff7e49aac6c34d9c050ab1432988c104b3b275e360f3f. Subsequent calls will reuse this data.


In [23]:
def read_twitts(data,twitt_n= 10000):
    """Read twitts.

    Parameters
    -----------
    data : DatasetDict
        Dataset loaded from hugginface
    twitt_n : int
        Number of twitts to use. This is to handle the use of memory
    
    Returns
    -----------
    data_train : list
        Train data
    labels_train : list
        Train label
    data_test : list
        Test data
    labels_test : list
        Test label
    """
    set_ = 'train'
    randomlist = random.sample(range(0, 1600000), twitt_n)
    trainrandomlist = random.sample(randomlist,int(len(randomlist)*0.8))
    testrandomlist = []
    for index in randomlist:
        if index not in trainrandomlist:
            testrandomlist.append(index)
    #training set
    data_train, labels_train = [],[]
    for i in trainrandomlist:
        data_train.append(data[set_][i]['text'])
        labels_train.append(data[set_][i]['sentiment'])
    #test set
    data_test, labels_test = [],[]
    for i in testrandomlist:
        data_test.append(data[set_][i]['text'])
        labels_test.append(data[set_][i]['sentiment'])
    return data_train,labels_train,data_test,labels_test

In [24]:
train_data, train_label, test_data, test_label = read_twitts(dataset,16000)

In [25]:
def data_info(data,label, k = 10):
    print('# trainings:', len(data))
    for x, y in zip(label[0:k], data[0:k]):
        print('label:', x, 'review:', y)
    for x, y in zip(label[-k:-1], data[-k:-1]):
        print('label:', x, 'review:', y)

In [26]:
data_info(train_data,train_label)

# trainings: 12800
label: 0 review: @alexisamore I  slacked off with that for like 2 weeks now... been too busy 
label: 4 review: eating Mother's frosted oatmeal cookies &amp;&amp; milk. 
label: 4 review: Testing that my laptop works from home  Yes it does 
label: 4 review: Tom Hanks is a Trek man!! 
label: 4 review: @michael_sargent Hmmm, not too late for me to change my ASB's name either.  
label: 4 review: Follow my best freind @Sidrraah she is beautiful lol 
label: 0 review: Ugh caaaake 
label: 4 review: Is very thankful for her amazing friend @m3r3h because she's helping him choreograph at 630am. 
label: 0 review: My email system was hacked and lots of my contacts have received email from an Indian web community 
label: 0 review: @svanwessem Thant stinks, sorry to hear that about the stolen money 
label: 4 review: Alex makes my life so much better. 
label: 0 review: @stevexmetal i love you 
label: 4 review: Church at St. Pats, NBC tour, planet hollywood dinner, and TIMES SQUARE!!!

## Data cleaning
In this section it will cover a step-by-step guide on how text can be cleaned, the objective is to transform the data into numbers so the models can be trained. It is necessary to understand the twit structure before designing a pipeline, below it is showed an example of a twit which was classified as negative:
  
> @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. 

In [12]:
twit = "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it."

The steps to transform the data are the following:
### Remove users
This is a simple step; the objective is to identify words which contains an @ at their left side.

In [13]:
def remove_user(txt):
    return re.sub('@[^\s]+','',txt)
twit = remove_user(twit)
print(twit)

 http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it.


In [27]:
train_data_clean = [remove_user(i) for i in train_data]
data_info(train_data_clean,train_label)

# trainings: 12800
label: 0 review:  I  slacked off with that for like 2 weeks now... been too busy 
label: 4 review: eating Mother's frosted oatmeal cookies &amp;&amp; milk. 
label: 4 review: Testing that my laptop works from home  Yes it does 
label: 4 review: Tom Hanks is a Trek man!! 
label: 4 review:  Hmmm, not too late for me to change my ASB's name either.  
label: 4 review: Follow my best freind  she is beautiful lol 
label: 0 review: Ugh caaaake 
label: 4 review: Is very thankful for her amazing friend  because she's helping him choreograph at 630am. 
label: 0 review: My email system was hacked and lots of my contacts have received email from an Indian web community 
label: 0 review:  Thant stinks, sorry to hear that about the stolen money 
label: 4 review: Alex makes my life so much better. 
label: 0 review:  i love you 
label: 4 review: Church at St. Pats, NBC tour, planet hollywood dinner, and TIMES SQUARE!!!! 
label: 4 review:     danika you are.   
label: 0 review: Thinki

### Remove URL
Although hyperlink can have important information, it will take out from the text.

In [10]:
def remove_url(txt):
    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())
twit = remove_url(twit)
print(twit)

switchfoot Awww thats a bummer You shoulda got David Carr of Third Day to do it


In [28]:
train_data_clean = [remove_url(i) for i in train_data_clean]
data_info(train_data_clean,train_label)

# trainings: 12800
label: 0 review: I slacked off with that for like 2 weeks now been too busy
label: 4 review: eating Mothers frosted oatmeal cookies ampamp milk
label: 4 review: Testing that my laptop works from home Yes it does
label: 4 review: Tom Hanks is a Trek man
label: 4 review: Hmmm not too late for me to change my ASBs name either
label: 4 review: Follow my best freind she is beautiful lol
label: 0 review: Ugh caaaake
label: 4 review: Is very thankful for her amazing friend because shes helping him choreograph at 630am
label: 0 review: My email system was hacked and lots of my contacts have received email from an Indian web community
label: 0 review: Thant stinks sorry to hear that about the stolen money
label: 4 review: Alex makes my life so much better
label: 0 review: i love you
label: 4 review: Church at St Pats NBC tour planet hollywood dinner and TIMES SQUARE
label: 4 review: danika you are
label: 0 review: Thinkin about milley
label: 4 review: ok ill finish my paper a

### Tokenize
Tokenize means that the words in the text will be split by a delimiter, normalize by a function and put the words in a list. In our case, it will split the words by spaces and it will transform to undercase all of the words. The result are tokens and an example it is showed below:

In [29]:
tokenizer = get_tokenizer("basic_english")

NameError: name 'get_tokenizer' is not defined