# Sentiment Analysis of Tweets with Supervised Learning

This notebook will cover the basics of analysing sentiment of tweets using supervised learning. Supervised learning means that a labelled dataset will be used to train our model. We can then test the model and see how accurate it has managed to predict an output. The dataset will first be cleaned, and three different methods of training will be explored: **logistic regression**, which uses a logistic function (Swaminathan); an **LSTM**, which uses a recurrent neural network (Srivastava); and a **transformer**, which uses a pre-trained model (huggingface).

In [20]:
#Importing all the required libraries
import os
import re
import asyncio
import numpy as np
import pandas as pd
import tensorflow_hub as hub
import tensorflow as tf
import random
import torch
import io

from sklearn import preprocessing, model_selection, metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from torch.utils.data import TensorDataset, DataLoader
from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification, BertTokenizer
from pymagnitude import *
from tqdm import tqdm, trange
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.utils import np_utils
from keras.layers import SpatialDropout1D
from keras.preprocessing import sequence, text
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

## Cleaning the data

We will use a dataset on US airline sentiments downloaded from Kaggle to train our models. The data is orignially from _Crowdflower's Data for Everyone library_. 

In [21]:
data_root = '/home/kyubin/Supervised_learning_notebook'
data_file = os.path.join(data_root, 'Tweets.csv') 

tweets_raw = pd.read_csv(data_file) #Importing tweets into a pandas dataframe

Let's take a look at few of the tweets and their sentiments!

In [22]:
for i in range(0, 5):
    x = random.randint(0, len(tweets_raw))
    print('Tweet: %s, \nSentiment: %s' % (tweets_raw['text'][x], tweets_raw['airline_sentiment'][x]))

Tweet: Do less please @JetBlue, 
Sentiment: negative
Tweet: @USAirways who would like to watch the video of how loyal customers are really treated while asking for an explanation for our 6 hour delay?, 
Sentiment: negative
Tweet: @SouthwestAir FJBFSC  all I need is receipt showing the 776.20 charge I have on my Amex not the 656.xx that I keep getting sent, 
Sentiment: negative
Tweet: @JetBlue what's up w flt 4? Brothers fiancé sitting on board for 30mins w tech  issues., 
Sentiment: negative
Tweet: @AmericanAir There was no one from #AA at the #woase2015 event @HelsinkiAirport
Lots of info available of #winterops, 
Sentiment: negative


Since we only have three different possible outcomes (positive, negative, neutral), it is easier to use numerical values, and not words. We will change the sentiments to numerical values: 0 for negative, 1 for neutral and 2 for positive

In [23]:
#Changing sentiments into numerical values values
def sentiment(x):
    if x == 'negative':
        return 0
    elif x == 'neutral':
        return 1
    else:
        return 2

tweets_raw['sentiment'] = tweets_raw['airline_sentiment'].apply(lambda x: sentiment(x))
tweets = tweets_raw[['text', 'sentiment']] #Only keeping the useful columns in the dataframe

In [24]:
#Cleaning tweets
tweets = tweets.drop_duplicates(subset = 'text', keep = 'first')
def preprocessor(text):
    text = re.sub("<[^>]*>", "",text)
    text = re.sub('[\W]+', ' ', text.lower())
    text = re.sub(' +', ' ', text)
    tweet = text.strip()
    return tweet

tweets['text'] = tweets['text'].apply(lambda x: preprocessor(x))
print('%d total tweets' % (len(tweets)))

14427 total tweets


The tweets are now all loaded and cleaned! We can now move on to preprocessing, training and testing. We will start with the simplest model, the logistic regression model.

## Preprocessing, Training, and Testing

### Logistic Regression

Strictly speaking, a logistic regression returns a **probability that a value belongs to a certain class**. The simplest form of a logistic regression is called a binary logistic regression, which predicts responses with two possible outcomes. Since we have 3 different outcomes, we have to use a **Multinomial Logistic Regression** (Swaminathan).

**Acknowledgement** <br>
A lot of the code was taken and adapted from Jason Liu's _How can we predict the sentiment by Tweets?_ from Kaggle.

The data has to be preprocessed before it can be used for training. This can be one in many different ways, and the bag-of words model would be the simplest. We will be using Google's Universal Sentence Encoder (Cer et al.), like we did in unsupervised learning, to convert the tweets into vectors. Since the Universal Sentence Encoder can also store the sentiment of the words, it will yield better results that the bag-of-words model, which only stores the numbers of different tokens in a specific tweet.

In [25]:
# Downlodas the pre-trained model from the interent (~1 GB). This may take up to a few minutes.
embed = hub.Module('https://tfhub.dev/google/universal-sentence-encoder/1') #(Google)

sess = tf.InteractiveSession()
sess.run([tf.global_variables_initializer(), tf.tables_initializer()]);  # Initialize model

In [26]:
#Putting tweets and sentiment values into separate lists
xval = tweets.text.values
yval = tweets.sentiment.values

#Converting tweets into vectors
vecs = sess.run(embed(xval))

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


We will now split the data into two groups: training, and testing. We will use 80 percent of the tweets for training, and the rest for testing

In [27]:
#Splitting tweets into training/testing groups
xtrain, xtest, ytrain, ytest = train_test_split(vecs, yval, test_size = 0.2, random_state = 42)

Now that we have all the tweets vectorized, and split into groups, we can load the model. More details on the parameters can be found in _Logistic Regression_ on Scikit-Learn.

In [28]:
model = LogisticRegression(solver = 'liblinear', max_iter = 100, C = 0.00001) #Loading model
fit = model.fit(xtrain, ytrain) #Training model to our training data set



The model is now trained, so we can check the accuracy of the model with our testing set:

In [29]:
pred = fit.predict(xtest)
accuracy = accuracy_score(pred,ytest)
print('The accuracy is: %s' % (float(accuracy)))

The accuracy is: 0.6275121275121275


<div class="alert alert-info">
    <h3>Question</h3>
    <p>Play around with the parameters; how does the accuracy change as you change the parameters?</p>
    <p>Try changing the sizes of the training/testing sets. What would the advantages be of having a bigger training/testing data set?</p>
</div>

A logistic regression simply uses a function to model the dependent and independent variables. Now, we will take a look at a more complex model, where we will have to train our own neural network.

### LSTM

An LSTM, or **Long Short-Term Memory**, has a similar architechture to an RNN, which conains loops, allowing previous information to persist. However, an RNN can only store information for short periods of time, and an LSTM can overcome this problem. An LSTM can choose which information to remember and which information to forget, allowing it to remember important imformation for longer periods of time (Srivastava).

**Acknowledgement** <br>
A lot of code here was taken and adapted from Abhishek Thakur's _Approaching (Almost) Any Problem on Kaggle_. The exact documentation of Keras can be found on keras.io.

We cannot use the Universal Sentence Encoder for the LSTM model. Therefore, we will use a pre-trained word embedding model. We will use the GloVe word vectors from Stanford with 300 dimensions (Plasticity Inc). The vectors will be used later on in the section when we create an embedding matrix of all the words. But first, let's import the vectors and get the samples into the right form!

In [30]:
#Download the word vectors from 
#https://github.com/plasticityai/magnitude#pre-converted-magnitude-formats-of-popular-embeddings-models 
#and import them!
vectors = Magnitude("/home/kyubin/glove.840B.300d.magnitude")

If we want to use an LSTM model, we have to tokenize the samples, and turn them into sequences.

In [31]:
token = text.Tokenizer(num_words = None) #using a keras tokenizer

token.fit_on_texts(tweets.text.values) #List of texts to train on
x_seq = token.texts_to_sequences(tweets.text.values) #Turns the tweets into sequences

The tweets are now all converted into sequences, but they are all of different length. Therefore, we will use a Keras API to **truncade/pad** all the samples into the same length; in this case, 30.

In [32]:
max_len = 30
x_pad = sequence.pad_sequences(x_seq, maxlen=max_len) 

For this model to work, all the labels have to be put into **binary form**. The negative sentiments will become [1, 0, 0], the neutral sentiments will become <br> [0, 1, 0], and the positive sentiments will become [0, 0, 1]. 

In [33]:
y_enc = np_utils.to_categorical(tweets.sentiment.values)

The data is now ready to be split into groups. This time, we will also include a validation set, which is important to check whether overfitting is happening (Brownlee). We will use 10 percent of the tweets for testing, and the rest (90 percent) for training and validating. From this 90 percent, 10 percent will be used for validating, and the rest for training.

In [34]:
xtrain_val, xtest, ytrain_val, ytest = train_test_split(x_pad, 
                                                        y_enc, 
                                                        random_state = 42, 
                                                        test_size = 0.1, 
                                                        shuffle = True)
xtrain, xval, ytrain, yval = train_test_split(xtrain_val,
                                              ytrain_val, 
                                              random_state = 42, 
                                              test_size = 0.1,
                                              shuffle = True)

Now, we will create an embedding matrix for all the words we have in the dataset. The GloVe word vectors will be utilized here.

In [35]:
word_index = token.word_index #A dictionary of all the words and the corresponding values

embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = vectors.query(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

100%|██████████| 15088/15088 [00:17<00:00, 855.83it/s]


Now that we finally have all our data ready for training and testing, we can intialize the model. We will create a neural network with a GloVe embedding layer, an LSTM layer, and two dense layers. 

In [36]:
model = Sequential() #this model is a linear stack of layers
#Embedding Layer - turns positive integers into dense vectors of fized size
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
#LSTM Layer
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))
#Dense layer 1 (just a regular densely - connected NN layer)
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))
#Dense layer 2
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3)) #Final output dimensionality of 3, because we have three possible outcomes!
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')































Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
















<div class="alert alert-info">
    <h3>Question</h3>
    <p>On lines 8, 13, and 16 above, there are dropout layers included in the model. A dropout layer sets a fraction of the inputs to zero at each update during training time. Why is this important? What does it prevent?</p>
    
</div>

In [37]:
#Training the model with the training dataset
model.fit(xtrain, ytrain, batch_size=512, epochs=20, verbose=1, validation_data = (xval, yval))

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 11685 samples, validate on 1299 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f5a842e9160>

Let's see how this model performs on our testing dataset:

In [38]:
#Testing the model
loss, acc = model.evaluate(x = xtest, y = ytest, verbose = 2, batch_size = 512)
print('The loss is %s' % (loss))
print('The accuracy is %s' % (acc))

The loss is 0.5234636724697411
The accuracy is 0.7796257771887221


<div class="alert alert-info">
    <h3>Question</h3>
    <p>Play around with the parameters again! Which parameters affects the results the most?</p>
    <p>Using the training and the validation accuracy values, how would we see if overfitting is happening?</p>
    
</div>

In this section, we looked at how we can train our own neural network. To improve our model, we would have to gather and label more data, add more layers, adjust parameters, etc. However, this is very time consuming and expensive. Therefore, in the next section, we will take a look at how we can use a model that has already been trained with billions of data samples.

### BERT Transformer

**Acknowledgement**<br>
A lot of the code in this section was taken and adapted from Chris McCormick's _BERT Fine-Tuning Tutorial with PyTorch_.

The last model we will explore is the BERT transformer model. BERT is a pre-trained deep learning model, which has been trained with billions of data samples. To use this model, we have to fine-tune it, so it suits our specific task. Here are a few reasons why fine tuning a model is better than building and training a model from scratch (McCormick).<br>
1. It is a lot easier to fine tune a model, as it contains a lot of information already.
2. Saves time & money
3. Less data required.

<div class="alert alert-info">
    <h3>Question</h3>
    <p>Can you think of other reasons why fine tuning a model would be better than building and training a model from scratch?</p>

</div>

Fine tuning a model also requires a lot of computing power. Therefore, it is recommended to use a GPU. In order to use a GPU in pytorch, we need to specifiy it as the device.

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'TITAN V'

In order to fine-tune the model, the data has to be in the same form as the data which was used to pre-train the orignial model. Therefore, we will use the BERT tokenizer to tokenize the data. Further, in order for BERT to work properly, special tokens must be added at the beginning and the end of each sentence. [CLS] must be added to the beginning, and [SEP] to the end.

In [7]:
texts = tweets.text.values #Making a list of all the tweets

texts = ["[CLS] " + text + " [SEP]" for text in texts] #Adding special tokens to beginning/end of tweets
labels = tweets.sentiment.values #List of sentiment values

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) #Applying tokenizer to all tweets
tokenized_texts = [tokenizer.tokenize(text) for text in texts]

We can take a look at how the tokenized tweets look like:

In [8]:
print ("First tweet tokenized:", tokenized_texts[0])

First tweet tokenized: ['[CLS]', 'virgin', '##ame', '##rica', 'what', 'dh', '##ep', '##burn', 'said', '[SEP]']


The samples are tokenized into words now, but words cannot be inputed directly into our model. Therefore, we will convert our tokens to integers. 

In [9]:
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts] #Converts tokens to integers

Just like we did with the LSTM model, we have to get all our samples to be the same length, and this is done by padding the data. We will make the length 30 again.

In [10]:
MAX_LEN = 30
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post") 
#Padding and truncating both at the end of the token

Again, we will use 81 percent of the data for training, 9 percent for validating, and 10 percent for testing. 

In [11]:
xtrain_val, xtest, ytrain_val, ytest = train_test_split(input_ids, 
                                                        labels, 
                                                        random_state = 42, 
                                                        test_size = 0.1, 
                                                        shuffle = True)
xtrain, xval, ytrain, yval = train_test_split(xtrain_val,
                                              ytrain_val, 
                                              random_state = 42, 
                                              test_size = 0.1,
                                              shuffle = True)

In [12]:
#Converting numpy arrays into torch tensors
xtrain = torch.tensor(xtrain)
xval = torch.tensor(xval)
xtest = torch.tensor(xtest) #Will be used in the testing section
ytrain = torch.tensor(ytrain)
yval = torch.tensor(yval)
ytest = torch.tensor(ytest) #Will be used in the testing section

To save memory during training, we will create an iterator of our data with torch DataLoader. 

In [13]:
# Select a batch size for training (16 or 32 recommended)
batch_size = 16

train_data = TensorDataset(xtrain, ytrain)
train_dataloader = DataLoader(train_data, batch_size=batch_size)

validation_data = TensorDataset(xval, yval)
validation_dataloader = DataLoader(validation_data, batch_size=batch_size)

test_data = TensorDataset(xtest, ytest)
test_dataloader = DataLoader(test_data, batch_size = batch_size)

We are now ready to load the model!

In [14]:
#Loading the model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
model.cuda()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
   

We will now get the training parameters from within the BERT trained model. These parameters will get updated during training. 

In [15]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta'] 
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01}, #Parameters where the weights will be decayed
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0} #Parameters where the weights will be fixed
]

optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=2e-5, #learning rate
                     warmup = 0.1) #Proportion of total training steps that is used as a warmup

t_total value of -1 results in schedule not being applied


The function below calculates the accuracy of the prediction compared to the given labels

In [16]:
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

We can now start the training! It will be divided into two section, the training section, and the validation section. We will be using 3 epochs, but this can be varied.

<div class="alert alert-info">
    <h3>Question</h3>
    <p>The more epochs you have, the higher training accuracy you would get. However, having too many epochs is not a good idea. Why?</p>
    
    

</div>

In [17]:
epochs = 3
for _ in trange(epochs, desc="Epoch"): 
  # Trainingop
  model.train()
  
  # Tracking variables
  tr_loss = 0
  nb_tr_examples, nb_tr_steps = 0, 0
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    batch = tuple(t.to(device) for t in batch) # Add batch to GPU
    b_input_ids, b_labels = batch # Unpack the inputs from our dataloader
    optimizer.zero_grad() # Clear out the gradients (by default they accumulate)
    loss = model(b_input_ids, labels=b_labels) # Forward propagation
    loss.backward() # Backward propagation
    optimizer.step() # Update parameters and take a step using the computed gradient

    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1

  print("Training loss: %s" % (tr_loss/nb_tr_steps))
    
  # Validation
  model.eval()

  # Tracking variables 
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0

  # Evaluate data for one epoch
  for batch in validation_dataloader:

    batch = tuple(t.to(device) for t in batch) # Add batch to GPU
    b_input_ids, b_labels = batch # Unpack the inputs from our dataloader
    with torch.no_grad(): # Telling the model not to compute or store gradients, saving memory & speeding up validation
      logits = model(b_input_ids) # Forward pass, calculate logit predictions
    logits = logits.detach().cpu().numpy() # Move logits and labels to CPU
    label_ids = b_labels.to('cpu').numpy()

    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

  print("Validation Accuracy: %s" % (eval_accuracy/nb_eval_steps))

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Training loss: 0.5691460691056075


Epoch:  33%|███▎      | 1/3 [01:35<03:10, 95.03s/it]

Validation Accuracy: 0.8376524390243902
Training loss: 0.32340468835243613


Epoch:  67%|██████▋   | 2/3 [03:08<01:34, 94.61s/it]

Validation Accuracy: 0.8475609756097561
Training loss: 0.1877086680865351


Epoch: 100%|██████████| 3/3 [04:40<00:00, 93.34s/it]

Validation Accuracy: 0.8414634146341463





Now we can use the testing dataset we created earlier to test the model.

In [19]:
model.eval()
# Tracking variables 
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0

# Evaluate data for one epoch
for batch in test_dataloader:
    batch = tuple(t.to(device) for t in batch) # Add batch to GPU
    b_input_ids, b_labels = batch # Unpack the inputs from our dataloader
    with torch.no_grad(): # Telling the model not to compute or store gradients, saving memory & speeding up validation
      logits = model(b_input_ids) # Forward pass, calculate logit predictions
    logits = logits.detach().cpu().numpy() # Move logits and labels to CPU
    label_ids = b_labels.to('cpu').numpy()

    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

print("Test set Accuracy: %s" % (eval_accuracy/nb_eval_steps))

Test set Accuracy: 0.8337912087912088


<div class="alert alert-info">
    <h3>Question</h3>
    <p>Compare the accuracy values of all the models! Do the results make sense?</p>
</div>

### References

Brownlee, Jason. “What Is the Difference Between Test and Validation Datasets?” *Machine Learning Mastery*, 26 July 2017, machinelearningmastery.com/difference-test-validation-datasets/. Accessed 11 Sept. 2019.

Cer, Daniel, et al. _Universal Sentence Encoder_. 2018 Feb. 2018.

Google. “TensorFlow Hub.” *Tfhub.Dev*, Google, 2019, tfhub.dev/google/universal-sentence-encoder/1. Accessed 30 Aug. 2019.

google-research. “BERT.” *GitHub*, 28 Mar. 2019, github.com/google-research/bert.

huggingface. “Huggingface/Pytorch-Transformers.” *GitHub*, 11 Sept. 2019, github.com/huggingface/pytorch-transformers#quick-tour. Accessed 11 Sept. 2019.

Keras. “Keras: The Python Deep Learning Library.” *Keras Documentation*, 2019, keras.io/. Accessed 11 Sept. 2019.

Liu, Jason. “How Can We Predict the Sentiment by Tweets?” *Kaggle*, 18 Oct. 2016, www.kaggle.com/jiashenliu/how-can-we-predict-the-sentiment-by-tweets/notebook. Accessed 11 Sept. 2019.

McCormick, Chris. “BERT Fine-Tuning Tutorial with PyTorch.” *Chris McCormick*, 22 July 2019, mccormickml.com/2019/07/22/BERT-fine-tuning/. Accessed 11 Sept. 2019.

Plasticity Inc. “Magnitude: A Fast, Simple Vector Embedding Utility Library.” *GitHub*, 28 Nov. 2018, github.com/plasticityai/magnitude. Accessed 30 Aug. 2019.

Scikit-Learn. “Logistic Regression.” *Scikit-Learn.Org*, 2014, scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Accessed 11 Sept. 2019.

Srivastava, Pranjal. “Essentials of Deep Learning : Introduction to Long Short Term Memory.” *Analytics Vidhya*, 23 Dec. 2017, www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/. Accessed 11 Sept. 2019.

Swaminathan, Saishruthi. “Logistic Regression — Detailed Overview.” *Medium*, Towards Data Science, 15 Mar. 2018, towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc. Accessed 11 Sept. 2019.

Thakur, Abhishek. “Approaching (Almost) Any NLP Problem on Kaggle.” *Kaggle*, 24 July 2018, www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle/comments. Accessed 11 Sept. 2019.