<a href="https://colab.research.google.com/github/ngtinc21/Machine-Learning-Algorithms/blob/main/RNN_Example_2_NLP_Twitter_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NLP Twitter Sentiment Analysis - Positive/Negative Sentiment**

The example is to analyze the nature of Tweets using NLP. The word limit of a single tweet has 140 characters. Since the analysis is a supervised learning task, the training dataset which consists of Tweets labeled with “1” or “0” and a test dataset without labels.

The training data consists of Tweets labeled with "1" or "0".

Label “0”: Positive Sentiment
Label “1”: Negative Sentiment

  * `train_tweets.csv` contains Training dataset which consists of Tweets labeled “1” or “0”
  * `test_tweets.csv` contains Test dataset without labels.


## **1. Data Acquisition**

**Importing required libraries**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### **Downloading Data**


In [None]:
! git clone https://github.com/MohamedAfham/Twitter-Sentiment-Analysis-Supervised-Learning

fatal: destination path 'Twitter-Sentiment-Analysis-Supervised-Learning' already exists and is not an empty directory.


### **Loading Data**

In [None]:
train_tweets= pd.read_csv("/content/Twitter-Sentiment-Analysis-Supervised-Learning/Data/train_tweets.csv")
train_tweets

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."


In [None]:
test_tweets = pd.read_csv('/content/Twitter-Sentiment-Analysis-Supervised-Learning/Data/test_tweets.csv')
test_tweets

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."
...,...,...
17192,49155,thought factory: left-right polarisation! #tru...
17193,49156,feeling like a mermaid ð #hairflip #neverre...
17194,49157,#hillary #campaigned today in #ohio((omg)) &am...
17195,49158,"happy, at work conference: right mindset leads..."


In [None]:
# Removing the "id" column from the train data. The dataframe will now have only 2 columns: `label` and `tweet`.
train_tweets= train_tweets[["label","tweet"]]
train_tweets

Unnamed: 0,label,tweet
0,0,@user when a father is dysfunctional and is s...
1,0,@user @user thanks for #lyft credit i can't us...
2,0,bihday your majesty
3,0,#model i love u take with u all the time in ...
4,0,factsguide: society now #motivation
...,...,...
31957,0,ate @user isz that youuu?ðððððð...
31958,0,to see nina turner on the airwaves trying to...
31959,0,listening to sad songs on a monday morning otw...
31960,1,"@user #sikh #temple vandalised in in #calgary,..."


In [None]:
# Removing the "id" column from the test data. The dataframe now has only 1 column: `tweet`.
test = test_tweets["tweet"]
test

0        #studiolife #aislife #requires #passion #dedic...
1         @user #white #supremacists want everyone to s...
2        safe ways to heal your #acne!!    #altwaystohe...
3        is the hp and the cursed child book up for res...
4          3rd #bihday to my amazing, hilarious #nephew...
                               ...                        
17192    thought factory: left-right polarisation! #tru...
17193    feeling like a mermaid ð #hairflip #neverre...
17194    #hillary #campaigned today in #ohio((omg)) &am...
17195    happy, at work conference: right mindset leads...
17196    my   song "so glad" free download!  #shoegaze ...
Name: tweet, Length: 17197, dtype: object

## **2. Exploratory Data Analysis**


### **1) Tweet Length**


In [None]:
# looking at the length of each Tweet
train_tweets['length'] = train_tweets['tweet'].apply(len)

In [None]:
# Visualizing the average length of Tweets with Positive Sentiment (label "0") compare to that of Tweets with Negative Sentiment (label "1").
# From the sns plot below, the average word length of tweets in each category is almost the same.
sns.barplot('label','length',data = train_tweets)
plt.title('Average Word Length vs Label')



### **2) Data Label Distribution**
A balanced dataset is the one that contains equal or almost equal number of samples from the positive and negative class.  The number of Tweets with Positive Sentiment (label "0") and the number of Tweets with Negative Sentiment (label "1") are to be examined below.

From the graph below, there is a major imbalance in the data. There are a lot more tweets labelled "0" and a very few ones labelled "1". 

**Imbalanced Data**

Imbalanced classes are a common problem in machine learning classification where there are a disproportionate ratio of observations in each class. Class imbalance can be found in many different areas including medical diagnosis, spam filtering, and fraud detection. More references are in [this article](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/).

In [None]:
# Plotting the number of Tweets with Positive Sentiment (label "0") and the number of Tweets with Negative Sentiment (label "1") in the dataset.
sns.countplot(x= 'label',data = train_tweets)
plt.title('Label Counts')

### **3. Text Processing**

The main issue with this data is that it is all in text format (strings). The classification algorithms need some sort of numerical feature vector to perform the classification task. In this lesson, tried out the following techniques of preprocessing the raw data:
1. Removal of punctuations.
2. Removal of commonly used words (stopwords).
3. Normalization of words.



In [None]:
# import the`nltk` library for text processing
import nltk

#### **i) Removal Of Punctuations**




In [None]:
# import `TextBlob` library to remove all the punctuations from the Tweet
nltk.download('punkt')
from textblob import TextBlob

In [None]:
# Generating the list of words in the tweet (with hastags and other punctuations removed)
def form_sentence(tweet):
  # TextBlob(tweet) creates an instance of tweet which can be processed by the Library.
  tweet_blob = TextBlob(tweet)
  # returns a string by joining all the elements of the list, separated by a string separator, which is a space (`" "`) in this case.
  return ' '.join(tweet_blob.words)

In [None]:
# example
form_sentence("#this is an good and clear example!!! XDD :)")

#### **ii) Removal Of Stopwords And Words With Unusual Symbols** 

Stopwords are very common words in english, for example 'the', 'a', etc. 
- Import a list of stopwords from the NLTK library to remove them. 
- Stopwords do not make sense in learning because they don’t have connections with sentiments. 
- **Removing them saves the computational power** as well as **increases the accuracy of the model**.


In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
import re

In [None]:
# Removing stopwords and words with unusual symbols
def no_user_alpha(tweet):
    # removes the keyword "user" from the tweet
    tweet_list = [ele for ele in tweet.split() if ele != 'user']

    # removes any non-word character(s) (e.g "#", "@") or number(s) at the start of a word with the use of regular expression. For example "@Hello, "#Hello", and "123Hello" becomes "Hello".
    # [Regular Expression](https://www.w3schools.com/python/python_regex.asp)
    clean_tokens = [t for t in tweet_list if re.match(r'[^\W\d]*$', t)]
    clean_s = ' '.join(clean_tokens)

    # removes all the stopwords provided by the NTLK Library
    clean_mess = [word for word in clean_s.split() if word.lower() not in stopwords.words('english')]
    return clean_mess

In [None]:
# example
no_user_alpha("#this is a good and clear example!!! :)")

#### **iii) Function For Text Processing (Combining Step i and ii)**



In [None]:
# Define a single function that will do all 2 steps of text processing:
def text_processing(tweet):
    
    #Generating the list of words in the tweet (hastags and other punctuations removed)
    def form_sentence(tweet):
        tweet_blob = TextBlob(tweet)
        return ' '.join(tweet_blob.words)
    new_tweet = form_sentence(tweet)
    
    #Removing stopwords and words with unusual symbols
    def no_user_alpha(tweet):
        tweet_list = [ele for ele in tweet.split() if ele != 'user']
        clean_tokens = [t for t in tweet_list if re.match(r'[^\W\d]*$', t)]
        clean_s = ' '.join(clean_tokens)
        clean_mess = [word for word in clean_s.split() if word.lower() not in stopwords.words('english')]
        return clean_mess
    no_punc_tweet = no_user_alpha(new_tweet)
    
    return no_punc_tweet

In [None]:
# apply the text processing function to our train and test dataframes
# apply() method to apply a function to entire column in the dataframe
train_tweets['tweet_list'] = train_tweets['tweet'].apply(text_processing)
test_tweets['tweet_list'] = test_tweets['tweet'].apply(text_processing)

In [None]:
# comparing the original list and the list with only important words in the same dataframe
train_tweets

Unnamed: 0,label,tweet,length,tweet_list
0,0,@user when a father is dysfunctional and is s...,102,"[father, dysfunctional, selfish, drags, kids, ..."
1,0,@user @user thanks for #lyft credit i can't us...,122,"[thanks, lyft, credit, ca, use, cause, offer, ..."
2,0,bihday your majesty,21,"[bihday, majesty]"
3,0,#model i love u take with u all the time in ...,86,"[model, love, u, take, u, time]"
4,0,factsguide: society now #motivation,39,"[factsguide, society, motivation]"
...,...,...,...,...
31957,0,ate @user isz that youuu?ðððððð...,68,"[ate, isz, youuu]"
31958,0,to see nina turner on the airwaves trying to...,131,"[see, nina, turner, airwaves, trying, wrap, ma..."
31959,0,listening to sad songs on a monday morning otw...,63,"[listening, sad, songs, monday, morning, otw, ..."
31960,1,"@user #sikh #temple vandalised in in #calgary,...",67,"[sikh, temple, vandalised, calgary, wso, conde..."


### **4. Model Building & Training**


In [None]:
from sklearn.model_selection import train_test_split

# The test dataset 'msg_tset and label_test' here is not from `test_tweets` dataframe
msg_train, msg_test, label_train, label_test = train_test_split(train_tweets['tweet'], train_tweets['label'], test_size=0.2)


## **4.1 SciKit Learn Pipline (and Naive Bayes Classification) method**

#### **Creating and Training the Model**

Pipeline concatenates the output of transformers into a composite feature space. SciKit Learn's [pipeline](http://scikit-learn.org/stable/modules/pipeline.html) is capabled to create a pipeline of workflow. This will allow to set up all the transformations to be done to the data for future use. 

Library: **sklearn.pipeline.Pipeline**

*   Pipeline of transforms with a final estimator.
*   The purpose of the pipeline is to assemble several steps that can be **cross-validated** together while setting different parameters. 

---

Sequentially transform and classify the data in the following order:
1. **Bag of Words (BOW) Transformation:** A reduced and simplified representation of a text document from selected parts of the text, based on specific criteria, such as word frequency.
  * Library: *sklearn.feature_extraction.text.CountVectorizer**
    *   Converts a collection of text documents to a matrix of token counts

2. **TF-IDF Transformation:** A formula that aims to define the importance of a keyword or phrase within a document or a web page.
  * Library: **sklearn.feature_extraction.text.TfidfTransformer**
    *   Transforms a count matrix to a normalized tf or tf-idf representation

3. **Naive Bayes Classification:** A classification algorithm that uses Bayes' theorem to classify objects. 
  * Library: **sklearn.naive_bayes.MultinomialNB**
    *   Naive Bayes classifier for multinomial models
    *   The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.





In [None]:
# Import required libraries
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

In [None]:
# Instantiation
pipeline = Pipeline([
    ('bow',CountVectorizer(analyzer=text_processing)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In [None]:
# Need an embedding layer to create the first hidden layer
msg_train.shape, msg_test.shape, label_train.shape, label_test.shape

((25569,), (6393,), (25569,), (6393,))

**iii) Model Fitting**

Now we use `.fit` to train the model using the train dataset.

In [None]:
pipeline.fit(msg_train,label_train)

Pipeline(steps=[('bow',
                 CountVectorizer(analyzer=<function text_processing at 0x7f8e2aa7fdd0>)),
                ('tfidf', TfidfTransformer()),
                ('classifier', MultinomialNB())])

### **5. Model Prediction & Evaluation**

#### **Prediction**

In [None]:
# The predictions is a list of labels (either 0 or 1) for Tweets in test dataset, msg_test
predictions = pipeline.predict(msg_test)
predictions

Let's see how our predictions looks like. The `predictions` is a list of labels (either 0 or 1) for Tweets in test dataset, `msg_test`.

In [None]:
predictions

#### **Evaluation**


In [None]:
# import libraries to evaluate our model to see how good it is in predicting the type of Tweet by comparing the predicted labels with the actual labels in the test dataset.
from sklearn.metrics import confusion_matrix, classification_report ,accuracy_score

In [None]:
# print the Classification Report
# use different statistical evaluation methods to evaluate our model. Read more about these methods here
# [here](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9)
print(classification_report(predictions,label_test))

In [None]:
# Print the accuracy score
# Accuracy score is the percentage of correctly predicted labels. Note that there are chances to improve this accuracy by tuning parameters using GridSearchCV and other preprocessing techniques.
print(accuracy_score(predictions,label_test))

#### **Confusion Matrix**
The following confusion matrix can be read as:
* Tweets with Positive Sentiment predicted correctly: 5954
* Tweets with Positive Sentiment predicted incorrectly: 390
* Tweets with Negative Sentiment predicted incorrectly: 2
* Tweets with Negative Sentiment predicted correctly: 72

In [None]:
print(confusion_matrix(predictions,label_test))

## **4.2 LSTM Method**

In [137]:
msg_test

29452                                   @user   bihday rg 
7242     new zelda got me, link inventory the world, fr...
13905                 5 proven #habits that   people have 
21143    love you...!!! ððð #friend #friends #...
3869     a sad couple days in #orlando . gun violence i...
                               ...                        
21709    "stood up for coffee" uploaded to @user  #brun...
10495    defo missing mummy karen @user #misogs   #zshq...
24377                     about to see @user in conce!!!  
7892      @user when you get excited about summer but t...
31700     @user camp colca, camp titicaca and camp mora...
Name: tweet, Length: 6393, dtype: object

In [None]:
# Word2vec
import gensim

# WORD2VEC 
W2V_SIZE = 300
W2V_WINDOW = 7
W2V_EPOCH = 32
W2V_MIN_COUNT = 10

w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE, 
                                            window=W2V_WINDOW, 
                                            min_count=W2V_MIN_COUNT, 
                                            workers=2)
documents = [_text.split() for _text in msg_train]
w2v_model.build_vocab(documents)

In [None]:
words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)

w2v_model.train(documents, total_examples=len(documents), epochs=W2V_EPOCH)

Vocab size 3150


(5973261, 10774304)

In [None]:
w2v_model.most_similar("pray")

### **Keras Tokenizer class**
- Remove punctuation and split strings into lists of individual words
- Convert the individual words into integers

By default, this **removes all punctuation, lowercases words**, and then **converts words to sequences of integers**. A Tokenizer is first fit on a list of strings and then converts this list into a list of lists of integers.

In [None]:
from keras.preprocessing.text import Tokenizer
### Create a Tokenizer object
tokenizer = Tokenizer(
    ### max number of words to keep
    # num_words = None,    

    ### String of character to be removed, note that '!?",.' are kept here
    # filters = '#$%&()*+-<=>@[\\]^_`{|}~\t\n',   

    ##3 Whether to convert the texts to lowercase
    lower = False, 
    
    ### Separator for word splitting.
    # split = ' '
    )

### Train the tokenizer to the texts
tokenizer.fit_on_texts(msg_train)

vocab_size = len(tokenizer.word_index) + 1
print("Total words", vocab_size)

### Convert lists of srtrings into list of lists of integer
from keras.preprocessing.sequence import pad_sequences
SEQUENCE_LENGTH = 300
x_train = pad_sequences(tokenizer.texts_to_sequences(msg_train), maxlen=SEQUENCE_LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(msg_test), maxlen=SEQUENCE_LENGTH)

In [None]:
print("msg_train", x_train.shape)
y_train = label_train.ravel().reshape(-1,1)
print("label_train", y_train.shape)
print()
print("msg_test",x_test.shape)
y_test = label_test.ravel().reshape(-1,1)
print("label_test", y_test.shape)

### **Embedding layer**

Keras offers an ***Embedding layer*** that can be used for neural networks on text data.

It requires that the input data be ***integer encoded***, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It is a flexible layer that can be used in a variety of ways, such as:

 - It can be used alone to learn a word embedding that can be saved and used in another model later.
 - It can be used as part of a deep learning model where the embedding is learned along with the model itself.
 - It can be used to load a pre-trained word embedding model, a type of transfer learning

In [133]:
vocab_size, W2V_SIZE

(39739, 300)

In [138]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Masking, Embedding
from tensorflow.keras import callbacks
import time
import numpy as np
def LSTM_model(msg_train, msg_test, label_train, label_test):
  start_time = time.time()
  model = Sequential()

  # Embedding layer
  embedding_matrix = np.zeros((vocab_size, W2V_SIZE))
  for word, i in tokenizer.word_index.items():
    if word in w2v_model.wv:
      embedding_matrix[i] = w2v_model.wv[word]

  model.add(
      Embedding(input_dim=vocab_size,
                input_length = SEQUENCE_LENGTH,
                output_dim=W2V_SIZE,
                weights=[embedding_matrix],
                trainable=False,
                mask_zero=True))

  # Masking layer for pre-trained embeddings
  model.add(Masking(mask_value=0.0))

  # Recurrent layer
  model.add(LSTM(64, return_sequences=False, 
                dropout=0.1, recurrent_dropout=0.1))

  # Fully connected layer
  model.add(Dense(64, activation='relu'))

  # Dropout for regularization
  model.add(Dropout(0.5))

  # Output layer
  model.add(Dense(1, activation='sigmoid'))

  # Compile the model
  model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

  early_stopping = callbacks.EarlyStopping(monitor="val_loss", min_delta=0.001, patience=5, restore_best_weights=True)
  history_LSTM = model.fit(x_train, y_train, batch_size=8, validation_data=(x_test,y_test),epochs=8,verbose=1, callbacks=[early_stopping])
  loss, acc = model.evaluate(x_test, y_test, verbose=0)
  # predict the label of the train dataset
  predictions = model.predict(x_test)
  end_time = time.time()
  return acc, pd.DataFrame(history_LSTM.history), predictions, end_time - start_time
LSTM_acc, LSTM_plot, LSTM_predictions, LSTM_elapse_time = LSTM_model(x_train ,y_train, x_test, y_test)
LSTM_plot.plot()
print(f"The accuracy of LSTM model is {LSTM_acc*100:0.1f}%, with training time {np.floor(LSTM_elapse_time/60)} minutes and {LSTM_elapse_time%60:0.0f} seconds.")


Epoch 1/8
Epoch 2/8
Epoch 3/8

KeyboardInterrupt: ignored

### **Model Prediction & Evaluation**

### **Evaluation**

We will now evaluate our model to see how good it is in predicting the type of Tweet by comparing the predicted labels with the actual labels in our test dataset.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score

#### **Classification Report**
We use different statistical evaluation methods to evaluate our model. You can read more about these methods [here](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9).

In [None]:
print(classification_report(LSTM_predictions,y_test))

              precision    recall  f1-score   support

           0       1.00      0.94      0.97      6325
           1       0.15      1.00      0.25        68

    accuracy                           0.94      6393
   macro avg       0.57      0.97      0.61      6393
weighted avg       0.99      0.94      0.96      6393



#### **Accuracy Score**
Accuracy score is the percentage of correctly predicted labels. Note that there are chances to improve this accuracy by tuning parameters using GridSearchCV and other preprocessing techniques.


In [None]:
print(accuracy_score(LSTM_predictions,y_test))

0.9374315657750665


#### **Confusion Matrix**
The following confusion matrix can be read as:
* Tweets with Positive Sentiment predicted correctly: 5954
* Tweets with Positive Sentiment predicted incorrectly: 365
* Tweets with Negative Sentiment predicted incorrectly: 2
* Tweets with Negative Sentiment predicted correctly: 72

In [None]:
print(confusion_matrix(LSTM_predictions,y_test))

[[5925  400]
 [   0   68]]
