<a href="https://colab.research.google.com/github/prachitshukla/Team-2/blob/coronavirus_sentiment_analysis/Copy_of_M4_Mini_Hackathon_To_Perform_Classification_of_Coronavirus_Tweets_DAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: To perform text classification of coronavirus tweets during the peak Covid - 19 period using LSTMs/RNNs/CNNs/BERT.


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural network (RNN, LSTM, GRU, CNNs, Bidirectional-LSTM, GRU, BERT) to classify the tweets


### Introduction

First we need to understand why sentiment analysis is needed for social media?

People from all around the world have been using social media more than ever. Sentiment analysis on social media data helps to understand the wider public opinion about certain topics such as movies, events, politics, sports, and more and gain valuable insights from this social data. Sentiment analysis has some powerful applications. Nowadays it is also used by some businesses to do market research and understand the customer’s experiences for their products or services.

Now an interesting question about this type of problem statement that may arise in your mind is that why sentiment analysis on COVID-19 Tweets? What is about the coronavirus tweets that would be positive? You may have heard sentiment analysis on movie or book reviews, but what is the purpose of exploring and analyzing this type of data?

The use of social media for communication during the time of crisis has increased remarkably over the recent years. As mentioned above, analyzing social media data is important as it helps understand public sentiment. During the coronavirus pandemic, many people took to social media to express their anger, grief, or sadness while some also spread happiness and positivity. People also used social media to ask their network for help related to vaccines or hospitals during this hard time. Many issues related to this pandemic can also be solved if experts considered this social data. That’s the reason why analyzing this type of data is important to understand the overall issues faced by people.



## Dataset

The given challenge is to build a multiclass classification model to predict the sentiment of Covid-19 tweets. The tweets have been pulled from Twitter and manual tagging has been done. We are given information like Location, Tweet At, Original Tweet, and Sentiment.

The training dataset consists of 36000 tweets and the testing dataset consists of 8955 tweets. There are 5 sentiments namely ‘Positive’, ‘Extremely Positive’, ‘Negative’, ‘Extremely Negative’, and ‘Neutral’ in the sentiment column.

## Description

This dataset has the following information about the user who tweeted:

1. **UserName:** twitter handler
2. **ScreenName:** a personal identifier on Twitter and is separate from the username
3. **Location:** where in the world the person tweets from
4. **TweetAt:** date of the tweet posted (DD-MM-YYYY)
5. **OriginalTweet:** the tweet itself
6. **Sentiment:** sentiment value



## Problem Statement

To build and implement a multiclass classification deep neural network model to classify between Positive/Extremely Positive/Negative/Extremely Negative/Neutral sentiments

## Grading = 10 Marks

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/db0ea322e4b14ad1b29d14fbe406d4e5) and open your user settings page. Click Account.

* Click on your profile picture at the top-right corner of the page.

![alt text](https://i.imgur.com/kSLmEj2.png)

* In the popout menu, click the Settings option.

![alt text](https://i.imgur.com/tNi6yun.png)








### 2. Next, scroll down to the API access section and click generate to download an API key (kaggle.json).
![alt text](https://i.imgur.com/vRNBgrF.png)


### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
from google.colab import files
files.upload()

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

### 4. Install the Kaggle API using the following command


In [None]:
#!pip uninstall urllib3
#!pip install urllib3>=1.26.11
!pip install -U -q kaggle==1.5.8

#### 4.1 List of installed pakage

In [None]:
!pip list

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c perform-classification-of-coronavirus-tweets

In [None]:
!unzip /content/perform-classification-of-coronavirus-tweets.zip

## YOUR CODING STARTS FROM HERE

* install gensim

## Import required packages

In [None]:
# Import required packages
import numpy as np
import pandas as pd
import chardet
import re
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import itertools
import seaborn as sns
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
#import matplotlib
#import matplotlib.patches as mpatches
tsne = TSNE(n_components=2)
from gensim.utils import simple_preprocess
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from sklearn.utils import shuffle
from sklearn.metrics.pairwise import cosine_similarity
from wordcloud import WordCloud

##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

* Load the Dataset


In [None]:
# Read the positive and negative files and split the sentences into a list
with open('corona_nlp_test.csv/corona_nlp_test.csv',"rb") as data_test:
  result = chardet.detect(data_test.read())
  print(result)
  data_test_set = pd.read_csv('corona_nlp_test.csv/corona_nlp_test.csv', encoding=result['encoding'])

with open('corona_nlp_train.csv/corona_nlp_train.csv',"rb") as data_train:
  result = chardet.detect(data_train.read())
  print(result)
  data_train_set = pd.read_csv('corona_nlp_train.csv/corona_nlp_train.csv', encoding=result['encoding'])

* check first 5 records of train dataframe

In [None]:
print(data_train_set.head())

* Check for Missing Values

In [None]:
print(data_train_set.isnull().sum())

* Visualize the sentiment column values


In [None]:
print(data_train_set["Sentiment"])

* Visualize top 10 Countries that had the highest tweets using countplot (Tweet count vs Location)


In [None]:
plt.figure(figsize=(20,5))
sns.countplot(data=data_train_set, x=data_train_set['Location'],  order= data_train_set['Location'].value_counts().iloc[:10].index)
plt.show()

* Plotting Pie Chart for the Sentiments in percentage


In [None]:
plt.figure(figsize=(20,5))
sentiment_count={}
for sentiment in data_train_set['Sentiment'].unique():
  sentiment_count[sentiment]=data_train_set['Sentiment'].value_counts()[sentiment]
  print(sentiment,data_train_set['Sentiment'].value_counts()[sentiment])
plt.pie(sentiment_count.values(), labels=sentiment_count.keys(), autopct='%1.1f%%')
plt.show()

* WordCloud for the Tweets/Text

    * Visualize the most commonly used words in each sentiment using wordcloud
    * Refer to the following [link](https://medium.com/analytics-vidhya/word-cloud-a-text-visualization-tool-fb7348fbf502) for Word Cloud: A Text Visualization tool




In [None]:
plt.figure(figsize=(20,5))
text=' '.join(data_train_set['OriginalTweet'].astype(str))
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

##   **Stage 2**: Data Pre-Processing  (2 Points)

####  Clean and Transform the data into a specified format


* function to preprocess the data

In [None]:
# Data Preprocessing function
def preprocess_text(sen):

    sen = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", sen)
    sen = re.sub(r"\'s", " \'s", sen)
    sen = re.sub(r"[\([{})\]]", "", sen)

    # Tokenizing words
    tokens = word_tokenize(sen)

    # Converting to lower case
    tokens = [w.lower() for w in tokens]

     # Remove punctuations
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]

    # Remove non alphabet
    words = [word for word in stripped if word.isalpha()]
    stop_words = set(stopwords.words('english'))

    # Remove stop words
    words = [w for w in words if not w in stop_words]

    return words# YOUR CODE HERE

* list of unique sentiments

In [None]:
# list of sentiments
print(data_train_set['Sentiment'].unique())

* function to segrigate sentiment specific words

In [None]:
def words_per_sentiment(sentences_per_sentiment):
  lines_for_sentiments = []
  # segrigate sentiment specific words
  for sen in sentences_per_sentiment:
      # Call the preprocess_text function on each sentence of the review text
      lines_for_sentiments.append(preprocess_text(sen))
  return lines_for_sentiments


* segrigate sentiment specific words

In [None]:
# Store the preprocessed reviews in a new list- positive
sentences_pos = list(data_train_set.loc[data_train_set['Sentiment']=='Positive', 'OriginalTweet'])
print(len(sentences_pos))

lines_pos = words_per_sentiment(sentences_pos)
print(len(lines_pos))
print(lines_pos[0])


In [None]:
# Store the preprocessed reviews in a new list- Extremely Positive
sentences_ext_pos = list(data_train_set.loc[data_train_set['Sentiment']=='Extremely Positive', 'OriginalTweet'])
print(len(sentences_ext_pos))

lines_ext_pos = words_per_sentiment(sentences_ext_pos)
print(len(lines_ext_pos))
print(lines_ext_pos[0])

In [None]:
# Store the preprocessed reviews in a new list- Neutral
sentences_neu = list(data_train_set.loc[data_train_set['Sentiment']=='Neutral', 'OriginalTweet'])
print(len(sentences_neu))

lines_neu = words_per_sentiment(sentences_neu)
print(len(lines_neu))
print(lines_neu[0])

In [None]:
# Store the preprocessed reviews in a new list- Negative
sentences_neg = list(data_train_set.loc[data_train_set['Sentiment']=='Negative', 'OriginalTweet'])
print(len(sentences_neg))

lines_neg = words_per_sentiment(sentences_neg)
print(len(lines_neg))
print(lines_neg[0])

In [None]:
# Store the preprocessed reviews in a new list- Extremely Negative
sentences_ext_neg = list(data_train_set.loc[data_train_set['Sentiment']=='Extremely Negative', 'OriginalTweet'])
print(len(sentences_ext_neg))

lines_ext_neg = words_per_sentiment(sentences_ext_neg)
print(len(lines_ext_neg))
print(lines_ext_neg[0])

* convert sentiment specific list of words to simple word list

In [None]:
text_ext_pos = list(itertools.chain.from_iterable(lines_ext_pos))
text_pos = list(itertools.chain.from_iterable(lines_pos))
text_neu = list(itertools.chain.from_iterable(lines_neu))
text_neg = list(itertools.chain.from_iterable(lines_neg))
text_ext_neg = list(itertools.chain.from_iterable(lines_ext_neg))

print(f'''Extremely Positive : {len(text_ext_pos)} \t Positive : {len(text_pos)} \t Neutral : {len(text_neu)} \t Negative : {len(text_neg)} \t Extremely Negative : {len(text_ext_neg)}''')


##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



* Import GloVe Embedding Files

In [None]:
from IPython import get_ipython
ipython = get_ipython()
ipython.magic("sx wget -qq https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/glove.6B.zip")
ipython.magic("sx unzip glove.6B.zip")

* create GloVe 50d embedding

In [None]:
GloVe_Dict_50d = {}
# Loading the 50-dimensional vector of the model
with open("glove.6B.50d.txt", 'r') as f:
  for line in f:
      values = line.split()
      word = values[0]
      vector = np.asarray(values[1:], "float32")
      GloVe_Dict_50d[word] = vector

print(len(GloVe_Dict_50d))

* Vector representation of words

In [None]:
def gen_vectors(text):
  vectors = []
  for word in text:
    try:
      vector = GloVe_Dict_50d[word]
      vectors.append(vector)
    except KeyError:
      pass
  print("There are %d words and the vector size of each word is %d" %((len(vectors),len(vectors[0]))))
  return vectors

In [None]:
# Passing the words present in text_neg and text_pos to the function gen_vectors
vectors_ext_pos = gen_vectors(text_ext_pos)
vectors_pos = gen_vectors(text_pos)
vectors_neu = gen_vectors(text_neu)
vectors_neg = gen_vectors(text_neg)
vectors_ext_neg = gen_vectors(text_ext_neg)

* Find cosine similarity

In [None]:
def find_cosine_similarity(text):
  word_similarity = []
  index = []
  for i, word_1 in enumerate(text):
    row_wise_simiarity = []
    print(i,word_1)
    if(i == 4):
      break
    for j, word_2 in enumerate(text):
      # Get the vectors of the word using GloVe
      try:
        vec_1, vec_2 = GloVe_Dict_50d[word_1], GloVe_Dict_50d[word_2]
      except KeyError:
        pass

      # As the vectors are in one dimensional, convert it to 2D by reshaping
      vec_1, vec_2 = np.array(vec_1).reshape(1,-1), np.array(vec_2).reshape(1,-1)

      # Measure the cosine similarity between the vectors.
      similarity = cosine_similarity(vec_1, vec_2)
      row_wise_simiarity.append(np.array(similarity).item())

    # Store the cosine similarity values in a list
    word_similarity.append(row_wise_simiarity)
    index.append(word_1)

  # Create a DataFrame to view the similarity between words
  return pd.DataFrame(word_similarity, columns=text, index = index)

In [None]:
#df_neu = find_cosine_similarity(text_neu)
df_ext_pos = find_cosine_similarity(text_ext_pos)
df_pos = find_cosine_similarity(text_pos)
df_neg = find_cosine_similarity(text_neg)
df_ext_neg = find_cosine_similarity(text_ext_neg)

In [None]:
def glove_embeddings(text, dim):
    if len(text) < 1:
        return np.zeros(dim)
    else:
        vectorized = [GloVe_Dict_50d[word] if word in GloVe_Dict_50d else np.random.randn(dim) for word in text]
    sum = np.sum(vectorized, axis=0)
    # Return the average vector
    return sum/len(vectorized)

def get_glove_embeddings(text, dimension):
        embeddings = text.apply(lambda x: glove_embeddings(x, dimension))
        return list(embeddings)

* Visualization of word vectors using TSNE

In [None]:
word_embeddings = get_glove_embeddings(data_train_set['OriginalTweet'], dimension=50)

In [None]:
def tsne_visualization(word_embeddings):
    x = word_embeddings[1:100]
    x = np.asarray(x)
    y = tsne.fit_transform(x)
    plt.figure(figsize=(20,10))
    colors=['orange','red']
    sns.scatterplot(x=y[:,0],y=y[:,1],hue=data_train_set['Sentiment'].iloc[1:100])

    for label,x,y in zip(data_train_set['Sentiment'].iloc[1:100],y[:, 0],y[:,1]):
        plt.annotate(label,xy=(x,y),xytext=(0,0),textcoords='offset points')
    plt.show()

In [None]:
tsne_visualization(word_embeddings)

In [None]:
stop_words = set(stopwords.words('english'))

data_train_set['OriginalTweet'] = data_train_set['OriginalTweet'].apply(lambda x:simple_preprocess(x, max_len=30))

# Remove stop words
data_train_set['OriginalTweet'] = data_train_set['OriginalTweet'].apply(lambda x: [w for w in x if not w in stop_words])

data_train_set.head()

* Replace Sentiment values to number

In [None]:
data_train_set["Sentiment"] = data_train_set["Sentiment"].apply(lambda x:4 if x == "Extremely Positive" else (3 if x == " Positive" else(2 if x == "Neutral" else(1 if x == "Negative" else(0)))))
data_train_set.head()

In [None]:
# Store OriginalTweet and Sentiment
X = data_train_set["OriginalTweet"]
y = data_train_set['Sentiment']

In [None]:
#Get GloVe embedding for OriginalTweet
train_embeddings  = get_glove_embeddings(data_train_set['OriginalTweet'], dimension=50)
print(len(train_embeddings))

* Prepaire Train and Test Sets

In [None]:
# Storing the train_embeddings in X
X = np.array(train_embeddings)

# Converting X into torch tensor
X = torch.Tensor(X)

# Reshaping X to 200 dimension
X = X.reshape(-1, 50)
print(X.shape)

In [None]:
# Storing the labels in y
y = data_train_set['Sentiment']

# Converting X into torch tensor
y = torch.Tensor(y)

# Reshaping y to 1 dimension
y = y.reshape(-1,1)
print(y.shape)

* Prepaire Train and Validation set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, train_size = 30000)

In [None]:
# Set up device to run CUDA operations
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

In [None]:
train_dataset = TensorDataset(X_train,y_train)
train_loader = DataLoader(train_dataset,batch_size = 32)
test_dataset = TensorDataset(X_test,y_test)
test_loader = DataLoader(test_dataset,batch_size = 32)

##   **Stage 4**: Build and Train the Deep Recurrent Model using Pytorch/Keras (4 Points)



In [None]:
class DAN(torch.nn.Module):
        def __init__(self, input_size, hidden_size, dp = 0.5, d_out = 5):
            super(DAN, self).__init__()
            self.input_size = input_size
            self.hidden_size  = hidden_size
            self.bn1 = nn.BatchNorm1d(input_size)
            self.fc1 = torch.nn.Linear(self.input_size, self.hidden_size)
            self.dropout1 = nn.Dropout(dp)
            self.bn2 = nn.BatchNorm1d(self.hidden_size)
            self.fc2 = torch.nn.Linear(self.hidden_size, 10)
            self.fc3 = torch.nn.Linear(10, d_out)

        def forward(self, x):
            # x = self.dropout1(x)
            x = self.bn1(x)
            x = self.fc1(x)
            x = self.dropout1(x)
            x = self.bn2(x)
            x = self.fc2(x)
            x = self.fc3(x)
            return x

In [None]:
# Dimension as 50 and layers as 32
model = DAN(50, 32)
#model = DAN(50, 64)
criterion = nn.CrossEntropyLoss()
# We will set the learning rate (lr) as 0.001
#optimizer = torch.optim.Adam(model.parameters(), lr = 0.001)
optimizer = torch.optim.Adam(model.parameters(), lr = 0.01)

print(model)

* Training Model

In [None]:
# First switch the module mode to model.train() so that new weights can be learned after every epoch.
model.train()

# No of Epochs
epochs = 50

for epoch in range(epochs):

  # Iterate through all the batches in each epoch
  for inputs, target in train_loader:

    # Zero the parameter gradients
    optimizer.zero_grad()

    # Forward pass
    outputs = model(inputs)

    # Compute Loss
    target = target.squeeze_()
    target = target.type(torch.LongTensor)

    loss = criterion(outputs, target)

  print('Epoch {}: train loss: {}'.format(epoch, loss.item()))

  # Backward pass
  loss.backward()

  # optimizer.step() updates the weights accordingly
  optimizer.step()

print("We got the training loss as %f for %d epochs" %((loss.item(), epochs)))

##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset (2 Points)

* Upload the model predictions to kaggle by mapping the sentiment column vlalues from numericals the categorical







In [None]:
# Creating empty lists to store the labels and the predictions
labels = []
predictions = []

In [None]:
model.eval()

for inputs,target in test_loader:

    # Forward pass
    outputs = model(inputs)

    _,out = torch.max(outputs, 1)

    labels.append(out)

    target = target.squeeze_()
    target = target.type(torch.LongTensor)

    predictions.append(target)
    loss = criterion(outputs,target)

print("We got the loss as %f for test set." %((loss.item())))

In [None]:
#Check Label and Prediction
labels = torch.cat(labels, 0)
predictions = torch.cat(predictions,0)

In [None]:
j = 0
for i in range(662):
  if labels[i] == predictions[i]:
    j+=1
print("%d predicted values matches the label out of total %d training set values." %((j,len(test_dataset))))

### Instructions for preparing Kaggle competition predictions


* Get the predictions using trained model and prepare a csv file
    * DeepNet model gives output for each class, consider the maximum value among all classes as prediction using `np.argmax`.

* Predictions (csv) file should contain 2 columns as Sample_Submission.csv
  - First column is the Test_Id which is considered as index
  - Second column is prediction in decoded form (for eg. Positive, Negative etc...).