<a href="https://colab.research.google.com/github/prachitshukla/Team-2/blob/coronavirus_sentiment_analysis/Copy_of_M4_Mini_Hackathon_To_Perform_Classification_of_Coronavirus_Tweets_PS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: To perform text classification of coronavirus tweets during the peak Covid - 19 period using LSTMs/RNNs/CNNs/BERT.


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural network (RNN, LSTM, GRU, CNNs, Bidirectional-LSTM, GRU, BERT) to classify the tweets


### Introduction

First we need to understand why sentiment analysis is needed for social media?

People from all around the world have been using social media more than ever. Sentiment analysis on social media data helps to understand the wider public opinion about certain topics such as movies, events, politics, sports, and more and gain valuable insights from this social data. Sentiment analysis has some powerful applications. Nowadays it is also used by some businesses to do market research and understand the customer’s experiences for their products or services.

Now an interesting question about this type of problem statement that may arise in your mind is that why sentiment analysis on COVID-19 Tweets? What is about the coronavirus tweets that would be positive? You may have heard sentiment analysis on movie or book reviews, but what is the purpose of exploring and analyzing this type of data?

The use of social media for communication during the time of crisis has increased remarkably over the recent years. As mentioned above, analyzing social media data is important as it helps understand public sentiment. During the coronavirus pandemic, many people took to social media to express their anger, grief, or sadness while some also spread happiness and positivity. People also used social media to ask their network for help related to vaccines or hospitals during this hard time. Many issues related to this pandemic can also be solved if experts considered this social data. That’s the reason why analyzing this type of data is important to understand the overall issues faced by people.



## Dataset

The given challenge is to build a multiclass classification model to predict the sentiment of Covid-19 tweets. The tweets have been pulled from Twitter and manual tagging has been done. We are given information like Location, Tweet At, Original Tweet, and Sentiment.

The training dataset consists of 36000 tweets and the testing dataset consists of 8955 tweets. There are 5 sentiments namely ‘Positive’, ‘Extremely Positive’, ‘Negative’, ‘Extremely Negative’, and ‘Neutral’ in the sentiment column.

## Description

This dataset has the following information about the user who tweeted:

1. **UserName:** twitter handler
2. **ScreenName:** a personal identifier on Twitter and is separate from the username
3. **Location:** where in the world the person tweets from
4. **TweetAt:** date of the tweet posted (DD-MM-YYYY)
5. **OriginalTweet:** the tweet itself
6. **Sentiment:** sentiment value



## Problem Statement

To build and implement a multiclass classification deep neural network model to classify between Positive/Extremely Positive/Negative/Extremely Negative/Neutral sentiments

## Grading = 10 Marks

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/db0ea322e4b14ad1b29d14fbe406d4e5) and open your user settings page. Click Account.

* Click on your profile picture at the top-right corner of the page.

![alt text](https://i.imgur.com/kSLmEj2.png)

* In the popout menu, click the Settings option.

![alt text](https://i.imgur.com/tNi6yun.png)








### 2. Next, scroll down to the API access section and click generate to download an API key (kaggle.json).
![alt text](https://i.imgur.com/vRNBgrF.png)


### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
from google.colab import files
files.upload()

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

### 4. Install the Kaggle API using the following command


In [None]:
#!pip uninstall urllib3
#!pip install urllib3>=1.26.11
!pip install -U -q kaggle==1.5.8

#### 4.1 List of installed pakage

In [None]:
!pip list

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c perform-classification-of-coronavirus-tweets

In [None]:
!unzip /content/perform-classification-of-coronavirus-tweets.zip

## YOUR CODING STARTS FROM HERE

## Import required packages

In [None]:
# Import required packages
import numpy as np
import pandas as pd
import chardet
#import re
#from nltk.tokenize import word_tokenize
#import nltk
#nltk.download('punkt_tab')
#nltk.download('stopwords')
#from nltk.corpus import stopwords
#from nltk.tokenize import word_tokenize
#import string
#import itertools
import seaborn as sns
#from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
#import matplotlib
#import matplotlib.patches as mpatches
#tsne = TSNE(n_components=2)
#from gensim.utils import simple_preprocess
#from sklearn.model_selection import train_test_split
#from torch.utils.data import TensorDataset, DataLoader
#import torch
#import torch.nn as nn
#import torch.nn.functional as F
#import torchvision
#import torchvision.transforms as transforms
#from sklearn.utils import shuffle
#from sklearn.metrics.pairwise import cosine_similarity
from wordcloud import WordCloud

##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

* Load the Dataset


In [None]:
# Read the positive and negative files and split the sentences into a list
with open('corona_nlp_test.csv/corona_nlp_test.csv',"rb") as data_test:
  result = chardet.detect(data_test.read())
  print(result)
  data_test_set = pd.read_csv('corona_nlp_test.csv/corona_nlp_test.csv', encoding=result['encoding'])

with open('corona_nlp_train.csv/corona_nlp_train.csv',"rb") as data_train:
  result = chardet.detect(data_train.read())
  print(result)
  data_train_set = pd.read_csv('corona_nlp_train.csv/corona_nlp_train.csv', encoding=result['encoding'])

* check first 5 records of train dataframe

In [None]:
print(data_train_set.head())

* Check for Missing Values

In [None]:
print(data_train_set.isnull().sum())

* Visualize the sentiment column values


In [None]:
print(data_train_set["Sentiment"])

* Visualize top 10 Countries that had the highest tweets using countplot (Tweet count vs Location)


In [None]:
plt.figure(figsize=(20,5))
sns.countplot(data=data_train_set, x=data_train_set['Location'],  order= data_train_set['Location'].value_counts().iloc[:10].index)
plt.show()

* Plotting Pie Chart for the Sentiments in percentage


In [None]:
plt.figure(figsize=(20,5))
sentiment_count={}
for sentiment in data_train_set['Sentiment'].unique():
  sentiment_count[sentiment]=data_train_set['Sentiment'].value_counts()[sentiment]
  print(sentiment,data_train_set['Sentiment'].value_counts()[sentiment])
plt.pie(sentiment_count.values(), labels=sentiment_count.keys(), autopct='%1.1f%%')
plt.show()

* WordCloud for the Tweets/Text

    * Visualize the most commonly used words in each sentiment using wordcloud
    * Refer to the following [link](https://medium.com/analytics-vidhya/word-cloud-a-text-visualization-tool-fb7348fbf502) for Word Cloud: A Text Visualization tool




In [None]:
plt.figure(figsize=(20,5))
text=' '.join(data_train_set['OriginalTweet'].astype(str))
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

##   **Stage 2**: Data Pre-Processing  (2 Points)

####  Clean and Transform the data into a specified format


In [None]:
# YOUR CODE HERE

##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



In [None]:
# YOUR CODE HERE

##   **Stage 4**: Build and Train the Deep Recurrent Model using Pytorch/Keras (4 Points)



In [None]:
# YOUR CODE HERE

##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset (2 Points)

* Upload the model predictions to kaggle by mapping the sentiment column vlalues from numericals the categorical







### Instructions for preparing Kaggle competition predictions


* Get the predictions using trained model and prepare a csv file
    * DeepNet model gives output for each class, consider the maximum value among all classes as prediction using `np.argmax`.

* Predictions (csv) file should contain 2 columns as Sample_Submission.csv
  - First column is the Test_Id which is considered as index
  - Second column is prediction in decoded form (for eg. Positive, Negative etc...).