<a href="https://colab.research.google.com/github/mehulbhardwaj/DeepLearningIISc/blob/main/Copy_of_Cohort_6_Team_5_M4_Mini_Hackathon_To_Classify_Coronavirus_Tweets_During_Covid_19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: To perform text classification of coronavirus tweets during the peak Covid - 19 period using LSTMs/RNNs/CNNs/BERT.


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural network (RNN, LSTM, GRU, CNNs, Bidirectional-LSTM, GRU, BERT) to classify the tweets


### Introduction

First we need to understand why sentiment analysis is needed for social media?

People from all around the world have been using social media more than ever. Sentiment analysis on social media data helps to understand the wider public opinion about certain topics such as movies, events, politics, sports, and more and gain valuable insights from this social data. Sentiment analysis has some powerful applications. Nowadays it is also used by some businesses to do market research and understand the customer’s experiences for their products or services.

Now an interesting question about this type of problem statement that may arise in your mind is that why sentiment analysis on COVID-19 Tweets? What is about the coronavirus tweets that would be positive? You may have heard sentiment analysis on movie or book reviews, but what is the purpose of exploring and analyzing this type of data?

The use of social media for communication during the time of crisis has increased remarkably over the recent years. As mentioned above, analyzing social media data is important as it helps understand public sentiment. During the coronavirus pandemic, many people took to social media to express their anger, grief, or sadness while some also spread happiness and positivity. People also used social media to ask their network for help related to vaccines or hospitals during this hard time. Many issues related to this pandemic can also be solved if experts considered this social data. That’s the reason why analyzing this type of data is important to understand the overall issues faced by people.



## Dataset

The given challenge is to build a multiclass classification model to predict the sentiment of Covid-19 tweets. The tweets have been pulled from Twitter and manual tagging has been done. We are given information like Location, Tweet At, Original Tweet, and Sentiment.

The training dataset consists of 36000 tweets and the testing dataset consists of 8955 tweets. There are 5 sentiments namely ‘Positive’, ‘Extremely Positive’, ‘Negative’, ‘Extremely Negative’, and ‘Neutral’ in the sentiment column.

## Description

This dataset has the following information about the user who tweeted:

1. **UserName:** twitter handler
2. **ScreenName:** a personal identifier on Twitter and is separate from the username
3. **Location:** where in the world the person tweets from
4. **TweetAt:** date of the tweet posted (DD-MM-YYYY)
5. **OriginalTweet:** the tweet itself
6. **Sentiment:** sentiment value



## Problem Statement

To build and implement a multiclass classification deep neural network model to classify between Positive/Extremely Positive/Negative/Extremely Negative/Neutral sentiments

## Grading = 10 Marks

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/15cef0def403469ebbb5db1a67991873) and open your user settings page. Click Account.

* Click on your profile picture at the top-right corner of the page.

![alt text](https://i.imgur.com/kSLmEj2.png)

* In the popout menu, click the Settings option.

![alt text](https://i.imgur.com/tNi6yun.png)








### 2. Next, scroll down to the API access section and click generate to download an API key (kaggle.json).
![alt text](https://i.imgur.com/vRNBgrF.png)


### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
from google.colab import files
files.upload()

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

### 4. Install the Kaggle API using the following command


In [None]:
!pip install -U -q kaggle
!pip install transformers==4.28.1
!pip install scikit-learn
!pip install imbalanced-learn
!pip install emoji

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

In [None]:
!ls -lart ~/.kaggle

In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c to-classify-coronavirus-tweets-during-covid-19

In [None]:
!unzip /content/to-classify-coronavirus-tweets-during-covid-19.zip

## YOUR CODING STARTS FROM HERE

## Import required packages

In [None]:


# Import required packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import emoji
import html
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, f1_score, precision_score, recall_score
from sklearn.utils.class_weight import compute_class_weight
from google.colab import drive
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score
from torch.utils.data import Dataset
from transformers import TrainerCallback, AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from imblearn.over_sampling import RandomOverSampler  # Import RandomOverSampler
import json  # For saving metrics
from torch.nn import CrossEntropyLoss
from scipy.sparse import hstack


from scipy.stats import chi2_contingency

##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

* Load the Dataset


In [None]:
# YOUR CODE HERE

# Load the training dataset
train_df = pd.read_csv('corona_nlp_train.csv/corona_nlp_train.csv', encoding='latin1')

# Load the testing dataset
test_df = pd.read_csv('corona_nlp_test.csv/corona_nlp_test.csv', encoding='latin1')

train_df.info()
test_df.info()



* Check for Missing Values

In [None]:
# YOUR CODE HERE
# Check for missing values in the training dataset
train_missing_values = train_df.isnull().sum()
print("Missing values in training dataset:\n", train_missing_values)

# Check for missing values in the testing dataset
test_missing_values = test_df.isnull().sum()
print("\nMissing values in testing dataset:\n", test_missing_values)


Fill the missing localtions using the distributon in rest of the data.

In [None]:
#check relation between location and sentiment

plt.figure(figsize=(12, 6))
sns.countplot(x='Location', hue='Sentiment', data=train_df, order=train_df['Location'].value_counts().index[:10])
plt.title("Sentiment Distribution by Location")
plt.xticks(rotation=45)
plt.show()



# Create a contingency table
contingency_table = pd.crosstab(train_df['Location'], train_df['Sentiment'])

# Perform chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square Statistic: {chi2}, P-value: {p}")

In [None]:
#check relation between TweetAt and sentiment
# Create a contingency table
contingency_table = pd.crosstab(train_df['TweetAt'], train_df['Sentiment'])

# Perform chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square Statistic: {chi2}, P-value: {p}")

In [None]:
#check relation between ScreenName and sentiment
# Create a contingency table
contingency_table = pd.crosstab(train_df['UserName'], train_df['Sentiment'])

# Perform chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square Statistic: {chi2}, P-value: {p}")

In [None]:
# Use Sentiment-Weighted Probabilistic Assignment for filling missing Location values.

# Step 1: Calculate sentiment-specific probability distribution of locations
location_probs = (
    train_df.groupby('Sentiment')['Location']
    .value_counts(normalize=True)
    .rename("probability")
    .reset_index()
)

# Step 2: Calculate the overall probability distribution of locations
overall_location_probs = (
    train_df['Location'].value_counts(normalize=True)
    .rename("probability")
    .reset_index()
    .rename(columns={'index': 'Location'})
)

# Step 3: Define a function to sample a location by sentiment-specific probabilities
def sample_location_by_sentiment(sentiment):
    # Get probabilities for the given sentiment
    sentiment_probs = location_probs[location_probs['Sentiment'] == sentiment]
    locations = sentiment_probs['Location'].values
    probabilities = sentiment_probs['probability'].values

    # If no probabilities are available (edge case), fallback to overall probabilities
    if len(locations) == 0:
        return np.random.choice(overall_location_probs['Location'], p=overall_location_probs['probability'])

    # Sample a location weighted by probabilities
    return np.random.choice(locations, p=probabilities)

# Step 4: Fill missing locations in train_df using sentiment-specific probabilities
train_df['Location'] = train_df.apply(
    lambda row: sample_location_by_sentiment(row['Sentiment']) if pd.isna(row['Location']) else row['Location'],
    axis=1
)

# Step 5: Fill missing locations in test_df using overall probabilities
test_df['Location'] = test_df['Location'].fillna(
    pd.Series(np.random.choice(overall_location_probs['Location'],
                                p=overall_location_probs['probability'],
                                size=len(test_df)))
)

"""
# Calculate the probability distribution of locations
location_probs = train_df['Location'].value_counts(normalize=True)

# Get the locations and their probabilities
locations = location_probs.index
probs = location_probs.values

# Fill missing values in training and testing data
train_df['Location'] = train_df['Location'].fillna(pd.Series(np.random.choice(locations,
                                                                            p=probs,
                                                                            size=len(train_df))))
test_df['Location'] = test_df['Location'].fillna(pd.Series(np.random.choice(locations,
                                                                           p=probs,
                                                                           size=len(test_df))))
"""

In [None]:
"""
# Calculate the probability distribution of locations
location_probs = train_df['Location'].value_counts(normalize=True)

# Get the locations and their probabilities
locations = location_probs.index
probs = location_probs.values

# Fill missing values in training and testing data
num_missing_train = train_df['Location'].isnull().sum()
random_locations_train = pd.Series(np.random.choice(locations, p=probs, size=num_missing_train))
train_df['Location'] = train_df['Location'].fillna(random_locations_train)

num_missing_test = test_df['Location'].isnull().sum()
random_locations_test = pd.Series(np.random.choice(locations, p=probs, size=num_missing_test))
test_df['Location'] = test_df['Location'].fillna(random_locations_test)


Check for Missing Values

In [None]:

# YOUR CODE HERE
# Check for missing values in the training dataset
train_missing_values = train_df.isnull().sum()
print("Missing values in training dataset:\n", train_missing_values)

# Check for missing values in the testing dataset
test_missing_values = test_df.isnull().sum()
print("\nMissing values in testing dataset:\n", test_missing_values)


* Visualize the sentiment column values


In [None]:
"""
# YOUR CODE HERE
sentiment_counts = train_df['Sentiment'].value_counts()

plt.figure(figsize=(10, 6))
sns.barplot(x=sentiment_counts.index, y=sentiment_counts.values)
plt.title('Distribution of Sentiments')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent labels from overlapping
plt.show()

* Visualize top 10 Countries that had the highest tweets using countplot (Tweet count vs Location)


In [None]:
"""
# Get the top 10 locations
top_10_locations = train_df['Location'].value_counts().nlargest(10).index

# Filter the dataframe for the top 10 locations
filtered_df = train_df[train_df['Location'].isin(top_10_locations)]

# Create the countplot
plt.figure(figsize=(12, 6))  # Adjust figure size as needed
sns.countplot(data=filtered_df, x='Location', order=top_10_locations)
plt.title('Top 10 Countries with Highest Tweet Counts')
plt.xlabel('Location')
plt.ylabel('Tweet Count')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

* Plotting Pie Chart for the Sentiments in percentage


In [None]:
"""
# Calculate sentiment percentages
sentiment_counts = train_df['Sentiment'].value_counts()
sentiment_percentages = sentiment_counts / sentiment_counts.sum() * 100

# Create the pie chart
plt.figure(figsize=(8, 8))  # Adjust figure size as needed
plt.pie(sentiment_percentages, labels=sentiment_percentages.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Sentiments in Percentage')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

* WordCloud for the Tweets/Text

    * Visualize the most commonly used words in each sentiment using wordcloud
    * Refer to the following [link](https://medium.com/analytics-vidhya/word-cloud-a-text-visualization-tool-fb7348fbf502) for Word Cloud: A Text Visualization tool




In [None]:
# Combine all tweets for each sentiment
sentiment_texts = {sentiment: ' '.join(train_df[train_df['Sentiment'] == sentiment]['OriginalTweet'].astype(str).tolist())
                  for sentiment in train_df['Sentiment'].unique()}

# Add custom words to ignore
#custom_stopwords = ['COVID', 'https', 't', 'co', 'coronavirus','COVID19']
#STOPWORDS.update(custom_stopwords)

# Create and display word clouds for each sentiment
for sentiment, text in sentiment_texts.items():
    # Create a word cloud object
    wordcloud = WordCloud(width=800, height=400, background_color='white',
                          stopwords=STOPWORDS, min_font_size=10).generate(text)

    # Display the generated image
    plt.figure(figsize=(8, 8), facecolor=None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.title(f"Word Cloud for {sentiment} Sentiment")
    plt.show()



##   **Stage 2**: Data Pre-Processing  (2 Points)

####  Clean and Transform the data into a specified format


In [None]:
#from sklearn.feature_extraction.text import TfidfVectorizer
#
#
#def define_common_words_tfidf(df, text_column='OriginalTweet', sentiment_column='Sentiment', threshold=0.1):
"""
    Identifies common words with low TF-IDF scores across documents for different sentiment classes.

    Args:
        df (pd.DataFrame): The input DataFrame containing text and sentiment columns.
        text_column (str): The name of the column containing the text data (default: 'OriginalTweet').
        sentiment_column (str): The name of the column containing the sentiment labels (default: 'Sentiment').
        threshold (float): Threshold for identifying low-importance words (default: 0.1).

    Returns:
        list: A list of common words to exclude for training.
"""
    # Combine all text by sentiment category
#    df_grouped = df.groupby(sentiment_column)[text_column].apply(lambda x: ' '.join(x)).reset_index()

    # Vectorize using TF-IDF
#    vectorizer = TfidfVectorizer(stop_words='english')
#    tfidf_matrix = vectorizer.fit_transform(df_grouped[text_column])
#    words = vectorizer.get_feature_names_out()
#    avg_tfidf_scores = tfidf_matrix.mean(axis=0).A1  # Average TF-IDF score for each word across sentiments

    # Identify low-importance words based on the threshold
#    common_words = [word for word, score in zip(words, avg_tfidf_scores) if score < 0.1]

#    return common_words

#common_words = define_common_words(train_df)  # Identify common words to remove from data
#print(common_words)


In [None]:
# Download necessary NLTK resources if you haven't already
nltk.download('stopwords')
nltk.download('wordnet')

#common_words = ['COVID', 'https', 't', 'co', 'coronavirus', 'COVID19']

def split_hashtag(hashtag):
    return ' '.join(re.findall(r'[A-Z][^A-Z]*', hashtag))

def process_hashtags(text):
    """
    Processes hashtags by splitting camel case and converting to lowercase.
    """
    return re.sub(r'#(\w+)', lambda match: split_hashtag(match.group(1)).lower(), text)

# Define emoticon dictionary
emoticon_dict = {
    ":)": "positive_smile",
    ":-)": "positive_smile",
    ":(": "negative_sad",
    ":-(": "negative_sad",
    ";)": "positive_wink",
    ";-)": "positive_wink",
}

def convert_emojis(text):
    """
    Converts emojis to their descriptive text.
    """
    return emoji.demojize(text)

def convert_emoticons(text):
    """
    Converts emoticons to sentiment tokens.
    """
    for emoticon, sentiment in emoticon_dict.items():
        text = text.replace(emoticon, sentiment)
    return text


def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)

    # Remove mentions
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)

    # Process hashtags
    text = process_hashtags(text)

    # Handle HTML ampersands
    text = html.unescape(text)

    # Convert emojis and emoticons
    text = convert_emojis(text)
    text = convert_emoticons(text)

    # Remove non-ASCII characters (but keep emojis processed above)
    text = re.sub(r'[^\x00-\x7F]+', '', text)

    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove non-alphanumeric characters except spaces
    text = re.sub(r'[^\w\s]', '', text)

    # Function to remove duplicate words
    text = re.sub(r'\b(\w+)\s+\1\b', r'\1', text)

    # Convert to lowercase
    text = text.lower()

    #remove all 1 char words
    text = " ".join(re.findall(r'\b\w{2,}\b', text))

    # Replace COVID-related terms with "coronavirus"
    patterns = r"covid[_\-]19|covid19|covid 19|covid"
    text = re.sub(patterns, "coronavirus", text, flags=re.IGNORECASE)

    # Remove stop words
    stop_words = set(stopwords.words('english'))

    return text

# Apply the cleaning function to the 'OriginalTweet' column in both DataFrames
train_df['CleanedTweet'] = train_df['OriginalTweet'].apply(clean_text)
test_df['CleanedTweet'] = test_df['OriginalTweet'].apply(clean_text)

In [None]:
train_df['CleanedTweet'].head()

In [None]:
""" commented out as it requires higher RAM to run
#Given high correlation between location and sentiment, and 18% dataset missing location values, we will train a model to fill missing location values

# Split dataset into known and missing locations
known_locations = train_df[train_df['Location'].notna()]
missing_locations = train_df[train_df['Location'].isna()]


# Vectorize the text data
vectorizer = CountVectorizer(max_features=2000)  # Limit to top 2000 words
X_known = vectorizer.fit_transform(known_locations['CleanedTweet'])
y_known = known_locations['Location']

# Add sentiment as an additional feature
sentiment_mapping = {'Extremely Negative': 0, 'Negative': 1, 'Neutral': 2, 'Positive': 3, 'Extremely Positive': 4}
sentiment_known = known_locations['Sentiment'].map(sentiment_mapping).values.reshape(-1, 1)
X_known = np.hstack((X_known.toarray(), sentiment_known))  # Combine text and sentiment

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_known, y_known, test_size=0.2, random_state=42)

# Train a Random Forest model
rf_model = RandomForestClassifier(random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

# Evaluate the model


y_pred = rf_model.predict(X_val)
print("Accuracy on Validation Set:", accuracy_score(y_val, y_pred))
print("Classification Report:\n", classification_report(y_val, y_pred))



In [None]:
""" commented out as it requires higher RAM to run
# Predict missing location values for the training dataset

# Vectorize the text data for missing locations
X_missing = vectorizer.transform(missing_locations['CleanedTweet'])

# Add sentiment as an additional feature
sentiment_missing = missing_locations['Sentiment'].map(sentiment_mapping).values.reshape(-1, 1)
X_missing = np.hstack((X_missing.toarray(), sentiment_missing))  # Combine text and sentiment

# Predict missing locations
predicted_locations = rf_model.predict(X_missing)

# Fill missing values in the original dataframe
train_df.loc[train_df['Location'].isna(), 'Location'] = predicted_locations

# Check the distribution of location after filling missing values
print(train_df['Location'].value_counts(normalize=True))

In [None]:
""" commented out as it requires higher RAM to run
# Predict missing location values in test_df

# Filter rows with missing locations in test_df
missing_test_locations = test_df[test_df['Location'].isna()]

# Vectorize the text data for missing locations in test_df
X_missing_test = vectorizer.transform(missing_test_locations['CleanedTweet'])

# Add sentiment as an additional feature
sentiment_missing_test = missing_test_locations['Sentiment'].map(sentiment_mapping).values.reshape(-1, 1)
X_missing_test = np.hstack((X_missing_test.toarray(), sentiment_missing_test))  # Combine text and sentiment

# Predict missing locations
predicted_test_locations = rf_model.predict(X_missing_test)

# Fill missing values in the original test_df
test_df.loc[test_df['Location'].isna(), 'Location'] = predicted_test_locations

# Check the distribution of location after filling missing values in test_df
print(test_df['Location'].value_counts(normalize=True))

In [None]:
# Step 2: Combine text, TweetAt and location into a single feature
train_df['text_with_location'] = train_df['CleanedTweet'] + " Location: " + train_df['Location'] + " TweetAt: " + train_df['TweetAt']
test_df['text_with_location'] = test_df['CleanedTweet'] + " Location: " + test_df['Location'] + " TweetAt: " + train_df['TweetAt']


In [None]:
# visualise cleanded word clouds
# Combine all tweets for each sentiment
sentiment_texts = {sentiment: ' '.join(train_df[train_df['Sentiment'] == sentiment]['CleanedTweet'].astype(str).tolist())
                  for sentiment in train_df['Sentiment'].unique()}

# Create and display word clouds for each sentiment
for sentiment, text in sentiment_texts.items():
    # Create a word cloud object
    wordcloud = WordCloud(width=800, height=400, background_color='white',
                          stopwords=STOPWORDS, min_font_size=10).generate(text)

    # Display the generated image
    plt.figure(figsize=(8, 8), facecolor=None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.title(f"Word Cloud for {sentiment} Sentiment")
    plt.show()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

file_path = "/content/drive/MyDrive/Cohort6_Team5_Hackathon1_CoronavirusClassification_Models/cleaned_data_train.csv"
train_df.to_csv(file_path, index=False)

file_path = "/content/drive/MyDrive/Cohort6_Team5_Hackathon1_CoronavirusClassification_Models/cleaned_data_test.csv"
test_df.to_csv(file_path, index=False)


##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



In [None]:
!wget -qq https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/glove.6B.zip
!unzip glove.6B.zip

GloVe_Dict_200d = {}
# Loading the 200-dimensional vector of the model
with open("glove.6B.200d.txt", 'r') as f:
  for line in f:
      values = line.split()
      word = values[0]
      vector = np.asarray(values[1:], "float32")
      GloVe_Dict_200d[word] = vector


def glove_embeddings(text,dim ):
    if len(text) < 1:
        return np.zeros(dim)
    else:
        vectorized = [GloVe_Dict_200d[word] if word in GloVe_Dict_200d else np.random.randn(dim) for word in text]
    sum = np.sum(vectorized,axis=0)
    # Return the average
    return sum/len(text)


def get_glove_embeddings(text,dimension):
        embeddings = text.apply(lambda x: glove_embeddings(x,dimension))
        return list(embeddings)

glove_embeddings_train = get_glove_embeddings(train_df['CleanedTweet'],dimension=200)
glove_embeddings_test = get_glove_embeddings(test_df['CleanedTweet'],dimension=200)


In [None]:
glove_embeddings_train[0]

##   **Stage 4**: Build and Train the Deep Recurrent Model using Pytorch/Keras (4 Points)



In [None]:
import numpy as np
import pandas as pd

def word_statistics(dataframe, text_column):
  """Calculates word statistics for sentences in a dataframe's text column.

  Args:
    dataframe: The input Pandas DataFrame.
    text_column: The name of the column containing text data.

  Returns:
    A tuple containing the mean, 75th percentile, and 95th percentile
    of the number of words in sentences.
  """
  word_counts = []
  for text in dataframe[text_column]:
      sentences = text.split(".")  # Split the text into sentences
      word_counts.extend([len(sentence.split()) for sentence in sentences if sentence])  # Count words in each sentence

  mean_words = np.mean(word_counts)
  percentile_75 = np.percentile(word_counts, 75)
  percentile_95 = np.percentile(word_counts, 95)
  percentile_99 = np.percentile(word_counts, 99)
  return mean_words, percentile_75, percentile_99


# Calculate and print the word statistics for the training set
mean_words, percentile_75, percentile_95 = word_statistics(train_df, "text_with_location")  # Assuming "text" column contains the text data

print(f"Mean words per sentence in training set: {mean_words}")
print(f"75th percentile of word count in training set: {percentile_75}")
print(f"99th percentile of word count in training set: {percentile_95}")

In [None]:
 # Step 1: Mount Google Drive and configure base model

drive.mount('/content/drive')

# Set the path in Google Drive where you want to save the best model
best_model_path = "/content/drive/MyDrive/Cohort6_Team5_Hackathon1_CoronavirusClassification_Models/BERTweet_best_model"


# Use BERTweet model and tokenizer
model_name = "vinai/bertweet-base"  # BERTweet model for first training
tokenizer = AutoTokenizer.from_pretrained(model_name, normalization=True)

 # Step 2: Pick base model or your own pre trained model
# for first training
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)

#for finetuning pre-trained model
#model = AutoModelForSequenceClassification.from_pretrained(best_model_path)

# Callback to print F1 Score every 100 iterations and accuracy after each epoch
class MetricsCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        # Print F1 score every 100 steps
        if state.global_step % 100 == 0 and state.global_step > 0:
            predictions = trainer.predict(val_dataset)
            preds = predictions.predictions.argmax(-1)
            f1 = f1_score(val_labels, preds, average='weighted')
            print(f"Step {state.global_step}: Validation F1 Score = {f1:.4f}")

    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        # Calculate accuracy after each epoch
        predictions = trainer.predict(val_dataset)
        preds = predictions.predictions.argmax(-1)
        acc = accuracy_score(val_labels, preds)
        print(f"Epoch {state.epoch}: Validation Accuracy = {acc:.4f}")


# Add a SaveMetricsCallback that logs metrics (e.g., F1 score, accuracy) to a JSON file (metrics.json) on Google Drive after every evaluation.
class SaveMetricsCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        log_path = "/content/drive/MyDrive/Cohort6_Team5_Hackathon1_CoronavirusClassification_Models/metrics.json"
        metrics = metrics or {}
        with open(log_path, 'a') as f:
            f.write(json.dumps({'epoch': state.epoch, 'metrics': metrics}) + '\n')



# Custom dataset class
class TweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

# Prepare data
train_texts = train_df['text_with_location'].tolist()
train_labels = train_df['Sentiment'].map({'Extremely Negative': 0, 'Negative': 1, 'Neutral': 2, 'Positive': 3, 'Extremely Positive': 4}).tolist()
test_texts = test_df['CleanedTweet'].tolist()
test_labels = train_df['text_with_location'].map({'Extremely Negative': 0, 'Negative': 1, 'Neutral': 2, 'Positive': 3, 'Extremely Positive': 4}).tolist()

# Step 3: Balance sentiment classes
#ros = RandomOverSampler(random_state=42)
#train_texts_resampled, train_labels_resampled = ros.fit_resample(np.array(train_texts).reshape(-1, 1), train_labels)
#train_texts_resampled = train_texts_resampled.flatten().tolist()  # Flatten back to list format

class_weights = compute_class_weight(
    class_weight="balanced",
    classes=np.unique(train_labels),
    y=train_labels
)
class_weights = torch.tensor(class_weights, dtype=torch.float)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class_weights = class_weights.to(device)

# Modify the model's loss function
model.config.problem_type = "classification"
model.config.num_labels = len(np.unique(train_labels))
loss_fn = CrossEntropyLoss(weight=class_weights)

# Subclass Trainer to use the custom loss function
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels").to(device)  # Ensure labels are on the same device
        outputs = model(**inputs)
        logits = outputs.logits
        loss = loss_fn(logits, labels)  # Apply custom loss function
        return (loss, outputs) if return_outputs else loss

# Step 4: Set a reduced max_length based on max sentence length in data
max_length = 60

# Encode the tweets using the BERTweet tokenizer with new max_length
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length) #removed train_texts_resampled
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=max_length)

# Split training data for validation
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=42       #removed train_texts_resampled, train_labels_resampled
)

train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=max_length)

# Create dataset objects
train_dataset = TweetDataset(train_encodings, train_labels)
val_dataset = TweetDataset(val_encodings, val_labels)
test_dataset = TweetDataset(test_encodings, [0] * len(test_df))  # Dummy labels for test


# Update dropout probabilities to reduce overfitting
model.config.hidden_dropout_prob = 0.2  # Default is 0.1, increase for stronger regularization
model.config.attention_probs_dropout_prob = 0.2  # Default is 0.1, increase for stronger regularization


# Step 5: Training arguments with best model saving
training_args = TrainingArguments(
    output_dir='./checkpoints',
    num_train_epochs=10,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    learning_rate=1e-5,
    logging_dir='./logs',
    logging_steps=500,
    evaluation_strategy="steps",  # save every 500 steps
    save_strategy="steps",
    load_best_model_at_end=True,
    save_total_limit=3,       # retain only last 3 checkpoints
    report_to="none",
    save_steps=500,
#    early_stopping_patience=3,
)

def compute_metrics(pred):
    """
    Compute evaluation metrics for the model.

    Args:
        pred: A PredictionOutput object from the Trainer containing logits and labels.

    Returns:
        dict: A dictionary of computed metrics.
    """
    logits, labels = pred
    predictions = logits.argmax(axis=-1)  # Get the predicted class by taking the argmax

    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average='weighted'),
        "precision": precision_score(labels, predictions, average='weighted'),
        "recall": recall_score(labels, predictions, average='weighted')
    }


trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics  # If you have a metrics function
)
"""
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    loss_function=custom_loss
)
"""
# Add the F1 score and accuracy callback
trainer.add_callback(SaveMetricsCallback())
trainer.add_callback(MetricsCallback())




In [None]:
# Step 6: Train the model
trainer.train()

# Step 7: Save the best model to Google Drive
trainer.save_model(best_model_path)  # Save the best model checkpoint to Google Drive



##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset (2 Points)

* Upload the model predictions to kaggle by mapping the sentiment column vlalues from numericals the categorical







In [None]:
# Step 8: Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation Results: {eval_results}")

# Step 9: Make predictions on the test set
predictions = trainer.predict(test_dataset)
predicted_labels = predictions.predictions.argmax(-1)

# Map predicted labels back to sentiment categories
sentiment_mapping = {0: 'Extremely Negative', 1: 'Negative', 2: 'Neutral', 3: 'Positive', 4: 'Extremely Positive'}
predicted_sentiments = [sentiment_mapping[label] for label in predicted_labels]


# Add predictions to the test dataframe
test_df['Predicted Sentiment'] = predicted_sentiments

# Save predictions to a CSV file
test_df[['Predicted Sentiment']].to_csv('./bertweet_predictions.csv', index=False)

In [None]:

# prompt: create a classification_report for the outputs.
# class_names = ['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']
# report = classification_report(train_df['Sentiment'], predicted_sentiments, target_names=class_names)


class_names = ['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']
report = classification_report(train_df['Sentiment'], predicted_sentiments, target_names=class_names)
report

In [None]:
print(classification_report(val_labels, predicted_labels, target_names=class_names))

In [None]:
# ... (Your existing code for training and prediction) ...


# Load the best saved model
best_model_path = "/content/drive/MyDrive/Cohort6_Team5_Hackathon1_CoronavirusClassification_Models/BERTweet_best_model"
model = AutoModelForSequenceClassification.from_pretrained(best_model_path)

# Generate test_id starting from 1
test_ids = np.arange(1, len(test_df) + 1)  # Create a sequence from 1 to the number of rows in test_df

# Map predicted labels to sentiment categories (same as before)
predicted_sentiments = [sentiment_mapping[label] for label in predicted_labels]

# Create submission file with generated 'test_id' and 'Sentiment' columns
submission_df = pd.DataFrame({'Test_Id': test_ids, 'Sentiment': predicted_sentiments})
submission_df.to_csv('submission.csv', index=False)

# Upload to Kaggle (using Kaggle API)
!kaggle competitions submit -c to-classify-coronavirus-tweets-during-covid-19 -f submission.csv -m "Predictions"

### Instructions for preparing Kaggle competition predictions


* Get the predictions using trained model and prepare a csv file
    * DeepNet model gives output for each class, consider the maximum value among all classes as prediction using `np.argmax`.

* Predictions (csv) file should contain 2 columns as Sample_Submission.csv
  - First column is the Test_Id which is considered as index
  - Second column is prediction in decoded form (for eg. Positive, Negative etc...).