# NLP Challenge Task

This task contains a total of three problems. The weightage of three task is as follows:

* Problem 1: 20%
* Problem 2: 30%
* Problem 3: 50%

## Problem 1 TF-IDF

Implement TF-IDF using using Python, Numpy, Pandas and whatever text cleaning library required.

The tf–idf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics, you can use the following formulas.

### Term Frequency
$$tf_{t,d} = \log_{10}(count(t,d) +1)$$ 

* $tf_{t,d}$ is the frequency of the word t in the
document d

### Inverse Document Frequency
$$idf_t = \log_{10}(\frac{N}{df_t})$$

* $N$ is the total number of documents
* $df_t $ is the number of documents in which term t occurs

### TF-IDF
$$tf\text{-}idf_{t,d} = tf_{t,d} \times idf_t $$

### What is expected? 
Your implementation should include the following two functions:
 * `compute_tfidf_weights(train_docs)`
 * `word_tfidf_vector(word, tf_df, idf_df)`


In [1]:
!pip install numpy pandas nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk import word_tokenize
from collections import Counter
import math
import itertools    

def compute_tfidf_weights(train_docs):
  """
  Computes the TF-IDF weights for a list of documents.

  Args:
    train_docs: A list of documents, where each document is a string.

  Returns:
    A tuple of (tf_df, idf_df), where tf_df is a DataFrame of term frequencies and idf_df is a DataFrame of inverse document frequencies.
  """

  # Consider English stopwords
  stop_words = set(stopwords.words('english'))
  
  # Tokenize documents, remove stopwords and count term frequencies
  tf_values = []
  df_counter = Counter()

  for doc in train_docs:
    tokens = [word for word in word_tokenize(doc.lower()) if word.isalpha() and word not in stop_words]
    tf_values.append(Counter(tokens))
    df_counter.update(set(tokens))

  # print("TF values: ", tf_values)
  # print("DF counter: ", df_counter)

  # Convert term frequencies to tf scores
  tf_df = pd.DataFrame(tf_values).fillna(0)
  tf_df = tf_df.applymap(lambda x: math.log10(1 + x))

  # print("TF DataFrame: ")
  # print(tf_df)

  # Calculate idf scores
  N = len(train_docs)
  idf_scores = {word: math.log10(N / df_counter[word]) for word in df_counter.keys()}
  # idf_scores = {word: math.log10(N / (1 + df_counter[word])) for word in df_counter.keys()}
  # idf_scores = {word: math.log10((1 + N) / (1 + df_counter[word])) for word in df_counter.keys()}

  idf_df = pd.DataFrame([idf_scores])

  # print("IDF DataFrame: ")
  # print(idf_df)

  return tf_df, idf_df

def word_tfidf_vector(word, tf_df, idf_df):
  """
  Calculates the TF-IDF vector for a word.

  Args:
    word: A string.
    tf_df: A DataFrame of term frequencies.
    idf_df: A DataFrame of inverse document frequencies.

  Returns:
    A numpy array of dimension 1xN, where N is the number of documents.
  """

  if word not in tf_df.columns:
        print(f"The word '{word}' does not exist in the documents.")
        return None
  
  # print("TF for the word:")
  # print(tf_df[word])
  
  # print("IDF for the word:")
  # print(idf_df[word].values[0])

  tf_idf_value = tf_df[word] * idf_df[word].values[0]
  # print("TF-IDF values:")
  # print(tf_idf_value)

  return np.array(tf_idf_value)

In [3]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
# Test

# Load the training data
train_docs = [
  "This is a document about cats.",
  "This is another document about cats.",
  "This is a document about dogs.",
]

# Compute the TF-IDF weights
tf_df, idf_df = compute_tfidf_weights(train_docs)

# Calculate the TF-IDF vector for the word "cat"
cat_tfidf_vector = word_tfidf_vector("cats", tf_df, idf_df)

# Print the TF-IDF vector
print(cat_tfidf_vector)

[0.05300875 0.05300875 0.        ]


In [5]:
from google.colab import drive
drive.mount('/content/drive')

# Set the working directory for the tasks
import os
SKELETON_DIR = '/content/drive/MyDrive/NLP'
os.chdir(SKELETON_DIR)

Mounted at /content/drive


In [6]:
import pandas as pd
import numpy as np

# Load the CSV file
# df = pd.read_csv('/content/drive/MyDrive/NLP/Dataset/Corona_NLP_train.csv', encoding='latin1')

p = 0.01
df = pd.read_csv('/content/drive/MyDrive/NLP/Dataset/Corona_NLP_train.csv', encoding='latin1', skiprows=lambda i: i>0 and np.random.random() > p)


# Preprocess the text data (this is a simple example, you might need to do more cleaning depending on your data)
df['OriginalTweet'] = df['OriginalTweet'].str.replace('http\S+|www.\S+', '', case=False) # remove URLs

# Step 3: Calculate TF-IDF weights
tf_df, idf_df = compute_tfidf_weights(df['OriginalTweet'].tolist())

# Calculate TF-IDF vector for a specific word
tf_idf_vector_word = word_tfidf_vector("coronavirus", tf_df, idf_df) 
print(tf_idf_vector_word)


  df['OriginalTweet'] = df['OriginalTweet'].str.replace('http\S+|www.\S+', '', case=False) # remove URLs


[0.         0.11996046 0.11996046 0.         0.         0.11996046
 0.19013283 0.         0.         0.         0.11996046 0.
 0.11996046 0.         0.11996046 0.         0.         0.
 0.11996046 0.11996046 0.         0.         0.11996046 0.
 0.         0.         0.11996046 0.         0.11996046 0.
 0.         0.11996046 0.11996046 0.11996046 0.         0.11996046
 0.         0.11996046 0.         0.         0.         0.11996046
 0.         0.         0.11996046 0.11996046 0.         0.11996046
 0.11996046 0.         0.11996046 0.11996046 0.         0.
 0.         0.         0.11996046 0.11996046 0.11996046 0.
 0.         0.11996046 0.11996046 0.11996046 0.11996046 0.
 0.         0.         0.11996046 0.11996046 0.         0.
 0.11996046 0.         0.         0.         0.         0.
 0.11996046 0.11996046 0.         0.         0.11996046 0.
 0.         0.11996046 0.         0.         0.11996046 0.
 0.11996046 0.11996046 0.11996046 0.11996046 0.         0.11996046
 0.         0.11

## Problem 2 POS for classification

Robots and chat bots receive different commands to do certain tasks. 

Write a simple pragram that receive interactions in the form of a sentence and return:
* A tuple of (command, object) if the sentence is a command
* None if the sentence is not a command

To write this function, you can utilize a Part-of-speech tagger or named-entity recognizer from libraries like NLTK and Spacy.

Consider the following EXAMPLE sentences:

* Commands:
  * Grab the book
  * Fetch the ball
  * Open the jar
  * Can hand this spoon to John?

* Not commands:
  * Hey, how is it going?
  * How is your day today?
  * Do you like the weather?
This list is not exhaustive, your function should be able to handle more cases. 

Expected outcome:

1. A function that performs the task
2. If your function has limitations, highlight those limitations with examples. You are not required to submit a different file. Write your answer in a 'Text' block in this notebook. 

In [7]:
!pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [8]:
import spacy
nlp = spacy.load('en_core_web_sm')

def extract_command(sentence):
    """
    This function takes a sentence as input and returns a command and object if they exist.
    Args:
    sentence (str): a string that may contain a command.

    Returns:
    tuple: a tuple containing a command and object if they exist, otherwise None.
    """
    doc = nlp(sentence)
    command_object_pairs = []
    for token in doc:
        if token.dep_ in {"dobj", "pobj"} and token.head.pos_ == "VERB":
            command_object_pairs.append((token.head.lemma_, token.text))
    if command_object_pairs:
        return command_object_pairs
    else:
        return None


In [9]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def extract_command_nltk(sentence):
    """
    This function takes a sentence as input and returns a command and object if they exist.
    Args:
    sentence (str): a string that may contain a command.

    Returns:
    tuple: a tuple containing a command and object if they exist, otherwise None.
    """
    text = word_tokenize(sentence)
    pos_tags = nltk.pos_tag(text)
    
    command_object_pairs = []
    for i in range(len(pos_tags) - 1):
        if pos_tags[i][1].startswith('VB') and pos_tags[i+1][1] == 'DT': 
            command = pos_tags[i][0]
            obj = pos_tags[i+1][0]
            command_object_pairs.append((command, obj))
            
    if command_object_pairs:
        return command_object_pairs
    else:
        return None


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [10]:
sentences = ["Grab the book",
             "Fetch the ball",
             "Open the jar",
             "Can you hand this spoon to John?",
             "Hey, how is it going?",
             "How is your day today?",
             "Do you like the weather?"]

print('Spacy: ')
for sentence in sentences:
    result = extract_command(sentence)
    if result:
        for pair in result:
            print(f"Sentence: {sentence}\nCommand: {pair[0]}, Object: {pair[1]}\n")
    else:
        print(f"Sentence: {sentence}\nThis sentence does not contain a command.\n")

print('\n\n\n\nNLTK: ')
for sentence in sentences:
    result = extract_command_nltk(sentence)
    if result:
        for pair in result:
            print(f"Sentence: {sentence}\nCommand: {pair[0]}, Object: {pair[1]}\n")
    else:
        print(f"Sentence: {sentence}\nThis sentence does not contain a command.\n")


Spacy: 
Sentence: Grab the book
Command: grab, Object: book

Sentence: Fetch the ball
Command: fetch, Object: ball

Sentence: Open the jar
Command: open, Object: jar

Sentence: Can you hand this spoon to John?
Command: hand, Object: spoon

Sentence: Hey, how is it going?
This sentence does not contain a command.

Sentence: How is your day today?
This sentence does not contain a command.

Sentence: Do you like the weather?
Command: like, Object: weather





NLTK: 
Sentence: Grab the book
Command: Grab, Object: the

Sentence: Fetch the ball
Command: Fetch, Object: the

Sentence: Open the jar
Command: Open, Object: the

Sentence: Can you hand this spoon to John?
This sentence does not contain a command.

Sentence: Hey, how is it going?
This sentence does not contain a command.

Sentence: How is your day today?
This sentence does not contain a command.

Sentence: Do you like the weather?
This sentence does not contain a command.



## Problem 3 Word embedding as features for classification

### Task
Implement a sentiment classifier based on Twitter data to analyse the sentiments of COVID-19 tweets.  

Train and test multiple classification model using necessary libraries with the features being sentence embeddings of tweets. 

Report the accuracy and F1 score (micro- and macro-averaged) for multiple classifier and discuss the differences. 

### Dataset
The dataset have been provided in Dataset.zip file. You are required to use the original tweet text for this classification task. 

### Tweet representation
After necessary pre-processing of the tweets, convert the words into their embeddings, then take the mean of all the word vectors in a tweet to end up with a single vector representing each tweet. The tweet vector is then used for sentiment classification.

In the process of finding the embeddings for each word, you can ignore out-of-vocabulary words.

### Embedding choice
For embedding, you can use GloVe embeddings using Gensim. A sample code is give below. 

However, this is a suggested option. You can use any word embedding of your choice, for example, word2vec, TF-IDF, etc., from any library of your choice.   

### Classifier choice
You are required to implement the following classifiers: 
* One tradition classification model (not a neural network based model)
* One classifier based on any neural network based model. 

You can use PyTorch/TensorFlow/scikit-learn to implement your classifier. However, you are free to develop a classifier from scratch. 

### Your answer must include the following: 
1. Code for data loading, data pre-processing, training, and testing of the models.  
2. A discussion on the comparison between the classifiers based on classifier accuracy and F1 score.

### Suggestion (Optional)
Consider saving a cleaned up version of the dataset after creating the embeddings to a file which can be loaded and used for further experimentation. 

In [11]:
import gensim.downloader as api

list(api.info()['models'].keys())

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

In [12]:
model = api.load("glove-wiki-gigaword-50")
model.most_similar("pneumatic")



[('hydraulic', 0.8155338764190674),
 ('actuators', 0.7667093873023987),
 ('sprinkler', 0.7374147772789001),
 ('valve', 0.7271664142608643),
 ('actuation', 0.7141326665878296),
 ('hose', 0.7138993740081787),
 ('paddles', 0.7132106423377991),
 ('valves', 0.709661066532135),
 ('high-pressure', 0.7025710344314575),
 ('turntable', 0.7003635764122009)]

In [13]:
import gensim.downloader as api
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.neural_network import MLPClassifier
from nltk.tokenize import word_tokenize
import numpy as np
import nltk

nltk.download('punkt')

# Load data
p = 0.01
df_train = pd.read_csv('/content/drive/MyDrive/NLP/Dataset/Corona_NLP_train.csv', encoding='latin1') # skiprows=lambda i: i>0 and np.random.random() > p
tweets_train = df_train['OriginalTweet'].values
labels_train = df_train['Sentiment'].values

# print(tweets_train)
# print(labels_train)

df_test = pd.read_csv('/content/drive/MyDrive/NLP/Dataset/Corona_NLP_test.csv', encoding='latin1') # replace with your test dataset path
tweets_test = df_test['OriginalTweet'].values
labels_test = df_test['Sentiment'].values

# print(tweets_test)
# print(labels_test)

# Pre processing tweets
df_train['OriginalTweet'] = df_train['OriginalTweet'].str.replace('http\S+|www.\S+', '', case=False) # remove URLs
df_test['OriginalTweet'] = df_test['OriginalTweet'].str.replace('http\S+|www.\S+', '', case=False) # remove URLs


# Convert tweets into embedding
model = api.load("glove-wiki-gigaword-50")

def tweet_to_embedding(tweet):
    embeddings = []
    for word in word_tokenize(tweet):
        if word in model:
            embeddings.append(model[word])
    if embeddings:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(model.vector_size)

# Convert training tweets to embeddings
tweet_embeddings_train = np.array([tweet_to_embedding(tweet) for tweet in tweets_train])
tweet_embeddings_test = np.array([tweet_to_embedding(tweet) for tweet in tweets_test])


# Encode labels
le = LabelEncoder()
labels_train_encoded = le.fit_transform(labels_train)
labels_test_encoded = le.transform(labels_test)


# Traditional classifier (Random Forest)
clf_rf = RandomForestClassifier()
clf_rf.fit(tweet_embeddings_train, labels_train_encoded)

# Neural network based classifier (MLP)
clf_mlp = MLPClassifier()
clf_mlp.fit(tweet_embeddings_train, labels_train_encoded)


# Test classifiers
y_pred_rf = clf_rf.predict(tweet_embeddings_test)
y_pred_mlp = clf_mlp.predict(tweet_embeddings_test)

# Compare results
print("Random Forest Accuracy: ", accuracy_score(labels_test_encoded, y_pred_rf))
print("Random Forest F1 Score (micro): ", f1_score(labels_test_encoded, y_pred_rf, average='micro'))
print("Random Forest F1 Score (macro): ", f1_score(labels_test_encoded, y_pred_rf, average='macro'))

print("MLP Accuracy: ", accuracy_score(labels_test_encoded, y_pred_mlp))
print("MLP F1 Score (micro): ", f1_score(labels_test_encoded, y_pred_mlp, average='micro'))
print("MLP F1 Score (macro): ", f1_score(labels_test_encoded, y_pred_mlp, average='macro'))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  df_train['OriginalTweet'] = df_train['OriginalTweet'].str.replace('http\S+|www.\S+', '', case=False) # remove URLs
  df_test['OriginalTweet'] = df_test['OriginalTweet'].str.replace('http\S+|www.\S+', '', case=False) # remove URLs


Random Forest Accuracy:  0.3417588204318062
Random Forest F1 Score (micro):  0.3417588204318062
Random Forest F1 Score (macro):  0.33127373269275534
MLP Accuracy:  0.3796735123749342
MLP F1 Score (micro):  0.3796735123749342
MLP F1 Score (macro):  0.38438828560258903
