# Assignment 2 - CT5120

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **November 25, 2022**.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $50$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

| Task | Marks for write-up | Marks for code | Total Marks |
| :--- | :----------------- | :------------- | :---------- |
| 1    |                  5 |              5 |          10 |
| 2    |                  - |             10 |          10 |
| 3    |                  5 |              5 |          10 |
| 4    |                  5 |              5 |          10 |
| 5    |                  5 |              5 |          10 |



---

This assignment involves tasks for feature engineering, training and evaluating a classifier for suggestion detection. You will work with the data from SemEval-2019 Task 9 subtask A to classify whether a piece of text contains a suggestion or not. 


Download train.csv, test_seen.csv and test_unseen.csv from the [Github](https://github.com/sharduls007/Assignment_2_CT5120) or uncomment the code cell below to get the data as a comma-separated values (CSV) file. The CSV file contains a header row followed by 5,440 rows in train.csv and 1,360 rows in test_seen.csv spread across 3 columns of data. Each row of data contains a unique id, a piece of text and a label assigned by an annotator. A label of $1$ indicates that the given text contains a suggestion while a label of $0$ indicates that the text does not contain a suggestion.

You can find more details about the dataset in Sections 1, 2, 3 and 4 of [SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
](https://aclanthology.org/S19-2151/).

We will be using test_seen.csv for benchmarking our model, hence it has label. On the other hand, test_unseen is used for [Kaggle](https://www.kaggle.com/competitions/nlp2022ct5120suggestionmining/overview) competition.


In [1]:
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/train.csv" > train.csv
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_seen.csv" > test.csv
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_unseen.csv" > test_unseen.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  670k  100  670k    0     0  1473k      0 --:--:-- --:--:-- --:--:-- 1499k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  168k  100  168k    0     0   581k      0 --:--:-- --:--:-- --:--:--  609k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  208k  100  208k    0     0   658k      0 --:--:-- --:--:-- --:--:--  684k


In [2]:
import numpy as np
import pandas as pd

# Read the CSV file.
train_df = pd.read_csv('train.csv', 
                 names=['id', 'text', 'label'], header=0)

test_df = pd.read_csv('test.csv', 
                 names=['id', 'text', 'label'], header=0)

# Store the data as a list of tuples where the first item is the text
# and the second item is the label.
train_texts, train_labels = train_df["text"].to_list(), train_df["label"].to_list() 
test_texts, test_labels = test_df["text"].to_list(), test_df["label"].to_list() 

# Check that training set and test set are of the right size.
assert len(test_texts) == len(test_labels) == 1360
assert len(train_texts) == len(train_labels) == 5440

In [3]:
print(train_labels.count(0), train_labels.count(1))
print(test_labels.count(0), test_labels.count(1))

#imbalanced classification task

4106 1334
1027 333


---

## Task 1: Data Pre-processing (10 Marks)

Explain at least 3 steps that you will perform to preprocess the texts before training a classifier.



Edit this cell to write your answer below the line in no more than 300 words.

---

Suggestion detection is a binary classification issue, we are trying to predict the sentence is a suggestion OR not hence the context of the sentence is an important feature. To help the algorithm to understand the context better, we perform the following preprocessing techniques. We strip the quotation marks in the beginning and the end of each sentence. The other punctuations are kept to avoid messing with specific terms. 

Furthermore, we tokenize the sentences in order to perform the following processing steps. We then remove the stopwords, this step is to remove the possible noise that might negatively affect the performance of the classification and letting the model to focus more on the true meaning of the sentence itself.

Other than that, we perform lemmatization to return the base form of the tokens. Lemmatization is chosen instead of stemming as rather than chopping off the ending of words and generating possible nonsense, it is context-dependent hence giving higher accuracy in returning base form of words.  

Besides, we perform feature transformation / enrichment by adding synonyms to help the algorithm in disambiguating and emphasizing the context of the sentence. However, instead of getting the nouns, we focus on exploring the synonyms for adjectives and verbs as those might be more related in suggestions. The unique synonyms are retrieved from WordNet. 

Last but not least, we detokenize the sentences to provide an appropriate input for the following step. 

---

In the code cell below, write an implementation of the steps you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

In [4]:
# your code goes here
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer, TreebankWordTokenizer
from nltk.corpus import stopwords, wordnet as wn
from nltk.stem import WordNetLemmatizer
from nltk.wsd import lesk

detokenizer = TreebankWordDetokenizer()
tokenizer = TreebankWordTokenizer()
    
def data_preprocessing(texts):
    processed = []

    english_stopwords = stopwords.words('english')
    adj_tags = ['JJ', 'JJR', 'JJS']
#     noun_tags = ['NN', 'NNP', 'NNS', 'NNPS']
#     adverb_tags = [] #['RB', 'RBR', 'RBS']
    verb_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']

    lemmatizer = WordNetLemmatizer()
    for i, sentence in enumerate(texts):
        # stripping extra "" from sentence + tokenization
        tokens = tokenizer.tokenize(sentence[1:-1])
        
        # to remove stopwords + lemmatize tokens
        lemmatized = [lemmatizer.lemmatize(token) for token in tokens if (token not in english_stopwords)]
        
        # get unique synonyms from wordnet for adjectives and verbs
        for word, pos_tag in nltk.pos_tag(lemmatized):
            if pos_tag in adj_tags + verb_tags:
                # without disambiguation
                syns = set(synset.lemma_names()[0].replace("_", " ") for synset in wn.synsets(word))
                lemmatized.extend(syns)
                
        # to form sentence with detokenizer
        processed.append(detokenizer.detokenize(lemmatized))
        
    return processed
    
processed_train = data_preprocessing(train_texts)
processed_test = data_preprocessing(test_texts)


---

## Task 2: Feature Engineering (I) - TF-IDF as features (10 Marks)

In the lectures we have seen that raw counts of words and `tf-idf` scores can be useful features for a classification task. Complete the following code cell to create a suggestion detector which uses `tf-idf` scores as features for a Naïve Bayes classifier.

After applying your preprocessing steps, use the training data to train the classifier and make predictions on the test set. You **must not** use the test set for training.

If everything is implemented correctly, then you should see a single floating point value between 0 and 1 at the end which denotes the accuracy of the classifier.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import GaussianNB

# Calculate tf-idf scores for the words in the training set.
# ... your code goes here
import time
start_time = time.time()

## for raw count and use character analyzer to creates n-grams that span across words and increase the robustness with regards to misspellings and word derivations
count_vect = CountVectorizer(analyzer='char', ngram_range=(5, 5)) #analyzer='char', ngram_range=(5, 5)
X_train_counts = count_vect.fit_transform(processed_train) 

## to transform raw counts to tf-idf scores --> same as using TfidfVectorizer directly.
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

X_test_counts = count_vect.transform(processed_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)


# Train a Naïve Bayes classifier using the tf-idf scores for words as features.
# ... your code goes here
NB_classifier_tfidf = GaussianNB()
NB_classifier_tfidf.fit(X_train_tfidf.toarray(), train_labels)


# Predict on the test set.
predictions = []    # save your predictions on the test set into this list

# ... your code goes here
predictions = NB_classifier_tfidf.predict(X_test_tfidf.toarray())
print("--- Execution Time ---")
print("--- %.2f minutes ---" % ((time.time() - start_time)/60))

#################### DO NOT EDIT BELOW THIS LINE #################


#################### DO NOT EDIT BELOW THIS LINE #################

def accuracy(labels, predictions):
  '''
  Calculate the accuracy score for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  assert len(labels) == len(predictions)
  
  correct = 0
  for label, prediction in zip(labels, predictions):
    if label == prediction:
      correct += 1 
  
  score = correct / len(labels)
  return score

# Calculate accuracy score for the classifier using tf-idf features.
accuracy(test_labels, predictions)

--- Execution Time ---
--- 0.43 minutes ---


0.7352941176470589

---

## Task 3: Evaluation Metrics (10 marks)

Why is accuracy not the best measure for evaluating a classifier? Describe an evaluation metric which might work better than accuracy for a classification task such as suggestion detection.

Edit this cell to write your answer below the line in no more than 150 words.

---

Accuracy is not suitable for measuring model performance on an imbalanced classification issue as it could not represent the numbers of correctly classified samples of different classes. We could simply predict only the majority class to get a high accuracy in an imbalanced dataset. Since in real world, most of the classification tasks including suggestion detection has no balanced dataset, hence it is not ideal to measure with accuracy. 

Other than accuracy, we have better options to measure the performance of model including precision and recall. Precision is to calculate the number of relevant instances in the results set (focus on false positives) while recall is to calculate the number of relevant instances being predicted (focus on false negatives). Rather than picking one over another, we could combine both the metrics into a single score to capture both properties. The combined metric is called **F-score**. Such measure helps to overcome the issue of not able to tell the whole story with a single precision or recall. 

We are implementing F-score in the next section to evaluate the model performance.


---

In the code cell below, write an implementation of the evaluation metric you defined above. Please write your own implementation from scratch.

In [6]:
def evaluate(labels, predictions):
  '''
  Calculate an evaluation score other than accuracy for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  # check that labels and predictions are of same length
  assert len(labels) == len(predictions)

  score = 0.0
  
  #################### EDIT BELOW THIS LINE #########################

  # your code goes here

  # compute confusion matrix
  classes = len(np.unique(labels)) # Number of classes 
  cm = np.zeros((classes, classes))

  for i in range(len(labels)):
      cm[labels[i]][predictions[i]] += 1
  
  # f-score = 2* (precision*recall / (precision + recall)) = tp / (tp + (1/2*(fp+fn)))
  score = cm[0][0] / (cm[0][0] + (1/2*(cm[1][0]+cm[0][1])))

  #################### EDIT ABOVE THIS LINE #########################

  return score

# Calculate evaluation score based on the metric of your choice
# for the classifier trained in Task 2 using tf-idf features.
evaluate(test_labels, predictions)

0.8392857142857143

---

## Task 4: Feature Engineering (II) - Other features (10 Marks)

Describe features other than those defined in Task 2 which might improve the performance of your suggestion detector. If these features require any additional pre-processing steps, then define those steps as well.


Edit this cell to write your answer below the line in no more than 500 words.

---

In task 2, we have used `bag-of-words` to provide raw count of words and specified a character analyzer to cope with the potential misspellings or word derivations, along with `tf-idf` scores as feature extractor to re-weight the count features that leans to relevance of terms regarding the context. However, there are some challenges when using tf-idf. Although local positioning information can be kept by extracting n-grams and calculating the tf-idf scores but they discarded the inner structure of sentence and the true meaning of the context. Besides, it's not a friendly method when dealing with large vocabulary as the vast amount of vocabularies are costing high memory and computational time.

To deal with the cost of memory introduced in task 2, we could implement a `hashing vectorizer`. This vectorizer is stateless which holds only constructor parameters hence there is no need to call the "fit" and allocate memory for  storing a vocabulary dictionary. Similar as bag-of-words, we could extract n-grams to deal with derived or misspelled words. However, hashing vectorizer could have collisions where distinct tokens can be mapped to the same feature index. Such issue could be mitigated by setting the *n_features*. This feature is ideal to deal with large corpus.

Other than that, we could implement a word embedding system to identify the semantics and contextual information of the sentences which wasn’t included in raw counts and tf-idf. In addition, it could minimize the sparsity of vector by using real-value vectors of specific feature dimensions instead of 0s. It is also easier to compute the similarity between words in vector space. In this task, we are implementing `GloVe`, `FastText`, `Word2Vec` for word embeddings. Both are using pretrained model and vertical stacking along with mean pool are performed to convert the vectors as model input. Gensim downloaded is used to retrieved the pretrained models. 

GloVe, a hybrid unsupervised learning algorithm to generate features for tokens. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the linear structures of the word is captured similar as Word2Vec. The "glove-wiki-gigaword-300" pretrained model, which is based on 2B tweets, 27B tokens, 1.2M vocab, uncased, is used for GloVe. 

We are using the "fasttext-wiki-news-subwords-300" model, which pretrained on 1 million word vectors trained on Wikipedia 2017 for FastText. 

For Word2Vec, the "word2vec-google-news-300", which trained on a part of the Google News dataset (about 100 billion words). The model contains 3 million words and phrases.


---

In the code cell below, write an implementation of the features (and any additional pre-preprocessing steps) you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

After creating your features, use the training data to train a Naïve Bayes classifier and use the test set to evaluate its performance using the metric defined in Task 3. You **must not** use the test set for training.

To make sure that your code doesn't take too long to run or use too much memory, you can consider a time limit of 3 minutes and a memory limit of 12GB for this task.

### Hashing Vectorizer

In [7]:
from sklearn.feature_extraction.text import HashingVectorizer
start_time = time.time()

hv = HashingVectorizer(n_features=2**16, analyzer='char', ngram_range=(5, 5))
X_train_hv = hv.transform(processed_train)
X_test_hv = hv.transform(processed_test)

NB_classifier_hash = GaussianNB()
NB_classifier_hash.fit(X_train_hv.toarray(), train_labels)

print("--- Execution Time ---")
print("--- %.2f minutes ---" % ((time.time() - start_time)/60))

predictions = NB_classifier_hash.predict(X_test_hv.toarray())
print("f-score: ", evaluate(test_labels, predictions))

--- Execution Time ---
--- 0.14 minutes ---
f-score:  0.8329621380846325


HashingVectorizer uses shorter time than CountVectorizer while achieving a similar f-score.

### Word Embeddings

In [8]:
def word_embedding(KeyedVectors, texts):
    # use glove / word2vec / fasttext
    all_text_vecs = []
    oov = np.random.rand(1,300) # random vector to represent out-of-vocab
    tokens = []
    # tokenize sentences again
    for text in texts:
        tokens.append(tokenizer.tokenize(text))
        
    for toks in tokens:
        text_vecs = []

        for tok in toks:
            if tok in KeyedVectors:
                text_vecs.append(KeyedVectors[tok])
            else:
                text_vecs.append(oov)

        all_text_vecs.append(text_vecs)

    all_pooled_vecs = []

    for text_vecs in all_text_vecs:
        # Vstack and take the mean of the tex_vecs
        mean_pool = np.mean(np.vstack(text_vecs), axis=0)
        all_pooled_vecs.append(mean_pool)

    # Vstack to reshape
    embedded = np.vstack(all_pooled_vecs)
        
    return embedded

#### Glove

In [9]:
# Create your features.
# ... your code goes here
start_time = time.time()

import gensim.downloader as api

print("Downloading GloVe embeddings. Please note this may take a few minutes.")
glove = api.load('glove-wiki-gigaword-300')
print("Finished downloading GloVe")

X_train = word_embedding(glove, processed_train)
X_test = word_embedding(glove, processed_test)

# Train a Naïve Bayes classifier using the features you defined.
# ... your code goes here
NB_classifier_glove = GaussianNB()
NB_classifier_glove.fit(X_train, train_labels)
print("--- Execution Time ---")
print("--- %.2f minutes ---" % ((time.time() - start_time)/60))

predictions = NB_classifier_glove.predict(X_test)

# Evaluate on the test set.
# ... your code goes here
print("f-score: ", evaluate(test_labels, predictions))



Downloading GloVe embeddings. Please note this may take a few minutes.
Finished downloading GloVe
--- Execution Time ---
--- 0.61 minutes ---
f-score:  0.550326797385621


#### FastText

In [10]:
start_time = time.time()

print("Downloading FastText embeddings. Please note this may take a few minutes.")
fasttext = api.load('fasttext-wiki-news-subwords-300')
print("Finished downloading FastText")

X_train = word_embedding(fasttext, processed_train)
X_test = word_embedding(fasttext, processed_test)

# Train a Naïve Bayes classifier using the features you defined.
# ... your code goes here
NB_classifier_fasttext = GaussianNB()
NB_classifier_fasttext.fit(X_train, train_labels)
print("--- Execution Time ---")
print("--- %.2f minutes ---" % ((time.time() - start_time)/60))

predictions = NB_classifier_fasttext.predict(X_test)

# Evaluate on the test set.
# ... your code goes here
print("f-score: ", evaluate(test_labels, predictions))

Downloading FastText embeddings. Please note this may take a few minutes.
Finished downloading FastText
--- Execution Time ---
--- 1.47 minutes ---
f-score:  0.3953318745441284


In [11]:
start_time = time.time()

print("Downloading Word2Vec embeddings. Please note this may take a few minutes.")
word2vec = api.load('word2vec-google-news-300')
print("Finished downloading Word2Vec")

X_train = word_embedding(word2vec, processed_train)
X_test = word_embedding(word2vec, processed_test)

# Train a Naïve Bayes classifier using the features you defined.
# ... your code goes here
NB_classifier_word2vec = GaussianNB()
NB_classifier_word2vec.fit(X_train, train_labels)
print("--- Execution Time ---")
print("--- %.2f minutes ---" % ((time.time() - start_time)/60))

predictions = NB_classifier_word2vec.predict(X_test)

# Evaluate on the test set.
# ... your code goes here
print("f-score: ", evaluate(test_labels, predictions))

Downloading Word2Vec embeddings. Please note this may take a few minutes.
Finished downloading Word2Vec
--- Execution Time ---
--- 0.46 minutes ---
f-score:  0.5486610058785107


---

## Task 5: Kaggle Competition (10 marks)

Head over to https://www.kaggle.com/t/1f90b74da0b7484da9647638e22d1068  
Use above classifier to predict the label for test_unseen.csv from competition page and upload the results to the leaderboard. The current baseline score is 0.36823. Make an improvement above the baseline. Please note that the evaluation metric for the competition is the f-score.

Read competition page for more details.



In [12]:
# from google.colab import drive
# drive.mount('/content/drive')

In [14]:
# Preparing submission for Kaggle
StudentID = "22221970_Chin" # Please add your student id and lastname
test_unseen = pd.read_csv("test_unseen.csv", names=['id', 'text'], header=0)

# preparing the test_unseen dataset
processed_test_unseen = data_preprocessing(test_unseen['text'].to_list())

model_data = {
    'tf-idf': [NB_classifier_tfidf,  tfidf_transformer.transform(count_vect.transform(processed_test_unseen)).toarray()],
    'hash': [NB_classifier_hash, hv.transform(processed_test_unseen).toarray()],
    'glove': [NB_classifier_glove, word_embedding(glove, processed_test_unseen)],
    'fasttext': [NB_classifier_glove, word_embedding(fasttext, processed_test_unseen)],
    'word2vec': [NB_classifier_glove, word_embedding(word2vec, processed_test_unseen)]
}

# Here Id is unique identifier assigned to each test sample ranging from test_0 till test_1699
# Expected is a list of prediction made by your classifier
sub = {"Id": [f"test_{i}" for i in range(len(test_unseen))],
       "Expected": model_data['tf-idf'][0].predict(model_data['tf-idf'][1])}
# sub
sub_df = pd.DataFrame(sub)
# The code below will generate a StudentID.csv on your drive on the left hand side in the explorer
# Please upload the file as a submission on the competition page
# You can index your submission StudentID_Lastname_index.csv, where index is your number of submission
index = 5
sub_df.to_csv(f"{StudentID}_{index}.csv", sep=",", header=1, index=None)

Mention the approach that you have chosen briefly, and what is the mean average f-score that you have achieved? Did it improve above the chosen baseline model (0.36823)? Why or why not?

Edit this cell to write your answer below the line in no more than 500 words.

---

In conclusion, I have chosen the following steps for data preprocessing:
1. strip the first and last character
2. remove stopwords
3. tokenizing the texts to perform WordNet synonyms enrichment.
4. detokenizing the texts to prepare for feature extractions.

The above steps could possibly retain the information in the context while removing some of the noise. The synonyms enrichment could help to emphasize the true meaning of the context. As this is a suggestion detection task, I think it should be more appropriate to get synonyms of the adjectives and verbs which might contribute more in recognizing an act of suggestion. 

Furthermore, I performed feature extraction with ngrams bag-of-words and tf-idf scores. It took more than half a minute to complete the training and achieved a 0.7338 accuracy and 0.8379 f-score (~0.57 minutes) on the test set. Since this is an imbalanced binary classification task, it should be more ideal to measure the model performance with f-score. 

In the following task, I performed feature engineering with ngrams hashing vectorizer. It's faster than the previous feature extraction method while obtaining a similar f-score of 0.83 (\~0.15 minutes) on the test set. I have also tried to extract features with word embeddings including GloVe, FastText, and Word2Vec. However, all of them only achieved a much lower f-score compared to the vectorizers. GloVe and Word2Vec achieved a similar f-score of about 0.55 (\~0.61 minutes) and 0.55 (\~0.46 minutes) respectively while FastText has the worst performance among all with an f-score of 0.4 (\~1.47 minutes). 

Ergo, I have chosen the ngrams hashing vectorizer as my aproach. It achieved a 0.72 f-score on the unseen test set, which improved above the chosen baseline of 0.36823. 

---