# Assignment 2 - CT5120

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **November 25, 2022**.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $50$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

| Task | Marks for write-up | Marks for code | Total Marks |
| :--- | :----------------- | :------------- | :---------- |
| 1    |                  5 |              5 |          10 |
| 2    |                  - |             10 |          10 |
| 3    |                  5 |              5 |          10 |
| 4    |                  5 |              5 |          10 |
| 5    |                  5 |              5 |          10 |



---

This assignment involves tasks for feature engineering, training and evaluating a classifier for suggestion detection. You will work with the data from SemEval-2019 Task 9 subtask A to classify whether a piece of text contains a suggestion or not. 


Download train.csv, test_seen.csv and test_unseen.csv from the [Github](https://github.com/sharduls007/Assignment_2_CT5120) or uncomment the code cell below to get the data as a comma-separated values (CSV) file. The CSV file contains a header row followed by 5,440 rows in train.csv and 1,360 rows in test_seen.csv spread across 3 columns of data. Each row of data contains a unique id, a piece of text and a label assigned by an annotator. A label of $1$ indicates that the given text contains a suggestion while a label of $0$ indicates that the text does not contain a suggestion.

You can find more details about the dataset in Sections 1, 2, 3 and 4 of [SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
](https://aclanthology.org/S19-2151/).

We will be using test_seen.csv for benchmarking our model, hence it has label. On the other hand, test_unseen is used for [Kaggle](https://www.kaggle.com/competitions/nlp2022ct5120suggestionmining/overview) competition.


In [1]:
# !curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/train.csv" > train.csv
# !curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_seen.csv" > test.csv
# !curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_unseen.csv" > test_unseen.csv

In [2]:
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords
import string, nltk

# Read the CSV file.
train_df = pd.read_csv('train.csv', 
                 names=['id', 'text', 'label'], header=0)

test_df = pd.read_csv('test.csv', 
                 names=['id', 'text', 'label'], header=0)

# Store the data as a list of tuples where the first item is the text
# and the second item is the label.
train_texts, train_labels = train_df["text"].to_list(), train_df["label"].to_list() 
test_texts, test_labels = test_df["text"].to_list(), test_df["label"].to_list() 

# Check that training set and test set are of the right size.
assert len(test_texts) == len(test_labels) == 1360
assert len(train_texts) == len(train_labels) == 5440

---

## Task 1: Data Pre-processing (10 Marks)

Explain at least 3 steps that you will perform to preprocess the texts before training a classifier.



Edit this cell to write your answer below the line in no more than 300 words.

---
* **Lowercase:** Lowercase is better to use in some cases where some algorithms are designed to take lowercase letters/words. We use lower() on each sentence to convert all the words into lowercase.

* **Punctuation Removal:** Punctuation Removal is the process of removing punctuation marks (like "!",",","?"). The idea behind this is that we do not require punctuations to carry out some of the tasks in NLP, but it can be quite nice if you use them for Sentiment Analysis. In our case, suggestions wouldn't usually contain any punctuation and will have little impact if we remove them from the dataset. We use string.punctuation library which is inbuilt in Python to check for any type of punctuation marks and if we find it, we just discard it.

* **Tokeniation:** Tokenization is the process of splitting either a query, paragraph or a document into smallest unit i.e., a word. For e.g., the sentence "I am a human" can be tokenized as "I", "am", "a", "human" and this is a good practice of NLP structures as discrete elements can be processed by the NLP model and token occurances can be used as vector representing the document. The NLTK.tokenize package has a 'word_tokenize' method which automatically converts your text into a list of tokens.

* **Stopword Removal:** Stopwords are the words which have the highest frequency in a document e.g., I, You, The, An, etc. So they provide almost no information or the meaning in any sentence. These sentences are better off removed from the sentence and hence this is called Stopword Removal. In NLTK library, we have stopwords function where there is a dictionary of stopwords defined, which we can use to filter out stopwords in a given document.

* **Lemmatization:** Lemmatization is a preprocessing method where you bring a word to its base form, e.g., running to run, better to good. Lemmatization is still in its very early stages as not every word will be converted as there are many grammatical constraints, but its a good alternative to Stemming where you just cut the words and sometimes get non-sensical words. To perform lemmatization, import WordNetLemmatizer() class from nltk,stem package. To make sure Lemmatization works well, we have to define Parts-of=speech tagging to lemmatize words better, so we used wordnet from nltk.corpus package, and tag the words into either Adjective (ADJ), Nouns (NOUN), Verbs (VERB), or Adverbs (ADV) and then send both the word and the POS of the word to lemmatize them.

---

In the code cell below, write an implementation of the steps you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

In [3]:
stopword = stopwords.words('english')
lemma = WordNetLemmatizer()
# Ref: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
# This function tags the word with appropriate POS
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [4]:
# your code goes here
def preprocessing(text_list):
    new_list = []
    for i in range(len(text_list)):
        text_list[i] = text_list[i].lower()
        text_list[i] = "".join([word for word in text_list[i] if word not in string.punctuation])
        text_list[i] = word_tokenize(text_list[i])
        text_list[i] = [word for word in text_list[i] if word not in stopword]
        text_list[i] = [lemma.lemmatize(word, get_wordnet_pos(word)) for word in text_list[i]]
        text_list[i] = " ".join(text_list[i])
    return text_list

train_texts = preprocessing(train_texts)
test_texts = preprocessing(test_texts)

---

## Task 2: Feature Engineering (I) - TF-IDF as features (10 Marks)

In the lectures we have seen that raw counts of words and `tf-idf` scores can be useful features for a classification task. Complete the following code cell to create a suggestion detector which uses `tf-idf` scores as features for a Naïve Bayes classifier.

After applying your preprocessing steps, use the training data to train the classifier and make predictions on the test set. You **must not** use the test set for training.

If everything is implemented correctly, then you should see a single floating point value between 0 and 1 at the end which denotes the accuracy of the classifier.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import GaussianNB

# Calculate tf-idf scores for the words in the training set.
# ... your code goes here
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1))
tfidf = TfidfTransformer()
v_train = vectorizer.fit_transform(train_texts)
tf_train = tfidf.fit_transform(v_train)
  

v_test = vectorizer.transform(test_texts)
tf_test = tfidf.transform(v_test) 

# Train a Naïve Bayes classifier using the tf-idf scores for words as features.
# ... your code goes here
NB_classifier = GaussianNB()
NB_classifier.fit(tf_train.toarray(), train_labels)

# Predict on the test set.
predictions = []    # save your predictions on the test set into this list

# ... your code goes here
p = NB_classifier.predict(tf_test.toarray())
predictions = p

#################### DO NOT EDIT BELOW THIS LINE #################


#################### DO NOT EDIT BELOW THIS LINE #################

def accuracy(labels, predictions):
  '''
  Calculate the accuracy score for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  assert len(labels) == len(predictions)
  
  correct = 0
  for label, prediction in zip(labels, predictions):
    if label == prediction:
      correct += 1 
  
  score = correct / len(labels)
  return score

# Calculate accuracy score for the classifier using tf-idf features.
accuracy(test_labels, predictions)

0.5294117647058824

---

## Task 3: Evaluation Metrics (10 marks)

Why is accuracy not the best measure for evaluating a classifier? Describe an evaluation metric which might work better than accuracy for a classification task such as suggestion detection.

Edit this cell to write your answer below the line in no more than 150 words.

---
Accuracy is not the best measure when the dataset is imbalanced, so in our case, too, it's not the best evaluation metric. Instead, we turn towards another metric to find out the reality of our model.

For any classification problems, a *confusion matrix* for each class to evaluate on the terms of **precision**, **recall** and **f1-score** is beneficial for finding out how the model works and its effectiveness.

* **Confusion Matrix:** It is a special type of error table which allows visualization of the performance of an algorithm, and is typically used in the Supervised Learning methods. It is a 2x2 matrix which shows numbers based on the values as the representation below in the table. Some of the concepts followed in the matrix are:
    1. True Positive(TP): The values which are predicted as positive and match when evaluated it with the original labels, are called True Positives.
    2. False Negative(FN): The values which are predicted as negative but the original labels are positive, are called False Negatives.
    3. False Positives(FP): The values which are predicted as positive, but the original labels state them negative, are called False Positives.
    4. True Negative(TN): The values which we predicted as negative and they matched the original labels, are called as True Negatives.

|    Total Population (P+N)   |     Positive(PP)     |    Negative(PN)      |
| :-------------------------- | -------------------: | -------------------: |
|       Positive (P)          |  True positive (TP)  |  False negative (FN) |
|       Negative (N)          |  False positive (FP) |  True negative (TN)  |

*Source: [Wikipedia -> Confusion Matrix](https://en.wikipedia.org/wiki/Confusion_matrix)*


* Precision: Precision is a count of how many are actually positive values with respect to the values which were predicted as positive. The equation can be drawn as:
$$ Precision = \frac {TP}{TP + FP} $$ 

* Recall: Recall is a count of how many were predicted correctly with respect to all the positive classes. The equation can be drawn as:
$$ Recall = \frac {TP}{TP + FN} $$ 

* F1-Score: It is difficult to compare models with low precision and high recall or vice versa. To make them comparable, we use the F-measure or the F1-score. It is the harmonic mean which uses both precision and recall to find out the comparable way to include both precision and recall. The equation can be drawn as:
$$ F1-score = \frac {2 * Precision * Recall }{Precision + Recall} $$ 

---



In the code cell below, write an implementation of the evaluation metric you defined above. Please write your own implementation from scratch.

In [6]:
def evaluate(labels, predictions):
  '''
  Calculate an evaluation score other than accuracy for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  # check that labels and predictions are of same length
  assert len(labels) == len(predictions)

  # score = 0.0
  
  #################### EDIT BELOW THIS LINE #########################

  # your code goes here
  classification_report = pd.DataFrame(index = ['class 0', 'class 1'], columns=['precision','recall','f1-score'])

  matrix0 =  np.zeros((2,2))
  matrix1 = np.zeros((2,2))

  for i in range(len(predictions)):
    if predictions[i] == 0 and labels[i] == 0:
      matrix0[0,0] += 1
    if predictions[i] == 1 and labels[i] == 0:
      matrix0[0,1] += 1
    if predictions[i] == 0 and labels[i] == 1:
      matrix0[1,0] += 1
    if predictions[i] == 1 and labels[i] == 1:
      matrix0[1,1] += 1

  precision = matrix0[0,0] / (matrix0[0,0] + matrix0[0,1])
  recall = matrix0[0,0] / (matrix0[0,0] + matrix0[1,0])
  f1= 2 * (precision * recall) / (precision + recall)

  list_row = [precision, recall, f1]
  classification_report.loc['class 0'] = list_row


  for i in range(len(predictions)):
      if predictions[i] == 1 and labels[i] == 1:
        matrix1[0,0] += 1
      if predictions[i] == 0 and labels[i] == 1:
        matrix1[0,1] += 1
      if predictions[i] == 1 and labels[i] == 0:
        matrix1[1,0] += 1
      if predictions[i] == 0 and labels[i] == 0:
        matrix1[1,1] += 1
  
  precision = matrix1[0,0] / (matrix1[0,0] + matrix1[0,1])
  recall = matrix1[0,0] / (matrix1[0,0] + matrix1[1,0])
  f1= 2 * (precision * recall) / (precision + recall)
  classification_report.loc['class 1'] = [precision, recall, f1]

  print(classification_report)

  #################### EDIT ABOVE THIS LINE #########################

# Calculate evaluation score based on the metric of your choice
# for the classifier trained in Task 2 using tf-idf features.
evaluate(test_labels, predictions)

        precision    recall  f1-score
class 0  0.486855  0.815661  0.609756
class 1  0.660661  0.294511  0.407407


---

## Task 4: Feature Engineering (II) - Other features (10 Marks)

Describe features other than those defined in Task 2 which might improve the performance of your suggestion detector. If these features require any additional pre-processing steps, then define those steps as well.


Edit this cell to write your answer below the line in no more than 500 words.

---
There are two things which I feel can change the accuracy and evaluation of the score.

1. **Bag of n-grams:** A bag of n-grams model provides not only with the words in the sentence, but also the n-grams of the sentences. There can be a different meaning to 2 or 3 words when we use together e.g., 'the dog barks' and 'the dog barked' has a difference of grammatical tense and changes the way humans interpret the sentence, so in similar way we can vectorize n-grams to get good TF-IDF values to process our model well. In our case we have used *trigrams* to fit our model. This doesn't require any additional preprocessing steps than what we have used.

2. **Multinomial Naive Bayes:** Multinomial Naive Bayes is a probabilistic learning algorithm used to classify labels by calculating probabilities and then comparing them to a threshold which is usually set by checking our for the line which separates the given classes in the mathematical space. Multinomial Naïve Bayes consider a feature vector where a given term represents the number of times it appears or very often i.e. frequency. We use the MultinomialNB() class from the *sklean.naive_bayes* package to use them. There is a hyperparameter called *alpha* which is used for smoothening of the algorithm so that it is not as steep when classifying. We are using 0.03 as the alpha value (found through just trying different values, getting better score.)
---

In the code cell below, write an implementation of the features (and any additional pre-preprocessing steps) you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

After creating your features, use the training data to train a Naïve Bayes classifier and use the test set to evaluate its performance using the metric defined in Task 3. You **must not** use the test set for training.

To make sure that your code doesn't take too long to run or use too much memory, you can consider a time limit of 3 minutes and a memory limit of 12GB for this task.

In [7]:
# Create your features.
# ... your code goes here
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(1, 3))
tfidf = TfidfTransformer()

v_train2 = vectorizer2.fit_transform(train_texts)
v_test2 = vectorizer2.transform(test_texts)

tf_train2 = tfidf.fit_transform(v_train2)
tf_test2 = tfidf.transform(v_test2)

# Train a Naïve Bayes classifier using the features you defined.
# ... your code goes here
from sklearn.naive_bayes import MultinomialNB
NB_classifier2 = MultinomialNB(alpha= 0.03)
NB_classifier2.fit(tf_train2.toarray(), train_labels)


# Evaluate on the test set.
# ... your code goes here
predictions2 = NB_classifier2.predict(tf_test2.toarray())
evaluate(test_labels, predictions2)
accuracy(test_labels, predictions2)


        precision    recall  f1-score
class 0  0.925024  0.831146  0.875576
class 1   0.42042  0.645161  0.509091


0.8014705882352942

---

## Task 5: Kaggle Competition (10 marks)

Head over to https://www.kaggle.com/t/1f90b74da0b7484da9647638e22d1068  
Use above classifier to predict the label for test_unseen.csv from competition page and upload the results to the leaderboard. The current baseline score is 0.36823. Make an improvement above the baseline. Please note that the evaluation metric for the competition is the f-score.

Read competition page for more details.



In [8]:
# from google.colab import drive
# drive.mount('/content/drive')

In [9]:
# Preparing submission for Kaggle
StudentID = "22222806_Rana" # Please add your student id and lastname
test_unseen = pd.read_csv("test_unseen.csv", names=['id', 'text'], header=0)
utest_texts = test_unseen["text"].to_list()
test_vect = vectorizer2.transform(utest_texts)
test_tfidf = tfidf.transform(test_vect)
test_pred = NB_classifier2.predict(test_tfidf.toarray())


# Here Id is unique identifier assigned to each test sample ranging from test_0 till test_1699
# Expected is a list of prediction made by your classifier

sub = {"Id": [f"test_{i}" for i in range(len(test_unseen))],
       "Expected": test_pred}
sub_df = pd.DataFrame(sub)
# The code below will generate a StudentID.csv on your drive on the left hand side in the explorer
# Please upload the file as a submission on the competition page
# You can index your submission StudentID_Lastname_index.csv, where index is your number of submission
sub_df.to_csv(f"{StudentID}.csv", sep=",", header=1, index=None)

Mention the approach that you have chosen briefly, and what is the mean average f-score that you have achieved? Did it improve above the chosen baseline model (0.36823)? Why or why not?

Edit this cell to write your answer below the line in no more than 500 words.

---
We have used the MultinomialNB instead of GaussianNB to fit our test cases as we did a trigram fit of our training case to train the model. We achieved a mean average f-score on 0.79117 on Kaggle. I feel that this method was crucial as we wanted to find whether a statement was a suggestion or not. In order to do so, we have to weight not only the words, but the words around it and I felt that having trigrams would be much better to weight and put in the model. It definitely improved when we run the training model and evaluated them. So that was the experimental motivation which led me to believe that this model may work well to surpass the baseline score. And it worked well as the models are better given the n-gram approach when finding something like a suggestion mining model where the position of the texts and their trigram weight matters more in the general context.

---