# Assignment 2 - CT5120

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **November 25, 2022**.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $50$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

| Task | Marks for write-up | Marks for code | Total Marks |
| :--- | :----------------- | :------------- | :---------- |
| 1    |                  5 |              5 |          10 |
| 2    |                  - |             10 |          10 |
| 3    |                  5 |              5 |          10 |
| 4    |                  5 |              5 |          10 |
| 5    |                  5 |              5 |          10 |



---

This assignment involves tasks for feature engineering, training and evaluating a classifier for suggestion detection. You will work with the data from SemEval-2019 Task 9 subtask A to classify whether a piece of text contains a suggestion or not. 


Download train.csv, test_seen.csv and test_unseen.csv from the [Github](https://github.com/sharduls007/Assignment_2_CT5120) or uncomment the code cell below to get the data as a comma-separated values (CSV) file. The CSV file contains a header row followed by 5,440 rows in train.csv and 1,360 rows in test_seen.csv spread across 3 columns of data. Each row of data contains a unique id, a piece of text and a label assigned by an annotator. A label of $1$ indicates that the given text contains a suggestion while a label of $0$ indicates that the text does not contain a suggestion.

You can find more details about the dataset in Sections 1, 2, 3 and 4 of [SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
](https://aclanthology.org/S19-2151/).

We will be using test_seen.csv for benchmarking our model, hence it has label. On the other hand, test_unseen is used for [Kaggle](https://www.kaggle.com/competitions/nlp2022ct5120suggestionmining/overview) competition.


In [1]:
# !curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/train.csv" > train.csv
# !curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_seen.csv" > test.csv
# !curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_unseen.csv" > test_unseen.csv

In [2]:
import numpy as np
import pandas as pd

# Read the CSV file.
train_df = pd.read_csv('train.csv', 
                 names=['id', 'text', 'label'], header=0)

test_df = pd.read_csv('test.csv', 
                 names=['id', 'text', 'label'], header=0)

# Store the data as a list of tuples where the first item is the text
# and the second item is the label.
train_texts, train_labels = train_df["text"].to_list(), train_df["label"].to_list() 
test_texts, test_labels = test_df["text"].to_list(), test_df["label"].to_list() 

# Check that training set and test set are of the right size.
assert len(test_texts) == len(test_labels) == 1360
assert len(train_texts) == len(train_labels) == 5440

---

## Task 1: Data Pre-processing (10 Marks)

Explain at least 3 steps that you will perform to preprocess the texts before training a classifier.



Edit this cell to write your answer below the line in no more than 300 words.

---

> Check if there is any missing value.

> Remove the stop words (function words) which are less informative about the text. We can safely remove them without losing any valuable information.

> Remove the punctuation. Punctuations can be less informative about the text. Hence, they can be removed safely.

> Lowercase the text. It is one of the most common preprocessing steps where the text is converted into lowercase.

> Stemming. Stemming is a simplified analysis of word structure by removing the endings or beginnings of words and leaving a common stem. In the code below, I use PorterStemmer to keep the stem of the word. 

> Lemmatisation. Lemmatisation is the linguistic analysis of word structure with a transformation to transform related words to a common lemma morphologically. In the code below, I use the WordNetLemmatizer to get the lemma of the word.

---

In the code cell below, write an implementation of the steps you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

In [43]:
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

def data_preprocessing(df):
    
    # Check if there is any missing value.
    print(f"Are there missing value? {df.isnull().any()}")
    
    # remove stopwords
    stops = set(stopwords.words('english')) # store the english stop words
    df['preprocess'] = df['text'].apply(lambda x: ' '.join([w for w in x.split() if w not in (stops)]))
    
    # remove puncuations
    punct = punctuation + '—«»–„“‘’'
    df['preprocess'] = df['preprocess'].apply(lambda x: ' '.join([w.strip(punct) for w in x.split()]))

    # lowercase the text
    df['preprocess'] = df['preprocess'].apply(lambda x: ' '.join([w.lower() for w in x.split()]))
    
    # stemming
    stemmer = PorterStemmer()
    df["preprocess"] = df["preprocess"].apply(lambda x: ' '.join([stemmer.stem(w) for w in x.split()]))

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    df["preprocess"] = df["preprocess"].apply(lambda x: ' '.join([lemmatizer.lemmatize(w) for w in x.split()]))
        
    return df

In [44]:
train_preprocess_df = data_preprocessing(train_df)
test_preprocess_df = data_preprocessing(test_df)

Are there missing value? id            False
text          False
label         False
preprocess    False
dtype: bool
Are there missing value? id            False
text          False
label         False
preprocess    False
dtype: bool


In [5]:
train_preprocess_df.head()

Unnamed: 0,id,text,label,preprocess
0,train_0,"""One would hope if I search for a word in the ...",0,one would hope i search word titl app would sh...
1,train_1,"""I would be beyond excited to get a response.""",0,i would beyond excit get respons
2,train_2,"""Just like the user can select apps that are a...",1,just like user select app allow run background...
3,train_3,"""Once you create a CoreIndependentInputSource ...",0,onc creat coreindependentinputsourc touch visu...
4,train_4,"""I Have problems with Contact class on Windows...",0,i have problem contact class window 8.1 window...


In [6]:
test_preprocess_df.head()

Unnamed: 0,id,text,label,preprocess
0,dev_0,"""I understand why you might do this for the FR...",1,i understand might free version feedli implor ...
1,dev_1,"""This is a significant bug and kind of hard to...",0,thi signific bug kind hard believ fix
2,dev_2,"""This leads the user to have to type AGAIN the...",0,thi lead user type again whole descript
3,dev_3,"""Needless to say, disappointed, it appears I c...",0,needle say disappoint appear i cannot develop ...
4,dev_4,"""Implementing the Auto-Upload feature in a Sil...",0,implement auto-upload featur silverlight 8.1 a...


---

## Task 2: Feature Engineering (I) - TF-IDF as features (10 Marks)

In the lectures we have seen that raw counts of words and `tf-idf` scores can be useful features for a classification task. Complete the following code cell to create a suggestion detector which uses `tf-idf` scores as features for a Naïve Bayes classifier.

After applying your preprocessing steps, use the training data to train the classifier and make predictions on the test set. You **must not** use the test set for training.

If everything is implemented correctly, then you should see a single floating point value between 0 and 1 at the end which denotes the accuracy of the classifier.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import GaussianNB

# Calculate tf-idf scores for the words in the training set.
# ... your code goes here

count_vectorizer = CountVectorizer()
train_X = train_df["preprocess"]
X_train_counts = count_vectorizer.fit_transform(x for x in train_X)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


# Train a Naïve Bayes classifier using the tf-idf scores for words as features.
# ... your code goes here
NB_classifier_tfidf = GaussianNB()
NB_classifier_tfidf.fit(X_train_tfidf.toarray(), train_labels)


# Predict on the test set.
predictions = []    # save your predictions on the test set into this list

# ... your code goes here
# generate the features for the test set
test_X = test_df["preprocess"]
X_test_counts = count_vectorizer.transform(x for x in test_X)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

predictions = NB_classifier_tfidf.predict(X_test_tfidf.toarray())


#################### DO NOT EDIT BELOW THIS LINE #################


#################### DO NOT EDIT BELOW THIS LINE #################

def accuracy(labels, predictions):
    '''
    Calculate the accuracy score for a given set of predictions and labels.

    Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

    Returns:
    float: A floating point value to score the predictions against the labels.
    '''

    assert len(labels) == len(predictions)

    correct = 0
    for label, prediction in zip(labels, predictions):
        if label == prediction:
            correct += 1 

    score = correct / len(labels)
    return score

# Calculate accuracy score for the classifier using tf-idf features.
accuracy(test_labels, predictions)

0.5279411764705882

---

## Task 3: Evaluation Metrics (10 marks)

Why is accuracy not the best measure for evaluating a classifier? Describe an evaluation metric which might work better than accuracy for a classification task such as suggestion detection.

Edit this cell to write your answer below the line in no more than 150 words.

---

> I formalise the accuracy as $Accuracy=\frac{correct}{len(labels)}$. It will fail when the dataset consists of unbalanced classes. For instance, in one dataset, if we have 90% of data belonging to class A and 10% of data belonging to class B. Even if the hypothesis fails to predict the data belonging to class B, the accuracy is still high as it can predict the data belonging to class A. Therefore, accuracy does not the best measure for evaluating a classifier. 

> A confusion matrix with f1-score measurement works better than accuracy in classification. Especially in imbalanced classification problems, f1-score measurement works better than accuracy as f1-score takes into account balancing precision and recall on the positive class.

---

In the code cell below, write an implementation of the evaluation metric you defined above. Please write your own implementation from scratch.

In [9]:
def evaluate(labels, predictions):
    '''
    Calculate an evaluation score other than accuracy for a given set of predictions and labels.

    Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

    Returns:
    float: A floating point value to score the predictions against the labels.
    '''

    # check that labels and predictions are of same length
    #assert len(labels) == len(predictions)

    score = 0.0

    #################### EDIT BELOW THIS LINE #########################

    # your code goes here
    num_classes = len(np.unique(labels)) # Number of classes 
    c_matrix = np.zeros((num_classes, num_classes))
    
    for i in range(len(labels)):
        c_matrix[labels[i]][predictions[i]] += 1

    score = 2*c_matrix[0][0] / (2*c_matrix[0][0] + c_matrix[1][0] + c_matrix[0][1]) # F1 score
    
    #################### EDIT ABOVE THIS LINE #########################

    return score

# Calculate evaluation score based on the metric of your choice
# for the classifier trained in Task 2 using tf-idf features.
evaluate(test_labels, predictions)

0.6085365853658536

---

## Task 4: Feature Engineering (II) - Other features (10 Marks)

Describe features other than those defined in Task 2 which might improve the performance of your suggestion detector. If these features require any additional pre-processing steps, then define those steps as well.


Edit this cell to write your answer below the line in no more than 500 words.

---

> In this code, I use word embedding with mean pooling as features. Word embedding converts a word to an n-dimensional vector. Words wich related to each other can be converted to similar n-dimensional vectors by using word embedding. The benefit of word embedding is that the trained model can correctly predict related words even if the words are unseen by the trained model. I use a pre-trained word embedding model -- GloVe in the code below. Glove is very similar to word2vec, but it starts off with buildings with a co-occurrence matrix, which makes it a hybrid model. I also use the mean pooling method to extract features by taking the average value of the features.

> One additional pre-processing step is tokenization as I apply word embedding to represent features. Tokenization is the process of dividing a text into smaller units i.e. tokens. Hence, besides the data pre-processing steps in task 1, I also apply tokenization in the code below. And 'nltk' provides a word_tokenize function to tokenize a sentence in a given language.

---

In the code cell below, write an implementation of the features (and any additional pre-preprocessing steps) you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

After creating your features, use the training data to train a Naïve Bayes classifier and use the test set to evaluate its performance using the metric defined in Task 3. You **must not** use the test set for training.

To make sure that your code doesn't take too long to run or use too much memory, you can consider a time limit of 3 minutes and a memory limit of 12GB for this task.

In [38]:
# Create your features.
# ... your code goes here
import time
import gensim.downloader as api
glove = api.load('glove-wiki-gigaword-300')

start = time.time()

# Tokenization
train_tokens = [word_tokenize(sent) for sent in train_df["preprocess"]]
test_tokens = [word_tokenize(sent) for sent in test_df["preprocess"]]

def word_embedding(tokens):
    all_text_vecs = []
    oov = np.random.rand(1,300) # random vector to represent out-of-vocab
    all_pooled_vecs = []

    for toks in tokens:
        text_vecs = []

        for tok in toks:
            if tok in glove:
                text_vecs.append(glove[tok])
            else:
                text_vecs.append(oov)
        
        if len(text_vecs) == 0: text_vecs.append(oov)
        
        all_text_vecs.append(text_vecs)
          
    for text_vec in all_text_vecs:
        mean_pool = np.mean(np.vstack(text_vec), axis=0)
        all_pooled_vecs.append(mean_pool)
    
    return all_pooled_vecs

# Train a Naïve Bayes classifier using the features you defined.
# ... your code goes here
NB_classifier_emb = GaussianNB()
NB_classifier_emb.fit(word_embedding(train_tokens), train_labels)
preds = NB_classifier_emb.predict(word_embedding(test_tokens))

print(f"======== time elapse: {time.time() - start:.2f} =========")

# Evaluate on the test set.
# ... your code goes here
score = evaluate(test_labels, preds)
print(f"score: {score}")

score: 0.706140350877193


---

## Task 5: Kaggle Competition (10 marks)

Head over to https://www.kaggle.com/t/1f90b74da0b7484da9647638e22d1068  
Use above classifier to predict the label for test_unseen.csv from competition page and upload the results to the leaderboard. The current baseline score is 0.36823. Make an improvement above the baseline. Please note that the evaluation metric for the competition is the f-score.

Read competition page for more details.



In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [39]:
# Preparing submission for Kaggle
StudentID = "20230033_Li" # Please add your student id and lastname
test_unseen = pd.read_csv("test_unseen.csv", names=['id', 'text'], header=0)

# Here Id is unique identifier assigned to each test sample ranging from test_0 till test_1699
# Expected is a list of prediction made by your classifier
sub = {"Id": [f"test_{i}" for i in range(len(test_unseen))],
       "Expected": [x for x in NB_classifier_emb.predict(word_embedding([word_tokenize(sent) for sent in data_preprocessing(test_unseen)["preprocess"]]))]}

sub_df = pd.DataFrame(sub)
# The code below will generate a StudentID.csv on your drive on the left hand side in the explorer
# Please upload the file as a submission on the competition page
# You can index your submission StudentID_Lastname_index.csv, where index is your number of submission
sub_df.to_csv(f"{StudentID}.csv", sep=",", header=1, index=None)

Are there missing value? id      False
text    False
dtype: bool


Mention the approach that you have chosen briefly, and what is the mean average f-score that you have achieved? Did it improve above the chosen baseline model (0.36823)? Why or why not?

Edit this cell to write your answer below the line in no more than 500 words.

---

> I use word embedding with the mean pooling method to represent the features of the pre-processing data as input. Then I apply the trained naive Bayes classifier to predict the unseen, pre-processed data. The mean average f-score I have achieved is 0.59352 which shows the approach I use to improve the chosen baseline model (0.36823). I believe that there are two reasons. First, I apply data pre-processing approaches (stop words removal, punctuation removal, lower case text, stemming, lemmatization and tokenization) to decrease the noise and only keep informative data. Second, I apply word embedding with the mean pooling approach to make the trained classifier able to make decisions with unseen but similar data.
---