---

This project involves tasks for feature engineering, training and evaluating a classifier for suggestion detection. I will work with the data from SemEval-2019 Task 9 subtask A to classify whether a piece of text contains a suggestion or not. 


Download train.csv, test_seen.csv and test_unseen.csv from the Github Suggestion mining folder or uncomment the code cell below to get the data as a comma-separated values (CSV) file. The CSV file contains a header row followed by 5,440 rows in train.csv and 1,360 rows in test_seen.csv spread across 3 columns of data. Each row of data contains a unique id, a piece of text and a label assigned by an annotator. A label of $1$ indicates that the given text contains a suggestion while a label of $0$ indicates that the text does not contain a suggestion.

You can find more details about the dataset in Sections 1, 2, 3 and 4 of [SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
](https://aclanthology.org/S19-2151/).

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!curl "https://raw.githubusercontent.com/mahima-sharma10/Suggestion-mining/main/train.csv" > train.csv
!curl "https://raw.githubusercontent.com/mahima-sharma10/Suggestion-mining/main/test_seen.csv" > test.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  670k  100  670k    0     0  1610k      0 --:--:-- --:--:-- --:--:-- 1610k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  168k  100  168k    0     0   505k      0 --:--:-- --:--:-- --:--:--  503k


In [3]:
import numpy as np
import pandas as pd

# Read the CSV file.
train_df = pd.read_csv('train.csv', 
                 names=['id', 'text', 'label'], header=0)

test_df = pd.read_csv('test.csv', 
                 names=['id', 'text', 'label'], header=0)

# Store the data as a list of tuples where the first item is the text
# and the second item is the label.
train_texts, train_labels = train_df["text"].to_list(), train_df["label"].to_list() 
test_texts, test_labels = test_df["text"].to_list(), test_df["label"].to_list() 

# Check that training set and test set are of the right size.
assert len(test_texts) == len(test_labels) == 1360
assert len(train_texts) == len(train_labels) == 5440

---

## Task 1: Data Pre-processing 



Edit this cell to write your answer below the line in no more than 300 words.

---

>
1st lowercase = Converting a word to lower case (NLP -> nlp). Words like Book and book mean the same but when not converted to the lower case those two are represented as two different words in the vector space model (resulting in more dimensions). The lower() function makes the whole process quite straightforward.
Also, when the text is lower cased, it treats the complete text equal and easy to process.

2nd Removed punctuation  = The punctuation to the sentence adds up noise that brings ambiguity while training the model. Hence, I have added a block below to remove the punctuations from the text.

3rd Removed hyperlinks = It is important to remove the hyperlinks from the text as it is an added information or references which hardly contribute and also have low occurence and removing them might help in getting a good accuracy score.

In the code cell below, write an implementation of the steps you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

In [4]:
import nltk
nltk.download(['punkt', 'wordnet', "omw-1.4", 'averaged_perceptron_tagger', 'universal_tagset','stopwords' ])
import nltk,csv,numpy,re 
from nltk import word_tokenize
from nltk.corpus import stopwords

#lowercasefortraindata
words_train = []
for word in train_texts:
  words_train.append(word.lower())
#lowercasefortestdata
words_test = []
for text in test_texts:
  words_test.append(text.lower())
#Removing hyperlink for train data
reg_train = []
for word in words_train:
  reg_train.append(re.sub(r"http\S+", "", word))
#Removing hyperlink for test data
reg_test = []
for text in words_test:
  reg_test.append(re.sub(r"http\S+", "", text))
#Removing punctuations for train data
punct_train = []
for word in reg_train:
  punct_train.append(re.sub(r'[^\w\s]','',word))
#Removing punctuations for test data
punct_test = []
for text in reg_test:
  punct_test.append(re.sub(r'[^\w\s]','',text))
#defined a function names myfunction to tokenize and join the data to make it processable for count vectorizer
def myfunction(z):
  text_lower = [w.lower() for w in z]
  tokenized_sents = [word_tokenize(i) for i in text_lower]
  
  for i in range(len(tokenized_sents)):
    tokenized_sents[i] = ' '.join(tokenized_sents[i])
    processed_texts = tokenized_sents
    
  return processed_texts
#saving the tokenized and joined data for training data
processed_texts = myfunction(punct_train)
train_df['processed_texts'] = processed_texts
trainX = train_df['processed_texts']
##saving the tokenized and joined data for training data
processed_texts = myfunction(punct_test)
test_df['processed_texts'] = processed_texts
testX = test_df['processed_texts']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


---

## Task 2: Feature Engineering (I) - TF-IDF as features

We have seen that raw counts of words and `tf-idf` scores can be useful features for a classification task. In the following code cell, created a suggestion detector which uses `tf-idf` scores as features for a Na誰ve Bayes classifier.

After applying preprocessing steps, used the training data to train the classifier and make predictions on the test set. 

If everything is implemented correctly, then we should see a single floating point value between 0 and 1 at the end which denotes the accuracy of the classifier.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import GaussianNB

# Calculate tf-idf scores for the words in the training set.
# ... your code goes here 
#generate the features for the train set
#count_vect = CountVectorizer(analyzer='word',token_pattern=r'\b[a-zA-Z]{3,}\b',ngram_range=(1, 1))
count_vect = CountVectorizer(ngram_range=(1, 3))  
X_train_counts = count_vect.fit_transform(trainX)
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

#generate the features for the test set
X_test_counts = count_vect.transform(testX)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# Train a Na誰ve Bayes classifier using the tf-idf scores for words as features.
# ... your code goes here
# Lets train a Gaussian Naive Bayes clasifier using counts 
NB_classifier_counts = GaussianNB()
NB_classifier_counts.fit(X_train_counts.toarray(), train_labels)


# Predict on the test set.
#predictions = []    # save your predictions on the test set into this list
predictions = NB_classifier_counts.predict(X_test_counts.toarray())

# ... your code goes here

#################### DO NOT EDIT BELOW THIS LINE #################


#################### DO NOT EDIT BELOW THIS LINE #################

def accuracy(labels, predictions):
  '''
  Calculated the accuracy score for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  assert len(labels) == len(predictions)
  
  correct = 0
  for label, prediction in zip(labels, predictions):
    if label == prediction:
      correct += 1 
  
  score = correct / len(labels)
  return score

# Calculate accuracy score for the classifier using tf-idf features.
accuracy(test_labels, predictions)

---

## Task 3: Evaluation Metrics

Accuracy not the best measure for evaluating a classifier? Described an evaluation metric which might work better than accuracy for a classification task such as suggestion detection.

Accuracy scores can be misleading as it can hide the imbalanced dataset issue.
An evaluation that might work better than accuracy for a classification task is F1 score.
Also, in the below code, I have calculated the F1- score of the trained model to evaluate the model. In addition to this, these are points adding to why f1 score is better than accuracy score.
1. Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial
2. Accuracy can be used when the class distribution is similar while F1-score is a better metric when there are imbalanced classes as in the above case.
3. In most real-life classification problems, imbalanced class distribution exists and thus F1-score is a better metric to evaluate our model on. 
---

In the code cell below, write an implementation of the evaluation metric you defined above. Please write your own implementation from scratch.

In [None]:
def evaluate(labels, predictions):
  '''
  Calculate an evaluation score other than accuracy for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''
  
  # check that labels and predictions are of same length
  assert len(labels) == len(predictions)

  score = 0.0
  
  #################### EDIT BELOW THIS LINE #########################

  # your code goes here
  def confusionmatrix(labels, predictions):
    tp = 0
    for lbl, pred in zip(labels, predictions):
        if lbl == 1 and pred == 1:
            tp +=1
    tn = 0
    for lbl, pred in zip(labels, predictions):
        if lbl == 0 and pred == 0:
            tn +=1
    fp = 0
    for lbl, pred in zip(labels, predictions):
        if lbl == 0 and pred == 1:
            fp +=1
    fn = 0
    for lbl, pred in zip(labels, predictions):
        if lbl == 1 and pred == 0:
            fn +=1
    
    precision = tp/ (tp + fp)  
    recall = tp/ (tp + fn)  

    p = precision
    r = recall
    final_score = 2 * p * r/ (p + r) 
    return final_score
  
  final_score=confusionmatrix(labels, predictions)

  #################### EDIT ABOVE THIS LINE #########################

  return final_score

# Calculate evaluation score based on the metric of your choice
# for the classifier trained in Task 2 using tf-idf features.
evaluate(test_labels, predictions)

---

## Task 4: Feature Engineering (II) - Other features 

Described features other than those defined in Task 2 which might improve the performance of your suggestion detector.


Edit this cell to write your answer below the line in no more than 500 words.

---
  
> I have added a preprocessing step where I have removed the stopwords from the training and testing data and used the cleaned data for generating the tfidf scores through tfidf vectorization and trained the data on bernoulli model which has slightly improved the performance of the model.
 
  

---

In the code cell below, write an implementation of the features (and any additional pre-preprocessing steps) you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

After creating your features, use the training data to train a Na誰ve Bayes classifier and use the test set to evaluate its performance using the metric defined in Task 3. You **must not** use the test set for training.

To make sure that your code doesn't take too long to run or use too much memory, you can consider a time limit of 3 minutes and a memory limit of 12GB for this task.

In [None]:
#stopword removal
stops = set(stopwords.words('english'))

text_clean = [word for word in punct_train if word not in stops]

processed_texts = myfunction(text_clean)
train_stpwrdX = train_df['processed_texts']
test_stpwrdX = test_df['processed_texts']

# Create your features.
# ... your code goes here
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf_vec = tfidf_vectorizer.fit_transform(train_stpwrdX)
print(X_train_tfidf_vec.toarray())
tfidf_vectorizer.get_feature_names_out().tolist()

#generate the features for the test set
X_test_vec = tfidf_vectorizer.transform(test_stpwrdX)

# Train a Na誰ve Bayes classifier using the features you defined.
# ... your code goes here
from sklearn.naive_bayes import BernoulliNB
NB_classifier_tfidf = BernoulliNB()
NB_classifier_tfidf.fit(X_train_tfidf_vec.toarray(), train_labels)

# Evaluate on the test set.
# ... your code goes here
from sklearn.metrics import classification_report
preds = NB_classifier_tfidf.predict(X_test_vec.toarray())
print(classification_report(test_labels, preds))