# CS4765/6765 Assignment 2: Sentiment Analysis for climate-related text in corporate disclosures

We've seen the problem of sentiment analysis in class. In this
assignment you will write your own sentiment analysis system. You will
implement  two variants of a
naive Bayes classifier. The starter code further provides a most-frequent class baseline and logistic regression. You will then compare the various
approaches.

In this assignment we will consider a three-way sentiment analysis task for climate-related text in corporate disclosures.

Read through this notebook in its entirety before getting started. You should only make changes / write code in parts of the notebook where the instructions ask you to do so. These are indicated with TODO throughout.


## Data

I've provided you with the following files for this assignment: `train-sample.csv`, `dev.csv`, `test.csv`.

Each of these files is a CSV file. You will use `train-sample.csv` for training models, `dev.csv` for evaluation during model development and for model refinement, and `test.csv` for final evaluation.

Each row in each file is an instance consisting of a text, and a label. (Each row also contains an id number, which we won't use in this assignment.) Each text is a climate-related paragraph from an annual report from a company. Each label is 0, 1 or 2. The labels have the following meanings:

- 2: Opportunity arising due to climate change (positive sentiment)

- 1: Neutral

- 0: Risk or threat that negatively impacts an entity (negative sentiment)

`train-sample.csv`, `dev.csv`, and `test.csv` contain 800, 200, and 300 instances, respectively.

The starter code handles reading these files for you. It also takes care of tokenizing the texts. Do not modify these parts of starter code. You should only modify the starter code where you are instructed to (indicated with TODO).

The data for this assignment is from the [ClimateBERT climate sentiment dataset](https://huggingface.co/datasets/climatebert/climate_sentiment). The license is [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) You are of course welcome to read more about this dataset, but it is not necessary to do so to be able to do this assignment. To create the data for this assignment, I converted the original dataset files from parquet to CSV. The original dataset does not include a development set. I randomly split the training data into 80% training and 20% dev to create `train-sample.csv` and `dev.csv`. `test.csv` is the same test data as the original dataset (but formatted as CSV instead of parquet).





## Models

In this assignment you will implement two different approaches to sentiment analysis: multinomial naive Bayes and binary multinomial naive Bayes. The starter code further includes a most-frequent class baseline and logistic regression. Further details on the models are provided below.

## Implementation

Your code must be able to run in the virtual environment on the NLP VM on the lab machines. You must not use NLTK or any other NLP toolkits. You should not import any modules that this notebook does not already import for you. Although the starter code uses an implementation of logistic regression from scikit-learn (sklearn), you must not use scikit-learn for any of the code that you write in this assignment.

## Experimental Setup

Throughout this assignment, we will always use the training data (`train-sample.csv`) for training models. We will implement our models and then use the development data (`dev.csv`) for preliminary evaluation. Once we've completed this, we will do our final evaluation on the test data (`test.csv`). The starter code guides you through this process.


In [None]:
import re

# A simple tokenizer based on the word tokenizer from sklearn.
# It applies case folding.
word_tokenize_pattern = re.compile(r"(?u)\b\w\w+\b")
def word_tokenize(s, apply_case_folding=True):
    return [x.lower() for x in word_tokenize_pattern.findall(s)]

In [None]:
# An example to demonstrate the tokenization
word_tokenize('''This is a sentence. Here is another one! This is to show, just as an example, how "word_tokenize" works.''')

In [None]:
# Helper function to load a data file
import csv

def get_texts_and_labels(fname):
    csv_reader = csv.reader(open(fname))
    # Ignore header row
    next(csv_reader)
    texts = []
    labels = []
    for line in csv_reader:
        _,text,label = line
        label = int(label)
        texts.append(text)
        labels.append(label)
    return texts,labels

In [None]:
# Load the training and dev data
train_texts, train_labels = get_texts_and_labels('data/train-sample.csv')
dev_texts, dev_labels = get_texts_and_labels('data/dev.csv')

# Some sanity checks
assert len(train_texts) == len(train_labels)
assert len(dev_texts) == len(dev_labels)

for label in train_labels + dev_labels:
    assert 0 <= label <= 2

In [None]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# A helper function to print out macro averaged P,R, and F1 and accuracy.
# Uses implementantions of evaluation metrics from sklearn.
def print_results(gold_labels, predicted_labels):
    p,r,f,_ = precision_recall_fscore_support(gold_labels, 
                                              predicted_labels,
                                              average='macro',
                                              zero_division=0)
    acc = accuracy_score(gold_labels, predicted_labels)

    print("Precision: ", p)
    print("Recall: ", r)
    print("F1: ", f)
    print("Accuracy: ", acc)
    print()

## Most-frequent Class Baseline

The starter code below trains a most-frequent class baseline on the training data and evaluates it on the dev data. (Note that sklearn also includes an implementation of a most-frequent class baseline, which you might find useful for your projects: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)

In [None]:
# A most-frequent class baseline
class Baseline:
    def __init__(self, labels):
        self.train(labels)

    def train(self, labels):
        # Count classes to determine which is the most frequent
        label_freqs = {}
        for l in labels:
            label_freqs[l] = label_freqs.get(l, 0) + 1
        self.mfc = sorted(label_freqs, reverse=True, 
                          key=lambda x : label_freqs[x])[0]
    
    def classify(self, test_instance):
        # Ignore the test instance and always return the most frequent class
        return self.mfc


In [None]:
baseline_classifier = Baseline(train_labels)
baseline_dev_predictions = [baseline_classifier.classify(x) for x in dev_texts]
print_results(dev_labels, baseline_dev_predictions)

## Logistic Regression

The implementation below uses the scikit-learn (sklearn) Python module for logistic regression. Scikit-learn is a popular tool for doing many machine learning tasks in Python. It includes implementations of many classifiers (including naive Bayes, but we're implementing that ourselves in this assignment). Read the comments in the code below to learn how to use it.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

# sklearn provides functionality for tokenizing text and
# extracting features from it. This uses the word_tokenize function
# defined above for tokenization, as opposed to sklearn's
# default tokenization (although the two are very similar,
# except that by default sklearn does not apply case folding).
# Using exactly the same tokenization for logistic regression
# and for the NB models we will implement enables us to 
# more easily compare results between the various methods.
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
count_vectorizer = CountVectorizer(analyzer=word_tokenize)

# train_counts will be a DxV matrix where D is the number of
# training documents and V is the number of types in the
# training documents. Each cell in the matrix indicates the
# frequency (count) of a type in a document.
train_counts = count_vectorizer.fit_transform(train_texts)

# Train a logistic regression classifier on the training
# data. A wide range of options are available. sklearn uses
# L2 regularization by default. The algorithm for minimizing
# the loss during training is LBFGS instead of SGD, which we
# saw in lecture.  The maximum number of iterations is set 
# to 500 (max_iter=500) to allow the model to converge on
# this training data. The random_state is set to 0  (an 
# arbitrarily chosen number) to help ensure results are 
# consistent from run to run.
# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
lr = LogisticRegression(max_iter=500,
                        random_state=0)
lr_classifier = lr.fit(train_counts, train_labels)

# Transform the dev documents into a DxV matrix, similar to
# that for the training documents, where D is the number of
# dev documents, and V is the number of types in the training
# documents.
dev_counts = count_vectorizer.transform(dev_texts)

# Predict the class for each dev document. 
lr_dev_predictions = lr_classifier.predict(dev_counts)
print_results(dev_labels, lr_dev_predictions)

## Multinomial Naive Bayes (6 marks)

Implement multinomial naive Bayes, i.e., as described in Appendix B.1-B.3 of the text book. Follow the structure of the starter code provided. Read the comments in the code, and the description below, to understand how it is intended to work.

- The constructor, `__init__`, should train the classifier on the provided training data. Similarly to the constructor for `Baseline` above, you might find it useful to have the constructor call helper methods (i.e., `train` in the case of `Baseline`).

- `classify` predicts the class for a provided test instance. Be sure to compute probabilities in log space to avoid underflow errors
  from multiplying many probabilities.

- **Optional:** It can be very helpful to ensure that your probability distributions are actually
  probability distributions! You can do this by adding `assert` statements to `sanity_check` to make
  sure that all the probabilities you estimate are between 0 and 1,
  and that the distributions you estimate sum to 1, e.g., $\sum_{w \in
  V}P(w|c) = 1$.

  You can add further simple checks to your code, for example, `__init__` checks that the number of training documents and classes are the same. You might want to make sure that all the classes you output are `0`, `1`, or `2`.


In [None]:
import math

class NB:

    # TODO: Complete this class

    def __init__(self, 
                 documents,
                 labels,
                 binary=False):
        # documents is a list of training documents. You will need to tokenize each document
        # labels is a list of corresponding class labels for the training documents (I.e.,
        # labels[i] is the class label for document[i]).
        # binary indicates whether to use binary multinomial naive Bayes (which you will
        # implement in the next step) or (regular) multinomial naive Bayes. You can ignore
        # it when you first implement multinomial naive Bayes, and then use it to implement
        # binary multinomial naive Bayes after.
        assert len(documents) == len(labels)
        self.binary = binary
         
    def sanity_check(self):
        # You might want to add some checks here to check that, for example,
        # you've estimated valid probability distributions
        assert True
    
    def classify(self, test_instance):
        # test_instance is an instance to classify. 
        # Return the predicted class: must be one of 0, 1, or 2
        tokens = word_tokenize(test_instance)
        return 0


In [None]:
nb_classifier = NB(train_texts, train_labels)
nb_classifier.sanity_check()
nb_dev_predictions = [nb_classifier.classify(x) for x in dev_texts]
print_results(dev_labels, nb_dev_predictions)

## Binary Multinomial Naive Bayes (1 mark)

Also implement binary multinomial naive Bayes (i.e., multinomial naive Bayes with binary features). Recall
that in this model, the frequency of a given word in a given document
is either 0 (if the word does not occur in the document) or 1 (if the
word occurs 1 or more times in the document). Repeated occurrences of
a word are ignored. This model is discussed in Appendix B.4 of the text book.

Implement this model by adding functionality to the class `NB` for when the value `True` is passed to the constructor for the parameter `binary`. **Hint** This should be a very small amount of additional code. (It's only worth 1 mark!) If you're writing lots of code, you are likely off track.

In [None]:
nb_bin_classifier = NB(train_texts, train_labels, binary=True)
nb_bin_classifier.sanity_check()
nb_bin_dev_predictions = [nb_bin_classifier.classify(x) for x in dev_texts]
print_results(dev_labels, nb_bin_dev_predictions)

## Test Data

So far, we've only evaluated on the development data. Once you have completed the tasks above (i.e., your implementation of the class `NB`), run the code below to evaluate the classifiers on the test data.

In [None]:
test_texts,test_labels = get_texts_and_labels('data/test.csv')

print("Test results:")
print()

print("Baseline:")
baseline_test_predictions = [baseline_classifier.classify(x) for x in test_texts]
print_results(test_labels, baseline_test_predictions)

print("Logistic Regression")
test_counts = count_vectorizer.transform(test_texts)
lr_test_predictions = lr_classifier.predict(test_counts)
print_results(test_labels, lr_test_predictions)

print("Multinomial Naive Bayes:")
nb_test_predictions = [nb_classifier.classify(x) for x in test_texts]
print_results(test_labels, nb_test_predictions)

print("Binary Multinomial Naive Bayes:")
nb_bin_test_predictions = [nb_bin_classifier.classify(x) for x in test_texts]
print_results(test_labels, nb_bin_test_predictions)


## Report (3 marks)

Write a brief report describing the results of the various methods for sentiment analysis considered in this assignment. Address at least the following in your report:

1. Does each of logistic regression, multinomial naive Bayes, and binary multinomial naive Bayes outperform the baseline?

1. Does logistic regression outperform multinomial naive Bayes?

1. Which of multinomial naive Bayes and binary multinomial naive Bayes performs better?

1. Which method performs best? Is the relative performance of methods consistent across the dev and test data?

1. Carry out some error analysis to attempt to explain what causes the difference in performance between binary multinomial naive Bayes and logistic regression on the test data. For this, you might find it helpful to examine the per-class P, R, and F values, or a confusion matrix. You can do this using `sklearn.metrics.precision_recall_fscore_support` and `sklearn.metrics.confusion_matrix`. You can read the documentation for these functions here:
  
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
   
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

    I've also included some examples of how to use them below:

**TODO** Write your report as markdown in this cell

In [None]:
# TODO If you need to write extra code to answer question 5 (e.g., to determine 
# per-class P, R, and F values or to create a confusion matrix) you can put that 
# code here

In [None]:
# Compute a confusion matrix for binary NB on the test data. 
# Note that I've transposed the confusion matrix so that 
# the rows correspond to system predictions and columns correspond to
# gold standard labels, following the convention in the textbook 
# and class. (By default, sklearn does it the other way around.)
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

from sklearn.metrics import confusion_matrix
print(confusion_matrix(test_labels, nb_bin_test_predictions, labels=[0, 1, 2]).transpose())

In [None]:
# Calculte P, R, F for each class for binary NB on the test data.

for label in list(range(3)):
    print(label)
    # This prints P, R, F, and support (i.e., number of gold standard instances) for each class
    print(precision_recall_fscore_support(test_labels, nb_bin_test_predictions, labels=[label]))


## What to submit

When you're done, submit this file to the assignment 2 dropbox on D2L. (You don't need to submit any of the data files we provided you with for this assignment).

## Grading

Your assignments will be graded based primarily on the correctness of their implementation and the written answers in the report.

Assignments that do not conform to the specifications outlined above might not be graded (e.g., modifying parts of the starter code that you were not asked to modify). Assignments that we are unable to run in a reasonable amount of time (less than one minute) also might not be graded. Grades will be out of 10 and broken down as follows:

- Multinomial NB: 6

- Binary multinomial NB: 1

- Report / discussion: 3