# CS4765/6765 Assignment 2: Sentiment Analysis for climate-related text in corporate disclosures

We've seen the problem of sentiment analysis in class. In this
assignment you will write your own sentiment analysis system. You will
implement  two variants of a
naive Bayes classifier. The starter code further provides a most-frequent class baseline and logistic regression. You will then compare the various
approaches.

In this assignment we will consider a three-way sentiment analysis task for climate-related text in corporate disclosures.

Read through this notebook in its entirety before getting started. You should only make changes / write code in parts of the notebook where the instructions ask you to do so. These are indicated with TODO throughout.


## Data

I've provided you with the following files for this assignment: `train-sample.csv`, `dev.csv`, `test.csv`.

Each of these files is a CSV file. You will use `train-sample.csv` for training models, `dev.csv` for evaluation during model development and for model refinement, and `test.csv` for final evaluation.

Each row in each file is an instance consisting of a text, and a label. (Each row also contains an id number, which we won't use in this assignment.) Each text is a climate-related paragraph from an annual report from a company. Each label is 0, 1 or 2. The labels have the following meanings:

- 2: Opportunity arising due to climate change (positive sentiment)

- 1: Neutral

- 0: Risk or threat that negatively impacts an entity (negative sentiment)

`train-sample.csv`, `dev.csv`, and `test.csv` contain 800, 200, and 300 instances, respectively.

The starter code handles reading these files for you. It also takes care of tokenizing the texts. Do not modify these parts of starter code. You should only modify the starter code where you are instructed to (indicated with TODO).

The data for this assignment is from the [ClimateBERT climate sentiment dataset](https://huggingface.co/datasets/climatebert/climate_sentiment). The license is [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) You are of course welcome to read more about this dataset, but it is not necessary to do so to be able to do this assignment. To create the data for this assignment, I converted the original dataset files from parquet to CSV. The original dataset does not include a development set. I randomly split the training data into 80% training and 20% dev to create `train-sample.csv` and `dev.csv`. `test.csv` is the same test data as the original dataset (but formatted as CSV instead of parquet).





## Models

In this assignment you will implement two different approaches to sentiment analysis: multinomial naive Bayes and binary multinomial naive Bayes. The starter code further includes a most-frequent class baseline and logistic regression. Further details on the models are provided below.

## Implementation

Your code must be able to run in the virtual environment on the NLP VM on the lab machines. You must not use NLTK or any other NLP toolkits. You should not import any modules that this notebook does not already import for you. Although the starter code uses an implementation of logistic regression from scikit-learn (sklearn), you must not use scikit-learn for any of the code that you write in this assignment.

## Experimental Setup

Throughout this assignment, we will always use the training data (`train-sample.csv`) for training models. We will implement our models and then use the development data (`dev.csv`) for preliminary evaluation. Once we've completed this, we will do our final evaluation on the test data (`test.csv`). The starter code guides you through this process.


In [76]:
import re

# A simple tokenizer based on the word tokenizer from sklearn.
# It applies case folding.
word_tokenize_pattern = re.compile(r"(?u)\b\w\w+\b")
def word_tokenize(s, apply_case_folding=True):
    return [x.lower() for x in word_tokenize_pattern.findall(s)]

In [77]:
# An example to demonstrate the tokenization
word_tokenize('''This is a sentence. Here is another one! This is to show, just as an example, how "word_tokenize" works.''')

['this',
 'is',
 'sentence',
 'here',
 'is',
 'another',
 'one',
 'this',
 'is',
 'to',
 'show',
 'just',
 'as',
 'an',
 'example',
 'how',
 'word_tokenize',
 'works']

In [78]:
#This section was written to check why tokenization was leaving out single letter. Asked Professor. Told me not to change anything
#text = 'What a Beautiful day to be alive'
#word_tokenize(text)

In [79]:
# Helper function to load a data file
import csv

def get_texts_and_labels(fname):
    csv_reader = csv.reader(open(fname))
    # Ignore header row
    next(csv_reader)
    texts = []
    labels = []
    for line in csv_reader:
        _,text,label = line
        label = int(label)
        texts.append(text)
        labels.append(label)
    return texts,labels

In [80]:
# Load the training and dev data
train_texts, train_labels = get_texts_and_labels('data/train-sample.csv')
dev_texts, dev_labels = get_texts_and_labels('data/dev.csv')

# Some sanity checks
assert len(train_texts) == len(train_labels)
assert len(dev_texts) == len(dev_labels)

for label in train_labels + dev_labels:
    assert 0 <= label <= 2

In [81]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# A helper function to print out macro averaged P,R, and F1 and accuracy.
# Uses implementantions of evaluation metrics from sklearn.
def print_results(gold_labels, predicted_labels):
    p,r,f,_ = precision_recall_fscore_support(gold_labels,
                                              predicted_labels,
                                              average='macro',
                                              zero_division=0)
    acc = accuracy_score(gold_labels, predicted_labels)

    print("Precision: ", p)
    print("Recall: ", r)
    print("F1: ", f)
    print("Accuracy: ", acc)
    print()

## Most-frequent Class Baseline

The starter code below trains a most-frequent class baseline on the training data and evaluates it on the dev data. (Note that sklearn also includes an implementation of a most-frequent class baseline, which you might find useful for your projects: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)

In [82]:
# A most-frequent class baseline
class Baseline:
    def __init__(self, labels):
        self.train(labels)

    def train(self, labels):
        # Count classes to determine which is the most frequent
        label_freqs = {}
        for l in labels:
            label_freqs[l] = label_freqs.get(l, 0) + 1
        self.mfc = sorted(label_freqs, reverse=True,
                          key=lambda x : label_freqs[x])[0]

    def classify(self, test_instance):
        # Ignore the test instance and always return the most frequent class
        return self.mfc


In [83]:
baseline_classifier = Baseline(train_labels)
baseline_dev_predictions = [baseline_classifier.classify(x) for x in dev_texts]
print_results(dev_labels, baseline_dev_predictions)

Precision:  0.14166666666666666
Recall:  0.3333333333333333
F1:  0.19883040935672514
Accuracy:  0.425



## Logistic Regression

The implementation below uses the scikit-learn (sklearn) Python module for logistic regression. Scikit-learn is a popular tool for doing many machine learning tasks in Python. It includes implementations of many classifiers (including naive Bayes, but we're implementing that ourselves in this assignment). Read the comments in the code below to learn how to use it.

In [84]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

# sklearn provides functionality for tokenizing text and
# extracting features from it. This uses the word_tokenize function
# defined above for tokenization, as opposed to sklearn's
# default tokenization (although the two are very similar,
# except that by default sklearn does not apply case folding).
# Using exactly the same tokenization for logistic regression
# and for the NB models we will implement enables us to
# more easily compare results between the various methods.
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
count_vectorizer = CountVectorizer(analyzer=word_tokenize)

# train_counts will be a DxV matrix where D is the number of
# training documents and V is the number of types in the
# training documents. Each cell in the matrix indicates the
# frequency (count) of a type in a document.
train_counts = count_vectorizer.fit_transform(train_texts)

# Train a logistic regression classifier on the training
# data. A wide range of options are available. sklearn uses
# L2 regularization by default. The algorithm for minimizing
# the loss during training is LBFGS instead of SGD, which we
# saw in lecture.  The maximum number of iterations is set
# to 500 (max_iter=500) to allow the model to converge on
# this training data. The random_state is set to 0  (an
# arbitrarily chosen number) to help ensure results are
# consistent from run to run.
# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
lr = LogisticRegression(max_iter=500,
                        random_state=0)
lr_classifier = lr.fit(train_counts, train_labels)

# Transform the dev documents into a DxV matrix, similar to
# that for the training documents, where D is the number of
# dev documents, and V is the number of types in the training
# documents.
dev_counts = count_vectorizer.transform(dev_texts)

# Predict the class for each dev document.
lr_dev_predictions = lr_classifier.predict(dev_counts)
print_results(dev_labels, lr_dev_predictions)

Precision:  0.7900114973711068
Recall:  0.7734038142620232
F1:  0.7801965694519533
Accuracy:  0.78



## Multinomial Naive Bayes (6 marks)

Implement multinomial naive Bayes, i.e., as described in Appendix B.1-B.3 of the text book. Follow the structure of the starter code provided. Read the comments in the code, and the description below, to understand how it is intended to work.

- The constructor, `__init__`, should train the classifier on the provided training data. Similarly to the constructor for `Baseline` above, you might find it useful to have the constructor call helper methods (i.e., `train` in the case of `Baseline`).

- `classify` predicts the class for a provided test instance. Be sure to compute probabilities in log space to avoid underflow errors
  from multiplying many probabilities.

- **Optional:** It can be very helpful to ensure that your probability distributions are actually
  probability distributions! You can do this by adding `assert` statements to `sanity_check` to make
  sure that all the probabilities you estimate are between 0 and 1,
  and that the distributions you estimate sum to 1, e.g., $\sum_{w \in
  V}P(w|c) = 1$.

  You can add further simple checks to your code, for example, `__init__` checks that the number of training documents and classes are the same. You might want to make sure that all the classes you output are `0`, `1`, or `2`.


In [85]:
import math

class NB:

    # TODO: Complete this class

    def __init__(self,
                 documents,
                 labels,
                 binary=False):
        # documents is a list of training documents. You will need to tokenize each document
        # labels is a list of corresponding class labels for the training documents (I.e.,
        # labels[i] is the class label for document[i]).
        # binary indicates whether to use binary multinomial naive Bayes (which you will
        # implement in the next step) or (regular) multinomial naive Bayes. You can ignore
        # it when you first implement multinomial naive Bayes, and then use it to implement
        # binary multinomial naive Bayes after.
        #sanity check
        assert len(documents) == len(labels)
        #all the initialization
        self.binary = binary
        self.counts_of_words = {}
        self.total_documents_in_each_class = {}
        self.vocab = set()
        self.train(documents, labels)

    def train(self, documents, labels):
        #as all documents have lables ( same no and also did sanity check), So looping at once
        for d, l in zip(documents, labels):
          self.total_documents_in_each_class[l] = self.total_documents_in_each_class.get(l, 0) + 1

          #This line is very important, as in the next for loop I am
          #using a nested dictionary which I have not previously initialized
          #This is the initialization
          #I could have also used defaultdict.
          #But that requires importing extra library which was not permitted
          if l not in self.counts_of_words:
            self.counts_of_words[l] = {}
          #tokenizing before looping through each word
          tokens = word_tokenize(d)
          # When binary=True(Binary MNB), I am considering only unique tokens
          if self.binary:
              tokens = set(tokens)

          for t in tokens:
            self.counts_of_words[l][t] = self.counts_of_words[l].get(t, 0) + 1


        for words in self.counts_of_words.values():
            self.vocab.update(words.keys())

    def sanity_check(self):
        # You might want to add some checks here to check that, for example,
        # you've estimated valid probability distributions

        #Checking if all the prior sum ==1
        total_docs = sum(self.total_documents_in_each_class.values())
        class_priors = {c: self.total_documents_in_each_class[c] / total_docs
                    for c in self.total_documents_in_each_class}

        prior_sum = sum(class_priors.values())
        assert abs(prior_sum - 1.0) < 1e-6, f"Class priors do not sum to 1 (sum={prior_sum})"
        print(f"Class priors sum to 1 (sum={prior_sum})")

        #checking if each class’s word probability distribution (after Laplace smoothing) sums to 1.
        vocab_size = len(self.vocab)
        for c in self.counts_of_words.keys():
            total_words_in_class = sum(self.counts_of_words[c].values())
            probs_sum = 0.0
            for w in self.vocab:
                word_count = self.counts_of_words[c].get(w, 0)
                p = (word_count + 1) / (total_words_in_class + vocab_size)
                probs_sum += p

            assert abs(probs_sum - 1.0) < 1e-6, \
                f" Word probabilities for class '{c}' sum to {probs_sum}, not 1"
            print(f"Word probabilities for class '{c}' sum to 1 (sum={probs_sum})")

        print("All sanity checks passed!")

    def classify(self, test_instance):
        # test_instance is an instance to classify.
        # Return the predicted class: must be one of 0, 1, or 2
        tokens = word_tokenize(test_instance)
        # When binary=True, I am considering only unique tokens (presence/absence)
        if self.binary:
            tokens = set(tokens)
        prediction = {}
        total_docs = sum(self.total_documents_in_each_class.values())
        for c in self.total_documents_in_each_class.keys():
          prior = self.total_documents_in_each_class[c] / total_docs
          #temp = 1.0
          total_words_in_class = sum(self.counts_of_words[c].values())
          vocab_size = len(self.vocab)
          log_prob = math.log(prior)
          for t in tokens:
            word_count = self.counts_of_words[c].get(t, 0)
            prob = (word_count + 1) / (total_words_in_class + vocab_size)
            #print(prob)
            log_prob += math.log(prob)
            #print(log_prob)

          prediction[c]=log_prob

        return max(prediction, key=prediction.get)


In [86]:
nb_classifier = NB(train_texts, train_labels)
nb_classifier.sanity_check()
nb_dev_predictions = [nb_classifier.classify(x) for x in dev_texts]
print_results(dev_labels, nb_dev_predictions)

Class priors sum to 1 (sum=1.0)
Word probabilities for class '2' sum to 1 (sum=0.9999999999999388)
Word probabilities for class '0' sum to 1 (sum=1.0000000000000493)
Word probabilities for class '1' sum to 1 (sum=0.9999999999999967)
All sanity checks passed!
Precision:  0.7541887125220459
Recall:  0.7648339186420837
F1:  0.7576305220883534
Accuracy:  0.755



## Binary Multinomial Naive Bayes (1 mark)

Also implement binary multinomial naive Bayes (i.e., multinomial naive Bayes with binary features). Recall
that in this model, the frequency of a given word in a given document
is either 0 (if the word does not occur in the document) or 1 (if the
word occurs 1 or more times in the document). Repeated occurrences of
a word are ignored. This model is discussed in Appendix B.4 of the text book.

Implement this model by adding functionality to the class `NB` for when the value `True` is passed to the constructor for the parameter `binary`. **Hint** This should be a very small amount of additional code. (It's only worth 1 mark!) If you're writing lots of code, you are likely off track.

In [87]:
nb_bin_classifier = NB(train_texts, train_labels, binary=True)
nb_bin_classifier.sanity_check()
nb_bin_dev_predictions = [nb_bin_classifier.classify(x) for x in dev_texts]
print_results(dev_labels, nb_bin_dev_predictions)

Class priors sum to 1 (sum=1.0)
Word probabilities for class '2' sum to 1 (sum=0.9999999999999148)
Word probabilities for class '0' sum to 1 (sum=1.0000000000000082)
Word probabilities for class '1' sum to 1 (sum=0.999999999999959)
All sanity checks passed!
Precision:  0.7509100309515325
Recall:  0.7628816700809677
F1:  0.7544819366057808
Accuracy:  0.75



## Test Data

So far, we've only evaluated on the development data. Once you have completed the tasks above (i.e., your implementation of the class `NB`), run the code below to evaluate the classifiers on the test data.

In [88]:
test_texts,test_labels = get_texts_and_labels('data/test.csv')

print("Test results:")
print()

print("Baseline:")
baseline_test_predictions = [baseline_classifier.classify(x) for x in test_texts]
print_results(test_labels, baseline_test_predictions)

print("Logistic Regression")
test_counts = count_vectorizer.transform(test_texts)
lr_test_predictions = lr_classifier.predict(test_counts)
print_results(test_labels, lr_test_predictions)

print("Multinomial Naive Bayes:")
nb_test_predictions = [nb_classifier.classify(x) for x in test_texts]
print_results(test_labels, nb_test_predictions)

print("Binary Multinomial Naive Bayes:")
nb_bin_test_predictions = [nb_bin_classifier.classify(x) for x in test_texts]
print_results(test_labels, nb_bin_test_predictions)


Test results:

Baseline:
Precision:  0.16979166666666667
Recall:  0.3333333333333333
F1:  0.22498274672187715
Accuracy:  0.509375

Logistic Regression
Precision:  0.6405949791868577
Recall:  0.6391179383355766
F1:  0.6397733310398869
Accuracy:  0.684375

Multinomial Naive Bayes:
Precision:  0.7011079918103196
Recall:  0.7250513895414245
F1:  0.709793342224737
Accuracy:  0.740625

Binary Multinomial Naive Bayes:
Precision:  0.7208921022480345
Recall:  0.7452035797534663
F1:  0.7300925925925926
Accuracy:  0.759375



## Report (3 marks)

Write a brief report describing the results of the various methods for sentiment analysis considered in this assignment. Address at least the following in your report:

1. Does each of logistic regression, multinomial naive Bayes, and binary multinomial naive Bayes outperform the baseline?

1. Does logistic regression outperform multinomial naive Bayes?

1. Which of multinomial naive Bayes and binary multinomial naive Bayes performs better?

1. Which method performs best? Is the relative performance of methods consistent across the dev and test data?

1. Carry out some error analysis to attempt to explain what causes the difference in performance between binary multinomial naive Bayes and logistic regression on the test data. For this, you might find it helpful to examine the per-class P, R, and F values, or a confusion matrix. You can do this using `sklearn.metrics.precision_recall_fscore_support` and `sklearn.metrics.confusion_matrix`. You can read the documentation for these functions here:
  
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
   
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

    I've also included some examples of how to use them below:

# Sentiment Analysis Report

### **1) Does each of Logistic Regression, Multinomial Naive Bayes, and Binary Multinomial Naive Bayes outperform the baseline?**

Yes, all three models significantly outperform the **Most Frequent Class (MFC)** baseline in both development and test data.

| Model | Dev Accuracy | Test Accuracy |
|--------|---------------|----------------|
| Baseline (MFC) | 0.425 | 0.5094 |
| Multinomial Naive Bayes | 0.755 | 0.7406 |
| Binary Multinomial Naive Bayes | 0.75 | 0.7594 |
| Logistic Regression | 0.78 | 0.6844 |


---

### **2) Does Logistic Regression outperform Multinomial Naive Bayes?**

The result varies. On **development** dataset **Logistic Regression** achieves the highest score overall. But in **test** dataset **Multinomial Naive Bayes** outperform **Logistic Regression**.

| Metric | Multinomial NB (Dev) | Logistic Regression (Dev) | Multinomial NB (Test) | Logistic Regression (Test) |
|---------|----------------------|----------------------------|------------------------|-----------------------------|
| **Precision** | 0.754 | 0.790 | 0.7011 | 0.6406 |
| **Recall** | 0.7648 | 0.773 | 0.7251 | 0.6391 |
| **F1-score** | 0.7576 | 0.780 | 0.7098 | 0.6398 |
| **Accuracy** | 0.755 | 0.78 | 0.7406 | 0.6844 |

---

### **3) Which of Multinomial Naive Bayes and Binary Multinomial Naive Bayes performs better?**

**Multinomial Naive Bayes** performs slightly(negligible) better than **Binary Multinomial Naive Bayes** in **development** dataset but in **test** dataset **Multinomial Naive Bayes** outperform **Binary Multinomial Naive Bayes** by almost 2% margin, as shown below:

| Metric | Multinomial NB (Dev) | Binary Multinomial NB (Dev) | Multinomial NB (Test) | Binary Multinomial NB (Test) |
|---------|----------------------|----------------------------|------------------------|-----------------------------|
| **Precision** | 0.754 | 0.7509 | 0.7011 | 0.7209 |
| **Recall** | 0.7648 | 0.7629 | 0.7251 | 0.7452 |
| **F1-score** | 0.7576 | 0.7545 | 0.7098 | 0.7301 |
| **Accuracy** | 0.755 | 0.75 | 0.7406 | 0.7593 |


---

### **4) Which method performs best? Is the relative performance consistent across the dev and test data?**

To determine which model performs best we first need to define what makes a model best or based n which criteria we are declaring the models as best. It can be according to the accuracy or can be according to F1 scores or any other score. But to me the model that is most robust (shows consistency in both dev and test data) is the best model. Though **Logistic Regression** performs best on **development** data, **Binary Multinomial Naive Bayes** achieves best score in the **test** dataset. The ranking according to accuracy are:

**Dev Data:**--> **Logistic Regression > Multinomial NB > Binary Multinomial NB > Baseline**

**Test Data:**--> **Binary Multinomial NB > Multinomial NB > Logistic Regression > Baseline**

| Model | Dev Accuracy | Test Accuracy |
|--------|---------------|----------------|
| Logistic Regression | 0.78 | 0.6844 |
| Multinomial NB | 0.755 | 0.7406 |
| Binary Multinomial NB | 0.75 | 0.7593 |
| Baseline | 0.425 | 0.5093 |

But **Binary Multinomial Naive Bayes** model's accuracy is consistent across both dataset which is significant because it tells us that the **Binary Multinomial Naive Bayes** model generalizes better than any of the other models. The second most robust model is **Multinomial Naive Bayes** model. So robustness wise the ranking becomes:

**Binary Multinomial NB > Multinomial NB > Logistic Regression > Baseline**

---

### **5) Error Analysis: Binary MNB vs Logistic Regression**

I carried out the following analysis on the **test dataset** and **dev dataset** both:

1. Drew per-class confusion matrices for Binary Multinomial Naive Bayes and Logistic Regression.

2. Computed per-class precision, recall, and F1-scores using the scikit-learn library.

3. Inspected misclassified documents to determine whether longer or shorter sentences were more prone to errors.

From the analysis of the above three things I made the following observations:
1. All models perform best on Class 0 and Class 1, while Class 2 is the hardest to classify and every model performs poorly on this class.

2. Confusion matrices indicate that most of the confusion were between Class 1 and Class 2 across all models.

3. From the common misclassified documents it can be seen that most common misclassified document's word counts were less than 100.

So after overall analysis, the reason behind the varying performance of the models appears to be a combination of class difficulty, document length, and model characteristics:

1. **Class-specific difficulty:** Class 2 is consistently the hardest to classify, likely because it contains fewer distinctive keywords or more overlapping content with Class 1. This explains why all models perform worse on Class 2.

2. **Document length effects**: Many common misclassified documents are short (less than 100 words), which provides less context for the models to make accurate predictions. Logistic Regression performs better on longer texts due to its ability to capture word interactions, whereas Naive Bayes relies mostly on individual word probabilities.

3. Model characteristics:
   - Naive Bayes (Binary & Multinomial): Performs well for classes with strong keyword signals (Class 0 and Class 1) but struggles with nuanced context in Class 2 or short texts.
   - Logistic Regression: Captures feature interactions better, giving it higher F1 for Class 0 and Class 1 in the dev set, but it is slightly less sensitive to small or imbalanced classes like Class 2.

Overall, the differences arise from the interplay between text length, class-specific ambiguity, and the models’ ability to leverage context versus word-level probabilities.

In [89]:
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

classes = [0, 1, 2]

#defining function for printing confusion matrixes
def print_confusion_matrices(labels, predictions_dict):
    for name, preds in predictions_dict.items():
        cm = confusion_matrix(labels, preds, labels=classes).transpose()
        print(f"\n{name} Confusion Matrix (predicted x true):")
        print("Rows = System Labels (Predicted), Columns = Gold Labels (True)")
        print(cm)

#defining function for printing class-wise PRF
def print_prf(labels, predictions_dict):
    for name, preds in predictions_dict.items():
        print(f"\n{name} - per class metrics:")
        for label in classes:
            p, r, f, s = precision_recall_fscore_support(labels, preds, labels=[label])[0][0], \
                         precision_recall_fscore_support(labels, preds, labels=[label])[1][0], \
                         precision_recall_fscore_support(labels, preds, labels=[label])[2][0], \
                         precision_recall_fscore_support(labels, preds, labels=[label])[3][0]
            print(f"Class {label}: Precision={p:.3f}, Recall={r:.3f}, F1={f:.3f}, Support={s}")

#defining function for panalyzing effect of document length/word count
def analyze_documents(texts, labels, predictions_dict, top_n=10):
    # Word counts
    word_counts = [(i, len(text.split()), text) for i, text in enumerate(texts)]
    word_counts_sorted = sorted(word_counts, key=lambda x: x[1], reverse=True)

    print("\nLongest documents:")
    for i, count, text in word_counts_sorted[:top_n]:
        print(f"Index: {i}, Words: {count}, Text: {text}")

    # Misclassified documents
    misclassified = {}
    for name, preds in predictions_dict.items():
        misclassified[name] = [(i, texts[i], labels[i], preds[i])
                               for i in range(len(texts)) if labels[i] != preds[i]]
        misclassified[name] = sorted(misclassified[name], key=lambda x: len(x[1].split()), reverse=True)

    for name, mis_docs in misclassified.items():
        print(f"\n{name} - shortest to longest misclassified:")
        for i, text, true, pred in mis_docs[:top_n]:
            print(f"True: {true}, Pred: {pred}, Words: {len(text.split())}, Text: {text}")

    # Correct predictions for first model (example: Binary MNB)
    first_model = list(predictions_dict.keys())[0]
    correct_docs = [(i, texts[i], labels[i], predictions_dict[first_model][i])
                    for i in range(len(texts)) if labels[i] == predictions_dict[first_model][i]]
    correct_docs_sorted = sorted(correct_docs, key=lambda x: len(x[1].split()), reverse=True)

    print(f"\n{first_model} - correctly predicted documents (highest to lowest word count):")
    for i, text, true, pred in correct_docs_sorted[:top_n]:
        print(f"Words: {len(text.split())}, True: {true}, Pred: {pred}, Text: {text}")

    # Common misclassified across all models
    mis_idx_sets = [set(i for i, _, _, _ in mis_docs) for mis_docs in misclassified.values()]
    common_mis_idx = set.intersection(*mis_idx_sets) if mis_idx_sets else set()

    print("\nCommon misclassified documents across all models:")
    for i in common_mis_idx:
        text = texts[i]
        preds_str = ', '.join([f"{name} Pred: {predictions_dict[name][i]}" for name in predictions_dict])
        print(f"Index: {i}, Words: {len(text.split())}, True: {labels[i]}, {preds_str}")

#for dev data:
dev_predictions = {
    "Binary Multinomial NB": nb_bin_dev_predictions,
    "Multinomial NB": nb_dev_predictions,
    "Logistic Regression": lr_dev_predictions
}

print("DEV Data Analysis:")
print_confusion_matrices(dev_labels, dev_predictions)
print_prf(dev_labels, dev_predictions)
analyze_documents(dev_texts, dev_labels, dev_predictions)

#for test data:
test_predictions = {
    "Binary Multinomial NB": nb_bin_test_predictions,
    "Multinomial NB": nb_test_predictions,
    "Logistic Regression": lr_test_predictions
}
print("Test Data Analysis:")
print_confusion_matrices(test_labels, test_predictions)
print_prf(test_labels, test_predictions)
analyze_documents(test_texts, test_labels, test_predictions)


DEV Data Analysis:

Binary Multinomial NB Confusion Matrix (predicted x true):
Rows = System Labels (Predicted), Columns = Gold Labels (True)
[[51 11  0]
 [14 59  8]
 [ 2 15 40]]

Multinomial NB Confusion Matrix (predicted x true):
Rows = System Labels (Predicted), Columns = Gold Labels (True)
[[52 11  0]
 [12 60  9]
 [ 3 14 39]]

Logistic Regression Confusion Matrix (predicted x true):
Rows = System Labels (Predicted), Columns = Gold Labels (True)
[[53  8  0]
 [13 68 13]
 [ 1  9 35]]

Binary Multinomial NB - per class metrics:
Class 0: Precision=0.823, Recall=0.761, F1=0.791, Support=67
Class 1: Precision=0.728, Recall=0.694, F1=0.711, Support=85
Class 2: Precision=0.702, Recall=0.833, F1=0.762, Support=48

Multinomial NB - per class metrics:
Class 0: Precision=0.825, Recall=0.776, F1=0.800, Support=67
Class 1: Precision=0.741, Recall=0.706, F1=0.723, Support=85
Class 2: Precision=0.696, Recall=0.812, F1=0.750, Support=48

Logistic Regression - per class metrics:
Class 0: Precision=0.

In [90]:
# Compute a confusion matrix for binary NB on the test data.
# Note that I've transposed the confusion matrix so that
# the rows correspond to system predictions and columns correspond to
# gold standard labels, following the convention in the textbook
# and class. (By default, sklearn does it the other way around.)
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

from sklearn.metrics import confusion_matrix
print(confusion_matrix(test_labels, nb_bin_test_predictions, labels=[0, 1, 2]).transpose())

[[ 91  23   4]
 [ 10 119  14]
 [  5  21  33]]


In [91]:
# Calculte P, R, F for each class for binary NB on the test data.

for label in list(range(3)):
    print(label)
    # This prints P, R, F, and support (i.e., number of gold standard instances) for each class
    print(precision_recall_fscore_support(test_labels, nb_bin_test_predictions, labels=[label]))


0
(array([0.77118644]), array([0.85849057]), array([0.8125]), array([106]))
1
(array([0.83216783]), array([0.73006135]), array([0.77777778]), array([163]))
2
(array([0.55932203]), array([0.64705882]), array([0.6]), array([51]))


## What to submit

When you're done, submit this file to the assignment 2 dropbox on D2L. (You don't need to submit any of the data files we provided you with for this assignment).

## Grading

Your assignments will be graded based primarily on the correctness of their implementation and the written answers in the report.

Assignments that do not conform to the specifications outlined above might not be graded (e.g., modifying parts of the starter code that you were not asked to modify). Assignments that we are unable to run in a reasonable amount of time (less than one minute) also might not be graded. Grades will be out of 10 and broken down as follows:

- Multinomial NB: 6

- Binary multinomial NB: 1

- Report / discussion: 3