
# Assignment 4 for Course 1MS041
Make sure you pass the `# ... Test` cells and
 submit your solution notebook in the corresponding assignment on the course website. You can submit multiple times before the deadline and your highest score will be used.

---
## Assignment 4, PROBLEM 1
Maximum Points = 24


    This time the assignment only consists of one problem, but we will do a more comprehensive analysis instead.

Consider the dataset `Corona_NLP_train.csv` that you can get from the course website [git](https://github.com/datascience-intro/1MS041-2024/blob/main/notebooks/data/Corona_NLP_train.csv). The data is "Coronavirus tweets NLP - Text Classification" that can be found on [kaggle](https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification). The data has several columns, but we will only be working with `OriginalTweet`and `Sentiment`.

1. [3p] Load the data and filter out those tweets that have `Sentiment`=`Neutral`. Let $X$ represent the `OriginalTweet` and let
    $$
        Y =
        \begin{cases}
        1 & \text{if sentiment is towards positive}
        \\
        0 & \text{if sentiment is towards negative}.
        \end{cases}
    $$
    Put the resulting arrays into the variables $X$ and $Y$. Split the data into three parts, train/test/validation where train is 60% of the data, test is 15% and validation is 25% of the data. Do not do this randomly, this is to make sure that we all did the same splits (we are in this case assuming the data is IID as presented in the dataset). That is [train,test,validation] is the splitting layout.

2. [4p] There are many ways to solve this classification problem. The first main issue to resolve is to convert the $X$ variable to something that you can feed into a machine learning model. For instance, you can first use [`CountVectorizer`](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) as the first step. The step that comes after should be a `LogisticRegression` model, but for this to work you need to put together the `CountVectorizer` and the `LogisticRegression` model into a [`Pipeline`](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline). Fill in the variable `model` such that it accepts the raw text as input and outputs a number $0$ or $1$, make sure that `model.predict_proba` works for this. **Hint: You might need to play with the parameters of LogisticRegression to get convergence, make sure that it doesn't take too long or the autograder might kill your code**
3. [3p] Use your trained model and calculate the precision and recall on both classes. Fill in the corresponding variables with the answer.
4. [3p] Let us now define a cost function
    * A positive tweet that is classified as negative will have a cost of 1
    * A negative tweet that is classified as positive will have a cost of 5
    * Correct classifications cost 0
    
    complete filling the function `cost` to compute the cost of a prediction model under a certain prediction threshold (recall our precision recall lecture and the `predict_proba` function from trained models).

5. [4p] Now, we wish to select the threshold of our classifier that minimizes the cost, fill in the selected threshold value in value `optimal_threshold`.
6. [4p] With your newly computed threshold value, compute the cost of putting this model in production by computing the cost using the validation data. Also provide a confidence interval of the cost using Hoeffdings inequality with a 99% confidence.
7. [3p] Let $t$ be the threshold you found and $f$ the model you fitted (one of the outputs of `predict_proba`), if we define the random variable
    $$
        C = (1-1_{f(X)\geq t})Y+5(1-Y)1_{f(X) \geq t}
    $$
    then $C$ denotes the cost of a randomly chosen tweet. In the previous step we estimated $\mathbb{E}[C]$ using the empirical mean. However, since the threshold is chosen to minimize cost it is likely that $C=0$ or $C=1$ than $C=5$ as such it will have a low variance. Compute the empirical variance of $C$ on the validation set. What would be the confidence interval if we used Bennett's inequality instead of Hoeffding in point 6 but with the computed empirical variance as our guess for the variance?

In [2]:
import pandas as pd
import numpy as np


# Part 1

# Load the data from the file specified in the problem definition and make sure that it is loaded using
# the search path `data/Corona_NLP_train.csv`. This is to make sure the autograder and your computer have the same
# file path and can load the data correctly.

# Contrary to how many other problems are structured, this problem actually requires you to
# have X on the shape (n_samples, ) that is a 1-dimensional array. Otherwise it will cause a bunch
# of errors in the autograder or also in for instance CountVectorizer.

# Make sure that all your data is numpy arrays and not pandas dataframes or series.

data = pd.read_csv('data/Corona_NLP_train.csv', encoding='latin1')
data = data.loc[data['Sentiment'] != 'Neutral']

X = data['OriginalTweet'].values
Y = np.where(data['Sentiment'].isin(['Positive', 'Extremely Positive']), 1, 0)

n = len(X)
train_idx = int(0.6 * n)
test_idx = train_idx + int(0.15 * n)

X_train = X[:train_idx]
Y_train = Y[:train_idx]
X_test = X[train_idx:test_idx]
Y_test = Y[train_idx:test_idx]
X_valid = X[test_idx:]
Y_valid = Y[test_idx:]

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Part 2

# Train a machine learning model or pipeline that can take the raw strings from X and predict Y=0,1 depending on the
# sentiment of the tweet. Store the trained model in the variable `model`.
model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', LogisticRegression(max_iter=200, solver='lbfgs'))
])
model.fit(X_train, Y_train)


In [7]:
from sklearn.metrics import precision_score, recall_score

# Part 3

# Evaluate the model on the test set and calculate precision, and recall on both classes. Store the results in the
# variables `precision_0`, `precision_1`, `recall_0`, `recall_1`.
Y_pred = model.predict(X_test)
precision = precision_score(Y_test, Y_pred, average=None)
recall = recall_score(Y_test, Y_pred, average=None)

precision_0 = precision[0]
precision_1 = precision[1]
recall_0 = recall[0]
recall_1 = recall[1]

precision_0, precision_1, recall_0, recall_1

(np.float64(0.8568291611769873),
 np.float64(0.8707557502738226),
 np.float64(0.8464208242950109),
 np.float64(0.8797491700479528))

In [8]:
# Initialize counts
TP_0, FP_0, FN_0 = 0, 0, 0
TP_1, FP_1, FN_1 = 0, 0, 0

# Iterate over true and predicted labels
for true, pred in zip(Y_test, Y_pred):
    # For class 0
    if true == 0 and pred == 0:
        TP_0 += 1  # True positive for class 0
    elif true != 0 and pred == 0:
        FP_0 += 1  # False positive for class 0
    elif true == 0 and pred != 0:
        FN_0 += 1  # False negative for class 0

    # For class 1
    if true == 1 and pred == 1:
        TP_1 += 1  # True positive for class 1
    elif true != 1 and pred == 1:
        FP_1 += 1  # False positive for class 1
    elif true == 1 and pred != 1:
        FN_1 += 1  # False negative for class 1

# Calculate precision and recall for class 0
precision_0 = TP_0 / (TP_0 + FP_0) if (TP_0 + FP_0) > 0 else 0
recall_0 = TP_0 / (TP_0 + FN_0) if (TP_0 + FN_0) > 0 else 0

# Calculate precision and recall for class 1
precision_1 = TP_1 / (TP_1 + FP_1) if (TP_1 + FP_1) > 0 else 0
recall_1 = TP_1 / (TP_1 + FN_1) if (TP_1 + FN_1) > 0 else 0

precision_0, precision_1, recall_0, recall_1

(0.8568291611769873,
 0.8707557502738226,
 0.8464208242950109,
 0.8797491700479528)

In [9]:

# Part 4

def cost(model,threshold,X,Y):
    # Hint, make sure that the model has a predict_proba method
    # think about how the decision is made based on the probabilities
    # and how the threshold can be used to make the decision.
    # For reference take a look at the lecture notes "Bayes classifier"
    # which contains how the decision is made based on the probabilities when the threshold is 0.5.

    # Fill in what is missing to compute the cost and return it
    # Note that we are interested in average cost
    Y_prob = model.predict_proba(X)[:, 1]
    predictions = (Y_prob > threshold).astype(int)
    total_cost = 0

    for true_label, predicted_label in zip(Y, predictions):
        if true_label == 1 and predicted_label == 0:  # positive misclassified as negative
            total_cost += 1  # cost = 1
        elif true_label == 0 and predicted_label == 1:  # negative misclassified as positive
            total_cost += 5  # cost = 5

    average_cost = total_cost / len(Y)
    return average_cost

In [11]:

# Part 5

# Find the optimal threshold for the model on the test set. Store the threshold in the variable `optimal_threshold`
# and the cost at the optimal threshold in the variable `cost_at_optimal_threshold` evaluated on the test set.
thresholds = np.arange(0, 1, 0.1)
costs = []
for threshold in thresholds:
    costs.append(cost(model, threshold, X_test, Y_test))

optimal_threshold = thresholds[np.argmin(costs)]
cost_at_optimal_threshold = np.min(costs)
optimal_threshold, cost_at_optimal_threshold

(np.float64(0.8), np.float64(0.27392344497607657))

### Hoeffding's Epsilon Formula

The margin of error (\(\epsilon\)) based on Hoeffding's inequality is given by:

$$
\epsilon = \sqrt{-\frac{(b-a)^2}{2n} \ln\left(\frac{\delta}{2}\right)}
$$

Where:
- $ n $: number of samples,  
- $ (a, b) $: bounds of the random variable $ a \leq X_i \leq b $,  
- $ \delta $ : confidence level (e.g., $ \delta = 0.01 $ for 99% confidence).  


In [13]:

# Part 6

n = len(Y_valid)
delta = 0.01  # 99% confidence
a = 0
b = 5
epsilon = np.sqrt(-1 / (2 * n / b**2) * np.log(delta / 2))
cost_at_optimal_threshold_valid = cost(model, optimal_threshold, X_valid, Y_valid)
cost_interval_valid = (cost_at_optimal_threshold_valid - epsilon, cost_at_optimal_threshold_valid + epsilon)

assert(type(cost_interval_valid) == tuple)
assert(len(cost_interval_valid) == 2)
cost_interval_valid

(np.float64(0.18761279025396377), np.float64(0.3656041434939434))

### Bennett's Epsilon Formula

The margin of error (\(\epsilon\)) based on Bennett's inequality is given by solving the following equation:

$$
\exp\left(-n \frac{\sigma^2}{b^2} h\left(\frac{b \epsilon}{\sigma^2}\right)\right) = \frac{\alpha}{2},
$$

Where:
- $ h(u) = (1 + u) \ln(1 + u) - u $,  
- $ n $: number of samples,  
- $ b $: the upper bound of the range of the random variable,  
- $ \sigma $: standard deviation of the random variable,  
- $ \alpha $: confidence level (e.g., $ \alpha = 0.01 $ for 99% confidence),  
- $ \epsilon $: the solution to the equation, representing the margin of error.

---

### Cost Function and Interval for \( C \)

The cost function \( C \) is defined as:

$$
C = (1 - \text{prediction}) \cdot \text{true\_label} + 5 \cdot (1 - \text{true\_label}) \cdot \text{prediction},
$$

Where:
- $ \text{prediction} $: the predicted label,  
- $ \text{true\_label} $: the actual label,  
- The constants $ 1 $ and $ 5 $ represent the penalties for different prediction outcomes.

The variance of \( C \) is:

$$
\sigma^2 = \text{Var}(C).
$$

The confidence interval for \( C \) is:

$$
\text{Interval of } C = \left( \text{cost\_at\_optimal\_threshold\_valid} - \epsilon, \; \text{cost\_at\_optimal\_threshold\_valid} + \epsilon \right),
$$

Where:
- $ \text{cost\_at\_optimal\_threshold\_valid} $: the observed cost at the optimal threshold,  
- $ \epsilon $: the solution to Bennett's epsilon equation.


In [14]:

# Part 7

def bennett_epsilon(n,b,sigma,alpha):
    import scipy.optimize as so
    h = lambda u: (1+u)*np.log(1+u)-u
    f = lambda epsilon: np.exp(-n*sigma**2/b**2*h(b*epsilon/sigma**2))-alpha/2
    ans = so.fsolve(f,0.002)
    epsilon = np.abs(ans[0])
    print("Numerical error", f(epsilon))
    return epsilon

Y_valid_prob = model.predict_proba(X_valid)[:, 1]
predictions = (Y_valid_prob >= optimal_threshold).astype(int)
C = [
    (1 - prediction) * true_label + 5 * (1 - true_label) * prediction
    for true_label, prediction in zip(Y_valid, predictions)
]
variance_of_C = np.var(C, ddof=0)
sigma = np.sqrt(variance_of_C)
epsilon2 = bennett_epsilon(n, b, sigma, delta)
interval_of_C = (cost_at_optimal_threshold_valid - epsilon2, cost_at_optimal_threshold_valid + epsilon2)

assert(type(interval_of_C) == tuple)
assert(len(interval_of_C) == 2)

interval_of_C

Numerical error -3.577867169202165e-15


(np.float64(0.2465085328181294), np.float64(0.30670840092977775))

### Bennett's Epsilon Formula

The margin of error (\(\epsilon\)) is derived from Bennett's inequality:

$$
\mathbb{P}\left( \left| \bar{X} - \mu \right| \geq \epsilon \right) \leq 2 \exp\left(-n \cdot \frac{\sigma^2}{b^2} \cdot h\left(\frac{b \epsilon}{\sigma^2}\right)\right),
$$

where:
- $ \bar{X} $: the sample mean,  
- $ \mu $: the true mean,  
- $ n $: the number of samples,  
- $ b $: the upper bound of the random variable,  
- $ \sigma^2 $: the variance of the random variable,  
- $ h(u) = (1 + u) \ln(1 + u) - u $.

To find \(\epsilon\) for a given confidence level \(\alpha\), set:

$$
2 \exp\left(-n \cdot \frac{\sigma^2}{b^2} \cdot h\left(\frac{b \epsilon}{\sigma^2}\right)\right) = \alpha.
$$

Rearranging:

$$
\exp\left(-n \cdot \frac{\sigma^2}{b^2} \cdot h\left(\frac{b \epsilon}{\sigma^2}\right)\right) = \frac{\alpha}{2}.
$$

Taking the natural logarithm:

$$
-n \cdot \frac{\sigma^2}{b^2} \cdot h\left(\frac{b \epsilon}{\sigma^2}\right) = \ln\left(\frac{\alpha}{2}\right).
$$

Solving for \( h\left(\frac{b \epsilon}{\sigma^2}\right) \):

$$
h\left(\frac{b \epsilon}{\sigma^2}\right) = -\frac{b^2}{n \sigma^2} \ln\left(\frac{\alpha}{2}\right).
$$

Finally, solving for \( \epsilon \):

$$
\epsilon = \frac{\sigma^2}{b} \cdot h^{-1}\left(-\frac{b^2}{n \sigma^2} \ln\left(\frac{\alpha}{2}\right)\right),
$$

where $ h^{-1} $ is the inverse function of $ h(u) $, solved numerically in the code.


### Summary Table: Difficulty of Estimating Precision and Recall

Precision: P(Y = 1 | g(X) = 1)
Recall: P(g(X) = 1 | Y = 1).

| **Scenario**                   | **Precision (Class 1)**    | **Recall (Class 1)**      | **Precision (Class 0)**   | **Recall (Class 0)**      |
|--------------------------------|----------------------------|---------------------------|---------------------------|---------------------------|
| \( g(X) = 1 \)                 | Depends on \( P(Y=1) \)    | Trivial (\(1\))           | Undefined                 | Trivial (\(0\))           |
| \( g(X) = 0 \)                 | Undefined                 | Trivial (\(0\))           | Depends on \( P(Y=0) \)   | Trivial (\(1\))           |
| \( P(Y = 1) \) close to \(0\)  | Harder                    | Harder                    | Easier                    | Easier                    |
| \( P(Y = 1) \) close to \(1\)  | Easier                    | Easier                    | Harder                    | Harder                    |


In [15]:
# Confidence interval parameters
delta = 0.01  # 99% confidence level

# Part 4: Calculate intervals for precision and recall for class 1
n_predPos = np.sum(Y_pred == 1)  # Number of predicted positives
n_actualPos = np.sum(Y_test == 1)  # Number of actual positives

# Precision interval
precision_epsilon = np.sqrt(np.log(2 / delta) / (2 * n_predPos))
precision_1_interval = (precision_1 - precision_epsilon, precision_1 + precision_epsilon)

# Recall interval
recall_epsilon = np.sqrt(np.log(2 / delta) / (2 * n_actualPos))
recall_1_interval = (recall_1 - recall_epsilon, recall_1 + recall_epsilon)


In [16]:
precision_1_interval

(np.float64(0.8396559234706035), np.float64(0.9018555770770417))

In [17]:
recall_1_interval

(np.float64(0.8484891517599025), np.float64(0.9110091883360032))

### Confidence Interval Formulas for Precision and Recall (Class 1)

#### Precision Confidence Interval
The margin of error (\(\epsilon\)) for precision is given by:

$$
\epsilon_{\text{precision}} = \sqrt{\frac{\ln\left(\frac{2}{\delta}\right)}{2 \cdot n_{\text{predPos}}}}
$$

The confidence interval for precision is:

$$
\text{Precision Interval} = \left( \text{Precision} - \epsilon_{\text{precision}}, \; \text{Precision} + \epsilon_{\text{precision}} \right)
$$

Where:
- $ n_{\text{predPos}} $: Number of predicted positives ($ g(X) = 1 $),
- $ \delta $: Confidence level (e.g., $ \delta = 0.01 $ for 99% confidence).

---

#### Recall Confidence Interval
The margin of error (\(\epsilon\)) for recall is given by:

$$
\epsilon_{\text{recall}} = \sqrt{\frac{\ln\left(\frac{2}{\delta}\right)}{2 \cdot n_{\text{actualPos}}}}
$$

The confidence interval for recall is:

$$
\text{Recall Interval} = \left( \text{Recall} - \epsilon_{\text{recall}}, \; \text{Recall} + \epsilon_{\text{recall}} \right)
$$

Where:
- $ n_{\text{actualPos}} $: Number of actual positives ($ Y = 1 $),
- $ \delta $: Confidence level (e.g., $ \delta = 0.01 $ for 99% confidence).
