
# Assignment 4 for Course 1MS041
Make sure you pass the `# ... Test` cells and
 submit your solution notebook in the corresponding assignment on the course website. You can submit multiple times before the deadline and your highest score will be used.

---
## Assignment 4, PROBLEM 1
Maximum Points = 24


    This time the assignment only consists of one problem, but we will do a more comprehensive analysis instead.

Consider the dataset `Corona_NLP_train.csv` that you can get from the course website [git](https://github.com/datascience-intro/1MS041-2024/blob/main/notebooks/data/Corona_NLP_train.csv). The data is "Coronavirus tweets NLP - Text Classification" that can be found on [kaggle](https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification). The data has several columns, but we will only be working with `OriginalTweet`and `Sentiment`.

1. [3p] Load the data and filter out those tweets that have `Sentiment`=`Neutral`. Let $X$ represent the `OriginalTweet` and let 
    $$
        Y = 
        \begin{cases}
        1 & \text{if sentiment is towards positive}
        \\
        0 & \text{if sentiment is towards negative}.
        \end{cases}
    $$
    Put the resulting arrays into the variables $X$ and $Y$. Split the data into three parts, train/test/validation where train is 60% of the data, test is 15% and validation is 25% of the data. Do not do this randomly, this is to make sure that we all did the same splits (we are in this case assuming the data is IID as presented in the dataset). That is [train,test,validation] is the splitting layout.

2. [4p] There are many ways to solve this classification problem. The first main issue to resolve is to convert the $X$ variable to something that you can feed into a machine learning model. For instance, you can first use [`CountVectorizer`](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) as the first step. The step that comes after should be a `LogisticRegression` model, but for this to work you need to put together the `CountVectorizer` and the `LogisticRegression` model into a [`Pipeline`](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline). Fill in the variable `model` such that it accepts the raw text as input and outputs a number $0$ or $1$, make sure that `model.predict_proba` works for this. **Hint: You might need to play with the parameters of LogisticRegression to get convergence, make sure that it doesn't take too long or the autograder might kill your code**
3. [3p] Use your trained model and calculate the precision and recall on both classes. Fill in the corresponding variables with the answer.
4. [3p] Let us now define a cost function
    * A positive tweet that is classified as negative will have a cost of 1
    * A negative tweet that is classified as positive will have a cost of 5
    * Correct classifications cost 0
    
    complete filling the function `cost` to compute the cost of a prediction model under a certain prediction threshold (recall our precision recall lecture and the `predict_proba` function from trained models). 

5. [4p] Now, we wish to select the threshold of our classifier that minimizes the cost, fill in the selected threshold value in value `optimal_threshold`.
6. [4p] With your newly computed threshold value, compute the cost of putting this model in production by computing the cost using the validation data. Also provide a confidence interval of the cost using Hoeffdings inequality with a 99% confidence.
7. [3p] Let $t$ be the threshold you found and $f$ the model you fitted (one of the outputs of `predict_proba`), if we define the random variable
    $$
        C = (1-1_{f(X)\geq t})Y+5(1-Y)1_{f(X) \geq t}
    $$
    then $C$ denotes the cost of a randomly chosen tweet. In the previous step we estimated $\mathbb{E}[C]$ using the empirical mean. However, since the threshold is chosen to minimize cost it is likely that $C=0$ or $C=1$ than $C=5$ as such it will have a low variance. Compute the empirical variance of $C$ on the validation set. What would be the confidence interval if we used Bennett's inequality instead of Hoeffding in point 6 but with the computed empirical variance as our guess for the variance?

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the data from the uploaded file
#file_path = "D:/Data Science and Data Engineering/Semester 1/Period 2/Introduction to Data Science/Data Science LAB/data/Corona_NLP_test.csv"
file_path = 'data/Corona_NLP_train.csv'
data = pd.read_csv(file_path, encoding='latin1')

# Filter out rows where `Sentiment` is 'Neutral'
filtered_data = data[data['Sentiment'] != 'Neutral']

# Define X and Y based on the problem statement
X = filtered_data['OriginalTweet'].values  # 1-dimensional array
Y = np.where(filtered_data['Sentiment'].isin(['Positive', 'Extremely Positive']), 1, 0)  # Map sentiment to binary

# Split the data into train/test/validation sets (60%/15%/25%)
n_samples = len(X)
train_end = int(0.6 * n_samples)
test_end = train_end + int(0.15 * n_samples)

X_train = X[:train_end]
Y_train = Y[:train_end]
X_test = X[train_end:test_end]
Y_test = Y[train_end:test_end]
X_valid = X[test_end:]
Y_valid = Y[test_end:]

# Verify the shapes of the splits
X_train.shape, Y_train.shape, X_test.shape, Y_test.shape, X_valid.shape, Y_valid.shape

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Define the pipeline
model = Pipeline([
    ('vectorizer', CountVectorizer()),  # Convert raw text to feature vectors
    ('classifier', LogisticRegression(max_iter=1000))  # Logistic regression model with increased max_iter for convergence
])

# Train the pipeline on the training data
model.fit(X_train, Y_train)

# Verify that the model works and `predict_proba` can be used
model.predict_proba(X_test[:5])  # Check probability predictions on a few test samples

In [None]:
from sklearn.metrics import precision_score, recall_score

# Get predictions on the test set
Y_pred = model.predict(X_test)

# Calculate precision and recall for each class
precision_0 = precision_score(Y_test, Y_pred, pos_label=0)
precision_1 = precision_score(Y_test, Y_pred, pos_label=1)
recall_0 = recall_score(Y_test, Y_pred, pos_label=0)
recall_1 = recall_score(Y_test, Y_pred, pos_label=1)

precision_0, precision_1, recall_0, recall_1

In [None]:
def cost(model, threshold, X, Y):
    """
    Compute the cost of a prediction model under a certain prediction threshold.

    Parameters:
    - model: The trained model with `predict_proba` method.
    - threshold: The threshold for classification.
    - X: Features (e.g., tweets).
    - Y: True labels.

    Returns:
    - Average cost of predictions.
    """
    # Get probabilities of the positive class
    probabilities = model.predict_proba(X)[:, 1]

    # Make predictions based on the threshold
    predictions = (probabilities >= threshold).astype(int)

    # Calculate costs based on the given rules
    costs = np.zeros_like(Y, dtype=float)
    costs[(Y == 1) & (predictions == 0)] = 1  # Positive classified as negative
    costs[(Y == 0) & (predictions == 1)] = 5  # Negative classified as positive

    # Return the average cost
    return np.mean(costs)

# Example usage: Calculate the cost for the test set with a threshold of 0.5
example_cost = cost(model, 0.5, X_test, Y_test)
example_cost

In [None]:
# Define a range of thresholds to evaluate
thresholds = np.linspace(0, 1, 101)  # 0.0 to 1.0 in 0.01 increments

# Compute costs for each threshold
costs = [cost(model, t, X_test, Y_test) for t in thresholds]

# Find the optimal threshold (minimizing cost)
optimal_index = np.argmin(costs)
optimal_threshold = thresholds[optimal_index]
cost_at_optimal_threshold = costs[optimal_index]

optimal_threshold, cost_at_optimal_threshold

In [None]:
from math import sqrt, log

# Given values
alpha = 0.01  # Confidence level
range_C = 5
n_valid = len(Y_valid)  # Number of validation samples

# Compute epsilon using Hoeffding's inequality
epsilon = range_C * sqrt((log(2 / alpha)) / (2 * n_valid))
# Cost at the optimal threshold on validation set
cost_at_optimal_threshold_valid = cost(model, optimal_threshold, X_valid, Y_valid)

# Calculate bounds
lower_bound = cost_at_optimal_threshold_valid - epsilon
upper_bound = cost_at_optimal_threshold_valid + epsilon

# Final cost interval
cost_interval_valid = (lower_bound, upper_bound)

# Output results
print("Cost at optimal threshold (validation):", cost_at_optimal_threshold_valid)
print("Confidence interval for cost:", cost_interval_valid)

In [None]:
from math import sqrt, log

# Compute the cost per sample on the validation set using the optimal threshold
probabilities_valid = model.predict_proba(X_valid)[:, 1]
predictions_valid = (probabilities_valid >= optimal_threshold).astype(int)

# Compute C for each sample in the validation set
C = (1 - predictions_valid) * Y_valid + 5 * (1 - Y_valid) * predictions_valid

# Compute the empirical variance of C
variance_of_C = np.var(C, ddof=1)  # Use ddof=1 for sample variance

# Number of validation samples
n_valid = len(C)

# Bennett's inequality constants
mean_cost = np.mean(C)
b = max(C) - min(C)  # Range of C

# Compute the confidence interval using Bennett's inequality
epsilon_bennett = sqrt((2 * variance_of_C * log(2 / 0.01)) / n_valid) + (b * log(2 / 0.01)) / (3 * n_valid)
lower_bound_bennett = mean_cost - epsilon_bennett
upper_bound_bennett = mean_cost + epsilon_bennett

interval_of_C = (lower_bound_bennett, upper_bound_bennett)

interval_of_C