# Exam 4th of January 2024
**Course: 1MS041 (Introduction to Data Science)**


This notebook contains the exam problems, instructions, and code cells required for completion.

## Instructions
1. Complete the problems by following the instructions.
2. Submit the completed notebook with your solutions saved.
3. This exam has **3 problems** for a total of **40 points**. To pass, you need **20 points**.
4. Remember to comment your code to receive partial credit even if your solution is incorrect.
5. Follow the instructions rigorously, and ensure that your answers are clear and unambiguous.
6. You are **not allowed to communicate with others** or use external help (forums, AI tools, etc.).

Good luck!



## Problem 1: Rejection Sampling and Monte Carlo Integration (14 Points)

In this problem, you will perform rejection sampling from complex distributions and use your samples to compute integrals (Monte Carlo integration).

### Tasks
1. **[4 Points]** Complete the function `problem1_inversion` to produce samples from the given distribution using rejection sampling.
2. **[2 Points]** Generate 100,000 samples using the above function, store them in `problem1_samples`, and plot the histogram with the true density.
3. **[2 Points]** Use the generated samples to compute the integral:  
   $$\int_0^1 \sin(x) \frac{2e^{x^2}}{x} e^{-1} dx$$  
   Store the result in `problem1_integral`.
4. **[2 Points]** Use Hoeffding's inequality to compute a 95% confidence interval for the integral and store it in `problem1_interval`.
5. **[4 Points]** Complete the function `problem1_inversion_2` to produce samples from a second distribution. Optimize the sampling distribution to minimize rejection.

### Code


In [None]:

from Utils import timeout
import numpy as np
import matplotlib.pyplot as plt

# Part 1: Fill in the function to perform rejection sampling
@timeout
def problem1_inversion(n_samples=1):
    # Write your code here
    return np.array([])

# Part 2: Generate samples and plot histogram
problem1_samples = None

# Part 3: Compute integral using Monte Carlo
problem1_integral = None

# Part 4: Compute confidence interval using Hoeffding's inequality
problem1_interval = None

# Part 5: Complete a second inversion function
def problem1_inversion_2(n_samples=1):
    # Write your code here
    return np.array([])

# Local test for Problem 1
try:
    assert(isinstance(problem1_inversion(10), np.ndarray))
    print("Good: problem1_inversion returns a numpy array.")
except:
    print("Error: problem1_inversion does not return a numpy array.")

try:
    assert(isinstance(problem1_samples, np.ndarray))
    print("Good: problem1_samples is a numpy array.")
except:
    print("Error: problem1_samples is not a numpy array.")



## Problem 2: Logistic Regression for Spam Detection (13 Points)

In this problem, you will build and calibrate a logistic regression model to classify emails as spam or not spam.

### Tasks
1. **[2 Points]** Load `data/spam.csv` and create numpy arrays `problem2_X` (shape `(n_emails, 3)`) and `problem2_Y` (shape `(n_emails,)`). Split the data into train (40%), calibration (20%), and test (40%) sets.
2. **[4 Points]** Implement the loss function inside the `ProportionalSpam` class.
3. **[4 Points]** Train the model on the training set. Calibrate probabilities using `DecisionTreeRegressor` and store the calibrated model.
4. **[3 Points]** Use the trained and calibrated models to make predictions on the test set. Compute the 0-1 loss and provide a 99% confidence interval.

### Code


In [None]:

from sklearn.tree import DecisionTreeRegressor
from scipy import optimize

# Part 1: Load data and split into train/calibration/test sets
problem2_X = None
problem2_Y = None
problem2_X_train, problem2_X_calib, problem2_X_test = None, None, None
problem2_Y_train, problem2_Y_calib, problem2_Y_test = None, None, None

# Part 2: Define logistic regression model
class ProportionalSpam:
    def __init__(self):
        self.coeffs = None
        self.result = None

    def loss(self, X, Y, coeffs):
        # Define the loss function here
        return None

    def fit(self, X, Y):
        opt_loss = lambda coeffs: self.loss(X, Y, coeffs)
        initial_args = np.zeros(X.shape[1] + 1)
        self.result = optimize.minimize(opt_loss, initial_args, method="cg")
        self.coeffs = self.result.x

    def predict(self, X):
        if self.coeffs is not None:
            G = lambda x: np.exp(x) / (1 + np.exp(x))
            return np.round(10 * G(np.dot(X, self.coeffs[1:]) + self.coeffs[0])) / 10

# Part 3: Train model and calibrate probabilities
problem2_ps = None
problem2_calibrator = None

# Part 4: Make predictions and compute test loss
problem2_final_predictions = None
problem2_01_loss = None
problem2_interval = None



## Problem 3: Markov Chains (13 Points)

Answer the following questions for four Markov chains.

### Tasks
1. **[2 Points]** Provide the transition matrix for each Markov chain.
2. **[2 Points]** Determine if each chain is irreducible.
3. **[3 Points]** Determine if each chain is aperiodic. Provide the period for each state.
4. **[3 Points]** Determine if each chain has a stationary distribution. If it does, provide the distribution.
5. **[3 Points]** Determine if each chain is reversible.

### Code


In [None]:

# Part 1: Transition matrices
problem3_A = None
problem3_B = None
problem3_C = None
problem3_D = None

# Part 2: Irreducibility
problem3_A_irreducible = None
problem3_B_irreducible = None
problem3_C_irreducible = None
problem3_D_irreducible = None

# Part 3: Aperiodicity and periods
problem3_A_is_aperiodic = None
problem3_B_is_aperiodic = None
problem3_C_is_aperiodic = None
problem3_D_is_aperiodic = None

problem3_A_periods = None
problem3_B_periods = None
problem3_C_periods = None
problem3_D_periods = None

# Part 4: Stationary distributions
problem3_A_has_stationary = None
problem3_B_has_stationary = None
problem3_C_has_stationary = None
problem3_D_has_stationary = None

problem3_A_stationary_dist = None
problem3_B_stationary_dist = None
problem3_C_stationary_dist = None
problem3_D_stationary_dist = None

# Part 5: Reversibility
problem3_A_is_reversible = None
problem3_B_is_reversible = None
problem3_C_is_reversible = None
problem3_D_is_reversible = None
