# Exam - Introduction to Data Science

## Instructions:
1. Complete the problems by following instructions.
2. When done, submit this file with your solutions saved, following the instruction sheet.

This exam has 3 problems for a total of 40 points, to pass you need
20 points.


In [None]:
# Insert your anonymous exam ID as a string in the variable below
examID = "XXX"

## Problem 1:  Random Walks and Hitting Times (15 points)

Consider a one-dimensional random walk on the integers, where at each step, a particle moves one step to the right with probability $p$ and one step to the left with probability $1-p$. Let $X_n$ denote the position of the particle after $n$ steps. Suppose the particle starts at position $0$, i.e., $X_0 = 0$.

1.  **(5 points)** Write a function `simulate_random_walk(n, p)` that simulates a random walk of `n` steps with probability `p` of moving right. The function should return a list of the positions $X_0, X_1, ..., X_n$ of the particle during the walk.

2.  **(5 points)** Suppose $p=0.5$. Write a function `estimate_hitting_time(num_simulations, target)` that estimates the expected number of steps for the particle to reach position `target` for the first time. The function should perform `num_simulations` of random walks, and for each walk, record the first time the particle hits the `target` position. If the target is not reached in 1000 steps, the simulation should be terminated and should not be counted in the estimate. The function should return the average hitting time over all successful simulations.

3. **(5 points)** For $p=0.6$ and `target=5`, using at least 1000 simulations, estimate the expected hitting time and provide a 95% confidence interval using the bootstrap method. You should use the `makeBootstrappedConfidenceIntervalOfStatisticT` function from ExamJanuary_2020_problem.ipynb, which you can copy into your notebook.

In [None]:
import numpy as np

def simulate_random_walk(n, p):
    """Simulates a 1D random walk.
    
    Args:
        n (int): Number of steps.
        p (float): Probability of moving right.
    
    Returns:
        list: List of positions of the particle during the walk.
    """
    positions = [0]
    current_position = 0
    for _ in range(n):
        if np.random.rand() < p:
            current_position += 1
        else:
            current_position -= 1
        positions.append(current_position)
    return positions

def estimate_hitting_time(num_simulations, target):
    """Estimates the expected hitting time to reach a target.
    
    Args:
        num_simulations (int): Number of simulations.
        target (int): Target position.
    
    Returns:
        float: Average hitting time over successful simulations.
    """
    hitting_times = []
    for _ in range(num_simulations):
        positions = simulate_random_walk(1000, 0.5)
        try:
             hitting_time = positions.index(target)
             hitting_times.append(hitting_time)
        except ValueError:
            continue # Did not hit target in 1000 steps.
    if hitting_times:
      return np.mean(hitting_times)
    else:
      return np.nan
  

def makeBootstrappedConfidenceIntervalOfStatisticT(dataset, statT, alpha, B=100):
    '''make a bootstrapped 1-alpha confidence interval for ANY given statistic statT
    from the dataset with B Bootstrap replications for 0 < alpha < 1, and
    return lower CI, upper CI, bootstrapped_samples '''
    n = len(dataset) # sample size of the original dataset
    bootstrappedStatisticTs=[] # list to store the statistic T from each bootstrapped data
    for b in range(B):
        #sample indices at random between 0 and len(iQMinutes)-1 to make the bootstrapped dataset
        randIndices=np.random.randint(0,n, size = n)  
        bootstrappedDataset = dataset[randIndices] # resample with replacement from original dataset
        bootstrappedStatisticT = statT(bootstrappedDataset)
        bootstrappedStatisticTs.append(bootstrappedStatisticT)
    # now get the [alpha/2, 1-alpha/2] percentile-based CI
    alpaAsPercentage=alpha*100.0
    lowerBootstrap1MinusAlphaCIForStatisticT = np.percentile(bootstrappedStatisticTs,alpaAsPercentage/2)
    upperBootstrap1MinusAlphaCIForStatisticT = np.percentile(bootstrappedStatisticTs,100-alpaAsPercentage/2)
    return (lowerBootstrap1MinusAlphaCIForStatisticT,upperBootstrap1MinusAlphaCIForStatisticT,
            np.array(bootstrappedStatisticTs))

# Test the functions
walk = simulate_random_walk(10, 0.6)
print(f"Simulated Random Walk: {walk}")

estimated_time = estimate_hitting_time(100, 5)
print(f"Estimated Hitting Time (p=0.5, target=5): {estimated_time}")

def calculate_hitting_time(positions, target):
     try:
          hitting_time = positions.index(target)
          return hitting_time
     except ValueError:
          return np.nan
def estimate_hitting_time_p6(num_simulations, target, p):
    hitting_times = []
    for _ in range(num_simulations):
     positions = simulate_random_walk(1000, p)
     hitting_time = calculate_hitting_time(positions, target)
     if not np.isnan(hitting_time):
      hitting_times.append(hitting_time)
    return np.array(hitting_times)
        
hitting_times_p6 = estimate_hitting_time_p6(1000, 5, 0.6)
statT = lambda dataset : np.mean(dataset) 
lower_ci, upper_ci, _ = makeBootstrappedConfidenceIntervalOfStatisticT(hitting_times_p6[~np.isnan(hitting_times_p6)], statT,  0.05, 1000)
print(f"Estimated Hitting Time (p=0.6, target=5): {np.mean(hitting_times_p6[~np.isnan(hitting_times_p6)])}")
print(f"95% Confidence Interval: ({lower_ci}, {upper_ci})")

# Answers for the exam
problem1_walk = walk
problem1_estimated_time = estimated_time
problem1_mean_hitting_time_p6 = np.mean(hitting_times_p6[~np.isnan(hitting_times_p6)])
problem1_confidence_interval = (lower_ci, upper_ci)

## Problem 2:  Hypothesis Testing (15 points)

A company claims that their new battery lasts an average of 60 hours. You are tasked with testing their claim. You collect data from 25 randomly selected batteries and find that the sample mean battery life is 58 hours, with a sample standard deviation of 5 hours. Assume that the battery life is normally distributed.

1.  **(5 points)** State the null and alternative hypotheses for a two-sided test. Define all the parameters in the problem.

2.  **(5 points)** Calculate the test statistic using the appropriate test. Explain why you chose that particular test.

3.  **(5 points)** Calculate the p-value for this test. At a significance level of $\alpha=0.05$, state whether you reject or fail to reject the null hypothesis, and explain what that means in this context.

In [None]:
import numpy as np
from scipy import stats

# Given data
sample_mean = 58
sample_std = 5
sample_size = 25
population_mean = 60

# 1. Null and alternative hypotheses (Written in markdown cell below)

# 2. Calculate the test statistic
# We use a t-test because the population standard deviation is unknown
degrees_freedom = sample_size - 1
test_statistic = (sample_mean - population_mean) / (sample_std / np.sqrt(sample_size))

# 3. Calculate the p-value
p_value = 2 * (1 - stats.t.cdf(abs(test_statistic), degrees_freedom))

alpha = 0.05
reject_null = p_value < alpha

print(f"Test statistic: {test_statistic}")
print(f"P-value: {p_value}")
print(f"Reject null hypothesis: {reject_null}")

# Answers for the exam
problem2_test_statistic = test_statistic
problem2_p_value = p_value
problem2_reject_null = reject_null

**1. Null and alternative hypotheses:**

   *   **Null Hypothesis ($H_0$):** The true mean battery life is equal to 60 hours, i.e., $\mu = 60$.
   *   **Alternative Hypothesis ($H_1$):** The true mean battery life is not equal to 60 hours, i.e., $\mu \neq 60$.

   Where:
    *   $\mu$ is the population mean battery life.

## Problem 3:  Naive Bayes Classifier (10 points)

Consider a simplified spam detection problem. Assume that emails are composed of words from a vocabulary of size $V$ and that each word $w_i$ appears in an email with frequency $n_i$. Let $C$ represent the class labels where $C = 1$ if the email is spam and $C = 0$ if the email is not spam. Assume the word occurrences are independent given the class label.
The simplified Naive Bayes classifier works as follows:

The probability of an email belonging to class $C$ is given by
$$P(C | \text{email}) \propto P(C) \prod_{i=1}^V P(w_i | C)^{n_i}$$

You have the following data:
*   $P(C=1) = 0.3$ (probability an email is spam)
*   $P(C=0) = 0.7$ (probability an email is not spam)
*   $P(\text{'free'} | C=1) = 0.2$
*   $P(\text{'free'} | C=0) = 0.01$
*   $P(\text{'money'} | C=1) = 0.1$
*   $P(\text{'money'} | C=0) = 0.001$
*  $P(\text{'game'} | C=1) = 0.05$
*  $P(\text{'game'} | C=0) = 0.005$

You receive an email containing the words "free", "money", and "game". All other words have a frequency of zero.

1.  **(5 points)** Compute the probability that the email is spam, i.e. $P(C=1 | \text{email})$, using the Naive Bayes classifier. You do not have to normalize the results.
2.  **(5 points)** Compute the probability that the email is not spam, i.e. $P(C=0 | \text{email})$ using the Naive Bayes classifier. You do not have to normalize the results.



In [None]:
# Given probabilities
p_spam = 0.3
p_not_spam = 0.7
p_free_spam = 0.2
p_free_not_spam = 0.01
p_money_spam = 0.1
p_money_not_spam = 0.001
p_game_spam = 0.05
p_game_not_spam = 0.005

# Calculate the probability of the email given class (unnormalized)
p_email_spam = p_spam * p_free_spam * p_money_spam * p_game_spam
p_email_not_spam = p_not_spam * p_free_not_spam * p_money_not_spam * p_game_not_spam

print(f"P(email | spam): {p_email_spam}")
print(f"P(email | not spam): {p_email_not_spam}")

# Answers for the exam
problem3_p_email_spam = p_email_spam
problem3_p_email_not_spam = p_email_not_spam