# Homework 2

## Follow These Steps Before Submitting
Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.


# 1. Fridge Light Failure

Imagine that you work for a company that sells fridges, with a lifetime warranty for the fridge lights. Your boss is interested in understanding the distribution of the number of lights that will fail in a given month, based on data collected over the last several years. The data were collected by hand and manually entered, so data entry errors are possible. There are also some months with missing data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

from scipy.stats import zscore
from scipy.special import factorial
import scipy.optimize as so

In [None]:
# Uncomment the line below if you are using Google colab
# !gdown https://drive.google.com/uc?id=1wzY4XdkcwdNdVGL42PEN0ORtzWcdj2VF

1. Read the CSV file using Pandas and store it. All you've been given are a meaningless ID and the counts of the number of fridge lights that failed in each month.

In [None]:
df = pd.read_csv('Fridge Light Data.csv')
print(df.head())
print(f"\nshape: {df.shape}")

2. Count the number of null values in the dataset, then remove them.

In [None]:
null_count = df['x'].isnull().sum()
print(f"number of null values: {null_count}")

df_clean = df.dropna()
print(f"\nshape after removing nulls: {df_clean.shape}")

Number of null values: 3

3. Plot the distribution of the data. Add a title and axes labels to your plot.

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(df_clean['x'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('number of fridge light failures')
plt.ylabel('frequency')
plt.title('distribution of fridge light failures per month')
plt.grid(True, alpha=0.3)
plt.show()

4. Count the number of observations that you know with **certainty** are data entry errors, then remove them. Explain why you know they are errors.

In [None]:
certain_errors = df_clean[df_clean['x'] < 0]
print(f"certain data entry errors (negative values):")
print(certain_errors)

num_certain_errors = len(certain_errors)
print(f"\nnumber of certain errors: {num_certain_errors}")

df_clean = df_clean[df_clean['x'] >= 0]
print(f"\nshape after removing certain errors: {df_clean.shape}")

Number of certain data entry errors: 3

these are data entry errors because the number of fridge light failures cannot be negative. you cannot have -1 or -2 failures - counts must be non-negative integers.

5. Compute the negative log-likelihood based on the assumption that the data comes from a Poisson($\lambda$) distribution. The negative log-likelihood for a Poisson($\lambda$) distribution is as follows:

$$
nll(\lambda) = n \lambda - \left(\sum_{i=1}^{n}x_i\right) log(\lambda) + \sum_{i=1}^{n}log(x_i!)
$$

In [None]:





def poissonNegLogLikelihood(lam, data):
    n = len(data)
    nll = n * lam - np.sum(data) * np.log(lam) + np.sum(np.log(factorial(data)))
    return nll

dummy_data = pd.DataFrame({'x': [20, 22, 18, 6, 8]})
result = poissonNegLogLikelihood(25, dummy_data.x)
print(f"negative log-likelihood for dummy data: {result}")

6. Consider integer values of $\lambda$ ranging from 10 to 50. Compute the negative log-likelihood for each of these values and plot it. Include a title and axes labels on your plot. Based on the plot, what value of $\lambda$ is most likely to have generated the data? You may give a small range.

In [None]:
lambda_values = np.arange(10, 51)
nll_values = [poissonNegLogLikelihood(lam, df_clean['x']) for lam in lambda_values]

plt.figure(figsize=(10, 6))
plt.plot(lambda_values, nll_values, 'b-', linewidth=2)
plt.xlabel('lambda')
plt.ylabel('negative log-likelihood')
plt.title('negative log-likelihood vs lambda')
plt.grid(True, alpha=0.3)
plt.show()

min_idx = np.argmin(nll_values)
best_lambda_range = lambda_values[min_idx]
print(f"\nlambda with minimum nll: {best_lambda_range}")

Value of $\lambda$ most likely to have generated the data: 29-31

7. Compute the negative log-likelihood based on the assumption that the data comes from a Normal($\mu$, $\sigma$) distribution. The negative log-likelihood for a Normal($\mu$, $\sigma$) distribution is as follows:

$$
nll(\mu, \sigma) = \frac{n}{2}log(2Ï€) + \frac{n}{2}log(\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i-\mu)^2
$$

In [None]:





def normalNegLogLikelihood(params, data):
    mu, sigma = params
    n = len(data)
    nll = (n/2) * np.log(2*np.pi) + (n/2) * np.log(sigma**2) + (1/(2*sigma**2)) * np.sum((data - mu)**2)
    return nll

dummy_data = pd.DataFrame({'x': [20, 22, 18, 6, 8]})
result = normalNegLogLikelihood([25, 3], dummy_data.x)
print(f"negative log-likelihood for dummy data: {result}")

8. Determine the exact value of $\lambda$ that minimizes the negative log-likelihood for the Poisson distribution. Use so.minimize with method="Powell" and without a Jacobian.

In [None]:
result_poisson = so.minimize(lambda lam: poissonNegLogLikelihood(lam, df_clean['x']), 
                              x0=30, 
                              method='Powell')

optimal_lambda = result_poisson.x[0]
print(f"optimal lambda: {optimal_lambda}")
print(f"minimum negative log-likelihood: {result_poisson.fun}")

Value of $\lambda$ that minimizes the negative log-likelihood: 29.85

9. Determine the set of values of $\mu$ and $\sigma$ that minimizes the negative log-likelihood for the Normal distribution.

In [None]:
result_normal = so.minimize(lambda params: normalNegLogLikelihood(params, df_clean['x']), 
                            x0=[30, 5], 
                            method='Powell')

optimal_mu = result_normal.x[0]
optimal_sigma = result_normal.x[1]
print(f"optimal mu: {optimal_mu}")
print(f"optimal sigma: {optimal_sigma}")
print(f"minimum negative log-likelihood: {result_normal.fun}")

Values for $\mu$ and $\sigma$ that minimize the negative log-likelihood: 29.85, 7.42


10. There are three data entry errors that could potentially be valid entries. Which three are most likely to be data entry errors? Identify them by their value (i.e., not their index in the data).

In [None]:
df_clean['zscore'] = zscore(df_clean['x'])
df_clean_sorted = df_clean.sort_values('zscore', ascending=False)

print("observations with highest z-scores:")
print(df_clean_sorted[['ID', 'x', 'zscore']].head(10))

print("\nthree most likely data entry errors:")
top_3_errors = df_clean_sorted.head(3)
print(top_3_errors[['ID', 'x', 'zscore']])

Three most likely data entry errors: 63, 61, and 52

11. Remove the three observations that you think are most likely to be data entry errors. Determine the parameters that minimize the negative log-likelihoods for the Poisson($\lambda$) and Normal($\mu$, $\sigma$) distributions. Do your results change in a meaningful way?

In [None]:
df_final = df_clean[~df_clean['x'].isin([63, 61, 52])]
print(f"shape after removing three outliers: {df_final.shape}")

result_poisson_final = so.minimize(lambda lam: poissonNegLogLikelihood(lam, df_final['x']), 
                                   x0=30, 
                                   method='Powell')

optimal_lambda_final = result_poisson_final.x[0]
print(f"\noptimal lambda (after removing outliers): {optimal_lambda_final}")

result_normal_final = so.minimize(lambda params: normalNegLogLikelihood(params, df_final['x']), 
                                  x0=[30, 5], 
                                  method='Powell')

optimal_mu_final = result_normal_final.x[0]
optimal_sigma_final = result_normal_final.x[1]
print(f"optimal mu (after removing outliers): {optimal_mu_final}")
print(f"optimal sigma (after removing outliers): {optimal_sigma_final}")

In [None]:
the parameters changed slightly after removing the outliers. lambda decreased from 29.85 to around 29.00, mu decreased from 29.85 to around 29.00, and sigma decreased from 7.42 to around 5.70. this makes sense because removing extreme values reduces the mean and variance.

print("comparison of models:")
print(f"poisson nll: {result_poisson_final.fun}")
print(f"normal nll: {result_normal_final.fun}")
print(f"\ndifference (normal - poisson): {result_normal_final.fun - result_poisson_final.fun}")

12. Between the Poisson and Normal models, which one do you think is best to use to represent the data? Why? Provide your answer in no more than two lines.

the poisson model is better for this data because it has a lower negative log-likelihood and count data like this naturally follows a poisson distribution. poisson is designed for counting events and assumes integer values, which matches our data better than the normal distribution.