# MSDM5054 HW1 
## Written by LIU, Liangjie

## Problem 1: Basic Knowledge
## Answers:
### 1.
Wrong. Overfitting happens when a model captures not only the underlying patterns in the training data but also the noise. This results in excellent performance on training data but poor generalization to unseen data. While a training error of zero is a strong indicator of overfitting, overfitting can still occur even if the training error is greater than zero. This is particularly true in scenarios where the model is excessively complex relative to the amount of data, leading it to model the noise alongside the signal.
### 2.
The variance of KNN decreases as K increases. K represents the number of nearest neighbors considered when making a prediction. In the context of model performance, variance refers to how much the model’s predictions fluctuate with different training datasets.

A larger K means the prediction is based on a broader set of neighbors, which smooths out the decision boundary. The model becomes less sensitive to fluctuations or noise in the training data, leading to more stable predictions across different datasets. 
### 3.
The expected misclassification error of the Bayes Classifier is:

$$
\text{Expected Error} = \int \left[1 - \max_{j} q_j(x)\right] p(x) \, dx
$$

For each x, assign it to the class j with the highest posterior probability  $q_j(x)$ . If the true class is j, the probability of correct classification is  $q_j(x)$ . Thus, the probability of misclassification for x is  $1 - \max_{j} q_j(x)$. Integrate the misclassification probability over all possible  x, weighted by the density  p(x).

### 4.
Proof:

1.	Minimize the Sum of Squared Residuals:
$$
SSR = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} \left(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i\right)^2
$$
2.	Take the Partial Derivative with respect to $\hat{\beta}_0$ and set it to zero:
  
$$
\frac{\partial SSR}{\partial \hat{\beta}_0} = -2 \sum_{i=1}^{n} \left(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i\right) = 0
$$

Simplify to:

$$
\sum_{i=1}^{n} \left(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i\right) = 0
$$

Which implies:

$$
\sum_{i=1}^{n} e_i = 0
$$


Multiple Linear Regression: The sum of residuals is also zero. Just like simple linear regression. 




Proof:

1.	Minimize the Sum of Squared Residuals:
$$
SSR = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} \left(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_{i1} - \hat{\beta}_2 X_{i2} - \cdots - \hat{\beta}_p X_{ip}\right)^2
$$
2.	Take the Partial Derivative with respect to $\hat{\beta}_0$ and set it to zero:
  
$$
\frac{\partial SSR}{\partial \hat{\beta}_0} = -2 \sum_{i=1}^{n} \left(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_{i1} - \hat{\beta}_2 X_{i2} - \cdots - \hat{\beta}_p X_{ip}\right) = 0
$$

Simplify to:

$$
\sum_{i=1}^{n} \left(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_{i1} - \hat{\beta}_2 X_{i2} - \cdots - \hat{\beta}_p X_{ip}\right) = 0
$$

Which implies:

$$
\sum_{i=1}^{n} e_i = 0
$$

This indicates that under the Ordinary Least Squares estimates, the sum of the residuals is zero.

### 5.
We can use R² to compare the performance of the two models.

R²: Measures the proportion of variance in the dependent variable that is predictable from the independent variables.

Adjusted R²: Adjusts R² based on the number of predictors in the model, penalizing for adding variables that do not improve the model sufficiently. It is particularly useful when models have different numbers of predictors.

When comparing models with the same number of predictors, R² provides a direct comparison of the proportion of variance explained without needing adjustment for differing model complexities.

### 6.
Linear Regression

Advantages:

1. Easy to understand and interpret the relationship between predictors and the response variable through coefficients.
2. Generally fast to train and make predictions, even on large datasets.
3. A foundational method with a rich theoretical background and numerous extensions.

Disadvantages:

1. Assumes a linear relationship between predictors and the response, which may not hold in real-world data.
2. Outliers can disproportionately influence the model parameters, leading to biased estimates.
3. Cannot naturally capture interactions or complex non-linear relationships without feature engineering.

KNN Regression

Advantages:

1. Can model complex and non-linear relationships without assuming a specific functional form.
2. The model is essentially the training data, leading to instant updates when new data is added.
3. Easy to grasp the concept of making predictions based on local neighborhoods.

Disadvantages:
1. Requires calculating distances to all training points for each prediction, which can be slow with large datasets.
2. Requires careful preprocessing to ensure that all features contribute appropriately to distance calculations.
3. Selecting an optimal K is crucial; too small K can lead to overfitting, while too large K can oversmooth the predictions.
4. Unlike linear regression, KNN does not provide explicit relationships between predictors and the response.

In [1]:
# Problem 2: Investigation of Life Expectancy

import pandas as pd
import numpy as np
import statsmodels.api as sm
import math

data = pd.read_csv("Life Expectancy Data.csv")

data.columns = data.columns.str.strip()

data = data.drop(columns=['Country'])

data = data.dropna(subset=['Life expectancy'])

print(data.shape)

data['Status'] = data['Status'].astype('category')
data = pd.get_dummies(data, columns=['Status'], drop_first=True)
data['Status_Developing'] = data['Status_Developing'].astype(int)

X = data.drop(columns=['Life expectancy'])
X_filled = X.fillna(X.mean())
X = sm.add_constant(X_filled)
y = data['Life expectancy']

full_model = sm.OLS(y, X).fit()

print(full_model.summary())

(2928, 21)
                            OLS Regression Results                            
Dep. Variable:        Life expectancy   R-squared:                       0.820
Model:                            OLS   Adj. R-squared:                  0.819
Method:                 Least Squares   F-statistic:                     664.0
Date:                Fri, 20 Sep 2024   Prob (F-statistic):               0.00
Time:                        23:47:54   Log-Likelihood:                -8239.5
No. Observations:                2928   AIC:                         1.652e+04
Df Residuals:                    2907   BIC:                         1.665e+04
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------

In [2]:
conf_intervals = full_model.conf_int(alpha=0.05)
print("\n95% confidence interval:")
adult_mortality_ci = conf_intervals.loc['Adult Mortality']
hiv_aids_ci = conf_intervals.loc['HIV/AIDS']

print("\nAdult Mortality:", adult_mortality_ci.tolist())
print("HIV/AIDS:", hiv_aids_ci.tolist())

if adult_mortality_ci[0] < 0 and adult_mortality_ci[1] < 0:
    print("\nAdult Mortality has negative impact on the life expectancy.")
else:
    print("\nAdult Mortality has no impact on the life expectancy.")

if hiv_aids_ci[0] < 0 and hiv_aids_ci[1] < 0:
    print("HIV/AIDS has negative impact on the life expectancy.")
else:
    print("HIV/AIDS has no impact on the life expectancy.")


95% confidence interval:

Adult Mortality: [-0.021350525396983937, -0.01822835814112676]
HIV/AIDS: [-0.5053295472183785, -0.4360966923109017]

Adult Mortality has negative impact on the life expectancy.
HIV/AIDS has negative impact on the life expectancy.


In [3]:
conf_int_full_97 = full_model.conf_int(alpha=0.03)

schooling_ci = conf_int_full_97.loc['Schooling']
alcohol_ci = conf_int_full_97.loc['Alcohol']

print("\nSchooling 97% confidence intervals:", schooling_ci.values)
print("Alcohol 97% confidence intervals:", alcohol_ci.values)

if schooling_ci[0] > 0 and schooling_ci[1] > 0:
    print("Schooling has positive impact on the life expectancy.")
elif schooling_ci[0] < 0 and schooling_ci[1] < 0:
    print("Schooling has negative impact on the life expectancy.")
else:
    print("Schooling has no impact on the life expectancy.")

if alcohol_ci[0] > 0 and alcohol_ci[1] > 0:
    print("Alcohol has positive impact on the life expectancy.")
elif alcohol_ci[0] < 0 and alcohol_ci[1] < 0:
    print("Alcohol has negative impact on the life expectancy.")
else:
    print("Alcohol has no impact on the life expectancy.")


Schooling 97% confidence intervals: [0.58155514 0.76666183]
Alcohol 97% confidence intervals: [0.00418148 0.11756867]
Schooling has positive impact on the life expectancy.
Alcohol has positive impact on the life expectancy.


In [4]:
p_values = full_model.pvalues.drop('const')

sorted_p = p_values.sort_values()

top7_vars = sorted_p.index[:7].tolist()
print("\ntop7_vars:", top7_vars)

X_small = X[top7_vars + ['const']]

small_model = sm.OLS(y, X_small).fit()

print(small_model.summary())


top7_vars: ['HIV/AIDS', 'Adult Mortality', 'Schooling', 'under-five deaths', 'infant deaths', 'BMI', 'Income composition of resources']
                            OLS Regression Results                            
Dep. Variable:        Life expectancy   R-squared:                       0.791
Model:                            OLS   Adj. R-squared:                  0.791
Method:                 Least Squares   F-statistic:                     1583.
Date:                Fri, 20 Sep 2024   Prob (F-statistic):               0.00
Time:                        23:47:54   Log-Likelihood:                -8458.3
No. Observations:                2928   AIC:                         1.693e+04
Df Residuals:                    2920   BIC:                         1.698e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                      coef    std err          t      P>|

In [5]:
new_observation = {
    'Year': 2008,
    'Adult Mortality': 125,
    'infant deaths': 94,
    'Alcohol': 4.1,
    'percentage_expenditure': 100,
    'Hepatitis_B': 20,
    'Measles': 13,
    'BMI': 55,
    'under-five deaths': 2,
    'Polio': 12,
    'Total_expenditure': 5.9,
    'Diphtheria': 12,
    'HIV/AIDS': 0.5,
    'GDP': 5892,
    'Population': 1.34e6,
    'thinness_1_19_years': np.nan, 
    'thinness_5_9_years': np.nan,
    'Income composition of resources': 0.9,
    'Schooling': 18
}

new_obs_df = pd.DataFrame([new_observation])

new_obs_df = sm.add_constant(new_obs_df,has_constant='add')

X_new = new_obs_df[top7_vars + ['const']]

prediction = small_model.get_prediction(X_new)
pred_summary = prediction.summary_frame(alpha=0.01)

print(pred_summary)

        mean   mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  \
0  89.576442  0.794559      87.528455      91.624429     78.167474   

   obs_ci_upper  
0    100.985409  


In [6]:
full_model_aic = full_model.aic
small_model_aic = small_model.aic

print("\nfull_model_aic:", full_model_aic)
print("small_model_aic:", small_model_aic)

if small_model_aic < full_model_aic:
    print("\nSmall_model_aic is lower.")
else:
    print("\nFull_model_aic is lower.")


full_model_aic: 16521.085082746766
small_model_aic: 16932.62646207667

Full_model_aic is lower.


In [7]:
# Problem 3: Implementing KNN regression

import numpy as np
import pandas as pd
import time
from math import sqrt

class KNNRegressor:

    def __init__(self, n_neighbors):
        self.n_neighbors = n_neighbors
        self.X_train = None
        self.y_train = None

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        predictions = []
        for index, test_point in enumerate(X):
            distances = np.linalg.norm(self.X_train - test_point, axis=1)
            neighbor_indices = np.argsort(distances)[:self.n_neighbors]
            neighbor_values = self.y_train[neighbor_indices]
            prediction = np.mean(neighbor_values)
            predictions.append(prediction)
        return predictions

def mean_squared_error(y_true, y_pred):
    return np.mean((np.array(y_true) - np.array(y_pred)) ** 2)

def load_data(train_path, test_path):
    # Load datasets
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)

    X_train = train_df.drop('medv', axis=1).values
    y_train = train_df['medv'].values
    X_test = test_df.drop('medv', axis=1).values
    y_test = test_df['medv'].values

    return X_train, y_train, X_test, y_test

def standardize_data(X_train, X_test):
    mean = X_train.mean(axis=0)
    std = X_train.std(axis=0)
    
    # Avoid division by zero
    std_replaced = np.where(std == 0, 1, std)
    
    X_train_std = (X_train - mean) / std_replaced
    X_test_std = (X_test - mean) / std_replaced
    return X_train_std, X_test_std

def evaluate_knn(X_train, y_train, X_test, y_test, K_values):
    results = {}
    best_mse = float('inf')
    best_K = None

    for K in K_values:
        knn = KNNRegressor(n_neighbors=K)
        knn.fit(X_train, y_train)
        
        start_time = time.time()
        predictions = knn.predict(X_test)
        end_time = time.time()
        
        mse = mean_squared_error(y_test, predictions)
        running_time = end_time - start_time
        
        results[K] = {'MSE': mse, 'Time': running_time}
        
        if mse < best_mse:
            best_mse = mse
            best_K = K

        print(f"K={K}: MSE={mse:.4f}, Time={running_time:.4f} seconds")

    return results, best_K

def main():
    train_path = 'boston_housing_train.csv'
    test_path = 'boston_housing_test.csv'

    X_train, y_train, X_test, y_test = load_data(train_path, test_path)

    K_values = list(range(1, 21))  # K from 1 to 20

    print("=== Without Standardization ===")
    results_no_std, best_K_no_std = evaluate_knn(X_train, y_train, X_test, y_test, K_values)
    print(f"Best K (No Standardization): {best_K_no_std} with MSE={results_no_std[best_K_no_std]['MSE']:.4f}")

    print("\n=== With Standardization ===")
    X_train_std, X_test_std = standardize_data(X_train, X_test)
    results_std, best_K_std = evaluate_knn(X_train_std, y_train, X_test_std, y_test, K_values)
    print(f"Best K (With Standardization): {best_K_std} with MSE={results_std[best_K_std]['MSE']:.4f}")

    improvement = results_std[best_K_std]['MSE'] < results_no_std[best_K_no_std]['MSE']
    if improvement:
        print("\nStandardization improved the performance.")
    else:
        print("\nStandardization did not improve the performance.")

if __name__ == "__main__":
    main()

=== Without Standardization ===
K=1: MSE=44.5171, Time=0.0054 seconds
K=2: MSE=46.0559, Time=0.0049 seconds
K=3: MSE=41.5232, Time=0.0047 seconds
K=4: MSE=40.8879, Time=0.0046 seconds
K=5: MSE=42.2411, Time=0.0046 seconds
K=6: MSE=43.8894, Time=0.0046 seconds
K=7: MSE=43.9851, Time=0.0046 seconds
K=8: MSE=42.8303, Time=0.0053 seconds
K=9: MSE=44.0434, Time=0.0047 seconds
K=10: MSE=45.6143, Time=0.0046 seconds
K=11: MSE=45.7835, Time=0.0046 seconds
K=12: MSE=45.8784, Time=0.0047 seconds
K=13: MSE=45.7650, Time=0.0046 seconds
K=14: MSE=46.5139, Time=0.0046 seconds
K=15: MSE=46.5349, Time=0.0046 seconds
K=16: MSE=48.1896, Time=0.0052 seconds
K=17: MSE=49.1327, Time=0.0048 seconds
K=18: MSE=48.9265, Time=0.0048 seconds
K=19: MSE=50.1288, Time=0.0047 seconds
K=20: MSE=51.0900, Time=0.0046 seconds
Best K (No Standardization): 4 with MSE=40.8879

=== With Standardization ===
K=1: MSE=25.4693, Time=0.0048 seconds
K=2: MSE=16.7776, Time=0.0049 seconds
K=3: MSE=19.7349, Time=0.0047 seconds
K=4: 