## Important instruction

For programming exercises only edit the code as shown in the following format.

```
##############################################

#Edit the following code

var1 = 3
var2 = 4
print(var1 + var4)

##############################################
```

You are open to experimenting with the other parts of code but you will only be awarded points if the question asked is answered which only needs finishing or making changes to the code in the above specified format.

## Question 7: Maximum Likelihood Estimation for Gaussian Distribution

You are given a dataset ```data_points``` that follows a Gaussian distribution. Your task is to find mean, variance and log likelihood by completing the functions ```calculate_mean_variance``` and ```gaussian_log_likelihood```.

*For more information, refer to **4.2.3 Gaussian (Normal) Density** section of **Introduction to Machine Learning, 4th Edition, The MIT Press, 2020** by Ethem Alpaydın.*

In [17]:
import numpy as np

def calculate_mean_variance(data_points):
    n = len(data_points)
    ############################################################
    #Edit the following code

    mean = np.sum(data_points) / n
    variance = np.var(data_points)  # or np.sum((data_points - mean) ** 2) / n
    
    ############################################################
    return mean, variance

def gaussian_log_likelihood(mean, variance, data_points):
    n = len(data_points)
    ############################################################
    #Edit the following code

    log_likelihood = -n / 2 * np.log(2 * np.pi * variance) - np.sum((data_points - mean) ** 2) / (2 * variance)
    ############################################################
    return log_likelihood

data_points = np.array([1.2, 2.5, 3.8, 4.2, 5.6])

# Calculating mean and variance
mean, variance = calculate_mean_variance(data_points)

# Calculating log likelihood
log_likelihood = gaussian_log_likelihood(mean, variance, data_points)

print("Mean:", mean)
print("Variance:", variance)
print("Log Likelihood:", log_likelihood)

Mean: 3.4599999999999995
Variance: 2.2543999999999995
Log Likelihood: -9.12690232142906


## Question 8: Analyzing Bias, Variance, and Error in Regression Models

Consider a dataset of oceanographic temperature and salinity measurements. The dataset has been loaded and split into training and testing sets. Three regression models have been applied to predict salinity based on temperature: Linear Regression, Polynomial Regression of degree 2, and Polynomial Regression of degree 5. The `calculate_metrics` function has been defined to compute bias, squared bias, variance, and error for each model.

Analyze the performance metrics (bias, squared bias, variance, and error) for each regression model obtained by finishing `calculate_metrics` funtion. Compare and contrast the models in terms of their bias, variance, and overall predictive performance.


*For more information, refer to **4.7  Tuning Model Complexity: Bias/Variance Dilemma** section of **Introduction to Machine Learning, 4th Edition, The MIT Press, 2020** by Ethem Alpaydın. and [numpy documentation](https://numpy.org/doc/stable/reference/routines.statistics.html) for various attributes.*

In [20]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

def calculate_metrics(model, X_test, y_test):
    ############################################################
    #Edit the following code
    y_pred = model.predict(X_test)
    true_mean = np.mean(y_test)
    bias = np.mean(y_pred) - true_mean
    bias_squared = bias ** 2
    variance = np.var(y_pred)
    error = np.mean((y_test - y_pred) ** 2)
    #do not use mean_squared_error from scikit learn package
    ############################################################
    return bias, bias_squared, variance, error

# Load the oceanographic dataset from the provided link
url = "https://raw.githubusercontent.com/JakeMWu/single-linear-regression-CalCOFI-oceanographic-data/main/tempsal.csv"
oceanographic_data = pd.read_csv(url, nrows=800)  # Use only the first 300 rows

# Remove rows with NaN values in 'T_degC' or 'Salnty'
oceanographic_data = oceanographic_data.dropna(subset=['T_degC', 'Salnty'])

# Use 'T_degC' as the feature and 'Salnty' as the target variable
X = oceanographic_data['T_degC'].values.reshape(-1, 1)
y = oceanographic_data['Salnty'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Polynomial Regression (Degree 2)
poly_model = make_pipeline(PolynomialFeatures(2), LinearRegression())
poly_model.fit(X_train, y_train)

# Polynomial Regression (Degree 5)
poly_model5 = make_pipeline(PolynomialFeatures(5), LinearRegression())
poly_model5.fit(X_train, y_train)

# Calculate metrics for each model
bias_linear, _ ,variance_linear, error_linear = calculate_metrics(linear_model, X_test, y_test)
bias_poly, _ ,variance_poly, error_poly = calculate_metrics(poly_model, X_test, y_test)
bias_poly5, _, variance_poly5, error_poly5 = calculate_metrics(poly_model5, X_test, y_test)

# Print performance metrics
print("Linear Regression:")
print(f"Bias: {bias_linear:.3f}, Variance: {variance_linear:.3f}, Error: {error_linear:.3f}\n")

print("Polynomial Regression (Degree 2):")
print(f"Bias: {bias_poly:.3f}, Variance: {variance_poly:.3f}, Error: {error_poly:.3f}\n")

print("Polynomial Regression (Degree 5):")
print(f"Bias: {bias_poly5:.3f}, Variance: {variance_poly5:.3f}, Error: {error_poly5:.3f}\n")

Linear Regression:
Bias: -0.003, Variance: 0.201, Error: 0.064

Polynomial Regression (Degree 2):
Bias: -0.007, Variance: 0.205, Error: 0.061

Polynomial Regression (Degree 5):
Bias: -0.001, Variance: 0.225, Error: 0.036



### **Answer the following question with a brief reasoning.**
Based on the results of bias, variance and error, explain which regression model is best suited for the above data and why

The Polynomial Regression (Degree 5) model appears to be the best suited for the data, as it has the lowest error (0.036) among the three models. Despite its slightly higher variance (0.225) compared to the Linear Regression (0.201) and Polynomial Regression (Degree 2) (0.205), the significant reduction in error indicates that the Degree 5 model captures the underlying pattern of the data more accurately without overly compromising due to increased variance. The very small biases across all models (-0.003 for Linear, -0.007 for Degree 2, and -0.001 for Degree 5) suggest that all models are, on average, very close to the actual values, but the substantial decrease in error with the Degree 5 model highlights its superior predictive performance. This suggests that the complexity added by the Degree 5 polynomial features is justified by the considerable gain in modeling the dataset's nuances, making it the preferable choice despite the inherent risk of higher variance.

## Question 9: AIC and BIC Evaluation in Regression Analysis - Model Performance

A regression analysis is performed in the following code using three different models: Linear Regression, Lasso Regression, and Ridge Regression. Calculate the AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) for each model by completing the ```calculate_aic_bic``` function.

*For more information, refer to **4.8  Tuning Model Complexity: Bias/Variance Dilemma** section of **Introduction to Machine Learning, 4th Edition, The MIT Press, 2020** by Ethem Alpaydın and Lecture 05: Parametric Method slides from class.*

*Further reading on https://vitalflux.com/aic-vs-bic-for-regression-models-formula-examples/*

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
from tabulate import tabulate

def calculate_aic_bic(y_true, y_pred, num_params, n):
    mse = mean_squared_error(y_true, y_pred)
    log_likelihood = -0.5 * len(y_true) * np.log(2 * np.pi * mse) - 0.5 * len(y_true)
    ############################################################
    #Edit the following code
    aic = 2 * num_params - 2 * log_likelihood
    bic = np.log(n) * num_params - 2 * log_likelihood
    ############################################################
    return aic, bic, mse

# Generating sample data
np.random.seed(42)
X = np.random.rand(100, 5)
y = 2*X[:, 0] + 3*X[:, 1] - 4*X[:, 2] + np.random.randn(100)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
linear_predictions = linear_model.predict(X_test)
linear_params = X_train.shape[1] + 1  # Including the intercept
linear_aic, linear_bic, linear_mse = calculate_aic_bic(y_test, linear_predictions, linear_params, len(y_test))

# Lasso Regression
lasso_model = Lasso(alpha=0.01)
lasso_model.fit(X_train, y_train)
lasso_predictions = lasso_model.predict(X_test)
lasso_params = np.count_nonzero(lasso_model.coef_) + 1  # Including the intercept
lasso_aic, lasso_bic, lasso_mse = calculate_aic_bic(y_test, lasso_predictions, lasso_params, len(y_test))

# Ridge Regression
ridge_model = Ridge(alpha=0.01)
ridge_model.fit(X_train, y_train)
ridge_predictions = ridge_model.predict(X_test)
ridge_params = X_train.shape[1] + 1  # Including the intercept
ridge_aic, ridge_bic, ridge_mse = calculate_aic_bic(y_test, ridge_predictions, ridge_params, len(y_test))

# Create a table to print the values
table = [["Linear Regression", linear_aic, linear_bic, linear_mse],
         ["Lasso Regression", lasso_aic, lasso_bic, lasso_mse],
         ["Ridge Regression", ridge_aic, ridge_bic, ridge_mse]]

headers = ["Model", "AIC", "BIC", "MSE"]
print(tabulate(table, headers, tablefmt="grid"))

+-------------------+---------+---------+----------+
| Model             |     AIC |     BIC |      MSE |
| Linear Regression | 68.2704 | 74.2448 | 0.975938 |
+-------------------+---------+---------+----------+
| Lasso Regression  | 67.9644 | 73.9388 | 0.96112  |
+-------------------+---------+---------+----------+
| Ridge Regression  | 68.2765 | 74.2509 | 0.976235 |
+-------------------+---------+---------+----------+


### **Answer the following question with a brief reasoning.**
Determine the best model based on the AIC and BIC results. Provide reasoning for your choice.

The Lasso Regression model is favored over Linear and Ridge Regression models based on lower AIC and BIC values, indicating optimal balance between fit and complexity. Its regularization technique, which reduces overfitting by eliminating less important features, makes it the best choice for generalizability and simplicity, outperforming others in terms of model efficiency and effectiveness for this dataset.

## Question 10: Linear regression with  k-fold cross validation

Your task to implement a linear regression with using k-fold cross validation model using scikit learn package of python. Please follow the below instructions to complete the exercise.

1. **Split the Data:**
   - Use `train_test_split` to split the data into training, validation, and test sets with a test size of 20% and a random state of 99.

2. **Set Up K-Fold Cross-Validation:**
   - Set the number of folds for cross-validation to 5.
   - Initialize a KFold cross-validator with shuffling and a random state of 99.

3. **Initialize Linear Regression Model:**
   - Create an instance of the linear regression model.

4. **Perform K-Fold Cross-Validation:**
   - Use a loop to iterate through the folds generated by KFold.
   - For each fold, train the model on the training data, make predictions on both training and validation sets.
   - For each fold, calculate mean squared errors.
   - Append the errors to lists for further analysis.

5. **Fit the Model on the Entire Training + Validation Set:**
   - After cross-validation, fit the model on the entire training + validation set to utilize all available data for training.

6. **Make Predictions on the Test Set:**
   - Use the trained model to make predictions on the test set.


For more information, refer https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html and https://www.kaggle.com/code/jnikhilsai/cross-validation-with-linear-regression

In [25]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/KrishnaTejaJ/datasets-CSCI-B455/main/assignment%202/question12/linear_data.csv')

# Extract features and target variable
X = df.drop('Chance of Admit', axis=1)
y = df['Chance of Admit']

# Lists to store performance metrics for each fold
train_errors, val_errors = [], []

############################################################
#Write the code for the corresponding instructions mentioned

# Split the data into training, validation, and test sets based on the provided instructions
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size = 0.2, random_state = 99)

# Set the number of folds for cross-validation
k_folds = 5

# Initialize KFold cross-validator
kf = KFold(n_splits=k_folds, shuffle = True, random_state= 99)

# Initialize linear regression model
model = LinearRegression()

# Perform k-fold cross-validation
for train_index, val_index in kf.split(X_train_val):
    X_train, X_val = X_train_val.iloc[train_index], X_train_val.iloc[val_index]
    y_train, y_val = y_train_val.iloc[train_index], y_train_val.iloc[val_index]

    # Fit the model on the training data
    model.fit(X_train, y_train)

    # Make predictions on the training and validation sets
    y_train_pred = model.predict(X_train) 
    y_val_pred = model.predict(X_val)

    # Calculate mean squared error for training and validation sets
    train_error = mean_squared_error(y_train, y_train_pred)
    val_error = mean_squared_error(y_val, y_val_pred)

    # Append errors to the lists
    train_errors.append(train_error)
    val_errors.append(val_error)

# Fit the model on the entire training + validation set
model.fit(X_train_val, y_train_val)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

############################################################

# Calculate mean squared error for the test set
test_error = mean_squared_error(y_test, y_test_pred)

# Calculate the average training and validation errors
avg_train_error = np.mean(train_errors)
avg_val_error = np.mean(val_errors)

print(f'Average Training Error: {avg_train_error}')
print(f'Average Validation Error: {avg_val_error}')
print(f'Test Error: {test_error}')

Average Training Error: 0.00390418409025593
Average Validation Error: 0.004228318833330217
Test Error: 0.004303691767806357
