# CS145 Introduction to Data Mining - Assignment 1
## Deadline: 11:59PM, January 14, 2025

## Instructions
Each assignment is structured as a Jupyter notebook, offering interactive tutorials that align with our lectures. You will encounter two types of problems: *write-up problems* and *coding problems*.

1. **Write-up Problems:** These problems are primarily theoretical, requiring you to demonstrate your understanding of lecture concepts and to provide mathematical proofs or derivations. Your answers should include sufficient steps for the mathematical derivations.
2. **Coding Problems:** Here, you will be engaging with practical coding tasks. These may involve completing code segments provided in the notebooks or developing models from scratch.

To ensure clarity and consistency in your submissions, please adhere to the following guidelines:

* For write-up problems, use Markdown bullet points to format text answers. Also, express all mathematical equations using $\LaTeX$ and avoid plain text such as `x0`, `x^1`, or `R x Q` for equations.
* For coding problems, comment on your code thoroughly for readability and ensure your code is executable. Non-runnable code may lead to a loss of **all** points. Coding problems have automated grading, and altering the grading code will result in a deduction of **all** points.
* Your submission should show the entire process of data loading, preprocessing, model implementation, training, and result analysis. This can be achieved through a mix of explanatory text cells, inline comments, intermediate result displays, and experimental visualizations.

### Submission Requirements

* Submit your ipynb through GradeScope in BruinLearn. Submission in PDF format will not be graded.
* Late submissions are allowed up to 24 hours post-deadline with a penalty factor of
  $$
  \mathbf{1}(t \leq 24) \, e^{-(\ln(2)/12) t}.
  $$

### Collaboration and Integrity

* Collaboration is encouraged, but all final submissions must be your own work. Please acknowledge any collaboration or external sources used, including websites, papers, and GitHub repositories.
* Any suspicious cases of academic misconduct will be reported to The Office of the Dean of Students.

---

## Outline
* **Part 1: Write-up (90 points)**
  1. Introduction & Know Your Data  
  2. Linear Regression  
  3. Logistic Regression & Classification
* **Part 2: Coding (40 points)**
  1. Exploratory Data Analysis & Preprocessing (8 points)  
  2. Linear Regression with Regularization and Cross Validation (12 points)  
  3. Logistic Regression with Regularization and Cross Validation (10 points)  
  4. Implement Gradient Descent and Compare with Sklearn (10 points)

---

## Part 1: Write-up (90 points)

### 1. Introduction & Know Your Data (30 points)

1. **One-Hot Encoding (10 points)**  
   One-hot-encoding is a process of converting a single categorical variable with $k$ discrete values into $k$ binary variables (indicators). In which scenarios is one-hot-encoding particularly important, and in which cases might it be inappropriate or unnecessary? Consider the following examples and state whether or not you would apply one-hot-encoding, and why:
   - (a) Zipcode  
   - (b) Income Level (e.g., discrete categories such as `low`, `medium`, `high`)  
   - (c) Age  
   - (d) Cuisine Category  
   - (e) All the states in the U.S.  

2. **True/False: Simple Explanations (10 points)**  
   For each statement below, write **True** or **False**, and provide a **brief** justification (1-3 sentences) for your answer:
   - (a) Categorical variables can only be used when the number of categories is finite.  
   - (b) *Correlation* refers to the linear dependence between two variables.  
   - (c) Supervised learning and unsupervised learning differ in that supervised learning requires labeled data while unsupervised learning does not.  
   - (d) Median is usually preferred over mean as a summary statistic when there are extreme outliers.  
   - (e) Sample variance is an unbiased estimator of the population variance.

3. **Data Preprocessing (10 points)**  
   (a) Why are normalization or standardization useful steps in data preprocessing?  
   (b) List two ways to normalize a dataset.  
   (c) Name two ways to deal with missing values and explain when/why each approach might be used.  
   (d) Suppose you have a table with 4 columns: 3 numeric columns and 1 categorical column. You want to predict one numeric column from the remaining three columns.  
   - What type of machine learning task is this?  
   - What preprocessing steps might you consider for the 3 feature columns?

**[TODO: Provide your responses here. ]**

### 2. Linear Regression (30 points)

Consider a dataset with $n$ samples and $d$ features, where $\boldsymbol{X} \in \mathbb{R}^{n \times d}$ is the feature matrix and $\boldsymbol{y} \in \mathbb{R}^n$ is the target vector.

1. **Closed-form for Ridge Regression (10 points)**  
   Recall the Ridge regression objective:  
   $$
   J(\boldsymbol{w}) = \frac{1}{2n} \sum_{i=1}^n \bigl(\boldsymbol{x}_i^\top \boldsymbol{w} - y_i\bigr)^2 \;+\; \frac{\lambda}{2n}\|\boldsymbol{w}\|_2^2,
   $$
   where $\boldsymbol{w}\in \mathbb{R}^d$, $\lambda \ge 0$ is the regularization parameter, and $\|\boldsymbol{w}\|_2^2 = \sum_{j=1}^d w_j^2$.  
   **Task:** Derive the closed-form solution for $\boldsymbol{w}^\star$. Show your steps:
   - Write $J(\boldsymbol{w})$ in matrix form.  
   - Take the gradient w.r.t.\ $\boldsymbol{w}$.  
   - Set the gradient to zero and solve for $\boldsymbol{w}^\star$.  

2. **Regularization Intuition (5 points)**  
   Recall the question from lecture: *When $\lambda$ is very large (i.e., goes to infinity) in Ridge regression, what happens to the magnitude of the weights?* Explain why this leads to a simpler or more complex model.

3. **Bias and Variance in High-degree Polynomials (5 points)**  
   Suppose you fit a polynomial regression model with a very high polynomial degree to a small dataset:
   - (a) How does this typically affect the model’s bias and variance?  
   - (b) In practice, what approaches might help mitigate overfitting in a high-degree polynomial scenario?

4. **Ridge vs. No Regularization (5 points)**  
   If you train a model without any regularization vs. a model with $\lambda > 0$ (Ridge regression), how can this alter the learned weights and generalization performance? Provide a short explanation.

5. **2D Residual Analysis (5 points)**  
   You have data with two features $X_1, X_2$ and one target $Y$. You fit a linear regression model:  
   $$
   \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2.
   $$
   If you plot $(\hat{Y} - Y)$ (the residual) against $X_1$ and $X_2$ and observe distinct patterns, what might that suggest about your model or data?

**[TODO: Provide your responses here. ]**

### 3. Logistic Regression & Practical Classification (30 points)

1. **Odds and Log-Odds (6 points)**  
   (a) Given the odds $\frac{P(Y=1)}{1 - P(Y=1)}$, what is its numerical range?  
   (b) What is the range of the log-odds $\ln \left(\frac{P(Y=1)}{1 - P(Y=1)}\right)$?  
   (c) When $P(Y=1) = 0.5$, what is the value of $\ln(\text{odds})$?

2. **Logistic Regression Model (4 points)**  
   Write down the logistic regression model for a single feature $X$. Briefly interpret the meaning of the learned parameter $\beta_1$ in a logistic regression context.

3. **Regularization in Logistic Regression (5 points)**  
   Similar to Ridge regression for linear models, logistic regression can also include $L_2$ penalty on the weights.  
   - (a) When $\lambda$ is large, do we expect more or less complex decision boundaries? Why?  
   - (b) Would the gradient of the weights ever become exactly zero in the presence of an extremely rare but always “positive” feature without regularization?

4. **Classification Threshold (5 points)**  
   Suppose you have a logistic regression model that predicts $P(Y=1|X)$. In practice, how would you convert these probabilities into a binary classification label (0 or 1)? Which threshold is commonly used, and why might one choose a different threshold?

5. **Evaluation Metrics (10 points)**  
   (a) Define **Precision** and **Recall**.  
   (b) Show how they combine in the $F_1$ measure.  
   (c) If you had a highly imbalanced dataset (e.g. 99% negatives, 1% positives), why might accuracy alone be misleading?  
   (d) What would be the accuracy of a classifier that predicts **all negative** in the above scenario?

**[TODO: Provide your responses here. ]**

---

## Part 2: Coding (40 points)

Below, we will work with real-world data to practice data preprocessing, train models, and experiment with regularization.

### Overview

**Datasets**  
- **Housing Dataset (for Regression):** We will use a (smaller) version of the California Housing dataset to predict continuous target values (housing prices).
- **Heart Disease Dataset (for Classification):** We will use a simplified heart disease dataset (`heart.csv`), with a binary label indicating the presence or absence of heart disease.

> **Files you will need:**
> - `housing.csv`
> - `heart.csv`  

> **Important**: For all plots, please use `matplotlib` (or `seaborn`).

---

### 1. Exploratory Data Analysis & Preprocessing (8 points)

#### 1.1 Load and Explore the Datasets (4 points)

**Housing (Regression)**  
1. Load the `housing.csv` file into a Pandas DataFrame.  
2. Print out the first 5 rows.  
3. Display a summary (using `df.info()`) of data types and missing values.  
4. Create histograms of at least two numeric columns (e.g., `median_income`, `median_house_value`). Discuss any interesting observations in 1-2 sentences.

**Heart Disease (Classification)**  
1. Load the `heart.csv` file into a separate Pandas DataFrame.  
2. Print out the first 5 rows.  
3. Display a summary of data types and check for missing values.  
4. Create histograms or bar plots of at least two features (e.g., `age`, `chol`, or a categorical variable). Briefly comment on any skewness or notable patterns.

#### 1.2 Data Cleaning & Feature Engineering (4 points)

- For **both** datasets:
  1. Identify if there are any **missing** values and decide how to handle them (drop or impute). Justify your choice in a text cell.  
  2. If there are categorical variables (e.g., `ocean_proximity` in housing, `cp` or `thal` in heart), convert them to numeric using a method of your choice (e.g., one-hot encoding, label encoding).  
  3. Optionally, create **at least one new feature** in one of the datasets (e.g., `rooms_per_household` in housing, or some ratio in heart data).  

In [None]:
# TODO: Data loading and EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1. Load housing data
# 2. Display the first few rows
# 3. Show data info (e.g. housing.info())
# 4. Plot histograms

# 5. Load heart data
# 6. Display the first few rows
# 7. Show data info
# 8. Plot relevant columns (bar/histogram)
# TODO: Handle missing values if any
# TODO: Convert categorical to numeric
# TODO: Create at least one new feature

### 2. Linear Regression with Regularization and Cross Validation (12 points)

We will focus on the **Housing** dataset for the regression tasks.

#### 2.1 Train-Test Split (2 points)

Split the **housing** dataset into 80% train and 20% test. **Important**: ensure you also separate the **target** (`median_house_value`) from the input features.

In [None]:
# TODO: Housing train-test split
from sklearn.model_selection import train_test_split

# Example:
# X_train, X_test, y_train, y_test = train_test_split(...)

#### 2.2 Baseline Linear Regression + Evaluation (2 points)

1. Train a standard linear regression (no regularization) on the training set.
2. Compute the Mean Squared Error (MSE) on both the training and test sets.
3. Print the RMSE (square root of MSE).
4. Briefly comment in a Markdown cell: Is there a large gap between train and test RMSE?

In [None]:
# TODO: Baseline Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 1. Fit
# 2. Predict on train and test
# 3. Compute MSE, RMSE

**[TODO: Provide your responses here. ]**

#### 2.3 Ridge Regression and Cross Validation (8 points)

1. Use **Ridge** regression (`sklearn.linear_model.Ridge`) with a few different values of $\lambda$ (called `alpha` in `sklearn`)—for example: `[0.0, 0.01, 0.1, 1.0, 10.0, 100.0]`.
2. Use **cross validation** (`cross_val_score` or `KFold` + `for loop`) on the **training set** to estimate how well each alpha performs.
3. Plot alpha values (x-axis) vs. average RMSE across folds (y-axis). You can also store or print these average values in a table if you prefer.
4. Choose the alpha that yields the best average RMSE and refit a final Ridge model on the entire training set.
5. Evaluate the final model’s performance on the test set. Compare it with the baseline from Section 2.2.

In [None]:
# TODO: Ridge regression cross validation
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, KFold

# 1. alpha_list = [0.0, 0.01, 0.1, 1.0, 10.0, 100.0]
# 2. cross_val each alpha
# 3. store average CV RMSE
# 4. pick best alpha
# 5. re-train on entire train set and evaluate on test set
# TODO: Plot alpha vs. average CV RMSE
import matplotlib.pyplot as plt

# Something like:
# plt.plot(alpha_list, avg_cv_mse_list, marker='o')
# plt.xlabel(...)
# plt.ylabel(...)
# plt.title(...)
# plt.show()

### 3. Logistic Regression with Regularization and Cross Validation (10 points)

We will focus on the **Heart Disease** dataset for classification tasks.

#### 3.1 Train-Test Split (2 points)

Split the **heart** dataset into 80% train and 20% test. Separate out the target column (often labeled something like `target` or `condition` in heart datasets).

In [None]:
# TODO: Heart train-test split

#### 3.2 Baseline Logistic Regression (3 points)

1. Train a standard logistic regression (with no regularization or `C` very large in `LogisticRegression`) on the training set.
2. Print out classification accuracy on **both** the training and test sets.
3. Also print out or compute other metrics such as **precision**, **recall**, or an $F_1$ score on the test set.  You can use `sklearn.metrics.classification_report` or compute manually.

In [None]:
# TODO: Logistic Regression baseline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score

# 1. Fit
# 2. Evaluate (accuracy, precision, recall, f1)

#### 3.3 Logistic Regression + Cross Validation (5 points)

1. Choose a list of `C` values (the inverse of $\lambda$ in logistic regression). For example: `[0.001, 0.01, 0.1, 1.0, 10.0, 100.0]`.
2. Perform **cross validation** on the training set to estimate the average accuracy (or $F_1$) for each `C`.
3. Plot the metric vs. `C` on a simple line plot.
4. Pick the best `C` and retrain on the entire training set. Evaluate on the test set.
5. In a short Markdown cell, discuss how increasing/decreasing `C` affects the decision boundary complexity.

In [None]:
# TODO: Logistic Regression cross validation for classification
from sklearn.model_selection import cross_val_score

# 1. c_list = [...]
# 2. for c in c_list:
#       log_reg = LogisticRegression(C=c, solver='liblinear', ...)
#       scores = cross_val_score(log_reg, X_train, y_train, cv=..., scoring='accuracy')
#       ...
# 3. Plot c vs. average CV score
# TODO: Final evaluation on test set after picking best C

**[TODO: Provide your responses here. ]**

### 4. Implement Gradient Descent and Compare with Sklearn (10 points)

In this task, you will implement gradient descent from scratch to perform linear regression on the Housing dataset and compare your implementation with sklearn's `LinearRegression`.

**Instructions:**

- **Implement Gradient Descent:**  
  1. Use the Mean Squared Error (MSE) as the loss function:
$$
     J(\boldsymbol{w}, b) = \frac{1}{n} \sum_{i=1}^{n}\left(\hat{y}_i - y_i\right)^2, \quad \text{where} \quad \hat{y}_i = \boldsymbol{w}^\top \boldsymbol{x}_i + b.
$$
  2. Initialize the weights and bias (you may combine them into one vector by adding a bias term to your feature matrix).
  3. Set a learning rate (e.g., `lr = 0.01`) and choose the number of iterations.
  4. At each iteration, compute the gradient with respect to the weights and bias, update them, and record the training MSE.
  5. Plot the training MSE versus the iteration number to observe the convergence.

- **Comparison with Sklearn:**  
  1. After implementing gradient descent, print the final learned coefficients and bias.
  2. Train a linear regression model using sklearn's `LinearRegression` on the same training data.
  3. Compare the coefficients and training MSE from your implementation with those from sklearn's model.
  4. Briefly discuss any differences observed in a Markdown cell.

In [None]:
# TODO: Implement Gradient Descent for Linear Regression
import numpy as np
import matplotlib.pyplot as plt

# Assume X_train and y_train are defined from previous parts.
# Add a column of ones to account for the bias term
n_samples, n_features = X_train.shape
X_train_bias = np.hstack((np.ones((n_samples, 1)), X_train))  # shape: (n_samples, n_features+1)

# Initialize weights (including bias) to zeros
w = np.zeros(n_features + 1)  # shape: (n_features+1, )

# Hyperparameters
lr = 0.01
num_iters = 1000

# To store MSE for each iteration
mse_history = []

for i in range(num_iters):
    # TODO: Compute predictions: ...
    # TODO: Compute error: ...
    # TODO: Compute gradient: ...
    # TODO: Update weights: ...
    # TODO: Compute current MSE and append to mse_history
    pass

# Plot the training MSE vs. iteration
plt.plot(mse_history)
plt.xlabel("Iteration")
plt.ylabel("MSE")
plt.title("Gradient Descent Convergence")
plt.show()

# Print learned weights from gradient descent
print("Learned weights from Gradient Descent:", w)

# Comparison with sklearn's LinearRegression
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
# Combine the intercept and coefficients for comparison
sklearn_weights = np.hstack((lin_reg.intercept_, lin_reg.coef_))
print("Sklearn LinearRegression coefficients:", sklearn_weights)

**[TODO: Provide your responses here. ]**