# Take-Home Assignment 1 (IS-ML) 

**Course:** Intelligent Systems – Machine Learning (MICS2-62)

**Title:** THA1 — Supervised Learning (Regression) • Classification • Unsupervised (Study)

**Group:** 13

**Members:** Jose Ignacio Valdivia Aguero, Esteban Leiva Montenegro, Isaac Palma Medina

**Spec reference:** Official assignment document
- Keep the code for regression models in A **implemented from scratch** (no scikit-learn for the models themselves).
- You may use numpy/pandas/matplotlib for data handling and plotting.
- For section B (classification), scikit-learn is allowed per the specification.


## Contents
A. Supervised Learning — Regression
1. Data Acquisition (A.I.1)
2. Data Transformation (A.I.2)
3. Least Squares — Closed Form (A.I.3)
4. Linear Regression via Gradient Descent (A.I.4–A.I.8)
5. Discussion LS vs GD (A.I.9)
6. Polynomial Regression (A.II.1–A.II.4)

B. Supervised Learning — Classification (B.I.1–B.I.6)

C. Unsupervised Learning — Free Choice Study (C.I.1–C.I.6)


## Environment and Reproducibility


In [2]:
# Add imports and random seed here
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt
# rng = np.random.default_rng(42)

# A. Supervised Learning — Regression
### Dataset Context
- Input X: Temperature (°C)
- Output Y: Net hourly electrical energy output (MW)

$$(13 - 1) \times 20 + 2 = 242 \quad\text{and}\quad 13 \times 20 + 1 = 261$$

### Dataset from row 242 to 261 (x, y)

| **x** | 26 | 23 | 27 | 27 | 19 | 14 | 12 | 18 | 24 | 20 | 22 | 29 | 11 | 17 | 25 | 14 | 26 | 8 | 13 | 17 | 
|:-----:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| **y** | 440 | 452 | 434 | 437 | 460 | 469 | 469 | 454 | 443 | 446 | 447 | 437 | 477 | 457 | 440 | 466 | 440 | 475 | 467 | 458 | 



## A.I.1 Data Acquisition
- Load the Excel file provided by the course.
- Select only your 20 rows using your group number `n`.
- Show the selected data in a table (X, Y).
- Keep a clean, documented pipeline.


In [None]:

data = {
    "x": [26, 23, 27, 27, 19, 14, 12, 18, 24, 20, 22, 29, 11, 17, 25, 14, 26, 8, 13, 17],
    "y": [440, 452, 434, 437, 460, 469, 469, 454, 443, 446, 447, 437, 477, 457, 440, 466, 440, 475, 467, 458]
}
print(len(data["x"]), len(data["y"]))  # should both be 20

# Load the Excel file and select group-specific rows
# file_path = "./Data Take Home Assignment 1 Exercise A.xlsx"
# n = <enter_group_number>
# start = (n - 1) * 20 + 2
# end = n * 20 + 1
# df = pd.read_excel(file_path)
# subset = df.iloc[start-1:end]  # adjust for 0-based indexing
# X = subset.iloc[:, 0].to_numpy(dtype=float)
# Y = subset.iloc[:, 1].to_numpy(dtype=float)
# subset.head()


20 20


## A.I.2 Data Transformation
- Choose and justify a transformation (e.g., Min–Max or Z-score).
- Report the formula used in the report and display transformed data here if you apply it for GD.
- Keep original X, Y copies for reference.

**Formulas:**
- Min–Max: \( x' = \frac{x - \min(x)}{\max(x) - \min(x)} \)
- Z-score: \( x' = \frac{x - \mu}{\sigma} \)


In [None]:
# Apply a chosen transformation to X and/or Y (if needed for GD stability)
# X_t = ...
# Y_t = ...
# Display summary
# print({"X_mean": X.mean(), "X_std": X.std(), "Y_mean": Y.mean(), "Y_std": Y.std()})
# print({"X_t_mean": X_t.mean(), "X_t_std": X_t.std(), "Y_t_mean": Y_t.mean(), "Y_t_std": Y_t.std()})


## A.I.3 Least Squares — Closed Form
Model: \( h_\theta(x) = \theta_1 x + \theta_0 \)

Closed form solution (OLS): \( \theta = (X^T X)^{-1} X^T y \)

Tasks:
- Compute \(\theta_0\), \(\theta_1\).
- Print parameters and SSE.
- Plot data points and fitted line.


In [None]:
import numpy as np

def linear_regression_least_squares(X, Y):
    """
    Performs Linear Regression using the analytical Least Squares solution.

    Parameters
    ----------
    X : np.ndarray
        1D vector (array) with input values — e.g., temperature in Celsius.
    Y : np.ndarray
        1D vector (array) with output values — e.g., energy in Megawatts (MW).

    Returns
    -------
    theta_0 : float
        Intercept term of the model (predicted value of Y when X = 0).
    theta_1 : float
        Slope of the model (average change in Y for each unit change in X).
    Y_pred : np.ndarray
        Predicted Y values from the fitted regression model.
    SSE : float
        Sum of Squared Errors, representing total fitting error.
    """

    # Step 1: Ensure input data are NumPy arrays (for vectorized math operations)
    X = np.array(X, dtype=float)
    Y = np.array(Y, dtype=float)
    n = len(X)

    # Step 2: Compute the means of X and Y
    # These represent the "center" of the data and will be used for normalization
    x_mean = np.mean(X)
    y_mean = np.mean(Y)

    # Step 3: Compute components for the analytical formulas
    # Numerator = covariance(X, Y)
    # Denominator = variance(X)
    numerador = np.sum((X - x_mean) * (Y - y_mean))
    denominador = np.sum((X - x_mean)**2)

    # Step 4: Compute the slope (theta_1) and intercept (theta_0)
    # Formula: theta_1 = cov(X,Y) / var(X)
    #          theta_0 = mean(Y) - theta_1 * mean(X)
    theta_1 = numerador / denominador
    theta_0 = y_mean - theta_1 * x_mean

    # Step 5: Generate predictions for all input values
    Y_pred = theta_1 * X + theta_0

    # Step 6: Compute the total squared error (SSE)
    # This measures the total deviation between real and predicted values
    SSE = np.sum((Y - Y_pred)**2)

    return theta_0, theta_1, Y_pred, SSE


Plot Linear Regression least squares

In [None]:
#Plot results
import matplotlib.pyplot as plt



## A.I.4 Linear Regression via Gradient Descent — Cost and Gradients
Model: \( h_\theta(x) = \theta_1 x + \theta_0 \)

Cost (MSE): \( J(\theta) = \frac{1}{2n} \sum_{i=1}^n (y^{(i)} - h_\theta(x^{(i)}))^2 \)

Gradients:
- \( \frac{\partial J}{\partial \theta_0} = -\frac{1}{n} \sum (y - h) \)
- \( \frac{\partial J}{\partial \theta_1} = -\frac{1}{n} \sum (y - h) x \)

Update rule: \( \theta_j \leftarrow \theta_j - \alpha \frac{\partial J}{\partial \theta_j} \)

Tasks:
- Choose learning rate \(\alpha\) and initial parameters.
- Implement cost and gradient functions.
- Prepare plotting utilities for line and cost curve.


In [None]:
# Define cost J(theta0, theta1), gradients, and a helper to plot the current line
# def compute_cost(theta0, theta1, Xv, Yv):
#     ...
#     return J
# def compute_grads(theta0, theta1, Xv, Yv):
#     ...
#     return dtheta0, dtheta1
# def plot_fit(theta0, theta1, Xv, Yv, title):
#     ...


### A.I.5 First GD Iteration
- Initialize \(\theta_0\), \(\theta_1\) (random or chosen) and \(\alpha\).
- Perform exactly one update step.
- Report parameters and plot current line.


In [None]:
# theta0, theta1 = ...  # initial values
# alpha = ...
# d0, d1 = compute_grads(theta0, theta1, X, Y)
# theta0 = theta0 - alpha * d0
# theta1 = theta1 - alpha * d1
# print(theta0, theta1)
# plot_fit(theta0, theta1, X, Y, "GD: iteration 1")


### A.I.6 Second GD Iteration
- One more update step and plot.


In [None]:
# d0, d1 = compute_grads(theta0, theta1, X, Y)
# theta0 = theta0 - alpha * d0
# theta1 = theta1 - alpha * d1
# print(theta0, theta1)
# plot_fit(theta0, theta1, X, Y, "GD: iteration 2")


### A.I.7 Third GD Iteration
- One more update step and plot.


In [None]:
# d0, d1 = compute_grads(theta0, theta1, X, Y)
# theta0 = theta0 - alpha * d0
# theta1 = theta1 - alpha * d1
# print(theta0, theta1)
# plot_fit(theta0, theta1, X, Y, "GD: iteration 3")


### A.I.8 Last GD Iteration and Cost Curve
- Run until a stopping condition (fixed iters or small parameter change).
- Plot cost per iteration.
- Report final parameters and final fitted line.


In [None]:
# costs = []
# for t in range(...):
#     d0, d1 = compute_grads(theta0, theta1, X, Y)
#     theta0 = theta0 - alpha * d0
#     theta1 = theta1 - alpha * d1
#     costs.append(compute_cost(theta0, theta1, X, Y))
# # Plot cost curve
# # Plot final fit
# print("Final:", theta0, theta1)


### A.I.9 Discussion — LS vs GD
- Compare convergence, numerical stability, sensitivity to scaling, and final fit.
- Comment on advantages and limitations for small vs larger datasets.


## A.II Polynomial Regression (Quadratic)
Model: \( h_\theta(x) = \theta_2 x^2 + \theta_1 x + \theta_0 \)

Cost: \( J(\theta) = \frac{1}{4n} \sum_{i=1}^n (y^{(i)} - h_\theta(x^{(i)}))^4 \)

Gradients (using chain rule): let \(e = y - h\)
- \( \frac{\partial J}{\partial \theta_0} = -\frac{1}{n} \sum e^3 \)
- \( \frac{\partial J}{\partial \theta_1} = -\frac{1}{n} \sum e^3 x \)
- \( \frac{\partial J}{\partial \theta_2} = -\frac{1}{n} \sum e^3 x^2 \)

Tasks:
- Use data from A.I.1 or A.I.2 (justify choice).
- Choose learning rate and initial parameters.
- Run GD until the model fits well; plot intermediate fits and final curve.
- Report initial values, learning rate, and a brief discussion vs linear regression.


In [None]:
# Implement polynomial features and GD with the 4th-power cost
# def poly_features(x):
#     # return [1, x, x^2]
#     ...
# def poly_cost(theta, Xv, Yv):
#     ...
# def poly_grads(theta, Xv, Yv):
#     ...
# # GD loop with plots of intermediate fits
# theta = ...  # [theta0, theta1, theta2]
# alpha = ...
# for t in range(...):
#     dtheta = poly_grads(theta, X, Y)
#     theta = theta - alpha * dtheta
#     # optionally plot selected iterations
# print("Final theta:", theta)


# B. Supervised Learning — Classification
Follow the instructions to create two synthetic datasets and compare classifiers (k-NN, Naive Bayes, Decision Trees, Random Forests; optionally SVM/ANN). Use scikit-learn here. Keep code concise and plots clear.

## B.I.1 Dataset 1 Creation (Binary, 2D continuous, 40/40)
- Generate and display points.
- Document data generation approach briefly.


In [None]:
# Create Dataset 1 (binary, two continuous features, 40 points per class)
# X1, y1 = ...
# Display/plot


## B.I.2 Dataset 2 (Dataset 1 + Outliers)
- Add four outliers per class at random.
- Display updated dataset.


In [None]:
# Create Dataset 2 by adding outliers to Dataset 1
# X2, y2 = ...
# Display/plot


## B.I.3 Train/Test Split
- Justify split strategy and report sizes.


In [None]:
# Perform train/test split for both datasets
# X1_tr, X1_te, y1_tr, y1_te = ...
# X2_tr, X2_te, y2_tr, y2_te = ...
# print(...)


## B.I.4 Train Models and Report Accuracy + Confusion Matrix
- k-NN, Naive Bayes, Decision Tree, Random Forest.
- Train on Dataset 1 and 2; evaluate on train and test splits.
- Report accuracy and confusion matrix; discuss briefly.


In [None]:
# Fit and evaluate models (scikit-learn allowed here)
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.naive_bayes import GaussianNB
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import accuracy_score, confusion_matrix
# ...


## B.I.5 Decision Boundaries
- Visualize decision boundaries for each model on both datasets.
- Keep the plotting style simple and consistent.


In [None]:
# Plot decision boundaries for each classifier
# def plot_decision_boundary(model, Xv, yv, title):
#     ...
# ...


## B.I.6 Discussion
- Compare models and decision boundaries.
- Discuss robustness to outliers and dataset differences.


# C. Unsupervised Learning — Free Choice Study
Choose one UL task (e.g., clustering or dimensionality reduction). Keep the study small but coherent, with clear motivations, metrics, and conclusions.

## C.I.1 Task and Metrics
- State the UL task and justify evaluation metrics.

## C.I.2 Data and Splits
- Choose a small dataset; describe preprocessing and any split strategy.

## C.I.3–C.I.4 Models and Methods
- Select two UL models; explain how they work and how they are optimized.

## C.I.5 Results and Discussion
- Present results clearly; comment on findings.

## C.I.6 Conclusions and Limitations
- Summarize, list limitations, suggest improvements.
