*2025 Spring DSAA 2011 Maching Learning*
## Lab Note 04
*Weiwen Chen, Zixin Zhong* \
*Hong Kong University of Science and Technology (Guangzhou)*

**Question 1**. Apply linear classification (affine model) with **only numpy**. Consider dataset  $(\mathbf{x}_i, y_i), i = 1,2,3,4,5$ with samples 
$$
\begin{aligned}
&\mathbf{x}_1 = -7, \quad \mathbf{x}_2= -2,\quad \mathbf{x}_3=1, \quad \mathbf{x}_4=5, \quad \mathbf{x}_5=7  \\
&y_1 = +1 , \quad y_2= +1,\quad y_3= -1, \quad y_4=-1, \quad y_5=-1.
\end{aligned}
$$
1. Write down the design matrix and target vector.
1. Estimate $\bar{\mathbf{w}}^*$.
2. Predict the label of a new test point $x_{\text{new}}$.

**Question 2**. 
1. Train a polynomial classification model with dataset in *Question 1* and predict for new test points.
<br> Set the order of the polynomial function as $2$ and $3$ individually.
1. Simulate a toy dataset (e.g. using *numpy.random*), train a polynomial classification model with the simulated dataset.
<br> Set the order of the polynomial function as $2$ and $3$ individually.

In [None]:
import numpy as np
def makePolyDataset(orgx : np.ndarray, order : int) -> np.ndarray :
	x = np.vstack(
		[orgx ** i for i in range(0, order + 1)]
	).T
	return x

x = np.array([-7, -2, 1, 5, 7])
y = np.array([1, 1, -1, -1, -1])
p2 = makePolyDataset(x, 2)
p3 = makePolyDataset(x, 3)

w2 = np.linalg.inv(p2.T @ p2) @ p2.T @ y
w3 = np.linalg.inv(p3.T @ p3) @ p3.T @ y

print(w2, w3)

def f(x : np.ndarray | float) -> np.ndarray | float :
	return 0.5 * x**2 - 2 * x + 1

np.random.seed(42)
x = np.random.uniform(-5, 5, 6)
y = np.sign(np.random.normal(0, 0.5, 6) + f(x))

print(x, y)
p2 = makePolyDataset(x, 2)
p3 = makePolyDataset(x, 3)
w2 = np.linalg.inv(p2.T @ p2) @ p2.T @ y
w3 = np.linalg.inv(p3.T @ p3) @ p3.T @ y

print(w2, w3)

## Using `numpy.random`

The `numpy.random` module is used to generate random numbers. It provides functionalities such as random number generation, random sampling, and shuffling arrays.

### Common Functions:

In [None]:
import numpy as np

# Generate a random float in the range [0, 1)
random_float = np.random.rand()

# Generate a random array with a specified shape (2x3 matrix)
random_array = np.random.rand(2, 3)

# Generate random integers in the range [low, high)
random_int = np.random.randint(low=10, high=20, size=5)

# Sample random numbers from a normal distribution
random_normal = np.random.normal(loc=0, scale=1, size=5)

# Shuffle an array randomly
array = np.array([1, 2, 3, 4, 5])
np.random.shuffle(array)

### Applications:
* Data augmentation (generating random samples)
* Data randomization (shuffling training data)
* Simulating random processes

**Question 3**. Apply ridge regression to the dataset provided in `DSAA2011-LA04-data.csv'. 

(Source link:https://www.kaggle.com/datasets/budincsevity/szeged-weather)
1. Use Ridge regression model to predict Apparel Temperature (C), with Humidity as the input feature.
2. Divide the dataset into a training set and a testing set.
3. Compare the effects of different regularization parameters (alpha).
4. Calculate mean squared error (MSE) as the evaluation metric.

## Tips:
## Basic Functions of pandas

pandas is a powerful library for data manipulation and analysis. It is widely used for reading, cleaning, transforming, and analyzing data.

### Common Functions:

In [None]:
import pandas as pd

# Read data
df = pd.read_csv('DSAA2011-LA04-data.csv')  # Read from a CSV file
print(df.head())  # View the first 5 rows

# Basic information about the data
print(df.info())  # Display dataset information
print(df.describe())  # Show statistical information for numerical data

# Data selection
print(df['Temperature (C)'])  # Select a single column
print(df[['Temperature (C)', 'Apparent Temperature (C)']])  # Select multiple columns
print(df.iloc[0:5])  # Select rows (first 5 rows)

# Data cleaning
df = df.dropna()  # Remove missing values
df['new_column'] = df['Temperature (C)'] * 2  # Create a new column

# Data grouping and aggregation
# grouped = df.groupby('Temperature (C)').mean()  # Group by a column and calculate the mean
def mean_str(col):
    if pd.api.types.is_numeric_dtype(col):
        return col.mean()
    else:
        return col.unique() if col.nunique() == 1 else np.nan
# So now you would do something like:

grouped = df.groupby('Temperature (C)').agg(mean_str)

# # Save data
df.to_csv('output.csv', index=False)  # Save to a CSV file

### Applications:

- Data preprocessing: cleaning, transforming, and preparing data
- Data analysis: statistics and visualization
- Data import and export: supports various formats (CSV, Excel, SQL, etc.)

## Mean Squared Error (MSE)

### What is MSE?

Mean Squared Error (MSE) is a common metric used to evaluate model performance, particularly in linear regression and curve fitting. It represents the average squared difference between predicted values and true values.

### Formula:

If there are true values `y` and predicted values `ŷ`, with a sample size of `n`, the formula for MSE is:

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

### Characteristics:
- The smaller the MSE, the better the model's predictive performance.
- MSE is more sensitive to outliers because it amplifies larger errors by squaring them.

### Role in Ridge Regression:

In regression, MSE is part of the objective function used to measure the fit of the model. In ridge regression, the MSE is combined with a regularization term, such as $ \lambda ||w||^2 $ (weighted L2 norm), to prevent overfitting.


In [None]:
from sklearn.model_selection import train_test_split
data = pd.read_csv("./DSAA2011-LA04-data.csv")
X = data['Humidity'].values.reshape(-1, 1)
X = np.vstack([np.ones(X.shape[0]).T, X.T]).T
Y = data['Apparent Temperature (C)'].values.reshape(-1, 1)
alpha = [1e-2, 1e-1, 1, 10, 100]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
print(X, Y)

def fit(x : np.ndarray, y : np.ndarray, alpha : float) -> np.ndarray :
	return np.linalg.inv(x.T @ x + alpha * np.identity(x.shape[1])) @ x.T @ y

def mse(y_pred : np.ndarray, y_correct : np.ndarray) -> float :
	return (y_pred - y_correct).T @ (y_pred - y_correct) / y_pred.shape[0]

bstAlpha = None
for a in alpha :
	w = fit(x_train, y_train, a)
	mse_train = mse(y_train, x_train @ w)
	mse_test = mse(y_test, x_test @ w)
	print(f"a={a}, {mse_train}, {mse_test}")
