# (a) Find a closed-form solution for this problem

The objective function is: $w^* = \min_{w∈\mathbb{R}} \frac {1}{N} \sum_{i\in N}\|w^Tx_i - y_i\|^2+\lambda\|w\|^2_2$

The closed-form solution for this optimization problem is: $$w = (X^TX+N\lambda I)^{-1}X^TY$$

where $I$ is the identity matrix.

In [1]:
##imports from libraries
import pandas as pd
import numpy as np
import time
from sklearn import linear_model
from sklearn import metrics

## (b) For “Individual household electric power consumption” dataset

In [None]:
## Load data and preprocessing

## Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip"
data = pd.read_csv(url, sep=";", low_memory=False, 
                 parse_dates={"timestamp":["Date", "Time"]}, 
                 infer_datetime_format=True, index_col="timestamp")

# Preprocess the data
data = data.dropna()
data = data.astype(float)

# Split the data into training and testing sets
n = data.shape[0]
train_ratio = 0.75
train_index = int(n * train_ratio)
X_train = data[["Global_reactive_power","Voltage","Global_intensity"]][:train_index].to_numpy()
y_train = data[["Global_active_power"]][:train_index].to_numpy().ravel()
X_test = data[["Global_reactive_power","Voltage","Global_intensity"]][train_index:].to_numpy()
y_test = data[["Global_active_power"]][train_index:].to_numpy().ravel()

In [None]:
## Closed form solution and optimal linear regressor

# Define lambda here:
lambd = 0.1 # change the value

## Calculate the closed-form solution here:
# def linear_ridge_regression(X, y, lambd):
#     n, m = X.shape
#     identity = np.identity(m)
#     identity[0, 0] = 0
#     w = np.linalg.inv(X.T @ X + lambd * identity) @ X.T @ y
#     return w

reg = linear_model.Ridge(alpha=lambd)
start = time.time()
## Find the optimal linear regressor here:
# w = linear_ridge_regression(X_train, y_train, lambd)
reg.fit(X_train, y_train)
w = reg.coef_

end = time.time()

# Evaluate the performance on the test set
# y_pred = X_test @ w
# mse = np.mean((y_test - y_pred) ** 2)
# print("Mean squared error:", mse)
y_pred = reg.predict(X_test)
mse = metrics.mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Consumed Time in seconds:", end-start)
print("Optimal Regressor:", w)

Mean Squared Error: 0.0018484036970121591
Consumed Time in seconds: 0.11168336868286133
Optimal Regressor: [-0.1899048   0.00404011  0.23987332]


## (c) For “Greenhouse gas observing network” dataset

In [2]:
!pip install wget

from numpy.random import multivariate_normal
from scipy.linalg import toeplitz
from numpy.random import randn

import requests
import csv
import wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9674 sha256=ff71998442e342c374ac95f4f4a115fc7b66e825831765e8e4ad052e8e024525
  Stored in directory: /root/.cache/pip/wheels/bd/a8/c3/3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00328/ghg_data.zip'
file_name = 'datasetGHG'
wget.download(url, file_name)
!unzip datasetGHG

In [4]:
# read .dat to feature and label vectors
'''    
    X : `numpy.array`, shape=(n_samples, n_features)
        The features matrix
    y : `numpy.array`, shape=(n_samples, n_outputs)
        The labels matrix
'''
N = 2921 # number of data 
d = int(5232/16*15) # number of features
dy = int(5232/16) # number of outputs
count = ["%04d" % x for x in range(1, N+1)]
X = np.zeros((N, d))
y = np.zeros((N, dy))
for n in range(1, N+1):
    datContent = [i.strip().split() for i in open("./ghg_data/ghg.gid.site{}.dat".format(count[n-1])).readlines()]
    X[n-1,:] = np.array(datContent[:15]).astype(float).reshape((-1))
    y[n-1,:] = np.array(datContent[15]).astype(float)
print(X.shape)
print(y.shape)

(2921, 4905)
(2921, 327)


In [5]:
n, m = X.shape

train_ratio = 0.8
train_index = int(n * train_ratio)
X_train = X[:train_index]
y_train = y[:train_index]
X_test = X[train_index:]
y_test = y[train_index:]

In [6]:
# Define lambda here:
lambd = 0.1 # change the value

# Calculate the closed-form solution here:
def linear_ridge_regression(X, y, lambd):
    n, m = X.shape
    w = np.linalg.inv(X.T @ X + lambd * np.identity(m)) @ X.T @ y
    return w

reg = linear_model.Ridge(alpha=lambd)
start = time.time()
# Find the optimal linear regressor here:
w1 = linear_ridge_regression(X_train, y_train, lambd)
reg.fit(X_train, y_train)
w2 = reg.coef_

end = time.time()

# Evaluate the performance on the test set
y_pred1 = X_test @ w1
y_pred2 = reg.predict(X_test)
mse1 = np.mean((y_test - y_pred1) ** 2)
mse2 = np.mean((y_test - y_pred2) ** 2)
print("Mean Squared Error:", mse1)
print("Mean Squared Error:", mse2)
print("Consumed Time in seconds:", end-start)
print("Optimal Regressor:", w1)
print("Optimal Regressor:", w2)

Mean Squared Error: 78563.38866984137
Mean Squared Error: 81864.41059090091
Consumed Time in seconds: 14.252393245697021
Optimal Regressor: [[-1.34424906e-01 -1.39461914e-01 -1.38639972e-01 ... -3.57860393e-01
  -4.45591855e-01 -6.69582966e-01]
 [ 4.72797279e-03 -1.17182496e-02 -2.62901689e-02 ... -2.09593741e-02
  -3.16648816e-02 -7.41467648e-02]
 [ 2.78227387e-01  7.40226029e-02  1.69706704e-02 ...  1.09537850e-01
  -3.10123876e-01  7.29959455e-02]
 ...
 [-2.41603239e+00 -8.89302727e-01  3.00106098e+00 ...  2.98104366e+00
  -2.31228301e+00  5.26471446e+00]
 [-2.56668479e+00 -2.32272928e+00  1.01233516e+00 ...  3.06039113e+00
   3.90717584e+00 -5.21008847e-02]
 [-2.34720888e-01  5.16715107e-01  1.63194143e-04 ...  2.84157762e-03
  -3.59185574e+00 -1.26026889e+00]]
Optimal Regressor: [[ 3.94074100e-04  1.76765028e-02  2.65691526e-01 ... -2.26227174e+00
  -2.63778702e+00 -3.98327295e-01]
 [ 3.54619938e-04  6.35239273e-03  9.22494184e-02 ... -1.11892145e+00
  -2.21654729e+00  7.61037890e

# (d) How would you address even bigger datasets?

When handling big datasets, we should consider the iterative approaches with suitabe solver. The time complexity of iterative approaches depend on teh cost per iteration and the total number of iterations. The iteration costs of some solvers are linear to $N$ (the number of samples in the dataset), such as Gradient Descent, while some solvers' iteration costs are independent of $N$.

For large datasets with high dimensional feature space, it is better to choose iterative appraoch instead of closed-form approach.

Other approaches to address the scalability issues include: parallel computing, mini-batch learning, feature selection and dimension reduction.