# (Kernel) Ridge Regression
Download the Spotify Tracks Dataset and perform ridge regression to predict the tracks’ popularity. Note that this dataset contains both numerical and categorical features. The student is thus required to follow these guidelines:
- first, train the model using only the numerical features,
- second, appropriately handle the categorical features (for example, with one-hot encoding or other techniques) and use them together with the numerical ones to train the model, in both cases, experiment with different training parameters, 
- use 5-fold cross validation to compute your risk estimates, thoroughly discuss and compare the performance of the model

The student is required to implement from scratch (without using libraries, such as Scikit-learn) the code for the ridge regression, while it is not mandatory to do so for the implementation of the 5-fold cross-validation.

Optional: Instead of regular ridge regression, implement kernel ridge regression using a Gaussian kernel.


In [None]:
import pandas as pd
import numpy as np

In [None]:
dataset = "data/dataset.csv"

dataset_df = pd.read_csv(dataset)
dataset_df

In [None]:
mask = np.random.rand(len(dataset_df))<0.7

train_df = dataset_df[mask]
test_df = dataset_df[~mask]

y_train_df = train_df[["popularity"]]
y_train_df

In [None]:
def ridge_regression(alpha, y, s):
    n_rows, n_cols = s.shape  # Get the dimensions of the input matrix s
    s_t = s.transpose()  # Transpose of matrix s
    
    # Calculate the identity matrix with the appropriate size
    identity = np.identity(n_cols)
    
    # Calculate the ridge regression coefficients using matrix operations
    w = np.linalg.inv(alpha * identity + np.dot(s_t, s)).dot(s_t).dot(y)
    
    # Convert the coefficients to a DataFrame for better presentation
    w_df = pd.DataFrame(w, columns=["Values"], index=s.columns)
    
    return w_df


def predict(w, x):
    return w.transpose().dot(x)

In [None]:
#Numerical features

train_numeric_df = train_df[["duration_ms","danceability", "energy", "loudness","speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo"]]
test_numeric_df = test_df[["duration_ms","danceability", "energy", "loudness","speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo"]]


In [None]:
alpha = 0.3

result = ridge_regression(alpha, y_train_df, train_numeric_df)
result

In [None]:
predict(result, train_numeric_df.loc[0])

In [None]:
def square_loss(w, test_df, y):
    X = test_df.values  # Convert the DataFrame to a numpy array
    # Calculate predictions for all rows at once
    predictions = np.dot(X, w) 
    # print(predictions[0])
    # print(predict(w,X[0]))
    squared_diff = (predictions - y)**2
    total_loss = np.sum(squared_diff)
    return total_loss.values[0]/test_df.shape[0]

y_test_df= test_df[["popularity"]]
print("Loss: ",square_loss(result, test_numeric_df, y_test_df))