# Task

Keep using the One-piece [competition case](https://www.kaggle.com/t/f5f7783abf31495f9593b3d93a18f9eb).

This time, it is based on the multiple linear regression framework as in the Assignment 2 (yes, as a TV series).
$$y=x'\beta + \epsilon.$$


1. Have you transformed any variables?
Polynomial transformation? Interactions?
1. Week 6 will talk about LOWESS, a simple nonparametric method.
1. Model averaging (or model assembly in ML language, stacking in particular). If you have a few competitive models, sometimes, the average of their predictions is better than any single model's prediction.
1. Get the test sample for prediction and submit your results on Kaggle to get your Kaggle score screenshot. Show the screenshot in the PDF file.

**Note:**
- All instructions in Assignment 1 apply.
- Use $\leq 300$ words.

## Install Packages

In [46]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import  Lasso, LinearRegression,LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import Pipeline

## 1. Read Data

In [55]:
train_url = "https://github.com/joshcpld/ada/raw/main/Assignment%202/data/train_data.csv"
train_df = pd.read_csv(train_url)

test_url = "https://github.com/joshcpld/ada/raw/main/Assignment%202/data/test_data.csv"
test_df = pd.read_csv(test_url)

train_df.head()

Unnamed: 0,ID,Y,X1,X2,X3,X4,X5,X6,X7,X8,...,X41,X42,X43,X44,X45,X46,X47,X48,X49,X50
0,0,-1.399091,1.174139,1.413109,0.164693,-1.067338,0.015324,-1.28097,0.489681,-0.371982,...,-0.115044,-2.580043,-0.812428,0.77282,-0.460444,0.190422,-0.362052,-1.119038,0.916313,-1.517434
1,1,3.09799,0.208922,0.931231,0.838779,0.893483,-0.510555,0.900289,-0.04249,0.8394,...,1.155635,0.673035,-0.438152,-0.001316,-0.7618,1.335092,0.901978,-1.549504,-0.456224,0.223405
2,2,-1.707346,-0.744982,0.962118,0.615392,-0.427943,-0.014912,1.138781,1.159491,0.055467,...,0.299277,1.387495,-0.007519,-0.464825,0.830986,0.373124,0.319232,-0.577295,-1.363846,-0.347154
3,3,0.610625,-0.170428,-1.361771,0.206042,0.623124,0.907441,-0.873814,1.287383,0.901191,...,1.209247,0.095866,-0.287905,-1.110714,-1.660352,0.207231,-0.419119,-0.517563,-1.050697,-0.096327
4,4,-0.689196,-0.858792,0.321308,-0.415649,1.014056,-0.522858,0.926634,-0.390663,0.790054,...,-1.191989,-1.127448,0.246358,0.407769,1.132454,-0.016621,0.964745,0.091532,0.649593,-0.81802


## 2. Split predictors

In [56]:
X0_train = train_df.drop(columns=['Y', 'ID'])
y0_train = train_df['Y'].values

cols = X0_train.columns.tolist()

## 3. Poly choice

In [57]:
# Define function
def matrix(df, cols):
    # Original terms: X1, X2, X3, ..., X50
    base = df[cols].copy()

    # Squared terms: X1^2, X2^2, X3^2, ..., X50^2
    squares = pd.DataFrame({f"{c}^2": df[c]**2 for c in cols}, index=df.index)

    # Interaction terms: X1*X2, X1*X3, ..., X49*X50
    inter_dict = {}
    for i, c1 in enumerate(cols):
        for j in range(i+1, len(cols)):
            c2 = cols[j]
            inter_dict[f"{c1}*{c2}"] = df[c1] * df[c2]
    inter = pd.DataFrame(inter_dict, index=df.index)

    # Combine all together
    X_poly = pd.concat([base, squares, inter], axis=1)
    return X_poly

# Create the poly training set
X_poly_train = matrix(train_df, cols)

print("Shape of original X0_train:", X0_train.shape)
print("Shape of poly training set:", X_poly_train.shape)

Shape of original X0_train: (2400, 50)
Shape of poly training set: (2400, 1325)


## 4. Standardize Predictors

In [58]:
scaler = StandardScaler()
X1_train = scaler.fit_transform(X_poly_train.values)

## 5. LASSO with 5-fold CV

In [59]:
alphas = np.logspace(-5, 2, num=100)
lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=10000, random_state=42)

# Fit the model
lasso_cv.fit(X1_train, y0_train)
lasso_cv

## 6. Model Prediction

In [None]:
X0_test = test_df.drop(columns=['ID'])

# Build X0_test
X_poly_test = matrix(X0_test, cols)
X1_test = scaler.transform(X_poly_test.values)

# Predict and create Kaggle submission
pred_test = lasso_cv.predict(X1_test)
submission_ass3 = pd.read_csv('../data/submission.csv')
submission_ass3 = pd.DataFrame({'ID': test_df['ID'], 'Y': pred_test})
submission_ass3.to_csv('submission_ass3_group9.csv', index=False)