# Hyperparameter tuning
We outline the various challenges we faced when designing the experiments meant to tune our hyperparameters. If this doesn't interest you, skip to **"Final Experimental Setup"**.
## Initial Experimental Setup
The goal is to test various different models across various datasets and parameters. Initially, the goal was this:
1. Three datasets: numerical, complete, FAMD complete
    - Numerical: Just the numerical and ordinal data. This is roughly 30 columns.
    - Complete: Both the numerical data and the categorical data, which is one hot encoded. This is roughly 500 columns.
    - FAMD complete: This is the complete data, but the categorical columns have been prepared for PCA.
2. Yes/no PCA: We want to test the effect of PCA on improving model performance.
3. 2 models: Linear, Polynomial
4. Hyperparameter set 1: `alpha`
   - This is just different values of regularization for the model.
5. Hyperparameter set 2: `degree`
   - This is strictly for polynomial regression, where we try different degrees for polynomial features.

Suppose the number of alphas being picked is $n$ and the number of degrees being picked is $m$. In total, that's $3 \text{ datasets} \times 2 \text{ yes/no pca} \times (4\text{ linear reg alphas} + 1 \times (n\text{ polynomial reg alphas} \times m\text{ polynomial reg degrees}) = 6 \times (4 + n \times m)$. If we have $4$ choices of $n$ and $m$, then it'll be $6 \times (4 + 16) = 120$ different models we are evaluating. Note that we're not simply just multiplying each layer with each other because linear regression will not be tuning different degrees, which would result in $192$ models.
### Caveats to the initial setup
Upon working with the experiment, we came across several roadblocks and realizations.
1. It doesn't make sense to use the FAMD complete dataset on anything other than PCA. The whole point of FAMD is to make a dataset prepared for PCA. Otherwise, without PCA, it's functionally the same as the complete dataset. **Practically, we should only test the FAMD complete dataset in tandem with PCA.**
2. The degrees of the polynomial features preprocessing step is too computationally expensive for the complete datasets. Let's say that we are working with $32 \text{ GB}$ of memory on our machine.

Assuming that we have $1800$ rows with $500$ columns, and each column is a `float` costing 8 bytes, then the space cost of `degree=3` for the complete dataset is:
$$
\begin{aligned}
\sum_{i=1}^3{500 \choose i}&=20833750 \text{ samples} \\
\implies 20833750 \times 8 \text{ bytes} \times \frac{1\text{ GB}}{10^9 \text{ bytes}} &\approx 300 \text{ GB}
\end{aligned}
$$

For `degree=2`, the cost is approximately $1.8 \text{ GB}$. 

Meanwhile, if we have $1800$ rows with only $30$ columns, it will take `degree=7` to cross $32 \text{ GB}$
$$
\begin{aligned}
\sum_{i=1}^7{30 \choose i}&\approx2.8\times 10^6 \text{ samples} \\
\implies 2.8 \times 10^6 \times 8 \text{ bytes} \times \frac{1\text{ GB}}{10^9 \text{ bytes}} &\approx 40 \text{ GB}
\end{aligned}
$$

For `degree=6`, the cost is approximately $11 \text{ GB}$.

Upon further research, however, the columns use `float64` around $120$ bytes per cell of data using `sys.getsizeof()`, further limiting the actual degrees we can use (given 500 columns, that's roughly 27 GBs); not even examining other factors that limit our memory allowance; just the size of the data.

**Through experiments, we've determined which degrees we can use. Practically, if we use the whole dataset, we can only go up to approximately `degree=1`, while if we use strictly the numerical dataset, we can go up to approximately `degree=3`, thus limiting our combinations for experimentation.**
### Outline of final experimental setup
The datasets are the only real limitation for what we can experiment with. Given what we know now, our final experimental setup should look like this, again assuming 4 alphas and 4 degrees.
- The numerical dataset can be tested with everything.
    - $2 \times (4 \times 4 + 4)=40$ models based on the numerical dataset. 
- The complete dataset can be tested with everything except polynomial degree, which should be limited to 2.
    - $2 \times (4 \times 1 + 4) = 16$ models based on the complete dataset.
- The FAMD complete dataset can be tested with everything except: No PCA, and polynomial degree, which should be limited to 2.
    - $1 \times (4 \times 1 + 4) = 8$ models based on the FAMD complete dataset.

In this example, $64$ models are being trained, assuming we keep the same 4-4 split of alphas and degrees. However, the real value will be much smaller when we run experiments to see what we can actually run.
## Final Experimental Setup
Numerical features will always be scaled.

The numerical dataset will be trained, generating $42$ models:
- With and without PCA.
- Linear Regression, $7$ choices of `alpha`.
- Polynomial Regression, $7$ choices of `alpha` crossed with $2$ choices of degrees: 2 and 3.

The whole dataset will be trained, generating $14$ models:
- With and without PCA.
- Linear Regression, $7$ choices of `alpha`.
- No Polynomial regression. 

The FAMD complete dataset will be trained, generating $7$ models:
- With PCA.
- Linear Regression, $7$ choices of `alpha`.
- No Polynomial regression.

The result should be $63$ models.

# Experiment

## Read and create datasets
Here we create the datasets we'll be experimenting with, and split the target column from the feature matrices.

In [1]:
from data_processor import DataProcessor
import pandas as pd

# read data
df = pd.read_csv('../data/train.csv')
dp = DataProcessor(df)

# complete data
complete_df = dp.complete_data()
target = complete_df['SalePrice']
complete_df.drop('SalePrice', axis=1, inplace=True)
display(complete_df.head())

# numerical data
num_df = dp.numerical_data()
num_df.drop('SalePrice', axis=1, inplace=True)
display(num_df.head())

# FAMD complete data
_, famd_cat_df = dp.famd_data()
famd_df = pd.concat([num_df, famd_cat_df], axis=1)
display(famd_df.head())
display(target.head())

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,...,MoSold_8,MoSold_9,MoSold_10,MoSold_11,MoSold_12,YrSold_2006,YrSold_2007,YrSold_2008,YrSold_2009,YrSold_2010
0,65.0,8450,7,5,196.0,706,0,150,856,856,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,80.0,9600,6,8,0.0,978,0,284,1262,1262,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,68.0,11250,7,5,162.0,486,0,434,920,920,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,60.0,9550,7,5,0.0,216,0,540,756,961,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,84.0,14260,8,5,350.0,655,0,490,1145,1145,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,...,BsmtFinType1,BsmtFinType2,HeatingQC,Electrical,KitchenQual,Functional,FireplaceQu,GarageQual,GarageCond,PoolQC
0,65.0,8450,7,5,196.0,706,0,150,856,856,...,2.0,5.0,0.0,4.0,2.0,6.0,-1.0,4.0,4.0,-1.0
1,80.0,9600,6,8,0.0,978,0,284,1262,1262,...,0.0,5.0,0.0,4.0,3.0,6.0,4.0,4.0,4.0,-1.0
2,68.0,11250,7,5,162.0,486,0,434,920,920,...,2.0,5.0,0.0,4.0,2.0,6.0,4.0,4.0,4.0,-1.0
3,60.0,9550,7,5,0.0,216,0,540,756,961,...,0.0,5.0,2.0,4.0,2.0,6.0,2.0,4.0,4.0,-1.0
4,84.0,14260,8,5,350.0,655,0,490,1145,1145,...,2.0,5.0,0.0,4.0,2.0,6.0,4.0,4.0,4.0,-1.0


Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,...,MoSold_8,MoSold_9,MoSold_10,MoSold_11,MoSold_12,YrSold_2006,YrSold_2007,YrSold_2008,YrSold_2009,YrSold_2010
0,65.0,8450,7,5,196.0,706,0,150,856,856,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057354,0.0,0.0
1,80.0,9600,6,8,0.0,978,0,284,1262,1262,...,0.0,0.0,0.0,0.0,0.0,0.0,0.055132,0.0,0.0,0.0
2,68.0,11250,7,5,162.0,486,0,434,920,920,...,0.0,0.125988,0.0,0.0,0.0,0.0,0.0,0.057354,0.0,0.0
3,60.0,9550,7,5,0.0,216,0,540,756,961,...,0.0,0.0,0.0,0.0,0.0,0.056433,0.0,0.0,0.0,0.0
4,84.0,14260,8,5,350.0,655,0,490,1145,1145,...,0.0,0.0,0.0,0.0,0.130189,0.0,0.0,0.057354,0.0,0.0


0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

In [2]:
import sys
sys.getsizeof(num_df['LotFrontage'].dtype)

120

## Define model pipeline
This is a function that creates the pipeline that `GridSearch` will be using.
- We always scale the numerical features -- we use `ColumnTransformer` to scale *only* the numerical features, as one hot encoded columns don't need scaling, and when the FAMD complete dataset is used, we don't want to scale the categorical columns.
- Generate the polynomial features if `degree>1`. This means if `degree=1`, this step will be skipped, and the model will just be automatically linear regression.
- Use `PCA`, if enabled, with an explained variance ratio of 95%
- Add the model. Will usually just be `Ridge`, which is `LinearRegression` that supports regularization

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

def create_pipeline(model, use_pca, degree=1):
    steps = []

    # always scale numerical data
    scaler = ColumnTransformer(
        transformers=[('scaler', StandardScaler(), num_df.columns)],
        remainder='passthrough'  # Keep one-hot-encoded categorical features
    )
    steps.append(('scaler', scaler))

    # generate polynomial combinations of the features if the model is polynomial
    if degree > 1:
        steps.append(('poly', PolynomialFeatures(degree=degree, include_bias=False)))

    # use pca, with an explained variance ratio of 95%
    if use_pca:
        steps.append(('pca', PCA(n_components=0.95)))

    # add the model
    steps.append(('model', model))

    return Pipeline(steps)

## Define experiment
This function uses the pipeline created to use grid search on the parameters put in. The results of the cross validation are trimmed.

In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

def run_experiment(X, y, dataset_name, use_pca, degree, param_grid, verbose=1):
    # train and test
    pipeline = create_pipeline(Ridge(), use_pca, degree=degree)
    regr = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=10, scoring='neg_root_mean_squared_error', n_jobs=-1)
    regr.fit(X, y)

    # examine results
    cv_results = pd.DataFrame(regr.cv_results_)
    if verbose == 1:
        display(cv_results)

    # add to results list
    # 1. prune unnecessary stats
    pruned_results = cv_results[['mean_fit_time', 'params', 'mean_test_score']].copy()
    # 2. add degrees, dataset, and pca usage as parameters
    pruned_results['dataset_name'] = dataset_name
    pruned_results['use_pca'] = use_pca
    pruned_results['degree'] = degree
    # 3. add rmse and nrmse, in pretty format
    pruned_results['rmse'] = pruned_results['mean_test_score'].apply(lambda x: round(-x))  
    pruned_results['nrmse'] = pruned_results['mean_test_score'].apply(lambda x: f'{round(-x / y.mean() * 100, 1)}%')
    
    return pruned_results

## Experiments

### Experiment 1: Numerical dataset

In [5]:
# init params
param_grid = {
    'model__alpha' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
degrees = [1, 2, 3]
pca_options = [True, False]
num_results = pd.DataFrame()

# run experiment
for use_pca in pca_options:
    for degree in degrees:
        exp_result = run_experiment(
            X=num_df, y=target, dataset_name='numerical', 
            use_pca=use_pca, degree=degree, param_grid=param_grid
        )
        num_results = pd.concat([num_results, exp_result], ignore_index=True)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.018708,0.000831,0.005575,0.000212,0.001,{'model__alpha': 0.001},-25375.364675,-28142.530153,-25185.455554,-41487.017487,-38292.886814,-28977.232406,-27707.671185,-26927.254983,-62616.433441,-29874.77574,-33458.662244,10994.122885,6
1,0.010978,0.004137,0.003085,0.001208,0.01,{'model__alpha': 0.01},-25375.335285,-28142.527248,-25185.454821,-41486.999736,-38292.921339,-28977.236212,-27707.654967,-26927.281647,-62616.362843,-29874.70684,-33458.648094,10994.108095,5
2,0.014149,0.002853,0.004914,0.00124,0.1,{'model__alpha': 0.1},-25375.041457,-28142.498285,-25185.447549,-41486.822278,-38293.26661,-28977.274348,-27707.492873,-26927.548308,-62615.65698,-29874.017984,-33458.506667,10993.960209,4
3,0.01516,0.000854,0.005324,0.00038,1.0,{'model__alpha': 1},-25372.11062,-28142.216775,-25185.379781,-41485.052761,-38296.721786,-28977.662945,-27705.880506,-26930.216482,-62608.609428,-29867.143637,-33457.099472,10992.482595,3
4,0.01557,0.001432,0.00563,0.000542,10.0,{'model__alpha': 10},-25343.534145,-28140.197483,-25185.18824,-41467.854614,-38331.514268,-28982.258386,-27690.598318,-26957.049206,-62539.221675,-29799.797761,-33443.72141,10977.829074,2
5,0.017344,0.002529,0.006199,0.000705,100.0,{'model__alpha': 100},-25120.537637,-28185.587652,-25223.845179,-41337.558631,-38698.463238,-29086.602028,-27608.392234,-27236.074011,-61937.436068,-29246.169285,-33368.066596,10841.767043,1
6,0.016901,0.002516,0.005309,0.000646,1000.0,{'model__alpha': 1000},-25355.689207,-30616.503446,-27163.881978,-41616.7622,-42724.077207,-31792.396365,-29137.213478,-29961.782256,-59446.560266,-28433.095571,-34624.796197,9932.123729,7


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,4.644471,1.327642,0.020389,0.001689,0.001,{'model__alpha': 0.001},-25925.003722,-30989.57341,-31132.089543,-68976.443155,-44967.621394,-34805.212647,-26723.804786,-29075.012562,-55725.005522,-32268.422551,-38058.818929,13469.277776,6
1,4.881876,1.301396,0.022331,0.00312,0.01,{'model__alpha': 0.01},-25924.985869,-30989.575017,-31132.079391,-68976.413378,-44967.602845,-34805.210732,-26723.805171,-29075.009686,-55725.063793,-32268.418105,-38058.816399,13469.280076,5
2,3.740188,0.489264,0.020717,0.005845,0.1,{'model__alpha': 0.1},-25924.807355,-30989.59109,-31131.977888,-68976.115625,-44967.417371,-34805.191579,-26723.809038,-29074.980931,-55725.646499,-32268.37365,-38058.791103,13469.303069,4
3,4.458057,1.059332,0.022523,0.005499,1.0,{'model__alpha': 1},-25923.023634,-30989.75268,-31130.964787,-68973.139282,-44965.563796,-34805.000638,-26723.84862,-29074.694236,-55731.47319,-32267.929687,-38058.539055,13469.533042,3
4,4.27521,1.164612,0.021386,0.002225,10.0,{'model__alpha': 10},-25905.327524,-30991.453538,-31121.025917,-68943.494865,-44947.144246,-34803.148891,-26724.335715,-29071.912392,-55789.703555,-32263.54838,-38056.109502,13471.837125,2
5,3.402701,0.832688,0.016074,0.007172,100.0,{'model__alpha': 100},-25741.970508,-31016.456277,-31039.686164,-68658.54743,-44774.140032,-34790.118531,-26737.85809,-29052.225922,-56368.233924,-32225.3414,-38040.457828,13495.313312,1
6,2.367853,0.24844,0.005272,0.001404,1000.0,{'model__alpha': 1000},-25090.384773,-31733.874096,-31279.590187,-66657.372144,-43850.447118,-35017.397038,-27415.169228,-29400.914492,-61717.485396,-32239.10513,-38440.17396,13766.653346,7


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,52.519001,4.604616,0.092039,0.017868,0.001,{'model__alpha': 0.001},-68354.347196,-82079.343842,-113897.70676,-82669.642815,-95832.04658,-77250.83448,-72081.765196,-70549.808193,-89959.867179,-71939.43501,-82461.479725,13453.346913,7
1,43.45453,6.700369,0.050833,0.010723,0.01,{'model__alpha': 0.01},-68354.347197,-82079.343841,-113897.706645,-82669.642815,-95832.046581,-77250.83448,-72081.765196,-70549.808193,-89959.867179,-71939.43501,-82461.479714,13453.346887,6
2,38.917366,1.765735,0.055102,0.025336,0.1,{'model__alpha': 0.1},-68354.3472,-82079.343821,-113897.705497,-82669.642816,-95832.046583,-77250.834478,-72081.765198,-70549.8082,-89959.867179,-71939.435009,-82461.479598,13453.346618,5
3,48.423488,7.342771,0.071975,0.018224,1.0,{'model__alpha': 1},-68354.347232,-82079.343628,-113897.694015,-82669.642826,-95832.046609,-77250.834464,-72081.765219,-70549.808263,-89959.867179,-71939.435,-82461.478444,13453.343929,4
4,42.671894,9.079529,0.044931,0.016359,10.0,{'model__alpha': 10},-68354.347551,-82079.341697,-113897.579197,-82669.642929,-95832.046864,-77250.834317,-72081.76543,-70549.8089,-89959.867186,-71939.434912,-82461.466898,13453.317037,3
5,36.594175,4.323732,0.023156,0.011517,100.0,{'model__alpha': 100},-68354.350742,-82079.322383,-113896.431042,-82669.643958,-95832.049418,-77250.832849,-72081.767539,-70549.815266,-89959.867255,-71939.434023,-82461.351447,13453.048131,2
6,22.723867,5.326384,0.018823,0.010206,1000.0,{'model__alpha': 1000},-68354.382644,-82079.129263,-113884.952251,-82669.654245,-95832.074951,-77250.818169,-72081.788626,-70549.87892,-89959.867945,-71939.425135,-82460.197215,13450.359906,1


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.006379,0.001522,0.001829,0.000428,0.001,{'model__alpha': 0.001},-25906.129058,-34773.990799,-24592.281251,-41914.260612,-37645.883756,-29503.344771,-27709.409295,-26967.482944,-71662.876656,-30654.542329,-35133.020147,13253.82105,7
1,0.005432,0.001413,0.002056,0.000673,0.01,{'model__alpha': 0.01},-25906.033754,-34771.911713,-24592.192767,-41914.186378,-37645.928821,-29503.119904,-27709.470076,-26967.509241,-71660.18232,-30654.327675,-35132.486265,13253.106598,6
2,0.007084,0.002031,0.002615,0.000678,0.1,{'model__alpha': 0.1},-25905.081639,-34751.166303,-24591.309971,-41913.444714,-37646.379755,-29500.876571,-27710.076698,-26967.772181,-71633.329803,-30652.183081,-35127.162072,13245.988134,5
3,0.005089,0.001002,0.001953,0.000532,1.0,{'model__alpha': 1},-25895.65149,-34548.172448,-24582.682995,-41906.094766,-37650.91644,-29478.968064,-27716.025016,-26970.398435,-71373.578395,-30630.93013,-35075.341818,13177.318034,4
4,0.007131,0.001432,0.002802,0.00065,10.0,{'model__alpha': 10},-25809.208503,-32892.182497,-24513.428456,-41838.522616,-37698.362282,-29304.118567,-27765.087296,-26996.202509,-69423.792935,-30435.793566,-34667.669923,12674.782623,3
5,0.006583,0.001455,0.002349,0.000492,100.0,{'model__alpha': 100},-25291.488456,-28476.905186,-24418.90718,-41454.829768,-38197.81469,-29009.03685,-27858.286313,-27227.830702,-64324.097545,-29330.560536,-33558.975723,11485.810351,1
6,0.006523,0.001514,0.002353,0.000476,1000.0,{'model__alpha': 1000},-25242.48104,-30370.587767,-26822.147467,-41551.358807,-42563.987949,-31832.327313,-29211.662646,-29890.430328,-59739.871245,-28266.208329,-34549.106289,10043.261182,2


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.206827,0.062213,0.010297,0.001692,0.001,{'model__alpha': 0.001},-55389.728372,-253879.89406,-884735.395859,-103175.760671,-64959.799878,-61535.758026,-66502.645733,-136961.641983,-207489.08992,-63069.251677,-189769.896618,240627.474656,7
1,0.277696,0.065538,0.012324,0.001807,0.01,{'model__alpha': 0.01},-54965.673439,-177169.298776,-860707.51223,-99450.878923,-59729.213623,-52204.19441,-60941.657197,-100133.620474,-159325.387906,-56578.012145,-168120.544912,234753.49699,6
2,0.220671,0.015748,0.010101,0.001547,0.1,{'model__alpha': 0.1},-54545.940915,-75087.117348,-720916.36011,-91129.91434,-54191.661716,-43236.54201,-49060.888488,-51411.525021,-75390.867926,-49877.577636,-126484.839551,198666.065101,5
3,0.302567,0.042004,0.013373,0.005771,1.0,{'model__alpha': 1},-48476.172565,-49582.316339,-429626.846963,-85030.180636,-48508.007592,-40014.931769,-40107.802837,-42007.362375,-78514.943037,-48631.507177,-91050.007129,113830.471468,4
4,0.293703,0.081081,0.010842,0.001203,10.0,{'model__alpha': 10},-36295.674895,-35623.825055,-145462.768753,-74976.284261,-39352.10597,-35239.653832,-34299.035463,-33786.752794,-69811.433298,-43507.535927,-54835.507025,33425.883327,3
5,0.236966,0.029697,0.007755,0.002545,100.0,{'model__alpha': 100},-28472.444414,-26013.344522,-53191.780326,-68793.141066,-35173.521031,-30798.300031,-26849.716474,-25558.830271,-53593.48142,-32708.46462,-38115.302418,14224.805823,2
6,0.186469,0.045032,0.004855,0.000319,1000.0,{'model__alpha': 1000},-24039.636302,-27231.354846,-30447.814731,-65745.980498,-38874.366327,-31057.042039,-24343.200881,-25499.788742,-42241.005439,-28301.331261,-33778.152106,12096.58357,1


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,4.47163,0.739104,0.033914,0.006612,0.001,{'model__alpha': 0.001},-226618.743728,-593617.763029,-7615862.0,-349394.650796,-588513.479181,-374401.905056,-240756.579341,-64966.150003,-512655.554838,-329397.944092,-1089619.0,2181158.0,7
1,4.521357,0.728802,0.035923,0.005078,0.01,{'model__alpha': 0.01},-85957.329873,-572982.728,-1022915.0,-166694.048509,-181644.94307,-96042.182414,-94851.34296,-64893.625203,-158504.486289,-122473.499706,-256695.9,290940.2,4
2,3.690834,0.120926,0.031464,0.005248,0.1,{'model__alpha': 0.1},-81671.379121,-464707.318047,-1873199.0,-150273.417889,-109542.939116,-80150.070991,-98172.998409,-63682.381425,-127558.022237,-106507.977093,-315546.5,530859.1,6
3,4.234298,0.464422,0.038066,0.002719,1.0,{'model__alpha': 1},-69719.60493,-277542.516204,-1799009.0,-130632.435044,-68844.154965,-72284.765061,-79382.1609,-56249.447496,-86831.02947,-97923.251648,-273841.8,512060.0,5
4,4.442668,0.982727,0.031032,0.013584,10.0,{'model__alpha': 10},-50979.185074,-136783.847301,-1322189.0,-125204.658831,-52641.906618,-58931.511566,-61623.349073,-45776.645547,-69817.917845,-74245.902331,-199819.4,375288.6,3
5,3.495936,0.110287,0.018046,0.009009,100.0,{'model__alpha': 100},-37177.955468,-64838.06389,-652959.7,-123793.549832,-42640.106695,-45847.800087,-38784.341565,-33419.211342,-86241.656752,-44544.217767,-117024.7,180633.0,2
6,2.286677,0.665803,0.01265,0.001619,1000.0,{'model__alpha': 1000},-30851.929455,-40592.564546,-145117.6,-123755.167285,-38155.70213,-35448.763823,-29497.356088,-27679.335484,-122744.932752,-34416.081673,-62825.94,44835.88,1


In [6]:
ranked_results = num_results.sort_values(by='mean_test_score', ascending=[False])
ranked_results

Unnamed: 0,mean_fit_time,params,mean_test_score,dataset_name,use_pca,degree,rmse,nrmse
5,0.017344,{'model__alpha': 100},-33368.07,numerical,True,1,33368,18.4%
4,0.01557,{'model__alpha': 10},-33443.72,numerical,True,1,33444,18.5%
3,0.01516,{'model__alpha': 1},-33457.1,numerical,True,1,33457,18.5%
2,0.014149,{'model__alpha': 0.1},-33458.51,numerical,True,1,33459,18.5%
1,0.010978,{'model__alpha': 0.01},-33458.65,numerical,True,1,33459,18.5%
0,0.018708,{'model__alpha': 0.001},-33458.66,numerical,True,1,33459,18.5%
26,0.006583,{'model__alpha': 100},-33558.98,numerical,False,1,33559,18.5%
34,0.186469,{'model__alpha': 1000},-33778.15,numerical,False,2,33778,18.7%
27,0.006523,{'model__alpha': 1000},-34549.11,numerical,False,1,34549,19.1%
6,0.016901,{'model__alpha': 1000},-34624.8,numerical,True,1,34625,19.1%


### Experiment 2: Complete dataset

In [7]:
# same parameters, but degree locked to 1
complete_results = pd.DataFrame()
# run experiment
for use_pca in pca_options:
    exp_result = run_experiment(
        X=complete_df, y=target, dataset_name='complete', 
        use_pca=use_pca, degree=1, param_grid=param_grid
    )    
    complete_results = pd.concat([complete_results, exp_result], ignore_index=True)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.891496,0.239938,0.017366,0.002267,0.001,{'model__alpha': 0.001},-24004.3474,-29005.999662,-25036.567098,-41373.959064,-36189.813883,-25959.060609,-25178.386154,-24230.663783,-61321.745945,-27541.417916,-31984.196151,11179.627617,6
1,0.900477,0.247625,0.017533,0.003693,0.01,{'model__alpha': 0.01},-24003.82025,-29004.960342,-25035.967934,-41373.740851,-36189.78012,-25958.927366,-25178.306022,-24230.422338,-61321.770093,-27541.12385,-31983.881917,11179.757399,5
2,0.74551,0.132904,0.016417,0.002252,0.1,{'model__alpha': 0.1},-23998.559249,-28994.589968,-25029.987167,-41371.563046,-36189.447193,-25957.601266,-25177.508767,-24228.014774,-61322.009969,-27538.18903,-31980.747043,11181.052395,4
3,1.045968,0.226466,0.019204,0.00186,1.0,{'model__alpha': 1},-23946.979988,-28893.127221,-24971.247841,-41350.207761,-36186.576516,-25944.959557,-25169.931993,-24204.614751,-61324.250974,-27509.414211,-31950.131081,11193.723762,3
4,0.861843,0.231519,0.016686,0.001973,10.0,{'model__alpha': 10},-23518.583964,-28067.475225,-24474.97609,-41171.612358,-36194.184986,-25869.02468,-25125.16636,-24027.001801,-61333.074096,-27269.855871,-31705.095543,11296.413254,2
5,0.689964,0.115075,0.010764,0.004169,100.0,{'model__alpha': 100},-22280.862036,-26115.805739,-22893.173438,-40519.658795,-37035.326099,-26423.088438,-25349.318648,-24010.015014,-60948.435145,-26494.636371,-31207.031972,11450.611678,1
6,0.500699,0.070328,0.005522,0.001431,1000.0,{'model__alpha': 1000},-23451.843222,-29497.388219,-25244.056434,-40505.451708,-42099.167672,-30670.732924,-27780.034071,-27628.680196,-58396.473439,-26681.61472,-33195.544261,10240.908121,7


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.115955,0.01592,0.011603,0.000713,0.001,{'model__alpha': 0.001},-28566.415993,-34907.303604,-29207.3312,-48345.450725,-31666.829221,-51419.148641,-27909.920365,-29120.481604,-81347.94129,-28136.473292,-39062.729593,16275.054658,7
1,0.119842,0.017554,0.012671,0.00107,0.01,{'model__alpha': 0.01},-28248.41745,-34650.306528,-28948.809147,-47881.151406,-31629.751054,-50665.90801,-27780.992956,-29094.1487,-81090.321308,-27848.574625,-38783.838118,16199.709296,6
2,0.112646,0.02793,0.012625,0.001906,0.1,{'model__alpha': 0.1},-26505.432091,-33371.614449,-28025.187614,-45425.002064,-31772.788616,-45732.62681,-27242.956249,-28924.155453,-79184.761559,-26494.118362,-37267.864327,15582.172108,5
3,0.112437,0.018223,0.012308,0.002002,1.0,{'model__alpha': 1},-24401.670082,-31451.360771,-27854.164143,-41689.86579,-33119.861656,-33725.670372,-27193.646599,-27713.587204,-72911.040631,-25181.099675,-34524.196692,13679.614141,4
4,0.108482,0.025328,0.013865,0.001957,10.0,{'model__alpha': 10},-22804.147091,-28773.651225,-25828.507771,-39916.811945,-34104.047296,-26550.129435,-25654.878754,-25015.982544,-65467.421905,-25190.360354,-31930.593832,12187.258844,2
5,0.120426,0.021812,0.011681,0.003009,100.0,{'model__alpha': 100},-21633.741566,-25941.508206,-22896.604002,-39943.063925,-36346.110075,-25809.179038,-24967.606231,-24024.017949,-61526.146863,-25791.252626,-30887.923048,11654.995709,1
6,0.092721,0.019975,0.006975,0.001834,1000.0,{'model__alpha': 1000},-23353.493305,-29447.829385,-25226.282648,-40417.573588,-42021.732583,-30576.529398,-27719.56908,-27620.518457,-58448.298924,-26601.993877,-33143.382124,10264.285382,3


In [8]:
ranked_results = complete_results.sort_values(by='mean_test_score', ascending=[False])
ranked_results

Unnamed: 0,mean_fit_time,params,mean_test_score,dataset_name,use_pca,degree,rmse,nrmse
12,0.120426,{'model__alpha': 100},-30887.923048,complete,False,1,30888,17.1%
5,0.689964,{'model__alpha': 100},-31207.031972,complete,True,1,31207,17.2%
4,0.861843,{'model__alpha': 10},-31705.095543,complete,True,1,31705,17.5%
11,0.108482,{'model__alpha': 10},-31930.593832,complete,False,1,31931,17.6%
3,1.045968,{'model__alpha': 1},-31950.131081,complete,True,1,31950,17.7%
2,0.74551,{'model__alpha': 0.1},-31980.747043,complete,True,1,31981,17.7%
1,0.900477,{'model__alpha': 0.01},-31983.881917,complete,True,1,31984,17.7%
0,0.891496,{'model__alpha': 0.001},-31984.196151,complete,True,1,31984,17.7%
13,0.092721,{'model__alpha': 1000},-33143.382124,complete,False,1,33143,18.3%
6,0.500699,{'model__alpha': 1000},-33195.544261,complete,True,1,33196,18.3%


### Experiment 3: FAMD complete dataset

In [9]:
# same parameters, but degree locked to 1 and pca locked to true
famd_results = pd.DataFrame()
# run experiment
exp_result = run_experiment(
    X=famd_df, y=target, dataset_name='FAMD complete', 
    use_pca=True, degree=1, param_grid=param_grid
)    
famd_results = pd.concat([famd_results, exp_result], ignore_index=True)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.981071,0.197595,0.014752,0.001055,0.001,{'model__alpha': 0.001},-25361.041242,-28134.039008,-25178.986642,-41481.107551,-38290.516257,-28965.195212,-27702.570995,-26920.145643,-62609.87193,-29875.917464,-33451.939194,10994.940721,6
1,0.868252,0.217951,0.017884,0.007649,0.01,{'model__alpha': 0.01},-25361.011983,-28134.03624,-25178.985872,-41481.089836,-38290.550793,-28965.199129,-27702.554836,-26920.172279,-62609.801386,-29875.848566,-33451.925092,10994.925928,5
2,0.719494,0.133003,0.012779,0.003194,0.1,{'model__alpha': 0.1},-25360.719471,-28134.008642,-25178.978218,-41480.912732,-38290.896179,-28965.238374,-27702.393331,-26920.43865,-62609.096058,-29875.159731,-33451.784139,10994.778009,4
3,1.017692,0.235063,0.016057,0.004028,1.0,{'model__alpha': 1},-25357.80176,-28133.740753,-25178.906636,-41479.146742,-38294.352498,-28965.638036,-27700.786827,-26923.103928,-62602.053835,-29868.285569,-33450.381658,10993.300073,3
4,0.790932,0.296126,0.014635,0.004581,10.0,{'model__alpha': 10},-25329.353432,-28131.853959,-25178.677009,-41461.982243,-38329.155155,-28970.340355,-27685.560911,-26949.907571,-62532.717702,-29800.93901,-33437.048735,10978.64347,2
5,0.718044,0.159388,0.01159,0.006488,100.0,{'model__alpha': 100},-25107.385346,-28178.271583,-25216.976016,-41331.901121,-38696.113881,-29075.457394,-27603.732917,-27228.640972,-61931.320955,-29247.118696,-33361.691888,10842.563079,1
6,0.496846,0.092045,0.005451,0.001453,1000.0,{'model__alpha': 1000},-25345.30756,-30611.026503,-27155.530126,-41610.772989,-42720.234441,-31781.96024,-29132.060438,-29952.657949,-59441.474056,-28429.901509,-34618.092581,9933.144987,7


In [10]:
ranked_results = famd_results.sort_values(by='mean_test_score', ascending=[False])
ranked_results

Unnamed: 0,mean_fit_time,params,mean_test_score,dataset_name,use_pca,degree,rmse,nrmse
5,0.718044,{'model__alpha': 100},-33361.691888,FAMD complete,True,1,33362,18.4%
4,0.790932,{'model__alpha': 10},-33437.048735,FAMD complete,True,1,33437,18.5%
3,1.017692,{'model__alpha': 1},-33450.381658,FAMD complete,True,1,33450,18.5%
2,0.719494,{'model__alpha': 0.1},-33451.784139,FAMD complete,True,1,33452,18.5%
1,0.868252,{'model__alpha': 0.01},-33451.925092,FAMD complete,True,1,33452,18.5%
0,0.981071,{'model__alpha': 0.001},-33451.939194,FAMD complete,True,1,33452,18.5%
6,0.496846,{'model__alpha': 1000},-34618.092581,FAMD complete,True,1,34618,19.1%


## Results of the complete experiment

In [12]:
# combine results for all experiments and rank by rmse
total_results = pd.concat([num_results, complete_results, famd_results], ignore_index=True)
ranked_total_results = total_results.sort_values(by='mean_test_score', ascending=[False])

# display entire df
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(ranked_total_results)

Unnamed: 0,mean_fit_time,params,mean_test_score,dataset_name,use_pca,degree,rmse,nrmse
54,0.120426,{'model__alpha': 100},-30887.92,complete,False,1,30888,17.1%
47,0.689964,{'model__alpha': 100},-31207.03,complete,True,1,31207,17.2%
46,0.861843,{'model__alpha': 10},-31705.1,complete,True,1,31705,17.5%
53,0.108482,{'model__alpha': 10},-31930.59,complete,False,1,31931,17.6%
45,1.045968,{'model__alpha': 1},-31950.13,complete,True,1,31950,17.7%
44,0.74551,{'model__alpha': 0.1},-31980.75,complete,True,1,31981,17.7%
43,0.900477,{'model__alpha': 0.01},-31983.88,complete,True,1,31984,17.7%
42,0.891496,{'model__alpha': 0.001},-31984.2,complete,True,1,31984,17.7%
55,0.092721,{'model__alpha': 1000},-33143.38,complete,False,1,33143,18.3%
48,0.500699,{'model__alpha': 1000},-33195.54,complete,True,1,33196,18.3%


## Analysis
- Ridge performs much better than linear regression due to regularization. You would have to set `alpha=0.000000000000001` before RMSE explodes from 21% to the millions, like in linear regression.
- In general, `alpha=100` peforms well, but not so afterward.
- The expectation was that the complete dataset without PCA would be terrible, but coupled with regularization, it seems that overfitting was the real issue.
- Unfortunately, FAMD failed to perform as expected. It took the lead in raw linear regression with no regularization, but as soon as any drop of regularization is added, FAMD was overtaken.

The lesson is that regularization is an incredibly strong tuner. 