# Application of HMLasso

The main goal of this notebook is to use the previously implemented HMLasso to critically reduce the volume of columns.

## Imports

Please install root/mosek/mosek.lib before starting optimization.

In [1]:
!cp "/content/drive/MyDrive/Statapp/file_04_HMLasso.py" "HMLasso.py"

In [4]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler # To standardize the data

import HMLasso as hml # Lasso with High Missing Rate

In [3]:
columns_types = pd.read_csv("/content/drive/MyDrive/Statapp/data_03_columns_types.csv")
data = pd.read_csv("/content/drive/MyDrive/Statapp/data_03.csv")

  data = pd.read_csv("/content/drive/MyDrive/Statapp/data_03.csv")


## Formatting the database

Let us start by spotting temporal variables.

We will use the code from a precedent notebook to do so.

In [5]:
temporal_variables = {}
waves_columns = [col for col in data.columns if "genetic_" not in col and col[1] in "123456789"]
for col in waves_columns:
  char = col[0] # R or H
  if col[2] in "01234":
    wave = col[1:3]
    suffix = col[3:]
  else:
    wave = col[1]
    suffix = col[2:]
  variable = char + 'w' + suffix
  
  if variable not in temporal_variables.keys():
    temporal_variables[variable] = np.zeros((14), dtype=bool)
  
  temporal_variables[variable][int(wave)-1] = True

temporal_variables = pd.DataFrame(temporal_variables)

# We manually add "GHIw":
temporal_variables["GHIw"] = np.ones((14), dtype=bool)
waves_columns += [f"GHI{w}" for w in range(1,15)]

## Don't do this at home !

In this section, we will experiment what happen if we wanted to simply line up all variables from all waves, regardless of whether the variable is present for this wave.

In [6]:
columns_wave1 = [col.replace('w', str(1)) for col in temporal_variables.T[0].index[temporal_variables.T[0]]]
non_waves_columns = [col for col in data.columns if col not in waves_columns]

data_wave1 = data.loc[data["INW1"] == 1, columns_wave1 + non_waves_columns]
data_wave1.head()

Unnamed: 0,R1MPART,R1MLEN,R1MCURLN,R1MLENM,R1MNEV,H1ANYFIN,H1ANYFAM,R1FAMR,R1FINR,H1HHRESP,...,REXITWV_14.0,REPLDEATH_1.0,REPLDEATH_2.0,REPLDEATH_3.0,REPLDEATH_4.0,REPLDEATH_5.0,REPLDEATH_7.0,REXPDEATH_1.0,REXPDEATH_2.0,REXPDEATH_7.0
0,0.0,20.2,,0.0,0.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,25.8,,0.0,0.0,1.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,31.3,31.3,0.0,0.0,1.0,1.0,0.0,1.0,2.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,31.2,31.2,0.0,0.0,1.0,1.0,1.0,0.0,2.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,,,,,,,,,


In [7]:
columns_to_drop_in_Xy = ["HHIDPN", "HHID", "PN"] + [f"INW{w}" for w in range(1,15)] + ["genetic_Section_A_or_E"] + ["GHI1"]
X = data_wave1.drop(columns=columns_to_drop_in_Xy).values
y = data_wave1["GHI1"].values

In [8]:
print(X.shape, y.shape)

(12651, 946) (12651,)


In [9]:
scaler = StandardScaler()#(with_std=False)
X = scaler.fit_transform(X)

In [10]:
X

array([[-0.17716539, -0.54910809,         nan, ...,  0.91789705,
        -0.8697622 , -0.16526982],
       [-0.17716539, -0.04533656,         nan, ..., -1.0894468 ,
         1.14973955, -0.16526982],
       [-0.17716539,  0.44943904,  0.34480223, ...,  0.91789705,
        -0.8697622 , -0.16526982],
       ...,
       [-0.17716539,  0.30550432,  0.20068924, ...,         nan,
                nan,         nan],
       [-0.17716539,  2.04171688, -2.27625286, ...,  0.91789705,
        -0.8697622 , -0.16526982],
       [-0.17716539, -0.83697753, -2.27625286, ...,         nan,
                nan,         nan]])

In [12]:
hml.ERRORS_HANDLING = "ignore"
lasso = hml.HMLasso(mu = 100, verbose = True)
lasso.fit(X, y)

[Imputing parameters] Starting...
[Imputing parameters] R calculated.
[Imputing parameters] rho_pair calculated.
[Imputing parameters] S_pair calculated.
[Imputing parameters] Parameters imputed.
[First Problem] Starting...
[First Problem] Objective and constraints well-defined.
                                     CVXPY                                     
                                     v1.3.1                                    
(CVXPY) Apr 10 09:49:52 AM: Your problem has 894916 variables, 1 constraints, and 0 parameters.
(CVXPY) Apr 10 09:49:52 AM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Apr 10 09:49:52 AM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Apr 10 09:49:52 AM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
-------------------------------------------------------------------------------
                                  Compil