The goal of this notebook is to get started with xgboost and apply it to our data.

## Libraries imports

In [1]:
!cp "/content/drive/MyDrive/Statapp/file_04_HMLasso.py" "HMLasso.py"

In [2]:
!cp "/content/drive/MyDrive/Statapp/manipulate_data.py" "manipulate_data.py"

In [3]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler # To standardize the data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

import xgboost as xgb # eXtreme Gradient Boosting
import HMLasso as hml # Lasso with High Missing Rate
import manipulate_data as manip # Useful functions

import time # To measure elapsed time during simulation

## Data imports

In [4]:
columns_types = pd.read_csv("/content/drive/MyDrive/Statapp/data_03_columns_types.csv")
data = pd.read_csv("/content/drive/MyDrive/Statapp/data_03.csv")
# data = pd.read_csv("/content/drive/MyDrive/Statapp/data_04.csv")

  data = pd.read_csv("/content/drive/MyDrive/Statapp/data_03.csv")


In [5]:
data.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42232 entries, 0 to 42231
Columns: 4161 entries, HHIDPN to GHI14
dtypes: float64(4061), int64(99), object(1)
memory usage: 1.3 GB


## Trying XGBoost

This section is dedicated to the use of XGBoost as a regressor to predict the index.

### Using HMLasso

To speed up the calculations, we made the choice to use the HMLasso to select only a few variables that could be useful. To achieve this subgoal, we first proceed to training the HMLasso on (X, y) where X is the matrix of (HHIDPN, wave) individuals and y is the GHIw.

In [6]:
untimed_data = manip.drop_time(data, keep_genetic=False)
untimed_data.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264618 entries, 0 to 264617
Columns: 192 entries, HHIDPN to GHIw
dtypes: float64(190), int64(2)
memory usage: 387.6 MB


In [7]:
X = untimed_data.drop(columns=["HHIDPN", "GHIw"]).values
y = untimed_data["GHIw"].values

y_scaled = y - y.mean()
X_scaled = StandardScaler().fit_transform(X)

hml.ERRORS_HANDLING = "ignore"
lasso = hml.HMLasso(mu = 100, verbose = True)
lasso.fit(X_scaled, y_scaled)

[Imputing parameters] Starting...
[Imputing parameters] R calculated.
[Imputing parameters] rho_pair calculated.
[Imputing parameters] S_pair calculated.
[Imputing parameters] Parameters imputed.
[First Problem] Starting...
[First Problem] Objective and constraints well-defined.
                                     CVXPY                                     
                                     v1.3.1                                    
(CVXPY) May 10 04:28:08 PM: Your problem has 36100 variables, 1 constraints, and 0 parameters.
(CVXPY) May 10 04:28:08 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) May 10 04:28:08 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) May 10 04:28:08 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
-------------------------------------------------------------------------------
                                  Compila

In [8]:
columns_for_lasso = untimed_data.drop(columns = ["HHIDPN", "GHIw"]).columns
criteria = pd.Series(abs(lasso.beta_opt) > 1e-9)
columns_to_keep = list(pd.Series(columns_for_lasso)[criteria.index[criteria]])

In [9]:
# Loading data
waves = [1,2,3,4,5,6,7,8,9,10,11,12,13,14]
columns_to_keep_for_each_wave = [var.replace('w', str(wave)) for var in columns_to_keep for wave in waves] + [var for var in data.columns if 'genetic_' in var]
working_data = manip.get_sample(data, waves = waves)

In [10]:
# We select only columns agreed by the lasso
working_data = working_data[['HHIDPN'] + columns_to_keep_for_each_wave + [f'GHI{wave}' for wave in range(1, 15)]]
working_data.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3396 entries, 0 to 3395
Columns: 1500 entries, HHIDPN to GHI14
dtypes: float64(1498), int64(1), object(1)
memory usage: 39.0 MB


In [11]:
# Formatting the database
variables_per_type = manip.get_columns_types(working_data, columns_types)

working_data[variables_per_type["Char"]] = working_data[variables_per_type["Char"]].astype('category')
working_data[variables_per_type["Categ"]] = working_data[variables_per_type["Categ"]].astype('category')

working_data.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3396 entries, 0 to 3395
Columns: 1500 entries, HHIDPN to GHI14
dtypes: category(855), float64(644), int64(1)
memory usage: 19.7 MB


In [12]:
FeatureTypes = []
for col in working_data.dtypes:
  if col == "category":
    FeatureTypes.append('c') # 'c' for categorical
  else:
    FeatureTypes.append('q') # 'q' for quantitative

### Training with different data

In [13]:
def train_model(working_data, data_to_use="all", simulation="short", params_grid=None, random_state=None, verbose=False):
  """
  Main function to train XGBoost Regressor.
  
  inputs:
  - working_data: the database on which the estimator will be trained and tested.
  - data_to_use:
     > 'all' = socioeconomic data, genetic data, precedent GHI are used for prediction
     > 'socio' = only socioeconomic data are used
     > 'sociogenetic' = only socioeconomic and genetic data are used
     > 'socioghi' = only socioeconomic data and precedent GHI are used
  - simulation:
     > 'short' = only a few hyperparameters will be tested. Does not take more than 10 minutes.
     > 'long' = a lot of hyperparameters will be tested. Can take up to 3h.
  - params_grid: the parameters to cross validate. If this option is specified, simulation is ignored.
  - random_state: the random_state used in the split train/test.
  - verbose: True or False
  """

  # Creating (X, y)
  basic_columns = ["genetic_VERSION", "genetic_Section_A_or_E", "HHIDPN", "GHI14"]
  genetic_columns = [col for col in working_data.columns if 'genetic_' in col and col != 'genetic_VERSION' and col != 'genetic_Section_A_or_E']
  GHI_columns = [f'GHI{wave}' for wave in range(1, 14)]

  message = {'all' : "Socioeconomic data, genetic data, precedent GHI will be used for prediction.",
             'socio' : "Only socioeconomic data will be used for prediction.",
             'sociogenetic' : "Only socioeconomic data and genetic data will be used for prediction.",
             'socioghi' : "Only socioeconomic data and precedent GHI will be used for prediction."}
  if data_to_use == 'all':
    columns_to_delete = basic_columns
  elif data_to_use == 'socio':
    columns_to_delete = basic_columns + genetic_columns + GHI_columns
  elif data_to_use == 'sociogenetic':
    columns_to_delete = basic_columns + GHI_columns
  elif data_to_use == 'socioghi':
    columns_to_delete = basic_columns + genetic_columns
  
  if verbose:
    print(message[data_to_use])

  if data_to_use in ['all', 'socio', 'sociogenetic', 'socioghi']:
    X = working_data.drop(columns = columns_to_delete)
  elif data_to_use == 'ghi':
    X = working_data[GHI_columns]

  y = working_data["GHI14"]

  # Splitting into Training and Testing sets
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)


  # Performing cross-validation to train and fine-tune the model
  model = xgb.XGBRegressor(tree_method='gpu_hist', enable_categorical=True, FeatureTypes=FeatureTypes)

  if params_grid is None:
    if simulation == 'long':
      params_grid = {"eta" : [0.1, 0.05, 0.03, 0.01], # learning rate
                    "lambda" : [1, 0.5, 2], # coefficient for L2 penalization
                    "alpha" : [0, 0.5, 1], # coefficient for L1 penalization
                    "max_depth" : [3, 4, 5], # max depth of trees
                    "n_estimators" : [100, 200] # number of trees
                    }
    elif simulation == 'short':
      params_grid = {"eta" : [0.05, 0.03], # learning rate
                    "lambda" : [1, 0.5], # coefficient for L2 penalization
                    "alpha" : [0.5, 1], # coefficient for L1 penalization
                    "max_depth" : [3, 4], # max depth of trees
                    "n_estimators" : [100] # number of trees
                    }

  grid = GridSearchCV(model, params_grid, refit = True, verbose = verbose, n_jobs=-1, scoring="r2") 
  grid.fit(X_train, y_train)

  results = pd.DataFrame(grid.cv_results_)
  # results.drop(columns = [col for col in results.columns if "split" in col or "time" in col]+["params"]).sort_values(by=["rank_test_score"]).head(5)

  if verbose:
    print("Model refitted with best hyperparameters.")
    print("Best parameters : " + str(grid.best_params_))
    print("R2 score on train : ", str(grid.score(X_train, y_train)))
    print("R2 score on test : ", str(grid.score(X_test, y_test)))
  
  # Storing results
  final_results = {}
  final_results["data"] = data_to_use
  final_results["best_parameters"] = list(grid.best_params_.items())
  final_results["r2_train"] = grid.score(X_train, y_train)
  final_results["r2_test"] = grid.score(X_test, y_test)

  return final_results

In [14]:
# SHORT SIMULATION
params_grid_short = {"eta" : [0.05, 0.03], # learning rate
              "lambda" : [1], # coefficient for L2 penalization
              "alpha" : [0.5], # coefficient for L1 penalization
              "max_depth" : [3], # max depth of trees
              "n_estimators" : [100] # number of trees
              }

# LONG SIMULATION
params_grid_long = {"eta" : [0.05, 0.03], # learning rate
               "lambda" : [1, 0.5], # coefficient for L2 penalization
               "alpha" : [0.5, 1], # coefficient for L1 penalization
               "max_depth" : [3, 4], # max depth of trees
               "n_estimators" : [100] # number of trees
              }

params_grid = {"short" : params_grid_short, "long" : params_grid_long}

In [None]:
results = {"Random_state" : [], "Data_used" : [], "best_parameters" : [], "r2_train" : [], "r2_test" : []}

speed = "short"
number_of_simulations = 10
t0 = time.time()
for random_state in range(number_of_simulations):

  t_beginning = time.time()
  print("random_state : ", random_state)

  for data_to_use in ['all', 'socio', 'sociogenetic', 'socioghi', 'ghi']:
    # result = train_model(working_data, data_to_use=data_to_use, random_state=random_state**3, simulation=speed)
    result = train_model(working_data, data_to_use=data_to_use, random_state=random_state**3+1123, params_grid=params_grid[speed])
    results["Random_state"].append(random_state)
    results["Data_used"].append(data_to_use)
    results["best_parameters"].append(result["best_parameters"])
    results["r2_train"].append(result["r2_train"])
    results["r2_test"].append(result["r2_test"])
    
  t_end = time.time()
  print("elapsed_time : ", t_end - t_beginning)
print("Simulation completed.")
print("Overall elapsed_time = ", time.time() - t0)

results = pd.DataFrame(results).sort_values("r2_test", ascending=False)

In [23]:
results.to_csv("XGBoost_simulation.csv", index=False)

In [18]:
results.groupby(["Data_used"])["r2_train", "r2_test"].agg({"mean", "std", "count"})

  results.groupby(["Data_used"])["r2_train", "r2_test"].agg({"mean", "std", "count"})


Unnamed: 0_level_0,r2_train,r2_train,r2_train,r2_test,r2_test,r2_test
Unnamed: 0_level_1,count,std,mean,count,std,mean
Data_used,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
all,10,0.026671,0.506276,10,0.020598,0.371964
ghi,10,0.019411,0.414276,10,0.020952,0.338193
socio,10,0.029931,0.370739,10,0.011842,0.218624
sociogenetic,10,0.035356,0.366517,10,0.013695,0.216808
socioghi,10,0.019482,0.510325,10,0.017338,0.374705


In [19]:
manip.get_sample(data, waves = waves)

Unnamed: 0,HHIDPN,R1MSTAT,R1MPART,R1MRCT,R1MLEN,R1MCURLN,R1MLENM,R1MDIV,R1MWID,R1MNEV,...,genetic_4_PP_COGENT17,genetic_4_SBP_COGENT17,genetic_4_EGFR_CKDGEN19,genetic_4_EGFRTE_CKDGEN19,genetic_4_EA3_W23_SSGAC18,genetic_4_HBA1CAA_MAGIC17,genetic_4_HBA1CEA_MAGIC17,genetic_4_GCOG2_CHARGE18,genetic_VERSION,genetic_Section_A_or_E
0,10003030,1.0,0.0,2.0,2.0,0.2,0.0,1.0,0.0,0.0,...,-0.17590,0.40720,-1.46539,-1.27411,1.58558,-1.31440,2.86850,1.07132,4.3,E
1,10004040,1.0,0.0,1.0,5.8,5.8,0.0,0.0,0.0,0.0,...,0.79770,0.31145,-0.27350,-0.17942,0.39630,0.14468,0.84597,-0.08080,4.3,E
2,10013040,1.0,0.0,2.0,8.0,7.2,0.0,1.0,0.0,0.0,...,1.95144,2.37466,-0.11260,-0.07336,-0.25041,1.04307,3.21742,-0.67176,4.3,E
3,10038010,1.0,0.0,1.0,29.3,29.3,0.0,0.0,0.0,0.0,...,2.21032,1.56604,-2.78818,-2.59190,0.92959,1.82469,1.38731,0.10438,4.3,E
4,10038040,1.0,0.0,1.0,29.0,29.0,0.0,0.0,0.0,0.0,...,1.70801,1.23272,-0.37825,-0.21145,0.34560,0.60019,0.40549,0.30028,4.3,E
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3391,205496020,1.0,0.0,1.0,35.0,35.0,0.0,0.0,0.0,0.0,...,-1.12762,-0.09069,0.67810,0.68729,0.01451,-0.05045,1.44899,-1.61619,4.3,E
3392,207347020,1.0,0.0,2.0,34.5,34.5,0.0,1.0,0.0,0.0,...,,,,,,,,,,
3393,207644020,1.0,0.0,3.0,22.4,22.4,0.0,2.0,0.0,0.0,...,-0.29558,1.16657,-0.15472,-0.34230,-0.79866,-2.31821,-2.36931,-1.28410,4.3,E
3394,208289020,1.0,0.0,1.0,34.6,34.6,0.0,0.0,0.0,0.0,...,,,,,,,,,,


In [22]:
train_model(manip.get_sample(data, waves = waves), data_to_use="all", simulation="short", params_grid=None, random_state=123, verbose=True)

Socioeconomic data, genetic data, precedent GHI will be used for prediction.
Fitting 5 folds for each of 16 candidates, totalling 80 fits




Parameters: { "FeatureTypes" } are not used.

Model refitted with best hyperparameters.
Best parameters : {'alpha': 1, 'eta': 0.05, 'lambda': 0.5, 'max_depth': 3, 'n_estimators': 100}
R2 score on train :  0.5382506288534002
R2 score on test :  0.3859139857606113


{'data': 'all',
 'best_parameters': [('alpha', 1),
  ('eta', 0.05),
  ('lambda', 0.5),
  ('max_depth', 3),
  ('n_estimators', 100)],
 'r2_train': 0.5382506288534002,
 'r2_test': 0.3859139857606113}