<a name='title'></a>
# Modelo de predicción sobre la glucosa en sangre.


## Índice

- [Descripción](#1)
    - [Imports](#1.1)

- [Análisis de los datos](#2)

- [Ingeniería de variables](#3)

- [Entrenamiento del modelo](#4)

- [Resultados y conclusión](#5)


<a name='1'></a>
## Descripción 



Objetivo: pronosticar los niveles de glucosa en sangre con una hora de anticipación utilizando los datos de los participantes de las seis horas anteriores.

Predecir las fluctuaciones de la glucemia es fundamental para controlar la diabetes tipo 1. El desarrollo de algoritmos eficaces para ello puede aliviar algunos de los desafíos que enfrentan las personas con esta afección.

<a name='1.1'></a>
### Imports

In [None]:
%pip install -r requirements.txt

In [2]:
import zipfile
import py7zr
import os
import numpy as np
import pandas as pd
import warnings
import kaggle
from sklearn.metrics import precision_recall_curve,roc_auc_score,classification_report,roc_curve
import pandas as pd
from catboost import CatBoostRegressor

<a name='2'></a>
## Análisis de los datos

Comenzaremos el trabajo con un breve análisis. Leeremos los datos, mostraremos el tabular, describiremos las columnas e imprimiremos una fila de cada tabla para tener una idea inicial.

### Instrucciones para el usuario:
Para poder descargar los datos desde Kaggle, necesitas una API key de Kaggle.
1. Ve a tu cuenta de Kaggle (https://www.kaggle.com/account)
2. En la sección API, haz clic en "Create New API Token". Esto descargará el archivo kaggle.json.
3. Coloca el archivo kaggle.json en la carpeta ~/.kaggle/ (en sistemas UNIX como Linux/Mac) o en C:\Users\TU_USUARIO\.kaggle\ (en Windows).
4. Asegúrate de que la carpeta tenga los permisos adecuados (chmod 600 en UNIX).

Alternativa manual: navega a la URL https://www.kaggle.com/competitions/brist1d/overview, descarga y descomprime los archivos en una carpeta /kaggle_data en la misma ubicación que el notebook.

In [5]:
!kaggle competitions download -c brist1d

Traceback (most recent call last):
  File "c:\Users\chans\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\Users\chans\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\chans\AppData\Local\Programs\Python\Python310\Scripts\kaggle.exe\__main__.py", line 7, in <module>
  File "c:\Users\chans\AppData\Local\Programs\Python\Python310\lib\site-packages\kaggle\cli.py", line 63, in main
    out = args.func(**command_args)
  File "c:\Users\chans\AppData\Local\Programs\Python\Python310\lib\site-packages\kaggle\api\kaggle_api_extended.py", line 1037, in competition_download_cli
    self.competition_download_files(competition, path, force,
  File "c:\Users\chans\AppData\Local\Programs\Python\Python310\lib\site-packages\kaggle\api\kaggle_api_extended.py", line 1000, in competition_download_files
    url = response.retries.history[0]

In [6]:
# Descomprimir el archivo descargado
with zipfile.ZipFile("brist1d.zip", 'r') as zip_ref:
    zip_ref.extractall("blood_glucose_data")  # Carpeta donde se extraerán los archivos

# Ruta de la carpeta que contiene los archivos .7z
folder_path = './blood_glucose_data'

# Verificar si la carpeta existe
if os.path.exists(folder_path):
    # Listar todos los archivos en la carpeta
    for file_name in os.listdir(folder_path):
        # Verificar si el archivo tiene extensión .7z
        if file_name.endswith('.7z'):
            file_path = os.path.join(folder_path, file_name)
            print(f"Descomprimiendo {file_path}...")
            
            # Descomprimir el archivo .7z
            with py7zr.SevenZipFile(file_path, mode='r') as z:
                z.extractall(path=folder_path)
                
            print(f"Archivo {file_name} descomprimido.")
else:
    print(f"La carpeta {folder_path} no existe.")

In [51]:
train = pd.read_csv('./blood_glucose_data/train.csv')
test = pd.read_csv('./blood_glucose_data/test.csv')
sample_submission = pd.read_csv('./blood_glucose_data/sample_submission.csv')

  train = pd.read_csv('./blood_glucose_data/train.csv')


In [24]:
train.head()

Unnamed: 0,id,p_num,time,bg-5:55,bg-5:50,bg-5:45,bg-5:40,bg-5:35,bg-5:30,bg-5:25,...,activity-0:40,activity-0:35,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00,bg+1:00
0,p01_0,p01,06:10:00,,,9.6,,,9.7,,...,,,,,,,,,,13.4
1,p01_1,p01,06:25:00,,,9.7,,,9.2,,...,,,,,,,,,,12.8
2,p01_2,p01,06:40:00,,,9.2,,,8.7,,...,,,,,,,,,,15.5
3,p01_3,p01,06:55:00,,,8.7,,,8.4,,...,,,,,,,,,,14.8
4,p01_4,p01,07:10:00,,,8.4,,,8.1,,...,,,,,,,,,,12.7


In [25]:
test.head()

Unnamed: 0,id,p_num,time,bg-5:55,bg-5:50,bg-5:45,bg-5:40,bg-5:35,bg-5:30,bg-5:25,...,activity-0:45,activity-0:40,activity-0:35,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00
0,p01_8459,p01,06:45:00,,9.2,,,10.2,,,...,,,,,,,,,,
1,p01_8460,p01,11:25:00,,,9.9,,,9.4,,...,,,,,,,,Walk,Walk,Walk
2,p01_8461,p01,14:45:00,,5.5,,,5.5,,,...,,,,,,,,,,
3,p01_8462,p01,04:30:00,,3.4,,,3.9,,,...,,,,,,,,,,
4,p01_8463,p01,04:20:00,,,8.3,,,10.0,,...,,,,,,,,,,


<a name='3'></a>
## Ingeniería de variables

In [52]:
# Drop the 'id' column from the datasets
train = train.drop('id', axis=1)
test_id = test['id'] # Save the test id column for the final submission
test = test.drop('id', axis=1)

# Convert time columns to hours, minutes, and seconds
def convert_time_columns(df):
    for col in df.columns:
        if pd.api.types.is_datetime64_any_dtype(df[col]) or 'time' in col.lower():
            df[col] = pd.to_datetime(df[col], errors='coerce')  # Convert to datetime
            df[col+'_hour'] = df[col].dt.hour
            df[col+'_minute'] = df[col].dt.minute
            df.drop(col, axis=1, inplace=True)  # Drop original time column after conversion

# Apply time conversion to both train and test datasets
convert_time_columns(train)
convert_time_columns(test)

  df[col] = pd.to_datetime(df[col], errors='coerce')  # Convert to datetime
  df[col] = pd.to_datetime(df[col], errors='coerce')  # Convert to datetime


In [53]:
# Handle 'p_num' in the test set: find unique values and replace them with numbers
unique_pnums = test['p_num'].unique()
pnum_mapping = {value: idx for idx, value in enumerate(unique_pnums)}
test['p_num'] = test['p_num'].map(pnum_mapping)

# Also ensure the same mapping is applied to 'p_num' in the train set, if it exists
if 'p_num' in train.columns:
    train['p_num'] = train['p_num'].map(pnum_mapping)

# Fill NaN values with a placeholder before converting to integer
train['p_num'] = train['p_num'].fillna(-1).astype(int)
test['p_num'] = test['p_num'].fillna(-1).astype(int)

In [54]:
# Define a mapping for activity intensity based on the provided activities
activity_intensity_mapping = {
    'Indoor climbing': 576,
    'Run': 600,
    'Strength training': 360,
    'Swim': 720,
    'Bike': 480,
    'Dancing': 360,
    'Stairclimber': 600,
    'Spinning': 600,
    'Walking': 240,
    'HIIT': 900,
    'Outdoor Bike': 480,
    'Walk': 240,
    'Aerobic Workout': 480,
    'Tennis': 480,
    'Workout': 360,
    'Hike': 480,
    'Zumba': 360,
    'Sport': 480,
    'Yoga': 180,
    'Swimming': 720,
    'Weights': 360,
    'Running': 600
}

# Replace the text values in the columns that have the word "activity" in their name by these numbers
activity_columns = [col for col in train.columns if 'activity' in col]

for col in activity_columns:
    train[col] = train[col].map(activity_intensity_mapping)
    test[col] = test[col].map(activity_intensity_mapping)

In [41]:
# Fill missing values with the mean of their respective columns in both train and test
# Also try 0 and train.mean() in both train and test
train.fillna(0, inplace=True)
test.fillna(0, inplace=True)

In [42]:
# Get all columns that start with 'bg-'
bg_columns = [col for col in train.columns if col.startswith('bg-')]

# Iterate over the columns in steps of 3
for i in range(0, len(bg_columns), 3):
    if i + 2 < len(bg_columns):
        # Sum the values of the three consecutive columns
        train[bg_columns[i + 2]] = train[bg_columns[i]] + train[bg_columns[i + 1]] + train[bg_columns[i + 2]]
        test[bg_columns[i + 2]] = test[bg_columns[i]] + test[bg_columns[i + 1]] + test[bg_columns[i + 2]]

# Drop the columns that were summed
columns_to_drop = [bg_columns[i] for i in range(len(bg_columns)) if i % 3 != 2]
train.drop(columns=columns_to_drop, inplace=True)
test.drop(columns=columns_to_drop, inplace=True)

# Define the prefixes to be processed
prefixes = ['insulin-', 'carbs-', 'hr-', 'steps-', 'cals-', 'activity-']

for prefix in prefixes:
    # Get all columns that start with the current prefix
    bg_columns = [col for col in train.columns if col.startswith(prefix)]
    
    # Iterate over the columns in steps of 3
    for i in range(0, int(np.floor(len(bg_columns) * 2 / 3)), 3):
        if i + 2 < len(bg_columns):
            # Sum the values of the three consecutive columns, treating NaNs as 0
            train[bg_columns[i + 2]] = train[bg_columns[i]].fillna(0) + train[bg_columns[i + 1]].fillna(0) + train[bg_columns[i + 2]].fillna(0)
            test[bg_columns[i + 2]] = test[bg_columns[i]].fillna(0) + test[bg_columns[i + 1]].fillna(0) + test[bg_columns[i + 2]].fillna(0)
    
    # Drop the columns that were summed
    columns_to_drop = [bg_columns[i] for i in range(int(np.floor(len(bg_columns) * 2 / 3))) if i % 3 != 2]
    train.drop(columns=columns_to_drop, inplace=True)
    test.drop(columns=columns_to_drop, inplace=True)


In [55]:
train.head()

Unnamed: 0,p_num,bg-5:55,bg-5:50,bg-5:45,bg-5:40,bg-5:35,bg-5:30,bg-5:25,bg-5:20,bg-5:15,...,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00,bg+1:00,time_hour,time_minute
0,0,,,9.6,,,9.7,,,9.2,...,,,,,,,,13.4,6,10
1,0,,,9.7,,,9.2,,,8.7,...,,,,,,,,12.8,6,25
2,0,,,9.2,,,8.7,,,8.4,...,,,,,,,,15.5,6,40
3,0,,,8.7,,,8.4,,,8.1,...,,,,,,,,14.8,6,55
4,0,,,8.4,,,8.1,,,8.3,...,,,,,,,,12.7,7,10


<a name='4'></a>
## Entrenamiento del modelo

Comenzamos la parte final del trabajo, el entrenamiento de nuestro modelo. 

### Versión 1 :


In [56]:
# Split training data into features (X) and target (y)
X_train = train.drop('bg+1:00', axis=1)  # Replace 'target_column' with the actual name of the target
X_test = test
y_train = train['bg+1:00']

In [57]:
model = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6, verbose=100)
model.fit(X_train, y_train)

0:	learn: 2.8636171	total: 70.7ms	remaining: 1m 10s
100:	learn: 1.9490180	total: 6.56s	remaining: 58.4s
200:	learn: 1.8772306	total: 12.6s	remaining: 50.2s
300:	learn: 1.8283394	total: 18.9s	remaining: 43.8s
400:	learn: 1.7916101	total: 25.1s	remaining: 37.4s
500:	learn: 1.7582353	total: 31.3s	remaining: 31.2s
600:	learn: 1.7295277	total: 37.3s	remaining: 24.7s
700:	learn: 1.7040631	total: 43.3s	remaining: 18.5s
800:	learn: 1.6805356	total: 49.4s	remaining: 12.3s
900:	learn: 1.6580273	total: 55.3s	remaining: 6.08s
999:	learn: 1.6373102	total: 1m 1s	remaining: 0us


<catboost.core.CatBoostRegressor at 0x279525b87f0>

In [58]:
predictions = model.predict(X_test)

In [59]:
submission = pd.DataFrame({'id': test_id, 'bg+1:00': predictions})  # Replace 'target_column' as needed
submission.to_csv('submission.csv', index=False)

### Versión 2:

In [44]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import (Ridge, SGDRegressor, BayesianRidge, ElasticNet, Lasso, PassiveAggressiveRegressor)
from sklearn.ensemble import (RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor, VotingRegressor)
from sklearn.svm import SVR
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.dummy import DummyRegressor
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingRegressor



In [45]:
# Split training data into features (X) and target (y)
X_train = train.drop('bg+1:00', axis=1)  # Replace 'target_column' with the actual name of the target
X_test = test
y_train = train['bg+1:00']

X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.01, random_state=42)

# Define regressors
models = {
     'CatBoost': CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6, verbose=100),
#     'XGBoost': XGBRegressor(n_estimators=1000, learning_rate=0.1),
#     'GradientBoosting': GradientBoostingRegressor(n_estimators=500),
     'HistGradientBoosting': HistGradientBoostingRegressor(),
#     'BayesianRidge': BayesianRidge(),
#     'SGDRegressor': SGDRegressor(max_iter=1000, tol=1e-3),
#     'PassiveAggressiveRegressor': PassiveAggressiveRegressor(max_iter=1000, tol=1e-3),
# TO resolve error    'LightGBM': LGBMRegressor(n_estimators=1000, learning_rate=0.1),
#     'RandomForest': RandomForestRegressor(n_estimators=500),
#     'AdaBoost': AdaBoostRegressor(n_estimators=500),
#     'Bagging': BaggingRegressor(n_estimators=500),
#     'SVR': SVR(),
#     'KNeighbors': KNeighborsRegressor(),
#     'MLPRegressor': MLPRegressor(max_iter=1000),
#     'Ridge': Ridge(),
#     'Lasso': Lasso(),
#     'ElasticNet': ElasticNet(),
#     'DecisionTree': DecisionTreeRegressor(),
#     'DummyRegressor': DummyRegressor(strategy='mean')  # Baseline model
}

# Train each model and evaluate RMSE
rmse_scores = {}
for name, model in models.items():
    print(f'Training Start {name}....')
    model.fit(X_train_split, y_train_split)
    y_pred = model.predict(X_val)
    rmse = mean_squared_error(y_val, y_pred, squared=False)  # squared=False gives RMSE
    rmse_scores[name] = rmse
    print(f'{name} RMSE: {rmse:.4f}')

    X_test = test.drop('bg+1:00', axis=1, errors='ignore')
    test_predictions = model.predict(X_test)

    # Save predictions to CSV
    submission = pd.DataFrame({'id': test_id, 'bg+1:00': test_predictions})
    submission.to_csv(f'{name}_submission.csv', index=False)

    print("Model training, evaluation, and test predictions completed.")

Training Start CatBoost....
0:	learn: 2.8956077	total: 124ms	remaining: 2m 3s
100:	learn: 2.0443697	total: 3.99s	remaining: 35.5s
200:	learn: 1.9530426	total: 7.76s	remaining: 30.9s
300:	learn: 1.8960458	total: 11.5s	remaining: 26.7s
400:	learn: 1.8526457	total: 15.4s	remaining: 23s
500:	learn: 1.8139396	total: 19s	remaining: 18.9s
600:	learn: 1.7769658	total: 22.8s	remaining: 15.1s
700:	learn: 1.7457416	total: 26.5s	remaining: 11.3s
800:	learn: 1.7174111	total: 30.2s	remaining: 7.5s
900:	learn: 1.6912117	total: 33.9s	remaining: 3.73s
999:	learn: 1.6665593	total: 38s	remaining: 0us
CatBoost RMSE: 1.8358
Model training, evaluation, and test predictions completed.
Training Start HistGradientBoosting....




HistGradientBoosting RMSE: 1.9572
Model training, evaluation, and test predictions completed.




<a name='5'></a>
## Resultados y conclusión