## Data review and group work primer

In this project I will review the "XGBoost for the Diabetes Dataset" data and code (https://xgboosting.com/xgboost-for-the-diabetes-dataset/) - "The diabetes dataset is a well-known dataset for regression tasks, containing various physiological measurements and a target variable representing a quantitative measure of disease progression one year after baseline."

The code is described as doing the following: "In this example, we’ll load the diabetes dataset from scikit-learn, perform hyperparameter tuning using GridSearchCV with common XGBoost regression parameters, save the best model, load it, and use it to make a prediction on a sample data point." I have provided the code below.

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBRegressor
import numpy as np

# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {diabetes.feature_names}")
print(f"Target: {diabetes.DESCR.splitlines()[1]}")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBRegressor
model = XGBRegressor(objective='reg:squarederror', random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_diabetes.ubj')

# Load saved model
loaded_model = XGBRegressor()
loaded_model.load_model('best_model_diabetes.ubj')

# Create a sample data point
sample_data_point = np.array([[0.03807591, 0.05068012, 0.06169621, 0.02187235, -0.0442235,
                               -0.03482076, -0.04340085, -0.00259226, 0.01990842, -0.01764613]])

# Use loaded model for prediction
prediction = loaded_model.predict(sample_data_point)
print(f"Predicted value: {prediction[0]:.3f}")

Dataset shape: (442, 10)
Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
Target: 
Best score: 0.444
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
Predicted value: 223.099


Now for my own exploration I'd like to use some of the tools pandas offers. To do this I will first import pandas and then convert the data array X to a pandas DataFrame.



In [None]:
from __future__ import print_function
import pandas as pd
from sklearn.datasets import load_diabetes
import numpy as np
diabetes = load_diabetes()
#in the next line we convert the array to a pandas dataframe
diabetes_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
diabetes_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


looking at the dataframe above all of the data is numeric, which is good as it seems we won't need to remove anything. It also has features (age, sex, etc.) which are the predictors of the disease progression for a regression task. The diabetes dataset also has targets (dependent variables) which measure the actual disease progression in each patient over the course of a year. This is worth having in the dataframe so in the next cell we add this as another column.

In [None]:
diabetes_df['target'] = diabetes.target
diabetes_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


now we use the describe function for the dataset

In [None]:
diabetes_df.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17,3.9184340000000004e-17,-5.777179e-18,-9.04254e-18,9.293722000000001e-17,1.130318e-17,152.133484
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,77.093005
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260971,-0.1377672,25.0
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665608,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324559,-0.03317903,87.0
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947171,-0.001077698,140.5
75%,0.03807591,0.05068012,0.03124802,0.03564379,0.02835801,0.02984439,0.0293115,0.03430886,0.03243232,0.02791705,211.5
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137,0.198788,0.1811791,0.1852344,0.1335973,0.1356118,346.0


In [None]:
diabetes_df.dtypes

Unnamed: 0,0
age,float64
sex,float64
bmi,float64
bp,float64
s1,float64
s2,float64
s3,float64
s4,float64
s5,float64
s6,float64


this is what we expected, given that all the data in the first few lines of the dataframe were floats