# Logistic regression submission

To run on Kaggle this notebook needs two inputs:
1. My [Diabetes prediction challenge: logistic regression](https://www.kaggle.com/datasets/gperdrizet/diabetes-prediction-challenge-logistic-regression) dataset
2. The [Diabetes prediction challenge](https://www.kaggle.com/competitions/playground-series-s5e12) dataset

The `Diabetes prediction challenge: logistic regression` dataset contains the Scikit-learn model pipeline, serialized with joblib and a Python module containing the custom transformers used in the pipeline. Once those sources are attached, set the `KAGGLE` flag below to True and run.

To run from a clone of the [diabetes-prediction](https://github.com/gperdrizet/diabetes-prediction) GitHub repo in any other environment, simply set `KAGGLE` to False.

## Notebook set up

### Imports

In [8]:
# Standard library imports
import sys
import urllib.request
from pathlib import Path

# Third party imports
import joblib
import pandas as pd

### Run configuration

In [None]:
# Install matching scikit-learn version to avoid unpickling warnings
# The model was trained with scikit-learn 1.7.2
import subprocess
import sys

try:
    import sklearn
    current_version = sklearn.__version__
    required_version = '1.7.2'
    
    if current_version != required_version:
        print(f'Current scikit-learn version: {current_version}')
        print(f'Required version: {required_version}')
        print('Installing matching version...')
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', f'scikit-learn=={required_version}'])
        print('Installation complete. Please restart the kernel.')

    else:
        print(f'scikit-learn {current_version} already installed')

except ImportError:
    print('scikit-learn not found, installing 1.7.2...')
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'scikit-learn==1.7.2'])
    print('Installation complete. Please restart the kernel.')

scikit-learn 1.7.2 already installed


### Install dependencies

In [2]:
# Flag to control file paths for Kaggle vs other environments
KAGGLE = False

### Add custom transformers to path

In [None]:
# Add path to custom transformers module
if KAGGLE:
    # On Kaggle, the transformers file should be uploaded as part of the dataset
    transformers_path = Path('/kaggle/input/diabetes-model')

else:
    # For local/GitHub, use the models directory
    transformers_path = Path('../models').resolve()

sys.path.insert(0, str(transformers_path))

# Import custom transformers (needed for model deserialization)
from logistic_regression_transformers import IDColumnDropper, IQRClipper, ConstantFeatureRemover

## 1. Asset loading

In [None]:
# Set file paths based on environment
if KAGGLE:
    # Kaggle paths - data is in /kaggle/input/
    test_df_path = '/kaggle/input/playground-series-s5e12/test.csv'
    model_path = '/kaggle/input/diabetes-model/logistic_regression.joblib'

else:
    # Otherwise, load from GitHub
    test_df_path = 'https://gperdrizet.github.io/FSA_devops/assets/data/unit3/diabetes_prediction_test.csv'
    model_url = 'https://github.com/gperdrizet/diabetes-prediction/raw/refs/heads/main/models/logistic_regression.joblib'
    
    # Download model to temporary location
    model_path = Path('logistic_regression.joblib')
    urllib.request.urlretrieve(model_url, model_path)

# Load the testing dataset
test_df = pd.read_csv(test_df_path)

# Load the model
model = joblib.load(model_path)

# Display first few rows of training data
test_df.head()

Unnamed: 0,id,age,alcohol_consumption_per_week,physical_activity_minutes_per_week,diet_score,sleep_hours_per_day,screen_time_hours_per_day,bmi,waist_to_hip_ratio,systolic_bp,...,triglycerides,gender,ethnicity,education_level,income_level,smoking_status,employment_status,family_history_diabetes,hypertension_history,cardiovascular_history
0,700000,45,4,100,4.3,6.8,6.2,25.5,0.84,123,...,111,Female,White,Highschool,Middle,Former,Employed,0,0,0
1,700001,35,1,87,3.5,4.6,9.0,28.6,0.88,120,...,145,Female,White,Highschool,Middle,Never,Unemployed,0,0,0
2,700002,45,1,61,7.6,6.8,7.0,28.5,0.94,112,...,184,Male,White,Highschool,Low,Never,Employed,0,0,0
3,700003,55,2,81,7.3,7.3,5.0,26.9,0.91,114,...,128,Male,White,Graduate,Middle,Former,Employed,0,0,0
4,700004,77,2,29,7.3,7.6,8.5,22.0,0.83,131,...,133,Male,White,Graduate,Low,Current,Unemployed,0,0,0


## 2. Inference

In [5]:
predictions_df = pd.DataFrame({
    'id': test_df['id'].astype(int),
    'diagnosed_diabetes': model.predict(test_df).astype(int)
})

predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 2 columns):
 #   Column              Non-Null Count   Dtype
---  ------              --------------   -----
 0   id                  300000 non-null  int64
 1   diagnosed_diabetes  300000 non-null  int64
dtypes: int64(2)
memory usage: 4.6 MB


## 3. Save submission file

In [6]:
# Set submission file path based on environment
if KAGGLE:
    submission_path = Path('submission.csv')

else:
    # Create data directory if it doesn't exist
    data_dir = Path('../data')
    data_dir.mkdir(parents=True, exist_ok=True)
    submission_path = data_dir / 'logistic_regression_submission.csv'

# Save submission file
predictions_df.to_csv(submission_path, index=False)
print(f'Submission saved to: {submission_path}')

Submission saved to: ../data/logistic_regression_submission.csv


## 4. Clean up

In [7]:
# Clean up downloaded model file if not on Kaggle
if not KAGGLE and model_path.exists():

    model_path.unlink()
    print(f'Cleaned up temporary model file: {model_path}')

Cleaned up temporary model file: logistic_regression.joblib
