## Kaggle Setup Instructions

When running on Kaggle, you need to:
1. Create a dataset with `logistic_regression.joblib` and `logistic_regression_transformers.py` from the models directory
2. Add the dataset to your notebook as an input
3. Set `KAGGLE = True` in the run configuration below

The custom transformers file is required because the model pipeline uses custom transformers that must be importable when deserializing the model.

# Logistic regression submission

## Notebook set up

### Imports

In [None]:
import joblib
import sys
import urllib.request
from pathlib import Path

import pandas as pd

### Run configuration

In [None]:
# Flag to control file paths for Kaggle vs other environments
KAGGLE = False

### Add custom transformers to path

In [None]:
# Add path to custom transformers module
if KAGGLE:
    # On Kaggle, the transformers file should be uploaded as part of the dataset
    # Add the input directory to the path
    transformers_path = Path('/kaggle/input/diabetes-model')
else:
    # For local/GitHub, use the models directory
    transformers_path = Path('../models').resolve()

sys.path.insert(0, str(transformers_path))

# Import custom transformers
from logistic_regression_transformers import IDColumnDropper, IQRClipper, ConstantFeatureRemover

## 1. Asset loading

In [None]:
# Set file paths based on environment
if KAGGLE:

    # Kaggle paths - data is in /kaggle/input/
    test_df_path = '/kaggle/input/playground-series-s5e12/test.csv'
    model_path = '/kaggle/input/diabetes-model/logistic_regression.joblib'
else:

    # Otherwise, load from GitHub
    test_df_path = 'https://gperdrizet.github.io/FSA_devops/assets/data/unit3/diabetes_prediction_test.csv'
    model_url = 'https://github.com/gperdrizet/diabetes-prediction/raw/refs/heads/main/models/logistic_regression.joblib'
    
    # Download model to temporary location
    model_path = Path('logistic_regression.joblib')
    urllib.request.urlretrieve(model_url, model_path)

# Load the testing dataset
test_df = pd.read_csv(test_df_path)

# Load the model
model = joblib.load(model_path)

# Display first few rows of training data
test_df.head()

## 2. Inference

In [None]:
predictions_df = pd.DataFrame({
    'id': test_df['id'].astype(int),
    'diagnosed_diabetes': model.predict(test_df).astype(int)
})

predictions_df.info()

## 3. Save submission file

In [None]:
# Set submission file path based on environment
if KAGGLE:
    submission_path = Path('submission.csv')

else:
    # Create data directory if it doesn't exist
    data_dir = Path('../data')
    data_dir.mkdir(parents=True, exist_ok=True)
    submission_path = data_dir / 'logistic_regression_submission.csv'

# Save submission file
predictions_df.to_csv(submission_path, index=False)
print(f'Submission saved to: {submission_path}')

## 4. Clean up

In [None]:
# Clean up downloaded model file if not on Kaggle
if not KAGGLE and model_path.exists():

    model_path.unlink()
    print(f'Cleaned up temporary model file: {model_path}')