# Diabetes Dataset Preparation Notebook

This notebook serves as the first step in a series of notebooks aimed at analyzing a diabetes dataset. In this notebook, we'll focus on data acquisition, sampling, and preparation for model training in subsequent steps. The key processes include:

1. **Environment Setup**:
    - Importing necessary libraries.
    - Establishing Azure ML client connection.

2. **Data Acquisition**:
    - Defining the data paths.
    - Loading data into `MLTable` objects using `mltable.from_delimited_files()`.

3. **Data Sampling**:
    - Random sampling of data using `mltable` functionality.
    - Converting sampled data to Pandas DataFrame for further analysis.

4. **Data Saving and Versioning**:
    - Saving the data to disk in MLTable format.
    - Creating or updating the data asset in Azure ML with versioning.

5. **Data Splitting**:
    - Splitting the data into training and testing sets using `train_test_split`.
    - Exploring the split data (optional).

6. **Data Persistence**:
    - Saving the split data to disk for future use.

This notebook sets the stage for the following notebooks in this series:
- **Train and Deploy a Model**: Building, training, and deploying a machine learning model.
- **Invoke a Real-Time Endpoint**: Making real-time predictions using the deployed model.
- **Create Synthetic Data**: Generating synthetic data for further analysis.
- **Explore Collected Data from Production**: Analyzing data collected from the production environment to gain insights and improve the model.

**Note**: This notebook assumes familiarity with Azure ML, Scikit-learn, and the `mltable` library for tabular data manipulation and analysis. For more detailed information on `mltable`, please refer to the [official documentation](link_to_mltable_documentation).


In [None]:
# !pip install mltable scikit-learn

In [None]:
# Import necessary libraries
import mltable
import os
import time
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

## Load Data

In this section, we'll load the diabetes dataset from two distinct files using `mltable.from_delimited_files()`. 

The paths variable holds dictionaries that specify the URLs of these CSV files. Allowed keys are `file`, `folder`, and `pattern`. For demonstration purposes we use both pattern and file.

In [None]:
paths = [
    {
        "pattern": "https://raw.githubusercontent.com/MicrosoftLearning/mslearn-dp100/main/data/diabetes.csv"
    },
    {
        "file": "https://raw.githubusercontent.com/MicrosoftLearning/mslearn-dp100/main/data/diabetes2.csv"
    },
    # {
    #     "folder": "https://raw.githubusercontent.com/MicrosoftLearning/mslearn-dp100/main/data/"
    # }
]

# Create an MLTable object
tbl = mltable.from_delimited_files(paths)

# tbl.show(5)


The `mltable.from_delimited_files()` function simplifies the loading process by auto-detecting the delimiter and establishing appropriate column names and data types, encapsulating the data within an MLTable object for easy manipulation and analysis.

## Sample Data
For initial exploration and model testing, we could also draw a random sample from the loaded data. The `tbl.take_random_sample()` method facilitates this, where `probability` sets the row selection chance, and `seed` ensures reproducibility.

In [None]:
# Take a random sample of the table
tbl_sample = tbl.take_random_sample(probability=0.001, seed=735)

# Convert to Pandas DataFrame
df = tbl_sample.to_pandas_dataframe()

# Display the first few rows
df.head()

Our data contains a `PatientID` (as patients can visit the doctor multiple times, this ID is not unique per row), several features and the label `Diabetic` which is to be predicted.

## Production data

To simulate ML OPs we will need `production` data, that will be used for model inference on the training model. Therefore, we'll split our data historically (by increasing `patientID`) and we'll keep the later samples / higher PatientID out of our dataset (for a `production` sample).

In [None]:
median_patient_id = str(int(df["PatientID"].median()))
median_patient_id

The built-in mltables filters make this really easy.

In [None]:
# tbl_sample.random_split(percent=.5)
tbl_dev = tbl.filter(f'PatientID <= {median_patient_id}')
tbl_prod = tbl.filter(f'PatientID > {median_patient_id}')

### Enriching Production Data with Dates

We split features and labels and thereby drop the patientID, as it contains no useful information for our model (other that it could memorize IDs).

In [None]:
tbl_prod_features, tbl_prod_labels = tbl_prod.drop_columns(["Diabetic", "PatientID"]), tbl_prod.keep_columns(["Diabetic"])
tbl_prod_features.show(5)

In [None]:
tbl_prod_labels.show(5)

To simulate `production` data, we create a `date` column and enrich it with artificial dates hovering around now plus/minus 30 days. 

We thereby have ~half of the data available for instant inference and simulate continuos inference over the next month.

In [None]:
import numpy as np
from datetime import datetime, timedelta

# Assuming df is your DataFrame
df_prod_features = tbl_prod_features.to_pandas_dataframe()

# Get today's date
today = datetime.today()

# Generate random number of days between -30 and 30
random_days = np.random.randint(-30, 30, size=len(df_prod_features))

# Create date column with random dates
df_prod_features['date'] = [(today + timedelta(days=int(d))).isoformat() + 'Z' for d in random_days]

df_prod_features.head(5)

## Data Saving and Versioning

Now that we have everything (training data and enriched production data), we can create AzureML data assets. 

We therefore define an asset version based on the current UTC time. This versioning can be invaluable for traceability and managing data versions in machine learning workflows.

For our training data (`tbl_dev`), due to it's relational format, MLTable is the best format to use.

In contrast, we need to store the production data as a URI_FOLDER or order to be processed by our Data and Model Monitor (which leverages Spark).

In [None]:
# Set the version number of the data asset to the current UTC time
VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())

In [None]:
# delete data directory if it exists
import shutil
if os.path.exists("data"):
    shutil.rmtree("data")

# Create the data directory and subdirectories for train, test, eval, and prod data
os.makedirs("data")
os.makedirs("data/train")
os.makedirs("data/prod")
os.makedirs("data/prod/features")
os.makedirs("data/prod/labels")

Connect to the AzureML workspace and define the `URI_FOLDER` data asset for `production` data (only the features). The `create_or_update` method is employed to either create a new data asset or update an existing one in AzureML.

In [None]:
df_prod_features.to_parquet("./data/prod/features/data.parquet")

# Connect to the AzureML workspace
ml_client = MLClient.from_config(credential=DefaultAzureCredential())

# Define the data asset
production_inputs = Data(
    path="./data/prod/features",
    type=AssetTypes.URI_FOLDER,
    description="Example data from the diabetes dataset (Production)",
    name="diabetes-urifolder-production",
    version=VERSION,
)

# Create or update the data asset in AzureML
ml_client.data.create_or_update(production_inputs)


In [None]:
# Define the data asset
tbl_prod_labels.save("./data/prod/labels")

production_labels = Data(
    path="./data/prod/labels",
    type=AssetTypes.MLTABLE,
    description="Example data from the diabetes dataset (Production Labels)",
    name="diabetes-mltable-production-labels",
    version=VERSION,
)

# Create or update the data asset in AzureML
ml_client.data.create_or_update(production_labels)

In [None]:
tbl_dev.save("./data/dev")

# Define the data asset
training_data = Data(
    path="./data/dev/",
    type=AssetTypes.MLTABLE,
    description="Example data from the diabetes dataset (Train, Test, Eval)",
    name="diabetes-mltable-dev",
    version=VERSION,
)

# Create or update the data asset in AzureML
ml_client.data.create_or_update(training_data)

## Next step

We've created some data assets, that lay the foundation of this accelerator. Proceed with the next notebook to understand how to train and deploy Models in AzureML.