# Diabetes Dataset Preparation Notebook

This notebook serves as the first step in a series of notebooks aimed at analyzing a diabetes dataset. In this notebook, we'll focus on data acquisition, sampling, and preparation for model training in subsequent steps. The key processes include:

1. **Environment Setup**:
    - Importing necessary libraries.
    - Establishing Azure ML client connection.

2. **Data Acquisition**:
    - Defining the data paths.
    - Loading data into `MLTable` objects using `mltable.from_delimited_files()`.

3. **Data Sampling**:
    - Random sampling of data using `mltable` functionality.
    - Converting sampled data to Pandas DataFrame for further analysis.

4. **Data Saving and Versioning**:
    - Saving the data to disk in MLTable format.
    - Creating or updating the data asset in Azure ML with versioning.

5. **Data Splitting**:
    - Splitting the data into training and testing sets using `train_test_split`.
    - Exploring the split data (optional).

6. **Data Persistence**:
    - Saving the split data to disk for future use.

This notebook sets the stage for the following notebooks in this series:
- **Train and Deploy a Model**: Building, training, and deploying a machine learning model.
- **Invoke a Real-Time Endpoint**: Making real-time predictions using the deployed model.
- **Create Synthetic Data**: Generating synthetic data for further analysis.
- **Explore Collected Data from Production**: Analyzing data collected from the production environment to gain insights and improve the model.

**Note**: This notebook assumes familiarity with Azure ML, Scikit-learn, and the `mltable` library for tabular data manipulation and analysis. For more detailed information on `mltable`, please refer to the [official documentation](link_to_mltable_documentation).


In [1]:
# !pip install mltable scikit-learn

In [2]:
# Import necessary libraries
import mltable  # Library for working with tabular data
import os  # Operating system interfaces
import time  # Time access and conversions
from azure.ai.ml import MLClient  # Azure ML Client for managing ML assets
from azure.ai.ml.entities import Data  # Data entity class for Azure ML
from azure.ai.ml.constants import AssetTypes  # Constants for Azure ML asset types
from azure.identity import DefaultAzureCredential  # Credential class for Azure authentication
from sklearn.model_selection import train_test_split  # Split arrays or matrices into random train and test subsets

## Load Data

In this section, we'll load the diabetes dataset from two distinct files using `mltable.from_delimited_files()`. The paths variable holds dictionaries that specify the URLs of these CSV files. Allowed keys are `file`, `folder`, and `pattern`.

In [3]:
# Define the file paths
paths = [
    {
        "pattern": "https://raw.githubusercontent.com/MicrosoftLearning/mslearn-dp100/main/data/diabetes.csv"
    },
    {
        "file": "https://raw.githubusercontent.com/MicrosoftLearning/mslearn-dp100/main/data/diabetes2.csv"
    },
    # {
    #     "folder": "https://raw.githubusercontent.com/MicrosoftLearning/mslearn-dp100/main/data/"
    # }
]

# Create an MLTable object
tbl = mltable.from_delimited_files(paths)

tbl.show(5)


Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,False
1,1147438,8,92,93,47,36,21.240576,0.158365,23,False
2,1640031,7,115,47,52,35,41.511523,0.079019,23,False
3,1883350,9,103,78,25,304,29.582192,1.28287,43,True
4,1424119,1,85,59,27,35,42.604536,0.549542,22,False


The `mltable.from_delimited_files()` function simplifies the loading process by auto-detecting the delimiter and establishing appropriate column names and data types, encapsulating the data within an MLTable object for easy manipulation and analysis.

## Sample Data
For initial exploration and model testing, we'll draw a random sample from the loaded data. The `tbl.take_random_sample()` method facilitates this, where `probability` sets the row selection chance, and `seed` ensures reproducibility.

In [4]:
# Take a random sample of the table
tbl_sample = tbl.take_random_sample(probability=0.001, seed=735)

# Convert to Pandas DataFrame
df = tbl_sample.to_pandas_dataframe()

# Display the first few rows
df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1430470,11,128,101,46,231,31.419341,0.113117,62,True
1,1654612,0,89,53,28,27,30.384366,0.140559,21,False
2,1090869,0,114,80,32,144,35.262387,0.130052,25,False
3,1855732,9,154,63,15,73,29.76703,0.617465,21,True
4,1948008,0,163,54,7,41,40.640848,0.133926,32,False


In [5]:
median_patient_id = str(int(df["PatientID"].median()))
median_patient_id

'1605160'

In [6]:
# tbl_sample.random_split(percent=.5)
tbl_dev = tbl.filter(f'PatientID <= {median_patient_id}')
tbl_prod = tbl.filter(f'PatientID > {median_patient_id}')

## Creating Synthetic Production Data with Dates

In [16]:
tbl_prod_features, tbl_prod_labels = tbl_prod.drop_columns(["Diabetic", "PatientID"]), tbl_prod.keep_columns(["Diabetic"])
tbl_prod_features.show(5)

Unnamed: 0,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age
0,7,115,47,52,35,41.511523,0.079019,23
1,9,103,78,25,304,29.582192,1.28287,43
2,0,82,92,9,253,19.72416,0.103424,26
3,0,133,47,19,227,21.941357,0.17416,21
4,1,88,86,11,58,43.225041,0.230285,22


In [21]:
import numpy as np
from datetime import datetime, timedelta

# Assuming df is your DataFrame
df_prod_features = tbl_prod_features.to_pandas_dataframe()

# Get today's date
today = datetime.today()

# Generate random number of days between -30 and 30
random_days = np.random.randint(-30, 30, size=len(df_prod_features))

# Create date column with random dates
df_prod_features['date'] = [(today + timedelta(days=int(d))).isoformat() + 'Z' for d in random_days]

df_prod_features.head(5)


Unnamed: 0,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,date
0,7,115,47,52,35,41.511523,0.079019,23,2023-11-05T18:22:11.028144Z
1,9,103,78,25,304,29.582192,1.28287,43,2023-10-01T18:22:11.028144Z
2,0,82,92,9,253,19.72416,0.103424,26,2023-09-17T18:22:11.028144Z
3,0,133,47,19,227,21.941357,0.17416,21,2023-09-21T18:22:11.028144Z
4,1,88,86,11,58,43.225041,0.230285,22,2023-10-05T18:22:11.028144Z


## Data Saving and Versioning

Post sampling, we persist the `MLTable` object to disk, and define a version for the data asset based on the current UTC time. This versioning can be invaluable for traceability and managing data versions in machine learning workflows.

In [7]:
# Set the version number of the data asset to the current UTC time
VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())

In [8]:
# delete model directory if it exists
import shutil
if os.path.exists("data"):
    shutil.rmtree("data")

os.makedirs("data")
os.makedirs("data/train")
os.makedirs("data/test")
os.makedirs("data/eval")
os.makedirs("data/prod")
os.makedirs("data/prod/features")
os.makedirs("data/prod/labels")

In [10]:

# Save the table to disk in MLTable format
tbl_prod_labels.save("./data/prod/labels")
tbl_dev.save("./data/dev")

print(os.listdir("./data/prod/features"))

with open("./data/prod/features/MLTable") as f:
    print(f.read())


['MLTable']
paths:
- pattern: https://raw.githubusercontent.com/MicrosoftLearning/mslearn-dp100/main/data/diabetes.csv
- file: https://raw.githubusercontent.com/MicrosoftLearning/mslearn-dp100/main/data/diabetes2.csv
transformations:
- read_delimited:
    delimiter: ','
    empty_as_string: false
    encoding: utf8
    header: all_files_same_headers
    include_path_column: false
    infer_column_types: true
    partition_size: 20971520
    path_column: Path
    support_multi_line: false
- filter: PatientID > 1605160
- drop_columns:
  - Diabetic
  - PatientID
type: mltable



Connect to the AzureML workspace and define the data asset. The `create_or_update` method is employed to either create a new data asset or update an existing one in AzureML.

In [23]:
df_prod_features.to_parquet("./data/prod/features/data.parquet")

# Connect to the AzureML workspace
ml_client = MLClient.from_config(credential=DefaultAzureCredential())

# Define the data asset
my_data = Data(
    path="./data/prod/features",
    type=AssetTypes.URI_FOLDER,
    description="Example data from the diabetes dataset (Production)",
    name="diabetes-urifolder-production",
    version=VERSION,
)

# Create or update the data asset in AzureML
ml_client.data.create_or_update(my_data)


Found the config file in: /config.json
[32mUploading features (0.16 MBs): 100%|██████████| 163811/163811 [00:00<00:00, 1215408.99it/s]
[39m



Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_folder', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'diabetes-urifolder-production', 'description': 'Example data from the diabetes dataset (Production)', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/13c1109b-ba76-4ca6-8161-8767bdf3c75c/resourceGroups/ai-services-rg/providers/Microsoft.MachineLearningServices/workspaces/schaeffler-ops-it-aml/data/diabetes-urifolder-production/versions/2023.10.13.173852', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/hehein5/code/Users/hehein/mlops-ci-cd-monitor-demo', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f7c0ade2fe0>, 'serialize': <msrest.serialization.Serializer object at 0x7f7c0ade15a0>, 'version': '2023.10.13.173852', 'latest_version': None, 'path': 'azureml://subscriptions/13c1109b-ba76

In [12]:


# Connect to the AzureML workspace
ml_client = MLClient.from_config(credential=DefaultAzureCredential())

# Define the data asset
my_data = Data(
    path="./data/prod/labels",
    type=AssetTypes.MLTABLE,
    description="Example data from the diabetes dataset (Production Labels)",
    name="diabetes-mltable-production-labels",
    version=VERSION,
)

# Create or update the data asset in AzureML
ml_client.data.create_or_update(my_data)


# Connect to the AzureML workspace
ml_client = MLClient.from_config(credential=DefaultAzureCredential())

# Define the data asset
my_data = Data(
    path="./data/dev/",
    type=AssetTypes.MLTABLE,
    description="Example data from the diabetes dataset (Train, Test, Eval)",
    name="diabetes-mltable-dev",
    version=VERSION,
)

# Create or update the data asset in AzureML
ml_client.data.create_or_update(my_data)

Found the config file in: /config.json
Found the config file in: /config.json


Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': ['https://raw.githubusercontent.com/MicrosoftLearning/mslearn-dp100/main/data/diabetes.csv', 'https://raw.githubusercontent.com/MicrosoftLearning/mslearn-dp100/main/data/diabetes2.csv'], 'type': 'mltable', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'diabetes-mltable-dev', 'description': 'Example data from the diabetes dataset (Train, Test, Eval)', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/13c1109b-ba76-4ca6-8161-8767bdf3c75c/resourceGroups/ai-services-rg/providers/Microsoft.MachineLearningServices/workspaces/schaeffler-ops-it-aml/data/diabetes-mltable-dev/versions/2023.10.13.173852', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/hehein5/code/Users/hehein/mlops-ci-cd-monitor-demo', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f7c7630fe80>, 'seriali

## Split Data
The traditional approach would be to utilize `train_test_split` to segregate the data into training and testing sets, facilitating model validation on unseen data. 

While the above approach works well, we'll use the native MLTable functionality, which makes it quite easy to split data into train, test, and validation.

In [13]:
tbl_train, tbl_test_eval = tbl_dev.random_split(percent=.7)
tbl_test, tbl_eval = tbl_test_eval.random_split(percent=.5)


Once we use sample, we need to store the file locally, create a new MLTable pointing to that file, and register the Data Asset.

In [14]:
tables = {"train": tbl_train, "test": tbl_test, "eval": tbl_eval}

for stage in tables.keys():
    print(f"Registering stage: {stage}")
    path = f"./data/{stage}/"
    parquet_path = path + f"{stage}.parquet" 
    print(parquet_path)

    table = tables.get(stage)

    df = table.to_pandas_dataframe()

    # Save files locally
    df.to_parquet(parquet_path)

    # Create new MLTable file, pointing towards the local files.
    tbl = mltable.from_parquet_files(paths=[{"file": parquet_path}])
    tbl.save(path)
    
    # Define Data Asset
    my_data = Data(
        path=path,
        type=AssetTypes.MLTABLE,
        description=f"Example data from the diabetes dataset ({stage})",
        name=f"diabetes-mltable-{stage}",
        version=VERSION,
    )

    # Create or update the data asset in AzureML
    # try:
    ml_client.data.create_or_update(my_data)
    # except TypeError as e:
    #     print(e)

Registering stage: train
./data/train/train.parquet
Registering stage: test
./data/test/test.parquet
Registering stage: eval
./data/eval/eval.parquet


[32mUploading train (0.21 MBs): 100%|██████████| 207473/207473 [00:00<00:00, 2809414.27it/s]
[39m

[32mUploading test (0.05 MBs): 100%|██████████| 51181/51181 [00:00<00:00, 624857.87it/s]
[39m

[32mUploading eval (0.05 MBs): 100%|██████████| 52748/52748 [00:00<00:00, 631680.80it/s]
[39m



This sectioning enables a streamlined workflow from data loading to splitting, ensuring the data is readily accessible for subsequent analysis and modeling in following notebooks.