**Company Name:**
- Major Hospital

**Problem Type:**

- Classification (Multi Class)

**Problem:**
- The company wants to automate the classification of patients depending on if they have hepatitis or not and if so, what category of hepatitis they have.

**Goal:**
- These details (features we will use to predict) are as follows:
  - X (Patient ID/No.)
  - Age (in years)
  - Sex (f,m)
  - ALB
  - ALP
  - ALT
  - AST
  - BIL
  - CHE
  - CHOL
  - CREA
  - GGT
  - PROT


- Which will let us determine the target variable which is:
  - Category (diagnosis) (values: '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis')
  - We have encoded the Category column so, 0, 1, 2, 3, 4 correspond respectively to '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis'

In [30]:
import pandas as pd
import numpy as np

#import main libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import catboost
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import optuna
from optuna.samplers import TPESampler
from sklearn.metrics import ConfusionMatrixDisplay
#import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [31]:
np.random.seed(42)

## Task 1: Understand the training code 

In [48]:
train = pd.read_parquet('data/train.parquet')

test = pd.read_parquet('data/test.parquet')

In [49]:
# Split train and test data into features X and targets Y.
le = LabelEncoder()

target_column_name = 'Category'
Y_train = train[target_column_name]
X_train = train.drop([target_column_name], axis = 1)  
Y_test = test[target_column_name]
X_test = test.drop([target_column_name], axis = 1)  

In [50]:
Y_train

0      0
1      0
2      0
3      0
4      0
      ..
487    0
488    0
489    0
490    0
491    0
Name: Category, Length: 492, dtype: int64

In [51]:


# Transform string data to numeric one-hot vectors

categorical_selector = selector(dtype_exclude=np.number)
categorical_columns = categorical_selector(X_train)
categorial_encoder = OneHotEncoder(handle_unknown='ignore')

# Standardize numeric data by removing the mean and scaling to unit variance
numerical_selector = selector(dtype_include=np.number)
numerical_columns = numerical_selector(X_train)
numerical_encoder = StandardScaler()

# Create a preprocessor that will preprocess both numeric and categorical data
preprocessor = ColumnTransformer([('categorical-encoder', categorial_encoder, categorical_columns),('standard_scaler', numerical_encoder, numerical_columns)])



xgb = make_pipeline(preprocessor, XGBClassifier(use_label_encoder=False,
                      eval_metric='mlogloss'))

print('Training model...') 

model = xgb.fit(X_train, Y_train)

print('Accuracy score: ', xgb.score(X_test,Y_test))

Training model...
Accuracy score:  0.926829268292683


## Task 2: Create a cloud client

In [52]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
ml_client = MLClient.from_config(credential=credential)

Found the config file in: ./config.json


## Task 3: Register the training and test data

In [54]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

train_data_name = 'hepatitis_c_train_parquet'
test_data_name = 'hepatitis_c_test_parquet'
training_data = Data(    name=train_data_name,    
                path='data/train.parquet',    
                type=AssetTypes.URI_FILE,    
                description='RAI hepatitis c train data')

tr_data = ml_client.data.create_or_update(training_data)
test_data = Data(    name=test_data_name,    
                path='data/test.parquet',    
                type=AssetTypes.URI_FILE,    
                description='RAI hepatitis c test data')
                
ts_data = ml_client.data.create_or_update(test_data)

DefaultAzureCredential failed to retrieve a token from the included credentials.
Attempted credentials:
	EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.
Visit https://aka.ms/azsdk/python/identity/environmentcredential/troubleshoot to troubleshoot this issue.
	ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable, no response from the IMDS endpoint.
	SharedTokenCacheCredential: SharedTokenCacheCredential authentication unavailable. No accounts were found in the cache.
	AzureCliCredential: Azure CLI not found on path
	AzurePowerShellCredential: PowerShell is not installed
	AzureDeveloperCliCredential: Azure Developer CLI could not be found. Please visit https://aka.ms/azure-dev for installation instructions and then,once installed, authenticate to your Azure account using 'azd auth login'.
To mitigate this issue, please refer to the troubleshooting guidelines here at https://aka.ms/azsdk/py

ClientAuthenticationError: DefaultAzureCredential failed to retrieve a token from the included credentials.
Attempted credentials:
	EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.
Visit https://aka.ms/azsdk/python/identity/environmentcredential/troubleshoot to troubleshoot this issue.
	ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable, no response from the IMDS endpoint.
	SharedTokenCacheCredential: SharedTokenCacheCredential authentication unavailable. No accounts were found in the cache.
	AzureCliCredential: Azure CLI not found on path
	AzurePowerShellCredential: PowerShell is not installed
	AzureDeveloperCliCredential: Azure Developer CLI could not be found. Please visit https://aka.ms/azure-dev for installation instructions and then,once installed, authenticate to your Azure account using 'azd auth login'.
To mitigate this issue, please refer to the troubleshooting guidelines here at https://aka.ms/azsdk/python/identity/defaultazurecredential/troubleshoot.

## Create a compute cluster

In [None]:
from azure.ai.ml.entities import AmlCompute
import time

compute_name = 'trainingcompute'

my_compute = AmlCompute(
    name=compute_name,
    size='Standard_DS12_v2',
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=3600
)
ml_client.compute.begin_create_or_update(my_compute).result()

## Create the job

In [None]:
from azure.ai.ml import command, Input, Output

target_column_name = 'Category'

# Create the job
job = command(
    description='Trains hepatitis c model',
    experiment_name='hepatitis_c_test',
    compute=compute_name,
    inputs=dict(training_data=Input(type='uri_file', path=f'{train_data_name}@latest'), 
                target_column_name=target_column_name),
    outputs=dict(model_output=Output(type=AssetTypes.MLFLOW_MODEL)),
    code='../src/',
    environment='azureml://registries/azureml/environments/responsibleai-ubuntu20.04-py38-cpu/versions/37',
    command='python train.py ' + 
            '--training_data ${{inputs.training_data}} ' +
            '--target_column_name ${{inputs.target_column_name}} ' +
            '--model_output ${{outputs.model_output}}'
)
job = ml_client.jobs.create_or_update(job)
ml_client.jobs.stream(job.name)

## Register the model

In [16]:
from azure.ai.ml.entities import Model

model_name = 'hepatitis_c_model'

# Register the model.
model_path = f'azureml://jobs/{job.name}/outputs/model_output'
model = Model(name=model_name,
                path=model_path,
                type=AssetTypes.MLFLOW_MODEL)
registered_model = ml_client.models.create_or_update(model)