## Training 

The main task on this dataset is to predict based on the given attributes of a patient that whether that particular person has heart disease or not and other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.

#### Libraries and modules to use

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
import mlflow

#### Data reading

In [7]:
import pandas as pd

file_url = "https://mlrawdata123.blob.core.windows.net/rawdata/raw_data.csv"
df = pd.read_csv(file_url)
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


#### Data preprocessing

In [8]:
df["thal"] = df["thal"].astype("category").cat.codes

Train-test split

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("target", axis=1), df["target"], test_size=0.3
)

Define numerical and categorical features

In [11]:
num_features = X_train.select_dtypes('number').columns.tolist()
cat_features = X_train.select_dtypes('object').columns.tolist()

Define preprocessing pipeline

In [12]:
column_transformer = ColumnTransformer(
    [("OHE", OneHotEncoder(sparse_output=False), cat_features),
     ("scaler", MinMaxScaler(), num_features)
     ]
).set_output(transform="pandas")

### Modeling

Set experiment

In [13]:
mlflow.set_experiment(experiment_name="heart_disease")

<Experiment: artifact_location='', creation_time=1702953146080, experiment_id='a2190bbd-2482-476b-8b5f-527a4bbd78eb', last_update_time=None, lifecycle_stage='active', name='heart_disease', tags={}>

#### Logistic regression

In [17]:
y_train

243    0
201    0
26     0
262    0
158    1
      ..
71     0
211    1
120    0
3      0
241    1
Name: target, Length: 212, dtype: int64

In [56]:

pipe = Pipeline(steps = [
    ("transformers", column_transformer),
    ("model", LogisticRegression(random_state=1234, penalty='l2'))
])

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
y_pred_prob = pipe.predict_proba(X_test)[:,1]

accuracy = metrics.accuracy_score(y_true=y_test, y_pred=y_pred)
recall = metrics.recall_score(y_true=y_test, y_pred=y_pred)
precision = metrics.recall_score(y_true=y_test, y_pred=y_pred)
auc_score = metrics.roc_auc_score(y_true=y_test, y_score=y_pred_prob)

In [57]:
auc_score

0.8987878787878787