# More accurate ML example

Here we are working with data from [Kaggle](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset?resource=download)... our goal is to basically predict if a person has diabettes or not.

I wont try to create a perfect model, i will just show how we can use easydags for ML tasks. Lets suppose that we have a classification task where there is a huge imbalance... usually our first question is:

1. Do nothing with the data?
2. Undersample?
3. Oversample?
4. Smote?

The first part of this notebook does such task in a "usual" way but we know that we can train each option in paralallel so that at the end we can do it in a faster way.


## Prepare the functions that we need



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

def base_rf(X_train,y_train, name = 'base'):
    #Basically we need a function to create a rf object trained with the data

    rf = RandomForestClassifier()
    
    params = {
        'criterion': ['gini', 'entropy', 'log_loss'],
        'n_estimators': [10, 30, 100]
    }
    
    clf = GridSearchCV(rf, param_grid=params, scoring='recall')
    clf.fit(X_train, y_train.values.ravel())

    clf.name = name

    return clf



# Usual way


In [2]:
%%time

import time
t = time.time()


data = pd.read_csv('data/diabetes_prediction_dataset.csv')
df = pd.get_dummies(data)
features = df[[c for c in df.columns if c not in ['diabetes','gender_Other', 'smoking_history_No Info']]]
target = df[['diabetes']]
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.15, random_state=40, stratify=target)

# base
base = base_rf(X_train,y_train)

# under
rus = RandomUnderSampler(random_state=42)

X_rus, y_rus = rus.fit_resample(X_train, y_train)

under = base_rf(X_rus,y_rus, 'under')


#over
ros = RandomOverSampler(sampling_strategy=0.5)

X_ros, y_ros = ros.fit_resample(X_train, y_train )


over = base_rf(X_ros,y_ros,'over')


#Smote

smote = SMOTE()

X_smote, y_smote = smote.fit_resample(X_train, y_train)


smote = base_rf(X_smote,y_smote, 'smote')


# metrics

from sklearn import metrics
metrics_list = []
for model in [base,smote, over,under]:
    
    y_pred = model.predict(X_test)
    
    aux = pd.DataFrame([metrics.f1_score(y_test, y_pred),
    metrics.recall_score(y_test, y_pred),
    metrics.precision_score(y_test, y_pred),
    metrics.accuracy_score(y_test, y_pred),
    model.name])
    metrics_list.append(aux.transpose())

metrics = pd.concat(metrics_list)
print(metrics)



print(f'time: {int(time.time() - t)} seconds')


          0         1         2         3      4
0  0.799641  0.698039  0.935857  0.970267   base
0  0.766347  0.721569  0.817052    0.9626  smote
0  0.783864  0.723922   0.85463  0.966067   over
0  0.596203  0.911373  0.443004  0.895067  under
time: 283 seconds
CPU times: user 4min 41s, sys: 1.33 s, total: 4min 42s
Wall time: 4min 43s


# As a dag


We will need this nodes

- Pre pro
- base
- under
- over
- smote
- final metrics

The steps to build and run are the following:

1. The common task before defining a dag is defining the function that we will run in each node
2. Creates nodes (please check that we did not add the dependency directly in here in this example)
3. Define dependencies using >> (thats the Hard dependency operator)
4. Create the nodes list using all the ExecNodes availables in the envioronment... if you do not want to do it with all the created nodes please create the list by yourself as usual
5. Create the dag with the list of nodes
6. Run the dag
7. Check the html output with one iframe


In [3]:
%%time
from easydags import  ExecNode, DAG, search_nodes
from IPython.display import IFrame

import time
t = time.time()


# defining the functions
def pre_pro():
    df = pd.get_dummies(data)
    features = df[[c for c in df.columns if c not in ['diabetes','gender_Other', 'smoking_history_No Info']]]
    target = df[['diabetes']]
    X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.15, random_state=40, stratify=target)

    return (X_train, X_test, y_train, y_test)



def model_base (**kwargs):
    X = kwargs['data'][0]
    y = kwargs['data'][2]
    base = base_rf(X,y)
    return base

def model_under (**kwargs):
    X = kwargs['data'][0]
    y = kwargs['data'][2]
    rus = RandomUnderSampler(random_state=42)

    X_rus, y_rus = rus.fit_resample(X, y)
    
    under = base_rf(X_rus,y_rus, 'under')
    
    return under

def model_over (**kwargs):
    X = kwargs['data'][0]
    y = kwargs['data'][2]
    ros = RandomOverSampler(sampling_strategy=0.5)

    X_ros, y_ros = ros.fit_resample(X, y )
    
    over = base_rf(X_ros,y_ros,'over')
    
    return over

def model_smote (**kwargs):
    X = kwargs['data'][0]
    y = kwargs['data'][2]
    smote = SMOTE()

    X_smote, y_smote = smote.fit_resample(X, y)
     
    smote = base_rf(X_smote,y_smote, 'smote')
    
    return over

def models_metrics (**kwargs):
    X_test = kwargs['data'][1]
    y_test = kwargs['data'][3]

    base = kwargs['base']
    smote = kwargs['smote']
    under = kwargs['under']
    over = kwargs['over']
    

    from sklearn import metrics
    metrics_list = []
    for model in [base,smote, over,under]:
        
        y_pred = model.predict(X_test)
        
        aux = pd.DataFrame([metrics.f1_score(y_test, y_pred),
        metrics.recall_score(y_test, y_pred),
        metrics.precision_score(y_test, y_pred),
        metrics.accuracy_score(y_test, y_pred),
        model.name])
        metrics_list.append(aux.transpose())
    
    metrics = pd.concat(metrics_list)

    return metrics




# defining the nodes
node_prepro = ExecNode('pre_pro', output_name = 'data',exec_function = pre_pro)

node_metrics = ExecNode('metrics',exec_function = models_metrics)

node_base = ExecNode('base', output_name = 'base',exec_function = model_base)

node_under = ExecNode('under', output_name = 'under',exec_function = model_under)

node_over = ExecNode('over', output_name = 'over',exec_function = model_over)

node_smote = ExecNode('smote', output_name = 'smote',exec_function = model_smote)


# Define dependencies
node_prepro >> node_base >> node_metrics

node_prepro >> node_under >> node_metrics

node_prepro >> node_over >> node_metrics

node_prepro >> node_smote >> node_metrics

node_prepro  >> node_metrics

# Creating the list of nodes... you can also do it by yourself! 
#nodes = [node_prepro, node_base,node_under,node_over,node_smote,node_metrics] 
nodes = [] 
globs = globals().copy()
for obj_name in globs:         
    if isinstance(globs[obj_name], ExecNode):
        nodes.append(globs[obj_name])


dag = DAG(nodes,name = 'Real ML imbalance and grid search',max_concurrency=3, debug = False, error_type_fatal= False)

dag.execute()
    

print(f'time: {int(time.time() - t)} seconds')


IFrame(src=f"{dag.name}_states_run.html", width='100%', height=600)





2023-06-15 11:16:20.815 | INFO     | easydags.node:execute:146 - Start executing pre_pro at 2023-06-15, 11:16:20
2023-06-15 11:16:21.073 | INFO     | easydags.node:execute:146 - Start executing smote at 2023-06-15, 11:16:21
2023-06-15 11:16:21.073 | INFO     | easydags.node:execute:146 - Start executing over at 2023-06-15, 11:16:21
2023-06-15 11:16:21.073 | INFO     | easydags.node:execute:146 - Start executing base at 2023-06-15, 11:16:21
2023-06-15 11:17:37.426 | INFO     | easydags.node:execute:146 - Start executing under at 2023-06-15, 11:17:37
2023-06-15 11:19:02.651 | INFO     | easydags.node:execute:146 - Start executing metrics at 2023-06-15, 11:19:02


drawing
time: 162 seconds
CPU times: user 6min 2s, sys: 1.05 s, total: 6min 3s
Wall time: 2min 42s


## You can spect the time in each node to understand why this is 2x faster and not 4x... but hey! 2x is still a good performance enhance!

The reason is basically that not all the inner nodes have the same comp time


# Final DAG

If you run this tutorial you will get the dag html by yourself, here i will add a png version so you can check it out without running the tutorial:

[Motivation](https://raw.githubusercontent.com/magralo/easydags/main/resource_readme/dag_tut_ml_imb.png)
              
