#### Marina Borges
São Paulo, Brazil  
github.com\inaborges

# Using Hermione ML to treat clean and encode value of your dataset

![title](tenor.gif)

First of all, you need to open the Jupyter Lab terminal:

![title](terminal.png)

Then you need to install hermione using:

* pip install hermione-ml

You can check hermione version:

In [1]:
!hermione info


 _                         _                  
| |__   ___ _ __ _ __ ___ (_) ___  _ __   ___ 
| '_ \ / _ \ '__| '_ ` _ \| |/ _ \| '_ \ / _ \
| | | |  __/ |  | | | | | | | (_) | | | |  __/
|_| |_|\___|_|  |_| |_| |_|_|\___/|_| |_|\___|
v0.2



To create a new project, you can type:

* hermione new project_name

After that you gotta choose yes[y] or no[n] to start with an implemented example code

Then we will analyze and treat the data using hermione's libraries

In [2]:
import sys
sys.path.append("src/ml../../")

In [3]:
import pandas as pd

from ml.data_source.spreadsheet import Spreadsheet
from ml.preprocessing.preprocessing import Preprocessing
from ml.model.trainer import TrainerSklearn

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [5]:
df = Spreadsheet().get_data('data/raw/train.csv')

In [6]:
df.columns

Index(['Survived', 'Pclass', 'Sex', 'Age'], dtype='object')

In [7]:
p = Preprocessing()

In [8]:
df = p.clean_data(df)
df = p.categ_encoding(df)

Cleaning data
Category encoding


In [9]:
df.head()

Unnamed: 0,Survived,Age,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male
0,0,22.0,0,0,1,0,1
1,1,38.0,1,0,0,1,0
2,1,26.0,0,0,1,1,0
3,1,35.0,1,0,0,1,0
4,0,35.0,0,0,1,0,1


In [10]:
X = df.drop(columns=["Survived"])
y = df["Survived"]

In [11]:
# Ensure the same random state passed to TrainerSkleran().train()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((571, 6), (143, 6), (571,), (143,))

In [12]:
rf = TrainerSklearn().train(X, y, classification=True, 
                            algorithm=RandomForestClassifier, 
                            preprocessing=p,
                           data_split=('train_test', {'test_size':.2}),
                           random_state=123)



In [13]:
rf.get_metrics()

{'accuracy': 0.8321678321678322,
 'f1': 0.7818181818181819,
 'precision': 0.7818181818181819,
 'recall': 0.7818181818181819,
 'roc_auc': 0.8448347107438016}

In [14]:
rf.get_columns()

['Age', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male']

In [15]:
rf.predict_proba(X_test, binary=True)

array([1.        , 0.21666667, 0.15833333, 0.9       , 0.6       ,
       0.        , 0.        , 0.        , 0.14285714, 1.        ,
       0.79      , 0.04      , 0.04      , 1.        , 1.        ,
       0.48      , 0.        , 0.53719697, 1.        , 0.        ,
       0.36666667, 1.        , 0.58      , 1.        , 0.45666667,
       1.        , 0.        , 1.        , 0.        , 0.19492965,
       0.        , 0.        , 1.        , 0.0594246 , 0.        ,
       1.        , 0.53719697, 0.12801088, 0.52833333, 0.        ,
       0.        , 0.        , 0.24675325, 0.        , 1.        ,
       0.2       , 0.325     , 1.        , 1.        , 0.31      ,
       0.        , 0.4       , 0.        , 0.1       , 1.        ,
       0.53719697, 1.        , 0.        , 0.04      , 0.        ,
       1.        , 1.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 1.        , 0.        , 0.53719697,
       0.        , 0.1       , 0.1       , 0.45333333, 0.76666

In [16]:
# Predicting new data
def predict_new(X, model, probs=True):
    X = p.clean_data(X)
    X = p.categ_encoding(X)
    
    columns = model.get_columns()
    for col in columns:
        if col not in X.columns:
            X[col] = 0
    print(X)
    if probs:
        return model.predict_proba(X)
    else:
        return model.predict(X)

In [17]:
new_data = pd.DataFrame({
    'Pclass':3,
    'Sex': 'male',
    'Age':4
}, index=[0])

new_data

Unnamed: 0,Pclass,Sex,Age
0,3,male,4


github.com/inaborges