Recorde o dataset titanic.csv q contém detalhes sobre os sobreviventes do naufrágio do mítico navio. Utilizando o H2O treine 2 modelos de ML, utilizando um algoritmo à sua escolha.

| Variable Name | Description                                 |
|---------------|---------------------------------------------|
| survived      | Survived (1) or died (0)                    |
| pclass        | Passenger's class                           |
| name          | Passenger's name                            |
| sex           | Passenger's sex                             |
| age           | Passenger's age                             |
| sibsp         | Number of siblings/spouses aboard           |
| parch         | Number of parents/children aboard           |
| ticket        | Ticket number                               |
| fare          | Fare                                        |
| cabin         | Cabin                                       |
| embarked      | Port of embarkation                         |


In [1]:
# pip install h2o
import h2o
from h2o.estimators import H2ORandomForestEstimator, H2OGeneralizedLinearEstimator

In [4]:
# Inicializa o sv H2O.
# Isso config um ambiente H2O local pra executar modelos de ML.
h2o.init()

# Carregar o dataset
data = h2o.import_file('titanic.csv', header=1)

data

Checking whether there is an H2O instance running at http://localhost:54321. connected.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html


0,1
H2O_cluster_uptime:,4 hours 16 mins
H2O_cluster_timezone:,Europe/Lisbon
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.6
H2O_cluster_version_age:,4 months and 23 days
H2O_cluster_name:,H2O_from_python_Alice_Dias_1vxncm
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.914 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803.0,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450.0,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877.0,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463.0,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909.0,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742.0,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736.0,30.0708,,C


In [7]:
# Converter a coluna 'Sex' e 'Embarked' pra fatores, pois são categóricas
data['Sex'] = data['Sex'].asfactor()
data['Embarked'] = data['Embarked'].asfactor()
data['Survived'] = data['Survived'].asfactor()  # Garantir q 'Survived' é tratada cm fator na classificação


# Dividir os dados em conj de treino e teste
train, test = data.split_frame(ratios=[0.8], seed=42)

No 1. modelo, trate o problema cm sendo de regressão.

In [None]:
# Config e treinar o modelo de regressão
glm_regressao = H2OGeneralizedLinearEstimator(family='binomial')
glm_regressao.train(x=train.columns[1:-1], y='Survived', training_frame=train)

# Avaliar o desempenho do modelo no conj de teste
regression_performance = glm_regressao.model_performance(test_data=test)
print(regression_performance)



glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.13814580582639124
RMSE: 0.371679708655707
LogLoss: 0.44664868290134696
AUC: 0.8806899004267426
AUCPR: 0.8796336709819358
Gini: 0.7613798008534851
Null degrees of freedom: 186
Residual degrees of freedom: 179
Null deviance: 253.29106816471833
Residual deviance: 167.04660740510383
AIC: 183.04660740510383

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4139980714151632
       0    1    Error    Rate
-----  ---  ---  -------  ------------
0      93   18   0.1622   (18.0/111.0)
1      12   64   0.1579   (12.0/76.0)
Total  105  82   0.1604   (30.0/187.0)

Maximum Metrics: Maximum metrics at their respective thresholds
metric                       threshold    value     idx
---------------------------  -----------  --------  -----
max f1                       0.413998     0.810127  79
max f2                       0.353534     

No 2. modelo, trate o problema cm sendo de classificação.

In [10]:
# Config e treinar o modelo de classificação
drf_classificacao = H2ORandomForestEstimator()
drf_classificacao.train(x=train.columns[1:-1], y='Survived', training_frame=train)

# Avaliar o desempenho do modelo no conjunto de teste
classification_performance = drf_classificacao.model_performance(test_data=test)
print(classification_performance)

drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.12336165192172423
RMSE: 0.35122877433622124
LogLoss: 0.4010828769229657
Mean Per-Class Error: 0.15238264580369842
AUC: 0.8898174490279753
AUCPR: 0.8844342937662759
Gini: 0.7796348980559507

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.5562165600061417
       0    1    Error    Rate
-----  ---  ---  -------  ------------
0      102  9    0.0811   (9.0/111.0)
1      17   59   0.2237   (17.0/76.0)
Total  119  68   0.139    (26.0/187.0)

Maximum Metrics: Maximum metrics at their respective thresholds
metric                       threshold    value     idx
---------------------------  -----------  --------  -----
max f1                       0.556217     0.819444  59
max f2                       0.134738     0.833333  112
max f0point5                 0.708493     0.856164  45
max accuracy                 0.581883     0.860963

Pra o caso em questão, qual dos modelos é + útil e qual é o mlhr em termos de performance preditiva? 

In [11]:
print("Performance da Regressão:")
print(regression_performance.auc())

print("Performance da Classificação:")
print(classification_performance.auc())

Performance da Regressão:
0.8806899004267426
Performance da Classificação:
0.8898174490279753
