### Base de dados de medidas de bico do penguim: três classes, 333 pinguins (11 registros faltantes)
O artigo estuda as diferenças na aparência entre machos e fêmeas, entre outras coisas. É por isso que usaremos a classificação macho/fêmea com base nas medidas corporais (como os criadores do conjunto de dados fizeram em seu artigo).
<url>https://christophm.github.io/interpretable-ml-book/data.html#penguins</url>

Cada linha representa um pinguim e contém as seguintes informações:

Sexo do pinguim (macho/fêmea), que é o alvo da classificação (sex).
Espécie de pinguim, que pode ser de Chinstrap, Gentoo ou Adelie (species).
Massa corporal do pinguim, medida em gramas (body_mass_g).
Comprimento do bico, medido em milímetros (bill_length_mm).
Profundidade do bico, medida em milímetros (bill_depth_mm).
Comprimento da nadadeira (a “cauda”), medido em milímetros (flipper_length_mm).

<img src='img/lter_penguins.jpg'></img>
<img src='img/culmen_depth.jpg'></img>

#### Vamos ler o dataset (csv). Obtido de
<url>https://gist.github.com/slopp/ce3b90b9168f2f921784de84fa445651#file-penguins-csv</url>

In [1]:
import pandas as pd
df = pd.read_csv('datasets/penguins.csv')

In [2]:
df.head()

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


### Vemos que existem registros com dados faltantes. Para o propósito deste tutorial vamos contar quantas linhas tem dados faltantes e apagá-las

In [3]:
print('shape antes: ' + str(df.shape))
print(df.isnull().sum())
df.dropna(inplace=True)
print('shape depois: ' + str(df.shape))

shape antes: (344, 9)
rowid                 0
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64
shape depois: (333, 9)


In [4]:
df['bill_depth_mm'].describe()

count    333.000000
mean      17.164865
std        1.969235
min       13.100000
25%       15.600000
50%       17.300000
75%       18.700000
max       21.500000
Name: bill_depth_mm, dtype: float64

In [5]:
df['bill_length_mm'].describe()

count    333.000000
mean      43.992793
std        5.468668
min       32.100000
25%       39.500000
50%       44.500000
75%       48.600000
max       59.600000
Name: bill_length_mm, dtype: float64

In [6]:
df['flipper_length_mm'].describe()

count    333.000000
mean     200.966967
std       14.015765
min      172.000000
25%      190.000000
50%      197.000000
75%      213.000000
max      231.000000
Name: flipper_length_mm, dtype: float64

In [7]:
df['body_mass_g'].describe()

count     333.000000
mean     4207.057057
std       805.215802
min      2700.000000
25%      3550.000000
50%      4050.000000
75%      4775.000000
max      6300.000000
Name: body_mass_g, dtype: float64

### Criar categorias para cada feature contínua em cinco quartis

#### No tutorial constam os quartis:
<code>(13.1,14.7]</code>, ou seja, >13.1<=14.7
(14.7,16.3]
(16.3,18]
(18,19.6]
(19.6,21.2]

In [None]:
quartiles = 5

bill_length_mm_range = pd.qcut(df['bill_length_mm'], quartiles) 
bill_length_mm_cat = pd.qcut(df['bill_length_mm'], quartiles, labels=False)

bill_depth_mm_range = pd.qcut(df['bill_depth_mm'], quartiles)
bill_depth_mm_cat = pd.qcut(df['bill_depth_mm'], quartiles, labels=False)

flipper_length_mm_range = pd.qcut(df['flipper_length_mm'], quartiles)
flipper_length_mm_cat = pd.qcut(df['flipper_length_mm'], quartiles, labels=False)

body_mass_g_range = pd.qcut(df['body_mass_g'], quartiles)
body_mass_g_cat = pd.qcut(df['body_mass_g'], quartiles, labels=False)

In [9]:
df['bill_length_mm_range'] = bill_length_mm_range
print(bill_length_mm_range.value_counts(sort=False))
df['bill_length_mm_cat'] = bill_length_mm_cat

df['bill_depth_mm_range'] = bill_depth_mm_range
print(bill_depth_mm_range.value_counts(sort=False))
df['bill_depth_mm_cat'] = bill_depth_mm_cat

df['flipper_length_mm_range'] = bill_length_mm_range
print(flipper_length_mm_range.value_counts(sort=False))
df['flipper_length_mm_cat'] = bill_length_mm_cat

df['body_mass_g_range'] = body_mass_g_range
print(body_mass_g_range.value_counts(sort=False))
df['body_mass_g_cat'] = body_mass_g_cat

bill_length_mm
(32.099000000000004, 38.6]    69
(38.6, 42.0]                  65
(42.0, 46.1]                  68
(46.1, 49.5]                  66
(49.5, 59.6]                  65
Name: count, dtype: int64
bill_depth_mm
(13.099, 15.04]    67
(15.04, 16.8]      67
(16.8, 17.9]       67
(17.9, 18.9]       68
(18.9, 21.5]       64
Name: count, dtype: int64
flipper_length_mm
(171.999, 188.4]    67
(188.4, 194.0]      67
(194.0, 203.0]      70
(203.0, 215.0]      65
(215.0, 231.0]      64
Name: count, dtype: int64
body_mass_g
(2699.999, 3475.0]    68
(3475.0, 3800.0]      69
(3800.0, 4300.0]      64
(4300.0, 4990.0]      65
(4990.0, 6300.0]      67
Name: count, dtype: int64


In [10]:
df.head(3)

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,bill_length_mm_range,bill_length_mm_cat,bill_depth_mm_range,bill_depth_mm_cat,flipper_length_mm_range,flipper_length_mm_cat,body_mass_g_range,body_mass_g_cat
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,"(38.6, 42.0]",1,"(17.9, 18.9]",3,"(38.6, 42.0]",1,"(3475.0, 3800.0]",1
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,"(38.6, 42.0]",1,"(16.8, 17.9]",2,"(38.6, 42.0]",1,"(3475.0, 3800.0]",1
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,"(38.6, 42.0]",1,"(17.9, 18.9]",3,"(38.6, 42.0]",1,"(2699.999, 3475.0]",0


### Este é o dataframe processado e completo, mas para o algoritmo OneFeatureRule, vamos deixar apenas as colunas numéricas, pois a target é o sexo (y)

In [11]:
X = df.drop(columns=['rowid','species','island','bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g', 'sex','year','bill_length_mm_range','bill_depth_mm_range','flipper_length_mm_range','body_mass_g_range'])

In [12]:
X

Unnamed: 0,bill_length_mm_cat,bill_depth_mm_cat,flipper_length_mm_cat,body_mass_g_cat
0,1,3,1,1
1,1,2,1,1
2,1,3,1,0
4,0,4,0,0
5,1,4,1,1
...,...,...,...,...
339,4,4,4,2
340,2,3,2,0
341,4,3,4,1
342,4,4,4,2


In [13]:
y = df['sex']
y

0        male
1      female
2      female
4      female
5        male
        ...  
339      male
340    female
341      male
342      male
343    female
Name: sex, Length: 333, dtype: object

In [14]:
from modules.OneFeatureRule import OneFeatureRule

clf = OneFeatureRule()
results = clf.fit(X, y)
print(results)
#arredondando acurácia para imprimir
print('%.2f' % results[0]['accuracy'])
print('%.2f' % results[1]['accuracy'])
print('%.2f' % results[2]['accuracy'])
print('%.2f' % results[3]['accuracy'])
print(clf)

[{'feature': 'bill_length_mm_cat', 'accuracy': 0.71, 'rules': {0: 'female', 1: 'male', 2: 'female', 3: 'female', 4: 'male'}}, {'feature': 'bill_depth_mm_cat', 'accuracy': 0.76, 'rules': {0: 'female', 1: 'male', 2: 'female', 3: 'male', 4: 'male'}}, {'feature': 'flipper_length_mm_cat', 'accuracy': 0.71, 'rules': {0: 'female', 1: 'male', 2: 'female', 3: 'female', 4: 'male'}}, {'feature': 'body_mass_g_cat', 'accuracy': 0.76, 'rules': {0: 'female', 1: 'female', 2: 'male', 3: 'female', 4: 'male'}}]
0.71
0.76
0.71
0.76
Melhor variavel de decisão para seus dados: body_mass_g_cat


### Neste caso, considerando duas casas decimais, a regra com peso corporal e a regra com a profundidade do bico são equivalentes

In [23]:
#as regras são por sua vez um dicionário com keys (features categóricas) e valores (coluna target)
rule_dict = results[clf.ideal_variable_index]['rules']

winner_rules = []
for k,v in rule_dict.items():
    winner_rules.append('if ' + clf.ideal_variable + ' in range: ' + str(k) + ' then sex == ' + v)

for i in range(len(winner_rules)):
    print(winner_rules[i])

if body_mass_g_cat in range: 0 then sex == female
if body_mass_g_cat in range: 1 then sex == female
if body_mass_g_cat in range: 2 then sex == male
if body_mass_g_cat in range: 3 then sex == female
if body_mass_g_cat in range: 4 then sex == male
