<img src="https://raw.githubusercontent.com/andre-marcos-perez/ebac-course-utils/main/media/logo/newebac_logo_black_half.png" alt="ebac-logo">

---

# **Módulo** | Análise de Dados: Fundamentos de Aprendizado de Máquina
Caderno de **Exercícios**<br> 
Professor [André Perez](https://www.linkedin.com/in/andremarcosperez/)

---

# **Tópicos**

<ol type="1">
  <li>Teoria;</li>
  <li>Atributos categóricos;</li>
  <li>Atributos numéricos;</li>
  <li>Dados faltantes.</li>
</ol>

---

# **Exercícios**

## 1\. Pinguins 

Neste exercício, vamos utilizar uma base de dados com informações sobre penguins. A idéia é preparar a base de dados para prever a espécie do penguin (variável resposta) baseado em suas características físicas e geográficas (variáveis preditivas).

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns

In [3]:
data = sns.load_dataset('penguins')

In [4]:
data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### **1.1. Valores nulos** 

A base de dados possui valores faltantes, utilize os conceitos da aula para trata-los.

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [6]:
# resposta da questão 1.1
data.isna().sum() / len(data)

species              0.000000
island               0.000000
bill_length_mm       0.005814
bill_depth_mm        0.005814
flipper_length_mm    0.005814
body_mass_g          0.005814
sex                  0.031977
dtype: float64

In [7]:
# Excluindo as linhas com valores NaN da coluna 'nome_da_coluna'
data = data.dropna(subset=['bill_length_mm'])
data = data.dropna(subset=['sex'])


In [8]:
data['bill_depth_mm'].interpolate(method='spline', order=2, inplace=True)

In [9]:
# Calculando a média e a mediana da coluna
mean_value = data['flipper_length_mm'].mean()
median_value = data['body_mass_g'].median()

# Substituindo os valores NaN pela média e pela mediana
data['flipper_length_mm'].fillna(mean_value, inplace=True)  # Substitui NaN pela média
data['body_mass_g'].fillna(median_value, inplace=True)  # Substitui NaN pela mediana


In [10]:
data.isna().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

### **1.2. Variáveis numéricas** 

Identifique as variáveis numéricas e crie uma nova coluna **padronizando** seus valores. A nova coluna deve ter o mesmo nome da coluna original acrescidade de "*_std*".

> **Nota**: Você não deve tratar a variável resposta.

In [11]:
# resposta da questão 1.2
data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [12]:
data.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

In [13]:
from sklearn.preprocessing import StandardScaler

# Inicializa o objeto StandardScaler
scaler = StandardScaler()

# Ajusta e transforma os dados da coluna para a escala padrão
data['bill_length_mm_std'] = scaler.fit_transform(data[['bill_length_mm']])
data['bill_depth_mm_std'] = scaler.fit_transform(data[['bill_depth_mm']])
data['flipper_length_mm_std'] = scaler.fit_transform(data[['flipper_length_mm']])
data['body_mass_g_std'] = scaler.fit_transform(data[['body_mass_g']])

In [14]:
data.dtypes

species                   object
island                    object
bill_length_mm           float64
bill_depth_mm            float64
flipper_length_mm        float64
body_mass_g              float64
sex                       object
bill_length_mm_std       float64
bill_depth_mm_std        float64
flipper_length_mm_std    float64
body_mass_g_std          float64
dtype: object

### **1.3. Variáveis categóricas** 

Identifique as variáveis categóricas nominais e ordinais, crie uma nova coluna aplicando a técnica correta de conversão a seus valores. A nova coluna deve ter o mesmo nome da coluna original acrescidade de "*_nom*" ou "*_ord*".

> **Nota**: Você não deve tratar a variável resposta.

In [15]:
# resposta da questão 1.3
data[['species', 'island', 'sex']].head()

Unnamed: 0,species,island,sex
0,Adelie,Torgersen,Male
1,Adelie,Torgersen,Female
2,Adelie,Torgersen,Female
4,Adelie,Torgersen,Female
5,Adelie,Torgersen,Male


In [16]:
data['species'].unique()

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

In [17]:
data['island'].unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

In [18]:
data['species_nom'] = data['species'].map({'Adelie':1, 'Chinstrap':2, 'Gentoo':3})

In [19]:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['sex_nom'] = le.fit_transform(data['sex'])

In [20]:
data.loc[data['island'] == 'Biscoe', 'island_nom'] = 0
data.loc[data['island'] == 'Dream', 'island_nom'] = 1
data.loc[data['island'] == 'Torgersen', 'island_nom'] = 2


In [21]:
data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,bill_length_mm_std,bill_depth_mm_std,flipper_length_mm_std,body_mass_g_std,species_nom,sex_nom,island_nom
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,-0.896042,0.780732,-1.426752,-0.568475,1,1,2.0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,-0.822788,0.119584,-1.069474,-0.506286,1,0,2.0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,-0.67628,0.424729,-0.426373,-1.190361,1,0,2.0
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,-1.335566,1.085877,-0.569284,-0.941606,1,0,2.0
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,-0.859415,1.747026,-0.783651,-0.692852,1,1,2.0


### **1.4. Limpeza** 

Descarte as colunas originais e mantenha apenas a variável resposta e as variáveis preditivas com o sufixo *_std*", *_nom*" e "*_ord*". 

In [22]:
# resposta da questão 1.4
df = data.drop(['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex'], axis=1)

---