# Pré-processamento - Roteiro da aula:

## 1 -  Limpeza de Dados
### (i) Pandas

* Removendo Dados Duplicados: \\
Um conjunto de dados pode conter linhas repetidas. \\
O Pandas oferece o método .drop_duplicates() para remover
essas duplicatas.
* Tratando Valores Ausentes: \\
Dados incompletos (NaN) afetam as análises. \\
O Pandas oferece os métodos .fillna() e .dropna()
para lidar com valores ausentes. \\

### (ii) Usando a Classe SimpleImputer
* A classe SimpleImputer pertence ao pacote sklearn.impute.

## 2 - Normalização dos Dados

Por reescala: define através de um valor mínimo e um valor máximo, um novo intervalo \\
onde os valores de um atributo estarão contidos (Exemplo: [0, 1]). \\
A normalização por reescala de um atributo $j$ de um objeto $x_i$ pode ser calculada como: \\

$$ x_{ij} = \frac{x_{ij} - min_j}{max_j - min_j} $$


Por padronização: os diferentes atributos contínuos poderão abranger diferentes intervalos, \\
mas deverão possuir os mesmos valores para alguma medida de posição e de espalhamento. \\
Essas medidas irão consistir na média e no desvio-padrão. \\

$$ x_{ij} = \frac{x_{ij} - \bar{x}_j}{\sigma_j} $$


## Estudo de Caso:
### Breast Cancer dataset

* https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data


In [1]:
import pandas as pd
import numpy as np

In [2]:
# Base de dados com dados faltantes
df = pd.read_csv('breast_cancer_missing.csv')

In [3]:
df

Unnamed: 0,sample_id,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,diagnosis
0,842302,17.99,10.38,122.80,,0.11840,0.27760,0.3001,0.14710,,...,17.33,184.60,2019.0,0.1622,0.66560,,0.2654,0.4601,0.11890,malignant
1,842517,20.57,,,1326.0,0.08474,,0.0869,0.07017,0.1812,...,23.41,158.80,1956.0,0.1238,0.18660,0.2416,0.1860,0.2750,0.08902,
2,84300903,19.69,21.25,130.00,1203.0,0.10960,,,0.12790,0.2069,...,25.53,152.50,1709.0,0.1444,0.42450,0.4504,0.2430,,0.08758,malignant
3,84348301,11.42,20.38,77.58,386.1,0.14250,0.28390,,0.10520,0.2597,...,26.50,98.87,567.7,0.2098,0.86630,0.6869,0.2575,0.6638,0.17300,malignant
4,84358402,20.29,14.34,135.10,1297.0,0.10030,0.13280,,0.10430,0.1809,...,16.67,152.20,1575.0,0.1374,0.20500,0.4000,0.1625,0.2364,0.07678,malignant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.2439,0.13890,0.1726,...,26.40,166.10,,0.1410,0.21130,0.4107,0.2216,0.2060,,malignant
565,926682,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.1440,0.09791,0.1752,...,38.25,155.00,1731.0,0.1166,0.19220,0.3215,,0.2572,0.06637,malignant
566,926954,16.60,28.08,108.30,858.1,0.08455,0.10230,,0.05302,0.1590,...,34.12,126.70,1124.0,0.1139,0.30940,0.3403,0.1418,0.2218,0.07820,
567,927241,,29.33,140.10,1265.0,0.11780,0.27700,0.3514,0.15200,0.2397,...,39.42,184.60,1821.0,0.1650,0.86810,0.9387,0.2650,0.4087,0.12400,malignant


## 1) Análise exploratória dos dados

In [4]:
# Formato da base de dados
df.shape

(569, 32)

In [5]:
# Informação dos tipos de dados
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   sample_id                569 non-null    int64  
 1   mean_radius              527 non-null    float64
 2   mean_texture             521 non-null    float64
 3   mean_perimeter           508 non-null    float64
 4   mean_area                502 non-null    float64
 5   mean_smoothness          523 non-null    float64
 6   mean_compactness         510 non-null    float64
 7   mean_concavity           502 non-null    float64
 8   mean_concave_points      514 non-null    float64
 9   mean_symmetry            519 non-null    float64
 10  mean_fractal_dimension   513 non-null    float64
 11  radius_error             511 non-null    float64
 12  texture_error            513 non-null    float64
 13  perimeter_error          505 non-null    float64
 14  area_error               5

In [6]:
# Total de amostras por classe
df['diagnosis'].value_counts()

diagnosis
benign       316
malignant    189
Name: count, dtype: int64

In [7]:
# Amostra com quantidades iguais por classe
# 4 amostras aleatórias por classe
df.sample(4)

Unnamed: 0,sample_id,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,diagnosis
411,905520,11.04,16.83,,373.2,0.1077,0.07804,0.03046,0.0248,0.1714,...,26.44,79.93,471.4,0.1369,0.1482,0.1067,0.07431,0.2998,,benign
166,87127,10.8,9.71,,357.6,,0.05736,0.02531,,0.1381,...,12.02,73.66,414.0,0.1436,0.1257,0.1047,0.04603,0.209,0.07699,benign
483,912558,13.7,17.64,87.76,571.1,,0.07957,0.04548,0.0316,0.1732,...,23.53,95.78,686.5,0.1199,0.1346,,0.09077,0.2518,0.0696,benign
542,921644,14.74,25.42,94.7,668.6,0.08275,,0.04105,0.03027,0.184,...,32.29,107.4,826.4,,0.1376,0.1611,0.1095,0.2722,0.06956,benign


In [8]:
# Amostra com quantidades iguais por classe
# 4 amostras aleatórias por classe
# Colunas ['sample_id','diagnosis']
df[['sample_id','diagnosis']].sample(4)

Unnamed: 0,sample_id,diagnosis
436,908916,benign
68,859471,benign
355,9010258,benign
75,8610404,malignant


In [None]:
# Amostragem por classe
# Agrupando as amostras por tipo de diagnóstico


In [9]:
df.groupby('diagnosis')[['sample_id','diagnosis']].sample(5)

Unnamed: 0,sample_id,diagnosis
418,906024,benign
293,891703,benign
183,873843,benign
551,923780,benign
220,8812816,benign
207,879830,malignant
197,877159,malignant
56,857637,malignant
62,858986,malignant
45,857010,malignant


In [None]:
# Diferença entre dois atributos por classe
# mean_area e worst_area


In [10]:
df.groupby('diagnosis')[['mean_area','worst_area']].agg(['min','max','mean','std'])

Unnamed: 0_level_0,mean_area,mean_area,mean_area,mean_area,worst_area,worst_area,worst_area,worst_area
Unnamed: 0_level_1,min,max,mean,std,min,max,mean,std
diagnosis,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
benign,143.5,992.1,464.033684,134.70506,185.2,1032.0,557.475524,162.800566
malignant,386.1,2501.0,990.943558,367.001705,553.6,4254.0,1440.750303,595.849372


## 2) Limpeza dos Dados

### Dados Faltantes

Missing Data ou Missing Values são valores ausentes mostrados como NaN que significa "Not a Number"

#### A sintaxe Python usada:

- Para selecionar NaN: pd.isnull()
- Para substituir valores NaN: df.fillna()
- Para substituir NaN pela média: df.fillna(df.mean(), inplace=True)
- Para remover os registros NaN: df.dropna(inplace=True)

In [11]:
# Informação da base de dados
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   sample_id                569 non-null    int64  
 1   mean_radius              527 non-null    float64
 2   mean_texture             521 non-null    float64
 3   mean_perimeter           508 non-null    float64
 4   mean_area                502 non-null    float64
 5   mean_smoothness          523 non-null    float64
 6   mean_compactness         510 non-null    float64
 7   mean_concavity           502 non-null    float64
 8   mean_concave_points      514 non-null    float64
 9   mean_symmetry            519 non-null    float64
 10  mean_fractal_dimension   513 non-null    float64
 11  radius_error             511 non-null    float64
 12  texture_error            513 non-null    float64
 13  perimeter_error          505 non-null    float64
 14  area_error               5

In [12]:
# Soma a quantidade de dados faltantes por atributo
df.isnull().sum()

sample_id                   0
mean_radius                42
mean_texture               48
mean_perimeter             61
mean_area                  67
mean_smoothness            46
mean_compactness           59
mean_concavity             67
mean_concave_points        55
mean_symmetry              50
mean_fractal_dimension     56
radius_error               58
texture_error              56
perimeter_error            64
area_error                 60
smoothness_error           54
compactness_error          60
concavity_error            59
concave_points_error       45
symmetry_error             58
fractal_dimension_error    52
worst_radius               47
worst_texture              58
worst_perimeter            66
worst_area                 60
worst_smoothness           63
worst_compactness          53
worst_concavity            50
worst_concave_points       52
worst_symmetry             59
worst_fractal_dimension    75
diagnosis                  64
dtype: int64

In [13]:
# Caso mais crítico é diagnosis
df['diagnosis'].isnull().sum()

64

In [14]:
#Remove os dados faltantes em diagnosis
df.dropna(subset=['diagnosis'], inplace = True)

In [15]:
# Confirma se a remoção ocorreu corretamente
df['diagnosis'].isnull().sum()

0

In [16]:
# Novo formato da base de dados
df.shape

(505, 32)

#### Tratamento dos dados faltantes nas demais colunas


In [20]:
# Obtem o nome das colunas com NaN
# Deixa no formato list
nan_columns = df.columns[df.isnull().any()].tolist()

In [21]:
nan_columns

['mean_radius',
 'mean_texture',
 'mean_perimeter',
 'mean_area',
 'mean_smoothness',
 'mean_compactness',
 'mean_concavity',
 'mean_concave_points',
 'mean_symmetry',
 'mean_fractal_dimension',
 'radius_error',
 'texture_error',
 'perimeter_error',
 'area_error',
 'smoothness_error',
 'compactness_error',
 'concavity_error',
 'concave_points_error',
 'symmetry_error',
 'fractal_dimension_error',
 'worst_radius',
 'worst_texture',
 'worst_perimeter',
 'worst_area',
 'worst_smoothness',
 'worst_compactness',
 'worst_concavity',
 'worst_concave_points',
 'worst_symmetry',
 'worst_fractal_dimension']

In [22]:
# Percorre cada coluna calculando a média
# Substitui todos os NaN pela média
for col in nan_columns:
  df[col].fillna(df[col].mean(), inplace = True)

In [23]:
# Faz a soma dos dados ausentes para todos os atributos
df.isnull().sum()

sample_id                  0
mean_radius                0
mean_texture               0
mean_perimeter             0
mean_area                  0
mean_smoothness            0
mean_compactness           0
mean_concavity             0
mean_concave_points        0
mean_symmetry              0
mean_fractal_dimension     0
radius_error               0
texture_error              0
perimeter_error            0
area_error                 0
smoothness_error           0
compactness_error          0
concavity_error            0
concave_points_error       0
symmetry_error             0
fractal_dimension_error    0
worst_radius               0
worst_texture              0
worst_perimeter            0
worst_area                 0
worst_smoothness           0
worst_compactness          0
worst_concavity            0
worst_concave_points       0
worst_symmetry             0
worst_fractal_dimension    0
diagnosis                  0
dtype: int64

### Dados Faltantes com SimpleImputer
* https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
* https://medium.com/@lainetnr/simpleimputer-lidando-com-dados-faltantes-com-scikit-learn-861de5f7e41d

In [26]:
# Importa a classe SimpleImputer
from sklearn.impute import SimpleImputer

In [27]:
# Faz a leitura da base de dados
df = pd.read_csv('breast_cancer_missing.csv')

In [28]:
# Remove os dados faltantes da classe diagnosis
df.dropna(subset=['diagnosis'],inplace=True)

In [29]:
# Obtém o nome dos atributos com dados faltantes
nan_columns = df.columns[df.isna().any()].tolist()

In [30]:
nan_columns

['mean_radius',
 'mean_texture',
 'mean_perimeter',
 'mean_area',
 'mean_smoothness',
 'mean_compactness',
 'mean_concavity',
 'mean_concave_points',
 'mean_symmetry',
 'mean_fractal_dimension',
 'radius_error',
 'texture_error',
 'perimeter_error',
 'area_error',
 'smoothness_error',
 'compactness_error',
 'concavity_error',
 'concave_points_error',
 'symmetry_error',
 'fractal_dimension_error',
 'worst_radius',
 'worst_texture',
 'worst_perimeter',
 'worst_area',
 'worst_smoothness',
 'worst_compactness',
 'worst_concavity',
 'worst_concave_points',
 'worst_symmetry',
 'worst_fractal_dimension']

In [31]:
# Cria o objeto SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')

In [32]:
# Aplica a transformção na base de dados
df[nan_columns] = imputer.fit_transform(df[nan_columns])

In [33]:
# Verifica o resultado final
df.isnull().sum()

sample_id                  0
mean_radius                0
mean_texture               0
mean_perimeter             0
mean_area                  0
mean_smoothness            0
mean_compactness           0
mean_concavity             0
mean_concave_points        0
mean_symmetry              0
mean_fractal_dimension     0
radius_error               0
texture_error              0
perimeter_error            0
area_error                 0
smoothness_error           0
compactness_error          0
concavity_error            0
concave_points_error       0
symmetry_error             0
fractal_dimension_error    0
worst_radius               0
worst_texture              0
worst_perimeter            0
worst_area                 0
worst_smoothness           0
worst_compactness          0
worst_concavity            0
worst_concave_points       0
worst_symmetry             0
worst_fractal_dimension    0
diagnosis                  0
dtype: int64