# Pré-processamento - Roteiro da aula:

## 1 -  Limpeza de Dados
### (i) Pandas

* Removendo Dados Duplicados: \\
Um conjunto de dados pode conter linhas repetidas. \\
O Pandas oferece o método .drop_duplicates() para remover
essas duplicatas.
* Tratando Valores Ausentes: \\
Dados incompletos (NaN) afetam as análises. \\
O Pandas oferece os métodos .fillna() e .dropna()
para lidar com valores ausentes. \\

### (ii) Usando a Classe SimpleImputer
* A classe SimpleImputer pertence ao pacote sklearn.impute.

## 2 - Normalização dos Dados

Por reescala: define através de um valor mínimo e um valor máximo, um novo intervalo \\
onde os valores de um atributo estarão contidos (Exemplo: [0, 1]). \\
A normalização por reescala de um atributo $j$ de um objeto $x_i$ pode ser calculada como: \\

$$ x_{ij} = \frac{x_{ij} - min_j}{max_j - min_j} $$


Por padronização: os diferentes atributos contínuos poderão abranger diferentes intervalos, \\
mas deverão possuir os mesmos valores para alguma medida de posição e de espalhamento. \\
Essas medidas irão consistir na média e no desvio-padrão. \\

$$ x_{ij} = \frac{x_{ij} - \bar{x}_j}{\sigma_j} $$


## Estudo de Caso:
### Breast Cancer dataset

* https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data


In [None]:
import pandas as pd
import numpy as np

### Leitura da Base de Dados

In [None]:
# Faz a leitura da base de dados
def readDataset(dataset):
    return pd.read_csv(dataset)

### Limpeza dos Dados

#### Remoção de amostras NaN

In [None]:
def removeSamples(dataframe, feature):
  dataframe.dropna(subset=[feature], inplace=True)

#### Tratamento dos dados faltantes nas demais colunas


In [None]:
def fillSamples(dataframe):
  nan_columns = dataframe.columns[dataframe.isna().any()].tolist()
  for col in nan_columns:
    dataframe[col].fillna(dataframe[col].mean(), inplace=True)

#### Normalização de atributos numéricos
* StandardScaler e MinMaxScaler

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

In [None]:
def stdScaler(dataframe, columns):
  scaler = StandardScaler()
  dataframe[columns] = scaler.fit_transform(dataframe[columns])

In [None]:
def minMax(dataframe, columns):
  scaler = MinMaxScaler()
  dataframe[columns] = scaler.fit_transform(dataframe[columns])

### Testando padronização StandardScaler

In [None]:
# Carrega a base de dados
df = readDataset('breast_cancer_missing.csv')

In [None]:
# Remove as amostras com diagnosis igual NaN
removeSamples(df, 'diagnosis')

In [None]:
# Armazena o nome das colunas com dados faltantes
nan_columns = df.columns[df.isna().any()].tolist()

In [None]:
# Substitui os dados faltantes pela média
fillSamples(df)

In [None]:
# Aplica a padronização StandarScaler nas colunas com dados faltantes
stdScaler(df, nan_columns)

In [None]:
# Verifica o resultado das 5 primeiras linhas
df.head()

Unnamed: 0,sample_id,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,diagnosis
0,842302,1.085817,-2.172329,1.283164,3.42126e-16,1.622464,3.508049,2.736543,2.562904,1.054117e-15,...,-1.484189,2.356961,2.106997,1.462564,2.759463,0.0,2.37501,2.859841,2.124614,malignant
2,84300903,1.57282,0.481953,1.591406,1.646896,0.968696,2.809158e-16,3.625689e-16,2.055453,0.9665091,...,-0.00752,1.374195,1.533224,0.597244,1.143226,0.886155,2.020424,0.0,0.2103,malignant
3,84348301,-0.796309,0.269513,-0.65277,-0.8114597,3.412898,3.635574,3.625689e-16,1.455497,2.971777,...,0.167159,-0.267728,-0.579187,3.776567,4.104876,2.073151,2.249955,6.292452,5.431268,malignant
4,84358402,1.744704,-1.205359,1.809745,1.929777,0.277781,0.5769897,3.625689e-16,1.43171,-0.02093352,...,-1.603042,1.365011,1.285206,0.25695,-0.328214,0.633198,0.746131,-0.909796,-0.449808,malignant
5,843786,-0.501242,-0.873269,-0.439141,-0.5376068,2.320808,1.329997,0.8776898,0.81299,1.034871,...,-0.328065,-0.129039,-0.257318,2.284133,1.816267,1.313273,0.929756,1.821801,2.46078,malignant


### Testando normalização MinMaxScaler

In [None]:
# Carrega a base de dados
df = readDataset('breast_cancer_missing.csv')

In [None]:
# Remove as amostras com diagnosis igual NaN
removeSamples(df, 'diagnosis')

In [None]:
# Armazena o nome das colunas com dados faltantes
nan_columns = df.columns[df.isna().any()].tolist()

In [None]:
# Substitui os dados faltantes pela média
fillSamples(df)

In [None]:
# Aplica a normalização MinMaxScaler nas colunas com dados faltantes
minMax(df, nan_columns)

In [None]:
# Verifica o resultado das 5 primeiras linhas
df.head()

Unnamed: 0,sample_id,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,diagnosis
0,842302,0.521037,0.022658,0.552093,0.217283,0.55397,0.792037,0.70314,0.731113,0.381067,...,0.15903,0.66831,0.450698,0.65664,0.619292,0.218722,0.912027,0.598462,0.54137,malignant
2,84300903,0.601496,0.39026,0.602404,0.449417,0.466746,0.260461,0.212302,0.635686,0.509596,...,0.404612,0.508442,0.374508,0.528241,0.385375,0.359744,0.835052,0.263926,0.275856,malignant
3,84348301,0.21009,0.360839,0.236112,0.102906,0.792844,0.811361,0.212302,0.522863,0.776263,...,0.433663,0.241347,0.094008,1.0,0.814012,0.548642,0.88488,1.0,1.0,malignant
4,84358402,0.629893,0.156578,0.638041,0.48929,0.374566,0.347893,0.212302,0.51839,0.378283,...,0.139263,0.506948,0.341575,0.477747,0.172415,0.319489,0.558419,0.1575,0.1843,malignant
5,843786,0.258839,0.20257,0.27098,0.141506,0.64714,0.461996,0.369728,0.402038,0.518687,...,0.351303,0.263908,0.136748,0.778547,0.482784,0.427716,0.598282,0.477035,0.587996,malignant
