# **Limpeza e preparação de dados**

## **Tratando de dados ausentes:**

In [79]:
import pandas as pd
import numpy as np

In [80]:
nomes = pd.Series(['Matheus', 'Dercy', np.nan, 'Sergio'])
nomes

0    Matheus
1      Dercy
2        NaN
3     Sergio
dtype: object

In [81]:
nomes.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [82]:
nomes[1] = None
nomes

0    Matheus
1       None
2        NaN
3     Sergio
dtype: object

**Obs: *None* em arrays de objetos é igual a NA**

In [83]:
nomes.isnull()

0    False
1     True
2     True
3    False
dtype: bool

### **Métodos para tratamento de NA**

In [84]:
nomes.notnull()

0     True
1    False
2    False
3     True
dtype: bool

In [85]:
nomes.fillna('Ercy')

0    Matheus
1       Ercy
2       Ercy
3     Sergio
dtype: object

In [86]:
nomes.dropna(inplace=False)

0    Matheus
3     Sergio
dtype: object

In [87]:
nomes

0    Matheus
1       None
2        NaN
3     Sergio
dtype: object

### **Tratando de dados ausentes:**

#### **Trabalhando com Series:**

In [88]:
from numpy import nan as NA
dados = pd.Series([3, NA, 5, NA, 9])
dados

0    3.0
1    NaN
2    5.0
3    NaN
4    9.0
dtype: float64

#####**OBS: dados.dropna() é igual a dados[dados.notnull()]**

In [89]:
dados.dropna()

0    3.0
2    5.0
4    9.0
dtype: float64

In [90]:
dados[dados.notnull()]

0    3.0
2    5.0
4    9.0
dtype: float64

#### **Trabalhando com DF**

In [91]:
data = pd.DataFrame([[4., 7.5, 3.], [2., NA, NA], [NA, NA, NA], [NA, 6.5, 8.]])
data

Unnamed: 0,0,1,2
0,4.0,7.5,3.0
1,2.0,,
2,,,
3,,6.5,8.0


#####**how='all' deleta os NaN do eixo inteiro passado apenas se todos os itens desse eixo forem NaN. Nesse caso, como eixo é zero, ele só deletou a coluna 2, cujo os itens eram todos NaN**

In [92]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,4.0,7.5,3.0
1,2.0,,
3,,6.5,8.0


In [93]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,4.0,7.5,3.0,
1,2.0,,,
2,,,,
3,,6.5,8.0,


In [94]:
data.dropna(how='all', axis=1)

Unnamed: 0,0,1,2
0,4.0,7.5,3.0
1,2.0,,
2,,,
3,,6.5,8.0


##### **O parâmetro thresh implica que no DF, no eixo selecionado (0 no caso) tem que no mínimo posssuir thresh valores não-NaN para o eixo continuar existindo**

In [76]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,1.539148,,0.53465
3,0.486813,,1.778009
4,0.167385,0.778138,1.852519
5,0.852871,1.289665,-1.350588
6,1.197955,-0.071773,-0.216568


#####**Dropna() elimina todas as linhas que tenham NaN presente:**

In [98]:
limpeza = data.dropna()
limpeza

Unnamed: 0,0,1,2,4


In [99]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df

Unnamed: 0,0,1,2
0,0.551738,,2.003972
1,1.081567,,0.305688
2,-0.773015,,1.046285
3,1.926864,,-0.686433
4,0.553448,0.166374,1.313207
5,-0.107321,-0.877288,-2.269388
6,-0.986265,-0.118848,-1.130076


In [75]:
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,0.025285,,
1,1.39241,,
2,1.539148,,0.53465
3,0.486813,,1.778009
4,0.167385,0.778138,1.852519
5,0.852871,1.289665,-1.350588
6,1.197955,-0.071773,-0.216568


In [77]:
df.dropna()

Unnamed: 0,0,1,2
4,0.167385,0.778138,1.852519
5,0.852871,1.289665,-1.350588
6,1.197955,-0.071773,-0.216568


### **Preenchendo dados ausentes:**

In [102]:
df.fillna(np.random.randint(0, 100))

Unnamed: 0,0,1,2
0,0.551738,25.0,2.003972
1,1.081567,25.0,0.305688
2,-0.773015,25.0,1.046285
3,1.926864,25.0,-0.686433
4,0.553448,0.166374,1.313207
5,-0.107321,-0.877288,-2.269388
6,-0.986265,-0.118848,-1.130076


In [103]:
df.fillna('Exemplo')

Unnamed: 0,0,1,2
0,0.551738,Exemplo,2.003972
1,1.081567,Exemplo,0.305688
2,-0.773015,Exemplo,1.046285
3,1.926864,Exemplo,-0.686433
4,0.553448,0.166374,1.313207
5,-0.107321,-0.877288,-2.269388
6,-0.986265,-0.118848,-1.130076


In [112]:
data.fillna(data.mean())

Unnamed: 0,0,1,2,4
0,4.0,7.5,3.0,
1,2.0,7.0,5.5,
2,3.0,7.0,5.5,
3,3.0,6.5,8.0,


#### **Obs: Se passar um dicionário no fillna, a key vira o número da coluna a ser preenchido, e o valor do dicionário vai ser o valor que vai ser preenchido na coluna determinada pela key do dicionário**

In [107]:
data.fillna({0:'Pares', 1: 'Quebrados', 2: 'Par Impar', 4: 'Ausentes'})

Unnamed: 0,0,1,2,4
0,4,7.5,3,Ausentes
1,2,Quebrados,Par Impar,Ausentes
2,Pares,Quebrados,Par Impar,Ausentes
3,Pares,6.5,8,Ausentes
