## Preprocessamento de Dados
Análise de Dados de Covid

Durante a análise e modelagem dos dados, um período significativo de tempo é dedicado a **prepagação** dos dados: limpeza, transformação e organização. Sabe-se que esta tarefa em geral, ocupam 80% do tempo de uma análise.

Os dados podem ser armazenados tanto em arquivos como nos bancos de dados, e o **pandas** fornece recursos de alto nível, rápido e flexível para **processamento adhoc** dos dados, de modo a você deixar os dados em um formato adequado para análise.


    - 1. Leitura de arquivos 
    - 2. Tratando dados ausentes
    - 3. Removendo dados duplicados
    - 4. Transformação de Dados
    - 5. Discretização de Dados
    - 6. Removendo dados Descrepantes
   
**Documentação:** https://pandas.pydata.org/docs/
    

## 1. Leitura dos dados

In [1]:
import pandas as pd
import numpy as np

In [2]:
filename = 'data/COVID19 cases Toronto.csv'
df = pd.read_csv(filename)

Visualizando cabeçalho de um dataframe

In [3]:
df.head()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated
0,44294,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No
1,44295,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No
2,44296,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes
3,44297,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No
4,44298,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No


Verificando informações sobre o dataframe e suas colunas

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14911 entries, 0 to 14910
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   _id                     14911 non-null  int64 
 1   Outbreak Associated     14911 non-null  object
 2   Age Group               14879 non-null  object
 3   Neighbourhood Name      14298 non-null  object
 4   FSA                     14344 non-null  object
 5   Source of Infection     14911 non-null  object
 6   Classification          14911 non-null  object
 7   Episode Date            14911 non-null  object
 8   Reported Date           14911 non-null  object
 9   Client Gender           14911 non-null  object
 10  Outcome                 14911 non-null  object
 11  Currently Hospitalized  14911 non-null  object
 12  Currently in ICU        14911 non-null  object
 13  Currently Intubated     14911 non-null  object
 14  Ever Hospitalized       14911 non-null  object
 15  Ev

Descrevendo estatísticas sobre os dados

In [5]:
df.describe(include='all')

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated
count,14911.0,14911,14879,14298,14344,14911,14911,14911,14911,14911,14911,14911,14911,14911,14911,14911,14911
unique,,2,9,140,96,8,2,149,142,5,3,2,2,2,2,2,2
top,,Sporadic,50-59,Glenfield-Jane Heights,M9V,N/A - Outbreak associated,CONFIRMED,2020-04-15,2020-05-29,FEMALE,RESOLVED,No,No,No,No,No,No
freq,,9333,2354,502,850,5578,13686,292,437,7909,13195,14760,14882,14886,13063,14511,14624
mean,51749.0,,,,,,,,,,,,,,,,
std,4304.579267,,,,,,,,,,,,,,,,
min,44294.0,,,,,,,,,,,,,,,,
25%,48021.5,,,,,,,,,,,,,,,,
50%,51749.0,,,,,,,,,,,,,,,,
75%,55476.5,,,,,,,,,,,,,,,,


## 2. Tratando dados ausentes

Verificando dados nulos

In [6]:
df.isnull()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14906,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False
14907,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False
14908,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False
14909,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False


Vericando dados nulos para cada coluna

In [7]:
df.isnull().sum()

_id                         0
Outbreak Associated         0
Age Group                  32
Neighbourhood Name        613
FSA                       567
Source of Infection         0
Classification              0
Episode Date                0
Reported Date               0
Client Gender               0
Outcome                     0
Currently Hospitalized      0
Currently in ICU            0
Currently Intubated         0
Ever Hospitalized           0
Ever in ICU                 0
Ever Intubated              0
dtype: int64

In [8]:
filter_ = df['Age Group'].isna()
df[filter_].shape

(32, 17)

Dropando dados de um dataframe.
O método **dropna()** é equivalente a "***df[df.notnull()]***"


In [9]:
df.shape[0]

14911

In [10]:
rows_before = df.shape[0]
rows_curr = df.dropna().shape[0]
print("{} linhas foram removidas, o que é equivalente a {}% do dataset".format(rows_before-rows_curr, 
                                                                               100-(rows_curr*100/rows_before)))

637 linhas foram removidas, o que é equivalente a 4.272013949433301% do dataset


In [11]:
df = df.append(pd.Series(), ignore_index=True)
df.tail()

  df = df.append(pd.Series(), ignore_index=True)


Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated
14907,59201.0,Outbreak Associated,20-29,,,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No
14908,59202.0,Outbreak Associated,40-49,,,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No
14909,59203.0,Outbreak Associated,19 and younger,,,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No
14910,59204.0,Outbreak Associated,50-59,,,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No
14911,,,,,,,,,,,,,,,,,


Se você usar **dropna(how='all')**, você descanta as linhas que contém somente NaNs

In [12]:
shape_before = df.shape[0]
shape_curr = df.dropna(how='all').shape[0]
print("{} linhas foram removidas, o que é equivalente a {}% do dataset".format(shape_before-shape_curr, 100-(shape_curr*100/shape_before)))

1 linhas foram removidas, o que é equivalente a 0.006706008583691414% do dataset


Se você usar **dropna(axis=1, how='all')**, você descanta as colunas que contém somente NaNs

In [13]:
df['Nulo'] = np.NAN

In [14]:
df.dropna(axis=1, how='all')

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14907,59201.0,Outbreak Associated,20-29,,,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No
14908,59202.0,Outbreak Associated,40-49,,,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No
14909,59203.0,Outbreak Associated,19 and younger,,,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No
14910,59204.0,Outbreak Associated,50-59,,,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No


Se você usar **dropna(thresh=N)**, você mantém as mesmas linhas que contém pelo menos **N** NaNs

In [15]:
shape_before = df.shape[0]
shape_curr = df.dropna(thresh=2).shape[0]
print("{} linhas foram removidas, o que é equivalente a {}% do dataset".format(shape_before-shape_curr, 100-(shape_curr*100/shape_before)))

1 linhas foram removidas, o que é equivalente a 0.006706008583691414% do dataset


Se você usar **dropna(subset=[])**, define quais as colunas você quer considerar para remover valores ausentes

In [16]:
df.head()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,


In [17]:
shape_before = df.shape[0]
shape_curr = df.dropna(subset=['Age Group', 'Neighbourhood Name']).shape[0]
print("{} linhas foram removidas, o que é equivalente a {}% do dataset".format(shape_before-shape_curr, 100-(shape_curr*100/shape_before)))

638 linhas foram removidas, o que é equivalente a 4.278433476394852% do dataset


## 3. Preechendo dados ausentes

Você pode usar **df.fillna()** para subistituir os valores no dataframe

In [18]:
df.fillna("")

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14907,59201.0,Outbreak Associated,20-29,,,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No,
14908,59202.0,Outbreak Associated,40-49,,,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No,
14909,59203.0,Outbreak Associated,19 and younger,,,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No,
14910,59204.0,Outbreak Associated,50-59,,,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,


Você pode usar um dicionário como parâmetro do **df.fillna(dic)** para considerar um preenchimento diferente para cada coluna

In [19]:
df.head()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,


In [20]:
df.fillna(value={'FSA':'Não tem', 'Classification':'Sem Status'})

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14907,59201.0,Outbreak Associated,20-29,,Não tem,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No,
14908,59202.0,Outbreak Associated,40-49,,Não tem,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No,
14909,59203.0,Outbreak Associated,19 and younger,,Não tem,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No,
14910,59204.0,Outbreak Associated,50-59,,Não tem,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,


**df.fillna(method='ffill')** pode ser utilizado como método de interpolação para dados ausentes

In [21]:
df.tail()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo
14907,59201.0,Outbreak Associated,20-29,,,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No,
14908,59202.0,Outbreak Associated,40-49,,,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No,
14909,59203.0,Outbreak Associated,19 and younger,,,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No,
14910,59204.0,Outbreak Associated,50-59,,,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,
14911,,,,,,,,,,,,,,,,,,


In [22]:
df.fillna(method='ffill').tail()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo
14907,59201.0,Outbreak Associated,20-29,West Humber-Clairville,M9C,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No,
14908,59202.0,Outbreak Associated,40-49,West Humber-Clairville,M9C,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No,
14909,59203.0,Outbreak Associated,19 and younger,West Humber-Clairville,M9C,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No,
14910,59204.0,Outbreak Associated,50-59,West Humber-Clairville,M9C,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,
14911,59204.0,Outbreak Associated,50-59,West Humber-Clairville,M9C,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,


Se você quiser preencher dados ausentes usando uma **média, moda ou mediana**, você pode usar **df.fillna(df.mean())**

In [23]:
df.mean()

  df.mean()


_id     51749.0
Nulo        NaN
dtype: float64

In [24]:
df.mean()

  df.mean()


_id     51749.0
Nulo        NaN
dtype: float64

In [25]:
df.fillna(df.mean())

  df.fillna(df.mean())


Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14907,59201.0,Outbreak Associated,20-29,,,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No,
14908,59202.0,Outbreak Associated,40-49,,,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No,
14909,59203.0,Outbreak Associated,19 and younger,,,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No,
14910,59204.0,Outbreak Associated,50-59,,,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,


## 4. Removendo dados duplicados

Você pode usar **df.duplicated()** para verificar valores duplicados

In [26]:
df.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
14907    False
14908    False
14909    False
14910    False
14911    False
Length: 14912, dtype: bool

Consultandos dados duplicados

In [27]:
filter_duplicados = df.duplicated()
df[filter_duplicados]

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo


Para remover valores duplicados, você pode usar **df.drop_duplicated()**

In [28]:
shape_before = df.shape[0]
shape_curr = df.drop_duplicates().shape[0]
print("{} linhas foram removidas, o que é equivalente a {}% do dataset".format(shape_before-shape_curr, 100-(shape_curr*100/shape_before)))

0 linhas foram removidas, o que é equivalente a 0.0% do dataset


Por padrão o método **df.drop_duplicates()** considera todas as colunas duplicadas, se você quiser verificar apenas um subconjunto, então utilize **df.drop_duplicates(subset=[])**

In [29]:
df

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14907,59201.0,Outbreak Associated,20-29,,,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No,
14908,59202.0,Outbreak Associated,40-49,,,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No,
14909,59203.0,Outbreak Associated,19 and younger,,,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No,
14910,59204.0,Outbreak Associated,50-59,,,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,


In [30]:
shape_before = df.shape[0]
shape_curr = df.drop_duplicates(subset=['FSA','Age Group','Outbreak Associated']).shape[0]
print("{} linhas foram removidas, o que é equivalente a {}% do dataset".format(shape_before-shape_curr, 100-(shape_curr*100/shape_before)))

13507 linhas foram removidas, o que é equivalente a 90.57805793991416% do dataset


Você pode usar o **método keep** para definir qual a linha que deve ser removida dentre as duplicadas

In [31]:
df.tail()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo
14907,59201.0,Outbreak Associated,20-29,,,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No,
14908,59202.0,Outbreak Associated,40-49,,,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No,
14909,59203.0,Outbreak Associated,19 and younger,,,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No,
14910,59204.0,Outbreak Associated,50-59,,,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,
14911,,,,,,,,,,,,,,,,,,


In [32]:
shape_before = df.shape[0]
shape_curr = df.drop_duplicates(subset=['FSA','Age Group','Outbreak Associated'], keep='last').shape[0]
print("{} linhas foram removidas, o que é equivalente a {}% do dataset".format(shape_before-shape_curr, 100-(shape_curr*100/shape_before)))

13507 linhas foram removidas, o que é equivalente a 90.57805793991416% do dataset


In [33]:
df.drop_duplicates(subset=['FSA','Age Group','Outbreak Associated'], keep='first')

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14435,58729.0,Sporadic,80-89,,,Travel,CONFIRMED,2020-03-27,2020-04-03,MALE,FATAL,No,No,No,Yes,No,No,
14442,58736.0,Sporadic,,,,Community,PROBABLE,2020-03-20,2020-03-20,MALE,RESOLVED,No,No,No,No,No,No,
14682,58976.0,Sporadic,90+,,,Institutional,CONFIRMED,2020-05-12,2020-05-08,MALE,RESOLVED,No,No,No,No,No,No,
14769,59063.0,Outbreak Associated,,,,N/A - Outbreak associated,CONFIRMED,2020-05-11,2020-05-15,MALE,RESOLVED,No,No,No,No,No,No,


Você pode utilizar o **método map** para transformar dados usando um mapeamento definido por um dicionário

## 5. Formatação de Dados usando MAP

In [34]:
df['year'] = df['Reported Date'].map(lambda x: str(x)[0:4] if x is not np.NaN else np.NAN)

In [35]:
df.head()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo,year
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,,2020
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,,2020
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,,2020
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,,2020
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,,2020


In [36]:
df['day'] = df['Reported Date'].map(lambda x: int(str(x)[-2:]) if x is not np.NaN else np.NAN)

In [37]:
df.head()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo,year,day
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,,2020,27.0
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,,2020,28.0
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,,2020,8.0
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,,2020,4.0
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,,2020,6.0


In [38]:
string = 'nome'

In [39]:
string.upper()

'NOME'

In [40]:
def funcao(x):
    if x == "Sporadic":
        return str(x).upper()  
    elif x == 'Outbreak Associated':
        return str(x).lower()  
    else:
        return np.NaN

In [41]:
df['Outbreak Associated'].map(funcao)

0                   SPORADIC
1                   SPORADIC
2                   SPORADIC
3        outbreak associated
4                   SPORADIC
                ...         
14907    outbreak associated
14908    outbreak associated
14909    outbreak associated
14910    outbreak associated
14911                    NaN
Name: Outbreak Associated, Length: 14912, dtype: object

Você pode usar o **fillna** como uma forma mais genérica para subistituir valores. Como vímos agora o **map** pode modificar um subconjunto de valores, porém o **replace** oferece uma forma mais simples e flexível de fazer isso

In [42]:
df

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo,year,day
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,,2020,27.0
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,,2020,28.0
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,,2020,8.0
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,,2020,4.0
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,,2020,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14907,59201.0,Outbreak Associated,20-29,,,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No,,2020,23.0
14908,59202.0,Outbreak Associated,40-49,,,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No,,2020,19.0
14909,59203.0,Outbreak Associated,19 and younger,,,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No,,2020,13.0
14910,59204.0,Outbreak Associated,50-59,,,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,,2020,23.0


In [43]:
df.replace(np.nan, -999)

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo,year,day
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,-999.0,2020,27.0
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,-999.0,2020,28.0
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,-999.0,2020,8.0
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,-999.0,2020,4.0
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,-999.0,2020,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14907,59201.0,Outbreak Associated,20-29,-999,-999,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No,-999.0,2020,23.0
14908,59202.0,Outbreak Associated,40-49,-999,-999,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No,-999.0,2020,19.0
14909,59203.0,Outbreak Associated,19 and younger,-999,-999,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No,-999.0,2020,13.0
14910,59204.0,Outbreak Associated,50-59,-999,-999,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,-999.0,2020,23.0


In [44]:
df.replace([np.nan, 0, 20], [1, 2, -999])

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo,year,day
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No,1.0,2020,27.0
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No,1.0,2020,28.0
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes,1.0,2020,8.0
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No,1.0,2020,4.0
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No,1.0,2020,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14907,59201.0,Outbreak Associated,20-29,1,1,N/A - Outbreak associated,CONFIRMED,2020-05-09,2020-05-23,FEMALE,RESOLVED,No,No,No,No,No,No,1.0,2020,23.0
14908,59202.0,Outbreak Associated,40-49,1,1,N/A - Outbreak associated,CONFIRMED,2020-06-18,2020-06-19,FEMALE,RESOLVED,No,No,No,No,No,No,1.0,2020,19.0
14909,59203.0,Outbreak Associated,19 and younger,1,1,N/A - Outbreak associated,PROBABLE,2020-06-13,2020-06-13,MALE,RESOLVED,No,No,No,No,No,No,1.0,2020,13.0
14910,59204.0,Outbreak Associated,50-59,1,1,N/A - Outbreak associated,CONFIRMED,2020-06-22,2020-06-23,FEMALE,RESOLVED,No,No,No,Yes,No,No,1.0,2020,23.0


## 6. Discretização de Dados

Você pode usar o **pd.cut** para discretizar oou separados em compartilhamentos (bins) dados contínuos ou frequentes para a análise

In [45]:
df['Reported Date'].dtype

dtype('O')

In [46]:
df['Reported Date'] = pd.to_datetime(df['Reported Date'])

In [47]:
df['day'] = df['Reported Date'].dt.day
df['month'] = df['Reported Date'].dt.month
df['year'] = df['Reported Date'].dt.year

In [48]:
df['day'].describe()

count    14911.000000
mean        15.981289
std          8.897289
min          1.000000
25%          8.000000
50%         16.000000
75%         24.000000
max         31.000000
Name: day, dtype: float64

- [0 --- 10]
- [10 --- 20]
- [20 --- 31]

In [49]:
bins = [0, 10, 20, 31]
df['periodo_mes'] = pd.cut(df['day'], bins=bins, labels=['inicio_do_mes', 'meio_do_mes', 'final_do_mes'])

In [50]:
df.groupby('periodo_mes').agg({'day':['mean', 'std']})

Unnamed: 0_level_0,day,day
Unnamed: 0_level_1,mean,std
periodo_mes,Unnamed: 1_level_2,Unnamed: 2_level_2
inicio_do_mes,5.371296,2.932437
meio_do_mes,15.802108,2.814188
final_do_mes,26.011319,3.067379


In [51]:
df.head()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,...,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated,Nulo,year,day,month,periodo_mes
0,44294.0,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,...,No,No,No,No,No,,2020.0,27.0,3.0,final_do_mes
1,44295.0,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,...,No,No,Yes,No,No,,2020.0,28.0,3.0,final_do_mes
2,44296.0,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,...,No,No,Yes,Yes,Yes,,2020.0,8.0,3.0,inicio_do_mes
3,44297.0,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,...,No,No,No,No,No,,2020.0,4.0,5.0,inicio_do_mes
4,44298.0,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,...,No,No,No,No,No,,2020.0,6.0,6.0,inicio_do_mes


## 7. Filtrando valores discrepantes

In [52]:
df[df['day'] < 20].index

Int64Index([    2,     3,     4,     5,     6,     7,     8,     9,    13,
               14,
            ...
            14897, 14899, 14900, 14902, 14903, 14904, 14905, 14906, 14908,
            14909],
           dtype='int64', length=9156)

In [53]:
shape_before = df.shape[0]
shape_curr = df.drop(df[df['day'] > 200].index).shape[0]
print("{} linhas foram removidas, o que é equivalente a {}% do dataset".format(shape_before-shape_curr, 100-(shape_curr*100/shape_before)))

0 linhas foram removidas, o que é equivalente a 0.0% do dataset
