#**Data Cleaning**
Master efficient workflows for cleaning real-world, messy data.

https://www.kaggle.com/learn/data-cleaning

https://www.kaggle.com/alexisbcook/missing-values

---
**Autor:** Marcos Bezerra

**GitHub:** [https://github.com/marcos-bezerra/Data_Cleaning_Kaggle](https://github.com/marcos-bezerra/Data_Cleaning_Kaggle)

**Google Drive:** [https://drive.google.com/marcos-bezerra/Data_Cleaning_Kaggle](https://drive.google.com/drive/folders/1eHyIT60C7-QV_DaCjMFHBLAaSxbl1ikN?usp=sharing)

**Vers√£o:** 1.0 - 13 Fev 2022

---

# **Lesson 01 - Handling Missing Values**

A limpeza de dados √© uma parte fundamental da ci√™ncia de dados, mas pode ser profundamente frustrante. Por que alguns de seus campos de texto est√£o ileg√≠veis? O que voc√™ deve fazer sobre esses valores ausentes? Por que suas datas n√£o est√£o formatadas corretamente? Como voc√™ pode limpar rapidamente a entrada de dados inconsistente? Neste curso, voc√™ aprender√° por que se deparou com esses problemas e, mais importante, como corrigi-los!
Neste curso, voc√™ aprender√° a lidar com alguns dos problemas mais comuns de limpeza de dados para que possa analisar seus dados mais rapidamente. Voc√™ trabalhar√° em cinco exerc√≠cios pr√°ticos com dados reais e confusos e responder√° a algumas das perguntas mais frequentes sobre limpeza de dados. Neste caderno, veremos como lidar com valores ausentes.

**D√™ uma primeira olhada nos dados**
A primeira coisa que precisamos fazer √© carregar as bibliotecas e o conjunto de dados que usaremos.
Para demonstra√ß√£o, usaremos um conjunto de dados de eventos que ocorreram em jogos de futebol americano. No exerc√≠cio a seguir, voc√™ aplicar√° suas novas habilidades a um conjunto de dados de alvar√°s de constru√ß√£o emitidos em S√£o Francisco.


In [None]:
# Autoriza√ß√£o para acessar o google drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
# informa o path da Pasta Raiz da pasta de trabalho
import os
pathRaiz = '/content/drive/MyDrive/I2A2/Desafio_03_Kaggle_DataClean'
os.chdir(pathRaiz)
os.getcwd()

'/content/drive/MyDrive/I2A2/Desafio_03_Kaggle_DataClean'

In [None]:
ls -lah dataset

total 429M
-rw------- 1 root root 432K Feb 12 12:42  catalog.csv
-rw------- 1 root root  45M Feb 13 12:33  ks-projects-201612.csv
-rw------- 1 root root  47M Feb 13 13:48  ks-projects-201612-utf8.csv
-rw------- 1 root root 263M Sep 20  2019 'NFL Play by Play 2009-2017 (v4).csv'
-rw------- 1 root root  76M Feb  9 11:51 'NFL Play by Play 2009-2017 (v4).csv.zip'
-rw------- 1 root root 231K Feb 12 12:40  pakistan_intellectual_capital.csv


In [None]:
# modules we'll use
import pandas as pd
import numpy as np
# read in all our data
nfl_data = pd.read_csv("dataset/NFL Play by Play 2009-2017 (v4).csv")
# set seed for reproducibility
np.random.seed(42)

  exec(code_obj, self.user_global_ns, self.user_ns)


A primeira coisa a fazer quando voc√™ obt√©m um novo conjunto de dados √© dar uma olhada em alguns deles.

Isso permite que voc√™ veja que tudo foi lido corretamente e d√° uma ideia do que est√° acontecendo com os dados. 

Nesse caso, vamos ver se h√° algum valor ausente, que ser√° representado com NaN ou None.

In [None]:
# look at the first five rows of the nfl_data file. 
# I can see a handful of missing data already!
nfl_data.head()

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,yrdln,yrdline100,ydstogo,ydsnet,GoalToGo,FirstDown,posteam,DefensiveTeam,desc,PlayAttempted,Yards.Gained,sp,Touchdown,ExPointResult,TwoPointConv,DefTwoPoint,Safety,Onsidekick,PuntResult,PlayType,Passer,Passer_ID,PassAttempt,PassOutcome,PassLength,AirYards,YardsAfterCatch,QBHit,PassLocation,InterceptionThrown,...,Accepted.Penalty,PenalizedTeam,PenaltyType,PenalizedPlayer,Penalty.Yards,PosTeamScore,DefTeamScore,ScoreDiff,AbsScoreDiff,HomeTeam,AwayTeam,Timeout_Indicator,Timeout_Team,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,No_Score_Prob,Opp_Field_Goal_Prob,Opp_Safety_Prob,Opp_Touchdown_Prob,Field_Goal_Prob,Safety_Prob,Touchdown_Prob,ExPoint_Prob,TwoPoint_Prob,ExpPts,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2009-09-10,2009091000,1,1,,15:00,15,3600.0,0.0,TEN,30.0,30.0,0,0,0.0,,PIT,TEN,R.Bironas kicks 67 yards from TEN 30 to PIT 3....,1,39,0,0,,,,0,0,,Kickoff,,,0,,,0,0,0,,0,...,0,,,,0,0.0,0.0,0.0,0.0,PIT,TEN,0,,3,3,3,3,3,0.001506,0.179749,0.006639,0.281138,0.2137,0.003592,0.313676,0.0,0.0,0.323526,2.014474,,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593.0,7.0,PIT,42.0,58.0,10,5,0.0,0.0,PIT,TEN,(14:53) B.Roethlisberger pass short left to H....,1,5,0,0,,,,0,0,,Pass,B.Roethlisberger,00-0022924,1,Complete,Short,-3,8,0,left,0,...,0,,,,0,0.0,0.0,0.0,0.0,PIT,TEN,0,,3,3,3,3,3,0.000969,0.108505,0.001061,0.169117,0.2937,0.003638,0.423011,0.0,0.0,2.338,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,2009-09-10,2009091000,1,1,2.0,14:16,15,3556.0,37.0,PIT,47.0,53.0,5,2,0.0,0.0,PIT,TEN,(14:16) W.Parker right end to PIT 44 for -3 ya...,1,-3,0,0,,,,0,0,,Run,,,0,,,0,0,0,,0,...,0,,,,0,0.0,0.0,0.0,0.0,PIT,TEN,0,,3,3,3,3,3,0.001057,0.105106,0.000981,0.162747,0.304805,0.003826,0.421478,0.0,0.0,2.415907,-1.40276,,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,2009-09-10,2009091000,1,1,3.0,13:35,14,3515.0,41.0,PIT,44.0,56.0,8,2,0.0,0.0,PIT,TEN,(13:35) (Shotgun) B.Roethlisberger pass incomp...,1,0,0,0,,,,0,0,,Pass,B.Roethlisberger,00-0022924,1,Incomplete Pass,Deep,34,0,0,right,0,...,0,,,,0,0.0,0.0,0.0,0.0,PIT,TEN,0,,3,3,3,3,3,0.001434,0.149088,0.001944,0.234801,0.289336,0.004776,0.318621,0.0,0.0,1.013147,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2009-09-10,2009091000,1,1,4.0,13:27,14,3507.0,8.0,PIT,44.0,56.0,8,2,0.0,1.0,PIT,TEN,(13:27) (Punt formation) D.Sepulveda punts 54 ...,1,0,0,0,,,,0,0,Clean,Punt,,,0,,,0,0,0,,0,...,0,,,,0,0.0,0.0,0.0,0.0,PIT,TEN,0,,3,3,3,3,3,0.001861,0.21348,0.003279,0.322262,0.244603,0.006404,0.208111,0.0,0.0,-0.699436,2.097796,,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


#**Quantos pontos de dados ausentes temos?**

In [None]:
# Ok, agora sabemos que temos alguns valores ausentes. Vamos ver quantos temos em cada coluna.

# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

In [None]:
# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

In [None]:
nfl_data.isnull().sum()

Date             0
GameID           0
Drive            0
qtr              0
down         61154
             ...  
Win_Prob     25009
WPA           5541
airWPA      248501
yacWPA      248762
Season           0
Length: 102, dtype: int64

In [None]:
# Isso parece muito! Pode ser √∫til ver qual porcentagem dos valores em nosso 
# conjunto de dados estava faltando para nos dar uma ideia melhor da escala 
# desse problema:

# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

24.87214126835169


Uau, quase um quarto das c√©lulas neste conjunto de dados est√° vazia! Na pr√≥xima etapa, examinaremos mais de perto algumas das colunas com valores ausentes e tentaremos descobrir o que pode estar acontecendo com elas.

**Descobrir por que os dados est√£o faltando**

Este √© o ponto em que entramos na parte da ci√™ncia de dados que eu gosto de chamar de "intui√ß√£o de dados", ou seja, "realmente olhar para seus dados e tentar descobrir por que eles s√£o do jeito que s√£o e como isso afetar sua an√°lise". Pode ser uma parte frustrante da ci√™ncia de dados, especialmente se voc√™ √© novo no campo e n√£o tem muita experi√™ncia. Para lidar com valores ausentes, voc√™ precisar√° usar sua intui√ß√£o para descobrir por que o valor est√° ausente. Uma das perguntas mais importantes que voc√™ pode se fazer para ajudar a descobrir isso √©:

**Este valor est√° faltando porque n√£o foi registrado ou porque n√£o existe?**

Se um valor est√° faltando porque n√£o existe (como a altura do filho mais velho de algu√©m que n√£o tem filhos), ent√£o n√£o faz sentido tentar adivinhar o que pode ser. Esses valores voc√™ provavelmente deseja manter como NaN. Por outro lado, se um valor estiver faltando porque n√£o foi registrado, voc√™ pode tentar adivinhar o que pode ter sido com base nos outros valores dessa coluna e linha. Isso √© chamado de imputa√ß√£o, e aprenderemos como fazer a seguir! :)

**Vamos trabalhar com um exemplo.**

Observando o n√∫mero de valores ausentes no dataframe nfl_data, percebo que a coluna "TimesSec" possui muitos valores ausentes:

In [None]:
# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

Observando a documenta√ß√£o, vejo que esta coluna cont√©m informa√ß√µes sobre o n√∫mero de segundos restantes no jogo quando a jogada foi feita. Isso significa que esses valores provavelmente est√£o ausentes porque n√£o foram registrados, e n√£o porque n√£o existem. Portanto, faria sentido tentarmos adivinhar quais deveriam ser, em vez de apenas deix√°-los como NAs.

Por outro lado, existem outros campos, como "PenalizedTeam" que tamb√©m possuem muitos campos ausentes. Neste caso, por√©m, o campo est√° faltando porque se n√£o houve p√™nalti, n√£o faz sentido dizer qual time foi penalizado. Para esta coluna, faria mais sentido deix√°-la vazia ou adicionar um terceiro valor como "neither" e us√°-lo para substituir os NAs.

***Dica: Este √© um √≥timo lugar para ler a documenta√ß√£o do conjunto de dados, caso ainda n√£o o tenha feito! Se voc√™ estiver trabalhando com um conjunto de dados obtido de outra pessoa, tamb√©m poder√° tentar entrar em contato com ela para obter mais informa√ß√µes.***

Se voc√™ estiver fazendo uma an√°lise de dados muito cuidadosa, esse √© o ponto em que voc√™ examinar√° cada coluna individualmente para descobrir a melhor estrat√©gia para preencher esses valores ausentes. No restante deste caderno, abordaremos algumas t√©cnicas "r√°pidas e sujas" que podem ajud√°-lo com valores ausentes, mas provavelmente tamb√©m acabar√£o removendo algumas informa√ß√µes √∫teis ou adicionando algum ru√≠do aos seus dados.

**Descartar valores ausentes**

Se voc√™ estiver com pressa ou n√£o tiver um motivo para descobrir por que seus valores est√£o ausentes, uma op√ß√£o que voc√™ tem √© apenas remover quaisquer linhas ou colunas que contenham valores ausentes. (Observa√ß√£o: geralmente n√£o recomendo essa abordagem para projetos importantes! Geralmente vale a pena dedicar um tempo para examinar seus dados e realmente examinar todas as colunas com valores ausentes uma a uma para realmente conhecer seu conjunto de dados .)

Se voc√™ tiver certeza de que deseja descartar linhas com valores ausentes, o pandas possui uma fun√ß√£o √∫til, dropna() para ajud√°-lo a fazer isso. Vamos experiment√°-lo em nosso conjunto de dados da NFL!


In [None]:
# remove all the rows that contain a missing value
nfl_data.dropna()

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,yrdln,yrdline100,ydstogo,ydsnet,GoalToGo,FirstDown,posteam,DefensiveTeam,desc,PlayAttempted,Yards.Gained,sp,Touchdown,ExPointResult,TwoPointConv,DefTwoPoint,Safety,Onsidekick,PuntResult,PlayType,Passer,Passer_ID,PassAttempt,PassOutcome,PassLength,AirYards,YardsAfterCatch,QBHit,PassLocation,InterceptionThrown,...,Accepted.Penalty,PenalizedTeam,PenaltyType,PenalizedPlayer,Penalty.Yards,PosTeamScore,DefTeamScore,ScoreDiff,AbsScoreDiff,HomeTeam,AwayTeam,Timeout_Indicator,Timeout_Team,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,No_Score_Prob,Opp_Field_Goal_Prob,Opp_Safety_Prob,Opp_Touchdown_Prob,Field_Goal_Prob,Safety_Prob,Touchdown_Prob,ExPoint_Prob,TwoPoint_Prob,ExpPts,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season


In [None]:
# Parece que removeu todos os nossos dados! üò± Isso ocorre porque cada linha 
# em nosso conjunto de dados tinha pelo menos um valor ausente. Podemos ter mais
# sorte removendo todas as colunas que t√™m pelo menos um valor ausente.

# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

Unnamed: 0,Date,GameID,Drive,qtr,TimeUnder,ydstogo,ydsnet,PlayAttempted,Yards.Gained,sp,Touchdown,Safety,Onsidekick,PlayType,Passer_ID,PassAttempt,AirYards,YardsAfterCatch,QBHit,InterceptionThrown,Rusher_ID,RushAttempt,Receiver_ID,Reception,Fumble,Sack,Challenge.Replay,Accepted.Penalty,Penalty.Yards,HomeTeam,AwayTeam,Timeout_Indicator,Timeout_Team,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,ExPoint_Prob,TwoPoint_Prob,Season
0,2009-09-10,2009091000,1,1,15,0,0,1,39,0,0,0,0,Kickoff,,0,0,0,0,0,,0,,0,0,0,0,0,0,PIT,TEN,0,,3,3,3,3,3,0.0,0.0,2009
1,2009-09-10,2009091000,1,1,15,10,5,1,5,0,0,0,0,Pass,00-0022924,1,-3,8,0,0,,0,00-0017162,1,0,0,0,0,0,PIT,TEN,0,,3,3,3,3,3,0.0,0.0,2009
2,2009-09-10,2009091000,1,1,15,5,2,1,-3,0,0,0,0,Run,,0,0,0,0,0,00-0022250,1,,0,0,0,0,0,0,PIT,TEN,0,,3,3,3,3,3,0.0,0.0,2009
3,2009-09-10,2009091000,1,1,14,8,2,1,0,0,0,0,0,Pass,00-0022924,1,34,0,0,0,,0,00-0026901,0,0,0,0,0,0,PIT,TEN,0,,3,3,3,3,3,0.0,0.0,2009
4,2009-09-10,2009091000,1,1,14,8,2,1,0,0,0,0,0,Punt,,0,0,0,0,0,,0,,0,0,0,0,0,0,PIT,TEN,0,,3,3,3,3,3,0.0,0.0,2009


In [None]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 102 

Columns with na's dropped: 41


Perdemos muitos dados, mas neste ponto removemos com sucesso todos os NaNs de nossos dados.

**Preenchendo valores ausentes automaticamente**

Outra op√ß√£o √© tentar preencher os valores ausentes. Para esta pr√≥xima parte, estou obtendo uma pequena subse√ß√£o dos dados da NFL para que ela seja impressa bem.


In [None]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:,'EPA':'Season'].head()
subset_nfl_data

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


Podemos usar a fun√ß√£o fillna() do Panda para preencher os valores ausentes em um dataframe para n√≥s. Uma op√ß√£o que temos √© especificar com o que queremos que os valores NaN sejam substitu√≠dos. Aqui, estou dizendo que gostaria de substituir todos os valores NaN por 0.

In [None]:
# replace all NA's with 0
subset_nfl_data.fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,0.0,0.0,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,0.0,0.0,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,0.0,0.0,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.0,0.0,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009


Eu tamb√©m poderia ser um pouco mais experiente e substituir valores ausentes por qualquer valor que vier diretamenteap√≥s ele na mesma coluna. (Isso faz muito sentido para conjuntos de dados em que as observa√ß√µes t√™m algum tipo de ordem l√≥gica.)

In [None]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the remaining na's with 0
subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,-1.068169,1.146076,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,-0.032244,0.036899,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,3.318841,-5.031425,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.106663,-0.156239,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009
