# 0. Índice <a name="Contents"></a>
1. [Importando bibliotecas](#import)
2. [Construindo um dataframe](#read)
3. [Identificando dados ausentes/missing](#identificando)
4. [Tratando dados ausentes/missing](#tratando)
5. [Dados duplicados](#duplicados)
6. [Mapeamento](#map)



# 1. Importando bibliotecas <a name="import"></a>

<div style="text-align: right"
     
[Voltar ao índice](#Contents)

In [1]:
import pandas as pd
import numpy as np

# 2. Construindo um dataframe <a name="read"></a>
<div style="text-align: right"
     
[Voltar ao índice](#Contents)

In [2]:
df = pd.DataFrame(
    np.random.randn(9, 4)*100,
    index=["A", "B", "C", "D", "E", "F", "G", "H", "I"],
    columns=["coluna1", "coluna2", "coluna3","coluna4"],
)
df

Unnamed: 0,coluna1,coluna2,coluna3,coluna4
A,-32.366292,51.515425,-118.332165,25.412576
B,-38.483049,-57.838634,89.205764,-92.573976
C,-61.54641,225.735532,43.534795,-105.415982
D,68.847121,-72.310971,-14.423825,-96.363953
E,-18.87889,21.81289,-39.944867,-36.446658
F,89.066931,-3.038381,-80.294392,-27.444039
G,41.146292,-71.557678,-118.516871,-58.231884
H,26.829477,-115.884829,-23.938607,-123.765576
I,-51.922059,54.784925,118.000455,59.3746


In [None]:
df.dtypes

## Alterando os tipos de dados

In [None]:
df['coluna1'] = df['coluna1'].astype(int)

In [None]:
df.dtypes

In [None]:
df['coluna3'] = df['coluna3'].astype(str)

In [None]:
df.dtypes

In [None]:
df

## Acrescentando dados faltantes na tabela

In [None]:
df.iloc[4,2]

In [None]:
df.iloc[4,2] = np.nan

In [None]:
df

In [None]:
df.iloc[1,0] = np.nan
df.iloc[4,0] = np.nan
df.iloc[3,0] = np.nan
df.iloc[8,0] = np.nan
df.iloc[6,0] = np.nan
df.iloc[4,3] = np.nan
df.iloc[4,3] = np.nan
df.iloc[8,3] = np.nan

In [None]:
df

In [None]:
# Os tipos de dados podem mudar após acrescentar um dado faltante
df.dtypes

# 3. Identificando dados ausentes <a name="identificando"></a>
<div style="text-align: right"
     
[Voltar ao índice](#Contents)

In [None]:
df.isna()

In [None]:
# Alias do isna
df.isnull()

In [None]:
df['coluna1'].isna()

In [None]:
df[df['coluna1'].isna()]

In [None]:
df[~df['coluna1'].isna()]

In [None]:
df['coluna1']

In [None]:
df['coluna1'].isna()

In [None]:
df['coluna1'].isna().sum()

In [None]:
df.isna().sum()

In [None]:
df['coluna2']

In [None]:
df['coluna2'].isna().sum()

In [None]:
percentage = (df.isnull().sum() / len(df)) * 100
percentage

# 4. Tratando dados ausentes <a name="tratando"></a>
<div style="text-align: right"
     
[Voltar ao índice](#Contents)

In [None]:
df['coluna1']

## Substituindo por 0

In [None]:
df['coluna1'].fillna(0)

## Substituindo pela média

In [None]:
df['coluna1']

In [None]:
(-23.0-132.0+94.0+0)/4

In [None]:
df['coluna1'].mean()

In [None]:
med_col1 = df['coluna1'].mean()

In [None]:
df['coluna1'].fillna(med_col1)

## Substituindo pela mediana

In [None]:
df['coluna1']

In [None]:
df['coluna1'].sort_values()

In [None]:
(-77-33)/2

In [None]:
df['coluna1'].median()

In [None]:
mediana_col1 = df['coluna1'].median()

In [None]:
df['coluna1'].fillna(mediana_col1)

In [None]:
df['coluna1'].fillna(method='ffill')

In [None]:
df['coluna1'].dropna()

In [None]:
# dropar todas as linhas que tenha pelo menos 1 NA
df.dropna()

# 5. Dados duplicados <a name="duplicados"></a>
<div style="text-align: right"
     
[Voltar ao índice](#Contents)

In [None]:
df_dup = df.append(df.loc['D':'H',:]).sort_index()
df_dup

In [None]:
df_dup.drop_duplicates()

In [None]:
df_dup.drop_duplicates(subset=['coluna1'])

In [None]:
df_dup.duplicated()

In [None]:
df_dup[df_dup.duplicated()]

# 6. Mapeamentos <a name="map"></a>
<div style="text-align: right"
     
[Voltar ao índice](#Contents)

In [None]:
# 1 feminino, 0 masculino
genero = pd.Series([1,0,1,1,1,1,0,0,0,1,1,0])
genero

In [None]:
genero.map({1:'Feminino', 0:'Masculino'})

In [None]:
genero_2 = genero.map({1:'Feminino', 0:'Masculino'})

In [None]:
genero.map({1:'Feminino', 2:'Masculino'})

In [None]:
genero_2.map('Genero: {}'.format)