# Aula 5 - pandas

Na aula de hoje, vamos explorar os seguintes tópicos em Python:

- 1) Criar tabelas dinâmicas e fazer reshape do df (Melt, pivot, pivot_table)
- 2) Transformação de Dados (cut, qcut, get_dummies)
- 3) Utilidades Extras (multiindex to singleindex, combine_first)
_______

### Objetivos

Apresentar como criar tabelas dinâmicas, como fazer transformações em dados contínuos e categóricos e aprender como trabalhar com multiindex

____
____
____

In [2]:
!pip install matplotlib



In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

In [151]:
df = pd.read_csv("data/titanic.csv")

In [3]:
df.head(15)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [152]:
df.drop(['PassengerId','Ticket','Name'],inplace=True,axis=1)

In [5]:
df.head(15)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S
5,0,3,male,,0,0,8.4583,,Q
6,0,1,male,54.0,0,0,51.8625,E46,S
7,0,3,male,2.0,3,1,21.075,,S
8,1,3,female,27.0,0,2,11.1333,,S
9,1,2,female,14.0,1,0,30.0708,,C


## Construindo uma Tabela Dinâmica usando Pandas

É hora de construir uma tabela dinâmica em Python usando a incrível biblioteca Pandas! Exploraremos as diferentes facetas de uma tabela dinâmica neste artigo e construiremos uma tabela dinâmica incrível e flexível a partir do zero.


    * pivot_table requer um dado e um parâmetro de índice
    * data é o dataframe do Pandas que você passa para a função
    * índice é o recurso que permite agrupar seus dados. O recurso de índice aparecerá como um índice na tabela resultante



In [153]:
# index único
table = pd.pivot_table(data=df, index=['Sex'], aggfunc=['sum','mean'])
table

Unnamed: 0_level_0,sum,sum,sum,sum,sum,sum,mean,mean,mean,mean,mean,mean
Unnamed: 0_level_1,Age,Fare,Parch,Pclass,SibSp,Survived,Age,Fare,Parch,Pclass,SibSp,Survived
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
female,7286.0,13966.6628,204,678,218,233,27.915709,44.479818,0.649682,2.159236,0.694268,0.742038
male,13919.17,14727.2865,136,1379,248,109,30.726645,25.523893,0.235702,2.389948,0.429809,0.188908


In [7]:
# múltiplos indexes
table = pd.pivot_table(df, index=['Sex','Pclass'])
table

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Fare,Parch,SibSp,Survived
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,1,34.611765,106.125798,0.457447,0.553191,0.968085
female,2,28.722973,21.970121,0.605263,0.486842,0.921053
female,3,21.75,16.11881,0.798611,0.895833,0.5
male,1,41.281386,67.226127,0.278689,0.311475,0.368852
male,2,30.740707,19.741782,0.222222,0.342593,0.157407
male,3,26.507589,12.661633,0.224784,0.498559,0.135447


### Função de agregação
Por padrão o `.pivot_table()` utiliza o `np.mean()` como função de agragação, mas podemos utilizar diferentes funções de agregação para diferentes colunas. Para isso, precisamos de um dicionário como entrada para o parâmetro aggfunc com o nome da coluna como chave e a função agregada como o valor. <br>
Vamos criar uma pivot table calculando a média de 'Age' e a soma para o 'Survived':


In [160]:
# diferentes funções de agregação
table = pd.pivot_table(df, 
                       index=['Sex','Pclass'], 
                       aggfunc={'Age':['mean', 'sum'], 'Survived':'sum'})
table

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Age,Survived
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,sum,sum
Sex,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,1,34.611765,2942.0,91
female,2,28.722973,2125.5,70
female,3,21.75,2218.5,72
male,1,41.281386,4169.42,45
male,2,30.740707,3043.33,17
male,3,26.507589,6706.42,47


Qual a diferença entre esse pivot_table e um groupby?

In [9]:
df.groupby(['Sex','Pclass']).agg({'Age':"mean",'Survived':sum})

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Survived
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1,34.611765,91
female,2,28.722973,70
female,3,21.75,72
male,1,41.281386,45
male,2,30.740707,17
male,3,26.507589,47


Para ficar mais parecido à tabela dinâmica do excel podemos indicar um dos index para ser visualizado como coluna e adicionar os totais de colunas e índices:

In [158]:
table = pd.pivot_table(df,
                       index=['Sex'], #1º coluna de agrupamento que vai vir como índice
                       columns=['Pclass'], #Segunda coluna de agrupamento que vai vir como coluna
                       values=['Survived'], #Coluna que quero pegar os valores
                       aggfunc=['sum', 'mean'],
                       margins=1) #All
table
#Por exemplo o All com female faz a probabilidade de todas as mulheres (sem considerar Pclass) terem sobrevivido

Unnamed: 0_level_0,sum,sum,sum,sum,mean,mean,mean,mean
Unnamed: 0_level_1,Survived,Survived,Survived,Survived,Survived,Survived,Survived,Survived
Pclass,1,2,3,All,1,2,3,All
Sex,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
female,91,70,72,233,0.968085,0.921053,0.5,0.742038
male,45,17,47,109,0.368852,0.157407,0.135447,0.188908
All,136,87,119,342,0.62963,0.472826,0.242363,0.383838


Formatando nossa saída

In [32]:
(table_prob*100).style.format('{0:,.1f}%')

Unnamed: 0_level_0,Survived,Survived,Survived,Survived
Pclass,1,2,3,All
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,96.8%,92.1%,50.0%,74.2%
male,36.9%,15.7%,13.5%,18.9%
All,63.0%,47.3%,24.2%,38.4%


O `pd.pivot_table()` nos permite passar vários parâmetros úteis: <br>
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True) <br>
Segue a mesma ideia de `pd.unstack()`.

## Desfazendo uma tabela dinâmica
Para fazer um unpivoting utilizamos o `pd.melt()`. Esse método é utilizado quando queremos que uma ou mais colunas se tornem colunas de identificadores. Segue a mesma ideia de `pd.stack()`. As colunas que vamos dissolver são definidas por `id_vars` e `value_vars`.

Parâmetros: <br>
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)


   * __frame__: DataFrame <br>
   * __id_vars__: Colunas para serem usadas como categorias. São as colunas que você quer manter do jeito que estão. <br> 
   * __value_vars__: Colunas para o unpivot - para sairem do formato largo para longo (wide to long). Se não especificada, usa todas as colunas que não estão em id_vars. <br> 
   * __var_name__: Nome para a nova coluna de variáveis categóricas. <br>
   * __value_name__: Nome para ser utilizado na coluna de valores. <br>
   * __col_level__: Se as colunas são MultiIndex.<br>

In [43]:
Vamos simplificar nossa tabela de probabilidades eliminando a coluna e linha com os totais, resetando o index e eliminando o multi-index do nome das colunas renomeando-as.

Unnamed: 0_level_0,Sex,Survived,Survived,Survived
Pclass,Unnamed: 1_level_1,1,2,3
0,female,0.968085,0.921053,0.5
1,male,0.368852,0.157407,0.135447


In [33]:
table_prob

Unnamed: 0_level_0,Survived,Survived,Survived,Survived
Pclass,1,2,3,All
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,0.968085,0.921053,0.5,0.742038
male,0.368852,0.157407,0.135447,0.188908
All,0.62963,0.472826,0.242363,0.383838


In [34]:
table_prob = table_prob.drop(('Survived', 'All'), axis=1).drop('All').reset_index()
table_prob

Unnamed: 0_level_0,Sex,Survived,Survived,Survived
Pclass,Unnamed: 1_level_1,1,2,3
0,female,0.968085,0.921053,0.5
1,male,0.368852,0.157407,0.135447


In [35]:
table_prob.columns = ['Sex', 1, 2, 3]
table_prob

Unnamed: 0,Sex,1,2,3
0,female,0.968085,0.921053,0.5
1,male,0.368852,0.157407,0.135447


Observe o resultado final do nosso df ao utilizar o método `pd.melt()`:

In [163]:
table_prob = pd.DataFrame([["female", 0.968085, 0.921053, 0.500000], ["male", 0.368852, 0.157407, 0.135447]], columns=["Sex", 1, 2, 3])
table_prob

Unnamed: 0,Sex,1,2,3
0,female,0.968085,0.921053,0.5
1,male,0.368852,0.157407,0.135447


In [169]:
pd.melt(table_prob,
       id_vars=['Sex'])

Unnamed: 0,Sex,variable,value
0,female,1,0.968085
1,male,1,0.368852
2,female,2,0.921053
3,male,2,0.157407
4,female,3,0.5
5,male,3,0.135447


In [None]:
pd.melt(table_prob,
       id_vars=['Sex'])

Ele converteu as distintas colunas de Pclass em uma coluna com a categoria da classe e outra com seu valor. <br>
Para facilitar o entendimento das novas colunas podemos renomea-las:

In [37]:
pd.melt(table_prob,
       id_vars=['Sex'],
       var_name='Class_melt',
       value_name='porc_of_survived')

Unnamed: 0,Sex,Class_melt,porc_of_survived
0,female,1,0.968085
1,male,1,0.368852
2,female,2,0.921053
3,male,2,0.157407
4,female,3,0.5
5,male,3,0.135447


## Transformação de dados

### pd.cut()
O método `pd.cut()` ordena os dados, separa em bins e computa qual grupo cada linha do df pertence. O `pd.cut()` escolherá os bins para serem espaçados uniformemente de acordo com os próprios valores e não com a frequência desses valores.  <br>
Ele é muito utilizado para transformar variáveis contínuas em categóricas. Por exemplo, podemos converter o valor númerico da idade em grupos de criança, jovem, adulto e idoso.
<br><br>
<a href='https://pandas.pydata.org/docs/reference/api/pandas.cut.html'>Parâmetros:</a> <br>
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)

Ao informar quantidade de grupos o pd.cut() escolhe os bins com o mesmo tamanho de janela :


In [38]:
df['cut_bins'] = pd.cut(df.Age, 4)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,cut_bins
0,0,3,male,22.0,1,0,7.25,,S,"(20.315, 40.21]"
1,1,1,female,38.0,1,0,71.2833,C85,C,"(20.315, 40.21]"
2,1,3,female,26.0,0,0,7.925,,S,"(20.315, 40.21]"
3,1,1,female,35.0,1,0,53.1,C123,S,"(20.315, 40.21]"
4,0,3,male,35.0,0,0,8.05,,S,"(20.315, 40.21]"


Podemos passar o nome dos grupos e transformar a variável numérica diretamente em categórica


In [39]:
df['cut_classes'] = pd.cut(df.Age, 4, labels=["jovens", "adultos", "meia-idade", "idosos"])
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,cut_bins,cut_classes
0,0,3,male,22.0,1,0,7.25,,S,"(20.315, 40.21]",adultos
1,1,1,female,38.0,1,0,71.2833,C85,C,"(20.315, 40.21]",adultos
2,1,3,female,26.0,0,0,7.925,,S,"(20.315, 40.21]",adultos
3,1,1,female,35.0,1,0,53.1,C123,S,"(20.315, 40.21]",adultos
4,0,3,male,35.0,0,0,8.05,,S,"(20.315, 40.21]",adultos


In [40]:
df.cut_bins.unique()

[(20.315, 40.21], NaN, (40.21, 60.105], (0.34, 20.315], (60.105, 80.0]]
Categories (4, interval[float64]): [(0.34, 20.315] < (20.315, 40.21] < (40.21, 60.105] < (60.105, 80.0]]

In [41]:
df.cut_classes.value_counts()

adultos       385
jovens        179
meia-idade    128
idosos         22
Name: cut_classes, dtype: int64

In [42]:
df.cut_bins.value_counts()

(20.315, 40.21]    385
(0.34, 20.315]     179
(40.21, 60.105]    128
(60.105, 80.0]      22
Name: cut_bins, dtype: int64

Também podemos passar uma lista com os valores de início e fim dos bins:

In [43]:
pd.cut(df.Age, [0,20,60,80]).unique()

[(20.0, 60.0], NaN, (0.0, 20.0], (60.0, 80.0]]
Categories (3, interval[int64]): [(0, 20] < (20, 60] < (60, 80]]

In [44]:
df.Age.describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

Repare que o ú

### pd.qcut()
O `pd.qcut()` é utilizado quando queremos discretizar nossos dados em quantis. Ao informar quantidade de grupos o `pd.qcut()` escolhe os bins tal que tenhamos a mesma quantidade de valores em cada grupo.

#### `pd.qcut()` x `pd.qcut()`
   * O comando `pd.cut()` cria **caixas equidistantes**, mas a **frequência** das amostras é **desigual** em cada caixa
   * O comando `pd.qcut()` cria **caixas de tamanhos desiguais**, mas a **frequência** das amostras é **igual** em cada caixa.

<br>
Parâmetros:<br>
pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')

In [45]:
pd.cut(df.Age, 4).value_counts()

(20.315, 40.21]    385
(0.34, 20.315]     179
(40.21, 60.105]    128
(60.105, 80.0]      22
Name: Age, dtype: int64

In [47]:
pd.qcut(df.Age, 4).value_counts()

(20.125, 28.0]     183
(0.419, 20.125]    179
(38.0, 80.0]       177
(28.0, 38.0]       175
Name: Age, dtype: int64

In [52]:
pd.qcut(df.Age, 4).value_counts()/df.Age.notnull().sum()
#df.Age.notnull().sum() é o total de pessoas com Age não nulo
#Fazendo essa divisão obtenho o quanto cada grupo representa do total

(20.125, 28.0]     0.256303
(0.419, 20.125]    0.250700
(38.0, 80.0]       0.247899
(28.0, 38.0]       0.245098
Name: Age, dtype: float64

<a href='https://towardsdatascience.com/discretisation-using-decision-trees-21910483fa4b'>Discretização utilizando decision trees</a>

### pd.get_dummies()

#### variáveis categóricas
Variáveis categóricas são aquelas que representam grupos ou classes dentro dos nossos dados. Elas podem ser de dois tipos:
* ordinais: possuem uma ordem que tem um sentido. Por exemplo, em rendimentos poderíamos ter: classe alta > classe média > classe baixa  
* nominais: não possuem uma ordem válida. Por exemplo: sexo e CEP.

<img src="variaveis_categoricas.jpeg" style="width: 500px">

Dummies são quaisquer variáveis cujos valores são 1 ou 0 para cada observação. O método `pd.get_dummies()` converte as variáveis categóricas em numéricas separando cada categoria em uma coluna única.
<br>
<br>
<a href="https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html">Parâmetros:</a> <br>
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

In [None]:
#Muitos modelos não aceitam variáveis categóricas como string, daí a necessidade de transformar em numérica
#Ordinais vc precisa garantir a ordem na transformação dos seus dados, daí vc faz isso na mão, cria um dicionário, classe alta: 1, classe média: 2, classe baixa: 3 e dá um aplly replace
#Ordinal deixa td na mesma coluna p modelo entender que aquilo tem uma ordem e que ela é ipc

In [54]:
pd.get_dummies(df, columns=['Sex', 'cut_classes'], drop_first=True)
#Se vc não passar quais colunas vc quer converter ele vai converter todas as categóricas
#Se não for uma categória ordinal e ao invés de vc ter dummies vc colocar 1, 2, 3 o modelo vai entender que aquilo tem uma ordem quando na verdade nao tem

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Cabin,Embarked,cut_bins,Sex_male,cut_classes_adultos,cut_classes_meia-idade,cut_classes_idosos
0,0,3,22.0,1,0,7.2500,,S,"(20.315, 40.21]",1,1,0,0
1,1,1,38.0,1,0,71.2833,C85,C,"(20.315, 40.21]",0,1,0,0
2,1,3,26.0,0,0,7.9250,,S,"(20.315, 40.21]",0,1,0,0
3,1,1,35.0,1,0,53.1000,C123,S,"(20.315, 40.21]",0,1,0,0
4,0,3,35.0,0,0,8.0500,,S,"(20.315, 40.21]",1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,27.0,0,0,13.0000,,S,"(20.315, 40.21]",1,1,0,0
887,1,1,19.0,0,0,30.0000,B42,S,"(0.34, 20.315]",0,0,0,0
888,0,3,,1,2,23.4500,,S,,0,0,0,0
889,1,1,26.0,0,0,30.0000,C148,C,"(20.315, 40.21]",1,1,0,0


In [None]:
#Quando vc acaba criando muitas colunas com os dummies, vc tem muitos zeros no seu df e pode acabar piorando o modelo
#Daí categorias mais comuns vc deixa normal e as outras vc coloca em "outros"
#Isso p/ não deixar que o seu df tenha uma quantidade gigantesca de colunas, a não ser q tenha muitas linhas tipo 1000:1000000

In [55]:
pd.get_dummies(pd.cut(df.Age, 4))

Unnamed: 0,"(0.34, 20.315]","(20.315, 40.21]","(40.21, 60.105]","(60.105, 80.0]"
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,1,0,0
...,...,...,...,...
886,0,1,0,0
887,1,0,0,0
888,0,0,0,0
889,0,1,0,0


## Multi-index

In [57]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,cut_bins,cut_classes
0,0,3,male,22.0,1,0,7.25,,S,"(20.315, 40.21]",adultos
1,1,1,female,38.0,1,0,71.2833,C85,C,"(20.315, 40.21]",adultos
2,1,3,female,26.0,0,0,7.925,,S,"(20.315, 40.21]",adultos
3,1,1,female,35.0,1,0,53.1,C123,S,"(20.315, 40.21]",adultos
4,0,3,male,35.0,0,0,8.05,,S,"(20.315, 40.21]",adultos


Para setar indexes use o método `set_index()` indicando quais as colunas quer utilizar como uma lista.

In [58]:
df_row_index = df.set_index(["Pclass", 'Sex'])
df_row_index

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Age,SibSp,Parch,Fare,Cabin,Embarked,cut_bins,cut_classes
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,male,0,22.0,1,0,7.2500,,S,"(20.315, 40.21]",adultos
1,female,1,38.0,1,0,71.2833,C85,C,"(20.315, 40.21]",adultos
3,female,1,26.0,0,0,7.9250,,S,"(20.315, 40.21]",adultos
1,female,1,35.0,1,0,53.1000,C123,S,"(20.315, 40.21]",adultos
3,male,0,35.0,0,0,8.0500,,S,"(20.315, 40.21]",adultos
...,...,...,...,...,...,...,...,...,...,...
2,male,0,27.0,0,0,13.0000,,S,"(20.315, 40.21]",adultos
1,female,1,19.0,0,0,30.0000,B42,S,"(0.34, 20.315]",jovens
3,female,0,,1,2,23.4500,,S,,
1,male,1,26.0,0,0,30.0000,C148,C,"(20.315, 40.21]",adultos


In [59]:
df_row_index.index

MultiIndex([(3,   'male'),
            (1, 'female'),
            (3, 'female'),
            (1, 'female'),
            (3,   'male'),
            (3,   'male'),
            (1,   'male'),
            (3,   'male'),
            (3, 'female'),
            (2, 'female'),
            ...
            (3,   'male'),
            (3, 'female'),
            (2,   'male'),
            (3,   'male'),
            (3, 'female'),
            (2,   'male'),
            (1, 'female'),
            (3, 'female'),
            (1,   'male'),
            (3,   'male')],
           names=['Pclass', 'Sex'], length=891)

Para acessar elementos:

In [60]:
df_row_index.loc[(3, 'female')]

  df_row_index.loc[(3, 'female')]


Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Age,SibSp,Parch,Fare,Cabin,Embarked,cut_bins,cut_classes
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,female,1,26.0,0,0,7.9250,,S,"(20.315, 40.21]",adultos
3,female,1,27.0,0,2,11.1333,,S,"(20.315, 40.21]",adultos
3,female,1,4.0,1,1,16.7000,G6,S,"(0.34, 20.315]",jovens
3,female,0,14.0,0,0,7.8542,,S,"(0.34, 20.315]",jovens
3,female,0,31.0,1,0,18.0000,,S,"(20.315, 40.21]",adultos
3,...,...,...,...,...,...,...,...,...,...
3,female,0,,8,2,69.5500,,S,,
3,female,1,15.0,0,0,7.2250,,C,"(0.34, 20.315]",jovens
3,female,0,22.0,0,0,10.5167,,S,"(20.315, 40.21]",adultos
3,female,0,39.0,0,5,29.1250,,Q,"(20.315, 40.21]",adultos


In [61]:
df_row_index.reset_index(['Sex'])

Unnamed: 0_level_0,Sex,Survived,Age,SibSp,Parch,Fare,Cabin,Embarked,cut_bins,cut_classes
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,male,0,22.0,1,0,7.2500,,S,"(20.315, 40.21]",adultos
1,female,1,38.0,1,0,71.2833,C85,C,"(20.315, 40.21]",adultos
3,female,1,26.0,0,0,7.9250,,S,"(20.315, 40.21]",adultos
1,female,1,35.0,1,0,53.1000,C123,S,"(20.315, 40.21]",adultos
3,male,0,35.0,0,0,8.0500,,S,"(20.315, 40.21]",adultos
...,...,...,...,...,...,...,...,...,...,...
2,male,0,27.0,0,0,13.0000,,S,"(20.315, 40.21]",adultos
1,female,1,19.0,0,0,30.0000,B42,S,"(0.34, 20.315]",jovens
3,female,0,,1,2,23.4500,,S,,
1,male,1,26.0,0,0,30.0000,C148,C,"(20.315, 40.21]",adultos


### Multi-index nas colunas

In [62]:
table

Unnamed: 0_level_0,Survived,Survived,Survived,Survived
Pclass,1,2,3,All
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,91,70,72,233
male,45,17,47,109
All,136,87,119,342


Acessando colunas:

In [63]:
table.columns

MultiIndex([('Survived',     1),
            ('Survived',     2),
            ('Survived',     3),
            ('Survived', 'All')],
           names=[None, 'Pclass'])

Como acessar uma coluna:

In [64]:
table[('Survived', 1)]

Sex
female     91
male       45
All       136
Name: (Survived, 1), dtype: int64

Slice usando multi-index

In [65]:
table.loc[:, ('Survived', 1):('Survived', 3)]

Unnamed: 0_level_0,Survived,Survived,Survived
Pclass,1,2,3
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,91,70,72
male,45,17,47
All,136,87,119


Para obter o nome das colunas de cada nível hierárquico

In [66]:
table

Unnamed: 0_level_0,Survived,Survived,Survived,Survived
Pclass,1,2,3,All
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,91,70,72,233
male,45,17,47,109
All,136,87,119,342


In [67]:
table.columns.get_level_values(0)

Index(['Survived', 'Survived', 'Survived', 'Survived'], dtype='object')

In [68]:
table.columns.get_level_values(1)

Index([1, 2, 3, 'All'], dtype='object', name='Pclass')

In [70]:
nivel_0 = table.columns.get_level_values(0)
nivel_1 = table.columns.get_level_values(1)

[j + '_' + str(nivel_1[i]) for i, j in enumerate(nivel_0)]
#enumerate traz (indice, valor)

['Survived_1', 'Survived_2', 'Survived_3', 'Survived_All']

## Exercícios

1. Baixe os dados de consumo de bebidas por país do <a href="https://www.kaggle.com/justmarkham/alcohol-consumption-by-country">kaggle</a> faça uma análise das informações utilizando os métodos que você já conhece e depois responda:

In [1]:
import pandas as pd

In [2]:
drinks = pd.read_csv("data/drinks.csv")
drinks

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa
...,...,...,...,...,...,...
188,Venezuela,333,100,3,7.7,South America
189,Vietnam,111,2,1,2.0,Asia
190,Yemen,6,0,0,0.1,Asia
191,Zambia,32,19,4,2.5,Africa


In [82]:
drinks.head(15)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa
5,Antigua & Barbuda,102,128,45,4.9,North America
6,Argentina,193,25,221,8.3,South America
7,Armenia,21,179,11,3.8,Europe
8,Australia,261,72,212,10.4,Oceania
9,Austria,279,75,191,9.7,Europe


In [83]:
drinks.tail(15)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
178,Tuvalu,6,41,9,1.0,Oceania
179,Uganda,45,9,0,8.3,Africa
180,Ukraine,206,237,45,8.9,Europe
181,United Arab Emirates,16,135,5,2.8,Asia
182,United Kingdom,219,126,195,10.4,Europe
183,Tanzania,36,6,1,5.7,Africa
184,USA,249,158,84,8.7,North America
185,Uruguay,115,35,220,6.6,South America
186,Uzbekistan,25,101,8,2.4,Asia
187,Vanuatu,21,18,11,0.9,Oceania


In [84]:
drinks.describe(include="all")

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
count,193,193.0,193.0,193.0,193.0,193
unique,193,,,,,6
top,Turkmenistan,,,,,Africa
freq,1,,,,,53
mean,,106.160622,80.994819,49.450777,4.717098,
std,,101.143103,88.284312,79.697598,3.773298,
min,,0.0,0.0,0.0,0.0,
25%,,20.0,4.0,1.0,1.3,
50%,,76.0,56.0,8.0,4.2,
75%,,188.0,128.0,59.0,7.2,


In [78]:
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB


In [77]:
drinks.isna().sum()

country                         0
beer_servings                   0
spirit_servings                 0
wine_servings                   0
total_litres_of_pure_alcohol    0
continent                       0
dtype: int64

a. Encontre qual a bebida mais consumida em cada um dos países e a quantidade.

In [3]:
drinks

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa
...,...,...,...,...,...,...
188,Venezuela,333,100,3,7.7,South America
189,Vietnam,111,2,1,2.0,Asia
190,Yemen,6,0,0,0.1,Asia
191,Zambia,32,19,4,2.5,Africa


In [12]:
drinks.iloc[:, 1:5].idxmax(axis=1)

0        beer_servings
1      spirit_servings
2        beer_servings
3        wine_servings
4        beer_servings
            ...       
188      beer_servings
189      beer_servings
190      beer_servings
191      beer_servings
192      beer_servings
Length: 193, dtype: object

In [4]:
drinks.max(axis=1)

0        0.0
1      132.0
2       25.0
3      312.0
4      217.0
       ...  
188    333.0
189    111.0
190      6.0
191     32.0
192     64.0
Length: 193, dtype: float64

b. Crie um df cujas bebidas estejam agrupadas em uma mesma coluna.

In [13]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [28]:
pd.set_option('display.max_rows', 1000)

In [148]:
drinks.set_index(["country", "continent"])

Unnamed: 0_level_0,Unnamed: 1_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
country,continent,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,Asia,0,0,0,0.0
Albania,Europe,89,132,54,4.9
Algeria,Africa,25,0,14,0.7
Andorra,Europe,245,138,312,12.4
Angola,Africa,217,57,45,5.9
Antigua & Barbuda,North America,102,128,45,4.9
Argentina,South America,193,25,221,8.3
Armenia,Europe,21,179,11,3.8
Australia,Oceania,261,72,212,10.4
Austria,Europe,279,75,191,9.7


In [66]:
melt = pd.melt(drinks.set_index(["country", "continent"]), ignore_index=False)
melt

Unnamed: 0_level_0,Unnamed: 1_level_0,variable,value
country,continent,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,Asia,beer_servings,0.0
Albania,Europe,beer_servings,89.0
Algeria,Africa,beer_servings,25.0
Andorra,Europe,beer_servings,245.0
Angola,Africa,beer_servings,217.0
Antigua & Barbuda,North America,beer_servings,102.0
Argentina,South America,beer_servings,193.0
Armenia,Europe,beer_servings,21.0
Australia,Oceania,beer_servings,261.0
Austria,Europe,beer_servings,279.0


In [67]:
melt2 = pd.melt(drinks, id_vars=["country"], ignore_index=False)
melt2

Unnamed: 0,country,variable,value
0,Afghanistan,beer_servings,0
1,Albania,beer_servings,89
2,Algeria,beer_servings,25
3,Andorra,beer_servings,245
4,Angola,beer_servings,217
5,Antigua & Barbuda,beer_servings,102
6,Argentina,beer_servings,193
7,Armenia,beer_servings,21
8,Australia,beer_servings,261
9,Austria,beer_servings,279


In [69]:
melt2 = melt2.groupby(["country", "variable"]).sum()
melt2

Unnamed: 0_level_0,Unnamed: 1_level_0,value
country,variable,Unnamed: 2_level_1
Afghanistan,beer_servings,0
Afghanistan,continent,Asia
Afghanistan,spirit_servings,0
Afghanistan,total_litres_of_pure_alcohol,0
Afghanistan,wine_servings,0
Albania,beer_servings,89
Albania,continent,Europe
Albania,spirit_servings,132
Albania,total_litres_of_pure_alcohol,4.9
Albania,wine_servings,54


c. Utilizando esse novo df, encontre qual a bebida mais consumida por país e a quantidade.

In [73]:
for pais in melt.index.unique():
    melt_pais = melt.loc[pais].reset_index()
    idx = melt_pais["value"].idxmax()
    print(pais[0], ":", melt_pais.loc[idx, "variable"], ":", melt_pais.loc[idx, "value"])

  melt_pais = melt.loc[pais].reset_index()


Afghanistan : beer_servings : 0.0
Albania : spirit_servings : 132.0
Algeria : beer_servings : 25.0
Andorra : wine_servings : 312.0
Angola : beer_servings : 217.0
Antigua & Barbuda : spirit_servings : 128.0
Argentina : wine_servings : 221.0
Armenia : spirit_servings : 179.0
Australia : beer_servings : 261.0
Austria : beer_servings : 279.0
Azerbaijan : spirit_servings : 46.0
Bahamas : spirit_servings : 176.0
Bahrain : spirit_servings : 63.0
Bangladesh : beer_servings : 0.0
Barbados : spirit_servings : 173.0
Belarus : spirit_servings : 373.0
Belgium : beer_servings : 295.0
Belize : beer_servings : 263.0
Benin : beer_servings : 34.0
Bhutan : beer_servings : 23.0
Bolivia : beer_servings : 167.0
Bosnia-Herzegovina : spirit_servings : 173.0
Botswana : beer_servings : 173.0
Brazil : beer_servings : 245.0
Brunei : beer_servings : 31.0
Bulgaria : spirit_servings : 252.0
Burkina Faso : beer_servings : 25.0
Burundi : beer_servings : 88.0
Cote d'Ivoire : beer_servings : 37.0
Cabo Verde : beer_servi

In [74]:
melt2

Unnamed: 0_level_0,Unnamed: 1_level_0,value
country,variable,Unnamed: 2_level_1
Afghanistan,beer_servings,0
Afghanistan,continent,Asia
Afghanistan,spirit_servings,0
Afghanistan,total_litres_of_pure_alcohol,0
Afghanistan,wine_servings,0
Albania,beer_servings,89
Albania,continent,Europe
Albania,spirit_servings,132
Albania,total_litres_of_pure_alcohol,4.9
Albania,wine_servings,54


In [81]:
for pais in melt2.index.get_level_values(0).unique():
    melt_pais = melt2.loc[pais]
    

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
       'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country', length=193)

In [116]:
melt2.loc['Albania'].iloc[[0, 2, 3, 4]]["value"].idxmax() #?

TypeError: reduction operation 'argmax' not allowed for this dtype

2. Considere os dados de preço de fechamento e volume das ações que estão dentro de "data/stocks.csv". <br>
a. Escolha um método de python ensinado na aula de hoje para obter um dataframe cujas linhas são os códigos das ações e as colunas são as datas.

In [125]:
data = pd.read_csv('data/stocks.csv')
data

Unnamed: 0,Date,Close,Volume,Symbol
0,2016-10-03,31.5,14070500,CSCO
1,2016-10-03,112.52,21701800,AAPL
2,2016-10-03,57.42,19189500,MSFT
3,2016-10-04,113.0,29736800,AAPL
4,2016-10-04,57.24,20085900,MSFT
5,2016-10-04,31.35,18460400,CSCO
6,2016-10-05,57.64,16726400,MSFT
7,2016-10-05,31.59,11808600,CSCO
8,2016-10-05,113.05,21453100,AAPL


In [131]:
#?
table = pd.pivot_table(data,
                       index=['Symbol'],
                       columns=['Date'],
                       aggfunc='sum',
                       margins=1)
table

Unnamed: 0_level_0,Close,Close,Close,Close,Volume,Volume,Volume,Volume
Date,2016-10-03,2016-10-04,2016-10-05,All,2016-10-03,2016-10-04,2016-10-05,All
Symbol,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
AAPL,112.52,113.0,113.05,338.57,21701800,29736800,21453100,72891700
CSCO,31.5,31.35,31.59,94.44,14070500,18460400,11808600,44339500
MSFT,57.42,57.24,57.64,172.3,19189500,20085900,16726400,56001800
All,201.44,201.59,202.28,605.31,54961800,68283100,49988100,173233000


b. Com o df original, converta o código das ações para variáveis dummies.

In [133]:
pd.get_dummies(data, columns=["Symbol"], drop_first=True)

Unnamed: 0,Date,Close,Volume,Symbol_CSCO,Symbol_MSFT
0,2016-10-03,31.5,14070500,1,0
1,2016-10-03,112.52,21701800,0,0
2,2016-10-03,57.42,19189500,0,1
3,2016-10-04,113.0,29736800,0,0
4,2016-10-04,57.24,20085900,0,1
5,2016-10-04,31.35,18460400,1,0
6,2016-10-05,57.64,16726400,0,1
7,2016-10-05,31.59,11808600,1,0
8,2016-10-05,113.05,21453100,0,0


3. Considere os dados do arquivo "german_credit.csv" que contem dados de empréstimos realizados por um banco.<br>
a. Encontre qual a média de empréstimo ("Credit Amount") obtidos considerando o propósito ("Purpose") do empréstimo nas linhas e o sexo ("Sex") nas colunas.

In [135]:
gc = pd.read_csv("data/german_credit.csv").drop("Unnamed: 0", axis=1)
gc

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose
0,67,male,2,own,,little,1169,6,radio/TV
1,22,female,2,own,little,moderate,5951,48,radio/TV
2,49,male,1,own,little,,2096,12,education
3,45,male,2,free,little,little,7882,42,furniture/equipment
4,53,male,2,free,little,little,4870,24,car
5,35,male,1,free,,,9055,36,education
6,53,male,2,own,quite rich,,2835,24,furniture/equipment
7,35,male,3,rent,little,moderate,6948,36,car
8,61,male,1,own,rich,,3059,12,radio/TV
9,28,male,3,own,little,moderate,5234,30,car


In [140]:
table = pd.pivot_table(gc,
                       index=['Purpose'],
                       columns=['Sex'],
                       values=['Credit amount'],
                       aggfunc='mean',
                       margins=1)
table

Unnamed: 0_level_0,Credit amount,Credit amount,Credit amount
Sex,female,male,All
Purpose,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
business,3195.421053,4392.525641,4158.041237
car,3369.723404,3922.333333,3768.192878
domestic appliances,1409.833333,1586.166667,1498.0
education,2134.041667,3390.171429,2879.20339
furniture/equipment,2774.72973,3269.11215,3066.98895
radio/TV,2400.517647,2525.635897,2487.653571
repairs,2126.4,2905.058824,2728.090909
vacation/others,11653.666667,7061.222222,8209.333333
All,2877.774194,3448.04058,3271.258



b. Converta as variáveis categóricas em numéricas.

In [141]:
gc

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose
0,67,male,2,own,,little,1169,6,radio/TV
1,22,female,2,own,little,moderate,5951,48,radio/TV
2,49,male,1,own,little,,2096,12,education
3,45,male,2,free,little,little,7882,42,furniture/equipment
4,53,male,2,free,little,little,4870,24,car
5,35,male,1,free,,,9055,36,education
6,53,male,2,own,quite rich,,2835,24,furniture/equipment
7,35,male,3,rent,little,moderate,6948,36,car
8,61,male,1,own,rich,,3059,12,radio/TV
9,28,male,3,own,little,moderate,5234,30,car


In [143]:
pd.get_dummies(gc, columns=["Sex", "Housing", "Saving accounts", "Checking account", "Purpose"], drop_first=True)

Unnamed: 0,Age,Job,Credit amount,Duration,Sex_male,Housing_own,Housing_rent,Saving accounts_moderate,Saving accounts_quite rich,Saving accounts_rich,Checking account_moderate,Checking account_rich,Purpose_car,Purpose_domestic appliances,Purpose_education,Purpose_furniture/equipment,Purpose_radio/TV,Purpose_repairs,Purpose_vacation/others
0,67,2,1169,6,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0
1,22,2,5951,48,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0
2,49,1,2096,12,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0
3,45,2,7882,42,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0
4,53,2,4870,24,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
5,35,1,9055,36,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
6,53,2,2835,24,1,1,0,0,1,0,0,0,0,0,0,1,0,0,0
7,35,3,6948,36,1,0,1,0,0,0,1,0,1,0,0,0,0,0,0
8,61,1,3059,12,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0
9,28,3,5234,30,1,1,0,0,0,0,1,0,1,0,0,0,0,0,0


In [144]:
gc["Checking account"].unique()

array(['little', 'moderate', nan, 'rich'], dtype=object)

4. Considere o dataset (fake) com testes de aceleração para três carros distintos. Utilize um dos métodos ensinados em aula para criar uma única coluna com os valores das datas e outra com os valores das acelerações.

In [145]:
s = 'Carro A'
x = 'Carro B'
three = 'Carro C'

s_data = [s, 2.5, 2.51, 2.54]
x_data = [x, 2.92, 2.91, 2.93]
three_data = [three, 3.33, 3.31, 3.35]

data = [s_data, x_data, three_data] 
car = pd.DataFrame(data, columns=['car_model', 'Sept 1 9am', 'Sept 1 10am', 'Sept 1 11am'])
car

Unnamed: 0,car_model,Sept 1 9am,Sept 1 10am,Sept 1 11am
0,Carro A,2.5,2.51,2.54
1,Carro B,2.92,2.91,2.93
2,Carro C,3.33,3.31,3.35


In [149]:
pd.melt(car, id_vars=["car_model"])

Unnamed: 0,car_model,variable,value
0,Carro A,Sept 1 9am,2.5
1,Carro B,Sept 1 9am,2.92
2,Carro C,Sept 1 9am,3.33
3,Carro A,Sept 1 10am,2.51
4,Carro B,Sept 1 10am,2.91
5,Carro C,Sept 1 10am,3.31
6,Carro A,Sept 1 11am,2.54
7,Carro B,Sept 1 11am,2.93
8,Carro C,Sept 1 11am,3.35


## Referências:
pd.melt(): <br>
https://towardsdatascience.com/shape-tables-like-jelly-with-pandas-melt-and-pivot-f2e13e666d6 <br>
https://pub.towardsai.net/understanding-pandas-melt-pd-melt-362954f8c125