In [2]:
import pandas as pd
import numpy as np

# DataFrame Calculations

Hoje veremos como criar novas colunas em um DataFrame. Até o momento, já criamos colunas através de condicionais (usando `.loc` ou `np.where`) e através dos métodos `.astype()`, `.map()` e `.fillna()`.

A criação de colunas é extremamente simples: basta lembrarmos que um `DataFrame` se comporta como um dicionário de `Series`! Podemos criar novas colunas como adicionamos chaves à um dicionário: utilizando o operador de *assignment*, `=`.

Para aula de hoje utilizaremos um novo dataset: os dados do artigo *Sleep  in Mammals: Ecological and Constitutional Correlates*, contendo informações sobre o sono e a vida de certos animais.

## Lendo o DataFrame

Vamos iniciar carregando o DataFrame, olhando a documentação e os dados.

Documentação: 
http://lib.stat.cmu.edu/datasets/sleep

In [3]:
tb_animals = pd.read_csv('http://www.statsci.org/data/general/sleep.txt', sep='\t')

In [4]:
tb_animals.describe()

Unnamed: 0,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger
count,62.0,62.0,48.0,50.0,58.0,58.0,58.0,62.0,62.0,62.0
mean,198.789984,283.134194,8.672917,1.972,10.532759,19.877586,142.353448,2.870968,2.419355,2.612903
std,899.158011,930.278942,3.666452,1.442651,4.60676,18.206255,146.805039,1.476414,1.604792,1.441252
min,0.005,0.14,2.1,0.0,2.6,2.0,12.0,1.0,1.0,1.0
25%,0.6,4.25,6.25,0.9,8.05,6.625,35.75,2.0,1.0,1.0
50%,3.3425,17.25,8.35,1.8,10.45,15.1,79.0,3.0,2.0,2.0
75%,48.2025,166.0,11.0,2.55,13.2,27.75,207.5,4.0,4.0,4.0
max,6654.0,5712.0,17.9,6.6,19.9,100.0,645.0,5.0,5.0,5.0


In [5]:
tb_animals.head()

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger
0,Africanelephant,6654.0,5712.0,,,3.3,38.6,645.0,3,5,3
1,Africangiantpouchedrat,1.0,6.6,6.3,2.0,8.3,4.5,42.0,3,1,3
2,ArcticFox,3.385,44.5,,,12.5,14.0,60.0,1,1,1
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3
4,Asianelephant,2547.0,4603.0,2.1,1.8,3.9,69.0,624.0,3,5,4


# Calculos com DataFrames

A forma mais simples de criarmos novas colunas é a partir de constantes, listas ou calculos com outras colunas. Vamos ver como realizar cada um desses passos.

## Colunas constantes

Podemos criar um coluna com valor constante simplesmente atribuindo um número à coluna.

In [10]:
tb_animals['new_column'] = 1

In [9]:
tb_animals.head()

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,new_column
0,Africanelephant,6654.0,5712.0,,,3.3,38.6,645.0,3,5,3,a
1,Africangiantpouchedrat,1.0,6.6,6.3,2.0,8.3,4.5,42.0,3,1,3,a
2,ArcticFox,3.385,44.5,,,12.5,14.0,60.0,1,1,1,a
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3,a
4,Asianelephant,2547.0,4603.0,2.1,1.8,3.9,69.0,624.0,3,5,4,a


In [11]:
tb_animals = tb_animals.drop('new_column', axis = 1)

In [12]:
tb_animals.head()

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger
0,Africanelephant,6654.0,5712.0,,,3.3,38.6,645.0,3,5,3
1,Africangiantpouchedrat,1.0,6.6,6.3,2.0,8.3,4.5,42.0,3,1,3
2,ArcticFox,3.385,44.5,,,12.5,14.0,60.0,1,1,1
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3
4,Asianelephant,2547.0,4603.0,2.1,1.8,3.9,69.0,624.0,3,5,4


## Criando colunas com `lists`

Podemos criar uma coluna a partir de uma lista (ou qualquer outro iterável). O Pandas interpretará o iterável como um `Series`, ou seja, cada elemento dele será visto como uma nova linha da nossa tabela. Logo, precisamos que o iterável tenha comprimento igual ao tamanho da nossa tabela.

In [13]:
[i for i in range(tb_animals.shape[0])]

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61]

In [14]:
tb_animals['id_linha'] = [i for i in range(tb_animals.shape[0])]

In [15]:
tb_animals.head()

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha
0,Africanelephant,6654.0,5712.0,,,3.3,38.6,645.0,3,5,3,0
1,Africangiantpouchedrat,1.0,6.6,6.3,2.0,8.3,4.5,42.0,3,1,3,1
2,ArcticFox,3.385,44.5,,,12.5,14.0,60.0,1,1,1,2
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3,3
4,Asianelephant,2547.0,4603.0,2.1,1.8,3.9,69.0,624.0,3,5,4,4


In [16]:
tb_animals['erro'] = [1,2,3]

ValueError: Length of values (3) does not match length of index (62)

## Criando colunas à partir de contas

Podemos utilizar os operadores matemáticos para realizar operações sobre as colunas de um DataSet. A operação será mapeada à cada elemento da coluna - como em vetores do Numpy.

In [19]:
tb_animals['BrainWt']/1000 

0     5.7120
1     0.0066
2     0.0445
3     0.0057
4     4.6030
       ...  
57    0.0123
58    0.0025
59    0.0580
60    0.0039
61    0.0170
Name: BrainWt, Length: 62, dtype: float64

In [20]:
tb_animals['BrainWt_kg'] = tb_animals['BrainWt']/1000 

In [22]:
tb_animals[['BrainWt', 'BrainWt_kg']].head()

Unnamed: 0,BrainWt,BrainWt_kg
0,5712.0,5.712
1,6.6,0.0066
2,44.5,0.0445
3,5.7,0.0057
4,4603.0,4.603


## Cálculos entre Colunas

Podemos realizar operações entre colunas - da mesma forma que os operadores booleanos (`<`, `>`, `==`, etc) podem ser aplicados sobre uma coluna para criar uma coluna, os operadores matemáticos podem ser usados entre duas colunas para criar novas colunas.

In [25]:
tb_animals['BrainWt_kg']/tb_animals['BodyWt']

0     0.000858
1     0.006600
2     0.013146
3     0.006196
4     0.001807
        ...   
57    0.006150
58    0.024038
59    0.013842
60    0.001114
61    0.004198
Length: 62, dtype: float64

In [26]:
tb_animals['ratio_brain_body'] = tb_animals['BrainWt_kg']/tb_animals['BodyWt']

In [27]:
tb_animals['ratio_brain_body'].describe()

count    62.000000
mean      0.009624
std       0.008915
min       0.000858
25%       0.003103
50%       0.006611
75%       0.013668
max       0.039604
Name: ratio_brain_body, dtype: float64

In [28]:
tb_animals.head()

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body
0,Africanelephant,6654.0,5712.0,,,3.3,38.6,645.0,3,5,3,0,5.712,0.000858
1,Africangiantpouchedrat,1.0,6.6,6.3,2.0,8.3,4.5,42.0,3,1,3,1,0.0066,0.0066
2,ArcticFox,3.385,44.5,,,12.5,14.0,60.0,1,1,1,2,0.0445,0.013146
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3,3,0.0057,0.006196
4,Asianelephant,2547.0,4603.0,2.1,1.8,3.9,69.0,624.0,3,5,4,4,4.603,0.001807


In [29]:
tb_animals[tb_animals['ratio_brain_body']>0.03]

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body
26,Groundsquirrel,0.101,4.0,10.4,3.4,13.8,9.0,28.0,5,1,3,26,0.004,0.039604
41,Owlmonkey,0.48,15.5,15.2,1.8,17.0,12.0,140.0,2,2,2,41,0.0155,0.032292


### Operadores Booleanos entre Colunas

Da mesma forma que podemos realizar a comparação de uma coluna com um valor, podemos criar comparações entre colunas:

In [30]:
tb_animals['ratio_brain_body']>0.01

0     False
1     False
2      True
3     False
4     False
      ...  
57    False
58     True
59     True
60    False
61    False
Name: ratio_brain_body, Length: 62, dtype: bool

In [32]:
tb_animals['Dreaming'] > tb_animals['NonDreaming']

0

In [33]:
tb_animals[tb_animals['Dreaming'] > tb_animals['NonDreaming']]

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body


## Usando métodos de `strings` em colunas

A aplicação dos métodos de `str` é um pouco mais complexa, sintaticamente, que a utilização dos operadores: precisamos utilizar um atributo das `Series` para conseguir acessar os métodos.

In [34]:
tb_animals['Species'].head()

0           Africanelephant
1    Africangiantpouchedrat
2                 ArcticFox
3      Arcticgroundsquirrel
4             Asianelephant
Name: Species, dtype: object

In [35]:
tb_animals['Species'].lower()

AttributeError: 'Series' object has no attribute 'lower'

Para acessar os métodos de `strings` vamos utilizar o atributo `.str` das `Series`

In [36]:
tb_animals['Species'].str.lower()

0            africanelephant
1     africangiantpouchedrat
2                  arcticfox
3       arcticgroundsquirrel
4              asianelephant
               ...          
57                 treehyrax
58                 treeshrew
59                    vervet
60              wateropossum
61      yellow-belliedmarmot
Name: Species, Length: 62, dtype: object

In [37]:
tb_animals['lower_species'] = tb_animals['Species'].str.lower()

In [38]:
tb_animals.head()

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,lower_species
0,Africanelephant,6654.0,5712.0,,,3.3,38.6,645.0,3,5,3,0,5.712,0.000858,africanelephant
1,Africangiantpouchedrat,1.0,6.6,6.3,2.0,8.3,4.5,42.0,3,1,3,1,0.0066,0.0066,africangiantpouchedrat
2,ArcticFox,3.385,44.5,,,12.5,14.0,60.0,1,1,1,2,0.0445,0.013146,arcticfox
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3,3,0.0057,0.006196,arcticgroundsquirrel
4,Asianelephant,2547.0,4603.0,2.1,1.8,3.9,69.0,624.0,3,5,4,4,4.603,0.001807,asianelephant


Além dos métodos básicos de `strings` podemos utilizar funções de REGEX também!. A síntaxe é a mesma: utilizaremos o atributo `.str` para acessar esses métodos.

Vamos começar com o método `.contains()` que retorna um vetor booleano determinando se um padrão foi encontrado ou não em cada linha de nossa coluna. 

In [39]:
tb_animals['lower_species'].str.contains(r'monk|ape|man|gorilla|baboon|chimpanzee')

0     False
1     False
2     False
3     False
4     False
      ...  
57    False
58    False
59    False
60    False
61    False
Name: lower_species, Length: 62, dtype: bool

In [40]:
tb_animals['id_primata'] = tb_animals['lower_species'].str.contains(r'monk|ape|man|gorilla|baboon|chimpanzee')

In [41]:
tb_animals.head()

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,lower_species,id_primata
0,Africanelephant,6654.0,5712.0,,,3.3,38.6,645.0,3,5,3,0,5.712,0.000858,africanelephant,False
1,Africangiantpouchedrat,1.0,6.6,6.3,2.0,8.3,4.5,42.0,3,1,3,1,0.0066,0.0066,africangiantpouchedrat,False
2,ArcticFox,3.385,44.5,,,12.5,14.0,60.0,1,1,1,2,0.0445,0.013146,arcticfox,False
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3,3,0.0057,0.006196,arcticgroundsquirrel,False
4,Asianelephant,2547.0,4603.0,2.1,1.8,3.9,69.0,624.0,3,5,4,4,4.603,0.001807,asianelephant,False


In [42]:
sum(tb_animals['id_primata'])

7

Podemos utilizar o método `.findall()` para guardar a informação de qual parte do `string` deu *match* com nosso padrão:

In [48]:
import re
pattern = r'monk|ape|man|gorilla|baboon|chimpanzee'
text = 'O Pedro ama apes mas não gorillas'
re.findall(pattern, text)[0]

'ape'

In [44]:
tb_animals['lista_primata'] = tb_animals['lower_species'].str.findall(r'monk|ape|man|gorilla|baboon|chimpanzee')

In [45]:
tb_animals.head(10)

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,lower_species,id_primata,lista_primata
0,Africanelephant,6654.0,5712.0,,,3.3,38.6,645.0,3,5,3,0,5.712,0.000858,africanelephant,False,[]
1,Africangiantpouchedrat,1.0,6.6,6.3,2.0,8.3,4.5,42.0,3,1,3,1,0.0066,0.0066,africangiantpouchedrat,False,[]
2,ArcticFox,3.385,44.5,,,12.5,14.0,60.0,1,1,1,2,0.0445,0.013146,arcticfox,False,[]
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3,3,0.0057,0.006196,arcticgroundsquirrel,False,[]
4,Asianelephant,2547.0,4603.0,2.1,1.8,3.9,69.0,624.0,3,5,4,4,4.603,0.001807,asianelephant,False,[]
5,Baboon,10.55,179.5,9.1,0.7,9.8,27.0,180.0,4,4,4,5,0.1795,0.017014,baboon,True,[baboon]
6,Bigbrownbat,0.023,0.3,15.8,3.9,19.7,19.0,35.0,1,1,1,6,0.0003,0.013043,bigbrownbat,False,[]
7,Braziliantapir,160.0,169.0,5.2,1.0,6.2,30.4,392.0,4,5,4,7,0.169,0.001056,braziliantapir,False,[]
8,Cat,3.3,25.6,10.9,3.6,14.5,28.0,63.0,1,2,1,8,0.0256,0.007758,cat,False,[]
9,Chimpanzee,52.16,440.0,8.3,1.4,9.7,50.0,230.0,1,1,1,9,0.44,0.008436,chimpanzee,True,[chimpanzee]


O método `.findall()` retorna uma lista: se quisermos transformar essa lista em um string teremos que utilizar o método `.map()`. Vamos começar definindo uma função para selecionar o primeiro elemento de cada lista e utilizar o método `.map()` para aplicar essa função a nossa coluna.

In [54]:
def prim_elem(lista):
#    if lista:
#        return lista[0]
#    else:
#        return np.nan
    try:
        return lista[0]
    except IndexError:
        return 'Não é Primata'

tb_animals['nome_primata'] = tb_animals['lista_primata'].map(prim_elem)

In [56]:
tb_animals.head(10)

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,lower_species,id_primata,lista_primata,nome_primata
0,Africanelephant,6654.0,5712.0,,,3.3,38.6,645.0,3,5,3,0,5.712,0.000858,africanelephant,False,[],Não é Primata
1,Africangiantpouchedrat,1.0,6.6,6.3,2.0,8.3,4.5,42.0,3,1,3,1,0.0066,0.0066,africangiantpouchedrat,False,[],Não é Primata
2,ArcticFox,3.385,44.5,,,12.5,14.0,60.0,1,1,1,2,0.0445,0.013146,arcticfox,False,[],Não é Primata
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3,3,0.0057,0.006196,arcticgroundsquirrel,False,[],Não é Primata
4,Asianelephant,2547.0,4603.0,2.1,1.8,3.9,69.0,624.0,3,5,4,4,4.603,0.001807,asianelephant,False,[],Não é Primata
5,Baboon,10.55,179.5,9.1,0.7,9.8,27.0,180.0,4,4,4,5,0.1795,0.017014,baboon,True,[baboon],baboon
6,Bigbrownbat,0.023,0.3,15.8,3.9,19.7,19.0,35.0,1,1,1,6,0.0003,0.013043,bigbrownbat,False,[],Não é Primata
7,Braziliantapir,160.0,169.0,5.2,1.0,6.2,30.4,392.0,4,5,4,7,0.169,0.001056,braziliantapir,False,[],Não é Primata
8,Cat,3.3,25.6,10.9,3.6,14.5,28.0,63.0,1,2,1,8,0.0256,0.007758,cat,False,[],Não é Primata
9,Chimpanzee,52.16,440.0,8.3,1.4,9.7,50.0,230.0,1,1,1,9,0.44,0.008436,chimpanzee,True,[chimpanzee],chimpanzee


## Ordenando valores

Podemos utilizar o método `.sort_values()` para ordenar um DataFrame por uma (ou mais) coluna.

In [57]:
tb_animals.sort_values(by='ratio_brain_body', ascending=False)

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,lower_species,id_primata,lista_primata,nome_primata
26,Groundsquirrel,0.101,4.00,10.4,3.4,13.8,9.0,28.0,5,1,3,26,0.00400,0.039604,groundsquirrel,False,[],Não é Primata
41,Owlmonkey,0.480,15.50,15.2,1.8,17.0,12.0,140.0,2,2,2,41,0.01550,0.032292,owlmonkey,True,[monk],monk
31,Lessershort-tailedshrew,0.005,0.14,7.7,1.4,9.1,2.6,21.5,5,2,4,31,0.00014,0.028000,lessershort-tailedshrew,False,[],Não é Primata
49,Rhesusmonkey,6.800,179.00,8.4,1.2,9.6,29.0,164.0,2,3,2,49,0.17900,0.026324,rhesusmonkey,True,[monk],monk
32,Littlebrownbat,0.010,0.25,17.9,2.0,19.9,24.0,50.0,1,1,1,32,0.00025,0.025000,littlebrownbat,False,[],Não é Primata
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,Wateropossum,3.500,3.90,12.8,6.6,19.4,3.0,14.0,2,1,1,60,0.00390,0.001114,wateropossum,False,[],Não é Primata
7,Braziliantapir,160.000,169.00,5.2,1.0,6.2,30.4,392.0,4,5,4,7,0.16900,0.001056,braziliantapir,False,[],Não é Primata
44,Pig,192.000,180.00,6.5,1.9,8.4,27.0,115.0,4,4,4,44,0.18000,0.000937,pig,False,[],Não é Primata
11,Cow,465.000,423.00,3.2,0.7,3.9,30.0,281.0,5,5,5,11,0.42300,0.000910,cow,False,[],Não é Primata


In [58]:
tb_animals

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,lower_species,id_primata,lista_primata,nome_primata
0,Africanelephant,6654.000,5712.0,,,3.3,38.6,645.0,3,5,3,0,5.7120,0.000858,africanelephant,False,[],Não é Primata
1,Africangiantpouchedrat,1.000,6.6,6.3,2.0,8.3,4.5,42.0,3,1,3,1,0.0066,0.006600,africangiantpouchedrat,False,[],Não é Primata
2,ArcticFox,3.385,44.5,,,12.5,14.0,60.0,1,1,1,2,0.0445,0.013146,arcticfox,False,[],Não é Primata
3,Arcticgroundsquirrel,0.920,5.7,,,16.5,,25.0,5,2,3,3,0.0057,0.006196,arcticgroundsquirrel,False,[],Não é Primata
4,Asianelephant,2547.000,4603.0,2.1,1.8,3.9,69.0,624.0,3,5,4,4,4.6030,0.001807,asianelephant,False,[],Não é Primata
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,Treehyrax,2.000,12.3,4.9,0.5,5.4,7.5,200.0,3,1,3,57,0.0123,0.006150,treehyrax,False,[],Não é Primata
58,Treeshrew,0.104,2.5,13.2,2.6,15.8,2.3,46.0,3,2,2,58,0.0025,0.024038,treeshrew,False,[],Não é Primata
59,Vervet,4.190,58.0,9.7,0.6,10.3,24.0,210.0,4,3,4,59,0.0580,0.013842,vervet,False,[],Não é Primata
60,Wateropossum,3.500,3.9,12.8,6.6,19.4,3.0,14.0,2,1,1,60,0.0039,0.001114,wateropossum,False,[],Não é Primata


Lembrando que os métodos do DataFrame não alteram o objeto original! Se quisermos guardar nosso resultado precisamos faze-lo explicitamente:

In [59]:
tb_animals = tb_animals.sort_values(by=['Predation', 'ratio_brain_body'], ascending=False)

In [62]:
tb_animals.head(10)

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,lower_species,id_primata,lista_primata,nome_primata
26,Groundsquirrel,0.101,4.0,10.4,3.4,13.8,9.0,28.0,5,1,3,26,0.004,0.039604,groundsquirrel,False,[],Não é Primata
31,Lessershort-tailedshrew,0.005,0.14,7.7,1.4,9.1,2.6,21.5,5,2,4,31,0.00014,0.028,lessershort-tailedshrew,False,[],Não é Primata
10,Chinchilla,0.425,6.4,11.0,1.5,12.5,7.0,112.0,5,4,4,10,0.0064,0.015059,chinchilla,False,[],Não é Primata
52,Roedeer,14.83,98.2,,,2.6,17.0,150.0,5,5,5,52,0.0982,0.006622,roedeer,False,[],Não é Primata
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3,3,0.0057,0.006196,arcticgroundsquirrel,False,[],Não é Primata
27,Guineapig,1.04,5.5,7.4,0.8,8.2,7.6,68.0,5,3,4,27,0.0055,0.005288,guineapig,False,[],Não é Primata
45,Rabbit,2.5,12.1,7.5,0.9,8.4,18.0,31.0,5,5,5,45,0.0121,0.00484,rabbit,False,[],Não é Primata
21,Goat,27.66,115.0,3.3,0.5,3.8,20.0,148.0,5,5,5,21,0.115,0.004158,goat,False,[],Não é Primata
53,Sheep,55.5,175.0,3.2,0.6,3.8,20.0,151.0,5,5,5,53,0.175,0.003153,sheep,False,[],Não é Primata
13,Donkey,187.1,419.0,,,3.1,40.0,365.0,5,5,5,13,0.419,0.002239,donkey,False,[],Não é Primata


## Métodos de agregação entre colunas

Podemos utilizar os métodos de agregação para criar novas colunas: basta mudar o eixo ao longo do qual a operação é realizada!

In [63]:
tb_animals[['Predation', 'Exposure', 'Danger']].mean(axis = 0)

Predation    2.870968
Exposure     2.419355
Danger       2.612903
dtype: float64

In [64]:
tb_animals[['Predation', 'Exposure', 'Danger']].mean(axis=1)

26    3.000000
31    3.666667
10    4.333333
52    5.000000
3     3.333333
        ...   
24    1.666667
25    1.000000
23    2.000000
29    1.000000
19    1.000000
Length: 62, dtype: float64

In [65]:
tb_animals['risco'] = tb_animals[['Predation', 'Exposure', 'Danger']].mean(axis=1)

In [70]:
tb_risco = tb_animals[['Predation', 'Exposure', 'Danger', 'risco']]

In [67]:
tb_animals[['Predation', 'Exposure', 'Danger', 'risco']].mean(axis=0)

Predation    2.870968
Exposure     2.419355
Danger       2.612903
risco        2.634409
dtype: float64

In [72]:
tb_animals[['Predation', 'Exposure', 'Danger']].sum(axis=1)

26     9
31    11
10    13
52    15
3     10
      ..
24     5
25     3
23     6
29     3
19     3
Length: 62, dtype: int64

In [73]:
tb_animals['Predation'] + tb_animals['Exposure'] + tb_animals['Danger']

26     9
31    11
10    13
52    15
3     10
      ..
24     5
25     3
23     6
29     3
19     3
Length: 62, dtype: int64

# Cálculos Condicionais

Podemos utilizar o atributo `.loc` para criar colunas condicionais. Vamos começar com um exemplo simples: criando uma coluna a partir de uma constante.

## Colunas Condicionais constantes

In [74]:
tb_animals['flag_alto_risco'] = 0

In [75]:
tb_animals.describe()

Unnamed: 0,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,risco,flag_alto_risco
count,62.0,62.0,48.0,50.0,58.0,58.0,58.0,62.0,62.0,62.0,62.0,62.0,62.0,62.0,62.0
mean,198.789984,283.134194,8.672917,1.972,10.532759,19.877586,142.353448,2.870968,2.419355,2.612903,30.5,0.283134,0.009624,2.634409,0.0
std,899.158011,930.278942,3.666452,1.442651,4.60676,18.206255,146.805039,1.476414,1.604792,1.441252,18.041619,0.930279,0.008915,1.386521,0.0
min,0.005,0.14,2.1,0.0,2.6,2.0,12.0,1.0,1.0,1.0,0.0,0.00014,0.000858,1.0,0.0
25%,0.6,4.25,6.25,0.9,8.05,6.625,35.75,2.0,1.0,1.0,15.25,0.00425,0.003103,1.416667,0.0
50%,3.3425,17.25,8.35,1.8,10.45,15.1,79.0,3.0,2.0,2.0,30.5,0.01725,0.006611,2.166667,0.0
75%,48.2025,166.0,11.0,2.55,13.2,27.75,207.5,4.0,4.0,4.0,45.75,0.166,0.013668,4.0,0.0
max,6654.0,5712.0,17.9,6.6,19.9,100.0,645.0,5.0,5.0,5.0,61.0,5.712,0.039604,5.0,0.0


In [76]:
tb_animals.loc[tb_animals['risco']>=4, 'flag_alto_risco'] = 1

In [77]:
tb_animals.describe()

Unnamed: 0,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,risco,flag_alto_risco
count,62.0,62.0,48.0,50.0,58.0,58.0,58.0,62.0,62.0,62.0,62.0,62.0,62.0,62.0,62.0
mean,198.789984,283.134194,8.672917,1.972,10.532759,19.877586,142.353448,2.870968,2.419355,2.612903,30.5,0.283134,0.009624,2.634409,0.274194
std,899.158011,930.278942,3.666452,1.442651,4.60676,18.206255,146.805039,1.476414,1.604792,1.441252,18.041619,0.930279,0.008915,1.386521,0.449749
min,0.005,0.14,2.1,0.0,2.6,2.0,12.0,1.0,1.0,1.0,0.0,0.00014,0.000858,1.0,0.0
25%,0.6,4.25,6.25,0.9,8.05,6.625,35.75,2.0,1.0,1.0,15.25,0.00425,0.003103,1.416667,0.0
50%,3.3425,17.25,8.35,1.8,10.45,15.1,79.0,3.0,2.0,2.0,30.5,0.01725,0.006611,2.166667,0.0
75%,48.2025,166.0,11.0,2.55,13.2,27.75,207.5,4.0,4.0,4.0,45.75,0.166,0.013668,4.0,1.0
max,6654.0,5712.0,17.9,6.6,19.9,100.0,645.0,5.0,5.0,5.0,61.0,5.712,0.039604,5.0,1.0


In [78]:
tb_animals.head()

Unnamed: 0,Species,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,lower_species,id_primata,lista_primata,nome_primata,risco,flag_alto_risco
26,Groundsquirrel,0.101,4.0,10.4,3.4,13.8,9.0,28.0,5,1,3,26,0.004,0.039604,groundsquirrel,False,[],Não é Primata,3.0,0
31,Lessershort-tailedshrew,0.005,0.14,7.7,1.4,9.1,2.6,21.5,5,2,4,31,0.00014,0.028,lessershort-tailedshrew,False,[],Não é Primata,3.666667,0
10,Chinchilla,0.425,6.4,11.0,1.5,12.5,7.0,112.0,5,4,4,10,0.0064,0.015059,chinchilla,False,[],Não é Primata,4.333333,1
52,Roedeer,14.83,98.2,,,2.6,17.0,150.0,5,5,5,52,0.0982,0.006622,roedeer,False,[],Não é Primata,5.0,1
3,Arcticgroundsquirrel,0.92,5.7,,,16.5,,25.0,5,2,3,3,0.0057,0.006196,arcticgroundsquirrel,False,[],Não é Primata,3.333333,0


In [79]:
tb_animals.groupby('flag_alto_risco').mean()

Unnamed: 0_level_0,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,id_primata,risco
flag_alto_risco,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,162.586089,203.836,9.602778,2.366667,11.875,17.214634,103.914634,2.244444,1.577778,1.888889,31.977778,0.203836,0.011467,0.111111,1.903704
1,294.623824,493.041176,5.883333,0.957143,6.314286,26.3,235.058824,4.529412,4.647059,4.529412,26.588235,0.493041,0.004746,0.117647,4.568627


Um atributo muito útil para esse tipo de visualização é o `.T`: ele nos retorna o DataFrame transposto:

In [80]:
tb_animals.groupby('flag_alto_risco').mean().T

flag_alto_risco,0,1
BodyWt,162.586089,294.623824
BrainWt,203.836,493.041176
NonDreaming,9.602778,5.883333
Dreaming,2.366667,0.957143
TotalSleep,11.875,6.314286
LifeSpan,17.214634,26.3
Gestation,103.914634,235.058824
Predation,2.244444,4.529412
Exposure,1.577778,4.647059
Danger,1.888889,4.529412


In [None]:
tb_animals['flag_alto_risco'] = 0
tb_animals.loc[tb_animals['risco']>=4, 'flag_alto_risco'] = 1

## Colunas Condicionais utilizando operações

In [81]:
tb_animals['max_risco'] = tb_animals[['Predation', 'Exposure', 'Danger']].max(axis = 1)

In [82]:
tb_animals.loc[tb_animals['max_risco'] < 5, 'risco_2'] = tb_animals[['Predation', 'Exposure', 'Danger']].mean(axis = 1)
tb_animals.loc[tb_animals['max_risco'] == 5, 'risco_2'] = 5

In [83]:
tb_animals[['Predation', 'Exposure', 'Danger', 'risco', 'risco_2']]

Unnamed: 0,Predation,Exposure,Danger,risco,risco_2
26,5,1,3,3.000000,5.000000
31,5,2,4,3.666667,5.000000
10,5,4,4,4.333333,5.000000
52,5,5,5,5.000000,5.000000
3,5,2,3,3.333333,5.000000
...,...,...,...,...,...
24,1,3,1,1.666667,1.666667
25,1,1,1,1.000000,1.000000
23,1,4,1,2.000000,2.000000
29,1,1,1,1.000000,1.000000


In [84]:
tb_animals['flag_alto_risco_2'] = 0
tb_animals.loc[tb_animals['risco_2']>=4, 'flag_alto_risco_2'] = 1

In [87]:
tb_animals.groupby('flag_alto_risco_2').mean().T

flag_alto_risco_2,0,1
BodyWt,16.130439,555.411
BrainWt,84.165366,671.597143
NonDreaming,9.635294,6.335714
Dreaming,2.364706,1.1375
TotalSleep,11.995,7.283333
LifeSpan,17.252632,24.865
Gestation,95.702703,224.547619
Predation,2.02439,4.52381
Exposure,1.487805,4.238095
Danger,1.756098,4.285714


# Quantis

Os quantis são pontos de corte em uma variável numérica que calculados para que uma % das observações esteja abaixo deste ponto. Por exemplo, o quantil 0.5 (50%, ou *mediana*) da variável `BodyWt` é um número tal que 50% das observações tem `BodyWt` abaixo deste número.

Os quantis mais famosos são os **quartis**:

1. 0.25, ou primeiro quartil, onde 25% das observações estão abaixo do quantil;
1. 0.5, ou mediana, onde 50% das observações estão abaixo do quantil;
1. e 0.75, ou terceiro quartil, onde 75% das observações estão abaixo do quantil.

Além disso, muitas vezes usamos os quantis 0.05 e 0.95 para representar os valores mais altos e mais baixos de uma variável.

In [88]:
tb_animals['BodyWt'].median()

3.3425

In [91]:
tb_animals['BodyWt'].quantile(0.05)

0.024250000000000008

In [92]:
tb_animals['BodyWt'].quantile([0.25, 0.5, 0.75])

0.25     0.6000
0.50     3.3425
0.75    48.2025
Name: BodyWt, dtype: float64

Uma utilização comum dos quantis é a **discretização de variáveis continuas**, ou seja, a criação de uma variável categórica (`string`) a partir de uma variável numérica.

In [93]:
q25 = tb_animals['BodyWt'].quantile(0.25)
q50 = tb_animals['BodyWt'].quantile(0.5)
q75 = tb_animals['BodyWt'].quantile(0.75)
print(q25, q50, q75)

0.6000000000000001 3.3425 48.2025


In [94]:
tb_animals.loc[tb_animals['BodyWt'] >= q75, 'cat_peso'] = 'Pesados'
tb_animals.loc[tb_animals['BodyWt'] < q75, 'cat_peso'] = 'Médios-Pesados'
tb_animals.loc[tb_animals['BodyWt'] < q50, 'cat_peso'] = 'Leves-Médios'
tb_animals.loc[tb_animals['BodyWt'] < q25, 'cat_peso'] = 'Leves'

In [95]:
tb_animals['cat_peso'].value_counts()

Leves             16
Pesados           16
Médios-Pesados    15
Leves-Médios      15
Name: cat_peso, dtype: int64

## Categorizando dados

A tarefa acima é tão comum que temos uma função específica para *cortar* uma variável numérica de acordo com seus quantis: a `pd.qcut()`

In [97]:
tb_animals['cat_peso']

26             Leves
31             Leves
10             Leves
52    Médios-Pesados
3       Leves-Médios
           ...      
24           Pesados
25    Médios-Pesados
23           Pesados
29           Pesados
19           Pesados
Name: cat_peso, Length: 62, dtype: object

In [100]:
pd.qcut(tb_animals['BodyWt'], 4)

26        (0.004, 0.6]
31        (0.004, 0.6]
10        (0.004, 0.6]
52     (3.342, 48.202]
3         (0.6, 3.342]
            ...       
24    (48.202, 6654.0]
25     (3.342, 48.202]
23    (48.202, 6654.0]
29    (48.202, 6654.0]
19    (48.202, 6654.0]
Name: BodyWt, Length: 62, dtype: category
Categories (4, interval[float64, right]): [(0.004, 0.6] < (0.6, 3.342] < (3.342, 48.202] < (48.202, 6654.0]]

In [101]:
# Your code here!
tb_animals['BodyWt_Interval'] = pd.qcut(tb_animals['BodyWt'], 4)

In [102]:
tb_animals['BodyWt_Interval'].value_counts()

(0.004, 0.6]        16
(48.202, 6654.0]    16
(0.6, 3.342]        15
(3.342, 48.202]     15
Name: BodyWt_Interval, dtype: int64

In [103]:
tb_animals['BodyWt_Interval'].value_counts(normalize = True)

(0.004, 0.6]        0.258065
(48.202, 6654.0]    0.258065
(0.6, 3.342]        0.241935
(3.342, 48.202]     0.241935
Name: BodyWt_Interval, dtype: float64

Os intervalos entre quantis não são uniforme: no exemplo acima a categoria `Leve` tinha animais entre 0 Kg e 0.6 Kg enquanto a `Médios-Pesados` tinha animais entre 3.3 Kg e 48 Kg! Isso acontece pois ao cortamos através de quantis estamos criando intervalos com número de observações uniforme - por consequencia sacrificamos a uniformidade entre intervalos.

Se quisermos *cortar* uma variável em intervalos iguais podemos utilizar a função `pd.cut`:

In [104]:
tb_animals['cat_risco'] = pd.cut(tb_animals['risco'], 3)

In [106]:
tb_animals['cat_risco'].value_counts()

(0.996, 2.333]    31
(3.667, 5.0]      17
(2.333, 3.667]    14
Name: cat_risco, dtype: int64

# Bonus: 

## Correlação

A correlação é um indicador estatístico que nos permite medir o quanto duas variáveis são correlatas (o aumento/diminuição de uma acontece junto com um aumento/diminuição da outra).

Vamos entender mais sobre esse indicador no futuro: por hora, basta sabermos que:

1. Quanto mais próximo de 1, mais diretamente correlatas as variáveis são;
1. Quanto mais próximo de -1, mais inversamente correlatas as variáveis são;
1. Quanto mais próximo de 0, menos correlatas as variáveis são.

Podemos usar o método `.corr()` para visualizar a matriz de correlação das variáveis numéricas de uma tabela:

In [107]:
tb_animals.corr()

Unnamed: 0,BodyWt,BrainWt,NonDreaming,Dreaming,TotalSleep,LifeSpan,Gestation,Predation,Exposure,Danger,id_linha,BrainWt_kg,ratio_brain_body,id_primata,risco,flag_alto_risco,max_risco,risco_2,flag_alto_risco_2
BodyWt,1.0,0.934164,-0.375946,-0.109383,-0.307186,0.302451,0.651102,0.059495,0.338274,0.133581,-0.293881,0.934164,-0.206684,-0.059574,0.197911,0.066044,0.261448,0.283205,0.286167
BrainWt,0.934164,1.0,-0.369218,-0.105139,-0.358102,0.509253,0.747242,0.033855,0.3678,0.145879,-0.31533,1.0,-0.199441,0.037174,0.204463,0.139818,0.256504,0.284774,0.30129
NonDreaming,-0.375946,-0.369218,1.0,0.514254,0.962715,-0.384432,-0.594703,-0.318185,-0.543757,-0.483852,0.1593,-0.369218,0.344769,0.0879,-0.488397,-0.44392,-0.432609,-0.455507,-0.413377
Dreaming,-0.109383,-0.105139,0.514254,1.0,0.727087,-0.295745,-0.450899,-0.447471,-0.537225,-0.579337,0.006344,-0.105139,-0.056772,-0.169448,-0.560928,-0.443143,-0.476951,-0.506142,-0.400841
TotalSleep,-0.307186,-0.358102,0.962715,0.727087,1.0,-0.410202,-0.631326,-0.395835,-0.642285,-0.587742,0.097175,-0.358102,0.25744,0.037904,-0.592435,-0.521043,-0.494504,-0.53008,-0.477303
LifeSpan,0.302451,0.509253,-0.384432,-0.295745,-0.410202,1.0,0.614849,-0.102544,0.360352,0.061778,-0.344098,0.509253,-0.1191,0.405716,0.123375,0.229133,0.072078,0.117312,0.200472
Gestation,0.651102,0.747242,-0.594703,-0.450899,-0.631326,0.614849,1.0,0.200504,0.638279,0.378617,-0.27352,0.747242,-0.301431,0.147839,0.449642,0.410179,0.407577,0.43518,0.425486
Predation,0.059495,0.033855,-0.318185,-0.447471,-0.395835,-0.102544,0.200504,1.0,0.618246,0.916042,0.051697,0.033855,-0.090758,-0.177373,0.91087,0.696052,0.905051,0.909189,0.80774
Exposure,0.338274,0.3678,-0.543757,-0.537225,-0.642285,0.360352,0.638279,0.618246,1.0,0.787203,-0.1642,0.3678,-0.335945,0.0661,0.87801,0.860177,0.778665,0.822943,0.817713
Danger,0.133581,0.145879,-0.483852,-0.579337,-0.587742,0.061778,0.378617,0.916042,0.787203,1.0,-0.01324,0.145879,-0.21641,-0.1173,0.975345,0.823986,0.879637,0.928268,0.837444


# Voltamos as 11h42!