DataFrame - Seleção condicional, set_index

In [4]:
import numpy as np
import pandas as pd

In [5]:
from numpy.random import randn
np.random.seed(101)

In [6]:
df = pd.DataFrame(np.random.randn(5, 4), index='A B C D E'.split(), columns='W X Y Z'.split())

In [7]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


DataFrame se comporta muito parecido com array do Numpy

In [8]:
df > 0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


Esse metodo se comporta de forma similar ao Numpy

In [9]:
bol = df>0

In [10]:
df[bol]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


Qual valor da linha W é menor que 0 no caso ele exclui a linha C

In [11]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Retorna os valores da colona Y aos quais os da coluna W sejam maior do que 0

In [12]:
df[df['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

Equivalente ao codigo em cima 

In [14]:
bol  = df['W']>0
df2 = df[bol]
df2['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

"and" compara apenas valores unico porem ele não consegue tratar Series

In [24]:
df[(df['W']>0) and (df['Y']>1)]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

"&" ultilizando o simbolo do operado e permite fazer as comparações 

In [25]:
df[(df['W']>0) & (df['Y']>1)]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


o mesmo se aplica ao "or" se ultiliza "|"

In [26]:
df[(df['W']>0) or (df['Y']>1)]


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [27]:
df[(df['W']>0) | (df['Y']>1)]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [28]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Reseto o index porem dessa forma não é salvo

In [29]:
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [30]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


usando o "inplace=True" vai ser alterado no DataFrame Original

In [31]:
df.reset_index(inplace=True)

In [32]:
df

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


Defino uma coluna com sendo estados do Brasil

In [33]:
col = 'RS RJ SP AM SC'.split()

In [34]:
col

['RS', 'RJ', 'SP', 'AM', 'SC']

In [38]:
df['Estado'] = col

In [39]:
df

Unnamed: 0,index,W,X,Y,Z,estado,Estado
0,A,2.70685,0.628133,0.907969,0.503826,RS,RS
1,B,0.651118,-0.319318,-0.848077,0.605965,RJ,RJ
2,C,-2.018168,0.740122,0.528813,-0.589001,SP,SP
3,D,0.188695,-0.758872,-0.933237,0.955057,AM,AM
4,E,0.190794,1.978757,2.605967,0.683509,SC,SC


Estado como indice do meu DataFrame

In [40]:
df.set_index('Estado')

Unnamed: 0_level_0,index,W,X,Y,Z,estado
Estado,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
RS,A,2.70685,0.628133,0.907969,0.503826,RS
RJ,B,0.651118,-0.319318,-0.848077,0.605965,RJ
SP,C,-2.018168,0.740122,0.528813,-0.589001,SP
AM,D,0.188695,-0.758872,-0.933237,0.955057,AM
SC,E,0.190794,1.978757,2.605967,0.683509,SC


In [41]:
df

Unnamed: 0,index,W,X,Y,Z,estado,Estado
0,A,2.70685,0.628133,0.907969,0.503826,RS,RS
1,B,0.651118,-0.319318,-0.848077,0.605965,RJ,RJ
2,C,-2.018168,0.740122,0.528813,-0.589001,SP,SP
3,D,0.188695,-0.758872,-0.933237,0.955057,AM,AM
4,E,0.190794,1.978757,2.605967,0.683509,SC,SC


"inplace=True" para passar para o original

In [42]:
df.set_index('Estado', inplace=True)

In [43]:
df

Unnamed: 0_level_0,index,W,X,Y,Z,estado
Estado,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
RS,A,2.70685,0.628133,0.907969,0.503826,RS
RJ,B,0.651118,-0.319318,-0.848077,0.605965,RJ
SP,C,-2.018168,0.740122,0.528813,-0.589001,SP
AM,D,0.188695,-0.758872,-0.933237,0.955057,AM
SC,E,0.190794,1.978757,2.605967,0.683509,SC
