http://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html


where and mask
query - mention


In [23]:
import seaborn as sns
import pandas as pd
import numpy as np

# Basics

First let's start with a bit of a recap on traditional indexing and selection. (We went over most of this in the intro video)

In [25]:
tips = sns.load_dataset('tips')
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


Next let's go over indexing and selecting with dataframes. There are basically 4 ways to do so:

In [29]:
# get a column
tips[['total_bill', 'tip']].head()

Unnamed: 0,total_bill,tip
0,16.99,1.01
1,10.34,1.66
2,21.01,3.5
3,23.68,3.31
4,24.59,3.61


In [34]:
# get some rows
tips[3:5]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [31]:
tips.loc[2:4, 'sex': 'smoker']

Unnamed: 0,sex,smoker
2,Male,No
3,Male,No
4,Female,No


In [32]:
# select rows and columns by their ordering
tips.iloc[1:3, 0:2]

Unnamed: 0,total_bill,tip
1,10.34,1.66
2,21.01,3.5


But this is just the tip (well actually it's most of the iceberg. Let's go over some of the more advanced bits of indexing.

# Multi-index

A subject that you might not think that you'd need - but turns out to be a rather frequent usecase. The initial idea was to provide a framework to work with higher dim data (and thus a replacement for panels). But because of some operations it became quite commonplace. Let's do an example below.

In almost all cases multiindex comes from groupby's (you will almost never construct it or read it in yourself).

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


In [12]:
mi_tips = tips.groupby(['sex', 'smoker']).total_bill.mean()
mi_tips

sex     smoker
Male    Yes       22.284500
        No        19.791237
Female  Yes       17.977879
        No        18.105185
Name: total_bill, dtype: float64

Ultimately there are a ton of operations that you can do on top of this type of data. But the way that I have always dealth with this is simply by resetting the index.

In [13]:
mi_tips.reset_index()

Unnamed: 0,sex,smoker,total_bill
0,Male,Yes,22.2845
1,Male,No,19.791237
2,Female,Yes,17.977879
3,Female,No,18.105185


Notice how we get values spread out over the full column now. So in this way it is easy to select only the smokers:

In [15]:
ri_tips = mi_tips.reset_index()
ri_tips[ri_tips['smoker'] == 'Yes']

Unnamed: 0,sex,smoker,total_bill
0,Male,Yes,22.2845
2,Female,Yes,17.977879


Of course you can always pop only certain indexes out:

In [16]:
ri0_tips = mi_tips.reset_index(level=0)
ri0_tips.loc['Yes']

Unnamed: 0_level_0,sex,total_bill
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1
Yes,Male,22.2845
Yes,Female,17.977879


And finally you can pull indexes back into the index (basically only useful for certain types of joins).

In [17]:
ri_tips.set_index(['sex', 'smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill
sex,smoker,Unnamed: 2_level_1
Male,Yes,22.2845
Male,No,19.791237
Female,Yes,17.977879
Female,No,18.105185


In [21]:
ri0_tips.set_index('sex', append=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill
smoker,sex,Unnamed: 2_level_1
Yes,Male,22.2845
No,Male,19.791237
Yes,Female,17.977879
No,Female,18.105185


# Modifications

The next little indexing trick is one that is mostly about speed. But it is getting and setting single values. It is a pretty simple:

In [37]:
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,6.0,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


In [38]:
tips.at[0, 'total_bill'] = 6
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,6.0,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


In [39]:
tips.iat[0, 0]

6.0

If you are modifying single values of a dataframe you should always use these guys. It's faster and it is a good way to know that you are not messing up (often times modifying the data can result in odd errors).

So just to prove it's faster let's time it!

In [43]:
%%timeit
tips.at[0, 'total_bill'] = 6

6.36 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [44]:
%%timeit
tips['total_bill'][0] = 6

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


122 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Where, Masks and Queries

These are things that are built into pandas that I have personally never used, mostly because they are pretty redundant and don't happen too often.

They are a bit faster, yes. But the mental space is probably not worth it. So if you wanna learn it, go for it. If not, probably won't matter.

Let me show you how you'd duplicate mask functionality below. 

In [52]:
df = pd.DataFrame(np.random.randn(25).reshape((5, 5)))
df.head()

Unnamed: 0,0,1,2,3,4
0,0.480753,-1.059331,1.658786,-0.947911,0.959558
1,0.18578,-0.125483,-0.81708,0.001897,-1.063209
2,-2.022341,0.01342,0.288912,0.196264,2.067028
3,0.388568,0.754839,-0.162831,-0.120153,0.459455
4,-0.999674,0.034427,-0.60752,0.085853,-1.265579


In [53]:
df.where(df > 0)

Unnamed: 0,0,1,2,3,4
0,0.480753,,1.658786,,0.959558
1,0.18578,,,0.001897,
2,,0.01342,0.288912,0.196264,2.067028
3,0.388568,0.754839,,,0.459455
4,,0.034427,,0.085853,


In [56]:
df[df < 0] = np.NaN
df

Unnamed: 0,0,1,2,3,4
0,0.480753,,1.658786,,0.959558
1,0.18578,,,0.001897,
2,,0.01342,0.288912,0.196264,2.067028
3,0.388568,0.754839,,,0.459455
4,,0.034427,,0.085853,


So that's it. This is really all I know about indexing and prob all you'll need to know too. If you've got any question or comment please add them!