<a href="https://colab.research.google.com/github/SoIllEconomist/ds4b/blob/master/python_ds4b/02_wrangle/02_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Selection by label
For getting a cross section using a label:

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
dates = pd.date_range('20190101',periods=10)
df = pd.DataFrame(np.random.randn(10,4), 
                  index=dates, 
                  columns=list('ABCD'))


In [0]:
dates[0]

Timestamp('2019-01-01 00:00:00', freq='D')

In [0]:
df.loc[dates[0]]

A   -1.158573
B   -0.941688
C    2.568583
D    0.494481
Name: 2019-01-01 00:00:00, dtype: float64

Selecting on a mutli-axis by label

In [0]:
df.loc[:, ['A','D']]

Unnamed: 0,A,D
2019-01-01,-1.158573,0.494481
2019-01-02,-0.683202,-0.18701
2019-01-03,0.564102,-0.146018
2019-01-04,-0.939398,0.034436
2019-01-05,1.135158,-2.690404
2019-01-06,-0.953653,2.677651
2019-01-07,1.677311,0.46266
2019-01-08,-0.612544,0.027003
2019-01-09,1.10621,-0.057844
2019-01-10,0.16568,-0.706584


Showing label slicing, both endpoints are *included*

In [0]:
df.loc['20190102':'20190104', ['A', 'B']]

Unnamed: 0,A,B
2019-01-02,-0.683202,0.056144
2019-01-03,0.564102,1.448998
2019-01-04,-0.939398,2.276844


Reduction in the dimensions of the returns object

In [0]:
df.loc['20190102',['A','B']]

A   -0.683202
B    0.056144
Name: 2019-01-02 00:00:00, dtype: float64

For getting a scalar value:

In [0]:
df.loc[dates[0], 'A']


-1.1585727488744095

## Selection by position
Select with the position of the passed integers:

In [0]:
df.iloc[2]

A    0.564102
B    1.448998
C    1.036779
D   -0.146018
Name: 2019-01-03 00:00:00, dtype: float64

By integer slices, acting similar to numpy/python:

In [0]:
df.iloc[2:4, 0:2]

Unnamed: 0,A,B
2019-01-03,0.564102,1.448998
2019-01-04,-0.939398,2.276844


By lists of integer position locations, similar to the numpy/python style:

In [0]:
df.iloc[[1,2,4],[0,2]]

For slicing rows explicitly:

In [0]:
df.iloc[1:3, :]

For slicing columns explicitly:

In [0]:
df.iloc[:,1:3]

For getting a value explicitly:

In [0]:
df.iloc[1,1]


# Boolean indexing
Using a single column's values to select data.

In [0]:
mask = df['A'] > 0

In [0]:
mask.head()

2019-01-01    False
2019-01-02    False
2019-01-03     True
2019-01-04    False
2019-01-05     True
Freq: D, Name: A, dtype: bool

In [0]:
df[mask] # ~ negates the mask

Unnamed: 0,A,B,C,D
2019-01-03,0.564102,1.448998,1.036779,-0.146018
2019-01-05,1.135158,0.720135,-0.771679,-2.690404
2019-01-07,1.677311,-0.572894,0.405523,0.46266
2019-01-09,1.10621,-0.385144,1.194672,-0.057844
2019-01-10,0.16568,0.786776,-1.079196,-0.706584


Selecting values from a DataFrame where a boolean condition is met.

In [0]:
df[df > 0]

Unnamed: 0,A,B,C,D
2019-01-01,,,2.568583,0.494481
2019-01-02,,0.056144,0.677926,
2019-01-03,0.564102,1.448998,1.036779,
2019-01-04,,2.276844,,0.034436
2019-01-05,1.135158,0.720135,,
2019-01-06,,0.296899,,2.677651
2019-01-07,1.677311,,0.405523,0.46266
2019-01-08,,,,0.027003
2019-01-09,1.10621,,1.194672,
2019-01-10,0.16568,0.786776,,


Using the `isin()` method for filtering:

In [0]:
df2 = df.copy()

In [0]:
df2['E'] = ['one','one','two','three','four','three','two','three','four','four']

In [0]:
df2

Unnamed: 0,A,B,C,D,E
2019-01-01,-1.158573,-0.941688,2.568583,0.494481,one
2019-01-02,-0.683202,0.056144,0.677926,-0.18701,one
2019-01-03,0.564102,1.448998,1.036779,-0.146018,two
2019-01-04,-0.939398,2.276844,-0.400286,0.034436,three
2019-01-05,1.135158,0.720135,-0.771679,-2.690404,four
2019-01-06,-0.953653,0.296899,-1.022055,2.677651,three
2019-01-07,1.677311,-0.572894,0.405523,0.46266,two
2019-01-08,-0.612544,-1.102798,-0.829374,0.027003,three
2019-01-09,1.10621,-0.385144,1.194672,-0.057844,four
2019-01-10,0.16568,0.786776,-1.079196,-0.706584,four


In [0]:
df2[df2['E'].isin(['two', 'four'])]


Unnamed: 0,A,B,C,D,E
2019-01-03,0.564102,1.448998,1.036779,-0.146018,two
2019-01-05,1.135158,0.720135,-0.771679,-2.690404,four
2019-01-07,1.677311,-0.572894,0.405523,0.46266,two
2019-01-09,1.10621,-0.385144,1.194672,-0.057844,four
2019-01-10,0.16568,0.786776,-1.079196,-0.706584,four


# Setting
Setting a new column automatically aligns the data by the indexes.

In [0]:
s1 = pd.Series([1, 2, 3, 4, 5, 6,7,8,9,10], index=pd.date_range('20190101', periods=10))

In [0]:
s1

2019-01-01     1
2019-01-02     2
2019-01-03     3
2019-01-04     4
2019-01-05     5
2019-01-06     6
2019-01-07     7
2019-01-08     8
2019-01-09     9
2019-01-10    10
Freq: D, dtype: int64

In [0]:
df['F'] = s1

In [0]:
df

Unnamed: 0,A,B,C,D,F
2019-01-01,-1.158573,-0.941688,2.568583,0.494481,1
2019-01-02,-0.683202,0.056144,0.677926,-0.18701,2
2019-01-03,0.564102,1.448998,1.036779,-0.146018,3
2019-01-04,-0.939398,2.276844,-0.400286,0.034436,4
2019-01-05,1.135158,0.720135,-0.771679,-2.690404,5
2019-01-06,-0.953653,0.296899,-1.022055,2.677651,6
2019-01-07,1.677311,-0.572894,0.405523,0.46266,7
2019-01-08,-0.612544,-1.102798,-0.829374,0.027003,8
2019-01-09,1.10621,-0.385144,1.194672,-0.057844,9
2019-01-10,0.16568,0.786776,-1.079196,-0.706584,10


Setting values by label:

In [0]:
df.loc[dates[0]] = 0

Setting values by position

In [0]:
df.iloc[0,1] = 0

Setting by assigning with a NumPy array:

In [0]:
df.loc[:, 'D'] = np.array([5] * len(df))

The result of the prior setting operations.

In [0]:
df

A where operation with setting.

In [0]:
df2 = df.copy()

In [0]:
df2

Unnamed: 0,A,B,C,D,F
2019-01-01,0.0,0.0,0.0,5,0
2019-01-02,-0.683202,0.056144,0.677926,5,2
2019-01-03,0.564102,1.448998,1.036779,5,3
2019-01-04,-0.939398,2.276844,-0.400286,5,4
2019-01-05,1.135158,0.720135,-0.771679,5,5
2019-01-06,-0.953653,0.296899,-1.022055,5,6
2019-01-07,1.677311,-0.572894,0.405523,5,7
2019-01-08,-0.612544,-1.102798,-0.829374,5,8
2019-01-09,1.10621,-0.385144,1.194672,5,9
2019-01-10,0.16568,0.786776,-1.079196,5,10


In [0]:
df2[df2 > 0] = -df2

In [0]:
df2


Unnamed: 0,A,B,C,D,F
2019-01-01,0.0,0.0,0.0,-5,0
2019-01-02,-0.683202,-0.056144,-0.677926,-5,-2
2019-01-03,-0.564102,-1.448998,-1.036779,-5,-3
2019-01-04,-0.939398,-2.276844,-0.400286,-5,-4
2019-01-05,-1.135158,-0.720135,-0.771679,-5,-5
2019-01-06,-0.953653,-0.296899,-1.022055,-5,-6
2019-01-07,-1.677311,-0.572894,-0.405523,-5,-7
2019-01-08,-0.612544,-1.102798,-0.829374,-5,-8
2019-01-09,-1.10621,-0.385144,-1.194672,-5,-9
2019-01-10,-0.16568,-0.786776,-1.079196,-5,-10
