# Select rows/cols conditionally

***Stick to loc/iloc*** and avoid old ways e.g. ```df[df['colX'] > 0]['colY']``` to avoid confusion. The original design of Pandas was messy.

# References

* [Indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) (MUST)
* [How to deal with SettingWithCopyWarning in Pandas](https://stackoverflow.com/a/53954986/4281353) (MUST)
* [SettingWithCopyWarning in Pandas: Views vs Copies](https://realpython.com/pandas-settingwithcopywarning/)

In [3]:
import numpy as np
import pandas as pd

# Basics

## Label

* [Pandas - what is exactly "label" and where is it defined?](https://stackoverflow.com/questions/70502134/pandas-what-is-exactly-label-and-where-is-it-defined)
* [SettingWithCopyWarning in Pandas: Views vs Copies](https://realpython.com/pandas-settingwithcopywarning/)

<img src="image/pandas_labels.png" align="left"/>



## Index/Indexer as label(s)

Indices identifies the locations in a dataframe and they are defined as a combination of labels.

* [Different choices for indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing) (MUST>

> Getting values from an object with multi-axes selection uses the following notation (using .loc as an example, but the following applies to .iloc as well). 
> ```
> df.loc[row_indexer,column_indexer]
> ```

> pandas now supports three types of multi-axis **indexing / indexer**.  
> * A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
> * A list or array of labels ['a', 'b', 'c'].
> * A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)
> * A boolean array (any NA values will be treated as False).
> * A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).


## Example

In [4]:
df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list('ABCDE'))
df

Unnamed: 0,A,B,C,D,E
0,2,2,9,2,3
1,2,8,4,6,5
2,2,6,8,3,7


In [7]:
row_index = (df['B'] == 2)
col_index= ['A', 'E']

df.loc[row_index, col_index] 

Unnamed: 0,A,E
0,2,3


In [16]:
row_indices = (df['B'].isin([2,6]))
col_indices = slice('C', 'E')

In [17]:
df.loc[row_indices, col_indices]

Unnamed: 0,C,D,E
0,9,2,3
2,8,3,7


---
# Conditional Update

**```.loc``` returns a view, NOT copy**. Hence setting values to the ```.iloc``` selection updates the original dataframe. DO NOT use old ways of indexing e.g. ```df[df[column=='x']]['columnB']``` as it will cause the error.

```
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
```

In [23]:
df_to_update = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list('ABCDE'))
df_to_update

Unnamed: 0,A,B,C,D,E
0,2,2,8,3,3
1,2,8,5,8,0
2,9,1,4,3,4


In [24]:
row_indices = (df['B'].isin([2,6]))
col_indices = slice('C', 'E')

df_to_update.loc[row_indices, col_indices] = -1
df_to_update

Unnamed: 0,A,B,C,D,E
0,2,2,-1,-1,-1
1,2,8,5,8,0
2,9,1,-1,-1,-1


---
# Old/Obsolulte ways

* [Selecting/excluding sets of columns in pandas](https://stackoverflow.com/a/51601986/4281353)

In [1]:
import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]

# or more simply include columns:
df[['A', 'B']]

# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]

# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically 
df[df.columns.difference(['C', 'D'])]

Unnamed: 0,A,B
0,-1.389901,0.355966
1,-0.335859,-1.081798
2,0.000856,0.863560
3,-1.073584,1.606143
4,-1.236630,-1.081519
...,...,...
95,-0.101795,1.018276
96,-0.002801,-0.716710
97,-1.284928,0.633800
98,-1.074113,0.553467
