## Tips for Selecting Columns in a DataFrame

Notebook to accompany this [post](https://pbpython.com/selecting-columns.html).



In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD&bom=true&format=true'
)

Build a mapping list so we can see the index of all the columns

In [3]:
col_mapping = [f"{c[0]}:{c[1]}" for c in enumerate(df.columns)]

In [4]:
col_mapping

['0:X',
 '1:Y',
 '2:Unique Squirrel ID',
 '3:Hectare',
 '4:Shift',
 '5:Date',
 '6:Hectare Squirrel Number',
 '7:Age',
 '8:Primary Fur Color',
 '9:Highlight Fur Color',
 '10:Combination of Primary and Highlight Color',
 '11:Color notes',
 '12:Location',
 '13:Above Ground Sighter Measurement',
 '14:Specific Location',
 '15:Running',
 '16:Chasing',
 '17:Climbing',
 '18:Eating',
 '19:Foraging',
 '20:Other Activities',
 '21:Kuks',
 '22:Quaas',
 '23:Moans',
 '24:Tail flags',
 '25:Tail twitches',
 '26:Approaches',
 '27:Indifferent',
 '28:Runs from',
 '29:Other Interactions',
 '30:Lat/Long']

We can also build a dictionary

In [5]:
col_mapping_dict = {c[0]:c[1] for c in enumerate(df.columns)}

In [6]:
col_mapping_dict

{0: 'X',
 1: 'Y',
 2: 'Unique Squirrel ID',
 3: 'Hectare',
 4: 'Shift',
 5: 'Date',
 6: 'Hectare Squirrel Number',
 7: 'Age',
 8: 'Primary Fur Color',
 9: 'Highlight Fur Color',
 10: 'Combination of Primary and Highlight Color',
 11: 'Color notes',
 12: 'Location',
 13: 'Above Ground Sighter Measurement',
 14: 'Specific Location',
 15: 'Running',
 16: 'Chasing',
 17: 'Climbing',
 18: 'Eating',
 19: 'Foraging',
 20: 'Other Activities',
 21: 'Kuks',
 22: 'Quaas',
 23: 'Moans',
 24: 'Tail flags',
 25: 'Tail twitches',
 26: 'Approaches',
 27: 'Indifferent',
 28: 'Runs from',
 29: 'Other Interactions',
 30: 'Lat/Long'}

Use iloc to select just the second column (Unique Squirrel ID)

In [7]:
df.iloc[:, 2]

0       37F-PM-1014-03
1       21B-AM-1019-04
2       11B-PM-1014-08
3       32E-PM-1017-14
4       13E-AM-1017-05
             ...      
3018    30B-AM-1007-04
3019    19A-PM-1013-05
3020    22D-PM-1012-07
3021    29B-PM-1010-02
3022     5E-PM-1012-01
Name: Unique Squirrel ID, Length: 3023, dtype: object

Pass a list of integers to select multiple columns by index

In [8]:
df.iloc[:, [0,1,2]]

Unnamed: 0,X,Y,Unique Squirrel ID
0,-73.956134,40.794082,37F-PM-1014-03
1,-73.968857,40.783783,21B-AM-1019-04
2,-73.974281,40.775534,11B-PM-1014-08
3,-73.959641,40.790313,32E-PM-1017-14
4,-73.970268,40.776213,13E-AM-1017-05
...,...,...,...
3018,-73.963943,40.790868,30B-AM-1007-04
3019,-73.970402,40.782560,19A-PM-1013-05
3020,-73.966587,40.783678,22D-PM-1012-07
3021,-73.963994,40.789915,29B-PM-1010-02


We can also pass a slice object to select a range of columns

In [9]:
df.iloc[:, 0:3]

Unnamed: 0,X,Y,Unique Squirrel ID
0,-73.956134,40.794082,37F-PM-1014-03
1,-73.968857,40.783783,21B-AM-1019-04
2,-73.974281,40.775534,11B-PM-1014-08
3,-73.959641,40.790313,32E-PM-1017-14
4,-73.970268,40.776213,13E-AM-1017-05
...,...,...,...
3018,-73.963943,40.790868,30B-AM-1007-04
3019,-73.970402,40.782560,19A-PM-1013-05
3020,-73.966587,40.783678,22D-PM-1012-07
3021,-73.963994,40.789915,29B-PM-1010-02


If we want to combine the list and slice notation, we need to use nump.r_ to process the data into an appropriate format.

In [10]:
np.r_[0:3,15:19,24,25]

array([ 0,  1,  2, 15, 16, 17, 18, 24, 25])

We can pass the output of np.r_ to .iloc to use multiple selection approaches

In [11]:
df.iloc[:, np.r_[0:3,15:19,24,25]]

Unnamed: 0,X,Y,Unique Squirrel ID,Running,Chasing,Climbing,Eating,Tail flags,Tail twitches
0,-73.956134,40.794082,37F-PM-1014-03,False,False,False,False,False,False
1,-73.968857,40.783783,21B-AM-1019-04,False,False,False,False,False,False
2,-73.974281,40.775534,11B-PM-1014-08,False,True,False,False,False,False
3,-73.959641,40.790313,32E-PM-1017-14,False,False,False,True,False,False
4,-73.970268,40.776213,13E-AM-1017-05,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
3018,-73.963943,40.790868,30B-AM-1007-04,False,False,False,True,False,False
3019,-73.970402,40.782560,19A-PM-1013-05,False,False,False,False,False,False
3020,-73.966587,40.783678,22D-PM-1012-07,False,False,False,True,False,False
3021,-73.963994,40.789915,29B-PM-1010-02,False,False,False,True,False,False


We can use the same notation when reading in a csv as well

In [12]:
df_2 = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD&bom=true&format=true',
    usecols=np.r_[1,2,5:8,15:25],
)

In [13]:
df_2.head()

Unnamed: 0,Y,Unique Squirrel ID,Date,Hectare Squirrel Number,Age,Running,Chasing,Climbing,Eating,Foraging,Other Activities,Kuks,Quaas,Moans,Tail flags
0,40.794082,37F-PM-1014-03,10142018,3,,False,False,False,False,False,,False,False,False,False
1,40.783783,21B-AM-1019-04,10192018,4,,False,False,False,False,False,,False,False,False,False
2,40.775534,11B-PM-1014-08,10142018,8,,False,True,False,False,False,,False,False,False,False
3,40.790313,32E-PM-1017-14,10172018,14,Adult,False,False,False,True,True,,False,False,False,False
4,40.776213,13E-AM-1017-05,10172018,5,Adult,False,False,False,False,True,,False,False,False,False


We can also select columns using a boolean array

In [14]:
run_cols = df.columns.str.contains('run', case=False)
run_cols

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False])

In [15]:
df.iloc[:, run_cols].head()

Unnamed: 0,Running,Runs from
0,False,False
1,False,False
2,False,False
3,False,True
4,False,False


A lambda function can be useful for combining into 1 line.

In [16]:
df.iloc[:, lambda df:df.columns.str.contains('run', case=False)].head()

Unnamed: 0,Running,Runs from
0,False,False
1,False,False
2,False,False
3,False,True
4,False,False


A more complex example

In [17]:
df.iloc[:, lambda df: df.columns.str.contains('district|precinct|boundaries',
                                              case=False)].head()

0
1
2
3
4


Combining index and boolean arrays

In [18]:
location_cols = df.columns.str.contains('district|precinct|boundaries',
                                        case=False)
location_cols

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False])

In [19]:
location_indices = [i for i, col in enumerate(location_cols) if col]
location_indices

[]

In [20]:
df.iloc[:, np.r_[0:3,location_indices]].head()

Unnamed: 0,X,Y,Unique Squirrel ID
0,-73.956134,40.794082,37F-PM-1014-03
1,-73.968857,40.783783,21B-AM-1019-04
2,-73.974281,40.775534,11B-PM-1014-08
3,-73.959641,40.790313,32E-PM-1017-14
4,-73.970268,40.776213,13E-AM-1017-05
