## <font color='darkblue'>Pandas Data Selection</font>
([Article source](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)) There are [multiple ways to select](http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing) and index rows and columns from [**Pandas DataFrames**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). I find tutorials online focusing on advanced selections of row and column choices a little complex for my requirements.

### <font color='darkgreen'>Selection Options</font>
There’s three main options to achieve the selection and indexing activities in Pandas, which can be confusing. The three selection cases and methods covered in this post are:
1. [Selecting data by row numbers (.iloc)](#sect2_1)
2. [Selecting data by label or by a conditional statment (.loc)](#sect2_2)
3. [Selecting in a hybrid approach (.ix) (now Deprecated in Pandas 0.20.1)](#sect2_3)


### <font color='darkgreen'>Data Setup</font>
This blog post, [inspired by other tutorials](http://chrisalbon.com/), describes selection activities with these operations. The tutorial is suited for the general data science situation where, typically I find myself:
1. Each row in your data frame represents a data sample.
2. Each column is a variable, and is usually named. I rarely select columns without their names.
3. I need to quickly and often select relevant rows from the data frame for modelling and visualisation activities.

For the uninitiated, the [**Pandas library**](http://pandas.pydata.org/) for Python provides high-performance, easy-to-use data structures and data analysis tools for handling tabular data in “series” and in “data frames”. It’s brilliant at making your data processing easier and I’ve written before about [grouping and summarising data](http://104.236.88.249/summarising-aggregation-and-grouping-data-in-python-pandas/) with Pandas.
![image1](images/1.PNG)
<br/>
Summary of iloc and loc methods discussed in this blog post. iloc and loc are operations for retrieving data from Pandas dataframes.

<a id='sect1'></a>
## <font color='darkblue'>Selection and Indexing Methods for Pandas DataFrames</font>
For these explorations we’ll need some sample data – I downloaded the uk-500 sample data set from www.briandunning.com. This data contains artificial names, addresses, companies and phone numbers for fictitious UK characters. To follow along, you can download the .csv file [here](https://s3-eu-west-1.amazonaws.com/shanebucket/downloads/uk-500.csv). Load the data as follows (the diagrams here come from a [Jupyter notebook](http://jupyter.org/) in the Anaconda Python install):

In [19]:
import pandas as pd
import random
 
def load_data():
    # read the data from the downloaded CSV file.
    data = pd.read_csv('../../datas/uk-500.csv')
    # set a numeric id for use as an index for examples.
    data['id'] = [random.randint(0,1000) for x in range(data.shape[0])]
    return data
 
data= load_data()
data.head(5)

Unnamed: 0,first_name,last_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
0,Aleshia,Tomkiewicz,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk,262
1,Evan,Zigomalas,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk,194
2,France,Andrade,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,162
3,Ulysses,Mcwalters,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk,823
4,Tyisha,Veness,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,524


<a id='sect2_1'></a>
### <font color='darkgreen'>1. Selecting pandas data using “iloc”</font>
The [iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) indexer for Pandas Dataframe is used for [**integer-location based indexing / selection by position**](http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position).

The [iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) indexer syntax is data.iloc[<row selection>, <column selection>], which is sure to be a source of confusion for R users. “iloc” in pandas is used to select rows and columns by number, in the order that they appear in the data frame. You can imagine that **each row has a row number from 0 to the total rows** (<font color='blue'>data.shape\[0]</font>)  and iloc\[] allows selections based on these numbers. The same applies for columns (<font color='blue'>ranging from 0 to data.shape\[1] </font>)

There are two “arguments” to iloc – a row selector, and a column selector.  For example:

In [2]:
# Single selections using iloc and DataFrame
# Rows:
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)
# Columns:
data.iloc[:,0] # first column of data frame (first_name)
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)

0      873
1      718
2      614
3       29
4      757
      ... 
495    462
496    688
497    372
498    575
499    732
Name: id, Length: 500, dtype: int64

Multiple columns and rows can be selected together using the [.iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) indexer.

In [3]:
# Multiple row and column selections using iloc and DataFrame
data.iloc[0:5] # first five rows of dataframe
data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).

Unnamed: 0,county,postal,phone1
0,Kent,CT2 7PP,01835-703597
1,Buckinghamshire,HP11 2AX,01937-864715
2,Bournemouth,BH6 3BE,01347-368222
3,Lincolnshire,DN36 5RP,01912-771311
4,West Midlands,B70 9DT,01547-429341


There’s two gotchas to remember when using iloc in this manner:
1. Note that [.iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) returns a Pandas [**Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) when one row is selected, and a Pandas [**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) when multiple rows are selected, or if any column in full is selected. To counter this, pass a single-valued list if you require [**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) output. When using .loc, or .iloc, you can control the output format by passing lists or single values to the selectors.
2. When selecting multiple columns or multiple rows in this manner, remember that in your selection e.g. \[1:5], the rows/columns selected will run from the first number to one minus the second number. e.g. \[1:5] will go 1,2,3,4., \[x,y] goes from x to y-1.

In practice, I rarely use the [iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) indexer, unless I want the first ( .iloc\[0] ) or the last ( .iloc\[-1] )  row of the data frame.

<a id='sect2_2'></a>
### <font color='darkgreen'>2. Selecting pandas data using “loc”</font>
The Pandas [loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) indexer can be used with [**DataFrames**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for two different use cases:
* [a.) Selecting rows by label/index](#sect2_2_1)
* [b.) Selecting rows with a boolean / conditional lookup](#sect2_2_2)

The loc indexer is used with the same syntax as iloc: `data.loc[<row selection>, <column selection>]` .

<a id='sect2_2_1'></a>
### 2a. Label-based / Index-based indexing using .loc
Selections using the loc method are based on the index of the data frame (if any). Where the index is set on a DataFrame, using `df.set_index()` the .loc method directly selects based on index values of any rows. For example, setting the index of our test data frame to the persons “last_name”:

In [4]:
data.set_index("last_name", inplace=True)
data.head()

Unnamed: 0_level_0,first_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Tomkiewicz,Aleshia,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk,873
Zigomalas,Evan,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk,718
Andrade,France,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,614
Mcwalters,Ulysses,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk,29
Veness,Tyisha,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,757


Now with the index set, we can directly select rows for different “last_name” values using `.loc[<label>]`  – either singly, or in multiples. For example:

In [5]:
# Single selection
data.loc['Andrade']

first_name                                France
company_name                 Elliott, John W Esq
address                             8 Moor Place
city              East Southbourne and Tuckton W
county                               Bournemouth
postal                                   BH6 3BE
phone1                              01347-368222
phone2                              01935-821636
email                 france.andrade@hotmail.com
web             http://www.elliottjohnwesq.co.uk
id                                           614
Name: Andrade, dtype: object

In [6]:
# Multiple selection
data.loc[['Zigomalas', 'Veness']]

Unnamed: 0_level_0,first_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Zigomalas,Evan,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk,718
Veness,Tyisha,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,757


Selecting single or multiple rows using [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) index selections with pandas. Note that the first example returns a [**Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html), and the second returns a [**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html). You can achieve a single-column DataFrame by passing a single-element list to the .loc operation.

Select columns with [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) using the names of the columns. In most of my data work, typically I have named columns, and use these named selections.

In [7]:
data.loc[['Zigomalas', 'Veness'], ['city', 'email']]

Unnamed: 0_level_0,city,email
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Zigomalas,Abbey Ward,evan.zigomalas@gmail.com
Veness,Greets Green and Lyng Ward,tyisha.veness@hotmail.com


When using the [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) indexer, columns are referred to by names using lists of strings, or “:” slices.

You can select ranges of index labels – the selection `data.loc[‘Bruch’:’Julio’]` will return all rows in the data frame between the index entries for “Bruch” and “Julio”. The following examples should now make sense:

In [10]:
data.set_index("last_name", inplace=True)

# Select rows with index values 'Andrade' and 'Veness', with all columns between 'city' and 'email'
data.loc[['Andrade', 'Veness'], 'city':'email']
# Select same rows, with just 'first_name', 'address' and 'city' columns
data.loc['Andrade':'Veness', ['first_name', 'address', 'city']]
 
# Change the index to be based on the 'id' column
data.set_index('id', inplace=True)
# select the row with 'id' = 487
data.loc[[123]]

Unnamed: 0_level_0,first_name,company_name,address,city,county,postal,phone1,phone2,email,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
123,Carmen,"Norda, Beth Dorsey Esq",11 Denison St #7,Orford Ward,Cheshire,WA2 9QB,01692-491267,01417-973243,carmen@hotmail.com,http://www.nordabethdorseyesq.co.uk


Note that in the last example, `data.loc[487]` (<font color='brown'>the row with index value 487</font>) is not equal to `data.iloc[487]` (<font color='brown'>the 487th row in the data</font>). The index of the [**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) can be out of numeric order, and/or a string or multi-value.

<a id='sect2_2_2'></a>
### 2b. Boolean / Logical indexing using .loc
[Conditional selections with boolean arrays](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing) using `data.loc[<selection>]` is the most common method that I use with Pandas [**DataFrames**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). With boolean indexing or logical selection, you pass an array or Series of True/False values to the [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) indexer to select the rows where your [**Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) has True values.

**In most use cases, you will make selections based on the values of different columns in your data set**.

For example, the statement `data[‘first_name’] == ‘Antonio’]` produces a Pandas [**Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) with a True/False value for every row in the ‘data’ DataFrame, where there are “True” values for the rows where the `first_name` is “Antonio”. These type of boolean arrays can be passed directly to the [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) indexer as so:

In [11]:
data[data['first_name'] == 'Antonio']

Unnamed: 0_level_0,first_name,company_name,address,city,county,postal,phone1,phone2,email,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
830,Antonio,Combs Sheetmetal,353 Standish St #8264,Little Parndon and Hare Street,Hertfordshire,CM20 2HT,01559-403415,01388-777812,antonio.villamarin@gmail.com,http://www.combssheetmetal.co.uk
250,Antonio,Saint Thomas Creations,425 Howley St,Gaer Community,Newport,NP20 3DE,01463-409090,01242-318420,antonio_glasford@glasford.co.uk,http://www.saintthomascreations.co.uk
824,Antonio,Radisson Suite Hotel,35 Elton St #3,Ipplepen,Devon,TQ12 5LL,01324-171614,01442-946357,antonio.heilig@gmail.com,http://www.radissonsuitehotel.co.uk


Using a boolean True/False series to select rows in a pandas data frame – all rows with first name of “Antonio” are selected.

As before, a second argument can be passed to .loc to select particular columns out of the data frame. Again, columns are referred to by name for the loc indexer and can be a single string, a list of columns, or a slice “:” operation.

In [14]:
data.loc[data['first_name'] == 'Erasmo', ['email', 'company_name', 'phone1']]

Unnamed: 0_level_0,email,company_name,phone1
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
704,erasmo.talentino@hotmail.com,Active Air Systems,01492-454455
431,egath@hotmail.com,Pan Optx,01445-796544
934,erasmo_rhea@hotmail.com,Martin Morrissey,01507-386397


Note that when selecting columns, if one column only is selected, the [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) operator returns a [**Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html). For a single column DataFrame, use a one-element list to keep the [**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) format, for example:

In [16]:
# Return Series
data.loc[data['first_name'] == 'Erasmo', 'email'].__class__

pandas.core.series.Series

In [17]:
# Return DataFrame
data.loc[data['first_name'] == 'Erasmo', ['email']].__class__

pandas.core.frame.DataFrame

Make sure you understand the following additional examples of [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) selections for clarity:

In [20]:
data= load_data()

# Select rows with first name Antonio, # and all columns between 'city' and 'email'
data.loc[data['first_name'] == 'Antonio', 'city':'email']
 
# Select rows where the email column ends with 'hotmail.com', include all columns
data.loc[data['email'].str.endswith("hotmail.com")]   
 
# Select rows with last_name equal to some values, all columns
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]   
       
# Select rows with first name Antonio AND hotmail email addresses
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')] 
 
# select rows with id column between 100 and 200, and just return 'postal' and 'web' columns
data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']] 
 
# A lambda function that yields True/False values can also be used.
# Select rows where the company name has 4 words in it.
data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)] 
 
# Selections can be achieved outside of the main .loc for clarity:
# Form a separate variable with your selections:
idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4)
# Select only the True values in 'idx' and only the 3 columns specified:
data.loc[idx, ['email', 'first_name', 'company']]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,email,first_name,company
2,france.andrade@hotmail.com,France,
5,erampy@rampy.co.uk,Eric,
11,charlesetta_erm@gmail.com,Charlesetta,
15,mthrossell@throssell.co.uk,Michell,
16,edgar.kanne@yahoo.com,Edgar,
...,...,...,...
481,ahmad.alsaqri@yahoo.com,Ahmad,
484,jacquelyne_reibman@yahoo.com,Jacquelyne,
487,isabelle.kono@yahoo.com,Isabelle,
492,elbert@hotmail.com,Elbert,


Logical selections and boolean [**Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) can also be passed to the generic \[] indexer of a pandas DataFrame and will give the same results: `data.loc[data[‘id’] == 9] == data[data[‘id’] == 9]` .

<a id='sect2_3'></a>
### <font color='darkgreen'>3. Selecting pandas data using ix</font>
> Note: The ix indexer has been deprecated in recent versions of Pandas, starting with version 0.20.1.

<a id='sect2_3'></a>
## <font color='darkblue'>Setting values in DataFrames using .loc</font>
With a slight change of syntax, you can actually update your [**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) in the same statement as you select and filter using [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) indexer. This particular pattern allows you to update values in columns depending on different conditions. **<font color='darkred'>The setting operation does not make a copy of the data frame, but edits the original data.</font>** As an example:

In [30]:
data.loc[data['id'] > 990, ['id', 'first_name']]

Unnamed: 0,id,first_name
312,994,Eleni
337,991,Joaquin
392,1000,Magdalene
458,992,Ivan
460,993,Carlton


In [31]:
# Change the first name of all rows with an ID greater than 990 to "John"
data.loc[data['id'] > 990, "first_name"] = "John"

In [32]:
data.loc[data['id'] > 990, ['id', 'first_name']]

Unnamed: 0,id,first_name
312,994,John
337,991,John
392,1000,John
458,992,John
460,993,John


That’s the basics of indexing and selecting with Pandas. If you’re looking for more, take a look at the [.iat](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iat.html), and [.at](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.at.html) operations for some more [performance-enhanced value accessors](http://pandas.pydata.org/pandas-docs/stable/indexing.html#fast-scalar-value-getting-and-setting) in the [Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/) and take a look at [selecting by callable functions](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-callable) for more iloc and loc fun.