# Selecting Subsets of Data from DataFrames with `loc`

## Subset selection with `loc`
The `loc` indexer selects data in a different manner than *just the brackets* and has its own set of rules that we must learn.

### Simultaneous row and column subset selection with `loc`
The `loc` indexer can select rows and columns simultaneously. This is not possible with *just the brackets*. This is done by separating the row and column selections with a **comma**. The selection will look something like this:

```
df.loc[rows, cols]
```

### `loc` primarily selects data by label

Very importantly, `loc` primarily selects data by the **label** of the rows and columns. Provide `loc` with the label of the rows and/or columns you would like to select. It also makes selections via boolean selection, a topic covered in a later chapter.

### Read in data
Let's get started by reading in a sample DataFrame.

In [1]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Select two rows and three columns with `loc`
Let's make our first selection with `loc` by simultaneously selecting some rows and some columns. Let's select the rows `Dean` and `Cornelia` along with the columns `age`, `state`, and `score`. A list is used to contain both the row and column selection before being placed within the brackets. Row and column selection must be separated by a comma.

In [2]:
rows = ['Dean', 'Cornelia']
cols = ['age', 'state', 'score']
df.loc[rows, cols]

Unnamed: 0,age,state,score
Dean,32,AK,1.8
Cornelia,69,TX,2.2


### The possible types of row and column selections
In the above example, we used a list of labels for both the row and column selection. You are not limited to just lists. All of the following are valid objects available for both row and column selections with `loc`.

* A single label
* A list of labels
* A slice with labels
* A boolean Series or array (covered in a later chapter)

### Select two rows and a single column
Let's select the rows `Aaron` and `Dean` along with the `food` column. We can use a list for the row selection and a single string for the column selection.

In [3]:
rows = ['Dean', 'Aaron']
cols = 'food'
df.loc[rows, cols]

Dean     Cheese
Aaron     Mango
Name: food, dtype: object

### Series Returned
In the above example, a Series and not a DataFrame was returned. Whenever you select a single row or a single column using a string label, pandas will return a Series

## `loc` with slice notation
Lists, tuples, and strings are the core Python objects that allow subset selection with slice notation. This same notation is allowed with DataFrames. Let's select all of the rows from `Jane` to `Penelope` with slice notation along with the columns `state` and `color`.

In [4]:
cols = ['state', 'color']
df.loc['Jane':'Penelope', cols]

Unnamed: 0,state,color
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


### Slice notation is inclusive of the stop label
Slice notation with the `loc` indexer is inclusive of the stop label. This functions differently that slicing done on Python lists, which is exclusive of the stop integer.

### Slice notation only works within the brackets attached to the object
Python only allows us to use slice notation within the brackets that are attached to an object. If we try and assign slice notation outside of this, we will get a syntax error like we do below.

In [5]:
rows = 'Jane':'Penelope'

SyntaxError: invalid syntax (3746084903.py, line 1)

### Slice both the rows and columns
Both row and column selections support slice notation. In the following example, we slice all the rows up to and including label `Dean` along with columns from `height` until the end.

In [6]:
df.loc[:'Dean', 'height':]

Unnamed: 0,height,score
Jane,165,4.6
Niko,70,8.3
Aaron,120,9.0
Penelope,80,3.3
Dean,180,1.8


### Selecting all of the rows and some of the columns
It is possible to use slice notation to select all of the rows and done with a single colon. In this example, we select all of the rows and two of the columns.

In [7]:
cols = ['food', 'color']
df.loc[:, cols]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


### Could have used *just the brackets*
It is not necessary to use `loc` to for this selection as ultimately, it is just a selection of two columns. This could have been accomplished with *just the brackets*.

In [8]:
cols = ['food', 'color']
df[cols]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


### A single colon is slice notation for select all
That single colon might be intimidating but it is technically slice notation that selects all items. In the following example, all of the elements of a Python list are selected using a single colon.

In [9]:
a_list = [1, 2, 3, 4, 5, 6]
a_list[:]

[1, 2, 3, 4, 5, 6]

### Use a single colon to select all the columns
It is possible to use a single colon to represent a slice of all the rows or all of the columns. Below, a colon is used as slice notation for all of the columns.

In [10]:
rows = ['Penelope','Cornelia']
df.loc[rows, :]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


### The above can be shortened
By default, pandas will select all of the columns if you only provide a row selection. Providing the colon is not necessary and the following will do the same.

In [11]:
rows = ['Penelope', 'Cornelia']
df.loc[rows]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


Though it is not syntactically necessary, one reason to use the colon is to reinforce the idea that `loc` may be used for simultaneous column selection. The first object passed to `loc` always selects rows and the second always selects columns.

### Use slice notation to select a range of rows with all of the columns
Similarly, we can slice from `Niko` through `Dean` while selecting all of the columns. Again, we do not provide a specific column selection to return all of the columns.

In [12]:
df.loc['Niko':'Dean']

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


Again, you could have written the above as `df.loc['Niko':'Dean', :]` to reinforce the fact that `loc` first selects rows and then columns.

### Changing the step size
The step size must be an integer when using slice notation with `loc`. In this example, we select every other row beginning at `Niko` and ending at `Christina`.

In [13]:
df.loc['Niko':'Christina':2]

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


## Select a single row and a single column
If the row and column selections are both a single label, then a scalar value and NOT a DataFrame or Series is returned.

In [14]:
rows = 'Jane'
cols = 'state'
df.loc[rows, cols]

'NY'

### Select a single row as a Series with `loc`
The `loc` indexer will return a single row as a Series when given a single row label. Let's select the row for `Niko`. Notice that the column names have now become index labels.

In [15]:
df.loc['Niko']

state        TX
color     green
food       Lamb
age           2
height       70
score       8.3
Name: Niko, dtype: object

### Confusing output
This output is potentially confusing. The once horizontal `Niko` row has now been visually represented as a vertical Series. It has an appearance of a column, but if you look at the values, you will see they align with their old column names.

## Summary of `loc`
* Primarily uses labels
* Selects rows and columns simultaneously
* Selection can be a single label, a list of labels, a slice of labels, or a boolean Series/array
* A comma separates row and column selections

