<a href="https://colab.research.google.com/github/modouseck/first-repo/blob/main/Workshop_Data_Manipulation_with_Pandas_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Selecting Subsets of Data in Pandas


## Part 1: Selection with `[]`, `.loc` and `.iloc`

Pandas offers a wide variety of options for subset selection which necessitates multiple articles.



# The anatomy of a DataFrame and a Series
The pandas library has two primary containers of data, the DataFrame and the Series. You will spend nearly all your time working with both of the objects when you use pandas.

At first glance, the DataFrame looks like any other two-dimensional table of data that you have seen. It has rows and it has columns. Technically, there are three main components of the DataFrame.

## The three components of a DataFrame
A DataFrame is composed of three different components, the **index**, **columns**, and the **data**. The data is also known as the **values**.

The index represents the sequence of values on the far left-hand side of the DataFrame. All the values in the index are in **bold** font. Each individual value of the index is called a **label**. Sometimes the index is referred to as the **row labels**. In the example above, the row labels are not very interesting and are just the integers beginning from 0 up to n-1, where n is the number of rows in the table. Pandas defaults DataFrames with this simple index.

The columns are the sequence of values at the very top of the DataFrame. They are also in **bold** font. Each individual value of the columns is called a **column**, but can also be referred to as **column name** or **column label**.



## Axis and axes
It is also common terminology to refer to the rows or columns as an **axis**. Collectively, we call them **axes**. So, a row is an axis and a column is another axis.

The word axis appears as a parameter in many DataFrame methods. Pandas allows you to choose the direction of how the method will work with this parameter. This has nothing to do with subset selection so you can just ignore it for now.


### Each row has a label and each column has a label
The main takeaway from the DataFrame anatomy is that each row has a label and each column has a label. These labels are used to refer to specific rows or columns in the DataFrame. It's the same as how humans use names to refer to specific people.

# What is subset selection?

Before we start doing subset selection, it might be good to define what it is. Subset selection is simply selecting particular rows and columns of data from a DataFrame (or Series). This could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns.

# Focusing only on `[]`, `.loc`, and `.iloc`
There are many ways to select subsets of data, but in this article we will only cover the usage of the square brackets (**`[]`**), **`.loc`** and **`.iloc`**. Collectively, they are called the **indexers**. These are by far the most common ways to select data. A different part of this Series will discuss a few methods that can be used to make subset selections.

If you have a DataFrame, `df`, your subset selection will look something like the following:

```
df[ ]
df.loc[ ]
df.iloc[ ]
```

A real subset selection will have something inside of the square brackets. All selections in this workshop will take place inside of those square brackets.

Notice that the square brackets also follow `.loc` and `.iloc`. All indexing in Python happens inside of these square brackets.

# A term for just those square brackets
The term **indexing operator** is used to refer to the square brackets following an object. The **`.loc`** and **`.iloc`** indexers also use the indexing operator to make selections. I will use the term **just the indexing operator** to refer to **`df[]`**. This will distinguish it from **`df.loc[]`** and **`df.iloc[]`**.

# Read in data into a DataFrame with `read_csv`
Let's begin using pandas to read in a DataFrame, and from there, use the indexing operator by itself to select subsets of data.



In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('/content/sample_data.csv', index_col=0)
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


# Extracting the individual DataFrame components
Earlier, we mentioned the three components of the DataFrame. The index, columns and data (values). We can extract each of these components into their own variables. Let's do that and then inspect them:

In [None]:
index = df.index
columns = df.columns
values = df.values

In [None]:
index

Index(['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'], dtype='object', name='name')

In [None]:
columns

Index(['state', 'color', 'food', 'age', 'height', 'score'], dtype='object')

In [None]:
values

array([['NY', 'blue', 'Steak', 30, 165, 4.6],
       ['TX', 'green', 'Lamb', 2, 70, 8.3],
       ['FL', 'red', 'Mango', 12, 120, 9.0],
       ['AL', 'white', 'Apple', 4, 80, 3.3],
       ['AK', 'gray', 'Cheese', 32, 180, 1.8],
       ['TX', 'black', 'Melon', 33, 172, 9.5],
       ['TX', 'red', 'Beans', 69, 150, 2.2]], dtype=object)

# Data types of the components
Let's output the type of each component to understand exactly what kind of object they are.

In [None]:
type(index)

pandas.core.indexes.base.Index

In [None]:
type(columns)

pandas.core.indexes.base.Index

In [None]:
type(values)

numpy.ndarray

# Understanding these types
Interestingly, both the index and the columns are the same type. They are both a pandas **`Index`** object. This object is quite powerful in itself, but for now you can just think of it as a sequence of labels for either the rows or the columns.

The values are a NumPy **`ndarray`**, which stands for n-dimensional array, and is the primary container of data in the NumPy library. Pandas is built directly on top of NumPy and it's this array that is responsible for the bulk of the workload.

# Beginning with just the indexing operator on DataFrames
We will begin our journey of selecting subsets by using just the indexing operator on a DataFrame. Its main purpose is to select a single column or multiple columns of data.

## Selecting a single column as a Series
To select a single column of data, simply put the name of the column in-between the brackets. Let's select the food column:

In [None]:
df['food']

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Jane,Steak
Niko,Lamb
Aaron,Mango
Penelope,Apple
Dean,Cheese
Christina,Melon
Cornelia,Beans


# Anatomy of a Series
Selecting a single column of data returns the other pandas data container, the Series. A Series is a one-dimensional sequence of labeled data. There are two main components of a Series, the **index** and the **data**(or **values**). There are NO columns in a Series.

The visual display of a Series is just plain text, as opposed to the nicely styled table for DataFrames. The sequence of person names on the left is the index. The sequence of food items on the right is the values.

You will also notice two extra pieces of data on the bottom of the Series. The **name** of the Series becomes the old-column name. You will also see the data type or **`dtype`** of the Series. You can ignore both these items for now.

# Selecting multiple columns with just the indexing operator
It's possible to select multiple columns with just the indexing operator by passing it a list of column names. Let's select `color`, `food`, and `score`:

In [None]:
df[['color', 'food', 'score']]

Unnamed: 0_level_0,color,food,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jane,blue,Steak,4.6
Niko,green,Lamb,8.3
Aaron,red,Mango,9.0
Penelope,white,Apple,3.3
Dean,gray,Cheese,1.8
Christina,black,Melon,9.5
Cornelia,red,Beans,2.2


# Selecting multiple columns returns a DataFrame
Selecting multiple columns returns a DataFrame. You can actually select a single column as a DataFrame with a one-item list:

In [None]:
df[['food']]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Jane,Steak
Niko,Lamb
Aaron,Mango
Penelope,Apple
Dean,Cheese
Christina,Melon
Cornelia,Beans


Although, this resembles the Series from above, it is technically a DataFrame, a different object.

# Column order doesn't matter
When selecting multiple columns, you can select them in any order that you choose. It doesn't have to be the same order as the original DataFrame. For instance, let's select `height` and `color`.

In [None]:
df[['height', 'color']]

Unnamed: 0_level_0,height,color
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,165,blue
Niko,70,green
Aaron,120,red
Penelope,80,white
Dean,180,gray
Christina,172,black
Cornelia,150,red


# Exceptions
There are a couple common exceptions that arise when doing selections with just the indexing operator.
* If you misspell a word, you will get a **`KeyError`**
* If you forgot to use a list to contain multiple columns you will also get a **`KeyError`**

In [None]:
#df['hight'] #misspell a word = KeyError

In [None]:
#df['color', 'age'] # should be:  df[['color', 'age']]

# Summary of just the indexing operator
* Its primary purpose is to select columns by the column names
* Select a single column as a Series by passing the column name directly to it: **`df['col_name']`**
* Select multiple columns as a DataFrame by passing a **list** to it: **`df[['col_name1', 'col_name2']]`**
* You actually can select rows with it, but this will not be shown here as it is confusing and not used often.

# Getting started with `.loc`
The **`.loc`** indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns. Most importantly, it only selects data by the **LABEL** of the rows and columns.

# Select a single row as a  Series with `.loc`
The **`.loc`** indexer will return a single row as a Series when given a single row label. Let's select the row for **`Niko`**.

In [None]:
df.loc['Niko']

Unnamed: 0,Niko
state,TX
color,green
food,Lamb
age,2
height,70
score,8.3


We now have a Series, where the old column names are now the index labels. The **`name`** of the Series has become the old index label, **`Niko`** in this case.

# Select multiple rows as a DataFrame with `.loc`
To select multiple rows, put all the row labels you want to select in a list and pass that to **`.loc`**. Let's select `Niko` and `Penelope`.

In [None]:
df.loc[['Niko', 'Penelope']]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3


# Use slice notation to select a range of rows with `.loc`
It is possible to 'slice' the rows of a DataFrame with `.loc` by using **slice notation**. Slice notation uses a colon to separate **start**, **stop** and **step** values. For instance we can select all the rows from `Niko` through `Dean` like this:

In [None]:
df.loc['Niko':'Dean']

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


# `.loc` includes the last value with slice notation
Notice that the row labeled with `Dean` was kept. In other data containers such as Python lists, the last value is excluded.

# Other slices
You can use slice notation similarly to how you use it with lists. Let's slice from the beginning through `Aaron`:

In [None]:
df.loc[:'Aaron']

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0


Slice from `Niko` to `Christina` stepping by 2:

In [None]:
df.loc['Niko':'Christina':2]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


Slice from `Dean` to the end:

In [None]:
df.loc['Dean':]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


# Selecting rows and columns simultaneously with `.loc`
Unlike just the indexing operator, it is possible to select rows and columns simultaneously with `.loc`. You do it by separating your row and column selections by a **comma**. It will look something like this:

```
df.loc[row_selection, column_selection]
```

## Select two rows and three columns
For instance, if we wanted to select the rows `Dean` and `Cornelia` along with the columns `age`, `state` and `score` we would do this:

In [None]:
df.loc[['Dean', 'Cornelia'], ['age', 'state', 'score']]

Unnamed: 0_level_0,age,state,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dean,32,AK,1.8
Cornelia,69,TX,2.2


# Use any combination of selections for either row or columns for `.loc`
Row or column selections can be any of the following as we have already seen:
* A single label
* A list of labels
* A slice with labels

We can use any of these three for either row or column selections with **`.loc`**. Let's see some examples.

Let's select two rows and a single column:

In [None]:
df.loc[['Dean', 'Aaron'], 'food']

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Dean,Cheese
Aaron,Mango


Select a slice of rows and a list of columns:

In [None]:
df.loc['Jane':'Penelope', ['state', 'color']]

Unnamed: 0_level_0,state,color
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


Select a single row and a single column. This returns a scalar value.

In [None]:
df.loc['Jane', 'age']

30

Select a slice of rows and columns

In [None]:
df.loc[:'Dean', 'height':]

Unnamed: 0_level_0,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,165,4.6
Niko,70,8.3
Aaron,120,9.0
Penelope,80,3.3
Dean,180,1.8


## Selecting all of the rows and some columns
It is possible to select all of the rows by using a single colon. You can then select columns as normal:

In [None]:
df.loc[:, ['food', 'color']]

Unnamed: 0_level_0,food,color
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


You can also use this notation to select all of the columns:

In [None]:
df.loc[['Penelope','Cornelia'], :]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


But, it isn't necessary as we have seen, so you can leave out that last colon:

In [None]:
df.loc[['Penelope','Cornelia']]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


# Assign row and column selections to variables
It might be easier to assign row and column selections to variables before you use `.loc`. This is useful if you are selecting many rows or columns:

In [None]:
rows = ['Jane', 'Niko', 'Dean', 'Penelope', 'Christina']
cols = ['state', 'age', 'height', 'score']
df.loc[rows, cols]

Unnamed: 0_level_0,state,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jane,NY,30,165,4.6
Niko,TX,2,70,8.3
Dean,AK,32,180,1.8
Penelope,AL,4,80,3.3
Christina,TX,33,172,9.5


# Summary of `.loc`

* Only uses labels
* Can select rows and columns simultaneously
* Selection can be a single label, a list of labels or a slice of labels
* Put a comma between row and column selections

# Getting started with `.iloc`
The `.iloc` indexer is very similar to `.loc` but only uses integer locations to make its selections. The word `.iloc` itself stands for integer location so that should help with remember what it does.

# Selecting a single row with `.iloc`
By passing a single integer to `.iloc`, it will select one row as a Series:

In [None]:
df.iloc[3]

Unnamed: 0,Penelope
state,AL
color,white
food,Apple
age,4
height,80
score,3.3


# Selecting multiple rows with `.iloc`
Use a list of integers to select multiple rows:

In [None]:
df.iloc[[5, 2, 4]]           # remember, don't do df.iloc[5, 2, 4]  Error!

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Christina,TX,black,Melon,33,172,9.5
Aaron,FL,red,Mango,12,120,9.0
Dean,AK,gray,Cheese,32,180,1.8


# Use slice notation to select a range of rows with `.iloc`
Slice notation works just like a list in this instance and is exclusive of the last element

In [None]:
df.iloc[3:5]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


Select 3rd position until end:

In [None]:
df.iloc[3:]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


Select 3rd position to end by 2:

In [None]:
df.iloc[3::2]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


# Selecting rows and columns simultaneously with `.iloc`
Just like with `.iloc` any combination of a single integer, lists of integers or slices can be used to select rows and columns simultaneously. Just remember to separate the selections with a **comma**.

Select two rows and two columns:

In [None]:
df.iloc[[2,3], [0, 4]]

Unnamed: 0_level_0,state,height
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aaron,FL,120
Penelope,AL,80


Select a slice of the rows and two columns:

In [None]:
df.iloc[3:6, [1, 4]]

Unnamed: 0_level_0,color,height
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Penelope,white,80
Dean,gray,180
Christina,black,172


Select slices for both

In [None]:
df.iloc[2:5, 2:5]

Unnamed: 0_level_0,food,age,height
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aaron,Mango,12,120
Penelope,Apple,4,80
Dean,Cheese,32,180


Select a single row and column

In [None]:
df.iloc[0, 2]

'Steak'

Select all the rows and a single column

In [None]:
df.iloc[:, 5]

Unnamed: 0_level_0,score
name,Unnamed: 1_level_1
Jane,4.6
Niko,8.3
Aaron,9.0
Penelope,3.3
Dean,1.8
Christina,9.5
Cornelia,2.2


# Selecting subsets of Series

We can also, of course, do subset selection with a Series. Earlier I recommended using just the indexing operator for column selection on a DataFrame. Since Series do not have columns, I suggest using only **`.loc`** and **`.iloc`**. You can use just the indexing operator, but its ambiguous as it can take both labels and integers. I will come back to this at the end of the tutorial.

Typically, you will create a Series by selecting a single column from a DataFrame. Let's select the **`food`** column:

In [None]:
food = df['food']
food

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Jane,Steak
Niko,Lamb
Aaron,Mango
Penelope,Apple
Dean,Cheese
Christina,Melon
Cornelia,Beans


# Series selection with `.loc`
Series selection with `.loc` is quite simple, since we are only dealing with a single dimension. You can again use a single row label, a list of row labels or a slice of row labels to make your selection. Let's see several examples.

Let's select a single value:

In [None]:
food.loc['Aaron']

'Mango'

Select three different values. This returns a Series:

In [None]:
food.loc[['Dean', 'Niko', 'Cornelia']]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Dean,Cheese
Niko,Lamb
Cornelia,Beans


Slice from `Niko` to `Christina` - is inclusive of last index

In [None]:
food.loc['Niko':'Christina']

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Niko,Lamb
Aaron,Mango
Penelope,Apple
Dean,Cheese
Christina,Melon


Slice from `Penelope` to the end:

In [None]:
food.loc['Penelope':]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Penelope,Apple
Dean,Cheese
Christina,Melon
Cornelia,Beans


Select a single value in a list which returns a Series

In [None]:
food.loc[['Aaron']]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Aaron,Mango


# Series selection with `.iloc`
Series subset selection with **`.iloc`** happens similarly to **`.loc`** except it uses integer location. You can use a single integer, a list of integers or a slice of integers. Let's see some examples.

Select a single value:

In [None]:
food.iloc[0]

'Steak'

Use a list of integers to select multiple values:

In [None]:
food.iloc[[4, 1, 3]]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Dean,Cheese
Niko,Lamb
Penelope,Apple


Use a slice - is exclusive of last integer

In [None]:
food.iloc[4:6]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Dean,Cheese
Christina,Melon


# Comparison to Python lists and dictionaries
It may be helpful to compare pandas ability to make selections by label and integer location to that of Python lists and dictionaries.

Python lists allow for selection of data only through integer location. You can use a single integer or slice notation to make the selection but NOT a list of integers.

Let's see examples of subset selection of lists using integers:

In [None]:
some_list = ['a', 'two', 10, 4, 0, 'asdf', 'mgmt', 434, 99]

In [None]:
some_list[5]

'asdf'

In [None]:
some_list[-1]

99

In [None]:
some_list[:4]

['a', 'two', 10, 4]

In [None]:
some_list[3:]

[4, 0, 'asdf', 'mgmt', 434, 99]

In [None]:
some_list[2:6:3]

[10, 'asdf']

### Selection by label with Python dictionaries
All values in each dictionary are labeled by a **key**. We use this key to make single selections. Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

In [None]:
d = {'a':1, 'b':2, 't':20, 'z':26, 'A':27}

In [None]:
d['a']

1

In [None]:
d['A']

27

### Pandas has power of lists and dictionaries
DataFrames and Series are able to make selections with integers like a list and with labels like a dictionary.

# Extra Topics
There are a few more items that are important and belong in this tutorial and will be mentioned now.

# Using just the indexing operator to select rows from a DataFrame - Confusing!
Above, I used just the indexing operator to select a column or columns from a DataFrame. But, it can also be used to select rows using a **slice**. This behavior is very confusing in my opinion. The entire operation changes completely when a slice is passed.

Let's use an integer slice as our first example:

In [None]:
df[3:6]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5


To add to this confusion, you can slice by labels as well.

In [None]:
df['Aaron':'Christina']

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5


# I recommend not doing this!
This feature is not deprecated and completely up to you whether you wish to use it. But, I highly prefer not to select rows in this manner as can be ambiguous, especially if you have integers in your index.

Using **`.iloc`** and **`.loc`** is explicit and clearly tells the person reading the code what is going to happen. Let's rewrite the above using **`.iloc`** and **`.loc`**.

In [None]:
df.iloc[3:6]      # More explicit that df[3:6]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5


In [None]:
df.loc['Aaron':'Christina']     # more explicit than df['Aaron':'Christina']

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5


# Cannot simultaneously select rows and columns with `[]`
An exception will be raised if you try and select rows and columns simultaneously with just the indexing operator. You must use **`.loc`** or **`.iloc`** to do so.

In [None]:
#df[3:6, 'Aaron':'Christina']

# Using just the indexing operator to select rows from a Series - Confusing!
You can also use just the indexing operator with a Series. Again, this is confusing because it can accept integers or labels. Let's see some examples

In [None]:
food

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Jane,Steak
Niko,Lamb
Aaron,Mango
Penelope,Apple
Dean,Cheese
Christina,Melon
Cornelia,Beans


In [None]:
food[2:4]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Aaron,Mango
Penelope,Apple


In [None]:
food['Niko':'Dean']

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Niko,Lamb
Aaron,Mango
Penelope,Apple
Dean,Cheese


Since Series don't have columns you can use a single label and list of labels to make selections as well

In [None]:
food['Dean']

'Cheese'

In [None]:
food[['Dean', 'Christina', 'Aaron']]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Dean,Cheese
Christina,Melon
Aaron,Mango


Again, I recommend against doing this and always use **`.iloc`** or **`.loc`**

# Summary
We covered an incredible amount of ground. Let's summarize all the main points:

* Before learning pandas, ensure you have the fundamentals of Python
* Always refer to the documentation when learning new pandas operations
* The DataFrame and the Series are the containers of data
* A DataFrame is two-dimensional, tabular data
* A Series is a single dimension of data
* The three components of a DataFrame are the **index**, the **columns** and the **data** (or **values**)
* Each row and column of the DataFrame is referenced by both a **label** and an **integer location**
* There are three primary ways to select subsets from a DataFrame - **`[]`**, **`.loc`** and **`.iloc`**
* I use the term **just the indexing operator** to refer to **`[]`** immediately following a DataFrame/Series
* Just the indexing operator's primary purpose is to select a column or columns from a DataFrame
* Using a single column name to just the indexing operator returns a single column of data as a Series
* Passing multiple columns in a list to just the indexing operator returns a DataFrame
* A Series has two components, the **index** and the **data** (**values**). It has no columns
* **`.loc`** makes selections **only by label**
* **`.loc`** can simultaneously select rows and columns
* **`.loc`** can make selections with either a single label, a list of labels, or a slice of labels
* **`.loc`** makes row selections first followed by column selections: **`df.loc[row_selection, col_selection]`**
* **`.iloc`** is analogous to **.`loc`** but uses only **integer location** to refer to rows or columns.
* **`.ix`** is deprecated and should never be used
* **`.loc`** and **`.iloc`** work the same for Series except they only select based on the index as their are no columns
* Pandas combines the power of python lists (selection via integer location) and dictionaries (selection by label)
* You can use just the indexing operator to select rows from a DataFrame, but I recommend against this and instead sticking with the explicit **`.loc`** and **`.iloc`**
* Normally data is imported without setting an index. Use the **`set_index`** method to use a column as an index.
* You can select a single column as a Series from a DataFrame with dot notation