# Selecting Subsets of Data from DataFrames with `loc`

In this chapter, we use the `loc` indexer to select subsets of data from DataFrames. The `loc` indexer selects data in a different manner than *just the brackets*. It has its own separate set of rules that we must learn. 

## Simultaneous row and column subset selection

The `loc` indexer can select rows and columns simultaneously.  This is done by separating the row and column selections with a **comma**. The selection will look something like this:

```python
df.loc[rows, cols]
```

### Just the brackets cannot do this

Simultaneous row and column subset selection is not possible with *just the brackets*. Reiterating from above, the `loc` indexer has a completely different and distinct set of rules that you must abide by to use correctly. It's best to forget about how *just the brackets* works when first learning subset selection with `loc`.

### `loc` primarily selects data by label

Very importantly, `loc` primarily selects subsets by the **label** of the rows and columns. It also makes selections via boolean selection, a topic covered in a later chapter.

### Read in data

Let's get started by reading in a sample DataFrame with the first column set as the index.

In [1]:
import pandas as pd
df = pd.read_csv('input/sample_data.csv', index_col=0)
df

FileNotFoundError: [Errno 2] No such file or directory: 'input/sample_data.csv'

### Select two rows and three columns with `loc`

Let's make our first selection with `loc` by simultaneously selecting some rows and some columns. Let's select the rows `Dean` and `Cornelia` along with the columns `age`, `state`, and `score`. A list is used to contain both the row and column selections before being placed within the brackets following `loc`. Row and column selection must be separated by a comma.

In [None]:
rows = ['Dean', 'Cornelia']
cols = ['age', 'state', 'score']
df.loc[rows, cols]

### The possible types of row and column selections

In the above example, we used a list of labels for both the row and column selection. You are not limited to just lists. All of the following are valid objects available for both row and column selections with `loc`.

* A single label
* A list of labels
* A slice with labels
* A boolean Series (covered in a later chapter)

### Select two rows and a single column

Let's select the rows `Aaron` and `Dean` along with the `food` column. We can use a list for the row selection and a single string for the column selection.

In [None]:
rows = ['Dean', 'Aaron']
cols = 'food'
df.loc[rows, cols]

### Series returned

In the above example, a Series and not a DataFrame was returned. Whenever you select a single row or a single column using a string label, pandas returns a Series.

## `loc` with slice notation

Let's take a moment to review Python's slice notation, which is used to select subsets from some core Python objects such as lists, tuples, and strings. Slice notation always has three components - the **start**, **stop**, and **step**. Syntactically, each component is separated by a colon like this - `start:stop:step`. All components of slice notation are optional and not necessary to include. Each has a default value if not included in the notation. The start component defaults to the beginning, the stop defaults to the end, and the step size to 1.

### Example slices

Let's take a look at several slice notations and the value of each component of the slice.

* `'Niko':'Christina':2` - start is 'Niko', stop is 'Christina', step is 2
* `'Niko':'Christina'` - start is 'Niko', stop is 'Christina', step is 1
* `'Niko'::2` - start is 'Niko', stop is the end', step is 2
* `'Niko':` - start is 'Niko', stop is the end, step is 1
* `:'Christina':2` - start is the beginning, stop is 'Christina', step is 2
* `:` - start is the beginning, stop is the end, step is 1. All components take their default value.

This same slice notation is allowed within the `loc` indexer. Let's select all of the rows from `Jane` to `Penelope` with slice notation along with the columns `state` and `color`.

In [None]:
cols = ['state', 'color']
df.loc['Jane':'Penelope', cols]

### Slice notation is inclusive of the stop label

Slice notation with the `loc` indexer includes the stop label. This behaves differently than slicing done on Python lists, which is exclusive of the stop integer.

### Slice notation only works within the brackets attached to the object

Python only allows us to use slice notation within the brackets that are attached to an object. If we try and assign slice notation outside of this, we will get a syntax error like we do below.

### Slice both the rows and columns

Both row and column selections support slice notation. In the following example, we slice all the rows from the beginning up to and including label `Dean` along with columns from `height` until the end.

In [None]:
df.loc[:'Dean', 'height':]

### Selecting all of the rows and some of the columns

It is possible to use slice notation to select all of rows or columns. We do so with a single colon, which is sometimes referred to as the **empty slice**. In this example, we select all of the rows and two of the columns.

In [None]:
cols = ['food', 'color']
df.loc[:, cols]

### Could have used *just the brackets*

It isn't necessary to use `loc` for this selection as we are only selecting two distinct columns. This could have been accomplished with *just the brackets*.

In [None]:
cols = ['food', 'color']
df[cols]

### A single colon is slice notation to select all values

That single colon might be intimidating, but it is technically slice notation that selects all items. In the following example, all of the elements of a Python list are selected using a single colon.

In [None]:
a_list = [1, 2, 3, 4, 5, 6]
a_list[:]

### Use a single colon to select all the columns

It is possible to use a single colon to represent a slice of all of the rows or all of the columns. Below, a colon is used as slice notation for all of the columns.

In [None]:
rows = ['Penelope','Cornelia']
df.loc[rows, :]

### The above can be shortened

By default, pandas selects all of the columns if you only provide a row selection. Providing the colon is not necessary so the following syntax makes the exact same selection.

In [None]:
rows = ['Penelope', 'Cornelia']
df.loc[rows]

Though it is not syntactically necessary, one reason to use the colon is to reinforce the idea that `loc` may be used for simultaneous column selection. The first object passed to `loc` always selects rows and the second always selects columns.

### Use slice notation to select a range of rows with all of the columns

Similarly, we can use slice notation to select several rows at a time. Below, the slice begins at the row labeled by `Niko` and goes all he way through `Dean`. We do not provide a specific column selection to return all of the columns.

In [None]:
df.loc['Niko':'Dean']

You could have written the above as `df.loc['Niko':'Dean', :]` to reinforce the fact that `loc` first selects rows and then columns.

### Changing the step size

The step size must be an integer when using slice notation with `loc`. In this example, we select every other row beginning at `Niko` and ending at `Christina`.

In [None]:
df.loc['Niko':'Christina':2, :]

### Select a single row and a single column

If the row and column selections are both a single label, then a scalar value and NOT a DataFrame or Series is returned.

In [None]:
rows = 'Jane'
cols = 'state'
df.loc[rows, cols]

### Select a single row as a Series with `loc`

The `loc` indexer returns a single row as a Series when given a single row label. Let's select the row for `Niko`. Notice that the column names have now become index labels.

In [None]:
df.loc['Niko']

Again, the column selection isn't necessary, but does provide clarity.

In [None]:
df.loc['Niko', :]

### Confusing output

This output is potentially confusing. The original row that was labeled by `Niko` had horizontal data. Selecting a single row returns a Series that displays the row data vertically.

### Selecting a single row as a DataFrame

It is possible to select a single row as a DataFrame instead of a Series. Create the row selection as a one-item list instead of just a string label. The returned result is a DataFrame and maintains the same horizontal position for the row.

In [None]:
rows = ['Niko']
df.loc[rows, :]

## Summary of the `loc` indexer

* Primarily uses labels
* Selects rows and columns simultaneously with `df.loc[rows, cols]`
* Both row and column selections can be a:
    * single label
    * list of labels
    * slice of labels
    * boolean Series
* A comma separates row and column selections

## Exercises

Read in the movie dataset by executing the cell below and use it for the following exercises.

In [None]:
pd.set_option('display.max_columns', 50)
movie = pd.read_csv('input/movie.csv', index_col='title')
movie.head(3)

# The DataFrame and Series

The DataFrame and Series are the two primary objects when using pandas to analyze data. In this chapter, we will learn how to read in data into a DataFrame and understand its components. We will also learn how to select a single column of data as a Series and examine its components.

## Reading external data with pandas

The one thing you need for data analysis is **data**. If you do not have any data, then you won't be able to use pandas to analyze it. This book contains many data sets stored externally in the `data` directory one level above where this notebook resides. Most of these datasets are stored as comma-separated value (**CSV**) files. These CSVs are human-readable and separate each individual piece of data with a comma. The comma is referred to as the **delimiter**. Despite its name, CSVs can use other one-character delimiters besides commas such as tabs, semi-colons, or others. 


### City of Chicago bike rides

We begin our data analysis adventure with a dataset on public bike rides from the city of Chicago. The data is contained in the `bikes.csv` file. There are about 50,000 recorded rides from 2013 through 2017. Each row of the dataset represents a single ride from a single person using the city's public bike stations. There are 19 columns of data containing information on gender, start time, trip duration, bike station name, temperature, wind speed, and more. Let's print out the first three lines of the `bikes.csv` file using Python's built-in capabilities for reading files. This does not use pandas. Take note of the commas separating each value on each line.

In [None]:
with open('input/bikes.csv') as f:
    for i in range(3):
        print(f.readline())     

### Understanding the file location

Above, the string `../data/bikes.csv` was used to represent the file location of the data. This location is relative to the directory where this notebook resides on your machine. Let's cover every part of this string to ensure we understand what it means.

The file location string begins with two dots, `..`. This translates as "move one level above the current working directory" to the **Jupyter Notebooks** directory. Appearing next in the string is `/data`, which translates as 'move down into the `data` directory. 

Note that the forward slash was written to separate the directories. Both macOS and Linux operating systems use this forward slash to separate directories and files from one another. On the other hand, the Windows operating system uses the backslash. Fortunately, we can always use a forward slash regardless of our operating system, as Python will automatically handle the file location string for us.

The string ends with `/bikes.csv` which translates as 'reference the filename `bikes.csv`. In summary, the file location `../data/bikes.csv` represents a relative location to where the dataset resides.

### Import pandas

To use the pandas library, we need to import it into our namespace. By convention, pandas is imported and aliased to the name `pd`. After running the import statement below, we will have access to all pandas objects with variable name `pd`. It is possible to use any other valid variable name as an alias, but it's best to use `pd` as the official documentation uses it along with most everyone else.

In [None]:
import pandas as pd
bikes = pd.read_csv('input/bikes.csv')

### Display DataFrame in Jupyter Notebook

We assigned the output from the `read_csv` function to the `bikes` variable name which now refers to a DataFrame object. To visually display the DataFrame, place the variable name as the last line in a code cell. By default, pandas outputs the first and last 5 rows and first and last 10 columns. If there are less than 60 total rows, it displays all rows. We cover how to change these display options in an upcoming chapter.

### `head` and `tail` methods

A very useful and simple method is `head`, which returns the first 5 rows of the DataFrame by default. This avoids long default output and is something I highly recommend when doing data analysis within a notebook. The `tail` method returns the last 5 rows by default. There will only be a few instances in the book where the `head` method is not used, as displaying up to 60 rows is far too many and will take up a lot of space on a screen or page.

In [None]:
bikes.head()

The last five rows of the DataFrame may be displayed with the `tail` method.

In [None]:
bikes.tail()

### First and last `n` rows
Both the `head` and `tail` methods accept a single integer parameter `n` controlling the number of rows returned. Here, we output the first three rows.

In [None]:
bikes.head(3)

## Components of a DataFrame

The DataFrame is composed of three separate components - the **columns**, the **index**, and the **data**. These terms will be used throughout the book and understanding them is vital to your ability to use pandas. Take a look at the following graphic of our `bikes` DataFrame stylized to put emphasis on each component.

### The columns

The columns provide a **label** for each column and are always displayed in **bold** font above the data. A column is a single vertical sequence of data. In the above DataFrame, the column name `tripduration` references all the values in that column (993, 623, 1040, etc...).

The columns are also referred to as the **column names** or the **column labels** with individual values referred to as a **column name** or **column label**.

Most DataFrames, like the one above, use strings for column names, but it is possible that they can be other types such as integers. The column names are not required to be unique, though having duplicate columns would be bad practice, as it's vital to be able to uniquely identify each column.

### The index

The index provides a **label** for each row and is always displayed to the left of the data. A row is a single horizontal sequence of data. For instance, the index label **3** references all the values in its row (12907, Subscriber, Male, etc...)

The index is also referred to as the **index names/labels** or the **row names/labels** with the individual values referred to as a(n) **index name/label** or **row name/label**.

In the above DataFrame, the index is simply a sequence of integers beginning at 0. The values in the index are not limited to integers. Strings are a common type that are used in the index and make for more descriptive labels.

Surprisingly, values in the index are not required to be unique. In fact, all of the index values can be the same. A row label does not guarantee a one-to-one mapping to one specific row.

### The data

The actual data is to the right of the index and below the columns and is displayed with normal font. The data is also referred to as the **values**. The data represents all the values for all the columns. It is important to note that the index and the columns are NOT part of the data. They are separate objects that act as **labels** for either rows or columns.

### The Axes

The index and columns are known collectively as the **axes**, each representing a single **axis** of the two-dimensional DataFrame. pandas uses the integer **0** to reference the index and **1** for the columns.

[1]: images/df_components.png

## What type of object is `bikes`?

Let's verify that `bikes` is indeed a DataFrame with the `type` function.

In [None]:
type(bikes)

### Fully-qualified name

The above output is something called the **fully-qualified name**. Only the word after the last dot is the name of the type. We have now verified that the `bikes` variable has type `DataFrame`. 

The fully-qualified name always returns the package and module name of where the type was defined. The package name is the first part of the fully-qualified name and, in this case, is `pandas`. The module name is the word immediately preceding the name of the type. Here, it is `frame`.

### Package vs Module

A Python **package** is a directory containing other directories or modules that contain Python code. A Python **module** is a file (typically a text file ending in .py) that contains Python code. 

### Sub-packages

Any directory containing other directories or modules within a Python package is considered a **sub-package**. In this case, `core` is the sub-package.

### Where are the packages located on my machine?

Third-party packages are installed in the `site-packages` directory which itself is set up during Python installation. We can get the actual location with help from the standard library's `site` module's `getsitepackages` function.

In [None]:
import site
site.getsitepackages()

If you navigate to this directory in your file system, you'll find the 'pandas' directory. Within it will be a 'core' directory which will contain the 'frame.py' file. It is this file which contains Python code where the DataFrame class is defined.

## Select a single column from a DataFrame - a Series

To select a single column from a DataFrame, append a set of square brackets, `[]`, to the end of the DataFrame variable name. Place the column name as a string within those brackets to select it. This returns a single column of data as a pandas **Series**. This is a separate (but similar) type of object than a DataFrame.

Let's select the column name `tripduration`, assign it to a variable name, and output the first few values to the screen. The `head` and `tail` methods work the same as they do with DataFrames.

In [None]:
td = bikes['tripduration']
td.head()

Select the last three values in the Series by passing the `tail` method the integer 3.

In [None]:
td.tail(3)

Let's verify that `td` has the type Series.

In [None]:
type(td)

## Components of a Series

A Series is a similar type of object as a DataFrame but only contains a single dimension of data. It has two components - the **index** and the **data**. Let's take a look at a stylized Series graphic.

It's important to note that a Series has no rows and no columns. In appearance, it resembles a one-column DataFrame, but it technically has no columns. It just has a sequence of values that are labeled by an index.

### The index

A Series index serves as labels for the values. A single **label** or **name** always references a single value. In the above image, the index label **3** corresponds to the value 667. The Series index is virtually identical to the DataFrame index, so the same rules apply to it. Index values can be duplicated and can be types other than integers, such as strings. 

### Output of Series vs DataFrame

Notice that there is no nice HTML styling for the Series. It's just plain text. Below the Series display, you will see a few other items printed to the screen - the **name**, **length**, and **dtype**. These other items are NOT part of the Series itself and are just extra pieces of information to help you understand the Series.

* The **name** is not important right now. If the Series was formed from a column of a DataFrame, it will be set to that column name.
* The **length** is the number of values in the Series
* The **dtype** is the data type of the Series, which will be discussed in an upcoming chapter.

[0]: images/series_components.png

## Changing display options

pandas gives you the ability to change how the output on your screen is displayed. For instance, the default number of columns displayed for a DataFrame is 20, meaning that if your DataFrame has more than 20 columns, then only the first and last 10 columns will be shown on the screen. All the other columns will be hidden and unable to be displayed. This is problematic as many DataFrames have more than 20 columns.

### Get current option value with `get_option`

There are a few dozen display options you can control to change the visual representation of your DataFrame. It is not necessary to remember the option names as the official documentation provides descriptions for all [available options][1]. 

Let's first learn how to retrieve each option value with the `get_option` function. This is not a DataFrame method, but instead, a function that is accessed directly from `pd`.  Below are three of the most common options to change.

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html

In [None]:
pd.get_option('display.max_columns')

In [None]:
pd.get_option('display.max_rows')

In [None]:
pd.get_option('display.max_colwidth')

### Use the `set_option` function to change an option value

To change an option's value, use the `set_option` function. You can set as many options as you would like at one time. It's usage is a bit strange. Pass it the option name as a string and follow it immediately with the value you want to set it to. Continue this pattern of option name followed by new value to set as many options as you desire. Below, we set the maximum number of columns to 100 and the maximum number of rows to 4.

In [None]:
pd.set_option('display.max_columns', 100, 'display.max_rows', 4)

We now read in the housing dataset which contains 81 columns, all of which will be visible. Uncomment the lines to run them in your notebook.

In [None]:
housing = pd.read_csv('input/housing.csv')
housing