# Advanced tabular data analysis

In the previous notebook, we recapped some of the basic tabular data manipulation tools we covered in Introduction to Python. In this notebook, we'll expand beyond those with some new, more-advanced features of Pandas.

## Exercise

Import the pandas, numpy, and matplotlib libraries with their usual aliases `pd`, `np`, and `plt`. Load the New York data once again, and perform the same data cleaning steps we did before - merge in the borough names, and rename the Roadway Name column to `roadway_name`, and parse the dates. Put the dataset in a variable named `data`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv("../data/Traffic_Volume_Counts.csv")

In [None]:
data = data.rename(columns={"Roadway Name": "roadway_name"})

In [None]:
borough = pd.read_csv("../data/Traffic_Borough.csv")

In [None]:
borough["borough"] = borough.RBoro.replace({
    1: "Manhattan",
    2: "Bronx",
    3: "Brooklyn",
    4: "Queens",
    5: "Staten Island"
}).astype("category")


In [None]:
borough = borough.drop_duplicates().dropna(subset="borough")

In [None]:
data = data.merge(borough, on="SegmentID", how="left", validate="m:1")

In [None]:
data["Date"] = pd.to_datetime(data.Date, format="%m/%d/%Y")

## The index

Every `pandas` dataframe has an _index_. This is what is displayed in bold to the left of the data when you display a dataframe.

In [None]:
data

The _index_ is basically a special column that Pandas uses as the label for each row. When you first read a CSV, the index will always just be sequential numbers. This is used to _align data_ when performing operations. For example, consider something like the following:

In [None]:
data["am_traffic"] = data["7:00-8:00AM"] + data["8:00-9:00AM"]

In [None]:
data.am_traffic

Pandas knew which values to add up because it matched matching indices. This is true even when the datasets are not in the same order. For example, suppose we sorted the 8-9AM values by total traffic.

In [None]:
sorted_traffic = data["8:00-9:00AM"].sort_values()
data["am_traffic_sorted"] = data["7:00-8:00AM"] + sorted_traffic
data.am_traffic_sorted

This looks like the same result we got before, and it should be - Pandas should have used the index to align the  We can confirm that they're all equal.

In [None]:
assert (data.am_traffic_sorted == data.am_traffic).all()

Well, that was not what I expected. Let's look at the ones that are different.

In [None]:
data.loc[data.am_traffic_sorted != data.am_traffic]

This is once again missing data causing issues in our analysis. Because many operations can result in NaN, NaN is considered to not equal NaN. For example, dividing by zero results in a NaN, and we wouldn't want code that compared `5 / 0 == 25 / 0` to silently say this was true, even though the data were invalid.

In some languages, any comparisons with NaN result in NaN, which usually eventually causes an error. This is not the case with Python. So if you had a function that compared two variables, and both were NaN, it might say they were not equal, and hide that you had missing values in your computation. So in Python, I recommend frequently checking for missing values using the `.isnull()` function.

We can rewrite our assertion to check that they are either equal or both NaN. They are. The index alignment worked.

In [None]:
assert ((data.am_traffic_sorted == data.am_traffic) | (data.am_traffic_sorted.isnull() & data.am_traffic.isnull())).all()

## Meaningful indices

By default, Pandas uses a row number as an index. But it is possible to have meaningful indices, as well. For example, there is a `RowID` field in the data. This might be a reference into some other database, or some other identifier used for tracking purposes (it's not, I created it for the purposes of this exercise. But let's ignore that for now.)

We can use `.set_index` to set this as the index of the dataframe.

In [None]:
data = data.set_index("RowID")
data

Now, we can use `.loc` to look up items by their index. For instance, let's look up RowID 5512.

In [None]:
data.loc[5512]

You can also select multiple row, by enclosing multiple index values in a list.

In [None]:
data.loc[[5512, 5518]]

"Slicing" is also a possibility. This uses the `:` operator to specify a range of values. For instance, let's select all rows between 5512 and 5518.

In [None]:
data.loc[5512:5518,:]

That might not be what you expect - you might expect 5512:5518 to give you 7 rows, but we got 8,714. But look closely at the first and last row - 5512 and 5518. Pandas has given us all rows in between these rows positionally, not necessarily numerically. We can sort the data frame based on the index, so that positional and numerical ranges are the equivalent.

In [None]:
data = data.sort_index()
data.loc[5512:5518,:]

That's more what we might expect. Note that slicing based on a position returns the values for both the start and the end of the range specified.

## Exercise

Use `.loc` and a slice to select range 2100-2105. Then re-sort the dataframe based on `am_traffic` using the `sort_values` cell below. Note that running your code with the slice now produces a different result.

In [None]:
data.loc[2100:2105]

In [None]:
data = data.sort_values("am_traffic")

## Selecting rows and columns at the same time

By adding a comma to `.loc`, we can select a column or columns as well. If we want to select multiple columns, enclose them in another set of `[]`.

In [None]:
data = data.sort_index()
data.loc[5142:5145, "roadway_name"]

In [None]:
data.loc[5142:5145, ["roadway_name", "am_traffic"]]

## Integer / positional indexing

`.loc` selects rows or columns based on their index or names. Sometimes, you want to select based on numeric position (often, for example, to extract the first row). `.iloc` indexes data frames based on integers.

In [None]:
# we are going to sort the data frame again to get the index out of order
data = data.sort_values("am_traffic")

In [None]:
data.iloc[2:4]

Note that we got the third and fourth rows (row indexing starts with 0). You can display the whole data frame to check if you want. Also note that only two rows were returned; when using `.iloc`, the first value of the slice is included (row #2) but the last value is excluded (row #4). This is how slices usually work in Python.

A challenge is when you want to select columns and rows using `.iloc`. `.iloc` refers to both by their positions, but generally it's much preferable to refer to columns by name. The following does not work:

In [None]:
data.iloc[2:4, "roadway_name"]

If you are only retrieving values, you can select the rows you want, and then the columns (or vice-versa):

In [None]:
data.iloc[2:4]["roadway_name"]

However, if you are changing the data, you _must not do this_. This is called chained indexing, and it can lead to unexpected results. The problem is that `.iloc[2:4]` may (but does not always) create a copy of that part of the original dataset. If you then change a column value, it may (or may not) only be represented in that copy.

When running the code below, you will get a "setting with copy warning" which warns of exactly this situation. Note that the roadway names did not change in the original data.

In [None]:
data.iloc[2:4]["roadway_name"] = "TEST"

In [None]:
data.iloc[2:4]

The way around this is awkward, but you can use the `.columns.get_loc` function to get the appropriate positional index of a column. That said, needing to combine `iloc` with modifying data is rare; I've only ever had to do it once.

In [None]:
data.iloc[2:4, data.columns.get_loc("roadway_name")] = "TEST"

In [None]:
data.iloc[2:4]

### Exercise

Update the borough for records 2-10 to be Queens.

In [None]:
data.iloc[2:10, data.columns.get_loc("borough")] = "Queens"

In [None]:
data

## Non-unique indices

Up until now, we've only used indices where every value was unique. It's possible to use a non-unique index as well. For instance, let's set the SegmentID to be the index. Most segments have multiple observations, so this is non-unique. You'll notice I've added a `.reset_index` call before the call to `set_index`. `reset_index` will convert the existing index back into a column, so we don't lose that information. 

In [None]:
data = data.reset_index().set_index("SegmentID")
data

Now, fetching a single index using `.loc` may result in more than one row.

In [None]:
data.loc[35832]

However, it may also result in a single row.

In [None]:
data.loc[202]

Any code you write will have to handle the possibility of getting either a dataframe or a single row when using `.loc`. For this reason, I prefer to avoid non-unique indices. The code below will assert that the index is unique.

In [None]:
assert not data.index.duplicated().any()

## Exercise

Set the index to roadway_name, and extract all records from "EAST 241 STREET".

In [None]:
data = data.reset_index().set_index("roadway_name")
data.loc["EAST 241 STREET"]

## Hierarchical indexing / MultiIndex

Pandas also allows multiple columns to be set as the index, which allows for a few additional features. Unique identification of the data in this dataset should be a combination of SegmentID, Direction, and Date. Let's index by all three columns. This requires using a list with set_index.

In [None]:
data = data.reset_index().set_index(["SegmentID", "Direction", "Date"])
data

You can think of the multiindex as being a index where each item is a tuple (SegmentID, Direction, Date). You can select by all or part of this tuple, but you always have to go from left to right - i.e. you can't select by direction unless you also select by SegmentID. When you select a single value from a level in your call to `.loc`, that index level will drop off in the result. For instance, there is no SegmentID when I select only SegmentID 89274.

In [None]:
data.loc[89274]

In [None]:
data.loc[89274, "WB"]

In [None]:
data.loc[89274, "WB", "2016-03-02"]

### Selecting columns with a multiindex

Previously, we used the comma to separate the row indexes from the column indexes. But in the examples above, we used the comma to separate different index levels. To select columns, you need to enclose all of your row selectors in a tuple.

In [None]:
data.loc[(89274, "WB", "2016-03-02"), "roadway_name"]

### Selecting multiple values in a multiindex

Like with a regular index, you can select multiple values by enclosing them in a list. Keep in mind that the order matters: a list of tuples selects those exact index values, while a tuple of lists selects any combination of the specified index values. When using a list of tuples, you must include all index levels.

When doing anything more complicated that indexing based on a single value at each level like we did above, you should add a `, :` at the end of your call to `.loc` to tell it to select all columns (or, alternately, specify the columns you want to select). Otherwise, `pandas` may misinterpret part of your selection as referring to the columns you want which will either produce an error or give unwanted results.

In [None]:
data.loc[[(89274, "WB", "2016-02-27"), (156485, "EB", "2015-02-07")], :]

You can also use a tuple of lists, which allows any combination of the listed values.

In [None]:
data.loc[([89274, 156485], ["EB", "WB"], ["2016-02-27", "2015-02-07"]), :]

It is also possible to slice MultiIndexes, but it is very confusing and I wouldn't recommend it. You can read more about that [in the pandas documentation](https://pandas.pydata.org/docs/user_guide/advanced.html#advanced-indexing-with-hierarchical-index). That page also has lots of instructions on using MultiIndexes in general, and it was the source of many of the examples above.

## Exercise

Select northbound counts only for sensors 88137 and 36705.

In [None]:
data.loc[([88137, 36705], "NB"), :]

## Getting rid of the index

Sometimes, you may want to get rid of the index you've set, and get back to an index that's just row numbers, 0 through whatever. `.reset_index` will do this. Any columns previously used as part of the index will be converted back into columns (though they may not be in the same order they were before).

In [None]:
data = data.reset_index()

In [None]:
data

## Indexing: Matt's perspective

Maybe it's because I was an R user first, but I've never been a big fan of indexing in `pandas`; I find it confusing and prone to create errors (for example, the slicing situation above, where sorting the data changed the results). When I'm selecting data I generally prefer to use the masking syntax we've seen before, and when combining datasets I like to use merge with a common key, rather than relying on indexing. I will occasionally use a meaningful index, perhaps even a hierarchical one, if the selection possibilities are very desirable for a particular problem.