In [1]:
import pandas as pd
import numpy as np

# Section 7: Going Multidimensional

In this section, we'll kick things up a notch and work with multi-index dataframes. This allows us to support more than one level of labels, enabling us to reflect multidimensional datasets within the confines of a two-dimensional data structure.

**Hierarchichal indices** represent a hierarchy of relationships that become intricately coupled with our data/values. Using Pandas techniques and methods, we can very efficiently change these hierarchies in order to answer specific questions.

As a side note, multi-level index, multiindex, and hierarchical index are all used interchangeably within the Pandas community.

## Introducing New Data

In this section, we'll be working with a new dataset that contains daily stock information from 2014 through 2019 for the technology companies Apple, Facebook, Microsoft, Google, and Amazon.

In [2]:
tech_url = 'https://andybek.com/pandas-tech'

In [3]:
tech = pd.read_csv(tech_url)

In [4]:
tech.head()

Unnamed: 0,date,month,year,day,name,open,close,high,low,volume,volume_type
0,2014-01-02,1,2014,2,FB,54.86,54.71,55.22,54.19,43257622,medium
1,2014-01-02,1,2014,2,AAPL,79.38,79.02,79.58,78.86,8398851,low
2,2014-01-02,1,2014,2,GOOGL,557.73,556.56,558.88,554.13,1822719,medium
3,2014-01-02,1,2014,2,MSFT,37.35,37.16,37.4,37.1,30643745,medium
4,2014-01-02,1,2014,2,AMZN,398.8,397.97,399.36,394.02,2140246,medium


Here we see we have date in the first column, then columns of decomposed dates, the stock tickers under "name", the opening, closing, high, and low prices for that day, the volume of shares traded, and the qualitative trade volume.

Examining the shape of the dataframe, we have 7105 rows and 11 columns.

In [5]:
tech.shape

(7105, 11)

Let's also check out the type of data in the dataframe.

In [6]:
tech.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7105 entries, 0 to 7104
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         7105 non-null   object 
 1   month        7105 non-null   int64  
 2   year         7105 non-null   int64  
 3   day          7105 non-null   int64  
 4   name         7105 non-null   object 
 5   open         7105 non-null   float64
 6   close        7105 non-null   float64
 7   high         7105 non-null   float64
 8   low          7105 non-null   float64
 9   volume       7105 non-null   int64  
 10  volume_type  7105 non-null   object 
dtypes: float64(4), int64(4), object(3)
memory usage: 610.7+ KB


We have 4 floating point columns (open, close, high, low), 4 integer columns (month, year, day, volume), and three object columns (date, name, and volume_type).

## Index and RangeIndex

Let's review the terminology that we've grown accustomed to. In Series, the index is simply a label for each value in the series. In dataframes, the index still serves as a label for each row, but is accompanied by another set of labels for the column dimension.

Remember that an index of the type `RangeIndex`, which is simply an immutable or unchangable object that represents a series of increasing or decreasing integers. In Pandas, `RangeIndex` is further derived from the `index` class, which happens to be the same exact class type that the column labels inherit from

In [7]:
type(tech.index)

pandas.core.indexes.range.RangeIndex

In [8]:
type(tech.columns)

pandas.core.indexes.base.Index

So what is the "index" class? It's essentially another immutable data structure (in this case a Numpy array) that is ordered and sliceable.

Oftentimes we find the need to replace the default indices with something more meaningful. We've previously seen this with the `set_index()` method.

In [9]:
tech.set_index('date')

Unnamed: 0_level_0,month,year,day,name,open,close,high,low,volume,volume_type
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,1,2014,2,FB,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,1,2014,2,AAPL,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,1,2014,2,GOOGL,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,1,2014,2,MSFT,37.35,37.16,37.40,37.10,30643745,medium
2014-01-02,1,2014,2,AMZN,398.80,397.97,399.36,394.02,2140246,medium
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,8,2019,23,MSFT,137.19,133.39,138.35,132.80,38515386,medium
2019-08-23,8,2019,23,AAPL,209.43,202.64,212.05,201.00,46882843,medium
2019-08-23,8,2019,23,GOOGL,1185.17,1153.58,1195.67,1150.00,1813141,medium
2019-08-23,8,2019,23,AMZN,1793.03,1749.62,1804.90,1745.23,5277898,medium


But what does it mean for *date* to "be" an index? One of the key implications has to do with how we select from the dataframe. If we wanted to extract or index all the prices for August 1, 2019, we can use that for our selection.

In [10]:
tech.set_index('date').loc['2019-08-01']

Unnamed: 0_level_0,month,year,day,name,open,close,high,low,volume,volume_type
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-08-01,8,2019,1,GOOGL,1217.63,1211.78,1236.3,1207.0,1771271,medium
2019-08-01,8,2019,1,FB,194.17,192.73,198.47,190.88,17777013,medium
2019-08-01,8,2019,1,MSFT,137.0,138.06,140.94,136.93,40557502,medium
2019-08-01,8,2019,1,AMZN,1871.72,1855.32,1897.92,1844.01,4713311,medium
2019-08-01,8,2019,1,AAPL,213.9,208.43,218.03,206.74,54017922,medium


## Creating a MultiIndex

In the previous lecture, we used the `set_index()` method to change the index from the default to something more meaningful. We can take this a set further and use more than one field as the index for our dataframe.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html

Let's start simply: instead of "date", we'll pass in a list of strings on which to set indices. 

In [11]:
tech.set_index(['date','name'])

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.40,37.10,30643745,medium
2014-01-02,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246,medium
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,MSFT,8,2019,23,137.19,133.39,138.35,132.80,38515386,medium
2019-08-23,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843,medium
2019-08-23,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141,medium
2019-08-23,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898,medium


The result is a MultiIndex, in which a single index has more than one component to it. When we promoted the "date" and "name" columns to indices, they were removed as regular columns from the dataframe. 

This looks a bit weird initially. But we'll get used to it.

Let's first set the `inplace` parameter to `True`.

In [12]:
tech.set_index(['date','name'], inplace = True)

Now let's look at the first two days.

In [13]:
tech.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
2014-01-02,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium
2014-01-03,FB,1,2014,3,55.0,54.56,55.65,54.53,38287706,medium
2014-01-03,GOOGL,1,2014,3,557.5,552.5,558.47,552.47,1669229,medium
2014-01-03,MSFT,1,2014,3,37.2,36.91,37.22,36.6,31134795,medium
2014-01-03,AAPL,1,2014,3,79.0,77.28,79.1,77.2,14043410,low
2014-01-03,AMZN,1,2014,3,398.29,396.44,402.71,396.22,2213512,medium


In our dataframe, we see that there are two bold-faced columns: the date and the name of the stock. The date is no longer repeating, where blank spaces indicate the data from above is carried on. 

Hierarchical indexing is not only about looks, but by using a multiindex, we're creating a hierarchy of relationships within our data where the information across the two index levels is inseparable from the actual values.

Let's see what type of object this is.


In [14]:
type(tech.index)

pandas.core.indexes.multi.MultiIndex

We see that our index is now of the type "MultiIndex". Cool!

## MultiIndex from `read_csv()`

It turns out that we don't have to wait until the dataframe is read in to set a multiindex. Instead, we can set the multiindex (or even a single index for that matter) when reading in the dataframe with `read_csv`!

All we really need to do is pass in a list ofthe index columns to the `index_col` parameter.

In [15]:
pd.read_csv(tech_url, index_col = ['date','name'])

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.40,37.10,30643745,medium
2014-01-02,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246,medium
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,MSFT,8,2019,23,137.19,133.39,138.35,132.80,38515386,medium
2019-08-23,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843,medium
2019-08-23,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141,medium
2019-08-23,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898,medium


There are other ways create multiindex dataframes as well, but we won't cover them here. Generally speaking, those other methods should only be utilized for very specific cases that necessitate them.



## Indexing Hierarchical DataFrames

So we've set hierarchical dataframes. Now how do we extract values from them? When we set the multiindex, we created an association between each pair of date and stock ticker and a value in our dataframe. Thus, indexing the data will look a bit different.

In [16]:
tech.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
2014-01-02,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium
2014-01-03,FB,1,2014,3,55.0,54.56,55.65,54.53,38287706,medium
2014-01-03,GOOGL,1,2014,3,557.5,552.5,558.47,552.47,1669229,medium
2014-01-03,MSFT,1,2014,3,37.2,36.91,37.22,36.6,31134795,medium
2014-01-03,AAPL,1,2014,3,79.0,77.28,79.1,77.2,14043410,low
2014-01-03,AMZN,1,2014,3,398.29,396.44,402.71,396.22,2213512,medium


Suppose we want to know what price Google closed at on January 2, 2014. Let's first try selecting for that particular date.

In [17]:
tech.loc['2014-01-02']

Unnamed: 0_level_0,month,year,day,open,close,high,low,volume,volume_type
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium


Now we've isolated the full dataset for that date. To avoid being redundant, Pandas automatically drops the index that we are indexing along (in this case, "date").

Next, let's extract the Google data by going back to our `loc[]` indexer and adding another label to it.

In [18]:
tech.loc['2014-01-02', 'GOOGL']

month                1
year              2014
day                  2
open            557.73
close           556.56
high            558.88
low             554.13
volume         1822719
volume_type     medium
Name: (2014-01-02, GOOGL), dtype: object

What returns is a Series containing the values for Google on that single date. Now to isolate to close price, we can simply grab it out of the series as an attribute.

In [19]:
tech.loc['2014-01-02', 'GOOGL'].close

556.56

Another perhaps more elegant way of doing the same thing is to take advantage of the "type coupling" that we have between the "date" and "name" fields. They are both different levels in our multiindex, and so we can treat that as one dimension and capture it in a single tuple of values. In other words, we can identify the rows by the multiindex in one go by using a tuple, instead of performing this task iteratively.  By using a tuple, we are able to pass in multiple index labels while remaining in the first dimension (rows) of the `loc[]` indexer.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

In [20]:
tech.loc[('2014-01-02', 'GOOGL')]

month                1
year              2014
day                  2
open            557.73
close           556.56
high            558.88
low             554.13
volume         1822719
volume_type     medium
Name: (2014-01-02, GOOGL), dtype: object

Best of all, we can go in and make use of the second dimension (columns) within `loc[]`.

In [21]:
tech.loc[('2014-01-02', 'GOOGL'), 'close']

556.56

What about selecting by position using `iloc[]`? This actually works the exact same way in multiindex dataframs as it does in single index dataframes. The hierarchical structure has no impact on how `iloc[]` functions.

In [22]:
tech.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
2014-01-02,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium


For example, on January 2, 2014 Google is at index position 2 (index starting at 0) in the dataframe, with the closing price as column position 4 (again index starting at 0). To get that value, we just use `iloc[]` the way we always have.

In [23]:
tech.iloc[2,4]

556.56

As another example, let's select the opening and closing prices for Apple on January 3, 2021.

In [24]:
tech.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
2014-01-02,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium
2014-01-03,FB,1,2014,3,55.0,54.56,55.65,54.53,38287706,medium
2014-01-03,GOOGL,1,2014,3,557.5,552.5,558.47,552.47,1669229,medium
2014-01-03,MSFT,1,2014,3,37.2,36.91,37.22,36.6,31134795,medium
2014-01-03,AAPL,1,2014,3,79.0,77.28,79.1,77.2,14043410,low
2014-01-03,AMZN,1,2014,3,398.29,396.44,402.71,396.22,2213512,medium


The label-based approach with `loc[]` would be as follows:

In [25]:
tech.loc[('2014-01-03', 'AAPL'), ['open', 'close']]

open        79
close    77.28
Name: (2014-01-03, AAPL), dtype: object

The position-based approach with `iloc[]` would be:

In [26]:
tech.iloc[8, [3,4]]

open        79
close    77.28
Name: (2014-01-03, AAPL), dtype: object

The result is exactly the same!

## Indexing Ranges and Slices and the `slice()` Object in MultiIndex DataFrames

We previously looked at ways to select values from multiindex dataframes by label or position. Now we'll go further and extract slices.

Suppose we want to extract multiple days from our tech stocks dataframe. If we want to select multiple dates, we just pass in a list of dates that we want to select. In the command below, the list of dates corresponds to the outer level of the index (the "date" level).

In [27]:
tech.loc[['2015-01-06', '2015-01-07']]

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-01-06,AAPL,1,2015,6,106.54,106.26,107.43,104.63,65797116,medium
2015-01-06,FB,1,2015,6,77.23,76.15,77.59,75.36,27399288,medium
2015-01-06,AMZN,1,2015,6,302.2,295.29,303.0,292.38,3519034,medium
2015-01-06,MSFT,1,2015,6,46.38,45.65,46.75,45.54,36447854,medium
2015-01-06,GOOGL,1,2015,6,520.49,506.64,521.21,505.55,2731813,medium
2015-01-07,FB,1,2015,7,76.76,76.15,77.36,75.82,22045333,medium
2015-01-07,MSFT,1,2015,7,45.98,46.23,46.46,45.49,29114061,medium
2015-01-07,GOOGL,1,2015,7,510.99,505.15,511.49,503.65,2345875,medium
2015-01-07,AMZN,1,2015,7,297.54,298.42,301.28,295.33,2640349,medium
2015-01-07,AAPL,1,2015,7,107.2,107.75,108.2,106.7,40105934,medium


If we also want a subset of stock names, we just need to specify those in another list of index labels. However, this must be done within a tuple. If we do not do this, Pandas will think we are indexing along the column axis.

In [30]:
## This will not work
# tech.loc[['2015-01-06', '2015-01-07'], ['FB', 'AMZN']]

To get this to work, we have to wrap the two lists of indexes (dates and stock names) into a tuple. On the outer level (column axis), let's start by selecting all columns by using a colon `:`.

In [33]:
tech.loc[(['2015-01-06', '2015-01-07'], ['FB', 'AMZN']), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-01-06,FB,1,2015,6,77.23,76.15,77.59,75.36,27399288,medium
2015-01-06,AMZN,1,2015,6,302.2,295.29,303.0,292.38,3519034,medium
2015-01-07,FB,1,2015,7,76.76,76.15,77.36,75.82,22045333,medium
2015-01-07,AMZN,1,2015,7,297.54,298.42,301.28,295.33,2640349,medium


If we only want to select particular columns, we can easily do so by passing in a list of columns labels!

In [34]:
tech.loc[(['2015-01-06', '2015-01-07'], ['FB', 'AMZN']), ['close', 'volume']]

Unnamed: 0_level_0,Unnamed: 1_level_0,close,volume
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-06,FB,76.15,27399288
2015-01-06,AMZN,295.29,3519034
2015-01-07,FB,76.15,22045333
2015-01-07,AMZN,298.42,2640349


We can also slice our multidimensional dataframe by specifying a range of values separated by a colon. For instance, perhaps we want to select a range of dates from the outer level of our multiindex, and a range of columns.

In [35]:
tech.loc['2017-01-03':'2017-01-31','open':"low"]

Unnamed: 0_level_0,Unnamed: 1_level_0,open,close,high,low
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-01-03,AMZN,757.92,753.67,758.76,747.70
2017-01-03,FB,116.03,116.86,117.84,115.51
2017-01-03,MSFT,62.79,62.58,62.84,62.13
2017-01-03,AAPL,115.80,116.15,116.33,114.76
2017-01-03,GOOGL,800.62,808.01,811.44,796.89
...,...,...,...,...,...
2017-01-31,MSFT,64.86,64.65,65.15,64.26
2017-01-31,AAPL,121.15,121.35,121.39,120.62
2017-01-31,FB,130.17,130.32,130.66,129.52
2017-01-31,GOOGL,819.50,820.19,823.07,813.40


But what if we wanted to go a step further and isolate a specific stock (e.g. Google) within this date range slice? We can do that too. But how?

Based on what we've done already, the first thing to try might be to wrap the index dimension in a tuple and add the stock as the second item in the tuple. 

Unfortunately, this does not quite work:

In [37]:
## This results in a syntax error:
# tech.loc[('2017-01-03':'2017-01-31', 'GOOGL'),'open':"low"]

So what do we do? Well, in order to slice on a hierarchical index, we have to use the `slice()` object. In order to isolate Google stock prices in the sliced date range, we have to wrap our date range within a slice object, then wrap that in a tuple together with "GOOGL".

Importantly, we do not use the colon `:` slice operator. Instead, we identify the start and end of the slice within the `slice()` object, separated by a column.



In [39]:
tech.loc[(slice('2017-01-03','2017-01-31'), 'GOOGL'),'open':"low"]

Unnamed: 0_level_0,Unnamed: 1_level_0,open,close,high,low
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-01-03,GOOGL,800.62,808.01,811.44,796.89
2017-01-04,GOOGL,809.89,807.77,813.43,804.11
2017-01-05,GOOGL,807.5,813.02,813.74,805.92
2017-01-06,GOOGL,814.99,825.21,828.96,811.5
2017-01-09,GOOGL,826.37,827.18,830.43,821.62
2017-01-10,GOOGL,827.07,826.01,829.41,823.14
2017-01-11,GOOGL,826.62,829.86,829.9,821.47
2017-01-12,GOOGL,828.38,829.53,830.38,821.01
2017-01-13,GOOGL,831.0,830.94,834.65,829.52
2017-01-17,GOOGL,830.0,827.46,830.18,823.2


Let's do another example, in which we want the opening prices for Facebook and Amazon for all the dates in the dataframe. In other words, since we want the data for all of the dates, we need to skip the outer level of the index. But within the inner label, we want FB and AMZN only, and on the column axis we want "open" only.

Let's first try the most intuitive solution, which is to open a tuple for the multiindex, use a colon to select all dates from the outer dimension of the index, and identify FB and AMZN in a list as the inner dimension of the index. As you might have guessed, this does not work. 

In [58]:
## This results in a syntax error
# tech.loc[(:, ["FB",'AMZN']), 'open']

To get this to work, we once again have to invoke the `slice()` object. In order to slice for everything in a given dimension, we use `None`.

In [62]:
tech.loc[(slice(None), ['FB','AMZN']), "open"]

date        name
2014-01-02  FB        54.86
            AMZN     398.80
2014-01-03  FB        55.00
            AMZN     398.29
2014-01-06  AMZN     396.13
                     ...   
2019-08-21  AMZN    1819.39
2019-08-22  FB       183.43
            AMZN    1828.00
2019-08-23  AMZN    1793.03
            FB       180.84
Name: open, Length: 2842, dtype: float64

## BONUS - Use Colons `:` with `pd.IndexSlice`

In the previous lecture, we saw how indexing slices when selecting along multiple indices requires comma-separated slice boundaries within a `slice()` object.

There is an alternative approach that allows us to use the colon operator instead. We can do this with the `pd.IndexSlice` object.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IndexSlice.html

Suppose we want the high and low prices for all trading days in the dataset for Amazon and Facebook. We can achieve this by using `pd.IndexSlice[]`, which is an indexor object that uses square brackets. It allows us to slice multiindexes more easily and intuitively using the colon `:` operator, without the need to invoke the `slice()` object

In [64]:
tech.loc[pd.IndexSlice[:, ['FB','AMZN']], ['high','low']]

Unnamed: 0_level_0,Unnamed: 1_level_0,high,low
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-02,FB,55.22,54.19
2014-01-02,AMZN,399.36,394.02
2014-01-03,FB,55.65,54.53
2014-01-03,AMZN,402.71,396.22
2014-01-06,AMZN,397.00,388.42
...,...,...,...
2019-08-21,AMZN,1829.58,1815.00
2019-08-22,FB,184.11,179.91
2019-08-22,AMZN,1829.41,1800.10
2019-08-23,AMZN,1804.90,1745.23


When working with the more complex multiindexes, and we need to call `pd.IndexSlice[]` several times, it is a good idea to assign it to a shorter variable so you don't have to keep typing the command.

In [66]:
i = pd.IndexSlice
tech.loc[i[:, 'FB'], ['high','low']]

Unnamed: 0_level_0,Unnamed: 1_level_0,high,low
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-02,FB,55.22,54.19
2014-01-03,FB,55.65,54.53
2014-01-06,FB,57.26,54.05
2014-01-07,FB,58.55,57.22
2014-01-08,FB,58.41,57.23
...,...,...,...
2019-08-19,FB,187.50,184.85
2019-08-20,FB,186.00,182.39
2019-08-21,FB,185.90,183.14
2019-08-22,FB,184.11,179.91


Let's try another example, where we want a week's worth of data from January 6 through January 10 in 2014 for FB and AMZN, high and low prices. The `pd.IndexSlice[]` selector allows us to use colons to make those selections.
* Also notice how we don't need to use tuples for this.

In [72]:
tech.loc[i['2014-01-06':'2014-01-10', ["FB","AMZN"]], ["high","low"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,high,low
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-06,AMZN,397.0,388.42
2014-01-06,FB,57.26,54.05
2014-01-07,FB,58.55,57.22
2014-01-07,AMZN,398.47,394.29
2014-01-08,AMZN,403.0,396.04
2014-01-08,FB,58.41,57.23
2014-01-09,AMZN,406.89,398.44
2014-01-09,FB,58.96,56.65
2014-01-10,FB,58.3,57.06
2014-01-10,AMZN,403.76,393.8
