In [1]:
import pandas as pd
import numpy as np

# Section 7: Going Multidimensional

In this section, we'll kick things up a notch and work with multi-index dataframes. This allows us to support more than one level of labels, enabling us to reflect multidimensional datasets within the confines of a two-dimensional data structure.

**Hierarchichal indices** represent a hierarchy of relationships that become intricately coupled with our data/values. Using Pandas techniques and methods, we can very efficiently change these hierarchies in order to answer specific questions.

As a side note, multi-level index, multiindex, and hierarchical index are all used interchangeably within the Pandas community.

## Introducing New Data

In this section, we'll be working with a new dataset that contains daily stock information from 2014 through 2019 for the technology companies Apple, Facebook, Microsoft, Google, and Amazon.

In [2]:
tech_url = 'https://andybek.com/pandas-tech'

In [3]:
tech = pd.read_csv(tech_url)

In [4]:
tech.head()

Unnamed: 0,date,month,year,day,name,open,close,high,low,volume,volume_type
0,2014-01-02,1,2014,2,FB,54.86,54.71,55.22,54.19,43257622,medium
1,2014-01-02,1,2014,2,AAPL,79.38,79.02,79.58,78.86,8398851,low
2,2014-01-02,1,2014,2,GOOGL,557.73,556.56,558.88,554.13,1822719,medium
3,2014-01-02,1,2014,2,MSFT,37.35,37.16,37.4,37.1,30643745,medium
4,2014-01-02,1,2014,2,AMZN,398.8,397.97,399.36,394.02,2140246,medium


Here we see we have date in the first column, then columns of decomposed dates, the stock tickers under "name", the opening, closing, high, and low prices for that day, the volume of shares traded, and the qualitative trade volume.

Examining the shape of the dataframe, we have 7105 rows and 11 columns.

In [5]:
tech.shape

(7105, 11)

Let's also check out the type of data in the dataframe.

In [6]:
tech.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7105 entries, 0 to 7104
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         7105 non-null   object 
 1   month        7105 non-null   int64  
 2   year         7105 non-null   int64  
 3   day          7105 non-null   int64  
 4   name         7105 non-null   object 
 5   open         7105 non-null   float64
 6   close        7105 non-null   float64
 7   high         7105 non-null   float64
 8   low          7105 non-null   float64
 9   volume       7105 non-null   int64  
 10  volume_type  7105 non-null   object 
dtypes: float64(4), int64(4), object(3)
memory usage: 610.7+ KB


We have 4 floating point columns (open, close, high, low), 4 integer columns (month, year, day, volume), and three object columns (date, name, and volume_type).

## Index and RangeIndex

Let's review the terminology that we've grown accustomed to. In Series, the index is simply a label for each value in the series. In dataframes, the index still serves as a label for each row, but is accompanied by another set of labels for the column dimension.

Remember that an index of the type `RangeIndex`, which is simply an immutable or unchangable object that represents a series of increasing or decreasing integers. In Pandas, `RangeIndex` is further derived from the `index` class, which happens to be the same exact class type that the column labels inherit from

In [7]:
type(tech.index)

pandas.core.indexes.range.RangeIndex

In [8]:
type(tech.columns)

pandas.core.indexes.base.Index

So what is the "index" class? It's essentially another immutable data structure (in this case a Numpy array) that is ordered and sliceable.

Oftentimes we find the need to replace the default indices with something more meaningful. We've previously seen this with the `set_index()` method.

In [9]:
tech.set_index('date')

Unnamed: 0_level_0,month,year,day,name,open,close,high,low,volume,volume_type
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,1,2014,2,FB,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,1,2014,2,AAPL,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,1,2014,2,GOOGL,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,1,2014,2,MSFT,37.35,37.16,37.40,37.10,30643745,medium
2014-01-02,1,2014,2,AMZN,398.80,397.97,399.36,394.02,2140246,medium
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,8,2019,23,MSFT,137.19,133.39,138.35,132.80,38515386,medium
2019-08-23,8,2019,23,AAPL,209.43,202.64,212.05,201.00,46882843,medium
2019-08-23,8,2019,23,GOOGL,1185.17,1153.58,1195.67,1150.00,1813141,medium
2019-08-23,8,2019,23,AMZN,1793.03,1749.62,1804.90,1745.23,5277898,medium


But what does it mean for *date* to "be" an index? One of the key implications has to do with how we select from the dataframe. If we wanted to extract or index all the prices for August 1, 2019, we can use that for our selection.

In [10]:
tech.set_index('date').loc['2019-08-01']

Unnamed: 0_level_0,month,year,day,name,open,close,high,low,volume,volume_type
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-08-01,8,2019,1,GOOGL,1217.63,1211.78,1236.3,1207.0,1771271,medium
2019-08-01,8,2019,1,FB,194.17,192.73,198.47,190.88,17777013,medium
2019-08-01,8,2019,1,MSFT,137.0,138.06,140.94,136.93,40557502,medium
2019-08-01,8,2019,1,AMZN,1871.72,1855.32,1897.92,1844.01,4713311,medium
2019-08-01,8,2019,1,AAPL,213.9,208.43,218.03,206.74,54017922,medium


## Creating a MultiIndex

In the previous lecture, we used the `set_index()` method to change the index from the default to something more meaningful. We can take this a set further and use more than one field as the index for our dataframe.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html

Let's start simply: instead of "date", we'll pass in a list of strings on which to set indices. 

In [11]:
tech.head()

Unnamed: 0,date,month,year,day,name,open,close,high,low,volume,volume_type
0,2014-01-02,1,2014,2,FB,54.86,54.71,55.22,54.19,43257622,medium
1,2014-01-02,1,2014,2,AAPL,79.38,79.02,79.58,78.86,8398851,low
2,2014-01-02,1,2014,2,GOOGL,557.73,556.56,558.88,554.13,1822719,medium
3,2014-01-02,1,2014,2,MSFT,37.35,37.16,37.4,37.1,30643745,medium
4,2014-01-02,1,2014,2,AMZN,398.8,397.97,399.36,394.02,2140246,medium


In [12]:
tech.set_index(['date','name'])

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.40,37.10,30643745,medium
2014-01-02,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246,medium
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,MSFT,8,2019,23,137.19,133.39,138.35,132.80,38515386,medium
2019-08-23,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843,medium
2019-08-23,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141,medium
2019-08-23,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898,medium


The result is a MultiIndex, in which a single index has more than one component to it. When we promoted the "date" and "name" columns to indices, they were removed as regular columns from the dataframe. 

This looks a bit weird initially. But we'll get used to it.

Let's first set the `inplace` parameter to `True`.

In [13]:
tech.set_index(['date','name'], inplace = True)

Now let's look at the first two days.

In [14]:
tech.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
2014-01-02,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium
2014-01-03,FB,1,2014,3,55.0,54.56,55.65,54.53,38287706,medium
2014-01-03,GOOGL,1,2014,3,557.5,552.5,558.47,552.47,1669229,medium
2014-01-03,MSFT,1,2014,3,37.2,36.91,37.22,36.6,31134795,medium
2014-01-03,AAPL,1,2014,3,79.0,77.28,79.1,77.2,14043410,low
2014-01-03,AMZN,1,2014,3,398.29,396.44,402.71,396.22,2213512,medium


In our dataframe, we see that there are two bold-faced columns: the date and the name of the stock. The date is no longer repeating, where blank spaces indicate the data from above is carried on. 

Hierarchical indexing is not only about looks, but by using a multiindex, we're creating a hierarchy of relationships within our data where the information across the two index levels is inseparable from the actual values.

Let's see what type of object this is.


In [15]:
type(tech.index)

pandas.core.indexes.multi.MultiIndex

We see that our index is now of the type "MultiIndex". Cool!

## MultiIndex from `read_csv()`

It turns out that we don't have to wait until the dataframe is read in to set a multiindex. Instead, we can set the multiindex (or even a single index for that matter) when reading in the dataframe with `read_csv`!

All we really need to do is pass in a list ofthe index columns to the `index_col` parameter.

In [16]:
pd.read_csv(tech_url, index_col = ['date','name'])

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.40,37.10,30643745,medium
2014-01-02,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246,medium
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,MSFT,8,2019,23,137.19,133.39,138.35,132.80,38515386,medium
2019-08-23,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843,medium
2019-08-23,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141,medium
2019-08-23,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898,medium


There are other ways create multiindex dataframes as well, but we won't cover them here. Generally speaking, those other methods should only be utilized for very specific cases that necessitate them.



## Indexing Hierarchical DataFrames

So we've set hierarchical dataframes. Now how do we extract values from them? When we set the multiindex, we created an association between each pair of date and stock ticker and a value in our dataframe. Thus, indexing the data will look a bit different.

In [17]:
tech.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
2014-01-02,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium
2014-01-03,FB,1,2014,3,55.0,54.56,55.65,54.53,38287706,medium
2014-01-03,GOOGL,1,2014,3,557.5,552.5,558.47,552.47,1669229,medium
2014-01-03,MSFT,1,2014,3,37.2,36.91,37.22,36.6,31134795,medium
2014-01-03,AAPL,1,2014,3,79.0,77.28,79.1,77.2,14043410,low
2014-01-03,AMZN,1,2014,3,398.29,396.44,402.71,396.22,2213512,medium


Suppose we want to know what price Google closed at on January 2, 2014. Let's first try selecting for that particular date.

In [18]:
tech.loc['2014-01-02']

Unnamed: 0_level_0,month,year,day,open,close,high,low,volume,volume_type
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium


Now we've isolated the full dataset for that date. To avoid being redundant, Pandas automatically drops the index that we are indexing along (in this case, "date").

Next, let's extract the Google data by going back to our `loc[]` indexer and adding another label to it.

In [19]:
tech.loc['2014-01-02', 'GOOGL']

month                1
year              2014
day                  2
open            557.73
close           556.56
high            558.88
low             554.13
volume         1822719
volume_type     medium
Name: (2014-01-02, GOOGL), dtype: object

What returns is a Series containing the values for Google on that single date. Now to isolate to close price, we can simply grab it out of the series as an attribute.

In [20]:
tech.loc['2014-01-02', 'GOOGL'].close

556.56

Another perhaps more elegant way of doing the same thing is to take advantage of the "type coupling" that we have between the "date" and "name" fields. They are both different levels in our multiindex, and so we can treat that as one dimension and capture it in a single tuple of values. In other words, we can identify the rows by the multiindex in one go by using a tuple, instead of performing this task iteratively.  By using a tuple, we are able to pass in multiple index labels while remaining in the first dimension (rows) of the `loc[]` indexer.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

In [21]:
tech.loc[('2014-01-02', 'GOOGL')]

month                1
year              2014
day                  2
open            557.73
close           556.56
high            558.88
low             554.13
volume         1822719
volume_type     medium
Name: (2014-01-02, GOOGL), dtype: object

Best of all, we can go in and make use of the second dimension (columns) within `loc[]`.

In [22]:
tech.loc[('2014-01-02', 'GOOGL'), 'close']

556.56

What about selecting by position using `iloc[]`? This actually works the exact same way in multiindex dataframs as it does in single index dataframes. The hierarchical structure has no impact on how `iloc[]` functions.

In [23]:
tech.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
2014-01-02,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium


For example, on January 2, 2014 Google is at index position 2 (index starting at 0) in the dataframe, with the closing price as column position 4 (again index starting at 0). To get that value, we just use `iloc[]` the way we always have.

In [24]:
tech.iloc[2,4]

556.56

As another example, let's select the opening and closing prices for Apple on January 3, 2021.

In [25]:
tech.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
2014-01-02,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium
2014-01-03,FB,1,2014,3,55.0,54.56,55.65,54.53,38287706,medium
2014-01-03,GOOGL,1,2014,3,557.5,552.5,558.47,552.47,1669229,medium
2014-01-03,MSFT,1,2014,3,37.2,36.91,37.22,36.6,31134795,medium
2014-01-03,AAPL,1,2014,3,79.0,77.28,79.1,77.2,14043410,low
2014-01-03,AMZN,1,2014,3,398.29,396.44,402.71,396.22,2213512,medium


The label-based approach with `loc[]` would be as follows:

In [26]:
tech.loc[('2014-01-03', 'AAPL'), ['open', 'close']]

open        79
close    77.28
Name: (2014-01-03, AAPL), dtype: object

The position-based approach with `iloc[]` would be:

In [27]:
tech.iloc[8, [3,4]]

open        79
close    77.28
Name: (2014-01-03, AAPL), dtype: object

The result is exactly the same!

## Indexing Ranges and Slices and the `slice()` Object in MultiIndex DataFrames

We previously looked at ways to select values from multiindex dataframes by label or position. Now we'll go further and extract slices.

Suppose we want to extract multiple days from our tech stocks dataframe. If we want to select multiple dates, we just pass in a list of dates that we want to select. In the command below, the list of dates corresponds to the outer level of the index (the "date" level).

In [28]:
tech.loc[['2015-01-06', '2015-01-07']]

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-01-06,AAPL,1,2015,6,106.54,106.26,107.43,104.63,65797116,medium
2015-01-06,FB,1,2015,6,77.23,76.15,77.59,75.36,27399288,medium
2015-01-06,AMZN,1,2015,6,302.2,295.29,303.0,292.38,3519034,medium
2015-01-06,MSFT,1,2015,6,46.38,45.65,46.75,45.54,36447854,medium
2015-01-06,GOOGL,1,2015,6,520.49,506.64,521.21,505.55,2731813,medium
2015-01-07,FB,1,2015,7,76.76,76.15,77.36,75.82,22045333,medium
2015-01-07,MSFT,1,2015,7,45.98,46.23,46.46,45.49,29114061,medium
2015-01-07,GOOGL,1,2015,7,510.99,505.15,511.49,503.65,2345875,medium
2015-01-07,AMZN,1,2015,7,297.54,298.42,301.28,295.33,2640349,medium
2015-01-07,AAPL,1,2015,7,107.2,107.75,108.2,106.7,40105934,medium


If we also want a subset of stock names, we just need to specify those in another list of index labels. However, this must be done within a tuple. If we do not do this, Pandas will think we are indexing along the column axis.

In [29]:
## This will not work
# tech.loc[['2015-01-06', '2015-01-07'], ['FB', 'AMZN']]

To get this to work, we have to wrap the two lists of indexes (dates and stock names) into a tuple. On the outer level (column axis), let's start by selecting all columns by using a colon `:`.

In [30]:
tech.loc[(['2015-01-06', '2015-01-07'], ['FB', 'AMZN']), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-01-06,FB,1,2015,6,77.23,76.15,77.59,75.36,27399288,medium
2015-01-06,AMZN,1,2015,6,302.2,295.29,303.0,292.38,3519034,medium
2015-01-07,FB,1,2015,7,76.76,76.15,77.36,75.82,22045333,medium
2015-01-07,AMZN,1,2015,7,297.54,298.42,301.28,295.33,2640349,medium


If we only want to select particular columns, we can easily do so by passing in a list of columns labels!

In [31]:
tech.loc[(['2015-01-06', '2015-01-07'], ['FB', 'AMZN']), ['close', 'volume']]

Unnamed: 0_level_0,Unnamed: 1_level_0,close,volume
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-06,FB,76.15,27399288
2015-01-06,AMZN,295.29,3519034
2015-01-07,FB,76.15,22045333
2015-01-07,AMZN,298.42,2640349


We can also slice our multidimensional dataframe by specifying a range of values separated by a colon. For instance, perhaps we want to select a range of dates from the outer level of our multiindex, and a range of columns.

In [32]:
tech.loc['2017-01-03':'2017-01-31','open':"low"]

Unnamed: 0_level_0,Unnamed: 1_level_0,open,close,high,low
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-01-03,AMZN,757.92,753.67,758.76,747.70
2017-01-03,FB,116.03,116.86,117.84,115.51
2017-01-03,MSFT,62.79,62.58,62.84,62.13
2017-01-03,AAPL,115.80,116.15,116.33,114.76
2017-01-03,GOOGL,800.62,808.01,811.44,796.89
...,...,...,...,...,...
2017-01-31,MSFT,64.86,64.65,65.15,64.26
2017-01-31,AAPL,121.15,121.35,121.39,120.62
2017-01-31,FB,130.17,130.32,130.66,129.52
2017-01-31,GOOGL,819.50,820.19,823.07,813.40


But what if we wanted to go a step further and isolate a specific stock (e.g. Google) within this date range slice? We can do that too. But how?

Based on what we've done already, the first thing to try might be to wrap the index dimension in a tuple and add the stock as the second item in the tuple. 

Unfortunately, this does not quite work:

In [33]:
## This results in a syntax error:
# tech.loc[('2017-01-03':'2017-01-31', 'GOOGL'),'open':"low"]

So what do we do? Well, in order to slice on a hierarchical index, we have to use the `slice()` object. In order to isolate Google stock prices in the sliced date range, we have to wrap our date range within a slice object, then wrap that in a tuple together with "GOOGL".

Importantly, we do not use the colon `:` slice operator. Instead, we identify the start and end of the slice within the `slice()` object, separated by a column.



In [34]:
tech.loc[(slice('2017-01-03','2017-01-31'), 'GOOGL'),'open':"low"]

Unnamed: 0_level_0,Unnamed: 1_level_0,open,close,high,low
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-01-03,GOOGL,800.62,808.01,811.44,796.89
2017-01-04,GOOGL,809.89,807.77,813.43,804.11
2017-01-05,GOOGL,807.5,813.02,813.74,805.92
2017-01-06,GOOGL,814.99,825.21,828.96,811.5
2017-01-09,GOOGL,826.37,827.18,830.43,821.62
2017-01-10,GOOGL,827.07,826.01,829.41,823.14
2017-01-11,GOOGL,826.62,829.86,829.9,821.47
2017-01-12,GOOGL,828.38,829.53,830.38,821.01
2017-01-13,GOOGL,831.0,830.94,834.65,829.52
2017-01-17,GOOGL,830.0,827.46,830.18,823.2


Let's do another example, in which we want the opening prices for Facebook and Amazon for all the dates in the dataframe. In other words, since we want the data for all of the dates, we need to skip the outer level of the index. But within the inner label, we want FB and AMZN only, and on the column axis we want "open" only.

Let's first try the most intuitive solution, which is to open a tuple for the multiindex, use a colon to select all dates from the outer dimension of the index, and identify FB and AMZN in a list as the inner dimension of the index. As you might have guessed, this does not work. 

In [35]:
## This results in a syntax error
# tech.loc[(:, ["FB",'AMZN']), 'open']

To get this to work, we once again have to invoke the `slice()` object. In order to slice for everything in a given dimension, we use `None`.

In [36]:
tech.loc[(slice(None), ['FB','AMZN']), "open"]

date        name
2014-01-02  FB        54.86
            AMZN     398.80
2014-01-03  FB        55.00
            AMZN     398.29
2014-01-06  AMZN     396.13
                     ...   
2019-08-21  AMZN    1819.39
2019-08-22  FB       183.43
            AMZN    1828.00
2019-08-23  AMZN    1793.03
            FB       180.84
Name: open, Length: 2842, dtype: float64

## BONUS - Use Colons `:` with `pd.IndexSlice`

In the previous lecture, we saw how indexing slices when selecting along multiple indices requires comma-separated slice boundaries within a `slice()` object.

There is an alternative approach that allows us to use the colon operator instead. We can do this with the `pd.IndexSlice` object.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IndexSlice.html

Suppose we want the high and low prices for all trading days in the dataset for Amazon and Facebook. We can achieve this by using `pd.IndexSlice[]`, which is an indexor object that uses square brackets. It allows us to slice multiindexes more easily and intuitively using the colon `:` operator, without the need to invoke the `slice()` object

In [37]:
tech.loc[pd.IndexSlice[:, ['FB','AMZN']], ['high','low']]

Unnamed: 0_level_0,Unnamed: 1_level_0,high,low
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-02,FB,55.22,54.19
2014-01-02,AMZN,399.36,394.02
2014-01-03,FB,55.65,54.53
2014-01-03,AMZN,402.71,396.22
2014-01-06,AMZN,397.00,388.42
...,...,...,...
2019-08-21,AMZN,1829.58,1815.00
2019-08-22,FB,184.11,179.91
2019-08-22,AMZN,1829.41,1800.10
2019-08-23,AMZN,1804.90,1745.23


When working with the more complex multiindexes, and we need to call `pd.IndexSlice[]` several times, it is a good idea to assign it to a shorter variable so you don't have to keep typing the command.

In [38]:
i = pd.IndexSlice
tech.loc[i[:, 'FB'], ['high','low']]

Unnamed: 0_level_0,Unnamed: 1_level_0,high,low
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-02,FB,55.22,54.19
2014-01-03,FB,55.65,54.53
2014-01-06,FB,57.26,54.05
2014-01-07,FB,58.55,57.22
2014-01-08,FB,58.41,57.23
...,...,...,...
2019-08-19,FB,187.50,184.85
2019-08-20,FB,186.00,182.39
2019-08-21,FB,185.90,183.14
2019-08-22,FB,184.11,179.91


Let's try another example, where we want a week's worth of data from January 6 through January 10 in 2014 for FB and AMZN, high and low prices. The `pd.IndexSlice[]` selector allows us to use colons to make those selections.
* Also notice how we don't need to use tuples for this.

In [39]:
tech.loc[i['2014-01-06':'2014-01-10', ["FB","AMZN"]], ["high","low"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,high,low
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-06,AMZN,397.0,388.42
2014-01-06,FB,57.26,54.05
2014-01-07,FB,58.55,57.22
2014-01-07,AMZN,398.47,394.29
2014-01-08,AMZN,403.0,396.04
2014-01-08,FB,58.41,57.23
2014-01-09,AMZN,406.89,398.44
2014-01-09,FB,58.96,56.65
2014-01-10,FB,58.3,57.06
2014-01-10,AMZN,403.76,393.8


## Cross Sections with `xs()`

Many of these multiindexing methods are powerful and useful, but not particularly intuitive.

The `xs()` method is a subset of label-based indexing, but has a much more straightforward syntax. 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.xs.html

Suppose we want a cross-section of our dataframe for the first trading day of the year 2019. We could do the following.

In [40]:
tech.xs('2019-01-02')

Unnamed: 0_level_0,month,year,day,open,close,high,low,volume,volume_type
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
MSFT,1,2019,2,99.55,101.12,101.75,98.94,35329345,medium
FB,1,2019,2,128.99,135.68,137.51,128.56,28146193,medium
GOOGL,1,2019,2,1027.2,1054.68,1060.79,1025.28,1593395,medium
AMZN,1,2019,2,1465.2,1539.13,1553.36,1460.93,7983103,medium
AAPL,1,2019,2,154.89,157.92,158.85,154.23,37039737,medium


At first there's nothing particularly impressive here. There's not much difference between this and using the `loc[]` indexer.

In [41]:
tech.loc['2019-01-02']

Unnamed: 0_level_0,month,year,day,open,close,high,low,volume,volume_type
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
MSFT,1,2019,2,99.55,101.12,101.75,98.94,35329345,medium
FB,1,2019,2,128.99,135.68,137.51,128.56,28146193,medium
GOOGL,1,2019,2,1027.2,1054.68,1060.79,1025.28,1593395,medium
AMZN,1,2019,2,1465.2,1539.13,1553.36,1460.93,7983103,medium
AAPL,1,2019,2,154.89,157.92,158.85,154.23,37039737,medium


But what if we wanted to skip the date level within our index and select solely based on stock name? If we wanted to do this with `loc()`, we'd need to use a `slice()` object within a tuple, as we saw earlier.


In [42]:
tech.loc[(slice(None), 'FB'), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-03,FB,1,2014,3,55.00,54.56,55.65,54.53,38287706,medium
2014-01-06,FB,1,2014,6,54.39,57.20,57.26,54.05,68974359,high
2014-01-07,FB,1,2014,7,57.67,57.92,58.55,57.22,77329009,high
2014-01-08,FB,1,2014,8,57.59,58.23,58.41,57.23,56800776,high
...,...,...,...,...,...,...,...,...,...,...
2019-08-19,FB,8,2019,19,186.01,186.17,187.50,184.85,9699661,low
2019-08-20,FB,8,2019,20,185.45,183.81,186.00,182.39,10087592,low
2019-08-21,FB,8,2019,21,185.00,183.55,185.90,183.14,8409548,low
2019-08-22,FB,8,2019,22,183.43,182.04,184.11,179.91,10829509,low


But with `xs()`, we simply specify our `level` (that is, the value that we're indexing by), and pass in the label that you want to select by. In this case, the company name is at level 1, and the date is at level 0.

In [43]:
tech.xs('FB', level = 1)

Unnamed: 0_level_0,month,year,day,open,close,high,low,volume,volume_type
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2014-01-02,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-03,1,2014,3,55.00,54.56,55.65,54.53,38287706,medium
2014-01-06,1,2014,6,54.39,57.20,57.26,54.05,68974359,high
2014-01-07,1,2014,7,57.67,57.92,58.55,57.22,77329009,high
2014-01-08,1,2014,8,57.59,58.23,58.41,57.23,56800776,high
...,...,...,...,...,...,...,...,...,...
2019-08-19,8,2019,19,186.01,186.17,187.50,184.85,9699661,low
2019-08-20,8,2019,20,185.45,183.81,186.00,182.39,10087592,low
2019-08-21,8,2019,21,185.00,183.55,185.90,183.14,8409548,low
2019-08-22,8,2019,22,183.43,182.04,184.11,179.91,10829509,low


Note that the output is technically not identical to the output when using the `loc[]` indexer. By default, `xs()` drops the level that we're selecting from, in this case the "name" index. But you can override this behavior by setting the `drop_level` parameter to `False`.

In [44]:
tech.xs('FB', level = 1, drop_level = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-03,FB,1,2014,3,55.00,54.56,55.65,54.53,38287706,medium
2014-01-06,FB,1,2014,6,54.39,57.20,57.26,54.05,68974359,high
2014-01-07,FB,1,2014,7,57.67,57.92,58.55,57.22,77329009,high
2014-01-08,FB,1,2014,8,57.59,58.23,58.41,57.23,56800776,high
...,...,...,...,...,...,...,...,...,...,...
2019-08-19,FB,8,2019,19,186.01,186.17,187.50,184.85,9699661,low
2019-08-20,FB,8,2019,20,185.45,183.81,186.00,182.39,10087592,low
2019-08-21,FB,8,2019,21,185.00,183.55,185.90,183.14,8409548,low
2019-08-22,FB,8,2019,22,183.43,182.04,184.11,179.91,10829509,low


But wait, there is more to this method! For instance, we could select from multiple levels of our index by passing same-sized **tuples** to both the `key` and `level` parameters.

In [45]:
tech.xs(key=('2019-01-02', 'FB'), level = (0, 1), drop_level = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-01-02,FB,1,2019,2,128.99,135.68,137.51,128.56,28146193,medium


When we get to multi-index columns later on, we'll see that we can use the exact same method to select from the other axis by setting the `axis` parameter to 1.

A quick note is that slicing using `xs()` requires the use of the `slice()` object. Colons are not functional.

## Skill Challenge

#### 1. From the *tech* dataframe, select all of the stock prices between July 13, 2015 and August 17, 2016. Assign the resulting dataframe slice to the variable *tech_df2*.

This can be accomplished using `pd.IndexSlice` as follows:

In [46]:
tech_df2 = tech.loc[pd.IndexSlice['2015-07-13':'2016-08-17']]

In [47]:
tech_df2

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-07-13,MSFT,7,2015,13,44.98,45.54,45.62,44.95,28178329,medium
2015-07-13,AMZN,7,2015,13,448.29,455.57,457.87,447.54,3956802,medium
2015-07-13,FB,7,2015,13,88.66,90.10,90.22,88.42,29976670,medium
2015-07-13,GOOGL,7,2015,13,559.51,571.73,572.85,558.70,2089641,medium
2015-07-13,AAPL,7,2015,13,125.03,125.66,125.76,124.32,41440538,medium
...,...,...,...,...,...,...,...,...,...,...
2016-08-17,GOOGL,8,2016,17,800.00,805.42,805.63,796.30,1066070,medium
2016-08-17,AMZN,8,2016,17,764.41,764.63,765.22,759.20,1891116,low
2016-08-17,MSFT,8,2016,17,57.54,57.56,57.68,57.23,18856423,medium
2016-08-17,FB,8,2016,17,123.66,124.37,124.38,122.85,13794179,low


Note that we could have also done this just using the `loc[]` indexer.

In [48]:
tech.loc['2015-07-13':'2016-08-17']

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-07-13,MSFT,7,2015,13,44.98,45.54,45.62,44.95,28178329,medium
2015-07-13,AMZN,7,2015,13,448.29,455.57,457.87,447.54,3956802,medium
2015-07-13,FB,7,2015,13,88.66,90.10,90.22,88.42,29976670,medium
2015-07-13,GOOGL,7,2015,13,559.51,571.73,572.85,558.70,2089641,medium
2015-07-13,AAPL,7,2015,13,125.03,125.66,125.76,124.32,41440538,medium
...,...,...,...,...,...,...,...,...,...,...
2016-08-17,GOOGL,8,2016,17,800.00,805.42,805.63,796.30,1066070,medium
2016-08-17,AMZN,8,2016,17,764.41,764.63,765.22,759.20,1891116,low
2016-08-17,MSFT,8,2016,17,57.54,57.56,57.68,57.23,18856423,medium
2016-08-17,FB,8,2016,17,123.66,124.37,124.38,122.85,13794179,low


#### 2. Select 10 days are random from *tech_df2*, but only AAPL price data.

To accomplish this, we first select only AAPL price data from *tech_df2*, and then we can use chain on the `sample()` method to randomly grab 10 dates.

In [49]:
tech_df2.xs(key = "AAPL", level = 1, drop_level = False).sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-11-04,AAPL,11,2015,4,123.13,122.0,123.82,121.62,44886050,medium
2016-02-04,AAPL,2,2016,4,95.86,96.6,97.33,95.19,46471652,medium
2015-11-20,AAPL,11,2015,20,119.2,119.3,119.92,118.85,34287096,medium
2015-12-17,AAPL,12,2015,17,112.02,108.98,112.25,108.98,44772827,medium
2016-06-28,AAPL,6,2016,28,92.9,93.59,93.66,92.14,40444914,medium
2016-07-07,AAPL,7,2016,7,95.7,95.94,96.5,95.62,25139558,medium
2016-05-23,AAPL,5,2016,23,95.87,96.43,97.19,95.67,38018643,medium
2016-05-16,AAPL,5,2016,16,92.39,93.88,94.39,91.65,61259756,medium
2016-07-19,AAPL,7,2016,19,99.56,99.87,100.0,99.34,23779924,medium
2015-08-04,AAPL,8,2015,4,117.42,114.64,117.7,113.25,124138623,high


Alternatively using the `slice()` object:

In [50]:
tech_df2.loc[(slice(None), 'AAPL'), :].sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2016-06-16,AAPL,6,2016,16,96.45,97.55,97.75,96.07,31326815,medium
2015-08-12,AAPL,8,2015,12,112.53,115.24,115.42,109.63,101685610,high
2016-04-11,AAPL,4,2016,11,108.97,109.02,110.61,108.83,29407518,medium
2016-02-01,AAPL,2,2016,1,96.47,96.43,96.71,95.4,40943541,medium
2015-11-11,AAPL,11,2015,11,116.37,116.11,117.42,115.21,45217971,medium
2016-07-13,AAPL,7,2016,13,97.41,96.87,97.67,96.84,25892171,medium
2016-06-30,AAPL,6,2016,30,94.44,95.6,95.77,94.3,35836356,medium
2015-07-16,AAPL,7,2015,16,127.74,128.51,128.57,127.35,36222447,medium
2016-05-16,AAPL,5,2016,16,92.39,93.88,94.39,91.65,61259756,medium
2015-11-27,AAPL,11,2015,27,118.29,117.81,118.41,117.6,13046445,low


#### 3. Select all of the intraday *high* and *low* prices for AAPL and GOOGL in all of the dates in *tech_df2*.

We can achieve this by using `loc[]` to select only AAPL and GOOGL from the "name" index level, high and low from the columns.

In [51]:
tech_df2.loc[(slice(None), ["AAPL", "GOOGL"]), ["low", "high"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,low,high
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-07-13,GOOGL,558.70,572.85
2015-07-13,AAPL,124.32,125.76
2015-07-14,GOOGL,574.17,589.71
2015-07-14,AAPL,125.04,126.37
2015-07-15,GOOGL,580.21,588.69
...,...,...,...
2016-08-15,AAPL,108.08,109.54
2016-08-16,GOOGL,797.00,804.26
2016-08-16,AAPL,109.21,110.23
2016-08-17,GOOGL,796.30,805.63


Alternatively using `pd.IndexSlice[]`

In [52]:
tech_df2.loc[pd.IndexSlice[:, ['AAPL', 'GOOGL']], ['low','high']]

Unnamed: 0_level_0,Unnamed: 1_level_0,low,high
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-07-13,GOOGL,558.70,572.85
2015-07-13,AAPL,124.32,125.76
2015-07-14,GOOGL,574.17,589.71
2015-07-14,AAPL,125.04,126.37
2015-07-15,GOOGL,580.21,588.69
...,...,...,...
2016-08-15,AAPL,108.08,109.54
2016-08-16,GOOGL,797.00,804.26
2016-08-16,AAPL,109.21,110.23
2016-08-17,GOOGL,796.30,805.63


## The Anatomy of the MultiIndex Object

How is a multiindex object structurally put together in Pandas? We know that a dataframe has a multi-level index. But it's not obvious from just looking at the dataframe itself.

In [53]:
tech.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume,volume_type
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622,medium
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851,low
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719,medium
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745,medium
2014-01-02,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246,medium


But from its type, there's nothing obvious that points it out as a multiindex dataframe.

In [54]:
type(tech)

pandas.core.frame.DataFrame

If we look at the type of the index, we do see that it's a MultiIndex dataframe.

In [55]:
type(tech.index)

pandas.core.indexes.multi.MultiIndex

Let's take a closer look at the innards of the multiindex

In [56]:
tech.index

MultiIndex([('2014-01-02',    'FB'),
            ('2014-01-02',  'AAPL'),
            ('2014-01-02', 'GOOGL'),
            ('2014-01-02',  'MSFT'),
            ('2014-01-02',  'AMZN'),
            ('2014-01-03',    'FB'),
            ('2014-01-03', 'GOOGL'),
            ('2014-01-03',  'MSFT'),
            ('2014-01-03',  'AAPL'),
            ('2014-01-03',  'AMZN'),
            ...
            ('2019-08-22',  'MSFT'),
            ('2019-08-22',    'FB'),
            ('2019-08-22',  'AMZN'),
            ('2019-08-22',  'AAPL'),
            ('2019-08-22', 'GOOGL'),
            ('2019-08-23',  'MSFT'),
            ('2019-08-23',  'AAPL'),
            ('2019-08-23', 'GOOGL'),
            ('2019-08-23',  'AMZN'),
            ('2019-08-23',    'FB')],
           names=['date', 'name'], length=7105)

As we can see, the multiindex is a data structure in its own right. It has its own specific attributes, sequence of values, and methods.

The beauty of this is that the object can support very complex structures and hierarches in a self-contained way. Let's break this down one component at a time, starting with the "names" attribute. The `names` attribute returns a list of the labels that we're indexing by.

In [57]:
tech.index.names

FrozenList(['date', 'name'])

The next component is called `levels`, which contains a list of lists containing the range of values for each of the labels in the multiindex.

First, let's take a look at `nlevels`, or number of levels.

In [58]:
tech.index.nlevels

2

As we already know, this particular index has two levels (in our case "date" and "name").

Now let's look at the actual levels.

In [59]:
tech.index.levels

FrozenList([['2014-01-02', '2014-01-03', '2014-01-06', '2014-01-07', '2014-01-08', '2014-01-09', '2014-01-10', '2014-01-13', '2014-01-14', '2014-01-15', '2014-01-16', '2014-01-17', '2014-01-21', '2014-01-22', '2014-01-23', '2014-01-24', '2014-01-27', '2014-01-28', '2014-01-29', '2014-01-30', '2014-01-31', '2014-02-03', '2014-02-04', '2014-02-05', '2014-02-06', '2014-02-07', '2014-02-10', '2014-02-11', '2014-02-12', '2014-02-13', '2014-02-14', '2014-02-18', '2014-02-19', '2014-02-20', '2014-02-21', '2014-02-24', '2014-02-25', '2014-02-26', '2014-02-27', '2014-02-28', '2014-03-03', '2014-03-04', '2014-03-05', '2014-03-06', '2014-03-07', '2014-03-10', '2014-03-11', '2014-03-12', '2014-03-13', '2014-03-14', '2014-03-17', '2014-03-18', '2014-03-19', '2014-03-20', '2014-03-21', '2014-03-24', '2014-03-25', '2014-03-26', '2014-03-27', '2014-03-28', '2014-03-31', '2014-04-01', '2014-04-02', '2014-04-03', '2014-04-04', '2014-04-07', '2014-04-08', '2014-04-09', '2014-04-10', '2014-04-11', '2014-0

Again, this returns a list of lists, which each list corresponding to a specific index level and containing all of the values within that index level. Since it's a list, we can examine each index individually.

In [60]:
tech.index.levels[0]

Index(['2014-01-02', '2014-01-03', '2014-01-06', '2014-01-07', '2014-01-08',
       '2014-01-09', '2014-01-10', '2014-01-13', '2014-01-14', '2014-01-15',
       ...
       '2019-08-12', '2019-08-13', '2019-08-14', '2019-08-15', '2019-08-16',
       '2019-08-19', '2019-08-20', '2019-08-21', '2019-08-22', '2019-08-23'],
      dtype='object', name='date', length=1421)

In [61]:
tech.index.levels[1]

Index(['AAPL', 'AMZN', 'FB', 'GOOGL', 'MSFT'], dtype='object', name='name')

A quick note - the highest level of the index (at the 0 position of the `levels` list), is the index level at the far left of the dataframe. As we move to the right, the levels increase. 

If we want to look at the length of each of our levels, we can use the `levshape` attribute.

In [62]:
tech.index.levshape

(1421, 5)

Thus, we have 1421 dates and 5 names (tickers).

Lastly, don't forget that multiindices are not just freely-floating sequences of labels. They represent a tight coupling of hierarchies of labels. An easy way to look at label combinations is to access the `values` attribute of the multiindex.

In [63]:
tech.index.values

array([('2014-01-02', 'FB'), ('2014-01-02', 'AAPL'),
       ('2014-01-02', 'GOOGL'), ..., ('2019-08-23', 'GOOGL'),
       ('2019-08-23', 'AMZN'), ('2019-08-23', 'FB')], dtype=object)

We see that this gives us a list of tuples, with each tuple containing two items - one date and one stock ticker. If we were to add a third dimension to our index, we would get a third value in each tuple. 

Essentially, the `index.values` attribute gives all existing combinations of the different index levels present in the dataframe. 

## Adding Another Level to a MultiIndex DataFrame

In this lecture, we'll be adding another level to our index, namely *volume_type*. 

The most intuitive way to do this is to use the `set_index()` method with the `append` parameter set to true.

In [64]:
tech.set_index('volume_type', append = True, inplace = True)

In [65]:
tech.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,name,volume_type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,medium,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,AAPL,low,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,GOOGL,medium,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,MSFT,medium,1,2014,2,37.35,37.16,37.4,37.1,30643745
2014-01-02,AMZN,medium,1,2014,2,398.8,397.97,399.36,394.02,2140246
2014-01-03,FB,medium,1,2014,3,55.0,54.56,55.65,54.53,38287706
2014-01-03,GOOGL,medium,1,2014,3,557.5,552.5,558.47,552.47,1669229
2014-01-03,MSFT,medium,1,2014,3,37.2,36.91,37.22,36.6,31134795
2014-01-03,AAPL,low,1,2014,3,79.0,77.28,79.1,77.2,14043410
2014-01-03,AMZN,medium,1,2014,3,398.29,396.44,402.71,396.22,2213512


We see that "volume_type" is added as the third level of our index. Now let's check our index attributes.

In [66]:
tech.index.nlevels

3

In [67]:
tech.index.levels[2]

Index(['high', 'low', 'medium'], dtype='object', name='volume_type')

In [68]:
tech.index.values

array([('2014-01-02', 'FB', 'medium'), ('2014-01-02', 'AAPL', 'low'),
       ('2014-01-02', 'GOOGL', 'medium'), ...,
       ('2019-08-23', 'GOOGL', 'medium'),
       ('2019-08-23', 'AMZN', 'medium'), ('2019-08-23', 'FB', 'medium')],
      dtype=object)

As expected, the `values` attribute now consists of tuples of 3, which each tuple capture one attribute from each level of the multiindex.

Since we introduced another level, selecting by label will look a bit different from our two-index dataframes. For example, suppose we want to select data from high-volume trading days in January 2019. We can still use the `loc[]` indexer for this, but we'll need a longer tuple.

Our approach will be to slice all of the dates from January 2021 from the first level, all names from the second level, and specify only high-volume trading days from the third level. For this, we will use the `loc[]` indexer with the `slice()` object.

In [69]:
tech.loc[(slice("2019-01-01","2019-01-31"), slice(None), 'high'), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,name,volume_type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-01-03,AAPL,high,1,2019,3,143.98,142.19,145.72,142.0,91312195
2019-01-04,AMZN,high,1,2019,4,1530.0,1575.39,1594.0,1518.31,9182575
2019-01-08,AMZN,high,1,2019,8,1664.69,1656.58,1676.61,1616.61,8881428
2019-01-31,AMZN,high,1,2019,31,1692.85,1718.73,1736.41,1679.08,10910338
2019-01-31,FB,high,1,2019,31,165.6,166.69,171.68,165.0,77233602


As another example, let's grab all of the high-volume trading days irrespective of date or company. We can easily do this with the `xs()` method (although other approaches would work as well).

In [70]:
tech.xs(key = 'high', level = 2, drop_level = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,name,volume_type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-06,FB,high,1,2014,6,54.39,57.20,57.26,54.05,68974359
2014-01-07,FB,high,1,2014,7,57.67,57.92,58.55,57.22,77329009
2014-01-08,FB,high,1,2014,8,57.59,58.23,58.41,57.23,56800776
2014-01-09,FB,high,1,2014,9,58.66,57.22,58.96,56.65,92349222
2014-01-13,FB,high,1,2014,13,57.89,55.91,58.25,55.38,63106519
...,...,...,...,...,...,...,...,...,...,...
2019-04-30,GOOGL,high,4,2019,30,1190.63,1198.96,1200.98,1183.00,6658855
2019-06-03,FB,high,6,2019,3,175.00,164.15,175.05,161.01,56059609
2019-06-03,AMZN,high,6,2019,3,1760.01,1692.69,1766.29,1672.00,9098708
2019-06-03,GOOGL,high,6,2019,3,1066.93,1038.74,1067.00,1027.03,4844480


Suppose we only cared about Facebook's high-volume days, we can add that in as well.

In [71]:
tech.xs(key = ('FB','high'), level = (1,2), drop_level = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,name,volume_type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-06,FB,high,1,2014,6,54.39,57.20,57.26,54.05,68974359
2014-01-07,FB,high,1,2014,7,57.67,57.92,58.55,57.22,77329009
2014-01-08,FB,high,1,2014,8,57.59,58.23,58.41,57.23,56800776
2014-01-09,FB,high,1,2014,9,58.66,57.22,58.96,56.65,92349222
2014-01-13,FB,high,1,2014,13,57.89,55.91,58.25,55.38,63106519
...,...,...,...,...,...,...,...,...,...,...
2018-10-31,FB,high,10,2018,31,155.00,151.79,156.40,148.96,60101251
2018-12-19,FB,high,12,2018,19,141.21,133.24,144.91,132.50,57404894
2018-12-21,FB,high,12,2018,21,133.39,124.95,134.90,123.42,56901491
2019-01-31,FB,high,1,2019,31,165.60,166.69,171.68,165.00,77233602


Just an interesting tidbit, according to our data, 115 out of 375 high-volume trading days have been driven by transaction of Facebook shares.

## Shuffling Levels with `swaplevel()` and `reorder_levels()`

What happens if we're unable with the order of our multiindices? When we added a new index in our last lecture, it was appended to the dataframe as the innermost (or lowest-level) index. What if instead we wanted it to be in a different order?

This is easily doable using the `swaplevel()` method, which swaps indices at two levels that we specify.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.swaplevel.html
* If given no arguments, the method by default swaps the two innermost levels.

In [72]:
tech.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,name,volume_type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,medium,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,AAPL,low,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,GOOGL,medium,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,MSFT,medium,1,2014,2,37.35,37.16,37.4,37.1,30643745
2014-01-02,AMZN,medium,1,2014,2,398.8,397.97,399.36,394.02,2140246


In our example dataset, within our three indices we will swap "volume_type" and "name".

In [73]:
tech.swaplevel(i = 2, j = 1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,volume_type,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.40,37.10,30643745
2014-01-02,medium,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,medium,MSFT,8,2019,23,137.19,133.39,138.35,132.80,38515386
2019-08-23,medium,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843
2019-08-23,medium,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
2019-08-23,medium,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898


Since "name" and "volume_type" are the two innermost indices, the default behavior will produce the same result.

In [74]:
tech.swaplevel()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,volume_type,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.40,37.10,30643745
2014-01-02,medium,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,medium,MSFT,8,2019,23,137.19,133.39,138.35,132.80,38515386
2019-08-23,medium,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843
2019-08-23,medium,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
2019-08-23,medium,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898


The method also works with label names instead of label positions.

In [75]:
tech.swaplevel(i = 'volume_type', j = 'name')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,volume_type,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.40,37.10,30643745
2014-01-02,medium,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,medium,MSFT,8,2019,23,137.19,133.39,138.35,132.80,38515386
2019-08-23,medium,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843
2019-08-23,medium,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
2019-08-23,medium,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898


Unfortunately, this method does not include an `inplace` parameter. Thus if we want to make our changes permanent, we have to reassign the variable to the new dataframe. Let's go ahead and do that - in this case it works nicely because the "least variable" index types are at the higher levels. Thus, as we work "inward" in our indices, they have increasing variability.

In [76]:
tech = tech.swaplevel(i = 'volume_type', j = 'name')

In [77]:
tech.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,volume_type,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745
2014-01-02,medium,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246


In Pandas, there is also another more powerful method that allows you to do more extensive reorders with more than two indices. This is the `reorder_levels()` method, which allows you to express a new order for the multiindex all in one go. All you need to do is pass in a list of the indices declared in the order in which you want them positioned.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reorder_levels.html

In [78]:
tech.reorder_levels([2, 0, 1])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
name,date,volume_type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
FB,2014-01-02,medium,1,2014,2,54.86,54.71,55.22,54.19,43257622
AAPL,2014-01-02,low,1,2014,2,79.38,79.02,79.58,78.86,8398851
GOOGL,2014-01-02,medium,1,2014,2,557.73,556.56,558.88,554.13,1822719
MSFT,2014-01-02,medium,1,2014,2,37.35,37.16,37.40,37.10,30643745
AMZN,2014-01-02,medium,1,2014,2,398.80,397.97,399.36,394.02,2140246
...,...,...,...,...,...,...,...,...,...,...
MSFT,2019-08-23,medium,8,2019,23,137.19,133.39,138.35,132.80,38515386
AAPL,2019-08-23,medium,8,2019,23,209.43,202.64,212.05,201.00,46882843
GOOGL,2019-08-23,medium,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
AMZN,2019-08-23,medium,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898


Instead of calling the reorder on the enter dataframe, we also have the option of calling it on the index object itself. This will return the reordered multiindex.

In [79]:
tech.index.reorder_levels([2, 0, 1])

MultiIndex([(   'FB', '2014-01-02', 'medium'),
            ( 'AAPL', '2014-01-02',    'low'),
            ('GOOGL', '2014-01-02', 'medium'),
            ( 'MSFT', '2014-01-02', 'medium'),
            ( 'AMZN', '2014-01-02', 'medium'),
            (   'FB', '2014-01-03', 'medium'),
            ('GOOGL', '2014-01-03', 'medium'),
            ( 'MSFT', '2014-01-03', 'medium'),
            ( 'AAPL', '2014-01-03',    'low'),
            ( 'AMZN', '2014-01-03', 'medium'),
            ...
            ( 'MSFT', '2019-08-22', 'medium'),
            (   'FB', '2019-08-22',    'low'),
            ( 'AMZN', '2019-08-22', 'medium'),
            ( 'AAPL', '2019-08-22', 'medium'),
            ('GOOGL', '2019-08-22',    'low'),
            ( 'MSFT', '2019-08-23', 'medium'),
            ( 'AAPL', '2019-08-23', 'medium'),
            ('GOOGL', '2019-08-23', 'medium'),
            ( 'AMZN', '2019-08-23', 'medium'),
            (   'FB', '2019-08-23', 'medium')],
           names=['name', 'date', 'volume_t

## Removing MultiIndex Levels with `droplevel()`

In previous lectures we worked on adding new levels to our multiindex. But what about removing levels?

Pandas recently introduced a method that makes this very easy. The `droplevel()` method allows you to drop a level from your multiindex.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.droplevel.html
* The levels to be dropped can be identified by index position or by name
* This method discards the index that is removed - there is no way to keep that level around using this method.

In [80]:
tech.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,volume_type,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745
2014-01-02,medium,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246


In [81]:
tech.droplevel(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day,open,close,high,low,volume
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2014-01-02,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,MSFT,1,2014,2,37.35,37.16,37.40,37.10,30643745
2014-01-02,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246
...,...,...,...,...,...,...,...,...,...
2019-08-23,MSFT,8,2019,23,137.19,133.39,138.35,132.80,38515386
2019-08-23,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843
2019-08-23,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
2019-08-23,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898


The instructor prefers using `reset_index()` because it's more powerful and flexible. It allows us to get rid of a level (or multiple levels) if we want to, and simultaneously restores that level (or levels) as a column in the dataframe.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html
* If you wish, you can discared the removed index by setting the `drop` parameter to `False`. 
* Whereas a single-index dataframe will reset to the default index, a multi-index dataframe will simply remove the specified index and keep the remainders.

In [82]:
tech.reset_index(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,volume_type,month,year,day,open,close,high,low,volume
date,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,FB,medium,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,AAPL,low,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,GOOGL,medium,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,MSFT,medium,1,2014,2,37.35,37.16,37.40,37.10,30643745
2014-01-02,AMZN,medium,1,2014,2,398.80,397.97,399.36,394.02,2140246
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,MSFT,medium,8,2019,23,137.19,133.39,138.35,132.80,38515386
2019-08-23,AAPL,medium,8,2019,23,209.43,202.64,212.05,201.00,46882843
2019-08-23,GOOGL,medium,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
2019-08-23,AMZN,medium,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898


What do we do if we want to remove multiple levels from our index?

One way is to call `drop_level()` or `reset_index()` multiple times. Thankfully, both of these methods accept lists of level names (or level positions) so that we can remove more than one at a time.

In [83]:
tech.droplevel(['volume_type', 'name'])

Unnamed: 0_level_0,month,year,day,open,close,high,low,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2014-01-02,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,1,2014,2,37.35,37.16,37.40,37.10,30643745
2014-01-02,1,2014,2,398.80,397.97,399.36,394.02,2140246
...,...,...,...,...,...,...,...,...
2019-08-23,8,2019,23,137.19,133.39,138.35,132.80,38515386
2019-08-23,8,2019,23,209.43,202.64,212.05,201.00,46882843
2019-08-23,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
2019-08-23,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898


In [84]:
tech.droplevel([1, 2])

Unnamed: 0_level_0,month,year,day,open,close,high,low,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2014-01-02,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,1,2014,2,37.35,37.16,37.40,37.10,30643745
2014-01-02,1,2014,2,398.80,397.97,399.36,394.02,2140246
...,...,...,...,...,...,...,...,...
2019-08-23,8,2019,23,137.19,133.39,138.35,132.80,38515386
2019-08-23,8,2019,23,209.43,202.64,212.05,201.00,46882843
2019-08-23,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
2019-08-23,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898


In [85]:
tech.reset_index(['volume_type','name'], drop = True)

Unnamed: 0_level_0,month,year,day,open,close,high,low,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2014-01-02,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,1,2014,2,37.35,37.16,37.40,37.10,30643745
2014-01-02,1,2014,2,398.80,397.97,399.36,394.02,2140246
...,...,...,...,...,...,...,...,...
2019-08-23,8,2019,23,137.19,133.39,138.35,132.80,38515386
2019-08-23,8,2019,23,209.43,202.64,212.05,201.00,46882843
2019-08-23,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
2019-08-23,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898


It's also worth noting that `reset_index()` is useful from eliminating our index entirely and resetting it to the standard range index. All of the existing indices will be moved into the dataframe as columns, unless the `drop` parameter is set to `False`.

In [86]:
tech.reset_index()

Unnamed: 0,date,volume_type,name,month,year,day,open,close,high,low,volume
0,2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
1,2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2,2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
3,2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.40,37.10,30643745
4,2014-01-02,medium,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246
...,...,...,...,...,...,...,...,...,...,...,...
7100,2019-08-23,medium,MSFT,8,2019,23,137.19,133.39,138.35,132.80,38515386
7101,2019-08-23,medium,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843
7102,2019-08-23,medium,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
7103,2019-08-23,medium,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898


## Sorting MultiIndex DataFrames with `sort_index()`

Consider this slice of our dataframe. where we select the opening, closing, high, and low prices for Apple from Jan 2, 2014 through April 2, 2014 for all volume types.

In [87]:
tech.loc[(slice('2014-01-02','2014-04-02'), slice(None), 'AAPL'), 'open':'low']

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,open,close,high,low
date,volume_type,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-02,low,AAPL,79.38,79.02,79.58,78.86
2014-01-03,low,AAPL,79.00,77.28,79.10,77.20
2014-01-06,low,AAPL,76.78,77.70,78.11,76.23
2014-01-07,low,AAPL,77.76,77.15,77.99,76.85
2014-01-08,low,AAPL,76.97,77.64,77.94,76.96
...,...,...,...,...,...,...
2014-03-27,low,AAPL,77.11,76.78,77.36,76.45
2014-03-28,low,AAPL,76.82,76.69,76.99,76.32
2014-03-31,low,AAPL,77.03,76.68,77.26,76.56
2014-04-01,low,AAPL,76.84,77.38,77.41,76.68


What if we wanted to select not just Apple, but a slice going from Apple to Facebook. How would we do that?

Let's first try using a `slice()` object to slice from *AAPL* to *FB*. This does not work - Pandas returns an unsorted index error.

In [88]:
## This generates an UnsortedIndexError
# tech.loc[(slice('2014-01-02','2014-04-02'), slice(None), slice('AAPL','FB')), 'open':'low']

The reason for this is that it's not very meaningful to talk about slicing along a sequence of values that is not sorted or does not have an intrinsic order. Pandas wants us to sort our index before slicing it.
* Note that the highest level index (the date) is already sorted, so we can easily slice on that

So how do we go about sorting our indexes? We can do this using the `sort_index()` method. If we run this with no arguments, we see that every level becomes nice and sorted. By default, the sort is in ascending order.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html

In [89]:
tech.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,volume_type,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,medium,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246
2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.40,37.10,30643745
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,medium,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843
2019-08-23,medium,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898
2019-08-23,medium,FB,8,2019,23,180.84,177.75,183.13,176.66,17331221
2019-08-23,medium,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141


Let's make this permanent by setting the `inplace` parameter to `True`

In [90]:
tech.sort_index(inplace = True)

Now that our index is sorted, we can proceed to slice on the "name" level of the index.

In [91]:
tech.loc[(slice('2014-01-02','2014-04-02'), slice(None), slice('AAPL','FB')), 'open':'low']

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,open,close,high,low
date,volume_type,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-02,low,AAPL,79.38,79.02,79.58,78.86
2014-01-02,medium,AMZN,398.80,397.97,399.36,394.02
2014-01-02,medium,FB,54.86,54.71,55.22,54.19
2014-01-03,low,AAPL,79.00,77.28,79.10,77.20
2014-01-03,medium,AMZN,398.29,396.44,402.71,396.22
...,...,...,...,...,...,...
2014-04-01,low,AAPL,76.84,77.38,77.41,76.68
2014-04-01,medium,AMZN,338.35,342.99,344.43,338.00
2014-04-02,high,FB,63.24,62.72,63.91,62.21
2014-04-02,low,AAPL,77.48,77.51,77.64,77.18


In this lecture we sorted the index to address an error. However in general, the instructor recommends always sorting the index, especially when working with multiindex dataframes. Why?
* Sorting indexes improves retrieval performance, which becomes significant for large dataframs or for frequent retrieval activities
* Enables slicing syntax. You can only slice on sorted indexes
* Overall, it is a good practice when working with tabular data representations, including Pansas, Excel, SQL, etc.

For multiindex dataframes, we can also fine-turn the sort using the `level` parameter. For instance, if we wanted to sort one level in descending order another in ascending order, we can specify that in the `sort_index()` method. We must also pass a list of booleans to the `ascending` parameter so that Pandas knows which level to sort by ascending and descending order.
* Remember that if we don't customize the sort, then each level of the multiindex is sorted in ascending order by default

In [92]:
tech.sort_index(level=(0, 2), ascending=[False, True])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,volume_type,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-08-23,medium,AAPL,8,2019,23,209.43,202.64,212.05,201.00,46882843
2019-08-23,medium,AMZN,8,2019,23,1793.03,1749.62,1804.90,1745.23,5277898
2019-08-23,medium,FB,8,2019,23,180.84,177.75,183.13,176.66,17331221
2019-08-23,medium,GOOGL,8,2019,23,1185.17,1153.58,1195.67,1150.00,1813141
2019-08-23,medium,MSFT,8,2019,23,137.19,133.39,138.35,132.80,38515386
...,...,...,...,...,...,...,...,...,...,...
2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,medium,AMZN,1,2014,2,398.80,397.97,399.36,394.02,2140246
2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719


## More MultiIndex Methods: `idx.is_lexsorted(), idx.sortlevel(), idx.set_names(), idx.to_flat_index()`

Up until this point, we've modified the multiindex primarily by using methods that act on the dataframe itself. But the multiindex object can also be manipulated as a standalone data structure, and can then be passed along to or become associated with other data frames.

In this lecture we're explore some of these methods.
* https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

Let's begin by creating a new variable for the multiindex. This variable will be directly linked to the underlying dataframe, such that if we change the multiindex object, the changes will carry over to the dataframe.

In [93]:
tidx = tech.index

In [94]:
tidx

MultiIndex([('2014-01-02',    'low',  'AAPL'),
            ('2014-01-02', 'medium',  'AMZN'),
            ('2014-01-02', 'medium',    'FB'),
            ('2014-01-02', 'medium', 'GOOGL'),
            ('2014-01-02', 'medium',  'MSFT'),
            ('2014-01-03',    'low',  'AAPL'),
            ('2014-01-03', 'medium',  'AMZN'),
            ('2014-01-03', 'medium',    'FB'),
            ('2014-01-03', 'medium', 'GOOGL'),
            ('2014-01-03', 'medium',  'MSFT'),
            ...
            ('2019-08-22',    'low',    'FB'),
            ('2019-08-22',    'low', 'GOOGL'),
            ('2019-08-22', 'medium',  'AAPL'),
            ('2019-08-22', 'medium',  'AMZN'),
            ('2019-08-22', 'medium',  'MSFT'),
            ('2019-08-23', 'medium',  'AAPL'),
            ('2019-08-23', 'medium',  'AMZN'),
            ('2019-08-23', 'medium',    'FB'),
            ('2019-08-23', 'medium', 'GOOGL'),
            ('2019-08-23', 'medium',  'MSFT')],
           names=['date', 'volume_type', 'n

In [95]:
type(tidx)

pandas.core.indexes.multi.MultiIndex

Let's first check whether this index is sorted using `tidx.is_lexsorted()`. This method tests whether an index is lexicographically sorted. In other words, is it in alphabetical order?
* https://www.geeksforgeeks.org/python-pandas-multiindex-is_lexsorted/
* Keep in mind that according to lexicographic sorting, multidigit numbers are initially sorted by their first digit. So 10 would come before 7,

In [96]:
tidx.is_lexsorted()

True

This returned `True`, meaning that the entire index is lexicographically sorted. This makes sense because we sorted the index in the previous lecture.

Next, let's modify the sort for a given level within the multiindex. We previously did this with the `sort_index()` method on the entire dataframe. But we can work directly on the multiindex object by using `sortlevel()`
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.sortlevel.html
* With this method, we specify the level as the first parameter, then we specify the order for that level, and finally we choose whether the remaining levels are sorted
* Key thing to remember is that because we operated on the multiindex and not on the dataframe, none of the associated values in the dataframe get sorted.

In [97]:
tidx.sortlevel(0, ascending = False, sort_remaining = True)

(MultiIndex([('2019-08-23', 'medium',  'MSFT'),
             ('2019-08-23', 'medium', 'GOOGL'),
             ('2019-08-23', 'medium',    'FB'),
             ('2019-08-23', 'medium',  'AMZN'),
             ('2019-08-23', 'medium',  'AAPL'),
             ('2019-08-22', 'medium',  'MSFT'),
             ('2019-08-22', 'medium',  'AMZN'),
             ('2019-08-22', 'medium',  'AAPL'),
             ('2019-08-22',    'low', 'GOOGL'),
             ('2019-08-22',    'low',    'FB'),
             ...
             ('2014-01-03', 'medium',  'MSFT'),
             ('2014-01-03', 'medium', 'GOOGL'),
             ('2014-01-03', 'medium',    'FB'),
             ('2014-01-03', 'medium',  'AMZN'),
             ('2014-01-03',    'low',  'AAPL'),
             ('2014-01-02', 'medium',  'MSFT'),
             ('2014-01-02', 'medium', 'GOOGL'),
             ('2014-01-02', 'medium',    'FB'),
             ('2014-01-02', 'medium',  'AMZN'),
             ('2014-01-02',    'low',  'AAPL')],
            names=['da

We can also sort multiple levels at once.

In [98]:
tidx.sortlevel((0, 1, 2), ascending = [True, True, False])

(MultiIndex([('2014-01-02',    'low',  'AAPL'),
             ('2014-01-02', 'medium',  'MSFT'),
             ('2014-01-02', 'medium', 'GOOGL'),
             ('2014-01-02', 'medium',    'FB'),
             ('2014-01-02', 'medium',  'AMZN'),
             ('2014-01-03',    'low',  'AAPL'),
             ('2014-01-03', 'medium',  'MSFT'),
             ('2014-01-03', 'medium', 'GOOGL'),
             ('2014-01-03', 'medium',    'FB'),
             ('2014-01-03', 'medium',  'AMZN'),
             ...
             ('2019-08-22',    'low', 'GOOGL'),
             ('2019-08-22',    'low',    'FB'),
             ('2019-08-22', 'medium',  'MSFT'),
             ('2019-08-22', 'medium',  'AMZN'),
             ('2019-08-22', 'medium',  'AAPL'),
             ('2019-08-23', 'medium',  'MSFT'),
             ('2019-08-23', 'medium', 'GOOGL'),
             ('2019-08-23', 'medium',    'FB'),
             ('2019-08-23', 'medium',  'AMZN'),
             ('2019-08-23', 'medium',  'AAPL')],
            names=['da

Now let's talk about aesthetics, things like **name**, **label**, etc. Take a look at our index as it stands now.

In [99]:
tech.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
date,volume_type,name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,medium,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246
2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745


Notice that the indexes that we promoted from the columns still have their old column names, like "volume_type" and "name". We can make this more presentable by using the `index.set_names()` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.set_names.html
* To use this method, we pass in a list of labels that is equal in length to the number of labels in our index

In [100]:
tidx.set_names(['Trading Date','Volume Category','Ticker'], inplace = True)

In [101]:
tech.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
Trading Date,Volume Category,Ticker,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,medium,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246
2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745


 We see that the index names have been renamed in the underlying dataframe!

 Finally, we can use the `index.to_flat_index()` method to convert the multiindex into a flat, non-hierarchical structure. The multiindex is converted into a single index of tuples containing the level values.
 * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.to_flat_index.html

In [102]:
tidx.to_flat_index()

Index([    ('2014-01-02', 'low', 'AAPL'),  ('2014-01-02', 'medium', 'AMZN'),
          ('2014-01-02', 'medium', 'FB'), ('2014-01-02', 'medium', 'GOOGL'),
        ('2014-01-02', 'medium', 'MSFT'),     ('2014-01-03', 'low', 'AAPL'),
        ('2014-01-03', 'medium', 'AMZN'),    ('2014-01-03', 'medium', 'FB'),
       ('2014-01-03', 'medium', 'GOOGL'),  ('2014-01-03', 'medium', 'MSFT'),
       ...
             ('2019-08-22', 'low', 'FB'),    ('2019-08-22', 'low', 'GOOGL'),
        ('2019-08-22', 'medium', 'AAPL'),  ('2019-08-22', 'medium', 'AMZN'),
        ('2019-08-22', 'medium', 'MSFT'),  ('2019-08-23', 'medium', 'AAPL'),
        ('2019-08-23', 'medium', 'AMZN'),    ('2019-08-23', 'medium', 'FB'),
       ('2019-08-23', 'medium', 'GOOGL'),  ('2019-08-23', 'medium', 'MSFT')],
      dtype='object', length=7105)

This output is useful in visualizing how Pandas "thinks" about our multiindexes. The levels we specify are combined together into tuples, each of which serves as a single label.

To illustrate, consider what we see when we extract a single column as a series from a dataframe. It looks kind of weird, with multiple levels and only a single column of values.

In [103]:
tech.close

Trading Date  Volume Category  Ticker
2014-01-02    low              AAPL        79.02
              medium           AMZN       397.97
                               FB          54.71
                               GOOGL      556.56
                               MSFT        37.16
                                          ...   
2019-08-23    medium           AAPL       202.64
                               AMZN      1749.62
                               FB         177.75
                               GOOGL     1153.58
                               MSFT       133.39
Name: close, Length: 7105, dtype: float64

By using `to_flat_index()`, we solidify the relationship between the index labels and the underlying data. And this relationship carries over to subsets and slices of data that we select.

## Reshaping with `stack()`

The `stack()` method is used alongside pivot tables to essentially take the column axis and rotate it into the innermost level of the index.

In [104]:
tech.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,month,year,day,open,close,high,low,volume
Trading Date,Volume Category,Ticker,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
2014-01-02,medium,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246
2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745


In [105]:
tech.stack()

Trading Date  Volume Category  Ticker        
2014-01-02    low              AAPL    month            1.00
                                       year          2014.00
                                       day              2.00
                                       open            79.38
                                       close           79.02
                                                    ...     
2019-08-23    medium           MSFT    open           137.19
                                       close          133.39
                                       high           138.35
                                       low            132.80
                                       volume    38515386.00
Length: 56840, dtype: float64

When the `stack()` method is called on a dataframe, the columns are pivoted into the inner-most levels of the index. In our case, since the columns were single level, the output is a Series.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html
* The column labels are now a new level within the multiindex

Let's break this down more by creating a variable for this new stacked series.

In [106]:
stacked = tech.stack()

In [107]:
stacked.head(10)

Trading Date  Volume Category  Ticker        
2014-01-02    low              AAPL    month           1.00
                                       year         2014.00
                                       day             2.00
                                       open           79.38
                                       close          79.02
                                       high           79.58
                                       low            78.86
                                       volume    8398851.00
              medium           AMZN    month           1.00
                                       year         2014.00
dtype: float64

It may feel like we lost a dimension since we've gone from a dataframe to a series, but in reality *all we did absorb the columns into a wider index*. We went from a 3-level indexed dataframe with a 1-level column axis into a 4-level muliindexed series with a single sequece of values. So nothing was lost here.

In [108]:
type(stacked)

pandas.core.series.Series

In [109]:
stacked.index.nlevels

4

Note that our level does not yet have a name, and it looks kind of weird. 

In [110]:
stacked.index.names

FrozenList(['Trading Date', 'Volume Category', 'Ticker', None])

See how the fourth index name is `None`. We cannot change *names* because it's immutable. Instead, we create a new instance of the index. Here's how we'll do it:
1. Set the names of the stacked index to a new variable *names*
2. Use the `set_names()` method on the index of the stacked dataframe to change the names, where we pass in the names from the *names* variable except for the last one "None". Note that because we're passing the list "names" into a function call, we'll need to use the `*` operator on the list in order to unpack it. Otherwise, the function will treat it as a single item tuple.
  * https://stackoverflow.com/questions/11315010/what-do-and-before-a-variable-name-mean-in-a-function-signature

In [111]:
names = stacked.index.names

We'll be passing this slice of "names" into the `set_names()` function call.

In [112]:
names[:-1]

FrozenList(['Trading Date', 'Volume Category', 'Ticker'])

In [113]:
stacked.index.set_names([*names[:-1], 'Previously a Column Axis'], inplace=True)

In [114]:
stacked.head()

Trading Date  Volume Category  Ticker  Previously a Column Axis
2014-01-02    low              AAPL    month                          1.00
                                       year                        2014.00
                                       day                            2.00
                                       open                          79.38
                                       close                         79.02
dtype: float64

A worked example of unpacking 

In [115]:
def unpacker(*args):
  for arg in args:
    print(arg)

In [116]:
unpacker(*names)

Trading Date
Volume Category
Ticker
None


## The Flip Side: Un-Stacking with `unstack()`

In this lecture, we'll talk about the `unstack()` method, which takes the inner-most level of a multiindex (right-most index) and kicks it into the column axis. It does the exact opposite of `stack()`
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html

Consider our *stacked* dataframe (actually a series) from the previous lecture that has a 4-level multiindex. We will use the `unstack()` method to take the "Previously a Column Axis" index back to the column axis.

In [117]:
stacked

Trading Date  Volume Category  Ticker  Previously a Column Axis
2014-01-02    low              AAPL    month                              1.00
                                       year                            2014.00
                                       day                                2.00
                                       open                              79.38
                                       close                             79.02
                                                                      ...     
2019-08-23    medium           MSFT    open                             137.19
                                       close                            133.39
                                       high                             138.35
                                       low                              132.80
                                       volume                      38515386.00
Length: 56840, dtype: float64

In [118]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Previously a Column Axis,month,year,day,open,close,high,low,volume
Trading Date,Volume Category,Ticker,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,low,AAPL,1.0,2014.0,2.0,79.38,79.02,79.58,78.86,8398851.0
2014-01-02,medium,AMZN,1.0,2014.0,2.0,398.80,397.97,399.36,394.02,2140246.0
2014-01-02,medium,FB,1.0,2014.0,2.0,54.86,54.71,55.22,54.19,43257622.0
2014-01-02,medium,GOOGL,1.0,2014.0,2.0,557.73,556.56,558.88,554.13,1822719.0
2014-01-02,medium,MSFT,1.0,2014.0,2.0,37.35,37.16,37.40,37.10,30643745.0
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,medium,AAPL,8.0,2019.0,23.0,209.43,202.64,212.05,201.00,46882843.0
2019-08-23,medium,AMZN,8.0,2019.0,23.0,1793.03,1749.62,1804.90,1745.23,5277898.0
2019-08-23,medium,FB,8.0,2019.0,23.0,180.84,177.75,183.13,176.66,17331221.0
2019-08-23,medium,GOOGL,8.0,2019.0,23.0,1185.17,1153.58,1195.67,1150.00,1813141.0


We see that the "Previously a Column Axis" index has now returned to the column axis, with the components of that index (month, year, day, open, close, high, low, volume) now serving as the columns themselves.

We can chain on multiple unstacks. This creates a **multiindex column axis**, where we have two levels within our column axis. 

In [119]:
stacked.unstack().unstack()

Unnamed: 0_level_0,Previously a Column Axis,month,month,month,month,month,year,year,year,year,year,day,day,day,day,day,open,open,open,open,open,close,close,close,close,close,high,high,high,high,high,low,low,low,low,low,volume,volume,volume,volume,volume
Unnamed: 0_level_1,Ticker,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT
Trading Date,Volume Category,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2
2014-01-02,low,1.0,,,,,2014.0,,,,,2.0,,,,,79.38,,,,,79.02,,,,,79.58,,,,,78.86,,,,,8398851.0,,,,
2014-01-02,medium,,1.0,1.0,1.0,1.0,,2014.0,2014.0,2014.0,2014.0,,2.0,2.0,2.0,2.0,,398.80,54.86,557.73,37.35,,397.97,54.71,556.56,37.16,,399.36,55.22,558.88,37.40,,394.02,54.19,554.13,37.10,,2140246.0,43257622.0,1822719.0,30643745.0
2014-01-03,low,1.0,,,,,2014.0,,,,,3.0,,,,,79.00,,,,,77.28,,,,,79.10,,,,,77.20,,,,,14043410.0,,,,
2014-01-03,medium,,1.0,1.0,1.0,1.0,,2014.0,2014.0,2014.0,2014.0,,3.0,3.0,3.0,3.0,,398.29,55.00,557.50,37.20,,396.44,54.56,552.50,36.91,,402.71,55.65,558.47,37.22,,396.22,54.53,552.47,36.60,,2213512.0,38287706.0,1669229.0,31134795.0
2014-01-06,high,,,1.0,,,,,2014.0,,,,,6.0,,,,,54.39,,,,,57.20,,,,,57.26,,,,,54.05,,,,,68974359.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-08-21,low,,8.0,8.0,8.0,8.0,,2019.0,2019.0,2019.0,2019.0,,21.0,21.0,21.0,21.0,,1819.39,185.00,1195.82,138.55,,1823.54,183.55,1191.58,138.79,,1829.58,185.90,1200.56,139.49,,1815.00,183.14,1187.92,138.00,,2039231.0,8409548.0,708272.0,14982314.0
2019-08-21,medium,8.0,,,,,2019.0,,,,,21.0,,,,,212.99,,,,,212.64,,,,,213.65,,,,,211.60,,,,,21564747.0,,,,
2019-08-22,low,,,8.0,8.0,,,,2019.0,2019.0,,,,22.0,22.0,,,,183.43,1193.80,,,,182.04,1191.52,,,,184.11,1198.78,,,,179.91,1178.91,,,,10829509.0,867915.0,
2019-08-22,medium,8.0,8.0,,,8.0,2019.0,2019.0,,,2019.0,22.0,22.0,,,22.0,213.19,1828.00,,,138.66,212.46,1805.60,,,137.78,214.44,1829.41,,,139.20,210.75,1800.10,,,136.29,22267819.0,2658388.0,,,18559088.0


In chaining the unstacking, we've created numerous gaps in our data as indicated by the `NaNs`. This is because we don't have observations for each and every 4-level combination - we saw this in the 4-level multiindex as well. But because of the way we've pivoted, these now need to be filled in as `NaN`.
* For example, we have no data for Amazon on 2014-01-02 in the Low volume category, the reason being on that day Amazon had a Medium volume trading day.

There is a parameter called `fill_value` that allows us to replace `NaN` with something else. For instance we can replace with a hyphen to make them easier to see.


In [120]:
stacked.unstack().unstack(fill_value = '-')

Unnamed: 0_level_0,Previously a Column Axis,month,month,month,month,month,year,year,year,year,year,day,day,day,day,day,open,open,open,open,open,close,close,close,close,close,high,high,high,high,high,low,low,low,low,low,volume,volume,volume,volume,volume
Unnamed: 0_level_1,Ticker,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT
Trading Date,Volume Category,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2
2014-01-02,low,1,-,-,-,-,2014,-,-,-,-,2,-,-,-,-,79.38,-,-,-,-,79.02,-,-,-,-,79.58,-,-,-,-,78.86,-,-,-,-,8.39885e+06,-,-,-,-
2014-01-02,medium,-,1,1,1,1,-,2014,2014,2014,2014,-,2,2,2,2,-,398.8,54.86,557.73,37.35,-,397.97,54.71,556.56,37.16,-,399.36,55.22,558.88,37.4,-,394.02,54.19,554.13,37.1,-,2.14025e+06,4.32576e+07,1.82272e+06,3.06437e+07
2014-01-03,low,1,-,-,-,-,2014,-,-,-,-,3,-,-,-,-,79,-,-,-,-,77.28,-,-,-,-,79.1,-,-,-,-,77.2,-,-,-,-,1.40434e+07,-,-,-,-
2014-01-03,medium,-,1,1,1,1,-,2014,2014,2014,2014,-,3,3,3,3,-,398.29,55,557.5,37.2,-,396.44,54.56,552.5,36.91,-,402.71,55.65,558.47,37.22,-,396.22,54.53,552.47,36.6,-,2.21351e+06,3.82877e+07,1.66923e+06,3.11348e+07
2014-01-06,high,-,-,1,-,-,-,-,2014,-,-,-,-,6,-,-,-,-,54.39,-,-,-,-,57.2,-,-,-,-,57.26,-,-,-,-,54.05,-,-,-,-,6.89744e+07,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-08-21,low,-,8,8,8,8,-,2019,2019,2019,2019,-,21,21,21,21,-,1819.39,185,1195.82,138.55,-,1823.54,183.55,1191.58,138.79,-,1829.58,185.9,1200.56,139.49,-,1815,183.14,1187.92,138,-,2.03923e+06,8.40955e+06,708272,1.49823e+07
2019-08-21,medium,8,-,-,-,-,2019,-,-,-,-,21,-,-,-,-,212.99,-,-,-,-,212.64,-,-,-,-,213.65,-,-,-,-,211.6,-,-,-,-,2.15647e+07,-,-,-,-
2019-08-22,low,-,-,8,8,-,-,-,2019,2019,-,-,-,22,22,-,-,-,183.43,1193.8,-,-,-,182.04,1191.52,-,-,-,184.11,1198.78,-,-,-,179.91,1178.91,-,-,-,1.08295e+07,867915,-
2019-08-22,medium,8,8,-,-,8,2019,2019,-,-,2019,22,22,-,-,22,213.19,1828,-,-,138.66,212.46,1805.6,-,-,137.78,214.44,1829.41,-,-,139.2,210.75,1800.1,-,-,136.29,2.22678e+07,2.65839e+06,-,-,1.85591e+07


We can also specify the level that we want to unstack, instead of having to do the default of the innermost level.

In [121]:
# Default behavior
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Previously a Column Axis,month,year,day,open,close,high,low,volume
Trading Date,Volume Category,Ticker,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-02,low,AAPL,1.0,2014.0,2.0,79.38,79.02,79.58,78.86,8398851.0
2014-01-02,medium,AMZN,1.0,2014.0,2.0,398.80,397.97,399.36,394.02,2140246.0
2014-01-02,medium,FB,1.0,2014.0,2.0,54.86,54.71,55.22,54.19,43257622.0
2014-01-02,medium,GOOGL,1.0,2014.0,2.0,557.73,556.56,558.88,554.13,1822719.0
2014-01-02,medium,MSFT,1.0,2014.0,2.0,37.35,37.16,37.40,37.10,30643745.0
...,...,...,...,...,...,...,...,...,...,...
2019-08-23,medium,AAPL,8.0,2019.0,23.0,209.43,202.64,212.05,201.00,46882843.0
2019-08-23,medium,AMZN,8.0,2019.0,23.0,1793.03,1749.62,1804.90,1745.23,5277898.0
2019-08-23,medium,FB,8.0,2019.0,23.0,180.84,177.75,183.13,176.66,17331221.0
2019-08-23,medium,GOOGL,8.0,2019.0,23.0,1185.17,1153.58,1195.67,1150.00,1813141.0


What if we wanted the "Volume Category" to act as the column? We can do that by specifying the level

In [122]:
stacked.unstack(level = 1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Volume Category,high,low,medium
Trading Date,Ticker,Previously a Column Axis,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-01-02,AAPL,month,,1.00,
2014-01-02,AAPL,year,,2014.00,
2014-01-02,AAPL,day,,2.00,
2014-01-02,AAPL,open,,79.38,
2014-01-02,AAPL,close,,79.02,
...,...,...,...,...,...
2019-08-23,MSFT,open,,,137.19
2019-08-23,MSFT,close,,,133.39
2019-08-23,MSFT,high,,,138.35
2019-08-23,MSFT,low,,,132.80


This also worked by *level name* instead of level position.

In [123]:
stacked.unstack(level = 'Ticker')

Unnamed: 0_level_0,Unnamed: 1_level_0,Ticker,AAPL,AMZN,FB,GOOGL,MSFT
Trading Date,Volume Category,Previously a Column Axis,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2014-01-02,low,month,1.00,,,,
2014-01-02,low,year,2014.00,,,,
2014-01-02,low,day,2.00,,,,
2014-01-02,low,open,79.38,,,,
2014-01-02,low,close,79.02,,,,
...,...,...,...,...,...,...,...
2019-08-23,medium,open,209.43,1793.03,180.84,1185.17,137.19
2019-08-23,medium,close,202.64,1749.62,177.75,1153.58,133.39
2019-08-23,medium,high,212.05,1804.90,183.13,1195.67,138.35
2019-08-23,medium,low,201.00,1745.23,176.66,1150.00,132.80


So when should we use this? Well, if you notice a particular index level whose values would serve well as column labels, you can try performing an `unstack()` to see how it works out. In the example immediately above, the company names ("Tickers") have become column labels, and the rest of the data is identified by the multiindex. Depending on what you're doing, this can be a very convenient way to look at your data.

## Bonus Lecture: Creating MultiLevel Columns Manually

In the previous lecture, we created multilevel columns by chaining together multiple `unstack()` calls. 

In [124]:
stacked.head()

Trading Date  Volume Category  Ticker  Previously a Column Axis
2014-01-02    low              AAPL    month                          1.00
                                       year                        2014.00
                                       day                            2.00
                                       open                          79.38
                                       close                         79.02
dtype: float64

In [125]:
stacked.unstack().unstack()

Unnamed: 0_level_0,Previously a Column Axis,month,month,month,month,month,year,year,year,year,year,day,day,day,day,day,open,open,open,open,open,close,close,close,close,close,high,high,high,high,high,low,low,low,low,low,volume,volume,volume,volume,volume
Unnamed: 0_level_1,Ticker,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT,AAPL,AMZN,FB,GOOGL,MSFT
Trading Date,Volume Category,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2
2014-01-02,low,1.0,,,,,2014.0,,,,,2.0,,,,,79.38,,,,,79.02,,,,,79.58,,,,,78.86,,,,,8398851.0,,,,
2014-01-02,medium,,1.0,1.0,1.0,1.0,,2014.0,2014.0,2014.0,2014.0,,2.0,2.0,2.0,2.0,,398.80,54.86,557.73,37.35,,397.97,54.71,556.56,37.16,,399.36,55.22,558.88,37.40,,394.02,54.19,554.13,37.10,,2140246.0,43257622.0,1822719.0,30643745.0
2014-01-03,low,1.0,,,,,2014.0,,,,,3.0,,,,,79.00,,,,,77.28,,,,,79.10,,,,,77.20,,,,,14043410.0,,,,
2014-01-03,medium,,1.0,1.0,1.0,1.0,,2014.0,2014.0,2014.0,2014.0,,3.0,3.0,3.0,3.0,,398.29,55.00,557.50,37.20,,396.44,54.56,552.50,36.91,,402.71,55.65,558.47,37.22,,396.22,54.53,552.47,36.60,,2213512.0,38287706.0,1669229.0,31134795.0
2014-01-06,high,,,1.0,,,,,2014.0,,,,,6.0,,,,,54.39,,,,,57.20,,,,,57.26,,,,,54.05,,,,,68974359.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-08-21,low,,8.0,8.0,8.0,8.0,,2019.0,2019.0,2019.0,2019.0,,21.0,21.0,21.0,21.0,,1819.39,185.00,1195.82,138.55,,1823.54,183.55,1191.58,138.79,,1829.58,185.90,1200.56,139.49,,1815.00,183.14,1187.92,138.00,,2039231.0,8409548.0,708272.0,14982314.0
2019-08-21,medium,8.0,,,,,2019.0,,,,,21.0,,,,,212.99,,,,,212.64,,,,,213.65,,,,,211.60,,,,,21564747.0,,,,
2019-08-22,low,,,8.0,8.0,,,,2019.0,2019.0,,,,22.0,22.0,,,,183.43,1193.80,,,,182.04,1191.52,,,,184.11,1198.78,,,,179.91,1178.91,,,,10829509.0,867915.0,
2019-08-22,medium,8.0,8.0,,,8.0,2019.0,2019.0,,,2019.0,22.0,22.0,,,22.0,213.19,1828.00,,,138.66,212.46,1805.60,,,137.78,214.44,1829.41,,,139.20,210.75,1800.10,,,136.29,22267819.0,2658388.0,,,18559088.0


We can also create multilevel columns manually. Specifically:
1. We will build a standalone multiindex object
2. We will prepare the sample dataset to go along with that object
3. We will build a new dataframe from those two components.

Let's start by resetting our index in place so we have a clean dataframe to work with and a basic range index.

In [127]:
tech.reset_index(inplace = True)

In [128]:
tech.head()

Unnamed: 0,Trading Date,Volume Category,Ticker,month,year,day,open,close,high,low,volume
0,2014-01-02,low,AAPL,1,2014,2,79.38,79.02,79.58,78.86,8398851
1,2014-01-02,medium,AMZN,1,2014,2,398.8,397.97,399.36,394.02,2140246
2,2014-01-02,medium,FB,1,2014,2,54.86,54.71,55.22,54.19,43257622
3,2014-01-02,medium,GOOGL,1,2014,2,557.73,556.56,558.88,554.13,1822719
4,2014-01-02,medium,MSFT,1,2014,2,37.35,37.16,37.4,37.1,30643745


In this exercise, what we want to do is create a multiindex column label containing: 
* trading volume (high and low) and tickers of Microsoft and Amazon
* closing prices from 10 random closing price observations

Let's begin by randomly creating the multiindex. Remember that we want a multiindex object that has two levels: one for the "Volume Category" and another for the "Ticker". We will be examining two types of volumes (low and high) and two tickers (AMZN and MSFT), so there are a total of 4 possible combinations of index labels. This can be done using the `from_product()` method. 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_product.html

In [129]:
pd.MultiIndex.from_product([['low','high'], ['MSFT','AMZN']], names = ['Volume Category','Ticker'])

MultiIndex([( 'low', 'MSFT'),
            ( 'low', 'AMZN'),
            ('high', 'MSFT'),
            ('high', 'AMZN')],
           names=['Volume Category', 'Ticker'])

We now have all element-wise combinations. Let's assign to a variable.

In [130]:
cols = pd.MultiIndex.from_product([['low','high'], ['MSFT','AMZN']], names = ['Volume Category','Ticker'])

Now let's prepare the data. We'll start by getting the sequence of values that we'll associated with each of the four labels. We'll start by defining some boolean series that capture our logic. 

Remember we have 4 junctures here, so we'll create a boolean series for each one from our *tech* dataframe.

In [131]:
low = tech['Volume Category'] == 'low'
high = tech['Volume Category'] == 'high'
amzn = tech.Ticker == 'AMZN'
msft = tech.Ticker == 'MSFT'

In [132]:
msft

0       False
1       False
2       False
3       False
4        True
        ...  
7100    False
7101    False
7102    False
7103    False
7104     True
Name: Ticker, Length: 7105, dtype: bool

With these boolean series, we can now easily make selections from our *tech* dataframe. Remember the goal here: we want a random sample of closing prices on high and low days for both MSFT and AMZN. 
We'll start by creating an array of arrays, with each component array grabbing one of the four pieces. 

In [134]:
[
 tech[low & msft].close.sample(10).values,
 tech[low & amzn].close.sample(10).values,
 tech[high & msft].close.sample(10).values,
 tech[high & amzn].close.sample(10).values,
]

[array([ 72.82, 126.24,  74.68, 137.46,  77.61,  57.6 ,  77.91,  41.27,
         85.51,  53.93]),
 array([ 382.36,  780.22,  333.57,  374.59,  764.04, 1186.1 ,  310.35,
         559.44,  535.02,  767.58]),
 array([ 47.87, 103.73,  49.16,  50.79,  55.91,  43.07,  51.78,  91.33,
        106.16,  39.55]),
 array([ 445.1 , 1668.4 ,  635.35,  364.47, 1470.9 , 1431.42,  327.82,
        1390.  , 1371.99,  488.1 ])]

We now have a Python list of four elements, each containing a Numpy np array of prices. This is a 4 x 10 arrangment, and what we really need is a 10 x 4 (10 rows, 4 columns). 


In [136]:
data = [
 tech[low & msft].close.sample(10).values,
 tech[low & amzn].close.sample(10).values,
 tech[high & msft].close.sample(10).values,
 tech[high & amzn].close.sample(10).values,
]

We can reshape this by zipping each array and destructing each tuple into separate arrays. The zip will combine all four arrays, then when we iterate through the for loop we get access to tuples containing four items (one from each original array), then we destructure the tuple to create another standalone list from each of the four-element tuples. The end result is that the four arrays of 10 elements each is converted into 10 arrays for 4 elements each.

In [137]:
data = [[*i] for i in zip(*data)]

In [138]:
data

[[120.33, 759.22, 114.26, 964.91],
 [65.68, 404.54, 54.13, 1530.42],
 [74.68, 1801.38, 106.16, 1429.95],
 [63.28, 768.56, 47.52, 531.07],
 [137.46, 335.78, 41.19, 1788.61],
 [85.72, 764.04, 91.33, 1575.39],
 [41.11, 853.42, 44.08, 1718.73],
 [76.29, 346.38, 50.16, 313.18],
 [74.41, 1823.54, 43.07, 482.18],
 [99.05, 847.38, 100.13, 358.69]]

Finally, let's create a dataframe from this. Because we meticulously constructed the data, we simply need to combine it with the column names from the *cols* variable that we created above.

In [139]:
pd.DataFrame(data, columns=cols)

Volume Category,low,low,high,high
Ticker,MSFT,AMZN,MSFT,AMZN
0,120.33,759.22,114.26,964.91
1,65.68,404.54,54.13,1530.42
2,74.68,1801.38,106.16,1429.95
3,63.28,768.56,47.52,531.07
4,137.46,335.78,41.19,1788.61
5,85.72,764.04,91.33,1575.39
6,41.11,853.42,44.08,1718.73
7,76.29,346.38,50.16,313.18
8,74.41,1823.54,43.07,482.18
9,99.05,847.38,100.13,358.69


Check out that end result! We have a dataframe of 10 random records with a 2-level multiindex column axis (one for Volume Category and another for Ticker).

In [141]:
df = pd.DataFrame(data, columns=cols)

In [142]:
df.columns

MultiIndex([( 'low', 'MSFT'),
            ( 'low', 'AMZN'),
            ('high', 'MSFT'),
            ('high', 'AMZN')],
           names=['Volume Category', 'Ticker'])

In [143]:
df.columns.nlevels

2

There are better ways to create multiindex columns, but we'll discuss those further in other sections in the course. This lecture nevertheless gives you an appreciation of what goes on internally.