# Advanced Indexing

- Index Objects
- Index Values and Names
- Changing DataFrame Index
- Changing Index Name Labels
- Building an Index.  Then a dataframe


- Hierachacle Indexing
- Extracing data with MultiIndex
- Setting & Sorting a MultiIndex
- Using .loc[] with nonunique indexes
- Indexing multiple Levels of a MultiIndex

Indexes are key building block of dataframes.

Key Building Blocks of Data Structures:
1. Index: sequence of labels
2. 1D series array with index
3. Dataframes: 2D arra with series as columns

Index are:
- Pandas Index are immutable (like dictionary keys)
- Homogenous in data type (like NumPy Arrays)



In [2]:
import pandas as pd
import numpy as np

#### Creating an Index

In [5]:
# note that index starts from 0-4

prices = [10.7, 10.86, 10.72, 10.71,10.79]

shares = pd.Series(prices)

print(shares)

0    10.70
1    10.86
2    10.72
3    10.71
4    10.79
dtype: float64


In [7]:
# recreate series index declaring days as the series index

days = ['mon', 'tues', 'wed', 'thurs', 'fri']

shares = pd.Series(prices, index=days)

print(shares)

mon      10.70
tues     10.86
wed      10.72
thurs    10.71
fri      10.79
dtype: float64


In [8]:
print(shares.index)

Index(['mon', 'tues', 'wed', 'thurs', 'fri'], dtype='object')


In [10]:
# slice index
print(shares.index[2])

wed


In [11]:
print(shares.index[:2])

Index(['mon', 'tues'], dtype='object')


In [12]:
print(shares.index[-2:])

Index(['thurs', 'fri'], dtype='object')


In [13]:
print(shares.index.name)

None


In [14]:
shares.index.name = 'weekday'

In [15]:
shares

weekday
mon      10.70
tues     10.86
wed      10.72
thurs    10.71
fri      10.79
dtype: float64

In [16]:
print(shares.index.name)

weekday


In [19]:
# throws error because index is Immutable (cannot be changed)
# this restriction helps pandas optimize series and dataframes

shares.index[2] = 'Wednesday'

TypeError: Index does not support mutable operations

In [20]:
# you can overwrite Index all at once

shares.index = ['Monday', 'Tuesday', 'Wednesday',
               'Thursday', 'Friday']

print(shares)

Monday       10.70
Tuesday      10.86
Wednesday    10.72
Thursday     10.71
Friday       10.79
dtype: float64


In [21]:
unemployment = pd.read_csv('unemployment.csv')

FileNotFoundError: File b'unemployment.csv' does not exist

In [None]:
unemployment.index = unemployment['Zip']

unemployment.head()

# delete redundant Zip column

del unemployment('Zip')



In [None]:
# importing index in read_csv

unemployment = pd.read_csv('Unemployment.csv', index_col='Zip')



#### Changing index of a DataFrame

As you saw in the previous exercise, indexes are immutable objects. This means that if you want to change or modify the index in a DataFrame, then you need to change the whole index. You will do this now, using a list comprehension to create the new index.

A list comprehension is a succinct way to generate a list in one line. For example, the following list comprehension generates a list that contains the cubes of all numbers from 0 to 9: 

cubes = [i**3 for i in range(10)]. 

This is equivalent to the following code:

cubes = []
for i in range(10):
    cubes.append(i**3)
    
Before getting started, print the sales DataFrame in the IPython Shell and verify that the index is given by month abbreviations containing lowercase characters.

In [31]:
sales = pd.read_csv('./data/sales.csv',
                   index_col='month')

In [32]:
sales.head()

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52


In [33]:
# Create the list of new indexes: new_idx
new_idx = [i.upper() for i in sales.index]

# Assign new_idx to sales.index
sales.index = new_idx

# Print the sales DataFrame
print(sales)

     eggs  salt  spam
JAN    47  12.0    17
FEB   110  50.0    31
MAR   221  89.0    72
APR    77  87.0    20
MAY   132   NaN    52
JUN   205  60.0    55


#### Changing index name labels

Notice that in the previous exercise, the index was not labeled with a name. In this exercise, you will set its name to 'MONTHS'.

Similarly, if all the columns are related in some way, you can provide a label for the set of columns.

To get started, print the sales DataFrame in the IPython Shell and verify that the index has no name, only its data (the month names)

In [34]:
# Assign the string 'MONTHS' to sales.index.name
sales.index.name = 'MONTHS'

# Print the sales DataFrame
print(sales)



        eggs  salt  spam
MONTHS                  
JAN       47  12.0    17
FEB      110  50.0    31
MAR      221  89.0    72
APR       77  87.0    20
MAY      132   NaN    52
JUN      205  60.0    55


In [35]:
# Assign the string 'PRODUCTS' to sales.columns.name 
sales.columns.name = 'PRODUCTS'

# Print the sales dataframe again
print(sales)

PRODUCTS  eggs  salt  spam
MONTHS                    
JAN         47  12.0    17
FEB        110  50.0    31
MAR        221  89.0    72
APR         77  87.0    20
MAY        132   NaN    52
JUN        205  60.0    55


#### Building an index, then a DataFrame

You can also build the DataFrame and index independently, and then put them together. If you take this route, be careful, as any mistakes in generating the DataFrame or the index can cause the data and the index to be aligned incorrectly.

In this exercise, the sales DataFrame has been provided for you without the month index. Your job is to build this index separately and then assign it to the sales DataFrame. Before getting started, print the sales DataFrame in the IPython Shell and note that it's missing the month information.

In [36]:
# Generate the list of months: months
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']

# Assign months to sales.index
sales.index = months

# Print the modified sales DataFrame
print(sales)

PRODUCTS  eggs  salt  spam
Jan         47  12.0    17
Feb        110  50.0    31
Mar        221  89.0    72
Apr         77  87.0    20
May        132   NaN    52
Jun        205  60.0    55


# Hierarchical Indexing

Represent multi-dimensional data with Hierarchical Indexes

In [39]:
stock = pd.read_csv('./data/tmp_clean_stock_data.csv')

In [41]:
stock

Unnamed: 0,name,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,IBM,156.08,160.01,159.81,165.22,172.25,167.15,164.75,152.77,145.36,146.11,137.21,137.96
1,MSFT,45.51,43.08,42.13,43.47,47.53,45.96,45.61,45.51,43.56,48.7,53.88,55.4
2,GOOGLE,512.42,537.99,559.72,540.5,535.24,532.92,590.09,636.84,617.93,663.59,735.39,755.35
3,APPLE,110.64,125.43,125.97,127.29,128.76,127.81,125.34,113.39,112.8,113.36,118.16,111.73


there are repeated values in Index(Data) and Symbol columns.  We would prefer using a meaningful index that uniquely identifies each row.

Indivdually, both the 'date' and 'Symbol' columns are inappropriate due to repetitions.  The trik is to use a tuple (Symbol and Date) to represent each record in the table uniquely.

The index now comprises of 2 columns, Symbol + Date



In [42]:
stocks = stocks.set_index(['Symbol', 'Data'])

KeyError: 'Symbol'

In [None]:
# note that out put is MultiIndex

print(stocks.index)

# prints None
print(stocks.name)

# prints the multiindex
print(stocks.index.names)

In [None]:
# sort the dataframe
# this makes a new dataframe in memory

stocks = stocks.sort_index()

# provides a readable ragged form with gaps where symbols aapl, csco, and msft would be presented
print(stocks)

In [None]:
# insert tuple into .loc
stocks.loc[('CSCO', '2016-10-04')]

# extract Volume element in the table
stocks.loc[('CSCO', '2016-10-04'), 'Volume']

# extracts all dates for AAPL, slices the outer index
stocks.loc['AAPL']

# slice using range of symbols
stocks.loc['CSCO':'MSFT']




## Fancy Indexing

In [None]:
# Slice AAPL and MSFT data for a particular date, returning all cols
stocks.loc[(['AAPL', 'MSFT'], '2016-10-05'), :]

stocks.loc[(['AAPL', 'MSFT'], '2016-10-05'), 'Close']

# this also works on the inner index
stocks.loc[('CSCO', ['2016-10-05', '2016-10-03']), :]

In [None]:
# trick for slicing hierarchical indexes
# does not recognize : syntax natively
# therfore we use Slice() expicitly


stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]



#### Setting up dataframe for example

In [71]:
sales = pd.read_csv('./data/sales.csv')

In [72]:
sales.columns

Index(['month', 'eggs', 'salt', 'spam'], dtype='object')

In [73]:
del sales['month']

In [74]:
state = pd.Series(['CA', 'CA', 'NY', 'NY', 'TX', 'TX'])
month = pd.Series([1, 2, 1, 2, 1, 2])

In [77]:
sales['state'] = state
sales['month'] = month

In [82]:
sales = sales.set_index(['state', 'month'])

In [83]:
sales

Unnamed: 0_level_0,Unnamed: 1_level_0,eggs,salt,spam
state,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,1,47,12.0,17
CA,2,110,50.0,31
NY,1,221,89.0,72
NY,2,77,87.0,20
TX,1,132,,52
TX,2,205,60.0,55


#### Extracting data with a MultiIndex

In the video, Dhavide explained the concept of a hierarchical index, or a MultiIndex. You will now practice working with these types of indexes.

The sales DataFrame you have been working with has been extended to now include State information as well. In the IPython Shell, print the new sales DataFrame to inspect the data. Take note of the MultiIndex!

Extracting elements from the outermost level of a MultiIndex is just like in the case of a single-level Index. You can use the .loc[] accessor as Dhavide demonstrated in the video.

In [84]:
# Print sales.loc[['CA', 'TX']]
print(sales.loc[['CA', 'TX']])

# Print sales['CA':'TX']
print(sales['CA':'TX'])




             eggs  salt  spam
state month                  
CA    1        47  12.0    17
      2       110  50.0    31
TX    1       132   NaN    52
      2       205  60.0    55
             eggs  salt  spam
state month                  
CA    1        47  12.0    17
      2       110  50.0    31
NY    1       221  89.0    72
      2        77  87.0    20
TX    1       132   NaN    52
      2       205  60.0    55
