## Task
Explore indexing in pandas

## Notebook Summary
* Index object - as immutable array vs. ordered set
* `[]`, `loc`, `iloc`, `.ix[]` - with integer & non-integer indexes
 * Series 
 * DataFrame
* MultiIndex object
* Hierarchical indexes, swap levels, sort index
* Re-indexing

## References
* *Python for Data Analysis*, Wes McKinney, O'Reilly, 2012
* *Numerical Python*, Robert Johansson, APress, 2015
* *Python Data Science Handbook*, Jake VanderPlas, O'Reilly, 2016


In [5]:
# display output from all cmds just like Python shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import platform
print 'python.version = ', platform.python_version()
import IPython
print 'ipython.version =', IPython.version_info

import numpy as np
print 'numpy.version =', np.__version__

import pandas as pd
print 'pandas.version =', pd.__version__
from pandas import Series, DataFrame, Index


python.version =  2.7.10
ipython.version = (5, 1, 0, '')
numpy.version = 1.11.2
pandas.version = 0.19.1


In [27]:
# Index object

# create
Index(['a','b','c','d'])
Index(range(5))

i1 = Index(['a','b','c','d','e','f'])
i2 = Index(['a', 'b', 'x', 'y', 'z'])

print '\n----- access as array'
i1[1]
i1[2:4]
i1[::2]

i1.shape, i1.dtype, i1.ndim, i1.size

# cannot assign values to Index - it is immutable
# i1[1] = 99 - TypeError: Index does not support mutable operations

print '\n----- as ordered set'
# access as ordered set

i1 & i2 # intersection
i1 | i2 # union
i1 ^ i2 # xor

print '\n----- delete items from index'
i1.drop('a')
i1 # original index is not affected


Index([u'a', u'b', u'c', u'd'], dtype='object')

Int64Index([0, 1, 2, 3, 4], dtype='int64')


----- access as array


'b'

Index([u'c', u'd'], dtype='object')

Index([u'a', u'c', u'e'], dtype='object')

((6,), dtype('O'), 1, 6)


----- as ordered set


Index([u'a', u'b'], dtype='object')

Index([u'a', u'b', u'c', u'd', u'e', u'f', u'x', u'y', u'z'], dtype='object')

Index([u'c', u'd', u'e', u'f', u'x', u'y', u'z'], dtype='object')


----- delete items from index


Index([u'b', u'c', u'd', u'e', u'f'], dtype='object')

Index([u'a', u'b', u'c', u'd', u'e', u'f'], dtype='object')

In [49]:
# Series w/ integer index - [] vs. loc vs iloc vs ix 

s = Series(['a','b','c','d','e','f'], index=range(1,7))
s
print '-----'

s[1] # [] - with integer index, for single values, uses explicit index
s[1:3] # [] - with integer index, for slices, uses implict index 
s.loc[[1,3]] # loc - always use explicit index
s.iloc[[1,3]] # iloc - always use implicit index
s.ix[[1,3]] # ix - same as loc


1    a
2    b
3    c
4    d
5    e
6    f
dtype: object

-----


'a'

2    b
3    c
dtype: object

1    a
3    c
dtype: object

2    b
4    d
dtype: object

1    a
3    c
dtype: object

In [56]:
# Series w/ non-integer index - [] vs. loc vs iloc vs ix 

s = Series(range(1,7), index=['a','b','c','d','e','f'])
s
print '-----'

s['a']
s['a':'c']
s.loc['a']
# s.iloc['a'] - will result in TypeError
s.ix['a':'c'] # same as []


----- with non-integer index


a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64

-----


1

a    1
b    2
c    3
dtype: int64

1

a    1
b    2
c    3
dtype: int64

In [102]:
# DataFrame w/ integer index - [] vs. loc vs iloc vs ix 
df = DataFrame(np.arange(12).reshape(4,3), columns=['Col1', 'Col2', 'Col3'], index=[101,102,103,104])
df


# indexing 
# df[0] - results in KeyError since there is no column labelled 0
print '\n----- []'
df['Col1'] # get 1st column as Series
df[['Col1', 'Col2']] # get 2 cols as DataFrame

print '\n----- loc'
df.loc[101] # get 1st row as Series
df.loc[100:102] # get 2 rows as DataFrame

print '\n----- iloc'
df.iloc[0] # get 1st row as Series
df.iloc[0:1] # get 1st row as DataFrame

print '\n----- ix'
df.ix[101,:] # get 1st row as Series
df.ix[101:102,:] # get 2 rows as DataFrame


Unnamed: 0,Col1,Col2,Col3
101,0,1,2
102,3,4,5
103,6,7,8
104,9,10,11



----- []


101    0
102    3
103    6
104    9
Name: Col1, dtype: int64

Unnamed: 0,Col1,Col2
101,0,1
102,3,4
103,6,7
104,9,10



----- loc


Col1    0
Col2    1
Col3    2
Name: 101, dtype: int64

Unnamed: 0,Col1,Col2,Col3
101,0,1,2
102,3,4,5



----- iloc


Col1    0
Col2    1
Col3    2
Name: 101, dtype: int64

Unnamed: 0,Col1,Col2,Col3
101,0,1,2



----- ix


Col1    0
Col2    1
Col3    2
Name: 101, dtype: int64

Unnamed: 0,Col1,Col2,Col3
101,0,1,2
102,3,4,5


In [89]:
# DataFrame w/ non-integer index - [] vs. loc vs iloc vs ix 
df = DataFrame(np.arange(12).reshape(4,3), columns=['Col1', 'Col2', 'Col3'], index=['a','b','c','d'])
df
print '-----'

# indexing 
# df[0] - results in KeyError
df['Col1'] # get 1st column as Series
df[['Col1', 'Col2']] # get 2 cols as DataFrame

# TypeError - no explicit index with 0
# df.loc[0] # get 1st row as Series
# df.loc[0:1] # get 2 rows as DataFrame
df.loc['c'] # get 3rd row as Series
df.loc[:'c'] # get 1st 3 rows as DataFrame

df.iloc[0] # get 1st row as Series
df.iloc[0:1] # get 1st row as DataFrame

df.ix[0] # get 1st row as Series
df.ix[0:1] # get 2 rows as DataFrame
df.ix[:'b', :'Col2'] # stop is inclusive, so 'b' and 'Col2' are part of result


Unnamed: 0,Col1,Col2,Col3
a,0,1,2
b,3,4,5
c,6,7,8
d,9,10,11


-----


a    0
b    3
c    6
d    9
Name: Col1, dtype: int64

Unnamed: 0,Col1,Col2
a,0,1
b,3,4
c,6,7
d,9,10


Col1    6
Col2    7
Col3    8
Name: c, dtype: int64

Unnamed: 0,Col1,Col2,Col3
a,0,1,2
b,3,4,5
c,6,7,8


Col1    0
Col2    1
Col3    2
Name: a, dtype: int64

Unnamed: 0,Col1,Col2,Col3
a,0,1,2


Col1    0
Col2    1
Col3    2
Name: a, dtype: int64

Unnamed: 0,Col1,Col2,Col3
a,0,1,2


Unnamed: 0,Col1,Col2
a,0,1
b,3,4


In [126]:
# MultiIndex object

# Create MultiIndex
pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('A', 3), ('B', 1), ('B',2), ('C',1), ('C',4)], names=['Letters', 'Numbers'])
pd.MultiIndex.from_arrays([['A','A','A','B','B','C','C'],[1,2,3,1,2,1,4]], names=['Letters', 'Numbers'])
pd.MultiIndex.from_product([['A', 'B'],[1,2]], names=['Letters', 'Numbers'])


MultiIndex(levels=[[u'A', u'B', u'C'], [1, 2, 3, 4]],
           labels=[[0, 0, 0, 1, 1, 2, 2], [0, 1, 2, 0, 1, 0, 3]],
           names=[u'Letters', u'Numbers'])

MultiIndex(levels=[[u'A', u'B', u'C'], [1, 2, 3, 4]],
           labels=[[0, 0, 0, 1, 1, 2, 2], [0, 1, 2, 0, 1, 0, 3]],
           names=[u'Letters', u'Numbers'])

MultiIndex(levels=[[u'A', u'B'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=[u'Letters', u'Numbers'])

In [11]:
# Hierarchical indexing - Series

s = Series(np.arange(9), index=[['a','a','a','b','b','b','c','c','c'],['x','y','z','x','y','z','x','y','z']])
s
s['a']
s['b':'c']
s[['a','c']]

print '---'
s.ix[['a','c']]

s['a','y']
s[:,'x']

print '\n----- Stack & Unstack'

s.unstack() # same as stack with level=1
type(s.unstack())
s.unstack(level=0)

s.unstack().stack()
type(s.unstack().stack())


a  x    0
   y    1
   z    2
b  x    3
   y    4
   z    5
c  x    6
   y    7
   z    8
dtype: int64

x    0
y    1
z    2
dtype: int64

b  x    3
   y    4
   z    5
c  x    6
   y    7
   z    8
dtype: int64

a  x    0
   y    1
   z    2
c  x    6
   y    7
   z    8
dtype: int64

---


a  x    0
   y    1
   z    2
c  x    6
   y    7
   z    8
dtype: int64

1

a    0
b    3
c    6
dtype: int64


----- Stack & Unstack


Unnamed: 0,x,y,z
a,0,1,2
b,3,4,5
c,6,7,8


pandas.core.frame.DataFrame

Unnamed: 0,a,b,c
x,0,3,6
y,1,4,7
z,2,5,8


a  x    0
   y    1
   z    2
b  x    3
   y    4
   z    5
c  x    6
   y    7
   z    8
dtype: int64

pandas.core.series.Series

In [15]:
# Hierarchical indexing - DataFrame

df = DataFrame(np.arange(36).reshape(9,4), 
               index=[['a','a','a','b','b','b','c','c','c'],['x','y','z','x','y','z','x','y','z']],
               columns=[['Iris','Iris','Campanula','Campanula'],['Petal','Sepal','Petal','Sepal']]
              )
df.index.names = ['Item','SubItem']
df.columns.names = ['Flower', 'Metric']
df.name = 'Flower Metrics'
df

print '-----'

df['Iris']
df['Iris']['Petal']
type(df['Iris']['Petal']) # this is a Series

print '\n----- IndexSlice'
# ToDo


Unnamed: 0_level_0,Flower,Iris,Iris,Campanula,Campanula
Unnamed: 0_level_1,Metric,Petal,Sepal,Petal,Sepal
Item,SubItem,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,x,0,1,2,3
a,y,4,5,6,7
a,z,8,9,10,11
b,x,12,13,14,15
b,y,16,17,18,19
b,z,20,21,22,23
c,x,24,25,26,27
c,y,28,29,30,31
c,z,32,33,34,35


-----


Unnamed: 0_level_0,Metric,Petal,Sepal
Item,SubItem,Unnamed: 2_level_1,Unnamed: 3_level_1
a,x,0,1
a,y,4,5
a,z,8,9
b,x,12,13
b,y,16,17
b,z,20,21
c,x,24,25
c,y,28,29
c,z,32,33


Item  SubItem
a     x           0
      y           4
      z           8
b     x          12
      y          16
      z          20
c     x          24
      y          28
      z          32
Name: Petal, dtype: int64

pandas.core.series.Series


----- IndexSlice


In [16]:
# Re-arranging Indexes - DataFrame

print '\n----- Unstack original DataFrame'
df.unstack()
df.unstack(level=0)
df['Iris']['Petal'].unstack() # same as unstacking Series


print '\n----- Swap column levels'
df.swaplevel('Item', 'SubItem')

print '\n----- Swap column levels and sort outer level'
df.swaplevel('Item', 'SubItem').sortlevel(0) # same as...
df.swaplevel('Item', 'SubItem').sort_index()

df.swaplevel(0,1, axis=1).sortlevel(0, axis=1)



----- Unstack original DataFrame


Flower,Iris,Iris,Iris,Iris,Iris,Iris,Campanula,Campanula,Campanula,Campanula,Campanula,Campanula
Metric,Petal,Petal,Petal,Sepal,Sepal,Sepal,Petal,Petal,Petal,Sepal,Sepal,Sepal
SubItem,x,y,z,x,y,z,x,y,z,x,y,z
Item,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
a,0,4,8,1,5,9,2,6,10,3,7,11
b,12,16,20,13,17,21,14,18,22,15,19,23
c,24,28,32,25,29,33,26,30,34,27,31,35


Flower,Iris,Iris,Iris,Iris,Iris,Iris,Campanula,Campanula,Campanula,Campanula,Campanula,Campanula
Metric,Petal,Petal,Petal,Sepal,Sepal,Sepal,Petal,Petal,Petal,Sepal,Sepal,Sepal
Item,a,b,c,a,b,c,a,b,c,a,b,c
SubItem,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
x,0,12,24,1,13,25,2,14,26,3,15,27
y,4,16,28,5,17,29,6,18,30,7,19,31
z,8,20,32,9,21,33,10,22,34,11,23,35


SubItem,x,y,z
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,0,4,8
b,12,16,20
c,24,28,32



----- Swap column levels


Unnamed: 0_level_0,Flower,Iris,Iris,Campanula,Campanula
Unnamed: 0_level_1,Metric,Petal,Sepal,Petal,Sepal
SubItem,Item,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
x,a,0,1,2,3
y,a,4,5,6,7
z,a,8,9,10,11
x,b,12,13,14,15
y,b,16,17,18,19
z,b,20,21,22,23
x,c,24,25,26,27
y,c,28,29,30,31
z,c,32,33,34,35



----- Swap column levels and sort outer level


Unnamed: 0_level_0,Flower,Iris,Iris,Campanula,Campanula
Unnamed: 0_level_1,Metric,Petal,Sepal,Petal,Sepal
SubItem,Item,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
x,a,0,1,2,3
x,b,12,13,14,15
x,c,24,25,26,27
y,a,4,5,6,7
y,b,16,17,18,19
y,c,28,29,30,31
z,a,8,9,10,11
z,b,20,21,22,23
z,c,32,33,34,35


Unnamed: 0_level_0,Flower,Iris,Iris,Campanula,Campanula
Unnamed: 0_level_1,Metric,Petal,Sepal,Petal,Sepal
SubItem,Item,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
x,a,0,1,2,3
x,b,12,13,14,15
x,c,24,25,26,27
y,a,4,5,6,7
y,b,16,17,18,19
y,c,28,29,30,31
z,a,8,9,10,11
z,b,20,21,22,23
z,c,32,33,34,35


Unnamed: 0_level_0,Metric,Petal,Petal,Sepal,Sepal
Unnamed: 0_level_1,Flower,Campanula,Iris,Campanula,Iris
Item,SubItem,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,x,2,0,3,1
a,y,6,4,7,5
a,z,10,8,11,9
b,x,14,12,15,13
b,y,18,16,19,17
b,z,22,20,23,21
c,x,26,24,27,25
c,y,30,28,31,29
c,z,34,32,35,33


In [None]:
# reset_index

reset_index - to flat
set_index - from flat



In [11]:
# Reindexing - Series - conform Series to new index

s = Series(range(5), index=['b','a','d','e','c'])
s
print '-----'

print 'New index value will have NaN'
s.reindex(['a','b','c','x'])

print 'New index values will contain 99'
s.reindex(['a','b','c','x','y','z'], fill_value=99)
# s.reindex(['a','b','c','x','y','z'], method='ffill') # - will not work due to string index

print '-----'
s = Series(['a','b','c'], index=range(3))
s.reindex(range(10))

print 'Fill new index values with ffill'
s.reindex(range(10), method='ffill')

print 'Fill new index values with bfill'
s.reindex(range(10), method='bfill')

# note that NA values are also dropped
s.drop(2)
s.drop([0,1,2])


b    0
a    1
d    2
e    3
c    4
dtype: int64

-----
New index value will have NaN


a    1.0
b    0.0
c    4.0
x    NaN
dtype: float64

New index values will contain 99


a     1
b     0
c     4
x    99
y    99
z    99
dtype: int64

-----


0      a
1      b
2      c
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
dtype: object

Fill new index values with ffill


0    a
1    b
2    c
3    c
4    c
5    c
6    c
7    c
8    c
9    c
dtype: object

Fill new index values with bfill


0      a
1      b
2      c
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
dtype: object

0    a
1    b
dtype: object

Series([], dtype: object)

In [15]:
# Reindexing - DataFrame

df = DataFrame(np.arange(9).reshape(3,3), index=['Row1', 'Row2', 'Row3'], columns=['Col1', 'Col2', 'Col3'])
df
print '-----'

print 'Fill new row index values with ffill; new column values will not be filled'
df.reindex(['Row1', 'Row2', 'Row3', 'Row4'], columns=['Col1', 'Col2', 'Col3', 'Col4'], method='ffill')

print '-----'

df.drop('Row1')
df.drop('Col1', axis=1)


Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,3,4,5
Row3,6,7,8


-----
Fill new row index values with ffill; new column values will not be filled


Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,
Row2,3,4,5,
Row3,6,7,8,
Row4,6,7,8,


-----


Unnamed: 0,Col1,Col2,Col3
Row2,3,4,5
Row3,6,7,8


Unnamed: 0,Col2,Col3
Row1,1,2
Row2,4,5
Row3,7,8
