In [24]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)

3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
1.12.1
0.20.1


## Why have an Index?

You may wonder why you would want an Index to go with your data.  And it's true that you won't always need an index.  But sometimes an index can keep your data aligned properly. 

For an example, here are two Series containing grades.  Imagine that they came from different sources, or were preprocessed separately, so the rows are not in the same order.  In fact, they don't even have all the same students.

In [25]:
grades1 = pd.Series([88,78,92], index = ['Ben','Sue','Blake'])
grades2 = pd.Series([84,81,50], index = ['Sue', 'May', 'Blake'])
print(grades1, '\n', grades2)

Ben      88
Sue      78
Blake    92
dtype: int64 
 Sue      84
May      81
Blake    50
dtype: int64


If these were NumPy arrays and you added them together, you'd mix up the grades for different people.  Pandas autoaligns the indices together for you.

In [26]:
avg_grades = (grades1 + grades2) /2
avg_grades

Ben       NaN
Blake    71.0
May       NaN
Sue      81.0
dtype: float64

You can check that May and Sue got the right average.  Notice that because Ben wasn't in the second Series, he gets a NaN - which is NumPy's Not a Number.  Pandas uses this to denote missing values.

A similar thing will happen if we place these Series into a DataFrame

In [27]:
gradebook = pd.DataFrame({'midterm': grades1,'final': grades2})
gradebook

Unnamed: 0,final,midterm
Ben,,88.0
Blake,50.0,92.0
May,81.0,
Sue,84.0,78.0


The Index for the DataFrame is formed from the union of the two Indexes we're passing in.  Since Ben doesn't appear in the Series grades2, he gets a NA in the DataFrame.  The same is true for Blake and grades1.

In [28]:
gradebook.index

Index(['Ben', 'Blake', 'May', 'Sue'], dtype='object')

You can check if items are in an Index with `in`.

In [29]:
'Ben' in gradebook.index

True

If you decide that these student names should actually be a regular variable, instead of an index, you can move them into the DataFrame with `reset_index()`.

In [30]:
gradebook.reset_index()

Unnamed: 0,index,final,midterm
0,Ben,,88.0
1,Blake,50.0,92.0
2,May,81.0,
3,Sue,84.0,78.0


## Checking Missing Values

Quick aside: how do you check if a Series has missing values?  You might think you can check equality with np.nan, but this doesn't work.

In [31]:
avg_grades == np.nan

Ben      False
Blake    False
May      False
Sue      False
dtype: bool

The correct way is to use the `np.isnan()` function.

In [32]:
np.isnan(avg_grades)

Ben       True
Blake    False
May       True
Sue      False
dtype: bool

## Duplicate Labels

Finally, it is possible to have duplicate values in an index.  Let's make another Series to show what happens.

In [33]:
grades3 = pd.Series([92,93,89,95], index = ['Ben','Ben','Sue','May'])
grades3

Ben    92
Ben    93
Sue    89
May    95
dtype: int64

If we select values for Ben, we'll get all the rows that correspond.  This could surprise you and cause errors if you're not expecting it.

In [34]:
grades3['Ben']

Ben    92
Ben    93
dtype: int64

Let's see what happens if we do some arithmetic with this Series.

In [35]:
(grades1 + grades3) / 2

Ben      90.0
Ben      90.5
Blake     NaN
May       NaN
Sue      83.5
dtype: float64

We get two rows for Ben, one corresponding to each entry in `grades3`.  In fact, this operation follows the rules for an outer join.  Every row with Ben in the first Series is combined with every row for Ben in the second Series.  Moreover, since Blake only appears in the first Series, he gets a row in the output, but the value is NA.

Sometimes this functionality is useful, and other times it might indicate that you're equating rows by mistake.  You might wonder if we actually have two different Bens in the class.

You can check if an Index has duplicate labels with the `is_unique` property

In [40]:
((grades1 + grades3) / 2).is_unique

False