<h1 id="tocheading">Table of Contents and Notebook Setup</h1>
<div id="toc"></div>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

These notes include the fundamental mechanics of interacting with data in <b>Series</b> and <b>DataFrame</b> objects.

# Reindexing (Crucial for time series data that doesn't match)

## Reindexing in Series

Reindexing involves creating a new object with the data conformed to a new index. Consider the following.

In [2]:
import pandas as pd
obj = pd.Series([4.3, 5.6, 7.4, 8.7], index=['d', 'c', 'a', 'b'])
obj

d    4.3
c    5.6
a    7.4
b    8.7
dtype: float64

In [3]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a    7.4
b    8.7
c    5.6
d    4.3
e    NaN
dtype: float64

This is <i> extremely </i> important for time series data where we often need to fill in the blanks (perhaps one variable is measured less frequently than another).

In [4]:
obj3 = pd.Series(['blue', 'red', 'yellow'], index = [0, 2, 4])
obj3

0      blue
2       red
4    yellow
dtype: object

In [5]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2       red
3       red
4    yellow
5    yellow
dtype: object

The method 'ffill' forward fills the values in successive entries (as opposed to leaving them as NaN). See CERN Research 2018 for concatenating time series DataFrames together.

## Reindexing in DataFrames

In a DataFrame we can either reindex the rows or columns. When passed as a sequene it will reindex the <b>rows</b> as a default.

In [6]:
import numpy as np
frame = pd.DataFrame(np.arange(9).reshape((3,3)), 
                     index=['a', 'b', 'c'],
                     columns=['Victoria', 'Geneva', 'St Genis'])
frame

Unnamed: 0,Victoria,Geneva,St Genis
a,0,1,2
b,3,4,5
c,6,7,8


In [7]:
frame.reindex(['a', 'r', 'b', 'q', 'c'])

Unnamed: 0,Victoria,Geneva,St Genis
a,0.0,1.0,2.0
r,,,
b,3.0,4.0,5.0
q,,,
c,6.0,7.0,8.0


We could always use the ffill method...

In [8]:
frame.reindex(['a', 'r', 'b', 'q', 'c'], method='ffill') # forward fill by sorting of letters

Unnamed: 0,Victoria,Geneva,St Genis
a,0,1,2
r,6,7,8
b,3,4,5
q,6,7,8
c,6,7,8


To reindex the <b> columns </b> we use the columns keyword argument.

In [9]:
frame.reindex(columns=['Victoria', 'Vancouver', 'Geneva', 'Paris', 'St Genis'])

Unnamed: 0,Victoria,Vancouver,Geneva,Paris,St Genis
a,0,,1,,2
b,3,,4,,5
c,6,,7,,8


# Dropping Entries from an Axis

The drop method of a series returns a new object with the indicated index and value deleted.

In [10]:
obj = pd.Series(np.arange(3), ['a', 'b', 'c'])
obj2 = obj.drop('b')
obj2

a    0
c    2
dtype: int64

## Removing Rows/Columns from DataFrame

In a dataframe, index values can be deleted from either axis. Consider the following.

In [11]:
data = pd.DataFrame(np.arange(25).reshape((5,5)), 
                   index = ['Victoria', 'Paris', 'St. Genis', 'Geneva', 'Vancouver'],
                   columns = ['one', 'two', 'three', 'four', 'five'])
data

Unnamed: 0,one,two,three,four,five
Victoria,0,1,2,3,4
Paris,5,6,7,8,9
St. Genis,10,11,12,13,14
Geneva,15,16,17,18,19
Vancouver,20,21,22,23,24


In [12]:
data.drop(['Victoria', 'St. Genis']) #Note: Doesn't actually modify the orginial dataframe- just returns a new one

Unnamed: 0,one,two,three,four,five
Paris,5,6,7,8,9
Geneva,15,16,17,18,19
Vancouver,20,21,22,23,24


In [13]:
data.drop('two', axis='columns')

Unnamed: 0,one,three,four,five
Victoria,0,2,3,4
Paris,5,7,8,9
St. Genis,10,12,13,14
Geneva,15,17,18,19
Vancouver,20,22,23,24


## Modifying the Original DataFrame

Notice that the drop method doesn't actually modify the original DataFrame - it just returns a new one. We can modify the original DataFrame by using the <b> inplace </b> argument.

In [14]:
data.drop('Victoria', inplace=True)
data

Unnamed: 0,one,two,three,four,five
Paris,5,6,7,8,9
St. Genis,10,11,12,13,14
Geneva,15,16,17,18,19
Vancouver,20,21,22,23,24


# Indexing, Selection, and Filtering

## Indexing in Series

Since the columns of DataFrames are Series, it is important that we first start with the indexing of series.

In [15]:
obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
obj

a    0
b    1
c    2
d    3
dtype: int64

We can either index by numeric value or by the indicies themselves.

In [16]:
print(obj[0])
print(obj['c'])

0
2


Slicing works exactly the same way and returns a sub-series.

In [17]:
obj[2:4]

c    2
d    3
dtype: int64

In [18]:
obj['b':'d'] # end points inclusive with index slicing

b    1
c    2
d    3
dtype: int64

We can also pass in an array to get a sub-series.

In [19]:
obj[['b', 'd', 'c']]

b    1
d    3
c    2
dtype: int64

The most important part of pandas, logical indexing:

In [20]:
obj[obj>=2]

c    2
d    3
dtype: int64

## Indexing in DataFrames

In [21]:
data = pd.DataFrame(np.arange(16).reshape(4,4),
                   index = ['Victoria', 'Vancouver', 'Lyon', 'Geneva'],
                   columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Victoria,0,1,2,3
Vancouver,4,5,6,7
Lyon,8,9,10,11
Geneva,12,13,14,15


In [22]:
print(type(data['two'])) #returns a series
data['two'] 

<class 'pandas.core.series.Series'>


Victoria      1
Vancouver     5
Lyon          9
Geneva       13
Name: two, dtype: int64

In [23]:
print(type(data[['two', 'four']]))
data[['two','four']]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,two,four
Victoria,1,3
Vancouver,5,7
Lyon,9,11
Geneva,13,15


The following syntax can be used to select a subset of rows.

In [24]:
data[:2]

Unnamed: 0,one,two,three,four
Victoria,0,1,2,3
Vancouver,4,5,6,7


In [25]:
data[data['three']>5]

Unnamed: 0,one,two,three,four
Vancouver,4,5,6,7
Lyon,8,9,10,11
Geneva,12,13,14,15


The row selection syntax [:2] is merely a convenience provided. A single element or list selects columns:

### Example: Deleting Bad Data

In [26]:
mu, sigma = 1, 0.1
data = pd.DataFrame(np.random.normal(mu, sigma, 16).reshape(4,4),
                   index = ['Victoria', 'Vancouver', 'Lyon', 'Geneva'],
                   columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Victoria,0.951828,0.952464,0.941566,1.044894
Vancouver,0.871968,1.072197,1.02579,1.103237
Lyon,0.986426,1.128422,0.988293,0.798681
Geneva,1.198642,1.054365,0.963429,1.045781


Remove data that strays too far from standard deviation.

In [27]:
data[(data>1.15) | (data<0.85)] = None
data

Unnamed: 0,one,two,three,four
Victoria,0.951828,0.952464,0.941566,1.044894
Vancouver,0.871968,1.072197,1.02579,1.103237
Lyon,0.986426,1.128422,0.988293,
Geneva,,1.054365,0.963429,1.045781


Drop all rows that that have this condition.

In [28]:
data.dropna()

Unnamed: 0,one,two,three,four
Victoria,0.951828,0.952464,0.941566,1.044894
Vancouver,0.871968,1.072197,1.02579,1.103237


## Selection with loc and iloc

### Selection on Rows

This is for DataFrame label-indexing on rows. Using these operators, we can select a subset of rows and columns at the same time.

In [29]:
mu, sigma = 1, 0.1
data = pd.DataFrame(np.random.normal(mu, sigma, 16).reshape(4,4),
                   index = ['Victoria', 'Vancouver', 'Lyon', 'Geneva'],
                   columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Victoria,1.154539,1.113937,0.988796,0.876886
Vancouver,1.013969,1.18117,1.302196,0.997266
Lyon,0.723571,0.899098,1.005422,0.967721
Geneva,1.016107,1.002229,0.866705,1.197267


In [30]:
data.loc['Geneva', ['one', 'four']]

one     1.016107
four    1.197267
Name: Geneva, dtype: float64

In [31]:
data.iloc[2, [3, 0, 1]]

four    0.967721
one     0.723571
two     0.899098
Name: Lyon, dtype: float64

Note that one uses labels and one uses integer indexing values. We can also select entire rows.

In [32]:
data.iloc[2]

one      0.723571
two      0.899098
three    1.005422
four     0.967721
Name: Lyon, dtype: float64

### Selecting subsets of DataFrames

In [33]:
mu, sigma = 1, 0.1
data = pd.DataFrame(np.random.normal(mu, sigma, 16).reshape(4,4),
                   index = ['Victoria', 'Vancouver', 'Lyon', 'Geneva'],
                   columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Victoria,0.945155,1.073735,1.110402,1.076347
Vancouver,1.103414,0.990882,0.999848,1.157775
Lyon,0.910648,0.91525,0.834112,1.038216
Geneva,0.776056,0.924459,1.143676,1.084434


In [34]:
data.iloc[[1,2], [0,1,3]] #[row, column] notation

Unnamed: 0,one,two,four
Vancouver,1.103414,0.990882,1.157775
Lyon,0.910648,0.91525,1.038216


We can slice using iloc or loc with both rows and columns:

In [35]:
data.loc['Victoria':'Lyon', 'two':'four']

Unnamed: 0,two,three,four
Victoria,1.073735,1.110402,1.076347
Vancouver,0.990882,0.999848,1.157775
Lyon,0.91525,0.834112,1.038216


In [36]:
data.iloc[1:3, 2:4]

Unnamed: 0,three,four
Vancouver,0.999848,1.157775
Lyon,0.834112,1.038216


Since data.iloc and data.loc return dataframes themselves, we can apply chains of operations like such.

In [37]:
data.iloc[:, :3][data.two>0.95]

Unnamed: 0,one,two,three
Victoria,0.945155,1.073735,1.110402
Vancouver,1.103414,0.990882,0.999848


## Integer Indexes

When you create a series it by default has integer indexes.

In [38]:
ser = pd.Series(['a','b','c'])
ser

0    a
1    b
2    c
dtype: object

Indexing using things like ser[-1] is complicated because pandas doesn't know whether you're indexing with label-based or position-based indexing. Thus <b> if the index axis contains integers, data selection will always be label-oriented </b>. Thus this generates an error:

In [39]:
try:
    ser[-1]
except:
    print('Error Generated')

Error Generated


But if we don't use integer indices then this is fine.

In [40]:
ser = pd.Series(['a','b','c'], index=['one', 'two', 'three'])
ser[-1]

'c'

For more precise handling, its typically better to use loc for labels and iloc for indexes. 

In [41]:
print(ser.iloc[1]) #Better programming style
print(ser[1])

b
b


# Arithmetic and Data Alignment

Sometimes you might have a set of dataframes that you may want to add together. The problem is that not all of the indexes might add up; one data frame might have some different indexes than another one. In this case, operations such as addition will yield NaN values.

In [42]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4),
                   index = ['Victoria', 'Vancouver', 'Lyon', 'Geneva'],
                   columns = ['two', 'three', 'four', 'five'])
df1

Unnamed: 0,two,three,four,five
Victoria,0,1,2,3
Vancouver,4,5,6,7
Lyon,8,9,10,11
Geneva,12,13,14,15


In [43]:
df2 = pd.DataFrame(np.arange(25).reshape(5,5),
                   index = ['Victoria', 'Vancouver', 'Lyon', 'Geneva', 'Paris'],
                   columns = ['one', 'two', 'three', 'four', 'five'])
df2

Unnamed: 0,one,two,three,four,five
Victoria,0,1,2,3,4
Vancouver,5,6,7,8,9
Lyon,10,11,12,13,14
Geneva,15,16,17,18,19
Paris,20,21,22,23,24


In [44]:
df1+df2

Unnamed: 0,five,four,one,three,two
Geneva,34.0,32.0,,30.0,28.0
Lyon,25.0,23.0,,21.0,19.0
Paris,,,,,
Vancouver,16.0,14.0,,12.0,10.0
Victoria,7.0,5.0,,3.0,1.0


The rows and columns that the dataframes do not have in common are filled with nulls. Sometimes, however, you might want to fill these slots with a special values, such as 0.

In [45]:
df1.add(df2, fill_value=0)

Unnamed: 0,five,four,one,three,two
Geneva,34.0,32.0,15.0,30.0,28.0
Lyon,25.0,23.0,10.0,21.0,19.0
Paris,24.0,23.0,20.0,22.0,21.0
Vancouver,16.0,14.0,5.0,12.0,10.0
Victoria,7.0,5.0,0.0,3.0,1.0


Whenever one of the add values would be NaN, we replace it with a zero instead. There are a number of arithmetic operations on dataframes on page 149 of the textbook.

## Operations between DataFrames and Series

Suppose we have a series that corresponds to one of the rows of a DataFrame:

In [46]:
df = pd.DataFrame(np.arange(16).reshape(4,4),
                   index = ['Victoria', 'Vancouver', 'Lyon', 'Geneva'],
                   columns = ['two', 'three', 'four', 'five'])
df

Unnamed: 0,two,three,four,five
Victoria,0,1,2,3
Vancouver,4,5,6,7
Lyon,8,9,10,11
Geneva,12,13,14,15


In [47]:
ser = df.iloc[0]
ser

two      0
three    1
four     2
five     3
Name: Victoria, dtype: int64

The operation is broadcast down all the rows of the DataFrame:

In [48]:
df-ser

Unnamed: 0,two,three,four,five
Victoria,0,0,0,0
Vancouver,4,4,4,4
Lyon,8,8,8,8
Geneva,12,12,12,12


If the indexes of <i>both</i> the Series and the DataFrame aren't equal, we take the union of the indexes.

In [49]:
ser2 = pd.Series(range(3), index=['three', 'five', 'six'])
ser2

three    0
five     1
six      2
dtype: int64

In [50]:
df+ser2

Unnamed: 0,five,four,six,three,two
Victoria,4.0,,,1.0,
Vancouver,8.0,,,5.0,
Lyon,12.0,,,9.0,
Geneva,16.0,,,13.0,


Note that we're adding the series across the rows and going downwards. If we want to add across the columns and move sideways then we must specify the axis.

In [51]:
ser3 = pd.Series(range(3), index=['Victoria', 'Vancouver', 'Geneva'])
ser3

Victoria     0
Vancouver    1
Geneva       2
dtype: int64

In [52]:
df.add(ser3, axis='index') #vertical axis

Unnamed: 0,two,three,four,five
Geneva,14.0,15.0,16.0,17.0
Lyon,,,,
Vancouver,5.0,6.0,7.0,8.0
Victoria,0.0,1.0,2.0,3.0


# Function Applications and Mappings

Numpy ufuncs (element wise array methods) work with pandas objects.

In [53]:
mu, sigma = 0, 0.1
data = pd.DataFrame(np.random.normal(mu, sigma, 16).reshape(4,4),
                   index = ['Victoria', 'Vancouver', 'Lyon', 'Geneva'],
                   columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Victoria,-0.064386,0.036402,0.193661,-0.069118
Vancouver,0.006797,-0.086853,0.130316,0.046497
Lyon,0.092051,-0.114491,-0.03165,-0.060582
Geneva,-0.146532,-0.038371,0.147527,0.002364


In [54]:
data = np.abs(data)
data

Unnamed: 0,one,two,three,four
Victoria,0.064386,0.036402,0.193661,0.069118
Vancouver,0.006797,0.086853,0.130316,0.046497
Lyon,0.092051,0.114491,0.03165,0.060582
Geneva,0.146532,0.038371,0.147527,0.002364


## Operating on DataFrame columns and rows 

We can operate on the rows or columns of DataFrames using functions.

In [55]:
f = lambda x: x.max()-x.min()
data.apply(f, axis='columns')

Victoria     0.157260
Vancouver    0.123520
Lyon         0.082841
Geneva       0.145163
dtype: float64

In [56]:
data.apply(f, axis='index')

one      0.139735
two      0.078090
three    0.162011
four     0.066755
dtype: float64

Notice that in both cases a Series is returned. This need not be the case; we can also define a function that might return a series itself.

In [57]:
def f(x):
    return pd.Series([x.min(), x.max()], index = ['min','max'])

data.apply(f, axis='columns')

Unnamed: 0,min,max
Victoria,0.036402,0.193661
Vancouver,0.006797,0.130316
Lyon,0.03165,0.114491
Geneva,0.002364,0.147527


We can also use element wise functions. The interpretter for the DataFrames can pick up on whether or not we're operating on rows (using things like x.min() and x.max()) or operating on elements ($x^2$)

In [58]:
data.apply(lambda x: x**2)

Unnamed: 0,one,two,three,four
Victoria,0.004146,0.001325,0.037505,0.004777
Vancouver,4.6e-05,0.007543,0.016982,0.002162
Lyon,0.008473,0.013108,0.001002,0.00367
Geneva,0.021472,0.001472,0.021764,6e-06


# Sorting and Ranking

## Sorting by Column or Index Labels

We can sort DataFrames by either column or index. If the data is time series and the indexes are timestamps then this can come in handy.

In [59]:
df = pd.DataFrame(np.arange(16).reshape((4,4)),
                 index=['three','one', 'two', 'four'],
                 columns=['d','b','a','c'])
df

Unnamed: 0,d,b,a,c
three,0,1,2,3
one,4,5,6,7
two,8,9,10,11
four,12,13,14,15


In [60]:
df.sort_index()

Unnamed: 0,d,b,a,c
four,12,13,14,15
one,4,5,6,7
three,0,1,2,3
two,8,9,10,11


In [61]:
df.sort_index(axis='columns')

Unnamed: 0,a,b,c,d
three,2,1,3,0
one,6,5,7,4
two,10,9,11,8
four,14,13,15,12


It is sorted in ascending order by default but can also be put in descending order.

In [62]:
df.sort_index(axis='columns', ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,1,2
one,4,7,5,6
two,8,11,9,10
four,12,15,13,14


## Sorting by Column Values

We can also sort a dataframe by a value in one of its columns or rows.

In [63]:
df = pd.DataFrame({'b': [4,8,1,2], 'a': [7,3,9,1]})
df

Unnamed: 0,a,b
0,7,4
1,3,8
2,9,1
3,1,2


In [64]:
df.sort_values(by='a')

Unnamed: 0,a,b
3,1,2
1,3,8
0,7,4
2,9,1


## Ranking

Ranking assigns ranks (sort order) from 1 up to the number of valid data points in the array.

In [65]:
ser = pd.Series([4,7,5,8,5,2,6,7])
ser.rank()

0    2.0
1    6.5
2    3.5
3    8.0
4    3.5
5    1.0
6    5.0
7    6.5
dtype: float64

Note that there are .5 values; this occurs when two values in the Series have the same value. We can choose not to have this by giving the lower rank to the one that comes first. 

In [66]:
ser.rank(method='first')

0    2.0
1    6.0
2    3.0
3    8.0
4    4.0
5    1.0
6    5.0
7    7.0
dtype: float64

Here the first time 5 occurs, the rank is 3.0 and the second time 5 occurs, the rank is 4.0. This is different than the first rank method where we simply assigned 3.5 to both.

DataFrames can also be ranked: we can choose to do this using columns (horizontal axis) or index (vertical axis).

In [67]:
df = pd.DataFrame({'b': [4,8,1,2], 'a': [7,3,9,1], 'c': [1,8,3,8], 'd': [9,8,0,5]})
df

Unnamed: 0,a,b,c,d
0,7,4,1,9
1,3,8,8,8
2,9,1,3,0
3,1,2,8,5


In [68]:
df.rank(axis='index')

Unnamed: 0,a,b,c,d
0,3.0,3.0,1.0,4.0
1,2.0,4.0,3.5,3.0
2,4.0,1.0,2.0,1.0
3,1.0,2.0,3.5,2.0


More ranking possibilities on page 156 of text.

# Axis Indexes with Duplicate Labels

Labels don't necessarily need to be unique, but it's a good idea to keep them unique. Regardless, we can still work with DataFrames that have non-unique indexes.

In [69]:
obj = pd.Series(range(5), index=['a','a','b','b','c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [70]:
obj.index.is_unique

False

In [71]:
obj['a']

a    0
a    1
dtype: int64

Similar logic applies to DataFrames. Lesson? Keep indexes unique.