# Chapter 5. Getting Starting with pandas

Designed to for conducting vectorized functions with tabular data.

## Introduction to pandas data structures

The two most common data structures from pandas are *Series* and *DataFrame*.

### Series

A 1-D array like object containing a sequence of values and an associated array of data labels (called the *index*).

The simplest Series is from from an array.

In [624]:
import pandas as pd
import numpy as np

obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

The index is shown to the left of each data point.
The values and indices can be extracted, specifically.

In [625]:
obj.values

array([ 4,  7, -5,  3])

In [626]:
obj.index

RangeIndex(start=0, stop=4, step=1)

The index can be specified.

In [627]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [628]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

The index can be used to select single or multiple values.

In [629]:
obj2['a']

-5

In [630]:
obj2['d'] = 6

In [631]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

Boolean indices, multiplication, and mathematical operations can also be used just like with NumPy.

In [632]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [633]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [634]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

A Series can be thought of as a fixed-length, ordered dictionary.
It can often be used in a simillar fashion as a dictionary.
A Series can be created from a dictionary.

In [635]:
'b' in obj2

True

In [636]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3  = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

By default, the resulting Series is ordered by the index.
This can be overriden when the Series is created.
Note that indices without values get assigned `NaN` and only values from the dictionary with included indices are retained.

In [637]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [638]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [639]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series can be joined using the `+` operator that automatically joins by index.

In [640]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Bot the Series object itself and its index have a `name` attribute.

In [641]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

### DataFrame

A rectangular table of data with an *ordered* collection of columns.
It has both a row and column index.

A DataFrame can be constructed from a dictionary of NumPy arrays.

In [642]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [643]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


The column order can be declared during creation.

In [644]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


This is also a way to declare an empty column.

In [645]:
frame2 = pd.DataFrame(
    data, 
    columns=['year', 'state', 'pop', 'debt'],
    index=['one', 'two', 'three', 'four', 'five', 'six']
)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [646]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [647]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [648]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [649]:
type(frame2.year)

pandas.core.series.Series

In [650]:
type(frame2)

pandas.core.frame.DataFrame

A DataFrame be indexed by column using either a list or dictionary-like syntax.
The rows can be subset using the `loc` method and passing the row index.

In [651]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [652]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [653]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment.

In [654]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [655]:
frame2['debt'] = np.arange(6.0)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


In [656]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Columns can be easily created and deleted.

In [657]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


In [658]:
del frame2['eastern']

In [659]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

DataFrames can also be created from nested dictionaries.
The first level becomes the column index and the second the row index.

In [660]:
 pop = {
    'Nevada': {2001: 2.4, 2002: 2.9},
    'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}
}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


A DataFrame can be transposed.

In [661]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


A DataFrame can be created from a dictionary of Series.

In [662]:
pdata = {
    'Ohio': frame3['Ohio'][:-1],
    'Nevada': frame3['Nevada'][:2]
}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


A DataFrame's `index` and `columns` have their own `name` attribute.

In [663]:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


The `values` attribute returns the data as a 2-D ndarray.

In [664]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [665]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

### Index Objects

These hold the axis labels and other metadata.
They are immutable.

In [666]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [667]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects can be shared amongst data stuctures.

In [668]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [669]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [670]:
obj2.index is labels

True

In [671]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [672]:
'Ohio' in frame3.columns

True

In [673]:
2003 in frame3.index

False

## 5.2 Eddential functionality

This section discusses the fundamental interations with Series and DataFrames.

### Reindexing

This creates a *new object* with the data conformed to a new index.

In [674]:
obj = pd.Series([4.3, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
obj

d    4.3
b    7.2
a   -5.3
c    3.6
dtype: float64

In [675]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.3
e    NaN
dtype: float64

There is a `method` option to describe how to handle missing data.
Here is an example with `'ffill'` which is a "forward-fill."

In [676]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [677]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, 'reindex' can alter either the row index, columns, or both.

In [678]:
frame = pd.DataFrame(
    np.arange(9).reshape((3, 3)),
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [679]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


Columns can be reindexed with the `columns` keyward.

In [680]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Reindexing can be done more succienctly by label-indexing with 'loc'.

In [681]:
frame.loc[['a', 'b', 'c', 'd'], states]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


### Dropping entries from an axis

The `drop` method returns a *new object* with the indicated values deleted from an axis.

In [682]:
obj = pd.Series(np.arange(5.0), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [683]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [684]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis.

In [685]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [686]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [687]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [688]:
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


There is an option to make the change in-place.

In [689]:
obj.drop('c', inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

### Indexing, selection, and filtering

For Series, can use either position, boolean, or index values for indexing.

In [690]:
obj = pd.Series(np.arange(4.0), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [691]:
obj['b']

1.0

In [692]:
obj[1]

1.0

In [693]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [694]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [695]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [696]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

In [697]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [698]:
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing a DataFrame can retrieve multiple columns.

In [699]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [700]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [701]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [702]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [703]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [704]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [705]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


`loc` and `iloc` are methods specifically for label-indexing on the rows of a DataFrame.

In [706]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

In [707]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [708]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [709]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [710]:
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [711]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


### Arithmetic and data alignment

When adding objects together, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.

In [712]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [713]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

`NaN` are returned for when there is only one value being added together.

In [714]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

For DataFrame, alignment is performed on both rows and columns.

In [715]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), 
                   columns=list('bcd'), 
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), 
                   columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [716]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


When added together, values are only returned for positions in both DataFrames.

In [717]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


There is an option to fill missing data with a specific value to be used for the operation.

In [718]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [719]:
df2.loc[1, 'b'] = np.nan
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [720]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [721]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


There are defined behaviours for arithmetic between DataFrame and Series.
It is slightly different than for 1D and multidimensional ndarrys.

In [722]:
arr = np.arange(12.0).reshape((3, 4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [723]:
arr[0]

array([0., 1., 2., 3.])

The subtration below results in one operation per row (an example of broadcasting).

In [724]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

A simillar mechanism is used for operations between a Series and DataFrame.

In [725]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [726]:
series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [727]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


IF an index is not found in either the DataFrame columns or Series index, the objects are reindexed to form the union.

In [728]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2

b    0
e    1
f    2
dtype: int64

In [729]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [730]:
series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [731]:
frame.sub(series3)

Unnamed: 0,Ohio,Oregon,Texas,Utah,b,d,e
Utah,,,,,,,
Ohio,,,,,,,
Texas,,,,,,,
Oregon,,,,,,,


In [732]:
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### Function application and mapping

DataFrames work well with ufuncs, too.

In [733]:
frame = pd.DataFrame(np.random.randn(4, 3), 
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-0.209365,-0.808293,0.493412
Ohio,-1.637488,0.602871,0.254673
Texas,-1.903313,-0.817953,-0.281141
Oregon,-1.4462,-0.223332,0.146243


In [734]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.209365,0.808293,0.493412
Ohio,1.637488,0.602871,0.254673
Texas,1.903313,0.817953,0.281141
Oregon,1.4462,0.223332,0.146243


Use the `apply` method to apply a function to the 1D arrays from columns or rows.

In [735]:
f = lambda x: x.max() - x.min()

frame.apply(f)

b    1.693947
d    1.420824
e    0.774553
dtype: float64

To operate of the rows, pass the value `axis=1`.

In [736]:
frame.apply(f, axis=1)

Utah      1.301705
Ohio      2.240359
Texas     1.622171
Oregon    1.592443
dtype: float64

In [737]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(f)

Unnamed: 0,b,d,e
min,-1.903313,-0.817953,-0.281141
max,-0.209365,0.602871,0.493412


### Sorting and ranking

Use the `sort_index()` method to sort a Series or DataFrame lexicographically.

In [738]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [739]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [740]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [741]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


Use the `sort_values` method to sort a Series by its values.

In [742]:
obj = pd.Series([4, 7, -3, -2])
obj.sort_values()

2   -3
3   -2
0    4
1    7
dtype: int64

In [743]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

For a DataFrame, a column can be specified to use for sorting.

In [744]:
frame = pd.DataFrame({
    'b': [4, 7, -3, 2],
    'a': [0, 1, 0, 1]
})
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [745]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


Ranking assigns ranks from 1 through the number of valid data points (rows).
There are a few different ways to handle ties and they can be declared using the `method` argument.

In [746]:
obj = pd.Series([7, -5, 7, 4, 3, 0, 4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [747]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [748]:
obj.rank(pct=True)

0    0.928571
1    0.142857
2    0.928571
3    0.642857
4    0.428571
5    0.285714
6    0.642857
dtype: float64

In [749]:
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

DataFrames can be ranked over the columns are rows.

In [750]:
frame = pd.DataFrame({
    'b': [4.3, 7, -3, 2],
    'a': [0, 1, 0, 1],
    'c': [-2, 5, 8, -2.5]
}).sort_index(axis=1)
frame

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5


In [751]:
frame.rank()

Unnamed: 0,a,b,c
0,1.5,3.0,2.0
1,3.5,4.0,3.0
2,1.5,1.0,4.0
3,3.5,2.0,1.0


In [752]:
frame.rank(axis=1)

Unnamed: 0,a,b,c
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,2.0,1.0,3.0
3,2.0,3.0,1.0


### Axis indexes with duplicate labels

Many pandas functions require the labels to be unique, but it is not mandatory for a Series or DataFrame.
HEre is a small example Series with non-unique labels.

In [753]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [754]:
obj.index.is_unique

False

In [755]:
obj.a

a    0
a    1
dtype: int64

In [756]:
obj.c

4

Here is an example with DataFrame.

In [757]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'b', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,-0.771682,1.70093,-0.930985
b,1.018207,-0.674151,0.397952
b,-1.038253,0.349713,-0.113963
b,-0.029995,-1.020177,0.901631


In [758]:
df.loc['b']

Unnamed: 0,0,1,2
b,1.018207,-0.674151,0.397952
b,-1.038253,0.349713,-0.113963
b,-0.029995,-1.020177,0.901631
