In [1]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values from
the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data. Consider
a small DataFrame:

In [2]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


Calling DataFrame’s sum method returns a Series containing column sums:

In [3]:
df.sum()

one    9.25
two   -5.80
dtype: float64

Passing axis=1 sums over the rows instead:

In [4]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

NA values are excluded unless the entire slice (row or column in this case) is NA. This
can be disabled using the skipna option:

In [5]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

Some methods, like idxmin and idxmax, return indirect statistics like the index value
where the minimum or maximum values are attained:

In [7]:
df.idxmax()

one    b
two    d
dtype: object

Other methods are accumulations

In [8]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


Another type of method is neither a reduction nor an accumulation. describe is one
such example, producing multiple summary statistics in one shot:

In [9]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On non-numeric data, describe produces alternate summary statistics

In [10]:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [11]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

Method Description<br>
count Number of non-NA values<br>
describe Compute set of summary statistics for Series or each DataFrame column<br>
min, max Compute minimum and maximum values<br>
argmin, argmax Compute index locations (integers) at which minimum or maximum value obtained, respectively<br>
idxmin, idxmax Compute index values at which minimum or maximum value obtained, respectively<br>
quantile Compute sample quantile ranging from 0 to 1<br>
sum Sum of values<br>
mean Mean of values<br>
median Arithmetic median (50% quantile) of values<br>
mad Mean absolute deviation from mean value<br>
var Sample variance of values<br>
std Sample standard deviation of values<br>
skew Sample skewness (3rd moment) of values<br>
kurt Sample kurtosis (4th moment) of values<br>
cumsum Cumulative sum of values<br>
cummin, cummax Cumulative minimum or maximum of values, respectively<br>
cumprod Cumulative product of values<br>
diff Compute 1st arithmetic difference (useful for time series)<br>
pct_change Compute percent changes<br>

<h3>Correlation and Covariance</h3>

In [15]:
import pandas_datareader as pdr

In [20]:
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = pdr.get_data_yahoo(ticker, '1/1/2000', '1/1/2010')
    
    price = DataFrame({tic: data['Adj Close'] for tic, data in all_data.items()})
    volume = DataFrame({tic: data['Volume'] for tic, data in all_data.items()})

In [21]:
returns = price.pct_change()

In [22]:
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009-12-24,0.03434,0.004385,0.002587,0.011117
2009-12-28,0.012294,0.013326,0.005484,0.007098
2009-12-29,-0.011861,-0.003477,0.007058,-0.005571
2009-12-30,0.012147,0.005461,-0.013699,0.005376
2009-12-31,-0.0043,-0.012597,-0.015504,-0.004416


The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance:

In [24]:
returns.MSFT.corr(returns.IBM)

0.4959795983674717

In [25]:
returns.MSFT.cov(returns.IBM)

0.0002159577259311431

DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively:

In [26]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.410011,0.424305,0.470676
IBM,0.410011,1.0,0.49598,0.390689
MSFT,0.424305,0.49598,1.0,0.443586
GOOG,0.470676,0.390689,0.443586,1.0


In [27]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.001027,0.000252,0.000309,0.000303
IBM,0.000252,0.000367,0.000216,0.000142
MSFT,0.000309,0.000216,0.000516,0.000205
GOOG,0.000303,0.000142,0.000205,0.00058


Using DataFrame’s corrwith method, you can compute pairwise correlations between
a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series
returns a Series with the correlation value computed for each column:

In [28]:
returns.corrwith(returns.IBM)

AAPL    0.410011
IBM     1.000000
MSFT    0.495980
GOOG    0.390689
dtype: float64

Passing a DataFrame computes the correlations of matching column names. Here I
compute correlations of percent changes with volume:

In [29]:
returns.corrwith(volume)

AAPL   -0.057549
IBM    -0.007892
MSFT   -0.014245
GOOG    0.062647
dtype: float64

<h3>Unique Values, Value Counts, and Membership</h3>

Another class of related methods extracts information about the values contained in a
one-dimensional Series. To illustrate these, consider this example:

In [30]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [31]:
uniques = obj.unique()

In [32]:
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [33]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [35]:
pd.value_counts(obj.values, sort=False)

d    1
b    2
a    3
c    3
dtype: int64

isin is responsible for vectorized set membership and can be very useful in
filtering a data set down to a subset of values in a Series or column in a DataFrame:

In [36]:
mask = obj.isin(['b', 'c'])

In [37]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [38]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In some cases, you may want to compute a histogram on multiple related columns in
a DataFrame. Here’s an example:

In [39]:
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]})

In [40]:
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


Passing pandas.value_counts to this DataFrame’s apply function gives:

In [41]:
result = data.apply(pd.value_counts).fillna(0)

In [42]:
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


<h2>Handling Missing Data</h2>

In [2]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [3]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [4]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [5]:
string_data[0] = None

In [6]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

Argument Description<br>
dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much
missing data to tolerate.<br>
fillna Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill' .<br>
isnull Return like-type object containing boolean values indicating which values are missing / NA.<br>
notnull Negation of isnull .<br>

<h3>Filtering Out Missing Data</h3>

In [7]:
from numpy import nan as NA

In [8]:
data = Series([1, NA, 3.5, NA, 7])

In [9]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [10]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [11]:
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])

In [12]:
cleaned = data.dropna()

In [13]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [14]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing how='all' will only drop rows that are all NA:

In [15]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Dropping columns in the same way is only a matter of passing axis=1 :

In [16]:
data[4]=NA

In [17]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [18]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Suppose
you want to keep only rows containing a certain number of observations. You can
indicate this with the thresh argument:

In [20]:
df = DataFrame(np.random.randn(7,3))
df

Unnamed: 0,0,1,2
0,0.53079,1.820979,0.828763
1,-1.466207,1.293301,-2.142601
2,-1.092346,0.888633,-0.016253
3,0.258848,-1.821599,-1.408268
4,-1.039877,-1.537137,0.464077
5,0.418899,0.673003,0.44293
6,0.136677,1.525506,-1.205894


In [23]:
df.loc[:4,1] = NA; df.loc[:2,2]=NA
df

Unnamed: 0,0,1,2
0,0.53079,,
1,-1.466207,,
2,-1.092346,,
3,0.258848,,-1.408268
4,-1.039877,,0.464077
5,0.418899,0.673003,0.44293
6,0.136677,1.525506,-1.205894


In [24]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
5,0.418899,0.673003,0.44293
6,0.136677,1.525506,-1.205894


<h3>Filling in Missing Data</h3>

In [25]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.53079,0.0,0.0
1,-1.466207,0.0,0.0
2,-1.092346,0.0,0.0
3,0.258848,0.0,-1.408268
4,-1.039877,0.0,0.464077
5,0.418899,0.673003,0.44293
6,0.136677,1.525506,-1.205894


Calling fillna with a dict you can use a different fill value for each column:

In [26]:
df.fillna({1: 0.5, 3:-1})

Unnamed: 0,0,1,2
0,0.53079,0.5,
1,-1.466207,0.5,
2,-1.092346,0.5,
3,0.258848,0.5,-1.408268
4,-1.039877,0.5,0.464077
5,0.418899,0.673003,0.44293
6,0.136677,1.525506,-1.205894


fillna returns a new object, but you can modify the existing object in place:

In [27]:
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,0.53079,0.0,0.0
1,-1.466207,0.0,0.0
2,-1.092346,0.0,0.0
3,0.258848,0.0,-1.408268
4,-1.039877,0.0,0.464077
5,0.418899,0.673003,0.44293
6,0.136677,1.525506,-1.205894


In [29]:
df = DataFrame(np.random.randn(6,3))
df

Unnamed: 0,0,1,2
0,-0.117709,0.578032,0.570881
1,3.036025,1.187093,-1.118264
2,-1.012357,1.055387,1.206807
3,0.802131,0.53983,-1.437613
4,-0.735336,-2.198529,1.115896
5,-0.254628,-0.027795,-0.367443


In [30]:
df.loc[2:, 1]=NA; df.loc[4:, 2]=NA

In [31]:
df

Unnamed: 0,0,1,2
0,-0.117709,0.578032,0.570881
1,3.036025,1.187093,-1.118264
2,-1.012357,,1.206807
3,0.802131,,-1.437613
4,-0.735336,,
5,-0.254628,,


In [32]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.117709,0.578032,0.570881
1,3.036025,1.187093,-1.118264
2,-1.012357,1.187093,1.206807
3,0.802131,1.187093,-1.437613
4,-0.735336,1.187093,-1.437613
5,-0.254628,1.187093,-1.437613


In [33]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.117709,0.578032,0.570881
1,3.036025,1.187093,-1.118264
2,-1.012357,1.187093,1.206807
3,0.802131,1.187093,-1.437613
4,-0.735336,,-1.437613
5,-0.254628,,-1.437613


With fillna you can do lots of other things with a little creativity. For example, you
might pass the mean or median value of a Series:

In [34]:
data = Series([1., NA, 3.5, NA, 7])

In [35]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

<i>fillna function arguments</i><br>
Argument Description<br>
value Scalar value or dict-like object to use to fill missing values<br>
method Interpolation, by default 'ffill' if function called with no other arguments<br>
axis Axis to fill on, default axis=0<br>
inplace Modify the calling object without producing a copy<br>
limit For forward and backward filling, maximum number of consecutive periods to fill<br>

<h2>Hierarchical Indexing</h2>

In [2]:
data = Series(np.random.randn(10),
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                                         [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])

In [3]:
data

a  1   -0.860304
   2    0.562665
   3   -1.816994
b  1    0.011137
   2   -1.481147
   3    0.462495
c  1    0.715410
   2    0.824325
d  2    0.868672
   3   -0.257167
dtype: float64

What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The
“gaps” in the index display mean “use the label directly above”:

In [4]:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

With a hierarchically-indexed object, so-called partial indexing is possible, enabling
you to concisely select subsets of the data:

In [5]:
data['b']

1    0.011137
2   -1.481147
3    0.462495
dtype: float64

In [6]:
data['b':'c']

b  1    0.011137
   2   -1.481147
   3    0.462495
c  1    0.715410
   2    0.824325
dtype: float64

In [7]:
data.loc[['b','d']]

b  1    0.011137
   2   -1.481147
   3    0.462495
d  2    0.868672
   3   -0.257167
dtype: float64

In [8]:
data[:,2]

a    0.562665
b   -1.481147
c    0.824325
d    0.868672
dtype: float64

Hierarchical indexing plays a critical role in reshaping data and group-based operations
like forming a pivot table. For example, this data could be rearranged into a DataFrame
using its unstack method:

In [9]:
data.unstack()

Unnamed: 0,1,2,3
a,-0.860304,0.562665,-1.816994
b,0.011137,-1.481147,0.462495
c,0.71541,0.824325,
d,,0.868672,-0.257167


The inverse operation of unstack is stack:

In [10]:
data.unstack().stack()

a  1   -0.860304
   2    0.562665
   3   -1.816994
b  1    0.011137
   2   -1.481147
   3    0.462495
c  1    0.715410
   2    0.824325
d  2    0.868672
   3   -0.257167
dtype: float64

In [11]:
frame = DataFrame(np.arange(12).reshape((4, 3)),
 index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
 columns=[['Ohio', 'Ohio', 'Colorado'],
 ['Green', 'Red', 'Green']])

In [12]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


The hierarchical levels can have names (as strings or any Python objects). If so, these
will show up in the console output (don’t confuse the index names with the axis labels!):

In [13]:
frame.index.names = ['key1', 'key2']

In [14]:
frame.columns.names = ['state', 'color']

In [15]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


With partial column indexing you can similarly select groups of columns:

In [16]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A MultiIndex can be created by itself and then reused; the columns in the above Data-
Frame with level names could be created like this:
MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])

<h3>Reordering and Sorting Levels</h3>

The swaplevel takes two level numbers or names and
returns a new object with the levels interchanged (but the data is otherwise unaltered):

In [18]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


sortlevel, on the other hand, sorts the data (stably) using only the values in a single
level. When swapping levels, it’s not uncommon to also use sortlevel so that the result
is lexicographically sorted:

In [23]:
frame.sort_index(1)

Unnamed: 0_level_0,state,Colorado,Ohio,Ohio
Unnamed: 0_level_1,color,Green,Green,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,2,0,1
a,2,5,3,4
b,1,8,6,7
b,2,11,9,10


In [25]:
frame.swaplevel(0, 1).sort_index(0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


<h3>Summary Statistics by Level</h3>

In [26]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [27]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


<H3>Using a DataFrame’s Columns</H3>

In [28]:
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
 'd': [0, 1, 2, 0, 1, 2, 3]})

In [29]:
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


DataFrame’s set_index function will create a new DataFrame using one or more of its
columns as the index:

In [30]:
frame2 = frame.set_index(['c', 'd'])

In [31]:
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


By default the columns are removed from the DataFrame, though you can leave them in:

In [32]:
frame.set_index(['c','d'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


reset_index, on the other hand, does the opposite of set_index; the hierarchical index
levels are are moved into the columns:

In [34]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


<h2>Other pandas Topics</h2>

<h3>Integer Indexing</h3>

In [35]:
ser = Series(np.arange(3.))
ser[-1]

KeyError: -1

In this case, pandas could “fall back” on integer indexing, but there’s not a safe and
general way (that I know of) to do this without introducing subtle bugs. Here we have
an index containing 0, 1, 2, but inferring what the user wants (label-based indexing or
position-based) is difficult::

In [36]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

On the other hand, with a non-integer index, there is no potential for ambiguity:

In [37]:
ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])

In [38]:
ser2[-1]

2.0

To keep things consistent, if you have an axis index containing indexers, data selection
with integers will always be label-oriented. This includes slicing with ix, too:

In [39]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In cases where you need reliable position-based indexing regardless of the index type,
you can use the iget_value method from Series and irow and icol methods from DataFrame:

In [40]:
ser3 = Series(range(3), index=[-5, 1, 3])

In [44]:
ser3.iat[2]

2

In [2]:
frame = DataFrame(np.arange(6).reshape(3, 2), index=[2, 0, 1])

In [5]:
frame.iloc[0]

0    0
1    1
Name: 2, dtype: int32

<h3>Panel Data</h3>

To create a Panel, you can use a dict of DataFrame objects or a three-dimensional
ndarray:

In [61]:
import pandas as pd
import pandas_datareader.data as web

In [62]:
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012')) 
                      for stk in ['AAPL', 'GOOG', 'MSFT']))

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  exec(code_obj, self.user_global_ns, self.user_ns)


In [63]:
pdata

<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 861 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: High to Adj Close

In [64]:
pdata = pdata.swapaxes('items', 'minor')

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  """Entry point for launching an IPython kernel.


In [66]:
pdata['Adj Close']

Unnamed: 0_level_0,AAPL,GOOG,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-01-02,11.314104,160.060059,15.688256
2009-01-05,11.791602,163.412491,15.834874
2009-01-06,11.597112,166.406265,16.020086
2009-01-07,11.346518,160.403763,15.055480
2009-01-08,11.557216,161.987823,15.526204
2009-01-09,11.292908,156.946732,15.063194
2009-01-12,11.053535,155.761169,15.024609
2009-01-13,10.935095,156.573120,15.294700
2009-01-14,10.638372,149.923050,14.731374
2009-01-15,10.395261,148.936752,14.847122


ix-based label indexing generalizes to three dimensions, so we can select all data at a
particular date or a range of dates like so:

In [67]:
pdata.loc[:, '6/1/2012', :]

Unnamed: 0,High,Low,Open,Close,Volume,Adj Close
AAPL,81.807144,80.074287,81.308571,80.141426,130246900.0,69.940491
GOOG,285.255798,283.113831,284.827393,284.42392,6138700.0,284.42392
MSFT,28.959999,28.440001,28.76,28.450001,56634300.0,23.857056


In [68]:
pdata.loc['Adj Close', '5/22/2012':, :]

Unnamed: 0_level_0,AAPL,GOOG,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-05-22,69.439308,299.278229,24.955561
2012-05-23,71.133591,303.592072,24.410501
2012-05-24,70.480331,300.702881,24.376957
2012-05-25,70.102554,294.660553,24.36857
2012-05-29,71.346817,296.060303,24.787857
2012-05-30,72.207024,293.016693,24.603371
2012-05-31,72.027489,289.345459,24.477589
2012-06-01,69.940491,284.42392,23.857056


An alternate way to represent panel data, especially for fitting statistical models, is in
“stacked” DataFrame form:

In [69]:
stacked = pdata.loc[:, '5/30/2012':, :].to_frame()
stacked

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  return self.obj._slice(obj, axis=axis, kind=kind)


Unnamed: 0_level_0,Unnamed: 1_level_0,High,Low,Open,Close,Volume,Adj Close
Date,minor,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2012-05-30,AAPL,82.855713,80.937141,81.314285,82.738571,132357400.0,72.207024
2012-05-30,GOOG,294.844849,290.675476,292.981842,293.016693,3827600.0,293.016693
2012-05-30,MSFT,29.48,29.120001,29.35,29.34,41585500.0,24.603371
2012-05-31,AAPL,83.071426,81.637146,82.96286,82.53286,122918600.0,72.027489
2012-05-31,GOOG,293.898407,288.418945,293.260773,289.345459,5958800.0,289.345459
2012-05-31,MSFT,29.42,28.940001,29.299999,29.190001,39134000.0,24.477589
2012-06-01,AAPL,81.807144,80.074287,81.308571,80.141426,130246900.0,69.940491
2012-06-01,GOOG,285.255798,283.113831,284.827393,284.42392,6138700.0,284.42392
2012-06-01,MSFT,28.959999,28.440001,28.76,28.450001,56634300.0,23.857056


DataFrame has a related to_panel method, the inverse of to_frame:

In [70]:
stacked.to_panel()

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  """Entry point for launching an IPython kernel.


<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: High to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT