In [423]:
"""
Begun - 8/6/2016
Python for Data Analysis - Chapter 5 - Getting Started with Pandas

Things to review:
- 

Things learned here:

Series Indexing
    - Can think about it as a fixed-length, ordered dict
    - Contains a data array and its index (also an array)
        obj = Series([4, 7, -5, 3])
        obj.values - outputs the values as an array
        obj.index - outputs the index as a Pandas index or range
    - Can specify the index
        obj2 = Series([4, 7, -5, 3], index=['first','second','third','fourth'])
        obj2['first'] 
            # outputs 4, a scalar
        obj2[['first','third']] 
            # outputs subset of the Series object with index
    - Can use math functions like in Numpy.  Operations are handled element-by-element on matching indices
    - Can use it in many functions that expect a dict
    - Can name a series, name its index, and update its index by assignment (obj.index.name = ['new', 'values'])

DataFrame Indexing
     - Creation from a dict: frame = pd.DataFrame(data). 
         Columns are automatically sorted alphabetically by key value of the dict
     - df['my_col'] and df.my_col are equivalent
     - del df['my_col'] to delete column my_col
     - The DataFrame index column is internal pd.Series that's returned in-place.
     - Setting names of the index and the columns
         frame3.index.name = 'year'
         frame3.columns.name = 'state'
    - Index objects are immutable
        index = df.index
        index[1] = 'a' # will be an error because index is an in-place assignment of df.index & df indices are immutable

    - Types of Index Ojbects
        Index - most general.  Axis labels in a NumPy array of Python objects
        Int64Index - integer index.
        MultiIndex - Hierarchical Index of nested levels
        DatatimeIndex - Stores nanosecond timestamps (using NumPy's datetime63 dtype)
        PeriodIndex - Stores timespans ("periods")
    - See lots of index methods in Table 5-3.  A few here:
        append - Concatenate with additional index objects.  Very useful for adding to a multi-index (as done in OM)
        unique - compare the array of unique values of the index
    - Reindex - Change the starting index element-by-element
        method= ... gives the ability to fill in NaNs resulting from the reindex
            method = ffill, method = backfille.  Fill values forward.  Fill values backward.
        Can also changes column values
    - drop indexes - applies to rows and columns (for columns must specify axis = 1)
        Does NOT drop in-place
    
Indexing, Selection, & Filtering
    - Series: selections like series['selection'] work when 'selection' is an index (i.e. row name)
    - Data Frame: selections like series['selection'] work when 'selection' is a column name
    - Index-based selection in a df: df.ix[[index or index range], [column or column range]]
        If only 1 parameter is given it's assumed to be an index reference
        If only 1 parameter with a single value is given then the result is a Series that's weirdly transposed
            Columns are the new index of the Series and the column values are the Series values
    
Data Alignment and Arithmetic
    - df1.add(df2).fillna(0) is a rubric.  Much safer than just df1 + df2

Broadcasting - Arithmetic between DF's and Series
    - "By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame's columns,
        broadcasting down the rows."
    - df.add(series, axis=0) will match on index of df and index of series.
    - apply - applies functions column-by-column or row-by-row 
    - map - applies functions element-by-element to a series
    - applymap - applies functions element-by-element to a DataFrame

Sorting
    - sorting by index or column name
        series.sort_index()
        df.sort_index()
        df.sort_index(axis=1) # by column
        df.sort_index(axis=1, ascending=False) # by column descending
    - sorting by value
        series.sort_values()
        df.sort_values(by=['column'])  # column list or single value
        df.sort_Values(axis=1)

Ranking values
    - obj.rank() # Break ties by using the average rank of the things that are tied        
        obj.rank(method = 'first') # Break ties by using the order of the tied value's occurrence in the column
        obj.rank(method = 'max')   # Break ties by using the maximum rank of the things that are tied
        obj.rank(method = 'min')   # Break ties by using the minimum rank of the things that are tied
        obj.rank(method = 'average') # THIS IS THE DEFAULT!
        obj.rank(ascending = False) # rank by descending value. Can be paired with the above methods

Statistics
    Reductions
    - df.sum(), df.mean(), df.idxmin, df.idxmax
      df.sum(axis=1) # summing column-by-column summing over rows
      df.sum(skipna=False) # by default, NA values are ignored
      df.sum(level=this_index) # for MultiIndex indexes

    Accumulations
    - df.cumsum(), df.pct_change
      Parameters are the same as with reduction statistics (axes, levels, skipna, etc.)

    Statistical Wrappers
    - df.describe() # produces a list of statistics e.g. for numeric data (mean, std, min, max, 25%, 50%, 75%, max)

    Correlation, Covariance
    - Series v. Series:   series1.corr(series2) # returns a single number
                series1.cov(series2)            # returns a single number
    - DataFrame v. DataFrame: df1.corr(df2) and df1.cov(df2) results in correlation and covariance matricies
    - DataFrame with Pairwise DataFrame: df1.corrwith(df2) # Returns a single number for each label match
    - DataFrame with Pairwise Series: df1.corrwith(series) # Returns a single number for each label of df1

Uniqueness, Value Counting, Vectorized Membership Test
    - series.unique()
    - series.value_counts()
    - series.isin(['a','b','c']) # Returns True, False etc. vectorized
    These are series methods.  To use them with dataframes, use apply.  df.apply(pd.value_counts).fillna(0) <-- you made a histogram!

Missing data
    - dropna() # By default drops a row that contains any NA
        dropna(how='all') # will require all of the values to be NA before dropping the row
        dropna(thresh=5)  # will require 5 of the values to be non-NA or else the row is dropped
    - fillna()
        fillna(method='ffill', limit=3) # Forward fill the NaNs up to 3 values.  If limit isn't specified then fill is unbounded.
        fillna(0)         # Fills missing values with number 0
        fillna(1: 2, 3:4) # Fills columns 1 and 3 with the constanrs 2 and 4
    - fillna() can take the result of a function as well.  E.g. data.fillna(data.mean())

Multiple Indexes
    - A series can have multiple indexes
    - series_with_multiindex.unstack() # turns it into a dataframe
    - Either columns or rows can have multiple indices.  Just pass a multi-level vector to index=[] or columns=[]
    - Create a MultiIndex on its own.  
        pd.MultiIndex.from_arrays([vector1, vector2], names=name_vector)
            For example: 
                vector1 = ['Ohio', 'Ohio', 'Colorado]
                vector2 = ['Green', 'Red', 'Green']
                names = ['state', 'color]
    - Swap & Sort levels
        df.swap_level('index1','index2')
        df.sort_level('index2')
        df.sort_level() sorts outside-in
        **** Data selection performance is much better for MultiIndex operations if the index is sorted outside-in
    - set_index(column_vector) will set the columns provided in the column vector as indices
        df.set_index(['column1','column3'])
    - reset_index() to move all index keys into columns and set the index to the natural numbers

Statistics and Multiple Levels
    - df.sum(level='key2')

More subtle indexing tools
    - Indexes -- To use integer-based indexing even when there is a non-integer index
        For Series:     series.itget_value(2)
        For DataFrame:  frame.irow(0)

Pulling in Web Data
    - from pandas_datareader import data as web
        all_data = {}
        for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
            all_data[ticker] = web.get_data_yahoo(ticker)

Panel Data
    - from pandas_datareader import data as web
      pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk)) for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))
      pdata = pdata.swapaxes('items', 'minor') # To get time on the major axis
    - Turn it into a DataFrame
        stacked = pdata.ix[:, '5/30/2016':, :].to_frame() # Turn



Iterating Over Lists
    - df = pd.DataFrame({a: data['column'] for a, data in all_data.iteritems()})
        # Gives a list with index a and column data


"""
import pandas as pd
import numpy as np

In [32]:
"""
Series
- Contains an array of data of any NumPy data type
- Contains an associated array of data labels called its index
"""
obj = pd.Series([4, 7, -5, 3])
obj 
    # Default index is assigned from 0 - n-1

0    4
1    7
2   -5
3    3
dtype: int64

In [33]:
obj.values

array([ 4,  7, -5,  3])

In [34]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [35]:
"""Create a Series with a specific index"""
obj2 = pd.Series([4, 7, -5, 3], index=['d','b','a','c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [36]:
obj2.index

Index([u'd', u'b', u'a', u'c'], dtype='object')

In [37]:
# Retrieve a single value or a subset of the Series
obj2['a']
obj2[['c','a','d']]

c    3
a   -5
d    4
dtype: int64

In [38]:
# Filtering, scalar multiplication, and math works like in NumPy
obj2[obj2 > 0]
obj2 * 2
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [39]:
# Can search for values in the index using dict functions
'b' in obj2 # True
'e' in obj2 # False

False

In [40]:
"""Create a Pandas Series object from a Python dict"""
# The dict's keys are used as the Series's index
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [41]:
# Specifying an index different from the dict's keys acts 
# as a filter - the passed index is used but might not match
# to keys in the dict. Non-matched values lead to a NaN.
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [42]:
"""Search for null / missing / NA / NaN values """
pd.isnull(obj4)
pd.notnull(obj4) # equivalent to ~(pd.isnull(obj4))

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [43]:
# Can just use these as methods of the instance
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [44]:
# 8/22/2016
# Pandas series add element-by-element using index matching
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [45]:
# Can name the index and the Series
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [46]:
# Can alter the index in-place by asignment
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

In [47]:
# Creating new metrics
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year' : [2000, 2001, 2002, 2001, 2001],
        'pop'  : [1.5,  1.7,  3.6,  2.4,  2.9]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2001


In [48]:
# Can force an order of the columns
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2001,Nevada,2.9


In [49]:
# Passing a column name that doesn't match a key results in NA values
frame2 = pd.DataFrame(data, columns= ['year', 'state', 'pop', 'debt'],
                            index =  ['one',  'two',   'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2001,Nevada,2.9,


In [50]:
# Can retrieve a single column in dict-like notation or as an attribute
frame2['state']
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2001
Name: year, dtype: int64

In [51]:
# Columns modified by assignment
frame2['debt'] = 16.5 # a constant
frame2
frame2['debt'] = np.arange(5.) # a np array of floats
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2001,Nevada,2.9,4.0


In [52]:
# DF column assignments using a list of array must match the length of the DataFrame
# Series are assigned via index matching
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five',]) # only three of the 5 indices used
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2001,Nevada,2.9,-1.7


In [53]:
# Delete columns with del, as with a dict
frame2['eastern'] = frame2.state == 'Ohio'
del frame2['eastern']
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

In [54]:
# Nested dict of dicts is turns inner values into rows and outer values into columns
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
    # 2000, 2001, 2002 become rows
    # Nevada and Ohio become columns
pop
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [55]:
# Enforcing an index is akin to doing a sub-select of the data's implicit index. 
pd.DataFrame(pop, index=[2001, 2002, 2003])
    # The inner dict keys 2000, 2001, 2002 are intersected with the pd.DataFrame here

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [56]:
# You can pass Series variables directly into the dataframe creation and get equivalent behavior
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]} # Specifically the first two values of the Nevada column
pd.DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


In [57]:
# Setting names of the index and the columns
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [58]:
# Indexes are objects that hold axis labels and othe rmetadata
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index # an object with values that are the index of the obj

Index([u'a', u'b', u'c'], dtype='object')

In [59]:
# Recall that index = obj.index was an in-place assignment to modifications to index are modifications to obj.index
# Index objects are immutable, so assignments like below are not allowed
index[1] = 'd'

TypeError: Index does not support mutable operations

In [60]:
# Example of assigning an index as a pre-defined Index object
index = pd.Index(np.arange(3))
obj2 = pd.Series([1.5, -2.5, 0], index=index)
obj2.index

Int64Index([0, 1, 2], dtype='int64')

In [61]:
# Reindexing changes the value of the index to the specified value element-by-element.  
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
    # The index is not in alphabetical order
obj
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
    # The index is reset alphabetical order, leaving the underlying data order unchanged
    # A new value, 'e', is added
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [62]:
# To avoid NaN's, use fillna().  There's also a parameter in reindex called fill_value but that's redundant.
obj.reindex(['a', 'b', 'c', 'd', 'e']).fillna(0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [63]:
# In a time-series, imposing a new index will often mean having NaNs.
# The index values 0, 2, 4 haves spaces.  The values at 0, 2, and 4 are carried forward to fill, 1,3,5, repsectively
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [64]:
# Reindex can also alter column values, not just index values
# Example of a ROW reindex as above
frame = pd.DataFrame(np.arange(9).reshape(3, 3), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame
frame2 = frame.reindex(['a',' b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [65]:
# Example of a Column reindex
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
    # Ohio is gone pecan, and Utah is added witn NaN values

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [66]:
# Row and Column reindexing at 1 time is possible.  Interpolation only happens row-wise in that case
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', columns=states)
    # Reindexes the row, ffill's b with the value in a, and then changes the columns to the "States" value

Unnamed: 0,Texas,Utah,California
a,1,,2
b,1,,2
c,4,,5
d,7,,8


In [67]:
# Using .ix for label-indexing:
frame.ix[['a','b','c','d'], states]

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


In [68]:
# Dropping index from an axis.  Can drop multiples at once if passed in an array
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj
newer_obj = obj.drop(['c', 'd'])
new_obj, newer_obj

(a    0.0
 b    1.0
 d    3.0
 e    4.0
 dtype: float64, a    0.0
 b    1.0
 e    4.0
 dtype: float64)

In [69]:
# Drop columns in a similar way but have to specify axis = 1
data = pd.DataFrame(np.arange(16).reshape(4, 4), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
data.drop(['one', 'three'], axis=1)
data.drop(['one', 'four'], axis=1)

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


In [79]:
#8/24/2016
# Indexing, Selection, Filtering - location 2672
obj = pd.Series(np.arange(4.), index=['a','b','c','d'])
obj['b'], obj[1]
    # Same output

c    2.0
d    3.0
dtype: float64

In [82]:
# Integer-based selection for the index is difference then label-based selection
obj[1:3], obj[['b','c','d']]
    # Notice that the endpoint of the label-based slice is inclusive while the integer-based slice is not inclusive

(b    1.0
 c    2.0
 dtype: float64, b    1.0
 c    2.0
 d    3.0
 dtype: float64)

In [86]:
# Index & selection with a Data Frame
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [88]:
# Column-based selecton is th esaimplest
data['two'], data[['three','one']]

(Ohio         1
 Colorado     5
 Utah         9
 New York    13
 Name: two, dtype: int64,           three  one
 Ohio          2    0
 Colorado      6    4
 Utah         10    8
 New York     14   12)

In [93]:
# Index-based selection of dataframes produces different outcomes based on the inputs
data.ix['Colorado',['two','three']]
    # The result of sort of like a transpose with the series renamed to "Colorado"

two      5
three    6
Name: Colorado, dtype: int64

In [94]:
# Index and column specificiation simultaneously
data.ix[['Colorado','Utah'],[3,0,1]]
    # The result of just a reordered subset of the original DF's structure

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


In [96]:
# Also works with ranges
data.ix[:'Utah','two']

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int64

In [100]:
## Data Alignment and Arithmetic
# Series: Indexes are matched in arithmetic.  When they don't match a NaN results
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a','b','c','d'])
s2 = pd.Series([-2.1, -3.6, -1.5, 4, 3.1], index=['a','c','e','f','g'])
s1 + s2
    # Only indeces a and c are shares.  The union of indexes is shown in the output with NaN values

a    5.2
b    NaN
c   -0.2
d    NaN
e    NaN
f    NaN
g    NaN
dtype: float64

In [109]:
# DataFrame: Like with series, but Indexes and Columns are matched
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [115]:
# Using fill-values
df1.add(df2).fillna(0)

Unnamed: 0,b,c,d,e
Colorado,0.0,0.0,0.0,0.0
Ohio,3.0,0.0,6.0,0.0
Oregon,0.0,0.0,0.0,0.0
Texas,9.0,0.0,12.0,0.0
Utah,0.0,0.0,0.0,0.0


In [117]:
## Data Frames and Series combination
# NumPy example of subtracting the values in a single array from an array with the same number of columns but different rows
arr = np.arange(12.).reshape((3,4))
arr - arr[0]
    # Subtract the value in arr[0] from the col-by-col corresponding values in arr[1]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

In [123]:
# DF and Series example
frame = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [127]:
series = frame.ix['Utah']
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [128]:
# Broadcasting with index of Series matching on column of DF
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [133]:
# As before, if there are not the expected matches then the cell value is filled with NaN
series2 = pd.Series(range(3), index=['b','e','f'])
series2
frame + series2
    # 'd' isn't an index in the series and 'f' isn't a column in the DF 

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [141]:
# You can change how indexes and columns are matched in Series & DF arithmetic
series3 = frame['d']
''' Utah       1.0
    Ohio       4.0
    Texas      7.0
    Oregon    10.0'''
frame.sub(series3)
    # Just a bunch of NaN's because the ['b','d','e'] columns of Frame don't match any of the ['Utah','Ohio','Texas','Oregon'] series indexes

Unnamed: 0,Ohio,Oregon,Texas,Utah,b,d,e
Utah,,,,,,,
Ohio,,,,,,,
Texas,,,,,,,
Oregon,,,,,,,


In [142]:
# Specifying axis=0 for says to use axis = 0 of the dataframe.  
frame.sub(series3, axis=0)

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


In [149]:
## Function Application 
frame = pd.DataFrame(np.random.randn(4, 3), columns=['b','d','e'], index=['Utah','Ohio','Texas','Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.806038,-0.666838,-1.107427
Ohio,0.681895,0.637201,-0.230924
Texas,0.095892,-1.846964,0.179568
Oregon,-0.204658,0.891902,-0.841593


In [151]:
# Applying ufuncs
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.806038,0.666838,1.107427
Ohio,0.681895,0.637201,0.230924
Texas,0.095892,1.846964,0.179568
Oregon,0.204658,0.891902,0.841593


In [152]:
# Using a lambda
f = lambda x: x.max() - x.min()
frame.apply(f)
    # The default is to apply the lambda row-by-row for each column

b    1.010697
d    2.738865
e    1.286995
dtype: float64

In [153]:
frame.apply(f, axis=1)
    # Specifying axis = 1 applies the lambad column-by-column for each index

Utah      1.913465
Ohio      0.912819
Texas     2.026532
Oregon    1.733495
dtype: float64

In [154]:
# Using "apply" to on a more complicated function that returns a series
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(f)
    # The result is a dataframe because we pass in row-by-row for each column and get 2 values per column

Unnamed: 0,b,d,e
min,-0.204658,-1.846964,-1.107427
max,0.806038,0.891902,0.179568


In [158]:
## Applymap v. Apply

# applymap is element-wise.  apply basically does aggreagtions
format = lambda x: '%.2f' % x
#frame.apply(format)
    # This returns an error saying that the lambda expects a float not a series
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,0.81,-0.67,-1.11
Ohio,0.68,0.64,-0.23
Texas,0.1,-1.85,0.18
Oregon,-0.2,0.89,-0.84


In [164]:
# Pandas "Maps" can be applied only to Series.  Applymap makes them work with DataFrames.
frame['e'].map(format)

Utah      -1.11
Ohio      -0.23
Texas      0.18
Oregon    -0.84
Name: e, dtype: object

In [177]:
## Sorting and Ranking

# Sorting by Index - Series
obj = pd.Series(range(4), index=['d','a','B','c'])
obj.sort_index()
    # Sorts by the index's value (i.e. ASCII value.

B    2
a    1
c    3
d    0
dtype: int64

In [178]:
# Sorting by Index - DataFrame
# With a DataFrame can sort by index value or column value
frame = pd.DataFrame(np.arange(8).reshape((2,4)), index=['three','one'], columns=['d','a','b','c'])
frame.sort_index(axis=1)
    # Sort by the Column value
frame.sort_index(axis=1, ascending=False)
    # Sort by the Column value, descending

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [183]:
# Sorting by Value - Series
obj = pd.Series([4, 7, -3, 2, np.nan, 5])
obj.sort_values()
    # Missing values are always put at the end

2   -3.0
3    2.0
0    4.0
5    5.0
1    7.0
4    NaN
dtype: float64

In [193]:
# Sorting by Value - Data Frame
# Main difference is that you have to specify 1 or more columns
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame.sort_values(by='b')
    # Note that the indexes are preserved when we sort

Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7


In [210]:
# Ranking - Determining order value
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
pd.DataFrame({'obj': obj, 'obj.rank()': obj.rank()
              , 'obj.rank(method=\'first\')': obj.rank(method='first')
              , 'obj.rank(method=\'max\')': obj.rank(method='max')
              , 'obj.rank(ascending=False)': obj.rank(ascending = False, method='max')})
    # Rank tells you the order of the object.  
    # By default, ties are assigned the mean rank (may be an non-integer)
    # With method = 'first' then the first occurrence of the tying element in the original array is the lower number
    #     next is next-lowest, etc.
    # With method = 'max' rank in descending order

Unnamed: 0,obj,obj.rank(),obj.rank(ascending=False),obj.rank(method='first'),obj.rank(method='max')
0,7,6.5,2.0,6.0,7.0
1,-5,1.0,7.0,1.0,1.0
2,7,6.5,2.0,7.0,7.0
3,4,4.5,4.0,4.0,5.0
4,2,3.0,5.0,3.0,3.0
5,0,2.0,6.0,2.0,2.0
6,4,4.5,4.0,5.0,5.0


In [216]:
## Duplicate Indexes
obj = pd.Series(range(5), index = ['a','a','b','b','c'])
obj.index.is_unique
    # is_unique is not a function but some other horror.  It doesn't get parentheses

False

In [None]:
## Descriptive Statistics

In [223]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=list('abcd'), columns=['one','two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [224]:
df.sum() # Sum row-by-row over the columns  # Output is a series

one    9.25
two   -5.80
dtype: float64

In [225]:
df.sum(axis=1) # Sum column-by-column over the rows  # Output is a series

a    1.40
b    2.60
c     NaN
d   -0.55
dtype: float64

In [227]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [251]:
# Correlation and Covariance -- using Yahoo Finance data
from pandas_datareader import data as web
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker)
    
price = pd.DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
volume = pd.DataFrame({tic: data['Volume'] for tic, data in all_data.iteritems() })

In [258]:
# Off-thread -- Just concat this data to see what it looks like
pd.concat([price, volume], axis=1).sort_index(axis=1)

Unnamed: 0_level_0,AAPL,AAPL,GOOG,GOOG,IBM,IBM,MSFT,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-04,27.990226,123432400,313.062468,3927000,113.304536,6155300,25.884104,38409100
2010-01-05,28.038618,150476200,311.683844,6031900,111.935822,6841400,25.892466,49749600
2010-01-06,27.592626,138040000,303.826685,7987100,111.208683,5605300,25.733566,58182400
2010-01-07,27.541619,119282800,296.753749,12876600,110.823732,5840600,25.465944,50559700
2010-01-08,27.724725,111902700,300.709808,9483900,111.935822,4197200,25.641571,51197400
2010-01-11,27.480148,115557400,300.255255,14479800,110.763844,5730400,25.315406,68754700
2010-01-12,27.167562,148614900,294.945572,9742900,111.644958,8081500,25.148142,65912100
2010-01-13,27.550775,151473000,293.252243,13041800,111.405433,6455400,25.382312,51863500
2010-01-14,27.391211,108223500,294.630868,8511900,113.184773,7111800,25.892466,63228100
2010-01-15,26.933449,148516900,289.710772,10909600,112.731385,8494400,25.808835,79913200


In [260]:
# See the percentage returns (the day-over-day change)
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-08-17,-0.001463,0.003564,-0.001618,0.002089
2016-08-18,-0.001282,-0.00309,0.005734,0.000695
2016-08-19,0.002567,-0.002675,-0.008181,0.000347
2016-08-22,-0.007772,-0.004217,-0.00025,0.000868
2016-08-23,0.003133,-9.1e-05,0.001625,0.003815


In [284]:
# Correlations and Covariances - Series
pd.DataFrame({'corr': returns['MSFT'].corr(returns['IBM'])
              ,'cov': returns['MSFT'].cov(returns['IBM'])}
              , index=['series'])
    # equivalent to returns.MSFT.corr(returns.IBM)

Unnamed: 0,corr,cov
series,0.502104,9e-05


In [290]:
# Correlations and Covariances - Data Frame
returns.tail().corr()
    # The full correlation matrix is returned

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.333753,-0.242712,0.391183
GOOG,0.333753,1.0,-0.066199,0.596282
IBM,-0.242712,-0.066199,1.0,0.27492
MSFT,0.391183,0.596282,0.27492,1.0


In [291]:
returns.tail().cov()
    # The full covariance matrix is returned

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.9e-05,5e-06,-5e-06,2e-06
GOOG,5e-06,1e-05,-1e-06,3e-06
IBM,-5e-06,-1e-06,2.6e-05,2e-06
MSFT,2e-06,3e-06,2e-06,2e-06


In [295]:
# Computing pair-wise correlations -- DataFrame with a Series
returns.corrwith(returns['IBM'])
    # 

AAPL    0.388022
GOOG    0.405625
IBM     1.000000
MSFT    0.502104
dtype: float64

In [298]:
# Computing pair-wise correlations - DataFrame with another DataFrame
returns.corrwith(volume[['AAPL','GOOG']])
    # Only matching columns are correlated
    # The daily returns are basically never correlated with volume.

AAPL   -0.078426
GOOG   -0.006641
IBM          NaN
MSFT         NaN
dtype: float64

In [None]:
# Unique Value, Value Counts, Membership

In [302]:
obj = pd.Series(['c','a','d','a','a','b','b','c','c'])

array(['c', 'a', 'd', 'b'], dtype=object)

In [303]:
# Unique Values
obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [305]:
# Value Counts
obj.value_counts()
# Equivalent: pd.value_counts(obj.values)

c    3
a    3
b    2
d    1
dtype: int64

In [312]:
# Vectorized Set Membership
mask = obj.isin(['b','c'])
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [323]:
# Use these Series functions with DataFrames with the apply constructions
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3], 'Qu3': [1, 9, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,9
2,4,1,2
3,3,2,4
4,4,3,4


In [324]:
# Computes a histogram.  The left hand side is a list of the unique values (notice that it goes from 1,2,3,4 --> 9)
data.apply(pd.value_counts).fillna(0)

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
9,0.0,0.0,1.0


In [328]:
## Missing Data
# Identify missing data - Series
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [342]:
# Drop missing data - Series - Function approach
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
    # Equivalent to data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [345]:
# Drop missing data - Series - Boolean approach
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
    # Equivalent to data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data[data.isnull()==False]
data[data.notnull()]
    # Equivalent methods

0    1.0
2    3.5
4    7.0
dtype: float64

In [360]:
# Drop missing data in Data Frames
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
    # Unclear to me why the NA column does not show up.
data[3] = np.nan
data

Unnamed: 0,0,1,2,3
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [361]:
# Drop missing data in Data Frames - Function approach
# BE CAREFUL - df.dropna() by default drops any row that contains any NA value
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2,3


In [363]:
# Setting how='all' requires that all values are NaN before the row is dropped
data.dropna(how='all')

Unnamed: 0,0,1,2,3
0,1.0,6.5,3.0,
1,1.0,,,
3,,6.5,3.0,


In [364]:
# Drop columns (as opposed to rows) with missing data the usual way: axis=1
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [368]:
# Drop rows that don't meet a threshold of non-NA values.
df = pd.DataFrame(np.random.randn(7, 3))
df.ix[:4,1] = np.nan
df.ix[:2,2] = np.nan
df.dropna(thresh = 3)

Unnamed: 0,0,1,2
5,-0.976653,0.596823,-0.376416
6,-1.402458,-1.166058,-2.379381


In [369]:
## Filling missing data with a constant
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.171552,0.0,0.0
1,-1.334006,0.0,0.0
2,-0.677129,0.0,0.0
3,-0.139236,0.0,0.036122
4,-0.412806,0.0,-0.811747
5,-0.976653,0.596823,-0.376416
6,-1.402458,-1.166058,-2.379381


In [371]:
# Filling missing data with a dict matches the dict key to the df column name
df.fillna({1: 0.5, 2: -1})

Unnamed: 0,0,1,2
0,-0.171552,0.5,-1.0
1,-1.334006,0.5,-1.0
2,-0.677129,0.5,-1.0
3,-0.139236,0.5,0.036122
4,-0.412806,0.5,-0.811747
5,-0.976653,0.596823,-0.376416
6,-1.402458,-1.166058,-2.379381


In [374]:
# Filling missing data by forward filling
df = pd.DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = np.nan; df.ix[4:, 2] = np.nan
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-1.600411,0.220101,1.854071
1,1.412282,1.306762,0.958741
2,-0.852246,,-0.113506
3,1.48679,,0.052512
4,-0.598562,,
5,-1.310831,,


In [376]:
## Hierarchical Indexing - A Series with a MultiIndex
data = pd.Series(np.random.randn(10), index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1,2,3,1,2,3,1,2,2,3]])
    # Note that two vectors were defined in the index
    # Upon instantiation, the data is sorted in order from outermost index to innermost index.
data

a  1    0.842251
   2    0.445471
   3    1.499606
b  1   -0.499740
   2   -0.847721
   3   -1.222757
c  1   -0.167686
   2    0.523020
d  2   -1.224501
   3   -1.518357
dtype: float64

In [398]:
# Selection - Can subselect series using Indexes from the outside in
data[:'c']         # Example 1
data.ix[['b','d']] # Example 2

# You can use the [] notation to select more inner levels but in general DO NOT DO THIS

b  1   -0.499740
   2   -0.847721
   3   -1.222757
d  2   -1.224501
   3   -1.518357
dtype: float64

In [399]:
# MultiIndex series to DataFrame --- unstack() by default moves the inner-most level to the columns
data.unstack()

Unnamed: 0,1,2,3
a,0.842251,0.445471,1.499606
b,-0.49974,-0.847721,-1.222757
c,-0.167686,0.52302,
d,,-1.224501,-1.518357


In [406]:
# Columns can have a hierarchical index
# The indexes and columns can be named.
frame = pd.DataFrame(np.arange(12).reshape((4,3)), 
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], 
                     columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [408]:
# Creating a multi-index on its own, for example to reuse across data frames
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color'])

MultiIndex(levels=[[u'Colorado', u'Ohio'], [u'Green', u'Red']],
           labels=[[1, 1, 0], [0, 1, 0]],
           names=[u'state', u'color'])

In [411]:
# Swapping Levels
frame.swaplevel('key1','key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [414]:
# Sorting levels
frame.sortlevel('key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [415]:
# Summary statistics and MultiIndexes
frame.sum() # Sums row-by-row across the columns

state     color
Ohio      Green    18
          Red      22
Colorado  Green    26
dtype: int64

In [418]:
frame.sum(level='key2') # Sums row-by-row across the columns and key2

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [421]:
frame.sum(level='color',axis=1) # Sums row-by-row 

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


In [422]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [424]:
frame2 = frame.set_index(['c','d'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [425]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


In [444]:
#### Appendix - Panel Data - Time series data for multiple metrics for multiple similar things. 
# E.g. metrics for stocks over time.  health records for individuals over time. 
from pandas_datareader import data as web
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk)) for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))
pdata = pdata.swapaxes('items', 'minor')
    # Move the time series to the major axis i.e. axis=0

In [445]:
# Explore pdata
pdata

<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1692 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2010-01-04 00:00:00 to 2016-08-23 00:00:00
Minor_axis axis: AAPL to MSFT

In [449]:
pdata['Adj Close']
pdata.ix[:, '6/1/2012', :]
    # Over all metrics, on a particular day, over all stocks.

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close
AAPL,569.159996,572.650009,560.520012,560.989983,130246900.0,73.371509
DELL,12.15,12.3,12.045,12.07,19397600.0,11.67592
GOOG,571.790972,572.650996,568.350996,570.981,6138700.0,285.205295
MSFT,28.76,28.959999,28.440001,28.450001,56634300.0,25.262972


In [450]:
# Just turn this into a stacked data frame and come back from the land of the lost
stacked = pdata.ix[:, '5/30/2016':, :].to_frame()
    # all metrics, from 5/30/2016 inwards, all stocks
stacked

Unnamed: 0_level_0,Unnamed: 1_level_0,Open,High,Low,Close,Volume,Adj Close
Date,minor,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-05-31,AAPL,99.599998,100.400002,98.820000,99.860001,42307200.0,99.321953
2016-05-31,GOOG,731.739990,739.729980,731.260010,735.719971,2129500.0,735.719971
2016-05-31,MSFT,52.259998,53.000000,52.080002,53.000000,37653100.0,52.671715
2016-06-01,AAPL,99.019997,99.540001,98.330002,98.459999,29173300.0,97.929494
2016-06-01,GOOG,734.530029,737.210022,730.659973,734.150024,1253600.0,734.150024
2016-06-01,MSFT,52.439999,52.950001,52.439999,52.849998,25324800.0,52.522643
2016-06-02,AAPL,97.599998,97.839996,96.629997,97.720001,40191600.0,97.193484
2016-06-02,GOOG,732.500000,733.020020,724.169983,730.400024,1341800.0,730.400024
2016-06-02,MSFT,52.639999,52.740002,51.840000,52.480000,22840800.0,52.154936
2016-06-03,AAPL,97.790001,98.269997,97.449997,97.919998,28504900.0,97.392403
