<h1 id="tocheading">Table of Contents and Notebook Setup</h1>
<div id="toc"></div>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
import numpy as np

# Some Math Prereqs

The <b> covariance </b> of two distributions X and Y is defined as

$$cov(X,Y) = \frac{1}{n-1}\sum_{i=1}^n (x_i-\mu_x)(y_i-\mu_y) $$

i represents the 'i'th measurement 

n is the total number of measurements 

$x_i$ and $y_i$ are individual measurements

$\mu_x$ and $\mu_y$ are the mean values of X and Y

This gives us information about how the two variables deviate from their expected values (means) and if they do it at the same time. If they both deviate positively or negatively at the same time, then we get a large contribution from the sum. If one deviates positively and the other negatively then we get a large negative number. Big positive numbers mean the variables are correlated (big negative also means they're related in some way too- as one goes up the other goes down).

The problem is that these large or small numbers depend on the scale of the units we use for measurement. We want a quantity that we know is the same for all distributions. We can divide by their <b> variances </b> (related to standard deviation).

$$\sigma_x \equiv cov(X,X) = \frac{1}{n-1}\sum_{i=1}^n (x_i-\mu_x)^2 $$

We define the <b> correlation </b> (or more precisely: the <i> linear correlation </i>) as follows:

$$corr(X,Y)=\frac{cov(X,Y)}{\sqrt{\sigma_x \sigma_y}}$$

The inequaltity $-1 \leq corr(X,Y) \leq 1$ always holds. This can be shown through the Cauchy-Schwartz inequality (quantities $x_i-\mu_x$ are elements of a vector).

If $corr(X,Y)=1$ then the values are perfectly correlated (in the vector space of measurements $x_i-\mu_x$ and $y_i-\mu_y$ they point in the same direction). If $corr(X,Y)=-1$ then the values are perfectly uncorrelated (in the vector space they point in opposite direction).

Such information is when comparing stock trends. Suppose that amazon always drops the day after microsoft goes up and we find a strict anticorrelation ($corr(X,Y)=-1$). In the future, when microsoft goes up, we may want to sell our amazon stock as we know its going to drop- then pick it up the next day for a discount.

# Basic Mathematical Functions of Pandas

pandas objects are equipt to deal with a variety of mathematical and statistical functions, and can also deal with missing data.

In [3]:
df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],
                  index=['a','b','c','d'], columns=['one','two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


## The Sum Method

Use <b> sum </b> to return the column sums. 

In [4]:
df.sum()

one    9.25
two   -5.80
dtype: float64

We can also use axis='columns' to sum <i> across </i> the columns instead.

In [5]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

## The Mean Method

We can exclude rows with NA values if we like:

In [6]:
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

## Accumulation Method (Integration of Rows/Columns)

In [7]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


We can use this for integration if we like.

In [8]:
dx = 0.001
(df*dx).cumsum()

Unnamed: 0,one,two
a,0.0014,
b,0.0085,-0.0045
c,,
d,0.00925,-0.0058


## Basic Statistical Method 

We can use the describe method to learn about the rows and columns.

In [9]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


## Summary

See table 5-8 on page 160 of textbook for all simple methods.

# Correlation and Covariance

## Introduction to Correlation in DataFrames

Correlation and Covariance look at the relationship between two data sets. Below we compare stock datasets. 

In [10]:
import pandas_datareader.data as web
stocks = ['AMZN', 'GOOG', 'AAPL', 'TD', 'JNJ', 'IBM']

start = pd.datetime(2017, 7, 29)
end = pd.datetime(2018, 8, 2)
f1 = web.DataReader(stocks, 'iex', start, end)
f1['open'].head() #opening price for the stock on that day

Symbols,AAPL,AMZN,GOOG,IBM,JNJ,TD
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-07-31,147.0895,1019.05,941.89,137.0588,128.7204,49.5724
2017-08-01,154.1153,996.11,932.38,137.6474,129.7238,49.0433
2017-08-02,156.2937,1001.77,928.61,137.7613,128.662,49.2164
2017-08-03,154.1055,999.47,930.34,137.1063,128.4964,49.226
2017-08-04,153.1439,989.68,926.75,137.6474,130.279,49.0433


Lets apply some functions and see how the stock changes at the beginning and the end of the day. Recall that functions like the one below can operate on rows or columns of dataframes; in this case we choose column headers.

In [11]:
def find_change(x, stock):
    return x['close'][stock]-x['open'][stock]

stock_day_changes = pd.DataFrame([f1.apply(find_change, axis='columns', args=(stock,)) 
                                  for stock in stocks], index=stocks)
stock_day_changes = stock_day_changes.transpose()
stock_day_changes.head()

Unnamed: 0_level_0,AMZN,GOOG,AAPL,TD,JNJ,IBM
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-07-31,-31.27,-11.39,-1.148,0.0096,0.565,0.2753
2017-08-01,0.08,-1.55,1.5013,-0.2117,-0.643,0.1519
2017-08-02,-5.88,1.78,-2.0999,0.1539,0.0779,-0.636
2017-08-03,-12.55,-6.69,-1.4523,-0.2693,1.4027,0.4841
2017-08-04,-2.1,1.21,0.314,-0.2117,-0.5455,0.1519


Correlation for entire DataFrame:

In [12]:
stock_day_changes.corr()

Unnamed: 0,AMZN,GOOG,AAPL,TD,JNJ,IBM
AMZN,1.0,0.732779,0.666508,0.411271,0.185849,0.441665
GOOG,0.732779,1.0,0.646076,0.389808,0.334801,0.52369
AAPL,0.666508,0.646076,1.0,0.379218,0.272717,0.472721
TD,0.411271,0.389808,0.379218,1.0,0.269354,0.528172
JNJ,0.185849,0.334801,0.272717,0.269354,1.0,0.386633
IBM,0.441665,0.52369,0.472721,0.528172,0.386633,1.0


Or we can just select single elements:

In [13]:
stock_day_changes['AMZN'].corr(stock_day_changes['GOOG'])

0.73277921990735062

Or we can select rows:

In [14]:
stock_day_changes.corrwith(stock_day_changes.AMZN)

AMZN    1.000000
GOOG    0.732779
AAPL    0.666508
TD      0.411271
JNJ     0.185849
IBM     0.441665
dtype: float64

# Unique Values, Counting Occurences, and Membership of Elements in a Series

The pandas module has even more methods for determining if elements in a series are unique. This is obviously useful for DataFrames as the rows and columns can be extracted as Series.

## Uniqueness

In [15]:
obj = pd.Series(['c', 'a', 'c', 'b', 'a', 'c', 'b', 'a', 'c'])
uniques = obj.unique()
uniques

array(['c', 'a', 'b'], dtype=object)

## Counting Occurences

In [16]:
obj = pd.Series(['c', 'a', 'c', 'b', 'a', 'c', 'b', 'a', 'c'])
obj.value_counts()

c    4
a    3
b    2
dtype: int64

The returned Series is sorted by the number of occurences. We can choose not to have this as well:

In [17]:
pd.value_counts(obj.values, sort=False)

b    2
c    4
a    3
dtype: int64

## Membership

Sometimes we want to see if an element is contained in a Series. We can use the <i> isin </i> method for this.

In [18]:
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2     True
3     True
4    False
5     True
6     True
7    False
8     True
dtype: bool

Then we can use the mask to extract the elements that we want.

In [19]:
obj[mask]

0    c
2    c
3    b
5    c
6    b
8    c
dtype: object

This can often make boolean indexing DataFrames easier when one has lots of conditions.

Suppose we have a Series of distinct values and a Series of non-distinct values like below:

In [20]:
to_match = pd.Series(['c','b','c','a','b'])
unique_vals = pd.Series(['b','a','c'])

We can use the <i> Index.get_indexer </i>  method to give an index array from the unique values:

In [21]:
pd.Index(unique_vals).get_indexer(to_match)

array([2, 0, 2, 1, 0])