# getting started with pandas
pandas contains data structures and data manipulation tools designed to make cleaning and analysis fast and easy. it is often used in tandem with:
* numerical computing tools such as NumPy and SciPy
* analytical libraries like statsmodels and SKLearn
* data visualization libraries like matplotlib

pandas adops many coding idioms from NumPy, the biggest difference is that pandas is designed for tabular, or heterogenious data, and NumPy si designed for workign with homogeneous numerical array data.

## Introduction to pandas Data Structures

### Series
a **Series** is a one-dimensional, array-like object containing a sequence of values (of similar types to NuPy types) and an associated array of data-labels, called its *index*.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#simple series formed from only an array fo data
simpleSeries = pd.Series([4, 7, -5, 3])
print(simpleSeries)

0    4
1    7
2   -5
3    3
dtype: int64


the string of numbers on the left shows the *index* (the index can be an integer, string, etc). the right string of numbers consists of the data. <u>A series is a homogenous data structure: only one type of data</u>.

In [3]:
#you can get the array representation of a Series with the values attribute
print(simpleSeries.values)

[ 4  7 -5  3]


In [4]:
#you can get the array represenation of the index object with the index attribute
print(simpleSeries.index)

RangeIndex(start=0, stop=4, step=1)


it is often desirable to create a series with an index consisting of **labels** instead of numbers

In [5]:
labelSeries = pd.Series(simpleSeries.values, index=["d", "b", "a", "c"])
print(labelSeries)

d    4
b    7
a   -5
c    3
dtype: int64


In [6]:
print(labelSeries.index)

Index(['d', 'b', 'a', 'c'], dtype='object')


you can use labels in the index of the object when sleecting single values, or even a set of values:

In [7]:
print("datapoint - a: ", labelSeries["a"])
#here ["c", "a", "d"] is interpreted as a list of indices, even though it contains strings
print("datapoint - c, a, and d: ", labelSeries[["c", "a", "d"]])
#seems like you can still access labled series with index positions
print("datapoint d: ", labelSeries[0])

datapoint - a:  -5
datapoint - c, a, and d:  c    3
a   -5
d    4
dtype: int64
datapoint d:  4


## numpy functions
using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link

In [10]:
#get all array elements greater than 0
print(labelSeries[labelSeries > 0])

#multiply all elements of the array by (scalar) 2
print(labelSeries * 2)

#you can use numpy functions directly in pandas
print(np.exp(labelSeries))

d    4
b    7
c    3
dtype: int64
d     8
b    14
a   -10
c     6
dtype: int64
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64


### Series as an ordered dictionary
another way to think about a Seris is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict

In [11]:
#is the label 'b' in label series?
print('b' in labelSeries)
#is 'e'?
print('e' in labelSeries)

True
False


In fact, you can create a Series form a Python dict. If you pass a dict to the Series object it will automatically put the labels in the order of the dict; you can override this by manually entering the labels in, how you want them to be ordered.

In [22]:
myDict = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']

#or you can pass in the dict by itself and the Series will automatically assign the labels
dictSeris = pd.Series(myDict)
print(dictSeris)

#we can create a list from these two things. note how Californias is assigned a NaN value, and Utah isn't included
#you can override the dict order by passing the dict keys directly
dictSeries2 = pd.Series(myDict, states)
print(dictSeries2)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64


the **pd.isnull(object)** and **pd.notnull(obect)** methods can be used to detect missing data. they will returns a boolean array of indices that satisfy the null condition.

In [23]:
#calling statically
print(pd.isnull(dictSeries))
print(pd.notnull(dictSeries))

#calling from the series object
print(dictSeries.isnull())

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool


## series data alignment
a useful feature is that, for many applications, a Series automatially assigns by index label in arithmetic operations

In [25]:
print(dictSeries + dictSeries2)

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
dtype: float64

Both the series object and it's index implement the name attribute

In [27]:
dictSeries.name = 'population'
dictSeries.index.name = 'state'

print(dictSeries)

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64


## DataFrame
a **DataFrame** represents a rectangular table of data, and contains an ordered collection of columns, <u>each of which can be a different value type</u>. it can be thought of a dict of **Series** objects, all sharing the same index. under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or other colection.

> while a Dataframe is physically two-dimensional, you can use it to represent higher dimensional data in a tabular format using hierarchical indexing. discussed in Chapter 8. this is an ingredient in some of the more advanced data-handling features.

There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:

In [28]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2000, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

popDataFrame = pd.DataFrame(data)
print(popDataFrame)

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2000  2.4
4  Nevada  2001  2.9
5  Nevada  2002  3.2


the resulting DataFrame will have its index assigned automatically, as with Series, and the columns are placed in sorted order.

In [29]:
#the head method selects the first n (default is 5) rows
print(popDataFrame.head(2))

  state  year  pop
0  Ohio  2000  1.5
1  Ohio  2001  1.7


In [31]:
#the tail method selects the last n rows
print(popDataFrame.tail(3))

    state  year  pop
3  Nevada  2000  2.4
4  Nevada  2001  2.9
5  Nevada  2002  3.2


if you pass in a column that isn't contained within the dict, then <u>it will appear with a missing value (np.NaN) in the result</u>

In [32]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2000  Nevada  2.4  NaN
five   2001  Nevada  2.9  NaN
six    2002  Nevada  3.2  NaN


A column in a DF can be retrieved as a Series either by dict-like notation, or by attribute

In [33]:
#getting a series by indexing
print(frame2['state'])

#getting an array by attribute
print(frame2.year)

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
one      2000
two      2001
three    2002
four     2000
five     2001
six      2002
Name: year, dtype: int64


**the Series will have the same index as the DataFrame, and their name attribute will be set to the name of the column**.

rows can also be retrieved by position or name, with the **loc** attribute:

In [51]:
print(frame2.loc['three'])

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object


Columns can be modified by assignment. The empty 'debt' column could be assigned with a scalar, or an array of values:

In [53]:
#assign the entire column with a scalar value
frame2['debt'] = 16.5
print(frame2)

#assign with an array of values
frame2['debt'] = np.arange(6.)
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2000  Nevada  2.4  16.5
five   2001  Nevada  2.9  16.5
six    2002  Nevada  3.2  16.5
       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2000  Nevada  2.4   3.0
five   2001  Nevada  2.9   4.0
six    2002  Nevada  3.2   5.0


if you assign with a Series then the labels will be realigned exactly tot he DataFrame's index, inserting missing values in any holes

In [36]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2000  Nevada  2.4  -1.5
five   2001  Nevada  2.9  -1.7
six    2002  Nevada  3.2   NaN


assigning a column that doesn't exist <u>will create a new column</u>. the del keyword will delect columns as with a dict.

> WARNING: new columns cannot be created with the attribute style syntax

In [37]:
frame2['eastern'] = frame2.state == 'Ohio'
print(frame2)

       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2000  Nevada  2.4  -1.5    False
five   2001  Nevada  2.9  -1.7    False
six    2002  Nevada  3.2   NaN    False


In [38]:
#then delete
del frame2['eastern']
print(frame2.columns)

Index(['year', 'state', 'pop', 'debt'], dtype='object')


> WARNING: the column returned from indexing a DataFrame is a <u>view</u> on teh underlying data, not a copy. thus, any in-place modifications to the Sereis will be reflected in the DataFrame. The Column can be explicitly copied with the Serie's **copy** method.

Another common form of data is a nested dict of dicts. pandas will treat the outer dict keys as the columns and the innder keys as the row indices:

In [40]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
populationFrame = pd.DataFrame(pop)
print(populationFrame)

      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5


you can transpose the DataFrame (swap rows and columns) with a similar syntax to NumPy arrays:

In [42]:
print(populationFrame.T)

        2001  2002  2000
Nevada   2.4   2.9   NaN
Ohio     1.7   3.6   1.5


Dicts of Series are treated in the same way as Dicts of Dicts

In [45]:
pdata = pd.DataFrame({'Ohio': populationFrame['Ohio'][:-1], 'Nevada': populationFrame['Nevada'][:2]})
print(pdata)

      Ohio  Nevada
2001   1.7     2.4
2002   3.6     2.9


the name of a DataFrames index and columns will be displayed when it is printed

In [46]:
populationFrame.index.name = 'year'
populationFrame.columns.name = 'state'
print(populationFrame)

state  Nevada  Ohio
year               
2001      2.4   1.7
2002      2.9   3.6
2000      NaN   1.5


the **value** attibute, similar to the Series object, will return a two-dimensional ndarray:

In [47]:
print(populationFrame.values)

[[2.4 1.7]
 [2.9 3.6]
 [nan 1.5]]


if the Dataframes columns are different data types, then the data type of the values array will be chosen to accomodate all of the columns


In [49]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2000, 'Nevada', 2.4, -1.5],
       [2001, 'Nevada', 2.9, -1.7],
       [2002, 'Nevada', 3.2, nan]], dtype=object)

### table 5-1: possible inputs to the DataFrame constructor
* 2d ndarray - a matrix of data, passing optional row and column labels
* dict of arrays, lists, or tuples - each sequence becomes a column in the DataFrame; all sequences must be the same length
* NumPy structured/record array -  treated at "dict of arrays" case
* Dict of Series - Each value becomes a column; indexes from each Series are unioned together to form the result's row index if no explicit index is passed
* dict of dicts - Each value becomes a column; keys are unioned to form the row index as in the "Dict of Series" case.
* List of Dicts or Series - Each item becomes a row in the DataFrame; union of dict keys or Series indexes becomes the dataframe's column labels
* list of lists or tuples - treated as the "2D ndarray" case
* Another Dataframe - The DataFrames indexes are used unless different ones are passed
* NumPy MaskedArray - Like the "2D ndarray" case except masked values become NA.missing in the DataFrame result

## Index Objects
* Index objects are responsible for holidng the axis labels and other metadata. any array, or other sequence of labels, is internally converted into an index
* Index objects are Immutable
* Index objects behive like a fixed size set
* unlike sets, index objects can contain duplicate labels

In [60]:
index = labelSeries.index
print(index)
print(index[1])

#set-like behavior in labelSeries
print('c' in index)
print('e' in index)

Index(['d', 'b', 'a', 'c'], dtype='object')
b
True
False


In [62]:
#duplicate labels
foobarSeries = pd.Series([0, 1, 2, 3], index = ['foo', 'foo', 'bar', 'bar'])

print(foobarSeries)
print(foobarSeries['foo'])

foo    0
foo    1
bar    2
bar    3
dtype: int64
foo    0
foo    1
dtype: int64


## Pandas Essintial Functionality

### Reindexing
reindexing  apandas object means that you create a new object with the old object's data *conformed* to the new index

In [63]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
print(obj)

#calling reindex rearranges teh data according to the new index
objReindexed = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(objReindexed)

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64


for ordered data like time series, it may be desirable to do some interpolation, or filling of values when reindexed. we can do this by passing in an argument to the *method* parameter

In [64]:
colorObj = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(colorObj)

#ffill - fill forward
#bfill - fill backwards
print(colorObj.reindex(range(6), method='ffill'))

0      blue
2    purple
4    yellow
dtype: object
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object


reindex can alter either the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the reult

In [66]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index = ['a', 'c', 'd'], columns = ['Ohio', 'Texas', 'California'])
print(frame)

   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8


In [83]:
#reindex only the rows
print(frame.reindex(['a', 'b', 'c', 'd']))

   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0


In [84]:
#the columns can be reindex with the columns keyword
states = ['Texas', 'Utah', 'California']
print(frame.reindex(columns = states))

   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8


### Dropping Entries from an Axis
dropping one or more entries from an axis is easy if you aready have an index array, or list without those entries. the *drop* method will return a new object with the indicated value or values delted from an axis

In [89]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print(obj)

new_obj = obj.drop('c')
print(new_obj)

newer_ob = obj.drop(['d', 'c'])
print(newer_ob)

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
e    4.0
dtype: float64


DataFrame index values can be deleted from either axis

In [91]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index = ['Ohio', 'Colorado', 'Utah', 'New York'], columns = ['one', 'two', 'three', 'four'])
print(data)

#dropping colorado and ohio
print(data.drop(['Colorado', 'Ohio']))

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15


In [92]:
#you can drop columns by passing axis=1, or axis='columns'
print(data.drop(['two', 'four'], axis=1))

          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14


**many functions like drop, which modify the size or shape of a Series or DataFrame can manipulate an object *in place* by using the inplace argument**

In [93]:
data.drop('one', axis='columns', inplace=True)
print(data)

          two  three  four
Ohio        1      2     3
Colorado    5      6     7
Utah        9     10    11
New York   13     14    15


### Indexing, Selection, And Filtering

Series indexing (obj[...]) works analogously to NumPy array indecing, except that **you can use the Series's index value instead of only integers**

In [95]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj)

#get b
print(obj['b'])
print(obj[1])

#get c and d
print(obj[2:4])

print(obj[['b', 'a', 'd']])

#logical indexing
print(obj[obj < 2])

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
1.0
1.0
c    2.0
d    3.0
dtype: float64
b    1.0
a    0.0
d    3.0
dtype: float64
a    0.0
b    1.0
dtype: float64


**NOTE:** slicing in pandas is different from normal pandas in that, the <u>endpoints are inclusive</u>

In [98]:
#slicing directly with index labels
print(obj['b':'c'])

#setting these methods modifies teh corresponding section of the series

obj['b':'c'] = 5
print(obj)

b    1.0
c    2.0
dtype: float64
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64


In [108]:
#slicing automatically selects columns
print(data[:2])

print(data['three'])

print(data['three'] > 5)
print(data[data['three'] > 5])

print(data < 5)

          two  three  four
Ohio        1      2     3
Colorado    5      6     7
Ohio         2
Colorado     6
Utah        10
New York    14
Name: three, dtype: int64
Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool
          two  three  four
Colorado    5      6     7
Utah        9     10    11
New York   13     14    15
            two  three   four
Ohio       True   True   True
Colorado  False  False  False
Utah      False  False  False
New York  False  False  False


### Selecting with loc and iloc
for label-indexing on the rows, you use *loc* and *iloc*. they enable you to select a subset of rows and columns from a Dtaframe with NumPy-like notation: either by using the axis labels (loc) or integers (iloc)

In [113]:
print(data)

#selecting a single row, and multiple columns by label
print(data.loc['Colorado', ['two', 'three']])

#an equivalent selection with integers
print(data.iloc[1, [0, 1]])

          two  three  four
Ohio        1      2     3
Colorado    5      6     7
Utah        9     10    11
New York   13     14    15
two      5
three    6
Name: Colorado, dtype: int64
two      5
three    6
Name: Colorado, dtype: int64


In [114]:
#both indexing functions work with slices in adition to single labels
print(data.loc[:'Utah', 'two'])

print(data.iloc[:, :3])
print(data.iloc[:, :3][data.three > 5])

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int64
          two  three  four
Ohio        1      2     3
Colorado    5      6     7
Utah        9     10    11
New York   13     14    15
          two  three  four
Colorado    5      6     7
Utah        9     10    11
New York   13     14    15


### Table 5-4, Indexing Options with DataFrame
* df[val] - Select a single column, or sequence of columns from the data frame; special case conveniances: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion
* df.loc[val] - select a single row, or subset of rows, from the DataFrame by label
* df.loc[:, val] - selects single column or subset of columns by label
* df.loc[val1, val2] - selects both rows and coumns by label
* df.iloc[where] select single row or subset row from the DataFrame by integer posision
* df.iloc[:, where] - selects single column or subset of columns by integer position
* df.iloc[where_i, where_j] - selects both rows and columns by integer position
* df.at[label_i, label_j] - selects a single scalar value by row and column label
* df.iat[i, j] - selects a single scalar value by row and column position (integers)
* reindex method - select either rows or columns by labels
* get_value, set_value methods - select a single value by row and column labels

if you have an axis index containing integer, data selection will always be label-oriented. for more precision handling use *loc* and *iloc*

on the other hand **slicing with integers is always integer oriented**

### Arithmetic and Data Alighment
when adding together objects, if any index pairs are not the same, the respective index in teh result will be the union of the index pairs. siliar to *automatic outer join* in databases 

### function applications and mapping
NumPy functions also work with pandas objects

In [116]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)

print(np.abs(frame))

               b         d         e
Utah   -0.702725  1.461847 -0.600543
Ohio   -0.878414  0.159343 -0.268247
Texas  -0.462516 -2.476017 -1.829872
Oregon -1.443210 -0.342621 -1.251918
               b         d         e
Utah    0.702725  1.461847  0.600543
Ohio    0.878414  0.159343  0.268247
Texas   0.462516  2.476017  1.829872
Oregon  1.443210  0.342621  1.251918


In [120]:
#apply
#another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrames apply method does exactly this
f = lambda x: x.max() - x.min()

#here function f, which computes the difference betwen the maximum and minimu of a Series, if invoked once on each column in frame,
#    the result is a Series having the oclumns of frame as its index. you can pass axis='columns' to apply the function per row instead
print(frame.apply(f))
print(frame.apply(f, axis=1))

b    0.980694
d    3.937864
e    1.561625
dtype: float64
Utah      2.164572
Ohio      1.037757
Texas     2.013501
Oregon    1.100588
dtype: float64


### Summarizing and Computing Descriptinve Statistics
pandas objects are equiped with a set of common mathematical and statistical methods. Most of these fall into the category of *reductions* or *summary statistics*. methods that extract a single value from a Series or a Seris of values from the rows or columns of a DataFrame

In [123]:
df = pd.DataFrame([[1.3, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index = ['a', 'b', 'c', 'd'], columns = ['one', 'two'])
print(df)

    one  two
a  1.30  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3


In [124]:
#print sum of columns (across rows)
print(df.sum())

one    9.15
two   -5.80
dtype: float64


In [126]:
#print sum of rows (across columns)
print(df.sum(axis=1))

a    1.30
b    2.60
c    0.00
d   -0.55
dtype: float64


NA values are excluded unless the entire slice is NA, this can be disabled wit the skipna option

In [127]:
print(df.mean(axis='columns', skipna=False))

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64


In [128]:
print(df.mean(axis='columns', skipna=True))

a    1.300
b    1.300
c      NaN
d   -0.275
dtype: float64


some methods, like *idxmin* and *idxmax* return indirect statistics

In [129]:
#return index with max value
print(df.idxmax())

one    b
two    d
dtype: object


In [131]:
#return index with minimum value
print(df.idxmin())

one    d
two    b
dtype: object


other methods are *accumulations*

In [132]:
#get teh cumulative sum of df
print(df.cumsum())

    one  two
a  1.30  NaN
b  8.40 -4.5
c   NaN  NaN
d  9.15 -5.8


### Correlation and Covariance
some summary statistics, like *correlation* and *covariance*, are computed from pairs of arguments

In [135]:
import pandas_datareader.data as web
all_data = {ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
print(all_data)

{'AAPL':                   High         Low        Open       Close       Volume  \
Date                                                                      
2016-08-26   26.987499   26.577499   26.852501   26.735001  111065200.0   
2016-08-29   26.860001   26.572500   26.655001   26.705000   99881200.0   
2016-08-30   26.625000   26.375000   26.450001   26.500000   99455600.0   
2016-08-31   26.642500   26.410000   26.415001   26.525000  118649600.0   
2016-09-01   26.700001   26.405001   26.535000   26.682501  106806000.0   
...                ...         ...         ...         ...          ...   
2021-08-19  148.000000  144.500000  145.029999  146.699997   86960300.0   
2021-08-20  148.500000  146.779999  147.440002  148.190002   59947400.0   
2021-08-23  150.190002  147.889999  148.309998  149.710007   60131800.0   
2021-08-24  150.860001  149.149994  149.449997  149.619995   48606400.0   
2021-08-25  150.320007  147.800003  149.809998  148.360001   57790733.0   

             Ad

In [136]:
price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})
print(price)

                  AAPL         IBM        MSFT         GOOG
Date                                                       
2016-08-26   25.073166  126.406105   53.691029   769.539978
2016-08-29   25.045031  127.523842   53.755795   772.150024
2016-08-30   24.852774  127.268356   53.561485   769.090027
2016-08-31   24.876219  126.853180   53.163647   767.049988
2016-09-01   25.023932  127.380173   53.283924   768.780029
...                ...         ...         ...          ...
2021-08-19  146.699997  138.020004  296.769989  2738.270020
2021-08-20  148.190002  139.110001  304.359985  2768.739990
2021-08-23  149.710007  139.619995  304.649994  2821.989990
2021-08-24  149.619995  139.839996  302.619995  2847.969971
2021-08-25  148.360001  139.860001  302.010010  2859.000000

[1258 rows x 4 columns]


In [137]:
volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})
print(volume)

                   AAPL        IBM        MSFT     GOOG
Date                                                   
2016-08-26  111065200.0  2498900.0  20971200.0  1166700
2016-08-29   99881200.0  2475900.0  16417200.0   847600
2016-08-30   99455600.0  1813300.0  16930200.0  1130000
2016-08-31  118649600.0  2323600.0  20860300.0  1248600
2016-09-01  106806000.0  2358400.0  26075400.0   925100
...                 ...        ...         ...      ...
2021-08-19   86960300.0  4160100.0  29850500.0   914800
2021-08-20   59947400.0  2656300.0  40796100.0   778200
2021-08-23   60131800.0  3039600.0  22830200.0  1054500
2021-08-24   48606400.0  2365600.0  18175800.0   756300
2021-08-25   57790733.0  1957777.0  17057369.0   628711

[1258 rows x 4 columns]


In [138]:
returns = price.pct_change()
print(returns)

                AAPL       IBM      MSFT      GOOG
Date                                              
2016-08-26       NaN       NaN       NaN       NaN
2016-08-29 -0.001122  0.008842  0.001206  0.003392
2016-08-30 -0.007676 -0.002003 -0.003615 -0.003963
2016-08-31  0.000943 -0.003262 -0.007428 -0.002653
2016-09-01  0.005938  0.004154  0.002262  0.002255
...              ...       ...       ...       ...
2021-08-19  0.002323 -0.010396  0.020775  0.002515
2021-08-20  0.010157  0.007897  0.025575  0.011127
2021-08-23  0.010257  0.003666  0.000953  0.019233
2021-08-24 -0.000601  0.001576 -0.006663  0.009206
2021-08-25 -0.008421  0.000143 -0.002016  0.003873

[1258 rows x 4 columns]


In [139]:
#the coor method of Series computes the correlation of the overlapping, non-NA aligned-by-index values in two Series, cov computes the covariance
print(returns['MSFT'].corr(returns['IBM']))

print(returns['MSFT'].cov(returns['IBM']))

0.5167814957608666
0.00014544044572982248
