# Tutorial 03 - Introduction to `numpy` and `pandas`

Python is becoming a de facto standard programming language in science and data analysis.  However, this was not always the case.  Python was initially as a general purpose programminng language, with no particular emphasis on data data analysis.

Python became a scientific computing workhorse through the development of two packages: `numpy` and `pandas`.  The purpose of this tutorial is to introduce these two packages, and along the way to take a first look at some financial data.

## Import Packages

As usual, let's start by importing the packages that we will need.

In [1]:
import numpy as np
import pandas as pd

Both of these packages have a lot of functionality, but here is what they do in brief:

`numpy`: vector and matrix computation (similar to what is native in R and Matlab)

`pandas`: introduces `DataFrame` data structure that allows for analysis of data (like R and SQL)

## `numpy.array`

We have already discussed the `list` structure, which is Python's simplest and most flexible way of storing multiple values in a single variable.

The of `lists` flexibility comes at a cost of performance: large `lists` are very slow.

The `array` structure in the `numpy` package can be thought of as a vector or a matrix, and allows for efficient computation.

The easiest way to create an array is by starting with a `list` and then use the `np.array()` method.  

Let's try the following code:

In [2]:
l = [1, 2, 3]
arr1 = np.array(l)
arr2 = np.array([1, 2, 3])

Let's now explore the types of the variables we just created.

In [3]:
print(type(l))
print(type(arr1))
print(type(arr2))

<class 'list'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


Let's look at the content of what's inside our two arrays.

In [4]:
print(arr1)
print(arr2)

[1 2 3]
[1 2 3]


If you print an array to the console (without using `print()`), this is what it looks like:

In [5]:
arr1

array([1, 2, 3])

We typically won't work with `arrays` directly, or have to build them from scratch.

Usually, we will be working with `arrays` indirectly, since `pandas` `Dataframes` are built on top of them.

It's good to know `arrays` exist, and to realize that `numpy` is what makes scientific computing possible in Python.

**Code Challenge:** Create an `array` that consists of 5 zeros.

## `pandas.DataFrame`

The `DataFrame` structure from the `pandas` package is going to be our primary data analysis workhorse.

A `DataFrame` is a convenient way to store rectangular data that consists of rows and columns.

Usually, the data that goes in a `DataFrame` comes from an external source.  

In this class, our data will usually come from special text files, called CSV files, and will be read into a `DataFrame` via the `pandas.read_csv()` method.  We will occassionally use built-in Python functions to query data from the internet.

Let's read in our first data set into a `DataFrame` by typing the following:

In [6]:
df_spy = pd.read_csv('../data/spy_dec_2018.csv')
df_spy

Unnamed: 0,date,open,high,low,close,volume,adjusted
0,2018-11-30,273.809998,276.279999,273.450012,275.649994,98204200,271.527222
1,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,275.122589
2,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,266.207977
3,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,265.804108
4,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,259.627899
5,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,260.120422
6,2018-12-11,267.660004,267.869995,262.480011,264.130005,121504400,260.179504
7,2018-12-12,267.470001,269.0,265.369995,265.459991,97976700,261.489624
8,2018-12-13,266.519989,267.48999,264.119995,265.369995,96662700,261.40094
9,2018-12-14,262.959991,264.029999,259.850006,260.470001,116961100,256.574249


This `DataFrame` consists of all the December end-of-day prices for SPY, which is an ETF that tracks the S&P500 Index.


## Exploring a `DataFrame`

Let's explore the `df_spy` `DataFrame` that we have just created.  

This is a very typical thing to do once you've loaded a new dataset.

First, we can first use the `type()` method to make sure what we have created is in fact a `DataFrame`.

In [7]:
type(df_spy)

pandas.core.frame.DataFrame

Next, we can use the `.dtypes` attribute of the `DataFrame` to see the data types of each of the columns.

In [8]:
df_spy.dtypes

date         object
open        float64
high        float64
low         float64
close       float64
volume        int64
adjusted    float64
dtype: object

We can check the number of rows and columns by using the `.shape` attribute.

In [9]:
df_spy.shape

(20, 7)

As we can see, our dataframe `df_spy` consists of 18 row and 7 columns.

A `DataFrame` can be thought of as a collection of rows and a collection of columns.  Both view points are useful.

**Code Challenge:** Try the `DataFrame.info()` and `DataFrame.describe()` methods on `df_spy`.

## `DataFrame` Columns

In order to isolate a particular column, say the `close` column, we can use square brackets (`[ ]`) as follows:

In [10]:
df_spy['close']

0     275.649994
1     279.299988
2     270.250000
3     269.839996
4     263.570007
5     264.070007
6     264.130005
7     265.459991
8     265.369995
9     260.470001
10    255.360001
11    255.080002
12    251.259995
13    247.169998
14    240.699997
15    234.339996
16    246.179993
17    248.070007
18    247.750000
19    249.919998
Name: close, dtype: float64

 


**Code Challenge:** Isolate the `date` column of `df_spy`.

As we can see from the following code, each column of a `DataFrame` is actually a different kind of `pandas` structure called a `Series`. 

In [11]:
type(df_spy['close'])

pandas.core.series.Series

Here is a bit of `pandas` inside baseball (don't get too bogged down with these details for now):

- A `pandas.DataFrame` is really a collection of columns glued together, and each column is a `pandas.Series`.

- A `pandas.Series` has two major components: 1) `.values`; 2) `.index`.

- The `.values` of a `Series` is a `numpy.array`.

Let's look at the `.values` attribute of the `close` column of `df_spy`:

In [12]:
df_spy['close'].values

array([275.649994, 279.299988, 270.25    , 269.839996, 263.570007,
       264.070007, 264.130005, 265.459991, 265.369995, 260.470001,
       255.360001, 255.080002, 251.259995, 247.169998, 240.699997,
       234.339996, 246.179993, 248.070007, 247.75    , 249.919998])

**Code Challenge:** Verify that the `values` component of the `close` column of `df_spy` is in fact a a `numpy.array`.

## Component-wise Column Operations

As we saw in the previous section, a `pandas.DataFrame` column is essentially a fancy `numpy.array`.

We can think of a `numpy.array` as a vector or matrix in Python.

As such, we can perform vector-like calculations with `DataFrame` columns.

For example, we can perform scalar arithmetic on the entire column, and the calculation *broadcasts* as we would expect.  The following code divides all the `close` prices by 100.

In [13]:
df_spy['close'] / 100

0     2.7565
1     2.7930
2     2.7025
3     2.6984
4     2.6357
5     2.6407
6     2.6413
7     2.6546
8     2.6537
9     2.6047
10    2.5536
11    2.5508
12    2.5126
13    2.4717
14    2.4070
15    2.3434
16    2.4618
17    2.4807
18    2.4775
19    2.4992
Name: close, dtype: float64

The follow code adds 100 to all the `close` prices:

In [14]:
df_spy['close'] + 100

0     375.649994
1     379.299988
2     370.250000
3     369.839996
4     363.570007
5     364.070007
6     364.130005
7     365.459991
8     365.369995
9     360.470001
10    355.360001
11    355.080002
12    351.259995
13    347.169998
14    340.699997
15    334.339996
16    346.179993
17    348.070007
18    347.750000
19    349.919998
Name: close, dtype: float64

We can also perform component-wise calculations between two colums.

Let's say we want to calculate the intraday range of SPY for each of the trade-dates in `df_spy`.  This is the difference between the `high` and the `low` of each day.  We can do this easily from the columns of our `DataFrame`.

In [15]:
df_spy['high'] - df_spy['low']

0      2.829987
1      2.889984
2      8.950012
3      7.529999
4      8.589996
5      6.540009
6      5.389984
7      3.630005
8      3.369995
9      4.179993
10     7.119995
11     4.670013
12    10.049988
13     6.970001
14     9.730011
15     6.569992
16    12.419998
17     9.329986
18     4.949997
19     2.720001
dtype: float64

**Code Challenge:** Calculate the difference between the `close` and `open` columns of `df_spy`.

## Adding Columns to a `DataFrame`

Data analysis often involves starting with a `DataFrame` and then adding new columns to it.  The newly added columns are often functions of the existing columns.

Continuing our previouse example, let's say we want to save our intraday ranges back into `df_spy` for further analysis later.  We can do this easily with the following code:

In [16]:
df_spy['intraday_range'] = df_spy['high'] - df_spy['low']
df_spy

Unnamed: 0,date,open,high,low,close,volume,adjusted,intraday_range
0,2018-11-30,273.809998,276.279999,273.450012,275.649994,98204200,271.527222,2.829987
1,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,275.122589,2.889984
2,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,266.207977,8.950012
3,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,265.804108,7.529999
4,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,259.627899,8.589996
5,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,260.120422,6.540009
6,2018-12-11,267.660004,267.869995,262.480011,264.130005,121504400,260.179504,5.389984
7,2018-12-12,267.470001,269.0,265.369995,265.459991,97976700,261.489624,3.630005
8,2018-12-13,266.519989,267.48999,264.119995,265.369995,96662700,261.40094,3.369995
9,2018-12-14,262.959991,264.029999,259.850006,260.470001,116961100,256.574249,4.179993


Notice that we sort of assumed that `intraday_range` exists and assign a value to it, the value we assign to it is a component-wise operation on two of the existing columns.

## Aggregating Calulations on `DataFrames`

We have already seen how to do *component-wise* calculations on the column of a `DataFrame`. A *Component-wise* calculation can be thought of as a function that takes in vectors and scalars, and returns a vector.

It is often useful to perform *aggregation* calculations on `DataFrame`.  An *aggregation* can be thought of a function that takes in a vector, and returns a scalar.

For example, we can use the following code to calculate the total SPY volume during December:

In [17]:
df_spy['volume'].sum()

3200984700

Let's break down what we did here:

- First, we isolated the `volume` column of the DataFrame, which is a `Series` (a souped up `numpy.array`).

- A `numpy.array` has method called `.sum()` which adds up all the components.

Next, let's calculate some summary statistics on the `intraday_range` column that we added to our `DataFrame` earlier.  Notice that each of these summary statistics is just an aggregating calculation on the column. 

In [18]:
print("Mean: ", df_spy['intraday_range'].mean()) # average
print("St Dev:", df_spy['intraday_range'].std()) # standard deviation
print("Min:" , df_spy['intraday_range'].min()) # minimum
print("Max:" , df_spy['intraday_range'].max()) # maximum

Mean:  6.421497299999994
St Dev: 2.802254726800228
Min: 2.7200010000000248
Max: 12.419997999999993


**Code Challenge:** Calculate the average daily volume for the trade dates in `df_spy`.

## Related Reading

*PDSH* - Section 3.1 - Introducing Pandas Objects

*PDSH* - Section 2.1 - Understanding Data Types in Python

*PDSH* - Section 2.2 - The Basics of NumPy Arrays

*PDSH* - Section 2.3 - Computation on NumPy Arrays: Universal Functions

*PDSH* - Section 2.4 - Aggregations: Min, Max, and Everything In Between