# Tutorial 03 - Introduction to `numpy` and `pandas`

- Python is becoming a de facto standard programming language in science and data analysis.

- However, this was not always the case.  Python was not initially designed for data analysis (unlike R and Matlab, which both were).

- Python became a scientific computing workhorse through the development of two packages:
    - `numpy`
    - `pandas`

- The purpose of this tutorial is to introduce these two packages, and along the way to take a first look at some financial data.

## Import Packages

As usual, let's start by importing the packages that we will need.

In [1]:
##> import numpy as np
##> import pandas as pd




Both of these packages have a lot of functionality, but here is what they do in brief:
- `numpy`: vector and matrix computation (similar to what is native in R and Matlab)
- `pandas`: dataframe data structure that allows for analysis of data (like R)

## `numpy.array`

- We have already discussed the `list` structure, which is Python's simplest and most flexible way of storing multiple values in a single variable.

- However, this flexibility comes at a cost of performance: large lists are very slow.

- The `array` structure in the numpy package can be thought of as a vector or a matrix, and allows for efficient computation.

The easiest way to create an array is by starting with a list and then use the `np.array()` method.  

Let's try the following code:

In [2]:
##> l = [1, 2, 3]
##> arr1 = np.array(l)
##> arr2 = np.array([1, 2, 3])




Let's now explore the types of the variables we just created.

In [3]:
##> print(type(l))
##> print(type(arr1))
##> print(type(arr2))




Let's look at the content of what's inside our two arrays.

In [4]:
##> print(arr1)
##> print(arr2)



If you print an array to the console, this is what it looks like:

In [5]:
##> arr1


- We typically won't work with arrays directly, or have to build them from scratch.
- Usually, we will be working with them indirectly, since `pandas` dataframe are built on top of them.
- It's good to know `arrays` exist, and to realize that `numpy` is what makes scientific computing possible in Python.

## `pandas.DataFrame`

- The `DataFrame` structure from the `pandas` package is going to be our primary workhorse.

- A `DataFrame` is a convenient way to store rectangular data that consists of rows and columns.

- Usually, the data that goes in a `DataFrame` come from an external source.  

- In this class, our data will usually come from special text files, called CSV files, and will be read into a `DataFrame` via the `pandas.read_csv()` method.

Let's read in our first data set into a `DataFrame` by typing the following:

In [6]:
##> df_spy = pd.read_csv("data/spy_dec_2018.csv")
##> df_spy




- This dataframe consists of all the December end-of-day prices for SPY (which is an ETF that tracks the S&P500 Index).

## Exploring a `DataFrame`

Let's explore the `df_spy` dataframe that we have just created.  

This is a very typical thing to do once you've loaded a new dataset.

First, we can first use the `type()` method to make sure what we have created is in fact a `DataFrame`.

In [7]:
##> type(df_spy)



Next, we can use the `.dtypes` attribute of the `DataFrame` to see the data types of each of the columns.

In [8]:
##> df_spy.dtypes



We can check the number of rows and columns by using the `.shape` attribute.

In [9]:
##> df_spy.shape



- As we can see, our dataframe `df_spy` consists of 18 row and 7 columns.

- A `DataFrame` can be thought of as a collection of rows and a collection of columns.  Both view points are useful.

## `DataFrame` Columns

In order to isolate a particular column, say the `date` column, we can use square brackets as follows:

In [10]:
##> df_spy['close']



As we can see from the following code, each column of a `DataFrame` is actually a different kind of `pandas` structure called a `Series`. 

In [11]:
##> type(df_spy['close'])



Here is a bit of `pandas` nerdery (don't get too bogged down with these details):
- A `DataFrame` is really a collection of columns glued together, and each column is a `Series`.
- A pandas `Series` has two major components: 1) `.values`; 2) `.index`.
- The `.values` of a `Series` is a `numpy.array`.

Let's look at the `.values` attribute of the `close` column of `df_spy`:

In [12]:
##> df_spy['close'].values



## Component-wise Column Operations

- As we saw in the previous section, a `DataFrame` column is essentially a fancy `numpy.array`.

- We can think of a `numpy.array` as a vector in Python.

- As such, we can perform vector-like calculations with `DataFrame` columns.

We can perform scalar arithmetic on the entire column, and the calculation *broadcasts* as we would expect:

In [13]:
##> df_spy['close'] / 100



In [14]:
##> df_spy['close'] + 100



We can also perform component-wise calculations between two colums.

Let's say we want to calculate the intraday range of SPY for each of the trade-dates in `df_spy`.  This is the difference between the `high` and the `low` of each day.  We can do this easily from the columns of our `DataFrame`.

In [15]:
##> df_spy['high'] - df_spy['low']


## Adding Columns to a `DataFrame`

- Data analysis often involves starting with a `DataFrame` and then adding new columns to it, which are functions of the existing columns.

- Continuing our previouse example, let's say we want to save our intraday ranges back into `df_spy` for further analysis later.

We can do this easily with the following code:

In [16]:
##> df_spy['intraday_range'] = df_spy['high'] - df_spy['low']
##> df_spy



- Notice that we sort of assumed that `intraday_range` exists and assign a value to it (which is a function of two of our existing variables).

## Aggregating Calulations on `DataFrames`

- We have already seen how to do *component-wise* calculations on the column of a `DataFrame`.

- *Component-wise* calculations can be thought of as a function that takes in vectors and scalars, and returns a scalar.

- It is often useful to perform *aggregation* calculations on `DataFrame`.

- An *aggregation* can be thought of a function that takes in a vector, and returns a scalar.

For example, we can use the following code to calculate the total SPY volume during December:

In [17]:
##> df_spy['volume'].sum()


Let's break down what we did here:
- First, we isolated the `volume` column of the DataFrame, which is a `Series` (a souped up `array`).
- A `numpy.array` has method called `.sum()` which adds up all the components.

Next, let's calculate some summary statistics on the `intraday_range` column that we added to our dataframe earlier.  Notice that each of these summary statistics is just an aggregating calcuation on the column. 

In [18]:
##> print("Mean: ", df_spy['intraday_range'].mean()) # average
##> print("St Dev:", df_spy['intraday_range'].std()) # standard deviation
##> print("Min:" , df_spy['intraday_range'].min()) # minimum
##> print("Max:" , df_spy['intraday_range'].max()) # maximum



## Related Reading

*PDSH* - Section 3.1 - Introducing Pandas Objects

*PDSH* - Section 2.1 - Understanding Data Types in Python

*PDSH* - Section 2.2 - The Basics of NumPy Arrays

*PDSH* - Section 2.3 - Computation on NumPy Arrays: Universal Functions

*PDSH* - Section 2.4 - Aggregations: Min, Max, and Everything In Between