# Python for Finance 2020

MSc in Finance, Universidade Católica Portuguesa

Instructor: João Brogueira de Sousa [jbsousa@ucp.pt]

## Working with data

In this notebook, you will learn the basics of handling data with Python.

In Python, the two main resources for numerical programming are two packages: [Numpy](https://numpy.org/) and [Pandas](https://pandas.pydata.org/). 

- [Numpy](https://numpy.org/) provides tools to work with N-dimensional array objects, `numpy.ndarray`.

- [Pandas)(https://pandas.pydata.org/)

We will use [Pandas](https://pandas.pydata.org/), provides data analysis tools to efficiently handle data in tabular form.

In this notebook you will see an introduction to Pandas. You are encouraged to explore [Numpy's Quickstart tutorial](https://numpy.org/devdocs/user/quickstart.html).

We can access both packages using `import` statements:

In [70]:
import numpy as np # import numpy and give it a shorter name, `np`
import pandas as pd # import pandas and give it a shorter name, `pd`

### Pandas

After the `import pandas` statement above, we can access any function available in Pandas:

In [71]:
# data from 11-02-2019 to 11-02-2020 available at https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC 

SP500 = pd.read_csv('^GSPC.csv') # you need to specify the correct path to the file being read

In [72]:
# pd.read_csv?

In [73]:
type(SP500)

pandas.core.frame.DataFrame

The two main data structures in Pandas are Series (1-dimensional) and DataFrame (2-dimensional). We have just created our first DataFrame. It looks like an Excel spreadsheet:

In [74]:
print(SP500) # this will show you a lot of data

           Date         Open         High          Low        Close  \
0    2019-02-11  2712.399902  2718.050049  2703.790039  2709.800049   
1    2019-02-12  2722.610107  2748.189941  2722.610107  2744.729980   
2    2019-02-13  2750.300049  2761.850098  2748.629883  2753.030029   
3    2019-02-14  2743.500000  2757.899902  2731.229980  2745.729980   
4    2019-02-15  2760.239990  2775.659912  2760.239990  2775.600098   
..          ...          ...          ...          ...          ...   
248  2020-02-05  3324.909912  3337.580078  3313.750000  3334.689941   
249  2020-02-06  3344.919922  3347.959961  3334.389893  3345.780029   
250  2020-02-07  3335.540039  3341.419922  3322.120117  3327.709961   
251  2020-02-10  3318.280029  3352.260010  3317.770020  3352.090088   
252  2020-02-11  3365.870117  3375.629883  3357.989990  3358.020020   

       Adj Close      Volume  
0    2709.800049  3361970000  
1    2744.729980  3827770000  
2    2753.030029  3670770000  
3    2745.729980  38367

The first thing we may want to do is to inspect how the data is organised, by looking at a small part of the DataFrame.

In [75]:
SP500.head() # displays the top of the table

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2019-02-11,2712.399902,2718.050049,2703.790039,2709.800049,2709.800049,3361970000
1,2019-02-12,2722.610107,2748.189941,2722.610107,2744.72998,2744.72998,3827770000
2,2019-02-13,2750.300049,2761.850098,2748.629883,2753.030029,2753.030029,3670770000
3,2019-02-14,2743.5,2757.899902,2731.22998,2745.72998,2745.72998,3836700000
4,2019-02-15,2760.23999,2775.659912,2760.23999,2775.600098,2775.600098,3641370000


In [76]:
SP500.tail(3) # displays the bottom 10 rows

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
250,2020-02-07,3335.540039,3341.419922,3322.120117,3327.709961,3327.709961,3730650000
251,2020-02-10,3318.280029,3352.26001,3317.77002,3352.090088,3352.090088,3450350000
252,2020-02-11,3365.870117,3375.629883,3357.98999,3358.02002,3358.02002,1175962914


A Dataframe can have columns with different data types:

In [77]:
SP500.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

DataFrames will distinguish a few data types: 

- Booleans (`bool`)
- Integers (`int64`)
- Floats (`float64`)
- Dates (`datetime`)
- Categorical data (`categorical`)
- Everything else (`object`)

By using Pandas we open the door to a rich collection of tools to work with data. As a quick preview:

In [78]:
SP500.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,253.0,253.0,253.0,253.0,253.0,253.0
mean,2983.837594,2995.493635,2971.836488,2984.992244,2984.992244,3522500000.0
std,155.519072,154.693839,157.08853,156.019198,156.019198,579719000.0
min,2712.399902,2718.050049,2703.790039,2709.800049,2709.800049,1175963000.0
25%,2873.98999,2890.030029,2855.939941,2878.379883,2878.379883,3218700000.0
50%,2952.709961,2964.149902,2944.050049,2952.01001,2952.01001,3499150000.0
75%,3081.25,3093.090088,3074.870117,3087.01001,3087.01001,3771200000.0
max,3365.870117,3375.629883,3357.98999,3358.02002,3358.02002,6454270000.0


When can get different elements of a DataFrame in a variety of ways:

In [79]:
SP500['Close'] # select a single column

0      2709.800049
1      2744.729980
2      2753.030029
3      2745.729980
4      2775.600098
          ...     
248    3334.689941
249    3345.780029
250    3327.709961
251    3352.090088
252    3358.020020
Name: Close, Length: 253, dtype: float64

In [80]:
SP500[0:3] # select first three rows

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2019-02-11,2712.399902,2718.050049,2703.790039,2709.800049,2709.800049,3361970000
1,2019-02-12,2722.610107,2748.189941,2722.610107,2744.72998,2744.72998,3827770000
2,2019-02-13,2750.300049,2761.850098,2748.629883,2753.030029,2753.030029,3670770000


In [81]:
SP500.loc[0:3,['Open', 'Close']] # select columns by label

Unnamed: 0,Open,Close
0,2712.399902,2709.800049
1,2722.610107,2744.72998
2,2750.300049,2753.030029
3,2743.5,2745.72998


In [82]:
SP500.iloc[0:5, 0:2] # select by position

Unnamed: 0,Date,Open
0,2019-02-11,2712.399902
1,2019-02-12,2722.610107
2,2019-02-13,2750.300049
3,2019-02-14,2743.5
4,2019-02-15,2760.23999


In [83]:
SP500[SP500['Close'] > SP500['Close'].max()*0.99] # Boolean indexing, note the .max() method to find the max Close price

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
236,2020-01-17,3323.659912,3329.879883,3318.860107,3329.620117,3329.620117,3698170000
239,2020-01-23,3315.77002,3326.879883,3301.870117,3325.540039,3325.540039,3764860000
248,2020-02-05,3324.909912,3337.580078,3313.75,3334.689941,3334.689941,4117730000
249,2020-02-06,3344.919922,3347.959961,3334.389893,3345.780029,3345.780029,3868370000
250,2020-02-07,3335.540039,3341.419922,3322.120117,3327.709961,3327.709961,3730650000
251,2020-02-10,3318.280029,3352.26001,3317.77002,3352.090088,3352.090088,3450350000
252,2020-02-11,3365.870117,3375.629883,3357.98999,3358.02002,3358.02002,1175962914


We can also change an instance of a DataFrame in several ways.

In [84]:
SP500['Open - Close'] = SP500['Open'] - SP500['Close'] # create a new column by diff of two existing columns 

In [85]:
SP500.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Open - Close
0,2019-02-11,2712.399902,2718.050049,2703.790039,2709.800049,2709.800049,3361970000,2.599853
1,2019-02-12,2722.610107,2748.189941,2722.610107,2744.72998,2744.72998,3827770000,-22.119873
2,2019-02-13,2750.300049,2761.850098,2748.629883,2753.030029,2753.030029,3670770000,-2.72998
3,2019-02-14,2743.5,2757.899902,2731.22998,2745.72998,2745.72998,3836700000,-2.22998
4,2019-02-15,2760.23999,2775.659912,2760.23999,2775.600098,2775.600098,3641370000,-15.360108


In [86]:
SP500['Volume'] = SP500['Volume']/1e3 # changes the values in a given column

In [87]:
SP500.tail()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Open - Close
248,2020-02-05,3324.909912,3337.580078,3313.75,3334.689941,3334.689941,4117730.0,-9.780029
249,2020-02-06,3344.919922,3347.959961,3334.389893,3345.780029,3345.780029,3868370.0,-0.860107
250,2020-02-07,3335.540039,3341.419922,3322.120117,3327.709961,3327.709961,3730650.0,7.830078
251,2020-02-10,3318.280029,3352.26001,3317.77002,3352.090088,3352.090088,3450350.0,-33.810059
252,2020-02-11,3365.870117,3375.629883,3357.98999,3358.02002,3358.02002,1175962.914,7.850097


In [88]:
SP500.drop('Open - Close', axis='columns', inplace=True) 

In [89]:
SP500.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2019-02-11,2712.399902,2718.050049,2703.790039,2709.800049,2709.800049,3361970.0
1,2019-02-12,2722.610107,2748.189941,2722.610107,2744.72998,2744.72998,3827770.0
2,2019-02-13,2750.300049,2761.850098,2748.629883,2753.030029,2753.030029,3670770.0
3,2019-02-14,2743.5,2757.899902,2731.22998,2745.72998,2745.72998,3836700.0
4,2019-02-15,2760.23999,2775.659912,2760.23999,2775.600098,2775.600098,3641370.0


We can make these changes permanent in the DataFrame either by using the option `inplace=True` (when it's available), or assigning the output back to the variable (example above: `SP500['Volume'] = SP500['Volume']/1e3`).