# Motivation for starting this project

## Our main goals were to:

1. Take a repetitive task done in Excel and attempt to rewrite it using Python;
2. Along the way we hoped to get better understanding of the questions, methodologies and ways of working of Research;
3. Learn how to use Python efficiently on a day-to-day basis.

## Within the wider Research environment framework, we want to begin to answer questions such as:
1. How to write scripts to automate the downloading of data.
2. Store downloaded data in an organized, durable, manner.
3. Store intermediate results.
4. Assess the quality of data.

## The specfic question addressed here:

In particular, Travis wanted a means to calculate and compare correlations of commodity time series.  The series are currently stored in Excel sheets.  


---------------------

The first cell in most scripts import/load the necessary Python packages and sets global parameters.   The packages we used in this project are very commonly used throughout the Python community.  Pandas in particular is a very convenient container that looks a lot like Excel and other spreadsheets. 

In [11]:
import pandas as pd
#import xlrd  #needed to read 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from IPython.display import Image
%matplotlib inline
plt.style.use('seaborn-whitegrid')

<img src="img/picOrgData1.png">

----------------------

The code below is the most complex part of the script.  It is a def() or function that Python uses to repeat competitive tasks, you can think of it as a very flexible macro.

Function which takes a Excel (.xlsx) file and converts it to a Pandas dataframe.
It loops through each specified sheet of the Excel file and puts the resulting Pandas dataframe into a list of dataframes. 

In [5]:
def fromXLSXtoDF(sheetNames_param, xlsxData_param):
    allDF = []
    for i in sheetNames_param:
        # print the sheet name (not necessary) but print statements give valuable feedback to the programmer and are used everywhere 
        # in the development process.
        print("Reading in sheet:", i)
        # so read each sheet named "i", and grab only columns 0 and 5 (which contain the relevant data)
        df = pd.read_excel(xlsxData_param, sheet_name=i, usecols=[0, 5])
        # set the timestamp to the index, this makes time series analysis easier.
        df = df.set_index("Date")
        # calculate diff and percent change, should be done per year
        # diff is the first difference, so the change in data from one day to the next.  
        df["PremiumDiff"] = df["Premium"].diff(1)
        # percent change is the change from one day to the next.
        df["PremiumPrctChange"] = df["Premium"].pct_change(1)
        # add the dataframe to the list of dataframes
        allDF.append(df)
    # concatenate across column (axis=0), use the index as key)
    result = pd.concat(allDF, axis=0, join='outer')
    # change names of columns to the sheetnames
    # result.columns = sheetNames
    return(result)

The code below takes the required names of the Excel sheets and the Excel file and runs the function fromXLSXtoDF defined in the above function.

In [6]:
#Calculate correlations for raw data

#Sheet that contain the desired data
sheetNames = ["SB Par Mar17", "SB Par Mar18", "SB Par Mar19"]

#Read in the name of the Excel file
xlsxData = 'Pna Prem vs Brl - Jeff.xlsx'

#Run the function
sbYears = fromXLSXtoDF(sheetNames, xlsxData)

#Drop missing values
sbYears = sbYears.dropna()
 

Reading in sheet: SB Par Mar17
Reading in sheet: SB Par Mar18
Reading in sheet: SB Par Mar19


The code below loads the Brazilian currency data.  We didn't write a function because we only do these steps once.

In [None]:
#Read in BRL dta

#Read in data from the same Excel file in sheet named 'Brl'
brl = pd.read_excel('Pna Prem vs Brl - Jeff.xlsx', 'Brl')

#Set Date (GMT) as the index, which makes times series analysis easier
brl = brl.set_index('Date (GMT)')

#Re-naming index to Date in order to merge 
brl.index.name = 'Date' 

#Calculate the percentage change, just like in the function 
brl['BrlPrctChange'] = brl["Last"].pct_change(1)

#Calculate the first differeence, just like in the function
brl['BrlDiff'] = brl["Last"].diff(1)

#Take only the relevant columns of the Pandas dataframe
brl = brl[["Last", "BrlDiff", "BrlPrctChange"]]

#Rename the columns called Last to Brl
brl = brl.rename(columns = {'Last':'Brl'})

#Drop missing values
brl = brl.dropna()

#Show me the results
brl.head(3)

Combine the two data frames created above

In [None]:
# Merge data
merD = pd.merge(sbYears,brl,how='inner',on='Date')
merD.head(3)

The data is not continuous over the years, so only take those days for which data is available

In [None]:
df1 = merD
year2016 = df1['2016-06-07': '2016-09-29']
print("2016:", year2016.shape)  #print states to find out how much data each dataframe holds
year2017 = df1['2017-06-05': '2017-09-29']
print("2017:", year2017.shape)
year2018 = df1['2018-06-05': '2018-09-29']
print("2018:", year2018.shape)
year2019 = df1['2019-06-05': '2019-09-29']
print("2019:", year2019.shape)  #2019 is empty

Take a look at the data.  The difference in scale of the data distorts any visual representation in the realationship.

In [None]:
year2017[['PremiumPrctChange', 'BrlPrctChange']].plot(figsize=(15,7))

Rescale the data based on the minimum and maximum for each column of data

In [None]:
#Standardize data, makes comparisons easier
def scaleDF(myDF):
    resultsStd = myDF.values #returns a numpy array
    min_max_scaler = preprocessing.MinMaxScaler()
    results_scaled = min_max_scaler.fit_transform(resultsStd)
    myresults_scaled = pd.DataFrame(results_scaled)
    
    # change names of columns to the sheetnames
    myresults_scaled.columns = myDF.columns
    
    # add index back in
    indx = myDF.index
    myresults_scaled.set_index(indx, inplace=True)
    
    return(myresults_scaled)

In [None]:
dfScale = scaleDF(merD)
dfScale.head(3)

Now replot the scaled data.  Patterns in the data are more apparent.

In [None]:
year2017 = dfScale['2017-06-05': '2017-09-29']
year2017[['PremiumPrctChange', 'BrlPrctChange']].plot(figsize=(15,7))

Add year, month and day columns to the data for plotting purposes.

In [None]:
df1 = dfScale  #Rename scaled data

df1["year"] = df1.index.year
df1["month"] = df1.index.month
df1["day"] = df1.index.day
df1.head(3)

In [None]:
# Same trick of subsetting the months for which we have data.
year2016 = df1['2016-06-07': '2017-03-15']
year2017 = df1['2017-06-05': '2018-03-15']
year2018 = df1['2018-06-05': '2019-03-15']
year2019 = df1['2019-06-05': '2020-03-15']

In [None]:
# Let's plot SPX and VIX cumulative returns with recession overlay
plot_cols = ['Premium', 'Brl']
fig, axes = plt.subplots(2,1, figsize=(15,7), sharex=True, sharey=True)
year2016[plot_cols].plot(subplots=True, ax=axes)
fig, axes = plt.subplots(2,1, figsize=(15,7), sharex=True, sharey=True)
year2017[plot_cols].plot(subplots=True, ax=axes)
fig, axes = plt.subplots(2,1, figsize=(15,7), sharex=True, sharey=True)
year2018[plot_cols].plot(subplots=True, ax=axes)

To do:
ARIMA
Granger causality
Cointegration

ARIMA models can be used to predict future series


In [None]:
df1["Premium"]['2018-06-05': '2019-03-15'].plot(figsize=(15,7))

In [None]:
from pandas import read_csv
from pandas import datetime
from pandas import DataFrame
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot

 
series = df1["Premium"]['2018-06-05': '2019-03-15']
# fit model
model = ARIMA(series, order=(5,1,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())
# plot residual errors
residuals = DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
residuals.plot(kind='kde')
pyplot.show()
print(residuals.describe())

Correlations per month, 20 day  June 1st to June 20st = corr; June 2nd to June 21st...

In [None]:
# Scaled data frame
dfScale = scaleDF(merD)
dfScale.head(3)


Use a var model to examine how the two time series affect one another.

In [None]:
# Import Statsmodels
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import adfuller
from statsmodels.tools.eval_measures import rmse, aic

Future projects: backtesting