<h1 style="text-align: center;"> Model Tuning </p>

## Notebook Description

In this notebook, time-series data are modeled for forecasting oil stock performance as part of the requirements of the RMDS 2021 Data Science Competition.

##  Table of contents
1. [Required Libraries](#Required-Libraries)
2. [Load Data](#Load-Data)
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 
11. [ARIMA Modeling](#ARIMA-Modeling)
12. [Conclusion](#Conclusion)

## Required Libraries

[[ go back to the top ]](#Table-of-contents)

This notebook uses several Python libraries such as:

In [151]:
# Load required packages 
import datetime
from datetime import timedelta
import numpy as np
import pandas as pd

# Visuals
import matplotlib.pyplot as plt
import seaborn as sns

# Time-Series
import statsmodels.api as sm
#from statsmodels.tsa.stattools import adfuller
#from statsmodels.tsa.seasonal import seasonal_decompose
#from statsmodels.tsa.stattools import acf, pacf
#from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARMA, ARIMA
from scipy import signal
import scipy.stats as stats

import warnings
warnings.filterwarnings("ignore")

<a id='Load-Data'></a>

## Load Data

[[ go back to the top ]](#Table-of-contents)

In [152]:
# Load Data Function
def LOAD_DATA(filepath, filename):
    # Read CSV files
    if filename.endswith('.csv'):
        new_df = pd.read_csv(filepath+filename)

    # Read Excel files
    elif filename.endswith('.xlsx'):
        new_df = pd.read_excel(filepath+filename)

    if type(df.index) != pd.core.indexes.datetimes.DatetimeIndex:
        for col in new_df.columns:
            if col.lower().find('date') != -1:
                print(f"TIMESTAMP FOUND! '{col}'")
                print()
                new_df['date'] = pd.to_datetime(new_df[col]) # format = '%Y/%m/%d'
                new_df.drop(columns = col, inplace = True)
                new_df.set_index('date', inplace = True)
                
    # Try to identify the date column
    elif type(df.index) == pd.core.indexes.datetimes.DatetimeIndex:
        print('Index already datetime')
        
    display(new_df.info())
    return new_df

In [165]:
google_mobility.index

RangeIndex(start=0, stop=382, step=1)

In [153]:
# Load Stock-Closing-Price by Company Data
fpath = '../../data/Transportation/google/'
fname = 'baseline_pct_change.csv'
google_mobility = LOAD_DATA(filepath = fpath, filename = fname)
google_mobility

Index already datetime
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 382 entries, 0 to 381
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   382 non-null    object 
 1   workplaces             382 non-null    float64
 2   retail_and_recreation  382 non-null    float64
 3   grocery_and_pharmacy   382 non-null    float64
 4   residential            382 non-null    float64
 5   transit_stations       382 non-null    float64
 6   parks                  382 non-null    float64
dtypes: float64(6), object(1)
memory usage: 21.0+ KB


None

Unnamed: 0,date,workplaces,retail_and_recreation,grocery_and_pharmacy,residential,transit_stations,parks
0,2020-02-15,0.502403,6.767918,2.568272,-0.988119,4.553895,17.078580
1,2020-02-16,0.567077,8.225749,2.717209,-0.944541,3.967670,18.122368
2,2020-02-17,-17.756044,4.049697,-0.375113,4.405907,1.903259,28.271605
3,2020-02-18,-0.006305,-0.211659,-1.377153,1.345745,2.440042,5.466077
4,2020-02-19,1.001656,2.222899,0.639556,0.345455,2.634615,8.153166
...,...,...,...,...,...,...,...
377,2021-02-26,-18.122059,-11.026122,-7.390536,7.725445,-13.963636,-8.750973
378,2021-02-27,-7.028979,-7.173964,-3.032665,3.771024,-10.169615,8.134422
379,2021-02-28,-10.865575,-7.526728,-5.262815,3.687106,-9.389011,0.226268
380,2021-03-01,-18.512951,-8.222857,-3.490489,6.898305,-13.624381,-11.995798


In [154]:
# Load International Sentiment Data

fpath = '../../data/News_AI_Sentiments/'
fname = 'daily-news-sentiment-international.csv'

sentiment_int = LOAD_DATA(filepath = fpath, filename = fname)
sentiment_int

Index already datetime
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   DateTime              395 non-null    object 
 1   Daily News Sentiment  395 non-null    float64
dtypes: float64(1), object(1)
memory usage: 6.3+ KB


None

Unnamed: 0,DateTime,Daily News Sentiment
0,4/17/2000 0:00,-0.10
1,8/1/2000 0:00,0.20
2,1/24/2001 0:00,0.20
3,4/4/2001 0:00,-0.10
4,10/31/2002 0:00,-0.50
...,...,...
390,2/23/2021 0:00,-0.27
391,2/24/2021 0:00,-0.10
392,2/25/2021 0:00,0.00
393,2/26/2021 0:00,-0.30


In [155]:
# Load North American Sentiment Data

fpath = '../../data/News_AI_Sentiments/'
fname = 'daily-news-sentiment-NA.csv'

sentiment_na = LOAD_DATA(filepath = fpath, filename = fname)
sentiment_na

Index already datetime
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396 entries, 0 to 395
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   DateTime              396 non-null    object 
 1   Daily News Sentiment  396 non-null    float64
dtypes: float64(1), object(1)
memory usage: 6.3+ KB


None

Unnamed: 0,DateTime,Daily News Sentiment
0,7/26/2013 0:00,0.10
1,8/1/2013 0:00,-0.50
2,8/8/2013 0:00,0.10
3,8/9/2013 0:00,0.10
4,8/17/2013 0:00,0.20
...,...,...
391,2/23/2021 0:00,-0.50
392,2/24/2021 0:00,-0.50
393,2/25/2021 0:00,0.07
394,2/26/2021 0:00,-0.40


In [156]:
# Load Stock-Closing-Price by Company Data
fpath = '../../'
fname = 'closing_price_by_company.csv'
closing_price = LOAD_DATA(filepath = fpath, filename = fname)
#closing_price

Index already datetime
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 985 entries, 0 to 984
Data columns (total 11 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Date                                     985 non-null    object 
 1   Value_PHILLIPS 66                        985 non-null    float64
 2   Value_BP P.L.C.                          985 non-null    float64
 3   Value_VALERO ENERGY CORPORATION          985 non-null    float64
 4   Value_CHEVRON CORPORATION                985 non-null    float64
 5   Value_OCCIDENTAL PETROLEUM CORPORATION   985 non-null    float64
 6   Value_MARATHON OIL CORPORATION           985 non-null    float64
 7   Value_PIONEER NATURAL RESOURCES COMPANY  985 non-null    float64
 8   Value_CONOCOPHILLIPS                     985 non-null    float64
 9   Value_EXXON MOBIL CORPORATION            985 non-null    float64
 10  Value_MARATHON PETROLEUM CO

None

In [157]:
# Load Stock-Closing-Price by Company Data
fpath = '../../data/Transportation/google/'
fname = 'baseline_pct_change.csv'
google_mobility = LOAD_DATA(filepath = fpath, filename = fname)
google_mobility

Index already datetime
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 382 entries, 0 to 381
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   382 non-null    object 
 1   workplaces             382 non-null    float64
 2   retail_and_recreation  382 non-null    float64
 3   grocery_and_pharmacy   382 non-null    float64
 4   residential            382 non-null    float64
 5   transit_stations       382 non-null    float64
 6   parks                  382 non-null    float64
dtypes: float64(6), object(1)
memory usage: 21.0+ KB


None

Unnamed: 0,date,workplaces,retail_and_recreation,grocery_and_pharmacy,residential,transit_stations,parks
0,2020-02-15,0.502403,6.767918,2.568272,-0.988119,4.553895,17.078580
1,2020-02-16,0.567077,8.225749,2.717209,-0.944541,3.967670,18.122368
2,2020-02-17,-17.756044,4.049697,-0.375113,4.405907,1.903259,28.271605
3,2020-02-18,-0.006305,-0.211659,-1.377153,1.345745,2.440042,5.466077
4,2020-02-19,1.001656,2.222899,0.639556,0.345455,2.634615,8.153166
...,...,...,...,...,...,...,...
377,2021-02-26,-18.122059,-11.026122,-7.390536,7.725445,-13.963636,-8.750973
378,2021-02-27,-7.028979,-7.173964,-3.032665,3.771024,-10.169615,8.134422
379,2021-02-28,-10.865575,-7.526728,-5.262815,3.687106,-9.389011,0.226268
380,2021-03-01,-18.512951,-8.222857,-3.490489,6.898305,-13.624381,-11.995798


In [158]:
google_mobility

Unnamed: 0,date,workplaces,retail_and_recreation,grocery_and_pharmacy,residential,transit_stations,parks
0,2020-02-15,0.502403,6.767918,2.568272,-0.988119,4.553895,17.078580
1,2020-02-16,0.567077,8.225749,2.717209,-0.944541,3.967670,18.122368
2,2020-02-17,-17.756044,4.049697,-0.375113,4.405907,1.903259,28.271605
3,2020-02-18,-0.006305,-0.211659,-1.377153,1.345745,2.440042,5.466077
4,2020-02-19,1.001656,2.222899,0.639556,0.345455,2.634615,8.153166
...,...,...,...,...,...,...,...
377,2021-02-26,-18.122059,-11.026122,-7.390536,7.725445,-13.963636,-8.750973
378,2021-02-27,-7.028979,-7.173964,-3.032665,3.771024,-10.169615,8.134422
379,2021-02-28,-10.865575,-7.526728,-5.262815,3.687106,-9.389011,0.226268
380,2021-03-01,-18.512951,-8.222857,-3.490489,6.898305,-13.624381,-11.995798


<a id='Format-Column-Names'></a>

## Format Column Names

[[ go back to the top ]](#Table-of-contents)

In [159]:
def FORMAT_TITLES(dataframe, start = 0, end = 0):
    df = dataframe
    # Remove extra characters in name
    for index, label in enumerate(df.columns):
        # IF not specified, don't remove chars from end
        if end == 0:
            name = label[start:]
        # Remove 'end'-number of chars from end-of-string
        elif end != 0:
            name = label[start:end]
        # Format column names to be in 'snake case'
        formatted_name = name.strip().lower().replace('.', '').replace(" ", "_")
        df.rename(columns = {label:formatted_name}, inplace = True)
    return df

In [160]:
sentiment_int = FORMAT_TITLES(sentiment_int)
sentiment_int

Unnamed: 0,datetime,daily_news_sentiment
0,4/17/2000 0:00,-0.10
1,8/1/2000 0:00,0.20
2,1/24/2001 0:00,0.20
3,4/4/2001 0:00,-0.10
4,10/31/2002 0:00,-0.50
...,...,...
390,2/23/2021 0:00,-0.27
391,2/24/2021 0:00,-0.10
392,2/25/2021 0:00,0.00
393,2/26/2021 0:00,-0.30


In [161]:
sentiment_na = FORMAT_TITLES(sentiment_na)
sentiment_na

Unnamed: 0,datetime,daily_news_sentiment
0,7/26/2013 0:00,0.10
1,8/1/2013 0:00,-0.50
2,8/8/2013 0:00,0.10
3,8/9/2013 0:00,0.10
4,8/17/2013 0:00,0.20
...,...,...
391,2/23/2021 0:00,-0.50
392,2/24/2021 0:00,-0.50
393,2/25/2021 0:00,0.07
394,2/26/2021 0:00,-0.40


In [162]:
closing_price = FORMAT_TITLES(closing_price, start = 6)
closing_price

Unnamed: 0,Unnamed: 1,phillips_66,bp_plc,valero_energy_corporation,chevron_corporation,occidental_petroleum_corporation,marathon_oil_corporation,pioneer_natural_resources_company,conocophillips,exxon_mobil_corporation,marathon_petroleum_corporation
0,2017-03-21,78.60,34.24,67.15,108.04,63.36,15.05,182.30,45.18,81.83,49.26
1,2017-03-22,78.83,33.99,67.57,108.39,63.51,15.02,182.33,44.65,81.76,49.57
2,2017-03-23,78.48,34.06,67.10,107.87,63.04,14.68,181.55,44.48,81.86,49.30
3,2017-03-24,77.27,33.78,65.78,107.99,62.83,14.61,180.91,44.10,81.23,48.87
4,2017-03-27,77.25,33.75,66.23,106.28,62.87,14.75,180.40,44.29,81.25,48.83
...,...,...,...,...,...,...,...,...,...,...,...
980,2021-02-19,82.31,22.87,71.53,95.80,25.42,9.45,133.94,48.42,52.37,53.00
981,2021-02-22,83.96,23.63,74.25,98.39,26.47,10.20,139.47,50.88,54.30,54.85
982,2021-02-23,85.53,24.23,75.81,99.63,26.06,11.16,145.24,52.10,55.05,55.77
983,2021-02-24,87.25,25.30,78.16,103.31,28.16,11.84,150.06,54.67,56.70,56.65


In [163]:
google_indicators = list(df.Indicator.value_counts().index)
old_names = df.columns.to_list()
count = 0
for indicator in google_indicators:
    last_char = -29
    formated_name = indicator[:last_char].strip().lower()
    df_i = df[df['Indicator']==indicator].copy()
    df_i['date'] = df_i['Date Value']
    df_i = df_i.groupby(['date'], as_index=True).mean()
    df_i.rename(columns = {'Value': formated_name.replace(" ", "_")}, inplace = True)
    if count == 0:
        df_0 = df_i.copy()
    else:
        df_0 = pd.concat([df_0, df_i], axis =1)
    count += 1
display(round(df_0, 2))

AttributeError: 'DataFrame' object has no attribute 'Indicator'