# **Testing Pipeline**
## December 2020
### Ian Yu

----

## **Table of Content**

1. [Objective](#Objective)
2. [Acknoledgement](#Acknoledgement)
3. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
4. [Data Cleaning and Concatenating Dataframes](#Data-Cleaning-and-Concatenating-Dataframes)
5. [Next Step](#Next-Step)

---

- [yfinance](https://pypi.org/project/yfinance/)
- [Qaundl](https://www.quandl.com/data/YALE/SP_CPI-U-S-Stock-Price-Data-Consumer-Price-Index)
- [fredapi](https://pypi.org/project/fredapi/)

----

## **Objective**
The purpose of this notebook is to explore how to create an automatic learning pipeline for the stock marekt forecasting models. This notebook will contain findings and codes from previous notebooks, and we will be planning and creating the whole pipeline in this notebook. 

[Back to Top](#Table-of-Content)

## **Plan**

In a grand scheme of things, the major parts of the pipeline will include:

- Calling new data from API and create the new datasets
- Feature engineer and create different datasets for different timeframe
    - Dataset for training
    - Dataset for prediction
- Preprocess and train on new train data
- Predict on new prediction dataset

We will create a test for just 5 days for now

## **Data Cleaning**

First, let's import the necessary packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time 
import yfinance as yf
from fredapi import Fred
import quandl
import requests
from datetime import datetime, timedelta
fred = Fred(api_key = "3e7ce9d3322d45b49f624720abd0f36a")
quandl.ApiConfig.api_key = "_gTGp-_JJ9kKR7-hCGT5"

Then we import the data

In [2]:
# Request data from all APIs
stock = yf.Ticker("^GSPC").history(period = "max")
yields = quandl.get("USTREASURY/YIELD")
usd = yf.Ticker("DX-Y.NYB").history(period = "max")
gold = quandl.get("LBMA/GOLD")
wti = pd.DataFrame(fred.get_series_latest_release('DCOILWTICO'), columns = ["price"])
cpi = pd.DataFrame(fred.get_series_latest_release('CPALTT01USA659N'), columns = ["annual rate"])
time.sleep(1)

# Put the dataframes into a list for certain treat-all operations
datalist = [stock, yields, usd, gold, wti, cpi]

First, we explore the missing data for each of the dataframe.

In [3]:
## Print the missing data for each dataframe
for df in datalist:
    display(df.tail(3))
    print(df.isna().sum())

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-12-24,3694.030029,3703.820068,3689.320068,3703.060059,1885090000,0,0
2020-12-28,3723.030029,3740.51001,3723.030029,3735.360107,3527460000,0,0
2020-12-29,3750.01001,3756.120117,3737.330078,3740.169922,262821878,0,0


Open            0
High            0
Low             0
Close           0
Volume          0
Dividends       0
Stock Splits    0
dtype: int64


Unnamed: 0_level_0,1 MO,2 MO,3 MO,6 MO,1 YR,2 YR,3 YR,5 YR,7 YR,10 YR,20 YR,30 YR
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2020-12-23,0.07,0.08,0.09,0.09,0.09,0.13,0.18,0.38,0.67,0.96,1.49,1.7
2020-12-24,0.09,0.09,0.09,0.09,0.1,0.13,0.17,0.37,0.66,0.94,1.46,1.66
2020-12-28,0.09,0.1,0.11,0.11,0.11,0.13,0.17,0.38,0.65,0.94,1.46,1.67


1 MO     2899
2 MO     7204
3 MO        3
6 MO        0
1 YR        0
2 YR        0
3 YR        0
5 YR        0
7 YR        0
10 YR       0
20 YR     939
30 YR     994
dtype: int64


Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-12-23,90.519997,90.669998,90.160004,90.410004,0,0,0
2020-12-28,90.220001,90.379997,89.980003,90.339996,0,0,0
2020-12-29,90.209,90.226997,89.852997,89.925003,0,0,0


Open            0
High            0
Low             0
Close           0
Volume          0
Dividends       0
Stock Splits    0
dtype: int64


Unnamed: 0_level_0,USD (AM),USD (PM),GBP (AM),GBP (PM),EURO (AM),EURO (PM)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-12-22,1873.3,1877.1,1399.73,1405.95,1532.73,1538.1
2020-12-23,1867.1,1875.0,1390.06,1382.44,1532.14,1535.06
2020-12-24,1872.55,,1376.14,,1535.87,


USD (AM)        1
USD (PM)      142
GBP (AM)       11
GBP (PM)      153
EURO (AM)    7837
EURO (PM)    7879
dtype: int64


Unnamed: 0,price
2020-12-17,48.34
2020-12-18,49.04
2020-12-21,47.79


price    309
dtype: int64


Unnamed: 0,annual rate
2017-01-01,2.13011
2018-01-01,2.442583
2019-01-01,1.81221


annual rate    4
dtype: int64


It looks like `stock` as well as `usd` do not have missing data, but 'Dividends' and 'Stock Splits' columns are zeroes for both the stock index as well as US dollar index. Additionally, `usd` does not have volume. Therefore, these columns will be dropped. For `yields`, '1 MO', '2 MO', '20 YR', and '30 YR' have too many missing values and should be dropped. '3 MO', '2 YR', and '10 YR' are of our interests. The rest should be dropped. For `gold`, we are only interseted in 'USD (AM)' to keep a daily values of gold in USD. For `wti`, we ought to explore further on the 309 missing values. For `cpi`, we also ought to epxlore furhter on the 4 missing values.

In [4]:
# Dropping dividends and stock splits for stock df
stock = stock.drop(['Dividends','Stock Splits'], axis = 1)

# Dropping volume, dividends, and stock splits for stock df
usd = usd.drop(['Dividends', 'Stock Splits', 'Volume'], axis = 1)

# forward filling the missing values from statutory holidays landing on weekdays 
wti = wti.fillna(method = 'ffill')

## yields drop columns with too many missing values
# yields forward fill missing values at random for '3 MO'
yields = yields.drop(['1 MO', '2 MO', '20 YR', '30 YR'], axis = 1)
yields['3 MO'] = yields['3 MO'].fillna(method = 'ffill')

## Keep only 'USD (AM)' for one daily values only
# Rename USD (AM) to price
gold = gold[['USD (AM)']].fillna(method = 'ffill').rename(columns = {'USD (AM)':'price'})

# CPI drop the first four years as they are all NaN values
# Set 2020 annual inflation to 1.1
cpi = cpi.dropna()
cpi2 = pd.DataFrame(index = ['2020-01-01'], columns = ['annual rate'], data = [1.1])
cpi2.index = pd.to_datetime(cpi2.index)
cpi = cpi.append(cpi2)

In [5]:
# Put the dataframes into a list for certain treat-all operations
datalist = [stock, yields, usd, gold, wti, cpi]

Now that the data is clean, we have to reindex and ensure every dataframe is of the same date range.

In [6]:
# Creating a date range of our interest, freq = 'B' for business days
date_range = pd.date_range(start = '1990-01-02', end = (datetime.today() - timedelta(15)), freq = 'B')

In [7]:
stock = stock.reindex(index = date_range, method = 'ffill')
usd = usd.reindex(index = date_range, method = 'ffill')
yields = yields.reindex(index = date_range, method = 'ffill')
wti = wti.reindex(index = date_range, method = 'ffill')
gold = gold.reindex(index = date_range, method = 'ffill')
cpi = cpi.reindex(index = date_range, method = 'ffill')

In [8]:
# Reistantiate datalist
datalist = [stock, yields, usd, gold, wti, cpi]

# Checking the shape of each dataframe
for df in datalist:
    print(df.shape)

(8075, 5)
(8075, 8)
(8075, 4)
(8075, 1)
(8075, 1)
(8075, 1)


In [9]:
# Add a string on the column names to indicate the market for each dataframe
stock.columns = "SPX " + stock.columns
yields.columns = yields.columns + ' yields'
usd.columns = "DXY " + usd.columns
wti.columns = "WTI " + wti.columns
gold.columns = "GOLD " + gold.columns
cpi.columns = "CPI " + cpi.columns

In [10]:
#Concatenating all dataframes
all_df = pd.concat([stock, yields, usd, gold, wti, cpi], axis = 1)

One thing to note that we did not fully treat the Statutory Holidays yet, and if we look at our `all_df`, we actually get 99 duplicated rows. Those rows are the statutory holidays that we filled the values of. In a typical structured learning problem, we would drop the duplicated values, but in this case, using `drop_duplicates` may actually drop real trading days. Since there are only 99 rows out of 9021 rows, we are not introducing significant bias. As there may be hidden patterns from Monday to Friday, or every five data points, we will keep the duplicated values to keep consistency.

In [11]:
all_df

Unnamed: 0,SPX Open,SPX High,SPX Low,SPX Close,SPX Volume,3 MO yields,6 MO yields,1 YR yields,2 YR yields,3 YR yields,5 YR yields,7 YR yields,10 YR yields,DXY Open,DXY High,DXY Low,DXY Close,GOLD price,WTI price,CPI annual rate
1990-01-02,353.399994,359.690002,351.980011,359.690002,162070000,7.83,7.89,7.81,7.87,7.90,7.87,7.98,7.94,93.129997,94.309998,93.080002,94.290001,401.65,22.88,5.397956
1990-01-03,359.690002,360.589996,357.890015,358.760010,192330000,7.89,7.94,7.85,7.94,7.96,7.92,8.04,7.99,94.150002,94.519997,94.080002,94.419998,396.30,23.81,5.397956
1990-01-04,358.760010,358.760010,352.890015,355.670013,177000000,7.84,7.90,7.82,7.92,7.93,7.91,8.02,7.98,93.720001,93.879997,92.389999,92.519997,394.95,23.41,5.397956
1990-01-05,355.670013,355.670013,351.350006,352.200012,158530000,7.79,7.85,7.79,7.90,7.94,7.92,8.03,7.99,93.339996,93.419998,92.550003,92.849998,401.20,23.07,5.397956
1990-01-08,352.200012,354.239990,350.540009,353.790009,140110000,7.79,7.88,7.81,7.90,7.95,7.92,8.05,8.02,92.519997,92.540001,91.940002,92.050003,403.75,21.64,5.397956
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-12-08,3683.050049,3708.449951,3678.830078,3702.250000,4549670000,0.09,0.09,0.10,0.14,0.20,0.39,0.65,0.92,90.900002,91.019997,90.750000,90.970001,1864.50,45.64,1.100000
2020-12-09,3705.979980,3712.389893,3660.540039,3672.820068,5209940000,0.08,0.09,0.10,0.16,0.21,0.41,0.68,0.95,90.919998,91.199997,90.690002,91.089996,1859.80,45.48,1.100000
2020-12-10,3659.129883,3678.489990,3645.179932,3668.100098,4618240000,0.08,0.09,0.10,0.14,0.20,0.39,0.65,0.92,91.059998,91.150002,90.669998,90.790001,1834.20,46.76,1.100000
2020-12-11,3656.080078,3665.909912,3633.399902,3663.459961,4367150000,0.08,0.08,0.10,0.11,0.18,0.37,0.63,0.90,90.739998,91.040001,90.620003,90.980003,1833.65,46.59,1.100000


In [12]:
# Exporting the dataframe as csv
all_df.to_csv('data/1-cleaned_df.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'data/1-cleaned_df.csv'