In view of current increaing price of Gold in **Aug 2020**, this project is to retrieve **Gold ETF (GLD)** prices and returns a prediction of the **next day Gold ETF price**.

**This project include following parts:**
1. Import the libraries 
1. Read the Gold ETF data from yahoo finance
1. Define explanatory variables and dependent variable
1. Split the data into train and test dataset
1. Create a linear regression model
1. Predict the Gold ETF prices
1. Plotting cumulative returns
1. MOST IMPORTANT: Use this model to predict daily moves.

In [None]:
# Import the libraries and yahoo finance package
from sklearn.linear_model import LinearRegression
from datetime import datetime
from datetime import timedelta  
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-deep')
!pip install yfinance
import yfinance as yf

Read the past daily Gold ETF price data starting from 1 Jan 2010 to now and store it in Dataframe Df. 
Then, remove the columns which are not relevant and drop NaN values using dropna() function.
After that, plot the Gold ETF close price.

In [None]:
# Read the Gold ETF data and plot the graph
ETFName = 'GLD'
now = datetime.now() + timedelta(days=1) 
Df = yf.download(ETFName, '2010-01-01', now.strftime("%Y-%m-%d"), auto_adjust=True)

# Only keep close columns
Df = Df[['Close']]

# Drop rows with missing values
Df = Df.dropna()

# Plot the GLD closing price
Df.Close.plot(figsize=(10, 7),color='r')
plt.ylabel("Prices")
plt.title("Gold ETF Price Series")
plt.show()

To predict the Gold ETF price, we define 2 explanatory variables: moving averages for past 3 days and 9 days and store the feature variables in X.
The goal is to predict the Gold ETF price, we store the next day Gold ETF price in y.

In [None]:
# Define explanatory variables
Df['S3'] = Df['Close'].rolling(window=3).mean()
Df['S9'] = Df['Close'].rolling(window=9).mean()
Df['next_day_price'] = Df['Close'].shift(-1)

Df = Df.dropna()
X = Df[['S3', 'S9']]

# Define dependent variable
y = Df['next_day_price']

Machine Learing Training step:
1. Split the predictors and output data into train (80%) and test data (20%). 
2. The training data is used to create the linear regression model by pairing the input with expected output. 
3. The test data is used to estimate how well the model has been trained.
Note: X_train & y_train are training dataset
      X_test & y_test are test dataset

In [None]:
# Split the data into train and test dataset
t = .8
t = int(t*len(Df))

# Train dataset
X_train = X[:t]
y_train = y[:t]

# Test dataset
X_test = X[t:]
y_test = y[t:]

The dependent variable - ‘y’ is the variable that you want to predict. 
The independent variables - ‘x’ are the explanatory variables that you use to predict the dependent variable. 
The following regression equation describes that relation:

Gold ETF price = m1 * 3 days moving average + m2 * 9 days moving average + c

In [None]:
# Create a linear regression model
linear = LinearRegression().fit(X_train, y_train)
print("Linear Regression model")
print("Gold ETF Price (y) = %.2f * 3 Days Moving Average (x1) \
+ %.2f * 9 Days Moving Average (x2) \
+ %.2f (constant)" % (linear.coef_[0], linear.coef_[1], linear.intercept_))

**Predict the Gold ETF prices**
Now, it’s time to check if the model works in the test dataset. 
We predict the Gold ETF prices using the linear model created using the train dataset. 
The predict method finds the Gold ETF price (y) for the given explanatory variable X.

In [None]:
# Predicting the Gold ETF prices
predicted_price = linear.predict(X_test)
predicted_price = pd.DataFrame(predicted_price, index=y_test.index, columns=['price'])
predicted_price.plot(figsize=(10, 7))
y_test.plot()
plt.legend(['predicted_price', 'actual_price'])
plt.ylabel("Gold ETF Price")
plt.show()

R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. 

In [None]:
# R square
r2_score = linear.score(X[t:], y[t:])*100
float("{0:.2f}".format(r2_score))
print("R square = %.2f" % (r2_score))

Plotting cumulative returns

In [None]:
gold = pd.DataFrame()

gold['price'] = Df[t:]['Close']
gold['predicted_price_next_day'] = predicted_price
gold['actual_price_next_day'] = y_test
gold['gold_returns'] = gold['price'].pct_change().shift(-1)

gold['signal'] = np.where(gold.predicted_price_next_day.shift(1) < gold.predicted_price_next_day,1,0)

gold['strategy_returns'] = gold.signal * gold['gold_returns']
((gold['strategy_returns']+1).cumprod()).plot(figsize=(10,7),color='g')
plt.ylabel('Cumulative Returns')
plt.show()

The Sharpe ratio is used to help investors understand the return of an investment compared to its risk.
The ratio is the average return earned in excess of the risk-free rate per unit of volatility or total risk.

In [None]:
'Sharpe Ratio = %.2f' % (gold['strategy_returns'].mean()/gold['strategy_returns'].std()*(252**0.5))


How to use this model to predict daily moves?

You can use the following code to predict the gold prices and give a trading signal whether we should **buy GLD 購買** or **take no position 不持倉**.

In [None]:
data = yf.download(ETFName, '2008-07-01', now.strftime("%Y-%m-%d"), auto_adjust=True)
data['S3'] = data['Close'].rolling(window=3).mean()
data['S9'] = data['Close'].rolling(window=9).mean()
data = data.dropna()
data['predicted_gold_price'] = linear.predict(data[['S3', 'S9']])
data['signal'] = np.where(data.predicted_gold_price.shift(1) < data.predicted_gold_price,"購買","不持倉")
data.tail(10)