1. We use MU's close price as a feature to predict next day price.
2. We use close price and Disparity5 as features to predict next day price.
3. We normalize with two ways, one is mean and std, the other is min-max.
4. In this project, normalizing meathods do not affect too much.
5. We use ElasticNetCV as a manchine learning algorithm to train close price
6. We use LinearRegression as a mancine learning algorithm to train close price and Disparity5
7. These two algorithms have similiar results. Please compare Mean Absolute Error, Mean Squared Error and Root Mean Squared Error.
8. Surprisingly, this simple predication is very accurate. Why do so many investors lose money in stock market? 

In [22]:
import yfinance as yf
import pandas as pd
import numpy as np
import datetime as dt
import plotly.express as px
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression, ElasticNetCV, Ridge
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn import metrics

df = yf.download("MU", start="2014-01-01", end=dt.date.today())
cdf = pd.DataFrame()
# Get MU close price
cdf['Close'] = df['Close']
# Calculate Movement Average in 5 days
cdf['MA5'] = cdf['Close'].rolling(5).mean()
# Calculate Movement Average in 6 days
cdf['MA6'] = cdf['Close'].rolling(6).mean()
# Calculate Movement Average in 10 days
cdf['MA10'] = cdf['Close'].rolling(10).mean()
# Calculate return by using ln
cdf['ln'] = np.log((cdf['Close']/cdf['Close'].shift(1)).dropna())
# Calculate Disparity in 5 days. This is the first feature
cdf['Disparity5'] = cdf['Close']/cdf['MA5']*100
# Calculate OSCP(price oscillator). This is the second feature
cdf['OSCP'] = (cdf['MA5']-cdf['MA10'])/cdf['MA5']
# Calculate BIAS6 (Type 2 feature)
cdf['BIAS6'] = ((cdf['Close']-cdf['MA6'])/cdf['MA6'])*100
# Calculate ASY5 (Type 2 feature)
cdf['ASY5'] = cdf['ln'].rolling(5).mean()

# Create label
cdf['Label']=cdf['Close'].shift(-1)
cdf = cdf.dropna()
print("cdf has null values? "+str(cdf.isnull().values.any()))

[*********************100%***********************]  1 of 1 downloaded
cdf has null values? False


In [23]:
#Let's plot using plotly. Tidy_df is data formating needed for plotly (print it to see what happens)

cdf_plot = pd.DataFrame() 
#Plot Close price, Disparity5 in the same graph
cdf_plot = cdf[['Close','Disparity5','OSCP','BIAS6','ASY5']]

#Normalize data with min-max using SK-Learn 
x = cdf_plot.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
scaled_cdf_plot = pd.DataFrame(x_scaled)
scaled_cdf_plot.columns = cdf_plot.columns
scaled_cdf_plot.index = cdf_plot.index

tidy_df = scaled_cdf_plot.reset_index().melt(id_vars=["Date"])
px.line(tidy_df, x="Date", y="value", color="variable")

Overall, the trend of Disparity5, OSCP, BIAS6 and ASY5 is similar with the trend of closing price.

In [24]:
# x is the feature set, do not include label
x = np.array(cdf[['Close','Disparity5','OSCP']])
# y is the label
y = np.array(cdf[['Label']])

In [25]:
# Scale values down, fit standard scaler to y so both x and y are using same scale
y = y.reshape(-1,1)
scaler = preprocessing.StandardScaler().fit(y)

x = scaler.transform(x)
y = scaler.transform(y)
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(x):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

# Assign sklearn's model to a variable
#linear = ElasticNetCV()
linear = LinearRegression()
# Fit or "train" the model, (reshape just to avoid error warning)
linear.fit(x_train, y_train.reshape(len(y_train),))

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [26]:
#Score returns the coefficient of determination 
print('The R^2 is:'+str(linear.score(x_test, y_test)))
# Predict() uses the model to predict the values for the input
forecast_set = linear.predict(x_test)

The R^2 is:0.9378163238556468


In [27]:
#slice dataframes for training data and testing data
cdf_train= cdf[0:len(x_train)].copy()
cdf_train.loc[:,'x_train']=scaler.inverse_transform(x_train)
cdf_train.loc[:,'y_train']=scaler.inverse_transform(y_train)

cdf_test= cdf[len(x_train):].copy()
cdf_test.loc[:,'x_test']=scaler.inverse_transform(x_test)
cdf_test.loc[:,'y_test']=scaler.inverse_transform(y_test)
cdf_test.loc[:,'forecast']=scaler.inverse_transform(forecast_set)

#Try to combine two dataframes
cdf_train_test = pd.concat([cdf_train,cdf_test],sort=False)
#Try to plot predicted values

cdf_plot=cdf_train_test[['Close','x_train','y_test','forecast']]
tidy_df_test = cdf_plot.reset_index().melt(id_vars=["Date"])
px.line(tidy_df_test, x="Date", y="value", color="variable")

It is surprising that forecast values are overlapping with tested values.

In [28]:
# Calculate MAPE (Mean Absolute Percentage Error)
MAPE = sum(abs(cdf_test['forecast']-cdf_test['y_test'])/abs(cdf_test['y_test']))/len(cdf_test['y_test'])*100
print('The MAPE with Disparity5 and OSCP is: '+str(MAPE))
cdf_test.head(20)

The MAPE with Disparity5 and OSCP is: 2.2906515709709994


Unnamed: 0_level_0,Close,MA5,MA6,MA10,ln,Disparity5,OSCP,BIAS6,ASY5,Label,x_test,y_test,forecast
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2018-10-22,39.76,41.414,41.568333,41.773,-0.017205,96.006181,-0.008669,-4.350267,-0.012574,38.68,39.76,38.68,39.711349
2018-10-23,38.68,40.508,40.958333,41.414,-0.027539,95.487311,-0.022366,-5.562564,-0.02215,35.43,38.68,35.43,38.638252
2018-10-24,35.43,39.124,39.661667,40.796,-0.087764,90.558225,-0.042736,-10.669412,-0.035682,36.78,35.43,36.78,35.353229
2018-10-25,36.78,38.22,38.733333,40.277,0.037395,96.232339,-0.05382,-5.043029,-0.023182,35.4,36.78,35.4,36.775563
2018-10-26,35.4,37.21,37.75,39.57,-0.038242,95.135716,-0.063424,-6.225166,-0.026671,34.66,35.4,34.66,35.393074
2018-10-29,34.66,36.19,36.785,38.802,-0.021126,95.772313,-0.072175,-5.776811,-0.027455,36.01,34.66,36.01,34.66888
2018-10-30,36.01,35.656,36.16,38.082,0.03821,100.99282,-0.068039,-0.414823,-0.014305,37.72,36.01,37.72,36.07388
2018-10-31,37.72,36.114,36.0,37.619,0.046394,104.447029,-0.041674,4.777778,0.012526,40.12,37.72,40.12,37.799972
2018-11-01,40.12,36.782,36.781667,37.501,0.061685,109.075091,-0.019548,9.07608,0.017384,40.32,40.12,40.32,40.231762
2018-11-02,40.32,37.766,37.371667,37.488,0.004973,106.762697,0.007361,7.889221,0.026027,39.92,40.32,39.92,40.383095


We will add two more features (BIAS6 and ASY5) from Type 2

In [29]:
# Use Close price and Disparity as input
x = np.array(cdf[['Close','Disparity5','OSCP','BIAS6','ASY5']])
# y is the label
y = np.array(cdf[['Label']])

# Scale values down, fit standard scaler to y so both x and y are using same scale
y = y.reshape(-1,1)
scaler = preprocessing.StandardScaler().fit(y)

x = scaler.transform(x)
y = scaler.transform(y)
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(x):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

# Assign sklearn's model to a variable
#linear = ElasticNetCV()
linear = LinearRegression()
# Fit or "train" the model, (reshape just to avoid error warning)
linear.fit(x_train, y_train.reshape(len(y_train),))

#Score returns the coefficient of determination 
print('The R^2 is:'+str(linear.score(x_test, y_test)))
# Predict() uses the model to predict the values for the input
forecast_set = linear.predict(x_test)

#slice dataframes for training data, testing data and forecast data.
cdf_train = cdf[0:len(x_train)].copy()
cdf_train.loc[:,'x_train']=scaler.inverse_transform(x_train)
cdf_train.loc[:,'y_train']=scaler.inverse_transform(y_train)

cdf_test= cdf[len(x_train):].copy()
cdf_test.loc[:,'x_test']=scaler.inverse_transform(x_test)
cdf_test.loc[:,'y_test']=scaler.inverse_transform(y_test)
cdf_test.loc[:,'forecast']=scaler.inverse_transform(forecast_set)

#Try to combine two dataframes
new_cdf = pd.concat([cdf_train,cdf_test],sort=False)

new_cdf_plot=new_cdf[['Close','x_train','y_test','forecast']]
tidy_df = new_cdf_plot.reset_index().melt(id_vars=["Date"])
px.line(tidy_df, x="Date", y="value", color="variable")

The R^2 is:0.937785251845711


In [30]:
# Calculate MAPE (Mean Absolute Percentage Error)
MAPE = sum(abs(cdf_test['forecast']-cdf_test['y_test'])/abs(cdf_test['y_test']))/len(cdf_test['y_test'])*100
print('The MAPE with Disparity5, OSCP BIAS6 and ASY5 is: '+str(MAPE))

The MAPE with Disparity5, OSCP BIAS6 and ASY5 is: 2.2920584878546735


When we use Close, Disparity5, OSCP, BIAS6 and ASY5 as features, the MAPE is slightly greater than the one just with Close, Disparity5 and OSCP as features.
The reason probably we choose LinearRegression model that does not need so many features. I also tried with ElasticNetCV model. It does not matter with 3 features or 5 features, MAPE is same for both cases.


Decision: We use 1 for buy, -1 for sell and 0 for hold
The basis idea is:
if forcast trend and label trend are same and |(forcast-label)/label>|=0.5, buy
if forcast trend and label trend are opposite and |(forcast-label)/label|>=0.5, sell
Otherwise, hold

In [32]:
cdf_test.head(20)

Unnamed: 0_level_0,Close,MA5,MA6,MA10,ln,Disparity5,OSCP,BIAS6,ASY5,Label,x_test,y_test,forecast
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2018-10-22,39.76,41.414,41.568333,41.773,-0.017205,96.006181,-0.008669,-4.350267,-0.012574,38.68,39.76,38.68,39.71514
2018-10-23,38.68,40.508,40.958333,41.414,-0.027539,95.487311,-0.022366,-5.562564,-0.02215,35.43,38.68,35.43,38.656959
2018-10-24,35.43,39.124,39.661667,40.796,-0.087764,90.558225,-0.042736,-10.669412,-0.035682,36.78,35.43,36.78,35.397916
2018-10-25,36.78,38.22,38.733333,40.277,0.037395,96.232339,-0.05382,-5.043029,-0.023182,35.4,36.78,35.4,36.781327
2018-10-26,35.4,37.21,37.75,39.57,-0.038242,95.135716,-0.063424,-6.225166,-0.026671,34.66,35.4,34.66,35.401501
2018-10-29,34.66,36.19,36.785,38.802,-0.021126,95.772313,-0.072175,-5.776811,-0.027455,36.01,34.66,36.01,34.673598
2018-10-30,36.01,35.656,36.16,38.082,0.03821,100.99282,-0.068039,-0.414823,-0.014305,37.72,36.01,37.72,36.059545
2018-10-31,37.72,36.114,36.0,37.619,0.046394,104.447029,-0.041674,4.777778,0.012526,40.12,37.72,40.12,37.784849
2018-11-01,40.12,36.782,36.781667,37.501,0.061685,109.075091,-0.019548,9.07608,0.017384,40.32,40.12,40.32,40.253015
2018-11-02,40.32,37.766,37.371667,37.488,0.004973,106.762697,0.007361,7.889221,0.026027,39.92,40.32,39.92,40.392472
