# Aim :
## Predicting Stock prices of Apple, Google and Amazon
Our first aim is to predict the next day closing stock price for Apple, Google and Amazon. **For this we will train the model that learns from the data of all the 3 companies.**
## Experiment with Transfer Learning
After this we will see if we can design a model that only learns from the data of a single company (**Google**), but predicts well for the other two (**Apple and Amazon**) also. This is called **Transfer Learning**.

# Dataset : [NYSE Kaggle](https://www.kaggle.com/dgawlik/nyse)

In [None]:
import numpy as np
import pandas
import seaborn
import matplotlib.pyplot as plt

In [None]:
df = pandas.read_csv('../input/nyse/prices.csv')

In [None]:
df.head()

In [None]:
print(df.shape)

# Filtering out the dataset
Our aim will be limited to predicting **Apple, Google and Amazon stocks**. Ticker symbol for the respective companies are AAPL, GOOGL and AMZN.

In [None]:
print(df[df['symbol'] == 'AAPL'].shape)
print(df[df['symbol'] =='GOOGL'].shape)
print(df[df['symbol'] == 'AMZN'].shape)

In [None]:
main_df = df[(df['symbol'] == 'AAPL') | (df['symbol'] == 'GOOGL') | (df['symbol'] == 'AMZN')].reset_index(drop = True)

In [None]:
print('Number of missing values : ' + str(main_df.isna().sum().sum()))

In [None]:
main_df.head(9)

# Starting and ending duration

In [None]:
print(main_df['date'].min())
print(main_df['date'].max())

So we have data from **4th of January, 2010** to **30th of December, 2016**

# One hot encode the symbols

In [None]:
main_df = pandas.get_dummies(main_df, columns = ['symbol'])

In [None]:
main_df.head(9)

In [None]:
from tabulate import tabulate
info = [[col, main_df[col].count(), main_df[col].max(), main_df[col].min()] for col in main_df.columns]
print(tabulate(info, headers = ['Feature', 'Count', 'Maximum', 'Minimum'], tablefmt= 'orgtbl'))

# Exploratory Data Analysis

In [None]:
main_df = main_df.drop(['date'], axis = 1)

In [None]:
seaborn.pairplot(main_df.drop(['symbol_AAPL', 'symbol_AMZN', 'symbol_GOOGL'], axis = 1))

In [None]:
plt.figure(figsize = (15,15))
mat = main_df.drop(['symbol_AAPL', 'symbol_AMZN', 'symbol_GOOGL'], axis = 1).corr()
seaborn.heatmap(mat, annot = True, square = True)

In [None]:
info = ['open', 'close', 'low', 'high', 'volume']

In [None]:
plt.figure(figsize = (20,10))

for i in range(4) :
    plt.subplot(2,2,i+1)
    
    plt.plot(main_df[main_df['symbol_AAPL'] == 1][info[i]].values)
    plt.plot(main_df[main_df['symbol_GOOGL']== 1][info[i]].values)
    plt.plot(main_df[main_df['symbol_AMZN'] == 1][info[i]].values)
    plt.xlabel('time' )
    plt.ylabel(info[i])
    plt.legend(['Apple', 'Google', 'Amazon'])
    plt.title(info[i] + ' vs time')
plt.show()

Since open price, low price, and high price are linearly correlated with closing price, only one of them will be selected to feed the model, to avoid overfitting. And the second feature will be **volume traded**.

In [None]:
info = [[col, main_df[col].count(), main_df[col].max(), main_df[col].min()] for col in main_df.columns]
print(tabulate(info, headers = ['Feature', 'Count', 'Maximum', 'Minimum'], tablefmt = 'orgtbl'))

# Create the dataset
We will use past 300 days data to predict future stock prices, 100 days of stock data for each company. We will use three fields, apart from **opening price and volume traded** to symbolize the company to which the data belongs to.

In [None]:
X = np.array(main_df.drop(['close', 'low', 'high'], axis = 1))
y = np.array(main_df['close'])

In [None]:
print(X.shape)
print(y.shape)

# Scaling the columns

In [None]:
print(X[:3])

In [None]:
from sklearn.preprocessing import MinMaxScaler
X = MinMaxScaler().fit_transform(X)

In [None]:
print(X[:3])

In [None]:
print(y.min())
print(y.max())

In [None]:
temp = MinMaxScaler().fit_transform(np.reshape(y, (len(y),1)))
y = temp.reshape(-1)

In [None]:
print(y.min())
print(y.max())

In [None]:
print(X.min())
print(X.max())

In [None]:
print(y[3:6])

# Creating the actual time series based numpy array
An **important** point to note is that we will have three outputs or target labels in y. These are the three closing prices, each for AAPL, AMAZN and GOOGL respectively.

In [None]:
length = 300                         # 100 days * 3 companies

X_res = []
y_res = []

for i in range(length,len(X)-2) :
    X_res.append(X[i-length:i])      # contains features for past 100 days for 3 companies.
    y_res.append(y[i:i+3])           # the next three closing prices for AAPL, AMZN, GOOGL.

X_res = np.array(X_res)
y_res = np.array(y_res)

In [None]:
print(X_res.shape)
print(y_res.shape)

# Train test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test  = train_test_split(X_res, y_res, test_size = 0.3, shuffle = True, random_state = 1)

In [None]:
print(X_train.shape)
print(y_train.shape)

In [None]:
print(X_test.shape)
print(y_test.shape)

# Model creation
### Recurrent Neural Network
Vanilla neural networks do not have a memory and so they do not take into account any past event for predictions. But this kind of model is poor when working with time series data, where there is dependency accross time. This is where RNN come in. A RNN model has **memory**, which can help in retaining past data and so the predictions are made on those basis.

### Long Short Term Memory
Simple RNN models have a **short term memory** and are not able to retain dependencies that occured long before the current state. So LSTM or **long short term memory** is used to retain those dependencies as well, using something called gated units. More info can be found [here](https://medium.com/x8-the-ai-community/understanding-recurrent-neural-networks-in-6-minutes-967ab51b94fe).

In [None]:
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Bidirectional
from keras.layers import BatchNormalization

from keras.layers import Input

In [None]:
def lstm_layer (hiddenx) :
    
    model = Sequential()
    
    model.add(Bidirectional(LSTM(hiddenx, activation = 'tanh', return_sequences = True)))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    
    return model

In [None]:
def rnn (hidden1, hidden2, hidden3) :
    
    model = Sequential()
    
    # Input Block
    model.add(Input((length, 5,)))
    
    # LSTM Block
    model.add(lstm_layer(hidden1))
    model.add(lstm_layer(hidden2))
    model.add(Bidirectional(LSTM(hidden3, activation = 'tanh', return_sequences = False)))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    
    # Output Block
    model.add(Dense(3, activation = 'linear'))
    
    model.compile(loss = 'mean_squared_error', optimizer = 'adam')
    return model

In [None]:
model = rnn(128, 128, 32)
model.summary()

In [None]:
from keras.callbacks import ModelCheckpoint
checkp = ModelCheckpoint('./result_model.h5', monitor = 'val_loss', save_best_only = True, verbose = 1)

In [None]:
history = model.fit(X_train, y_train, epochs = 200, batch_size = 32, validation_data = (X_test, y_test), callbacks = [checkp])

In [None]:
plt.figure(figsize = (20, 5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['train_loss', 'validation_loss'])
plt.xlabel('Epochs')
plt.ylabel('Losses')
plt.title('Losses vs Epochs')
plt.show()

In [None]:
from keras.models import load_model
model = load_model('./result_model.h5')

# Prediction

In [None]:
pred = model.predict(X_test)

In [None]:
print(pred.shape)

# Performance Metrics
#### Mean Squared Error
Sum of squares of differences between actual value and predicted value, divided by the total number of samples. This is an **absolute measure**.
#### R-squared score
Basically this metric evaluates how well the model performs compared to predicting mean for every sample. This is a **relative measure**.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
print('mean squared error : ' + str(mean_squared_error(y_test, pred)))
print('r2_score : ' + str(r2_score(y_test, pred)))

# Plot the Prediction and Test value

In [None]:
plt.figure(figsize = (20,15))

plt.subplot(3,1,1)
plt.plot(y_test[:50,0])
plt.plot(pred[:50,0])
plt.xlabel('Period')
plt.ylabel('Closing Prices')
plt.legend(['testing price','predicted price'])
plt.title('Comparison in closing prices for Apple', fontsize = 15)


plt.subplot(3,1,2)
plt.plot(y_test[:50,1])
plt.plot(pred[:50,1])
plt.xlabel('Period')
plt.ylabel('Closing Prices')
plt.legend(['testing price','predicted price'])
plt.title('Comparison in closing prices for Amazon',fontsize = 15)


plt.subplot(3,1,3)
plt.plot(y_test[:50,2])
plt.plot(pred[:50,2])
plt.xlabel('Period')
plt.ylabel('Closing Prices')
plt.legend(['testing price','predicted price'])
plt.title('Comparison in closing prices for Google',fontsize = 15)

# Transfer learning

In [None]:
import pandas
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pandas.read_csv('../input/nyse/prices.csv')

In [None]:
df.head()

# Taking stock price data of Google

In [None]:
df = df[df['symbol'] == 'GOOGL'].reset_index(drop = True)

In [None]:
df.head()

In [None]:
df = df.drop(['symbol'], axis = 1)

In [None]:
print('Starting date : ' + str(df['date'].min()))
print('Lasting date  : ' + str(df['date'].max()))

So the we have data from **04th January, 2010 to 30th December, 2016**.

In [None]:
print('Number of missing values : ' + str(df.isna().sum().sum()))
print(df.shape)

It has already been observed that features **open, close, low and high** are highly correlated, so we must drop two of them (**low and high**) and **date** column. The other feature will be **volume traded**.

In [None]:
from tabulate import tabulate

In [None]:
info = [[col, df[col].count(), df[col].max(), df[col].min()] for col in df.columns]
print(tabulate(info, headers = ['Feature', 'Count', 'Maximum', 'Minimum'], tablefmt = 'orgtbl'))

In [None]:
X = np.array(df.drop(['date', 'high', 'low', 'close'], axis = 1))
y = np.array(df['close'])

In [None]:
print(X.shape)
print(y.shape)

# Scale

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
X = MinMaxScaler().fit_transform(X)

In [None]:
t = np.reshape(y, (len(y),1))
t = MinMaxScaler().fit_transform(t)
y = t.reshape(-1)

In [None]:
print(X.max())
print(X.min())

In [None]:
print(y.max())
print(y.min())

# Convert to time series data

In [None]:
length = 300

X_res = []
y_res = []

for i in range(length , len(X)) :
    X_res.append(X[i-length:i])
    y_res.append(y[i])

X_res = np.array(X_res)
y_res = np.array(y_res)

In [None]:
print(X_res.shape)
print(y_res.shape)

# Train/test split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size = 0.2, random_state = 1)

In [None]:
print(X_train.shape)
print(y_train.shape)

In [None]:
print(X_test.shape)
print(y_test.shape)

# Build the model

In [None]:
from keras.models import Sequential
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import Bidirectional
from keras.layers import BatchNormalization

In [None]:
'''
def lstm_layer (hiddenx) :
    
    model = Sequential()
    
    model.add(Bidirectional(LSTM(hiddenx, activation = 'tanh', return_sequences = True)))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    
    return model
'''

def mod (hidden1, hidden2, hidden3) :
    
    model = Sequential()
    
    # Input layer
    model.add(Input((length, 2,)))
    
    # lstm layer
    model.add(lstm_layer(hidden1))
    model.add(lstm_layer(hidden2))
    model.add(Bidirectional(LSTM(hidden3, activation = 'tanh', return_sequences = False)))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    
    # output layer
    model.add(Dense(1, activation = 'linear'))
    
    model.compile(loss = 'mean_squared_error', optimizer = 'adam')
    
    return model

In [None]:
model = mod(128, 64, 64)
model.summary()

# Model training

In [None]:
from keras.callbacks import ModelCheckpoint
checkp = ModelCheckpoint('./transfer_model.h5', monitor = 'val_loss', verbose = 1, save_best_only = True)

In [None]:
history = model.fit(X_train, y_train, epochs = 200, batch_size = 32, callbacks = [checkp], validation_data = (X_test, y_test))

In [None]:
plt.figure(figsize = (20,5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['Training loss', 'Validation loss'])
plt.xlabel('Epochs')
plt.ylabel('Losses')
plt.title('Losses vs Epochs')

# Prediction on test data

In [None]:
from keras.models import load_model
model = load_model('./transfer_model.h5')

In [None]:
pred = model.predict(X_test)

In [None]:
print(pred.shape)

In [None]:
pred = pred.reshape(-1)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
print('Mean squared error : ' + str(mean_squared_error(y_test, pred)))
print('r2_score : ' + str(r2_score(y_test, pred)))

In [None]:
plt.figure(figsize = (20,5))

plt.plot(y_test[:100])
plt.plot(pred[:100])
plt.legend(['Testing values', 'predicted values'])
plt.xlabel('Time Period')
plt.ylabel('Closing prices')
plt.title('Comparison b/w predicted and real closing prices (Google)')

# Now let's evaluate model's performance on stock data of Amazon and Apple.

In [None]:
df = pandas.read_csv('../input/nyse/prices.csv')

In [None]:
apple = df[df['symbol'] == 'AAPL'].reset_index(drop = True)
amazn = df[df['symbol'] == 'AMZN'].reset_index(drop = True)

In [None]:
apple.head()

In [None]:
amazn.head()

In [None]:
print('Apple,')
print('Number of missing values : ' + str(apple.isna().sum().sum()))
print(apple.shape)

print('Amazon')
print('Number of missing values : ' + str(amazn.isna().sum().sum()))
print(amazn.shape)

In [None]:
apple = apple.drop(['date', 'symbol'], axis = 1)
amazn = amazn.drop(['date', 'symbol'], axis = 1)

In [None]:
info = [[col, apple[col].count(), apple[col].max(), apple[col].min()] for col in apple.columns]
print(tabulate(info, headers = ['Feature', 'Count', 'Max', 'Min'], tablefmt = 'orgtbl'))

In [None]:
info = [[col, amazn[col].count(), amazn[col].max(), amazn[col].min()] for col in amazn.columns]
print(tabulate(info, headers = ['Feature', 'Count', 'Max', 'Min'], tablefmt = 'orgtbl'))

# Create arrays
Only use **open price and volume traded**, as other features are correlated with closing price.

In [None]:
X_apple = np.array(apple.drop(['close', 'low', 'high'], axis = 1))
y_apple = np.array(apple['close'])

X_amazn = np.array(amazn.drop(['close', 'low', 'high'], axis = 1))
y_amazn = np.array(amazn['close'])

In [None]:
print(X_apple.shape)
print(y_apple.shape)

In [None]:
print(X_amazn.shape)
print(y_amazn.shape)

# Scale

In [None]:
X_apple = MinMaxScaler().fit_transform(X_apple)
X_amazn = MinMaxScaler().fit_transform(X_amazn)

In [None]:
t = np.reshape(y_apple, (len(y_apple),1))
t = MinMaxScaler().fit_transform(t)
y_apple = t.reshape(-1)

In [None]:
t = np.reshape(y_amazn, (len(y_amazn),1))
t = MinMaxScaler().fit_transform(t)
y_amazn = t.reshape(-1)

In [None]:
print(X_apple.max())
print(X_apple.min())

In [None]:
print(y_apple.max())
print(y_apple.min())

In [None]:
print(X_amazn.max())
print(X_amazn.min())

In [None]:
print(y_amazn.max())
print(y_amazn.min())

# Verify the shapes

In [None]:
print(X_apple.shape)
print(y_apple.shape)

In [None]:
print(X_amazn.shape)
print(y_amazn.shape)

# Create timesteps of size 300

In [None]:
length = 300

X_res_apple = []
y_res_apple = []

X_res_amazn = []
y_res_amazn = []

In [None]:
for i in range(length, len(X_apple)) :
    
    X_res_apple.append(X_apple[i-length:i])            # take 300 prior data for apple stock
    y_res_apple.append(y_apple[i])                     # next day closing price
    
    X_res_amazn.append(X_amazn[i-length:i])            # take 300 prior data for Amazonstock
    y_res_amazn.append(y_amazn[i])                     # next day closing price

X_res_apple, y_res_apple = np.array(X_res_apple), np.array(y_res_apple)
X_res_amazn, y_res_amazn = np.array(X_res_amazn), np.array(y_res_amazn)

In [None]:
print(X_res_apple.shape)
print(y_res_apple.shape)

In [None]:
print(X_res_amazn.shape)
print(y_res_amazn.shape)

In [None]:
from sklearn.utils import shuffle

In [None]:
X_res_apple, y_res_apple = shuffle(X_res_apple, y_res_apple, random_state = 1)
X_res_amazn, y_res_amazn = shuffle(X_res_amazn, y_res_amazn, random_state = 1)

# Predict for apple stock data

In [None]:
from keras.models import load_model
model = load_model('./transfer_model.h5')

In [None]:
for i, layer in enumerate(model.layers) :
    if i < 5 :
        layer.trainable = False

In [None]:
print(model.summary())

In [None]:
X_apple_train, X_apple_test, y_apple_train, y_apple_test = train_test_split(X_res_apple, y_res_apple, test_size = 0.2, random_state = 1)

In [None]:
print(X_apple_train.shape)
print(y_apple_train.shape)

In [None]:
print(X_apple_test.shape)
print(y_apple_test.shape)

### Training

In [None]:
checkp = ModelCheckpoint('./best_apple_model.h5', save_best_only = True, monitor = 'val_loss', verbose = 1)

In [None]:
history = model.fit(X_apple_train, y_apple_train, epochs = 50, batch_size = 32, validation_data = (X_apple_test, y_apple_test), callbacks = [checkp])

In [None]:
model = load_model('./best_apple_model.h5')
y_apple_pred = model.predict(X_apple_test)

In [None]:
print(y_apple_pred.shape)

In [None]:
y_apple_pred = y_apple_pred.reshape(-1)

In [None]:
print('Apple,')
print('Mean squared error : ' + str(mean_squared_error(y_apple_test, y_apple_pred)))
print('r2 score : ' + str(r2_score(y_apple_test, y_apple_pred)))

### Visualization

In [None]:
plt.figure(figsize = (20,5))

plt.plot(y_apple_test[100:200])
plt.plot(y_apple_pred[100:200])

plt.xlabel('Time period')
plt.ylabel('Closing prices')
plt.legend(['Actual Price', 'Predicted price'])
plt.title('Closing price comparison for Apple dataset', fontsize = 15)

# Predict for amazon stock data

In [None]:
model = load_model('./transfer_model.h5')

In [None]:
for i, layer in enumerate(model.layers) :
    if i < 5 :
        layer.trainable = False

In [None]:
print(model.summary())

In [None]:
X_amazn_train,  X_amazn_test, y_amazn_train, y_amazn_test = train_test_split(X_res_amazn, y_res_amazn, test_size = 0.2, random_state = 1)

In [None]:
print(X_amazn_train.shape)
print(y_amazn_train.shape)

In [None]:
print(X_amazn_test.shape)
print(y_amazn_test.shape)

### Train on the last layer

In [None]:
from keras.callbacks import ModelCheckpoint
checkp = ModelCheckpoint('./best_amazn_model.h5', monitor = 'val_loss', save_best_only = True, verbose = 1)

In [None]:
history = model.fit(X_amazn_train, y_amazn_train, epochs = 100, batch_size = 32, validation_data = (X_amazn_test, y_amazn_test), callbacks = [checkp])

### Use the 'best_amazn_model.h5' for prediction

In [None]:
model = load_model('./best_amazn_model.h5')

In [None]:
y_amazn_pred = model.predict(X_amazn_test)

In [None]:
print(y_amazn_pred.shape)

In [None]:
y_amazn_pred = y_amazn_pred.reshape(-1)

In [None]:
print('Mean_sqaured_error : ' + str(mean_squared_error(y_amazn_test, y_amazn_pred)))
print('r2_score : ' + str(r2_score(y_amazn_test, y_amazn_pred)))

### Visualization

In [None]:
plt.figure(figsize = (20,5))

plt.plot(y_amazn_test[100:200])
plt.plot(y_amazn_pred[100:200])

plt.xlabel('Time period')
plt.ylabel('Closing prices')
plt.legend(['Actual Price', 'Predicted price'])
plt.title('Closing price comparison for Amazon dataset', fontsize = 15)