# LSTM Stock Predictor Using Closing Prices

In this notebook, you will build and train a custom LSTM RNN that uses a 10 day window of Bitcoin closing prices to predict the 11th day closing price. 

You will need to:

1. Prepare the data for training and testing
2. Build and train a custom LSTM RNN
3. Evaluate the performance of the model

## Data Preparation

In this section, you will need to prepare the training and testing data for the model. The model will use a rolling 10 day window to predict the 11th day closing price.

You will need to:
1. Use the `window_data` function to generate the X and y values for the model.
2. Split the data into 70% training and 30% testing
3. Apply the MinMaxScaler to the X and y values
4. Reshape the X_train and X_test data for the model. Note: The required input format for the LSTM is:

```python
reshape((X_train.shape[0], X_train.shape[1], 1))
```

In [15]:
import numpy as np
import pandas as pd
import hvplot.pandas

In [16]:
# Set the random seed for reproducibility
from numpy.random import seed   # Set the random seed to make sure the results are reproducible. passing the number 1 to the function seed() will always return the same random numbers
seed(1)
from tensorflow import random
random.set_seed(2)              # set_seed() is a method of the tensorflow.random module. It sets the seed for the random number generator.

In [17]:
# Load the fear and greed sentiment data for Bitcoin
df = pd.read_csv('btc_sentiment.csv', index_col="date", infer_datetime_format=True, parse_dates=True)   # index_col="date" is the column that will be used as the index. 
                                                                                                        # Infer_datetime_format=True is used to infer the datetime format of the column. 
                                                                                                        # parse_dates=True is used to parse the datetime column to a datetime object.
df = df.drop(columns="fng_classification")  # Drop the fng_classification column from the dataframe.
df.head()

Unnamed: 0_level_0,fng_value
date,Unnamed: 1_level_1
2019-07-29,19
2019-07-28,16
2019-07-27,47
2019-07-26,24
2019-07-25,42


In [18]:
# Load the historical closing prices for Bitcoin
df2 = pd.read_csv('btc_historic.csv', index_col="Date", infer_datetime_format=True, parse_dates=True)['Close']  # index_col="Date" is the column that will be used as the index.
                                                                                                                # infer_datetime_format=True is used to infer the datetime format of the column.
                                                                                                                # parse_dates=True is used to parse the datetime column to a datetime object.
                                                                                                                # ['Close'] tells pandas to only load the Close column from the csv file.
df2 = df2.sort_index() # useing .sort_index() to sort the index of the dataframe from most recent to oldest.
df2.tail()

Date
2019-07-25    9882.429688
2019-07-26    9847.450195
2019-07-27    9478.320313
2019-07-28    9531.769531
2019-07-29    9529.889648
Name: Close, dtype: float64

In [19]:
# Join the data into a single DataFrame
df = df.join(df2, how="inner")  # useing .join() instead of .merge() or .concat() because .join() is more flexible and can be used to join on multiple columns. 
                                # how="inner" is used to join the dataframes on the index.
df.tail()

Unnamed: 0,fng_value,Close
2019-07-25,42,9882.429688
2019-07-26,24,9847.450195
2019-07-27,47,9478.320313
2019-07-28,16,9531.769531
2019-07-29,19,9529.889648


In [20]:
df.head()

Unnamed: 0,fng_value,Close
2018-02-01,30,9114.719727
2018-02-02,15,8870.820313
2018-02-03,40,9251.269531
2018-02-04,24,8218.049805
2018-02-05,11,6937.080078


In [21]:
# This function accepts the column number for the features (X) and the target (y)
# It chunks the data up with a rolling window of Xt-n to predict Xt
# It returns a numpy array of X any y
def window_data(df, window, feature_col_number, target_col_number): # window_data() is a function that accepts the dataframe, 
                                                                    # the window size, the column number for the features (X) 
                                                                    # and the column number for the target (y).
                                                                    # window is the number of rows to include in the window.
    X = []
    y = []
    for i in range(len(df) - window - 1): # for i in the range of the length of the dataframe minus the window size minus 1. 
        features = df.iloc[i:(i + window), feature_col_number]  # locate i to i + window and load the features from the column number specified. Useing iloc instead of loc because iloc is faster.
        target = df.iloc[(i + window), target_col_number]       # locate i + window and load the target from the column number specified.
        X.append(features)                                      # append or add the features to the X list. 
        y.append(target)                                        # append or add the target to the y list.
    return np.array(X), np.array(y).reshape(-1, 1)              # return numpy array of X and y with a shape of -1, 1.

In [37]:
df

Unnamed: 0,fng_value,Close
2018-02-01,30,9114.719727
2018-02-02,15,8870.820313
2018-02-03,40,9251.269531
2018-02-04,24,8218.049805
2018-02-05,11,6937.080078
...,...,...
2019-07-25,42,9882.429688
2019-07-26,24,9847.450195
2019-07-27,47,9478.320313
2019-07-28,16,9531.769531


In [22]:
# Predict Closing Prices using a 10 day window of previous closing prices
# Then, experiment with window sizes anywhere from 1 to 10 and see how the model performance changes
window_size = 10

# Column index 0 is the 'fng_value' column
# Column index 1 is the `Close` column
feature_column = 1
target_column = 1
X, y = window_data(df, window_size, feature_column, target_column) 

In [23]:
# Use 70% of the data for training and the remaineder for testing
split = int(0.7 * len(X)) # new variable called split that multiplies the length of X by .7 and then casts it to an integer.
X_train = X[: split]      # [: split] tells python to take all rows the column that's index is equal to split.
X_test = X[split:]        # [split:] tells python to test all columns from the row that's index is equal to split.
y_train = y[: split]      # [: split] tells python to take all rows the column that's index is equal to split.
y_test = y[split:]        # [split:] tells python to test all columns from the row that's index is equal to split.

In [24]:
from sklearn.preprocessing import MinMaxScaler
# Use the MinMaxScaler to scale data between 0 and 1.
X_train_scaler = MinMaxScaler() 
X_test_scaler = MinMaxScaler()
y_train_scaler = MinMaxScaler()
y_test_scaler = MinMaxScaler()

# Fit the scaler for the Training Data
X_train_scaler.fit(X_train)
y_train_scaler.fit(y_train)

# Scale the training data
X_train = X_train_scaler.transform(X_train)
y_train = y_train_scaler.transform(y_train)

# Fit the scaler for the Testing Data
X_test_scaler.fit(X_test)
y_test_scaler.fit(y_test)

# Scale the y_test data
X_test = X_test_scaler.transform(X_test)
y_test = y_test_scaler.transform(y_test)

In [25]:
# Reshape the features for the model
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))  # reshape syntax is (rows, columns, channels) and shape syntax is (rows, columns). 
                                                                    # shape[0] tells python to take the first row of the X_train array.
                                                                    # shape[1] tells python to take the second row of the X_train array.
                                                                    # 1 tells python to take the first channel of the X_train array. 
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
print (f"X_train sample values:\n{X_train[1]} \n")
print (f"X_test sample values:\n{X_test[1]}")

X_train sample values:
[[0.68162134]
 [0.72761425]
 [0.60270722]
 [0.44784942]
 [0.54023074]
 [0.52711046]
 [0.60786209]
 [0.66058747]
 [0.64516902]
 [0.58657552]] 

X_test sample values:
[[0.00242586]
 [0.00307681]
 [0.00183924]
 [0.        ]
 [0.00051155]
 [0.00254834]
 [0.00577449]
 [0.        ]
 [0.02614593]
 [0.02101502]]


---

## Build and Train the LSTM RNN

In this section, you will design a custom LSTM RNN and fit (train) it using the training data.

You will need to:
1. Define the model architecture
2. Compile the model
3. Fit the model to the training data

### Hints:
You will want to use the same model architecture and random seed for both notebooks. This is necessary to accurately compare the performance of the FNG model vs the closing price model. 

In [26]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

In [27]:
# Build the LSTM model. 
# The return sequences need to be set to True if you are adding additional LSTM layers, but 
# You don't have to do this for the final layer. 
# Note: The dropouts help prevent overfitting
# Note: The input shape is the number of time steps and the number of indicators
# Note: Batching inputs has a different input shape of Samples/TimeSteps/Features

model = Sequential()

number_units = 5
dropout_fraction = 0.2

# Layer 1
model.add(LSTM(
    units=number_units,
    return_sequences=True,
    input_shape=(X_train.shape[1], 1))
    )
model.add(Dropout(dropout_fraction))
# Layer 2
model.add(LSTM(units=number_units, return_sequences=True))
model.add(Dropout(dropout_fraction))
# Layer 3
model.add(LSTM(units=number_units))
model.add(Dropout(dropout_fraction))
# Output layer
model.add(Dense(1))


In [28]:
# Compile the model
model.compile(optimizer="adam", loss="mean_squared_error")

In [29]:
# Summarize the model
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 10, 5)             140       
                                                                 
 dropout (Dropout)           (None, 10, 5)             0         
                                                                 
 lstm_1 (LSTM)               (None, 10, 5)             220       
                                                                 
 dropout_1 (Dropout)         (None, 10, 5)             0         
                                                                 
 lstm_2 (LSTM)               (None, 5)                 220       
                                                                 
 dropout_2 (Dropout)         (None, 5)                 0         
                                                                 
 dense (Dense)               (None, 1)                 6

In [30]:
# Train the model
# Use at least 10 epochs
# Do not shuffle the data
# Experiement with the batch size, but a smaller batch size is recommended
model.fit(X_train, y_train, epochs=10, shuffle=False, batch_size=1, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2246ca50fc8>

---

## Model Performance

In this section, you will evaluate the model using the test data. 

You will need to:
1. Evaluate the model using the `X_test` and `y_test` data.
2. Use the X_test data to make predictions
3. Create a DataFrame of Real (y_test) vs predicted values. 
4. Plot the Real vs predicted values as a line chart

### Hints
Remember to apply the `inverse_transform` function to the predicted and y_test values to recover the actual closing prices.

In [31]:
# Evaluate the model
model.evaluate(X_test, y_test)



0.03376372158527374

In [32]:
# Make some predictions
predicted = model.predict(X_test)



In [33]:
# Recover the original prices instead of the scaled version
predicted_prices = y_test_scaler.inverse_transform(predicted)
real_prices = y_test_scaler.inverse_transform(y_test.reshape(-1, 1))

In [34]:
# Create a DataFrame of Real and Predicted values
stocks = pd.DataFrame({
    "Real": real_prices.ravel(),
    "Predicted": predicted_prices.ravel()
}, index = df.index[-len(real_prices): ]) 
stocks.head()

Unnamed: 0,Real,Predicted
2019-02-20,3924.23999,4065.871582
2019-02-21,3974.050049,4067.491943
2019-02-22,3937.040039,4077.04126
2019-02-23,3983.530029,4090.373047
2019-02-24,4149.089844,4107.730469


In [36]:
# Plot the real vs predicted values as a line chart
stocks.hvplot(title="5 Day Predicted vs Actual BTC Price - Close")