# LSTM Stock Predictor Using Fear and Greed Index

In this notebook, you will build and train a custom LSTM RNN that uses a 10 day window of Bitcoin fear and greed index values to predict the 11th day closing price. 

You will need to:

1. Prepare the data for training and testing
2. Build and train a custom LSTM RNN
3. Evaluate the performance of the model

## Data Preparation

In this section, you will need to prepare the training and testing data for the model. The model will use a rolling 10 day window to predict the 11th day closing price.

You will need to:
1. Use the `window_data` function to generate the X and y values for the model.
2. Split the data into 70% training and 30% testing
3. Apply the MinMaxScaler to the X and y values
4. Reshape the X_train and X_test data for the model. Note: The required input format for the LSTM is:

```python
reshape((X_train.shape[0], X_train.shape[1], 1))
```

In [48]:
import numpy as np
import pandas as pd
import hvplot.pandas

In [49]:
# Set the random seed for reproducibility
# Note: This is for the homework solution, but it is good practice to comment this out and run multiple experiments to evaluate your model
from numpy.random import seed
seed(1)
from tensorflow import random
random.set_seed(2)

In [50]:
# Load the fear and greed sentiment data for Bitcoin
df = pd.read_csv('btc_sentiment.csv', index_col="date", infer_datetime_format=True, parse_dates=True)
df = df.drop(columns="fng_classification")
df.head()

Unnamed: 0_level_0,fng_value
date,Unnamed: 1_level_1
2019-07-29,19
2019-07-28,16
2019-07-27,47
2019-07-26,24
2019-07-25,42


In [51]:
# Load the historical closing prices for Bitcoin
df2 = pd.read_csv('btc_historic.csv', index_col="Date", infer_datetime_format=True, parse_dates=True)['Close']
df2 = df2.sort_index()
df2.tail()

Date
2019-07-25    9882.429688
2019-07-26    9847.450195
2019-07-27    9478.320313
2019-07-28    9531.769531
2019-07-29    9529.889648
Name: Close, dtype: float64

In [52]:
# Join the data into a single DataFrame
df = df.join(df2, how="inner")
df.tail()

Unnamed: 0,fng_value,Close
2019-07-25,42,9882.429688
2019-07-26,24,9847.450195
2019-07-27,47,9478.320313
2019-07-28,16,9531.769531
2019-07-29,19,9529.889648


In [53]:
df.head()

Unnamed: 0,fng_value,Close
2018-02-01,30,9114.719727
2018-02-02,15,8870.820313
2018-02-03,40,9251.269531
2018-02-04,24,8218.049805
2018-02-05,11,6937.080078


In [55]:
# This function accepts the column number for the features (X) and the target (y)
# It chunks the data up with a rolling window of Xt-n to predict Xt
# It returns a numpy array of X any y
def window_data(df, window, feature_col_number, target_col_number):
    X = []
    y = []
    for i in range(len(df) - window - 1):
        features = df.iloc[i:(i + window), feature_col_number]
        target = df.iloc[(i + window), target_col_number]
        X.append(features)
        y.append(target)
    return np.array(X), np.array(y).reshape(-1, 1)

In [56]:
# Predict Closing Prices using a 10 day window of previous fng values
# Then, experiment with window sizes anywhere from 1 to 10 and see how the model performance changes
window_size = 10

# Column index 0 is the 'fng_value' column
# Column index 1 is the `Close` column
feature_column = 0
target_column = 1
X, y = window_data(df, window_size, feature_column, target_column)

In [57]:
X.shape

(532, 10)

In [58]:
X[0:10]

array([[30, 15, 40, 24, 11,  8, 36, 30, 44, 54],
       [15, 40, 24, 11,  8, 36, 30, 44, 54, 31],
       [40, 24, 11,  8, 36, 30, 44, 54, 31, 42],
       [24, 11,  8, 36, 30, 44, 54, 31, 42, 35],
       [11,  8, 36, 30, 44, 54, 31, 42, 35, 55],
       [ 8, 36, 30, 44, 54, 31, 42, 35, 55, 71],
       [36, 30, 44, 54, 31, 42, 35, 55, 71, 67],
       [30, 44, 54, 31, 42, 35, 55, 71, 67, 74],
       [44, 54, 31, 42, 35, 55, 71, 67, 74, 63],
       [54, 31, 42, 35, 55, 71, 67, 74, 63, 67]], dtype=int64)

In [59]:
# Use 70% of the data for training and the remaineder for testing
# YOUR CODE HERE!
split = int(0.7 * len(X))     # this defines the split of 70% of the entire length of the X data. X has 532 datapoints, so the first 372 datapoints will be assigned to training.                             
X_train = X[: split]          # this takes the whole X dataset (532 datapoints) from index numbers 0-371 and assigns to the training. It applies the split at index number 372, so anything from that index is not included in the training.
X_test = X[split: ]           # Any X data from index number 372 and onward is assigned to test.      
y_train = y[: split]
y_test = y[split : ]

In [60]:
from sklearn.preprocessing import MinMaxScaler

# Creating 4 MinMaxScaler() objects. Previously our RNN model code only created 1 in our class examples. 
X_train_scaler = MinMaxScaler()
X_test_scaler = MinMaxScaler()
y_train_scaler = MinMaxScaler()
y_test_scaler = MinMaxScaler()

# fitting the train data
X_train_scaler.fit(X_train)                               # fit each scaler objects with the appropriate X,y train/test datasets
y_train_scaler.fit(y_train)
X_test_scaler.fit(X_test)
y_test_scaler.fit(y_test)

# transform train data
X_train = X_train_scaler.transform(X_train)               # transform each X.y dataset into normalized values between 0-1. assign our scaler objects that are fitted with the raw dataset, and transform is run on those values. 
y_train = y_train_scaler.transform(y_train)
X_test = X_test_scaler.transform(X_test)
y_test = y_test_scaler.transform(y_test)

In [61]:
# Reshape the features for the model
# YOUR CODE HERE!
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))   
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

print(X_test[:5])     

[[[0.30379747]
  [0.37974684]
  [0.27848101]
  [0.40506329]
  [0.40506329]
  [0.34177215]
  [0.34177215]
  [0.3164557 ]
  [0.27848101]
  [0.59493671]]

 [[0.37974684]
  [0.27848101]
  [0.40506329]
  [0.40506329]
  [0.34177215]
  [0.34177215]
  [0.3164557 ]
  [0.27848101]
  [0.59493671]
  [0.62025316]]

 [[0.27848101]
  [0.40506329]
  [0.40506329]
  [0.34177215]
  [0.34177215]
  [0.3164557 ]
  [0.27848101]
  [0.59493671]
  [0.62025316]
  [0.5443038 ]]

 [[0.40506329]
  [0.40506329]
  [0.34177215]
  [0.34177215]
  [0.3164557 ]
  [0.27848101]
  [0.59493671]
  [0.62025316]
  [0.5443038 ]
  [0.5443038 ]]

 [[0.40506329]
  [0.34177215]
  [0.34177215]
  [0.3164557 ]
  [0.27848101]
  [0.59493671]
  [0.62025316]
  [0.5443038 ]
  [0.5443038 ]
  [0.56962025]]]


---

## Build and Train the LSTM RNN

In this section, you will design a custom LSTM RNN and fit (train) it using the training data.

You will need to:
1. Define the model architecture
2. Compile the model
3. Fit the model to the training data

### Hints:
You will want to use the same model architecture and random seed for both notebooks. This is necessary to accurately compare the performance of the FNG model vs the closing price model. 

In [62]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

In [63]:
# Build the LSTM model. 
# The return sequences need to be set to True if you are adding additional LSTM layers, but 
# You don't have to do this for the final layer. 
# Note: The dropouts help prevent overfitting
# Note: The input shape is the number of time steps and the number of indicators
# Note: Batching inputs has a different input shape of Samples/TimeSteps/Features

# YOUR CODE HERE!

model = Sequential ()                            # build the RNN model


number_units = 30                                
dropout_fraction = 0.2                           # set the dropout rate to 20%, so after running through each layer, it keeps 80% of the data.
                                                     # Dropout fraction gives a % chance that a neuron won't be included during testing. In this case, there's a 20% chance each neuron won't be included in testing. 

# layer 1

model.add(LSTM(                                  # Define layer as LSTM
    units = number_units,
    return_sequences =  True,                    # return sequences to link this LSTM layer to the next
    input_shape = (X_train.shape[1], 1)          # X_train.shape[1] is number of time steps, 10 days worth of prices. The second 1 is number of input features it will use to predict 11th day price. 
))

model.add(Dropout(dropout_fraction))             # add first dropout layer. A dropout layer gives a % chance that a neuron won't be included in testing in next LSTM layer. defers to the dropout_fraction, which is 20% 
                                                 # dropout layers don't have params to pass, no weights in a dropout layer.    
                                                 # if there wasn't enough to describe the model in this layer, it won't drop anything. 

# layer 2

model.add(LSTM(                                  # We have our second LSTM layer, but this time all we need is the units and return_sequences param
     units = number_units,
     return_sequences = True                      # We have another layer to this model, will need a return_sequences =True
))

model.add(Dropout(dropout_fraction))             # 2nd dropout layer

# layer 3

model.add(LSTM(units=number_units))               # only need units param in 3rd layer, no other LSTM layers to connect.

model.add(Dropout(dropout_fraction))              # final dropout layer before final output layer

# output layer 
model.add(Dense(1))                               # Dense output layer that should return only 1 value. 

In [64]:
# Compile the model
# YOUR CODE HERE!
model.compile(optimizer='adam', loss='mean_squared_error')

In [65]:
# Summarize the model
# YOUR CODE HERE!
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 10, 30)            3840      
_________________________________________________________________
dropout_3 (Dropout)          (None, 10, 30)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 10, 30)            7320      
_________________________________________________________________
dropout_4 (Dropout)          (None, 10, 30)            0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 30)                7320      
_________________________________________________________________
dropout_5 (Dropout)          (None, 30)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                

**CODER'S NOTE:**

This summary indicates we do not have enough data to describe our model.

* Total params is 18,511
* At the first LSTM layer we start with 3,840 params.
* The first dropout is zero. This means the model decided 3840 params wasn't enough to describe the model, so it didn't drop anything.
* At the second LSTM layer it increased the params to 7320
* The next dropout is zero. Not even 7320 was enough to describe the model.

In [66]:
# Train the model
# Use at least 10 epochs
# Do not shuffle the data
# Experiement with the batch size, but a smaller batch size is recommended
# YOUR CODE HERE!
model.fit(X_train, y_train, epochs=10, shuffle=False, batch_size=1, verbose=1)

# CODER'S NOTE: not good loss function here. We start at .0322 but loss function increased to .0434.

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1badf52c908>

---

## Model Performance

In this section, you will evaluate the model using the test data. 

You will need to:
1. Evaluate the model using the `X_test` and `y_test` data.
2. Use the X_test data to make predictions
3. Create a DataFrame of Real (y_test) vs predicted values. 
4. Plot the Real vs predicted values as a line chart

### Hints
Remember to apply the `inverse_transform` function to the predicted and y_test values to recover the actual closing prices.

In [67]:
# Evaluate the model
# YOUR CODE HERE!
model.evaluate(X_test, y_test)                       # not really a good model. We want loss function to be close to zero. 



0.20438039302825928

In [68]:
# Make some predictions
# YOUR CODE HERE!
predicted = model.predict(X_test)

In [69]:
# Recover the original prices instead of the scaled version
predicted_prices = y_test_scaler.inverse_transform(predicted)
real_prices = y_test_scaler.inverse_transform(y_test.reshape(-1, 1))

In [70]:
# Create a DataFrame of Real and Predicted values
stocks = pd.DataFrame({
    "Real": real_prices.ravel(),
    "Predicted": predicted_prices.ravel()
}, index = df.index[-len(real_prices): ]) 
stocks.tail()

Unnamed: 0,Real,Predicted
2019-07-25,9772.139648,4066.88501
2019-07-26,9882.429688,4230.646484
2019-07-27,9847.450195,4088.820312
2019-07-28,9478.320313,4054.185059
2019-07-29,9531.769531,3950.74585


In [71]:
# Plot the real vs predicted values as a line chart
# YOUR CODE HERE!
stocks.hvplot(
    xlabel='Monthly Progression',
    ylabel='BTC Price',
    title='Actual vs Predicted Bitcoin Prices Fear and Greed'
)

**MODEL ANALYSIS:**

Poor model. Sentiment cannot predict prices.

The idea behind this model is that it is using **sentiment** of Bitcoin, to predict the closing price of Bitcoin. We're using the FNG index to get sentiment scores. Most BTC buyers are getting their sentiment and impressions of the Bitcoin asset from forum sites like reddit, 4chan, or Elon Musk's Twitter account, I do not think the FNG reflects those sources.

Bitcoin had a serious crash in 2018, and likely the senitment around Bitcoin would be mostly negative (from serious fundamental investors). And likely negative sentiment will cause the model to predict BTC prices to be low. Not to mention most BTC buyers are getting their sentiment and impressions of the Bitcoin asset from forum sites like reddit, 4chan, or Elon Musk's Twitter account. 

Back to the model.

The FNG is very poor predicting ability for our model. Real prices outpaced predicted from April to the rest of our window.  One of the key takeaways is the number of predictions being made each month (See below cell). In Feb the model is making 9 predictions, but then in March it vastly increases to 31 and for the subsequent months. Also, from earlier the model summary there seems to be an isufficient number of params to descibe the model. The model kept increasing params to use at each LSTM layer, and not dropping out anything. We could add more layers, but the core problem is that there just isn't enough parameters. We need more data.

In [72]:
# Displaying number of predictions being made by the model.
stocks.groupby(stocks.index.month).count()

Unnamed: 0,Real,Predicted
2,9,9
3,31,31
4,30,30
5,31,31
6,30,30
7,29,29


In [73]:
# Model summary. 
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 10, 30)            3840      
_________________________________________________________________
dropout_3 (Dropout)          (None, 10, 30)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 10, 30)            7320      
_________________________________________________________________
dropout_4 (Dropout)          (None, 10, 30)            0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 30)                7320      
_________________________________________________________________
dropout_5 (Dropout)          (None, 30)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                