# RNN, LSTM and GRU: IBM Stock Price Prediction

## I. Recurrent Neural Networks
In a recurrent neural network we store the output activations from one or more of the layers of the network. Often these are hidden later activations. Then, the next time we feed an input example to the network, we include the previously-stored outputs as additional inputs. You can think of the additional inputs as being concatenated to the end of the “normal” inputs to the previous layer. For example, if a hidden layer had 10 regular input nodes and 128 hidden nodes in the layer, then it would actually have 138 total inputs (assuming you are feeding the layer’s outputs into itself à la Elman) rather than into another layer). Of course, the very first time you try to compute the output of the network you’ll need to fill in those extra 128 inputs with 0s or something.

Source: [Quora](https://www.quora.com/What-is-a-simple-explanation-of-a-recurrent-neural-network)

<div style="text-align:center;">
<img src="https://cdn-images-1.medium.com/max/1600/1*NKhwsOYNUT5xU7Pyf6Znhg.png"  width="50%" height="50%">
</div>

Source: [Medium](https://medium.com/ai-journal/lstm-gru-recurrent-neural-networks-81fe2bcdf1f9)

Let me give you the best explanation of Recurrent Neural Networks that I found on internet: https://www.youtube.com/watch?v=UNmqTiOnRfg&t=3s

Now, even though RNNs are quite powerful, they suffer from  **Vanishing gradient problem ** which hinders them from using long term information, like they are good for storing memory 3-4 instances of past iterations but larger number of instances don't provide good results so we don't just use regular RNNs. Instead, we use a better variation of RNNs: **Long Short Term Networks(LSTM).**

### What is Vanishing Gradient problem?
Vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (0, 1), and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the "front" layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n while the front layers train very slowly.

Source: [Wikipedia](https://en.wikipedia.org/wiki/Vanishing_gradient_problem)

<div style="text-align:center;">
<img src="https://cdn-images-1.medium.com/max/1460/1*FWy4STsp8k0M5Yd8LifG_Q.png"  width="50%" height="50%">
</div>

Source: [Medium](https://medium.com/@anishsingh20/the-vanishing-gradient-problem-48ae7f501257)

## II. Long Short Term Memory(LSTM)


A Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture used in deep learning, primarily for tasks involving sequences of data, such as natural language processing, time series analysis, and more. 

**LSTM (Long Short-Term Memory) Architecture**

- *Sequential Data Handling*: LSTMs are designed to handle sequences of data, like words in a sentence or time-series data points. They're great at capturing patterns and dependencies in these sequences.

- *Memory Cells*: At the core of an LSTM are memory cells. These cells have the ability to store information for long periods and decide when to forget or remember information. This capability helps LSTMs avoid the vanishing gradient problem, a common issue with traditional RNNs.

- *Gates*: LSTMs use three types of gates to control the flow of information into and out of the memory cells:
   - **Forget Gate**: Decides what information from the previous cell state should be thrown away or kept.
   - **Input Gate**: Determines what new information should be added to the cell state.
   - **Output Gate**: Computes the final output based on the current cell state, taking into account the input and the forget gate.

- *Hidden State*: LSTMs also have a hidden state, which is similar to the cell state but is used to carry information from one time step to the next.

- *Training*: During training, LSTMs learn the parameters of their gates and cell states through backpropagation, adjusting their weights to make accurate predictions based on the input sequence.

In essence, an LSTM is like a smart unit that can selectively remember or forget information as it processes a sequence. This ability to capture long-term dependencies in data makes LSTMs powerful tools for tasks like predicting future values in time series data or understanding the context of words in a sentence when processing natural language.

<div style="text-align:center;">
<img src="https://www.mdpi.com/water/water-12-00109/article_deploy/html/images/water-12-00109-g001.png"  width="50%" height="50%">
</div>


### Components of LSTMs
So the LSTM cell contains the following components
* Forget Gate “f” ( a neural network with sigmoid)
* Candidate layer “C"(a NN with Tanh)
* Input Gate “I” ( a NN with sigmoid )
* Output Gate “O”( a NN with sigmoid)
* Hidden state “H” ( a vector )
* Memory state “C” ( a vector)

* Inputs to the LSTM cell at any step are X<sub>t</sub> (current input) , H<sub>t-1</sub> (previous hidden state ) and C<sub>t-1</sub> (previous memory state).  
* Outputs from the LSTM cell are H<sub>t</sub> (current hidden state ) and C<sub>t</sub> (current memory state)

### And now we get to the code...
I will use LSTMs for predicting the price of stocks of IBM for the year 2017

In [36]:
# Importing the libraries
import numpy as np
import warnings
import matplotlib.pyplot as plt
import plotly.express as px
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, GRU
from tensorflow.keras.optimizers import SGD
import math
from sklearn.metrics import mean_squared_error

In [37]:
class TimeSeriesGenerator(BaseEstimator, TransformerMixin):
    def __init__(self, n_timesteps=60):
        self.n_timesteps = n_timesteps
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Initialize lists to hold sequences and targets
        X_transformed, y_transformed = [], []
        
        # Generate the sequences and corresponding targets
        for i in range(len(X) - self.n_timesteps):
            X_transformed.append(X[i:i + self.n_timesteps])  # Sequence of 60 timesteps
            y_transformed.append(X[i + self.n_timesteps])    # Target after 60 timesteps
        
        # Convert lists to NumPy arrays
        X_transformed, y_transformed = np.array(X_transformed), np.array(y_transformed)
        
        # Reshape X_transformed to (samples, timesteps, features)
        X_transformed = X_transformed.reshape(X_transformed.shape[0], X_transformed.shape[1], 1)
        
        return X_transformed, y_transformed

In [38]:
# Some functions to help out with
def plot_predictions(test, predicted, dates):
    # Create a DataFrame to hold the test, predicted, and Date
    df = pd.DataFrame({
        'Date': dates,              # The dates corresponding to the test set
        'Real': test,               # Real IBM stock prices
        'Predicted': predicted      # Predicted IBM stock prices
    })
    
    # Melt the DataFrame for Plotly (long-form format for easy plotting)
    df_melted = df.melt(id_vars='Date', value_vars=['Real', 'Predicted'], 
                        var_name='Type', value_name='IBM Stock Price')
    
    # Create the plot
    fig = px.line(df_melted, x='Date', y='IBM Stock Price', color='Type',
                  labels={'Date': 'Time', 'IBM Stock Price': 'IBM Stock Price'},
                  title="IBM Stock Price: Real vs Predicted")
    
    # Show the plot
    fig.show()

def return_rmse(test,predicted):
    rmse = math.sqrt(mean_squared_error(test, predicted))
    print("The root mean squared error is {}.".format(rmse))

### 1. Import the data IBM_2006-01-01_to_2018-01-01 

In [39]:
data = pd.read_csv('IBM_2006-01-01_to_2018-01-01.csv')
data['Date'] = pd.to_datetime(data['Date'])
data['Year'] = data['Date'].dt.year
#min_year = data['Year'].min()
#max_year = data['Year'].max()
data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Name,Year
0,2006-01-03,82.45,82.55,80.81,82.06,11715200,IBM,2006
1,2006-01-04,82.2,82.5,81.33,81.95,9840600,IBM,2006
2,2006-01-05,81.4,82.9,81.0,82.5,7213500,IBM,2006
3,2006-01-06,83.95,85.03,83.41,84.95,8197400,IBM,2006
4,2006-01-09,84.1,84.25,83.38,83.73,6858200,IBM,2006


### 2. Split time series
train set: before 2016

test set: after 2016

In [40]:
split_year = 2016
train = data[data.Year < split_year]
test = data[data.Year >= split_year]

### 3. Plot IBM stock price (train and test on the same plot)
Choose 'High' attribute for prices

In [41]:
warnings.filterwarnings('ignore')
# Create a combined DataFrame or just concatenate train and test data
train.loc[:, 'Type'] = 'Train' 
test.loc[:, 'Type'] = 'Test'   

# Combine the train and test datasets
data = pd.concat([train, test])

# Plot the data using plotly express
fig = px.line(data, x='Date', y='High', color='Type', 
              labels={'Date': 'Time', 'High': 'IBM Stock Price'},
              title=f"IBM Stock Price Evolution between {data['Year'].min()} and {data['Year'].max()}")

# Show the plot
fig.show()

### 4. Scale the training set (range (0,1))

In [42]:
target_col = ['High']

In [43]:
num_pipeline = Pipeline(steps=[
    ('scaler', MinMaxScaler())  # Min-Max scaling
])

In [44]:
col_transformer = ColumnTransformer(transformers=[
    ('num_pipeline', num_pipeline, target_col)
], remainder='drop', n_jobs=-1)

In [45]:
pipeline = Pipeline(steps=[
    ('col_transformer', col_transformer),  # Scale the data
    ('time_series_generator', TimeSeriesGenerator(n_timesteps=60))  # Generate 60-timestep sequences
])

In [46]:
X_train, y_train = pipeline.fit_transform(train)

### 5. Create a data structure with 60 timesteps and 1 output for training set and then reshape it

In [47]:
pipeline = Pipeline(steps=[
    ('col_transformer', col_transformer),  # Scale the data
    ('time_series_generator', TimeSeriesGenerator(n_timesteps=60))  # Generate 60-timestep sequences
])

In [48]:
X_train, y_train = pipeline.fit_transform(train)

In [49]:
print("Shape of X_train:", X_train.shape)  # Should be (num_samples, 60, 1)
print("Shape of y_train:", y_train.shape)  # Should be (num_samples,)

Shape of X_train: (2457, 60, 1)
Shape of y_train: (2457, 1)


### 6. Create LSTM model and train it

In [50]:
lstm = Sequential([
    LSTM(50, activation='relu', input_shape=(X_train.shape[1], 1)),  # 50 LSTM units and return sequences #input_shape=(n_steps, n_features)
#    Dropout(0.2),  # Dropout to prevent overfitting
#    LSTM(50, return_sequences=False),  # Another LSTM layer
#    Dropout(0.2),
#    Dense(25),  # Dense layer with 25 neurons
    Dense(1)  # Output layer predicting the stock price
], name="LSTM_Model")

In [51]:
lstm.compile(optimizer='adam', loss='mse', metrics=['accuracy']) 

In [52]:
lstm.summary()

In [53]:
%%time 
epochs=100 # number of times a complete dataset is passed
# Using defaults (epochs=1, batch_size=32, verbose=1)
history = lstm.fit(
  X_train,
  y_train,
  epochs=epochs
)

Epoch 1/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 7.4475e-04 - loss: 0.1460
Epoch 2/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 6.1408e-04 - loss: 0.0016
Epoch 3/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.0014 - loss: 8.2234e-04
Epoch 4/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 1.8017e-04 - loss: 7.2770e-04
Epoch 5/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 6.3699e-04 - loss: 5.9207e-04
Epoch 6/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 4.1977e-04 - loss: 6.1291e-04
Epoch 7/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.0017 - loss: 6.0759e-04
Epoch 8/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 5.9671e-04 - loss: 4.9267e-04


An epoch is a single pass through the entire training dataset. It is used to measure the number of times the model has seen the entire dataset. 

Batch size represents the number of samples used in one forward and backward pass through the network. The batch size can be understood as a trade-off between accuracy and speed. Large batch sizes can lead to faster training times but may result in lower accuracy and overfitting, while smaller batch sizes can provide better accuracy, but can be computationally expensive and time-consuming.

Iterations are the number of batches required to complete one epoch used to measure the progress of the training process.
Iterations play a crucial role in the training process, as they determine the number of updates made to the model weights during each epoch. 
Like batch size, more iterations can increase accuracy but too much can lead to overfitting; fewer iterations can reduce the time taken to train but can lead to an overgeneralization of the data causing underfitting. 

The iteration count is equal to the number of batches in an epoch, and it is calculated by dividing the total number of samples in the training dataset by the batch size.

return_sequence parameter:

1. return_sequences=True:
The layer will return the output for each timestep in the input sequence.
So, if your input has 60 timesteps, the layer will return an output for all 60 timesteps.
The output shape from this layer would be (batch_size, timesteps, units).

2. return_sequences=False (default behavior):
The layer will return only the output from the last timestep.
The output shape from this layer would be (batch_size, units).

### 7. Predict using test set after preparing it as the train set

In [54]:
# Apply the pipeline to the test set
X_test, y_test = pipeline.transform(test)

In [55]:
print(X_test.shape, y_test.shape)

(443, 60, 1) (443, 1)


In [67]:
# Make predictions
predicted_stock_price_lstm = lstm.predict(X_test)

# Since the target values were scaled, we should inverse transform the predictions to get the original scale
# Assuming num_pipeline was already fitted during the training phase
predicted_stock_price_lstm = pipeline.named_steps['col_transformer'].transformers_[0][1].named_steps['scaler'].inverse_transform(predicted_stock_price_lstm)

[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step


In [68]:
y_test_original = pipeline.named_steps['col_transformer'].transformers_[0][1].named_steps['scaler'].inverse_transform(y_test)
#for i in range(len(y_test_original)):
#    print(y_test_original.ravel()[i] - array[i])

# question pour le prof, est-ce mieux (pour comparer les resultats et le plot) de comparer predicted_stock_price à y_test_original (ie y_test qui a été scalé inverse) 
# ou directement test[-X_test.shape[0]:]['High'] qui sont les vraies valeurs.

### 8. Visualize the results

In [69]:
plot_predictions(y_test_original.ravel(), predicted_stock_price_lstm.ravel(), test[-X_test.shape[0]:]['Date'])

### 9. Evaluate the model using rsme

In [70]:
# Calculate RMSE
return_rmse(y_test_original, predicted_stock_price_lstm)

The root mean squared error is 1.6118736524765744.


LSTM is not the only kind of unit that has taken the world of Deep Learning by a storm. We have **Gated Recurrent Units(GRU)**. It's not known, which is better: GRU or LSTM because they have comparable performances. GRUs are easier to train than LSTMs.

## III. Gated Recurrent Units
In simple words, the GRU unit does not have to use a memory unit to control the flow of information like the LSTM unit. It can directly makes use of the all hidden states without any control. GRUs have fewer parameters and thus may train a bit faster or need less data to generalize. But, with large data, the LSTMs with higher expressiveness may lead to better results.

They are almost similar to LSTMs except that they have two gates: reset gate and update gate. Reset gate determines how to combine new input to previous memory and update gate determines how much of the previous state to keep. Update gate in GRU is what input gate and forget gate were in LSTM. We don't have the second non linearity in GRU before calculating the outpu, .neither they have the output gate.

Source: [Quora](https://www.quora.com/Whats-the-difference-between-LSTM-and-GRU-Why-are-GRU-efficient-to-train)

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Gated_Recurrent_Unit%2C_base_type.svg/1920px-Gated_Recurrent_Unit%2C_base_type.svg.png" width="900" height="600">

### 10. Create GRU model and train it

In [72]:
gru = Sequential([
    GRU(50, activation='relu', input_shape=(X_train.shape[1], 1)),  # 50 LSTM units and return sequences #input_shape=(n_steps, n_features)
#    Dropout(0.2),  # Dropout to prevent overfitting
#    LSTM(50, return_sequences=False),  # Another LSTM layer
#    Dropout(0.2),
#    Dense(25),  # Dense layer with 25 neurons
    Dense(1)  # Output layer predicting the stock price
], name="GRU_Model")

In [73]:
gru.compile(optimizer='adam', loss='mse', metrics=['accuracy']) 

In [74]:
gru.summary()

In [75]:
%%time 
epochs=100 # number of times a complete dataset is passed
# Using defaults (epochs=1, batch_size=32, verbose=1)
history = gru.fit(
  X_train,
  y_train,
  epochs=epochs
)

Epoch 1/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 4.2031e-04 - loss: 0.1500
Epoch 2/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 3.2802e-04 - loss: 6.5211e-04
Epoch 3/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.0013 - loss: 3.0264e-04
Epoch 4/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.0016 - loss: 2.8037e-04
Epoch 5/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 6.4308e-04 - loss: 2.6275e-04
Epoch 6/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 8.1996e-04 - loss: 2.3325e-04
Epoch 7/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 5.9440e-04 - loss: 2.3319e-04
Epoch 8/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 7.4726e-04 - loss: 2.3256e

### 11. Predict using Test set

In [76]:
# 4. Make predictions using the GRU model on the test set
predicted_stock_price_gru = gru.predict(X_test)

# 5. Inverse transform the predicted prices back to original scale
predicted_stock_price_gru = pipeline.named_steps['col_transformer'].transformers_[0][1].named_steps['scaler'].inverse_transform(predicted_stock_price_gru)

[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step


### 12. Visualize the results

In [77]:
plot_predictions(y_test_original.ravel(), predicted_stock_price_gru.ravel(), test[-X_test.shape[0]:]['Date'])

### 13. Evaluate the results

In [79]:
# Calculate RMSE
return_rmse(y_test_original, predicted_stock_price_gru)

The root mean squared error is 1.5708045636143608.


### 14. Conclude

In [81]:
return_rmse(y_test_original, predicted_stock_price_lstm) 
return_rmse(y_test_original, predicted_stock_price_gru)

The root mean squared error is 1.6118736524765744.
The root mean squared error is 1.5708045636143608.


Both models performed similarly, with the GRU slightly outperforming the LSTM in terms of RMSE:

LSTM RMSE: 1.61
GRU RMSE: 1.57

Given the small difference, the GRU model can be considered marginally better for this task, but the overall performance difference is minimal. Both are viable options for time series forecasting in this case.