Skip to content

The MLOps ZoomCamp course provides course material, subjects, notes, and results. This repository is part of my learning path for Data Engineering career

Notifications You must be signed in to change notification settings

jblanco89/MLOps_zoomcamp_course

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stock Prices Prediction based on Deep Learning model (MLOps Zoomcamp Project)

Table Of Content

  1. Project Description
  2. Solution Architecture (Google Cloud Platform)
  3. System in Production
  4. User Guide
  5. Model Management (MLFlow)
  6. Orchestration (Prefect 2.0)
  7. Monitoring (Evidently AI)
  8. Dashboard
  9. Makefile and Dockefile
  10. Recomendations
  11. References
  12. Annexes
  13. Exploratory Data Analysis

Project Description

This project aims to develop and deploy a stock price prediction system using an LSTM model, which predicts stock prices with a 10-day lag based on historical OHLC and technical indicators such as MACD, RSI, SMA and EMA. By providing accurate predictions, the system empowers traders to make well-informed decisions, optimize investment strategies, and minimize risks while capitalizing on market fluctuations.

For this project cloud-based approach was chosen. By utilizing Google Cloud Platform (GCP) and MLflow integration, efficiency and scalability of the system is significantly enhanced. Deploying the LSTM model on GCP also enables seamless batch processing of large datasets, ensuring timely predictions. Additionally, MLflow's model tracking and versioning features facilitate efficient monitoring and maintenance of the model's performance over time. Furthermore, the Prefect framework orchestrates the workflow, automating data processing, scheduling, and error handling, resulting in a streamlined and reliable end-to-end solution.

To sum, this stock price prediction system addresses the specific needs of traders and investors by providing accurate predictions, timely insights, and efficient model management through cloud technology. With this solution, traders could stay ahead in the fast-paced financial markets, make informed decisions, and enhance their overall financial outcomes.

Introduction, Motivation, Objectives and Planning Schema are explained in the paper here

Solution Architecture

Important: Linux (Ubuntu 22.04 LTS) is the recommended development environment.

LSTM model was deployed as Batch Workload using some GCP tools such as:

  1. Cloud Storage
  2. Cloud Scheduler/Trigger
  3. Cloud Functions
  4. BigQuery
  5. Google Looker Studio (previously known as Data Studio)

As we can see, the workflow solution implemented in Google Cloud Platform (GCP) leverages serverless batch processing using Cloud Storage, Cloud Functions, BigQuery, and Google Looker Studio to predict stock prices based on historical OHLC (Open, High, Low, Close) data.

The process begins when users upload a CSV file containing OHLC price data for their desired stock to Google Cloud Storage. As soon as the file is uploaded, the Cloud Function is triggered to execute the LSTM (Long Short-Term Memory) model, which is designed to forecast the stock's prices for the next 10 days (see code below).

# This function runs as Google Cloud Function inside GCP

def predict_stock_prices(gs_url: str, bucket_name: str) -> np.array():
    # Load the LSTM model from Google Cloud Storage
    model_bucket = bucket_name
    gcs_client = storage.Client()
    bucket = gcs_client.get_bucket(model_bucket)
    blob = bucket.blob("lstm_model.pkl")
    data_path = gs_url
    data = pd.read_csv(data_path, sep=",")
    df = data[['Date','Close', 'Symbol']]
    symbol = df['Symbol'].iloc[0]
    df = df.set_index('Date')
    df = df.tail(90)

    with blob.open(mode = "rb") as f:
        lstm_model = pickle.load(f)

    # Reshape the input data
    num_time_steps = 10  # setting 10 time steps
    num_features = 1  # setting 1 feature (Close price)
    n = np.int((len(df)/num_time_steps))
    subset_df = df.tail(n)
    X = np.reshape(df['Close'].values, (len(subset_df), num_time_steps, num_features))
    X = np.repeat(X, 9, axis=-1)

    # Perform predictions using the LSTM model
    predictions = lstm_model.predict(X)

    min_value = np.min(df['Close'])
    max_value = np.max(df['Close'])
    predictions = predictions[1]
    predictions = np.zeros(9)

    predictions = (predictions * (max_value - min_value) + min_value)
    
    y_predicted = np.squeeze(predictions)
    start_date = df.index[-1]
    dates = pd.date_range(start=pd.to_datetime(start_date) + pd.DateOffset(days=1), 
                            periods=len(y_predicted), 
                            freq='D')
    new_data = pd.DataFrame({'Close': y_predicted, 'Symbol': symbol}, index=dates)
    data_result = pd.concat([df, new_data])
    data_result.index = pd.to_datetime(data_result.index)
    data_result.index = data_result.index.tz_localize(None)
    print(data_result)

    # Save the predictions to BigQuery
    bq_client = bigquery.Client()
    dataset_id = 'stock_output'
    table_id = 'predicted_prices'
    table_ref = bq_client.dataset(dataset_id).table(table_id)

    table_string = f"{table_ref.project}.{table_ref.dataset_id}.{table_ref.table_id}"

    data_result = data_result.reset_index(drop=False)

    data_result.to_gbq(destination_table=table_string, project_id='ambient-decoder-391319', if_exists='replace')
    return predictions

The results of the LSTM model are then stored in a dedicated BigQuery table named predicted_prices. The data in the table feeds into a Looker Studio Dashboard, where users can visualize and explore the forecasted stock prices in a user-friendly and intuitive manner.

The integration of Cloud Functions, BigQuery, and Looker Studio streamlines the entire process, making it a powerful and efficient tool for stock market analysis and decision-making.

architecture

System in Production

Important: Before following this guide ensure you have been authorized to connect GCP project.

Requirements

To install the app, you need to have Anaconda, Docker, and Docker-Compose installed on your system. You have a perfect installation guide in this link

NOTICE, After installing Anaconda, Google Cloud SDK must be also installed in your system. Lear more here

If you are dealing issues with Anaconda environment, you may find useful some bits

To download the Linux 64-bit archive file, at the command line, run:

curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-440.0.0-linux-x86_64.tar.gz

Once Google Cloud SDK had been installed, your account must be authenticate (previously authorized by admin). To do so, just write this in the command line:

gcloud auth application-default login

Clone the Repo

git clone https://github.com/jblanco89/MLOps_zoomcamp_course.git

Bash file setup.sh should get writing permission, run next commands:

cd ~/MLOps_zoomcamp_course
sudo chmod +x setup.sh

Finally, run bash file

sudo bash setup.sh

User Guide

1. Go to the main repo directory:

cd ~/MLOps_zoomcamp_course

2. Install dependencies

pip install -r requirements.txt

3. Run main python file with ticker and end date arguments.

For example, supposing that we were interested to predict Microsoft Stock prices in the next 10 days from July 22nd, 2023. Thus, we just need to run this command:

python ./src/main.py MSFT 2023-07-22

This system use Yahoo Finance API to get stock data prices. Check the list of tickers supported:

Yahoo Finance Ticker List

Some examples:

python src/main.py META 2023-07-25 #Meta
python src/main.py MMM 2023-07-25 #3M
python src/main.py ETH-USD 2023-07-25 #Ethereum

4. Check the results in Dashboard

After a couple of seconds data should have been processed by Google Cloud function. dashboard is available here

Do not forget update data by clicking here:

update_dashboard

Model Management (MLFlow)

MLFlow Run

In Virtual Machine instance of Google Cloud Platform you should use a command like this:

mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://postgres:passwod@sql_private_ip:5432/mlflow --default-artifact-root gs://storage_bucket_name

Experiment Tracking

Experiment Tracking results after 152 runs simulations:

Experiment_Tracking

MLFlow was deployed in GCP (scenario 5 according to documentation). You may check MLFlow experiment tracking and model registry clicking in this Link.

Model Registry

Once best model has been identified, we can register and tag it in MLFlow model registry system:

Model Registry

Orchestration (Prefect 2.0)

Prefect Cloud

Orchestration management has been set in prefect cloud platform. For self-authentication run this command in your local or virtual machine

prefect cloud login

Workflow Orchestration

After best model was chosen with MlFlow, we build workflow using Prefect 2.0. Here the result:

prefect_workflow

Workflow Deployment

Run the agent by default

prefect agent start --pool agent --work-queue default

Agent-prefect

If deployed workflow has a custom run, agent is going to detect it and will run workflow automatically. In this system we have scheduled orchestration once a day.

Monitoring (Evidently AI)

Workflow generates a data drift json report (DataDriftPreset) using the Evidently library.

def generate_report(Y_pred, Y_test):
    report = Report(metrics=[
    DataDriftPreset()
    ])

    Y_pred_flattened = np.ravel(Y_pred)

    current = pd.DataFrame({'Close': Y_pred_flattened})
    # current = pd.DataFrame({'Close': Y_test})
    df_reference = pd.DataFrame({'Close': Y_test})
    df_reset = df_reference.reset_index(drop=True)
    current_reset = current.reset_index(drop=True)

    df_reset = pd.DataFrame(df_reset, columns=['Close'])
    current_reset = pd.DataFrame(current_reset, columns=['Close'])
    report.run(reference_data=df_reset, current_data=current_reset)
    return report.save_json("./reports/dataReport.json")

As a result, we can check drift score every day in dashboard.

Monitor

Dashboard

Stock Prices prediction

Once stock prices have been uploaded to BQ table, technical indicators are calculated using sql:

WITH price_data AS (
  SELECT
    Symbol,
    index,
    Close,
    Close - LAG(Close) OVER (ORDER BY index) AS price_diff
  FROM `ambient-decoder-391319.stock_output.predicted_prices`
),
rsi_data AS (
  SELECT
  Symbol,
    index,
    Close,
    CASE WHEN price_diff > 0 THEN price_diff ELSE 0 END AS gain,
    CASE WHEN price_diff < 0 THEN ABS(price_diff) ELSE 0 END AS loss
  FROM price_data
),
histogram_data AS (
  SELECT
    Symbol,
    index,
    Close,
    gain,
    loss,
    NTILE(5) OVER (ORDER BY Close) AS bucket_number
  FROM rsi_data
)
SELECT
  Symbol,
  index,
  Close,
  CASE
    WHEN avg_gain IS NULL OR avg_loss IS NULL THEN NULL
    ELSE 100 - (100 / (1 + (NULLIF(avg_gain, 0) / NULLIF(avg_loss, 0))))
  END AS RSI_14_periods,
  -- MACD (5 periods)
  ema_12 - ema_26 AS MACD_5_periods,
  -- SMA (5 periods)
  AVG(Close) OVER (ORDER BY index ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS SMA_5_periods,
  -- SMA (15 periods)
  AVG(Close) OVER (ORDER BY index ROWS BETWEEN 14 PRECEDING AND CURRENT ROW) AS SMA_15_periods,
  bucket_number,
  COUNT(*) OVER (PARTITION BY bucket_number) AS bucket_count
FROM (
  SELECT
  Symbol,
    index,
    Close,
    AVG(Close) OVER (ORDER BY index ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS ema_12,
    AVG(Close) OVER (ORDER BY index ROWS BETWEEN 25 PRECEDING AND CURRENT ROW) AS ema_26,
    AVG(gain) OVER (ORDER BY index ROWS BETWEEN 13 PRECEDING AND CURRENT ROW) AS avg_gain,
    AVG(loss) OVER (ORDER BY index ROWS BETWEEN 13 PRECEDING AND CURRENT ROW) AS avg_loss,
    bucket_number
  FROM histogram_data
)
ORDER BY index DESC;

Using BigQuery we get data table like this:

Bigquery stock prices

As a result, we'll be able to see the following dashboard: Dashboard

Makefile and Dockerfile

Dockerfile

# This Dockerfile sets up a Python 3.8 environment, 
# installs necessary dependencies (libpq-dev), and 
# installs MLflow along with its required packages. 
# It then copies the local files into the container's working directory. 
# Finally, it runs the MLflow server with specific configurations, 
# using PostgreSQL as the backend store and Google Cloud Storage for 
# artifact storage.


# syntax=docker/dockerfile:1

FROM python:3.8-slim-buster

RUN apt-get update && apt-get install -y libpq-dev

WORKDIR /MLOps_zoomcamp_course

RUN python -m ensurepip --default-pip && pip install --no-cache-dir --upgrade pip

RUN pip install psycopg2-binary

RUN pip install mlflow

COPY . .

CMD ["mlflow", "server", "-h", "0.0.0.0", "-p", "5000", "--backend-store-uri", "postgresql://postgres:1234@10.28.192.5:5432/mlflow", "--default-artifact-root", "gs://lstm_model_test"]

Makefile

# This Makefile defines rules for building and running a Docker container 
# for a LSTM application. It includes commands to create a data directory, 
# set its permissions, build a Docker image, and run the Docker container. 
# The .PHONY target ensures these rules are always executed, 
# regardless of existing files with the same names as the targets.


DOCKER_IMAGE_NAME = lstm_app_image
CURRENT_DIR := $(shell pwd)
DATA_DIR = data

# Create the directory if it does not exist
create_data_directory:
	@if [ ! -d "$(DATA_DIR)" ]; then \
                mkdir -p "$(DATA_DIR)"; \
	fi

set_data_directory_permissions:
	chmod -R 777 "$(DATA_DIR)"

build_docker_image:
	docker build -t $(DOCKER_IMAGE_NAME) .

run_docker_container:
	docker run -p 5000:5000 $(DOCKER_IMAGE_NAME)

.PHONY: create_data_directory set_data_directory_permissions build_docker_image run_docker_container

Recomendations

Alternatively, process may be much more robust with the following workload:

architecture

References

  1. Alla, S., Adari, S.K. (2021). What Is MLOps?. In: Beginning MLOps with MLFlow. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-6549-9_3

  2. Bhandari, H. N., Rimal, B., Pokhrel, N. R., Rimal, R., Dahal, K. R., & Khatri, R. K. (2022). Predicting stock market index using LSTM. Machine Learning with Applications, 9, 100320. https://doi.org/10.1016/j.mlwa.2022.100320

  3. Moghar, A., & Hamiche, M. (2020). Stock market prediction using LSTM recurrent neural network. Procedia Computer Science, 170, 1168-1173. https://doi.org/10.1016/j.procs.2020.03.049

  4. Ghosh, A., Bose, S., Maji, G., Debnath, N., & Sen, S. (2019, September). Stock price prediction using LSTM on the Indian share market. In Proceedings of 32nd international conference on (Vol. 63, pp. 101-110). https://doi.org/10.29007/qgcz

  5. Machine Learning to Predict Stock Prices. Utilizing a Keras LSTM model to forecast stock trends (2019). ARTIFICIAL INTELLIGENCE IN FINANCE. https://towardsdatascience.com/predicting-stock-prices-using-a-keras-lstm-model-4225457f0233

  6. Stock Market Predictions with LSTM in Python (2020). Datacamp Tutorial. https://www.datacamp.com/tutorial/lstm-python-stock-market

  7. Run Prefect on Google Cloud Platform (2022). https://medium.com/@mariusz_kujawski/run-prefect-on-google-cloud-platform-7cc9f801d454

  8. Running a serverless batch workload on GCP with Cloud Scheduler, Cloud Functions, and Compute Engine. https://medium.com/google-cloud/running-a-serverless-batch-workload-on-gcp-with-cloud-scheduler-cloud-functions-and-compute-86c2bd573f25

Annexes

Neural Network Model

Our approach is mainly based on LSTM architecture. The model is initialized indicating a sequential stack of layers. In this case, a LSTM layer is added with 200 units (or neurons) ensuring that the output is returned for each timestep in the input sequence A dropout layer with a dropout rate of 0.2 was added to prevent overfitting. Lastly, a dense layer with a single neuron is added, and the activation function relu

LSTM Model

Experiment Tracking Metrics results:

- RMSE vs Epochs

Epochs

- RMSE vs Hidden Units

Hidden Units

- RMSE vs Learning Rate

Learning Rate

- Hidden Units vs Learning Rate

Hidden vs Learning Rate

Cloud Function

Here how Cloud Function Looks like in GCP:

Cloud Function

About

The MLOps ZoomCamp course provides course material, subjects, notes, and results. This repository is part of my learning path for Data Engineering career

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published