Skip to content

hazelcast/automl-poc

Repository files navigation

AutoML Proof of Concept

This is a Docker compose based sandbox for learning how to operationalize a Google Cloud hosted, tabular ML classification model. The topic problem is credit card fraud detection. The model was trained on a dataset that was generated by https://github.com/namebrandon/Sparkov_Data_Generation with minor modifications, and the "real" transactions that are used to test the operationalized system come from the same source.

The main components are as follows:

File Description
compose.yaml Docker compose file describing how to run all components locally
retrieve_gcp_creds.sh Logs in to Google Cloud and retrieves a credentials file, application_default_credentials.json. Do not check this file in.
card-fraud.proto Protobuf definition of an authorization request
event-sender Contains the python program, eventsender.py for sending events to Hazelcast via map.put()
config/hazelcast.yaml The configuration used by the Hazelcast instance
scoring-pipeline Java code for the prediction pipeline
submitjob.sh Helper script to deploy the pipeline to Hazelcast via the Hazelcast CLI

Instructions

This assumes a tabular classification ML model has already been trained and deployed to Google Cloud.

Build the java project.

cd scoring-pipeline
mvn clean package

Generate the card transaction data.

This is the data that will be sent to Hazelcast for fraud detection. Clone https://github.com/wrmay/Sparkov_Data_Generation, which is a fork of the original that uses commas instead of pipes for separating data. This was a requirement of Google's AutoML.

Generate the data with something similar to the following:

python -m venv venv
. venv/bin/activate
pip install -r requirements.txt
python datagen.py -n 10000 -o data 01-01-2022 01-31-2022

Note: attempts to generate less than 8 days of data will fail, pick the start and end date accordingly

Use event-sender/csv_sort.py to restore all of the *nnnnn.csv files by the linux timestamp column and remove the header row (needs documentation).

Copy all of the generated data files into the data/transactions_for_generator project.

Start the Simulation

docker compose up -d

The management center should be available at localhost:8080

Preparation

Obtain Google Cloud credentials. By default, any authenticated user can access the model.

./retrieve_gcp_creds.sh

This should create a file, application_default_credentials.json, which you should not check in to github.

Obtain the project, location and endpoint id of the tabular, classification model endpoint you will access. For example: "hazelcast-33", "us-central1", "4731246912831750144". Edit submitjob.sh accordingly.

Submit the Pipeline

./submitjob.sh 

About

Sandbox for Operationalizing ML in Hazelcast

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published