This is a Docker compose based sandbox for learning how to operationalize a Google Cloud hosted, tabular ML classification model. The topic problem is credit card fraud detection. The model was trained on a dataset that was generated by https://github.com/namebrandon/Sparkov_Data_Generation with minor modifications, and the "real" transactions that are used to test the operationalized system come from the same source.
The main components are as follows:
File | Description |
---|---|
compose.yaml | Docker compose file describing how to run all components locally |
retrieve_gcp_creds.sh | Logs in to Google Cloud and retrieves a credentials file, application_default_credentials.json . Do not check this file in. |
card-fraud.proto | Protobuf definition of an authorization request |
event-sender | Contains the python program, eventsender.py for sending events to Hazelcast via map.put() |
config/hazelcast.yaml | The configuration used by the Hazelcast instance |
scoring-pipeline | Java code for the prediction pipeline |
submitjob.sh | Helper script to deploy the pipeline to Hazelcast via the Hazelcast CLI |
This assumes a tabular classification ML model has already been trained and deployed to Google Cloud.
cd scoring-pipeline
mvn clean package
This is the data that will be sent to Hazelcast for fraud detection. Clone https://github.com/wrmay/Sparkov_Data_Generation, which is a fork of the original that uses commas instead of pipes for separating data. This was a requirement of Google's AutoML.
Generate the data with something similar to the following:
python -m venv venv
. venv/bin/activate
pip install -r requirements.txt
python datagen.py -n 10000 -o data 01-01-2022 01-31-2022
Note: attempts to generate less than 8 days of data will fail, pick the start and end date accordingly
Use event-sender/csv_sort.py
to restore all of the *nnnnn.csv files by the linux timestamp column and remove the header row (needs documentation).
Copy all of the generated data files into the data/transactions_for_generator
project.
docker compose up -d
The management center should be available at localhost:8080
Obtain Google Cloud credentials. By default, any authenticated user can access the model.
./retrieve_gcp_creds.sh
This should create a file, application_default_credentials.json
, which you should not check in to github.
Obtain the project, location and endpoint id of the tabular, classification model endpoint you will access.
For example: "hazelcast-33", "us-central1", "4731246912831750144". Edit submitjob.sh
accordingly.
./submitjob.sh