Skip to content

Commit

Permalink
Merge pull request #10 from streamnsight/refactor/speedupv3
Browse files Browse the repository at this point in the history
Refactor: parallelize transaction generation
  • Loading branch information
streamnsight committed Jun 10, 2022
2 parents 10b6b33 + 8cc7484 commit 763b9b7
Show file tree
Hide file tree
Showing 15 changed files with 313 additions and 231 deletions.
33 changes: 0 additions & 33 deletions 150k_create_tx_1.sh

This file was deleted.

10 changes: 0 additions & 10 deletions 150k_create_tx_2.sh

This file was deleted.

21 changes: 21 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2016-2022 Brandon Harris

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
113 changes: 76 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,76 @@
## Generate Fake Credit Card Transaction Data, Including Fraudulent Transactions

### General Usage
* Create customers data file (see generate_customers.bat for syntax)
* Create transactions, utilizing prior customer file (see various .sh/.bat for syntax)

This code is heavily modified, but based on original code by [Josh Plotkin](https://github.com/joshplotkin/data_generation). Change log of modifications to original code are below.

### Modifications:

#### v 0.4
* Only surface-level changes done in scripts so that simulation can be done using Python3
* Corrected bat files to generate transactions files.

#### v 0.3
* Completely re-worked profiles / segementation of customers
* introduced fraudulent transactions
* introduced fraudulent profiles
* modification of transaction amount generation via Gamma distribution
* added 150k_ shell scripts for multi-threaded data generation (one python process for each segment launched in the background)

#### v 0.2
* Added unix time stamp for transactions for easier programamtic evaluation.
* Individual profiles modified so that there is more variation in the data.
* Modified random generation of age/gender. Original code did not appear to work correctly?
* Added batch files for windows users

#### v 0.1
* Transaction times are now included instead of just dates
* Profile specific spending windows (AM/PM with weighting of transaction times)
* Merchant names (specific to spending categories) are now included (along with code for generation)
* Travel probability is added, with profile specific options
* Travel max distances is added, per profile option
* Merchant location is randomized based on home location and profile travel probabilities
* Simulated transaction numbers via faker MD5 hash (replacing sequential 0..n numbering)
* Includes credit card number via faker
* improved cross-platform file path compatibility
# Generate Fake Credit Card Transaction Data, Including Fraudulent Transactions

Note: Version v1.0 behavior has changed in such a way that it runs much faster, however transaction files are chunked, so that several files get generated per profile. If your downstream process expects 1 file per profile, please checkout the v0.5 release branch `release/v0.5`.

## General Usage

In this version, the general usage has changed:

Please run the datagen script as follow:

```bash
python datagen.py -n <NUMBER_OF_CUSTOMERS_TO_GENERATE> -o <OUTPUT_FOLDER> <START_DATE> <END_DATE>
```

To see the full list of options, use:

```bash
python datagen.py -h
```

You can pass additional options with the following flags:

- `-config <CONFIG_FILE>`: pass the name of the config file, defaults to `./profiles/main_config.json`
- `-seed <INT>`: pass a seed to the Faker class
- `-c <CUSTOMER_FILE>`: pass the path to an already generated customer file
- `-o <OUTPUT_FOLDER>`: folder to save files into

This version is modified from the version v0.5 to parallelize the work using `multiprocessing`, so as to take advantage of all available CPUs and bring a huge speed improvement.

Because of the way it parallelize the work (chunking transaction generation by chunking the customer list), there will be multiple transaction files generated per profile. Also not that if the number of customers is small, there may be empty files (i.e. files where no customer in the chunk matched the profile). This is expected.

With standard profiles, it was benchmarked as generating ~95MB/thread/min. With a 64 cores/128 threads AMD E3, I was able to generate 1.4TB of data, 4.5B transactions, in just under 2h, as opposed to days when running the previous versions.

The generation code is originally based on code by [Josh Plotkin](https://github.com/joshplotkin/data_generation). Change log of modifications to original code are below.

## Change Log

### v1.0

- Parallelized version, bringing orders of magnitude faster generation depending on the hardware used.

### v0.5

- 12x speed up thanks to some code refactoring.

### v0.4

- Only surface-level changes done in scripts so that simulation can be done using Python3
- Corrected bat files to generate transactions files.

### v0.3

- Completely re-worked profiles / segmentation of customers
- introduced fraudulent transactions
- introduced fraudulent profiles
- modification of transaction amount generation via Gamma distribution
- added 150k_ shell scripts for multi-threaded data generation (one python process for each segment launched in the background)

### v0.2

- Added unix time stamp for transactions for easier programamtic evaluation.
- Individual profiles modified so that there is more variation in the data.
- Modified random generation of age/gender. Original code did not appear to work correctly?
- Added batch files for windows users

### v0.1

- Transaction times are now included instead of just dates
- Profile specific spending windows (AM/PM with weighting of transaction times)
- Merchant names (specific to spending categories) are now included (along with code for generation)
- Travel probability is added, with profile specific options
- Travel max distances is added, per profile option
- Merchant location is randomized based on home location and profile travel probabilities
- Simulated transaction numbers via faker MD5 hash (replacing sequential 0..n numbering)
- Includes credit card number via faker
- improved cross-platform file path compatibility
18 changes: 0 additions & 18 deletions create_all_transactions.sh

This file was deleted.

35 changes: 0 additions & 35 deletions create_tx_1.sh

This file was deleted.

9 changes: 0 additions & 9 deletions create_tx_2.sh

This file was deleted.

Loading

0 comments on commit 763b9b7

Please sign in to comment.