

### About the DataSet

The full dataset includes:

 - 89,997,827 Customer documents
 - 89,997,827 IDcard documents
 - 179,998,862 Account documents
 - 179,998,862 Correspondance<State> documents
 - 179,998,862*60 Statement documents

Total = 89,997,827*2+ 179,998,862*62 = 11,339,925,098

The Data will be split in 3 repositories:

 - us-east
 	- 50% of Custoners/IDCards/Accounts...
 	- 6 months of statements
 - us-west
 	- 50% of Custoners/IDCards/Accounts...
 	- 6 months of statements
 - archives
 	- 54 months of statements for all custoners

This means:

 - us-east = 89,997,827 + 179,998,862 + 179,998,862/2*6 = 809,993,275
 - us-west = 89,997,827 + 179,998,862 + 179,998,862/2*6 = 809,993,275
 - archives = 179,998,862*54 = 9,719,938,548




### Step 0 : run DocumentProducers 

The corresponding notebook can be executed [here](./Step%200%20-%20DocumentProducers.ipynb)

#### Why making this a dedicated step?

There are several reasons to make DocumentProducer a dedicated step

 - generating 9+B documents randomly takes time (see below)
 - it can use a lot of CPU at the injector level and I would like to avoid having to run more than one injector
 - once we have all DocumentMessages in Kafka, we can snapshot this and run DocumentConsumers separately

#### Generation time

Based on the previous tests the Producers throughput is:

 - 80K docs/s for CSV based import
 - 40K docs/s for Random Generation 

So, the projected producer time is: 

 - Customers / Accounts hierarchy
 	- 2*(89,997,827 + 179,998,862)/80000 = 6749 s => 2h
 - Non Archived statements 
 	- (179,998,862*6)/40000 = 26999s => 8h
 - Archived statements
 	- (179,998,862*54)/40000 = 242998s => 68h

#### Generation steps

 - import/states-hierarchy
 - import/customers
 - import/accounts
 - import/statements_live
 - import/statements_archive0
 - import/statements_archive1
 - import/statements_archive2
 - import/statements_archive3
 - import/statements_archive4
 - import/statements_archive5
 
#### Volume Steps

 - V1 : 1.6B --> Tests
   - import/states-hierarchy
   - import/customers
   - import/accounts
   - import/statements_live
 - V2 : 2.6B --> Tests
   - import/statements_archive0
 - V3 : 3.6B --> Tests
   - import/statements_archive1
 - V3 : 4.6B --> Tests
   - import/statements_archive2
 - V4 : 6.6B --> Tests
   - import/statements_archive3
 - V5 : 8.6B --> Tests
   - import/statements_archive4
 - V6 : 10.6B --> Tests
   - import/statements_archive5
  

### Step 1 : Full Content but no Archives

#### Target volumes

 - us-east = 809,993,275
 - us-west = 809,993,275
 - archives = 0

#### Import steps

##### run DocumentConsumers

The idea is to run the consumers in bulk-mode and ideally in parallel against the 2 live repositories.

 - states
    - import/states-hierarchy-us-east => us-east
    - import/states-hierarchy-us-west => us-west
 - customers
    - import/customers-us-east => us-east
    - import/customers-us-west => us-west
 - accounts
    - import/accounts-us-east => us-east
    - import/accounts-us-west => us-west
 - statements
    - import/statements_live-us-east => us-east
    - import/statements_live-us-west => us-east


| Node type  | sizing | count |
|------|------|-----|
| injector  | c5.2xlarge | 1 |
| app  | m5.xlarge | 1 |
| worker  | m5.xlarge | 1 |
| mongodb  | M60 NVMe | 2 |
| es-master  | r5.large.es | 3 |
| es-data  | r5.2xlarge.es | 3 |



##### run Bulk Indexing    

The idea is to :

 - import all in us-east
 - start BAF Bulk Indexing on us-east
 - import all in us-west 
 - start BAF Bulk Indexing on us-west
 
In order to get Indexing as fast as possible:

 - scale the number of worker nodes: up to 8 ?
 - at this point the ES cluster should have 12 data nodes `r5.2xlarge.es`
 

| Node type  | sizing | count |
|------|------|-----|
| injector  | c5.2xlarge | 1 |
| app  | m5.xlarge | 1 |
| worker  | m5.xlarge | 8 |
| mongodb  | M60 NVMe | 2 |
| es-master  | r5.large.es | 3 |
| es-data  | r5.2xlarge.es | 12 |

#### Testing
 


| Node type  | sizing | count |
|------|------|-----|
| injector  | c5.2xlarge | 1 |
| app  | m5.xlarge | 3 |
| worker  | m5.xlarge | 2 |
| mongodb  | M60 NVMe | 2 |
| es-master  | r5.large.es | 3 |
| es-data  | r5.2xlarge.es | 12 |


### Step 2 : 2.7B

#### Target volumes

 - us-east = 809,993,275
 - us-west = 809,993,275
 - archives = 1,079,993,172

#### Import steps

##### run DocumentConsumers

The idea is to run the consumers in bulk-mode and ideally in parallel against the 2 live repositories.

 - statements
    - import/statements_archives0 => achives


| Node type  | sizing | count |
|------|------|-----|
| injector  | c5.2xlarge | 1 |
| app  | m5.xlarge | 1 |
| worker  | m5.xlarge | 1 |
| mongodb  | M60 NVMe | 2 |
| mongodb  | M80 NVMe | 5 |
| es-master  | r5.large.es | 3 |
| es-data  | r5.2xlarge.es | 12 |


##### run Bulk Indexing    

In order to get Indexing as fast as possible:

 - scale the number of worker nodes: up to 8 ?
 - at this point the ES cluster should have 16 data nodes `r5.2xlarge.es`
 
| Node type  | sizing | count |
|------|------|-----|
| injector  | c5.2xlarge | 1 |
| app  | m5.xlarge | 1 |
| worker  | m5.xlarge | 8 |
| mongodb  | M60 NVMe | 2 |
| mongodb  | M80 NVMe | 5 |
| es-master  | r5.large.es | 3 |
| es-data  | r5.2xlarge.es | 16 |

#### Testing
 
| Node type  | sizing | count |
|------|------|-----|
| injector  | c5.2xlarge | 1 |
| app  | m5.xlarge | 3 |
| worker  | m5.xlarge | 2 |
| mongodb  | M60 NVMe | 2 |
| mongodb  | M80 NVMe | 5 |
| es-master  | r5.large.es | 3 |
| es-data  | r5.2xlarge.es | 16 |


### Other steps

The other steps are likely to be very similar to the previous one, however:

 - we may need to adjust the number of ES nodes
 - we will try the new BAF feature allowing to directly pipe Importer and Bulk Indexing
 - we may do some adjustemnts on the tests
 