### Initial Setup

 - initialize the repositories and MongoDB indexes [see commands here](../ToolBox/Repository-Cleanup.ipynb)
    - for `us-east` and `us-west`

### Import in US-EAST

We first import all the data into the US-EAST Repository

 - hierarchy
 - customers
 - accounts
 - live statements
 
#### Import Hierarchy

In [9]:
# import in synchronous mode
!import.sh -o import -l import/states-hierarchy-us-east -r us-east -b /

Using config /nxbench/notebooks/11B-Steps/nuxeo.properties
url=http://127.0.0.1:8080/nuxeo
login=nco-admin
Nuxeo Client configured
Connected to Nuxeo Server 11.3.26
Running Operation:StreamImporter.runDocumentConsumersEx
   nbThreads: 10 
   logName: import/states-hierarchy-us-east 
   blockDefaultSyncListeners: true 
   rootFolder: / 
   logSize: 8 
   batchSize: 500 
#####################
Execution completed
elapsed:60.039
committed:26
failures:0
consumers:8
throughput:0.4330518496310731



Import complected without issues.
#### Import Customers from CSV

In [10]:
# import customers
nbDocs = 89997827
expectedThroughput = 14000
print("expected duration (s) =", round(nbDocs/expectedThroughput))
print("expected duration (h) =", round((nbDocs/expectedThroughput)/3600))

expected duration (s) = 6428
expected duration (h) = 2


In [12]:
!tail import-useast-customers.log

Nuxeo Client configured
...........................................................................................................................
Running completed
elapsed:7427.235
committed:90918788
failures:0
consumers:16
throughput:12241.269866915482

Exit after 7443 s


Import completed without any issues, throughput is high since the MongoDB collection is still small.

#### Import Accounts 

In [13]:
# import accounts
nbDocs = 89997827*2
expectedThroughput = 10000
print("expected duration (s) =", round(nbDocs/expectedThroughput))
print("expected duration (h) =", round((nbDocs/expectedThroughput)/3600))

expected duration (s) = 18000
expected duration (h) = 5


In [None]:
!import.sh -o import -t 16 -l import/accounts-us-east -r us-east -b / -a -w 25000 -bulk > import-useast-accounts.log

The command exited in timeout since the throughput was not at all what was expected.

    Consumers status: threads: 16, failure 0, 
    messages committed: 181842842, elapsed: 59763.90s, 
    throughput: 3042.69 msg/s


Completed without errors, but the throughput is lower than expected.
The import started fast but finished slow.

#### Import statements

In [None]:
# import statements
nbDocs = 89997827*6
expectedThroughput = 9000
print("expected duration (s) =", round(nbDocs/expectedThroughput))
print("expected duration (h) =", round((nbDocs/expectedThroughput)/3600))

In [None]:
!import.sh -o import -t 16 -l import/statements_live-us-east -r us-east -b / -a -w 200000 -bulk > import-useast-live-statements.log

Looks like the importer was stuck at some point because it was started with 16 threads whereas there are 24 partitions.


See [this notebook](explore/unbalanced-import.ipynb) for more explanations on the probem created by this.


See [this notebook](explore/import-m60-througput.ipynb) for more details on the throughput limitations.


### Indexing US-EAST

Because at this point Kafka is almost full, we need to cleanup some of the streams, typically the streams that have been used to reindex US-WEST.

<img src="monitoring/kafka-storage-clean.png"/>

 - scale out the number of Nuxeo Worker nodes 
     - 1 => 5
 - configure ES for bulk indexing (see [ES toolbox](../ToolBox/Elasticsearch.ipynb))
     - no replicas
     - refresh rate  
 - configure the US-EAST MongoDB cluster use allow read from secondaries
     - add `?readPreference=nearest` to the connection url
     

Associated Datadog notebook: https://app.datadoghq.com/notebook/275428/index-us-east

Start Indexing on US-EAST using BAF  

In [12]:
!(INJECTOR="http://127.0.0.1:8080";\
curl -H 'Content-Type:application/json+nxrequest' \
  -H 'X-NXRepository:us-east' \
  -X POST -d '{"params":{},"context":{}}' -u $NXUSER:$NXPWD \
  "$INJECTOR/nuxeo/api/v1/automation/Elasticsearch.BulkIndex")

{"commandId":"d1fc4c28-dafb-4c69-bf65-a7d8efdb2dc6"}

In [13]:
# check status
! cid="d1fc4c28-dafb-4c69-bf65-a7d8efdb2dc6"; \
  INJECTOR="http://127.0.0.1:8080"; \
  curl -H 'Content-Type:application/json+nxrequest' -H 'X-NXRepository:us-east' \
  -u $NXUSER:$NXPWD $INJECTOR/nuxeo/api/v1/bulk/$cid

{"entity-type":"bulkStatus","commandId":"d1fc4c28-dafb-4c69-bf65-a7d8efdb2dc6","state":"SCROLLING_RUNNING","processed":132270,"error":false,"errorCount":0,"total":0,"action":"index","username":"nco-admin","submitted":"2020-09-16T01:23:49.637Z","scrollStart":"2020-09-16T01:23:49.716Z","scrollEnd":null,"processingStart":null,"processingEnd":null,"completed":null,"processingMillis":0}

Once indexing is actually started, the index has been deleted and recreated by Nuxeo with the default configuration.

We can now temporarily tweak the configuration to speed up indexing.

In [15]:
!curl -X PUT $ES_SERVER/us-east/_settings -H "Content-Type: application/json"    -d '{"index" : { "number_of_replicas" : 0  , "refresh_interval" : -1 } }'

{"acknowledged":true}

In [31]:
# check status
! cid="d1fc4c28-dafb-4c69-bf65-a7d8efdb2dc6"; \
  INJECTOR="http://127.0.0.1:8080"; \
  curl -H 'Content-Type:application/json+nxrequest' -H 'X-NXRepository:us-east' \
  -u $NXUSER:$NXPWD $INJECTOR/nuxeo/api/v1/bulk/$cid

{"entity-type":"bulkStatus","commandId":"d1fc4c28-dafb-4c69-bf65-a7d8efdb2dc6","state":"COMPLETED","processed":818298652,"error":false,"errorCount":0,"total":818298652,"action":"index","username":"nco-admin","submitted":"2020-09-16T01:23:49.637Z","scrollStart":"2020-09-16T01:23:49.716Z","scrollEnd":"2020-09-16T04:36:03.312Z","processingStart":null,"processingEnd":null,"completed":"2020-09-16T18:03:02.420Z","processingMillis":0}

In [32]:
from dateutil import parser
start = "2020-09-16T01:23:49.637Z"
end = "2020-09-16T18:03:02.420Z"
nbdocs = 818298652
s = parser.parse(start)
e = parser.parse(end) 
throughput = nbdocs / (e-s).total_seconds()

print('%s docs/s' % format(throughput, ',.2f'))

13,649.05 docs/s


In [33]:
!curl -X PUT $ES_SERVER/us-east/_settings -H "Content-Type: application/json"    -d '{"index" : { "number_of_replicas" : 1  , "refresh_interval" : null } }'

{"acknowledged":true}

### Import in US-WEST

In [20]:
# import in synchronous mode
!import.sh -o import -l import/states-hierarchy-us-west -r us-west -b /

Using config /nxbench/notebooks/11B-Steps/nuxeo.properties
url=http://127.0.0.1:8080/nuxeo
login=nco-admin
Nuxeo Client configured
Connected to Nuxeo Server 11.3.26
Running Operation:StreamImporter.runDocumentConsumersEx
   nbThreads: 10 
   logName: import/states-hierarchy-us-west 
   blockDefaultSyncListeners: true 
   rootFolder: / 
   logSize: 8 
   batchSize: 500 
#####################
Execution completed
elapsed:60.041
committed:26
failures:0
consumers:8
throughput:0.43303742442664184



In [21]:
!import.sh -o import -t 16 -l import/customers-us-west -r us-west -b / -a -w 8000 -bulk > import-uswest-customers.log

In [22]:
!tail import-uswest-customers.log

Nuxeo Client configured
................................................................................................................
Running completed
elapsed:6747.088
committed:89076866
failures:0
consumers:16
throughput:13202.268297078681

Exit after 6781 s


In [23]:
!import.sh -o import -t 16 -l import/accounts-us-west -r us-west -b / -a -w 40000 -bulk > import-uswest-accounts.log

In [24]:
!tail import-uswest-accounts.log

Async Automation Execution Scheduled
  => status url:[http://127.0.0.1:8080/nuxeo/site/api/v1/automation/StreamImporter.runDocumentConsumersEx/@async/27bba16f-b9bf-4565-85ce-f7a715ba6414/status]
#####################
Execution completed

waiting for end of Async Exec
url=http://127.0.0.1:8080/nuxeo
login=nco-admin
Nuxeo Client configured
....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

    Consumers status: threads: 16, failure 0, messages committed: 178154882, elapsed: 55277.80s, throughput: 3222.90 msg/s


In [25]:
!import.sh -o import -t 24 -l import/statements_live-us-west -r us-west -b / -a -w 80000 -bulk > import-uswest-live-statements.log


Import completed.
Import ends with some errors because there are is a small overflow of statements, so the last statements correspond to accounts that were not imported: this is a side effect of rounding the number of documents created.

More detailed monitoring is available [here](Monitoring-US-WEST.ipynb)


### Indexing US-WEST
 
 - configure ES for bulk indexing (see [ES toolbox](../ToolBox/Elasticsearch.ipynb))
     - no replicas
     - refresh rate

In [12]:
!curl -X PUT $ES_SERVER/us-west/_settings -H "Content-Type: application/json"    -d '{"index" : { "number_of_replicas" : 0  , "refresh_interval" : -1 } }'

{"acknowledged":true}

Start Indexing on US-WEST using BAF    

In [3]:
!(INJECTOR="http://127.0.0.1:8080";\
curl -H 'Content-Type:application/json+nxrequest' \
  -H 'X-NXRepository:us-west' \
  -X POST -d '{"params":{},"context":{}}' -u $NXUSER:$NXPWD \
  "$INJECTOR/nuxeo/api/v1/automation/Elasticsearch.BulkIndex")

{"commandId":"a5920313-7abc-445c-9a64-a7dcf126b50a"}

In [6]:
# check status
! cid="a5920313-7abc-445c-9a64-a7dcf126b50a"; \
  INJECTOR="http://127.0.0.1:8080"; \
  curl -H 'Content-Type:application/json+nxrequest' -H 'X-NXRepository:us-west' \
  -u $NXUSER:$NXPWD $INJECTOR/nuxeo/api/v1/bulk/$cid

{"entity-type":"bulkStatus","commandId":"a5920313-7abc-445c-9a64-a7dcf126b50a","state":"COMPLETED","processed":33838270,"error":true,"errorCount":1,"errorMessage":"Invalid command","total":0,"action":"index","username":"nco-admin","submitted":"2020-09-14T22:41:21.304Z","scrollStart":"2020-09-14T22:41:21.371Z","scrollEnd":"2020-09-14T23:49:23.660Z","processingStart":null,"processingEnd":null,"completed":"2020-09-14T23:49:23.660Z","processingMillis":0}

Failed because of Timeout in MongoDB

    com.mongodb.MongoExecutionTimeoutException: operation exceeded time limit
	at com.mongodb.internal.connection.ProtocolHelper.createSpecialException(ProtocolHelper.java:239) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:171) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:303) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:259) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:99) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:450) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:72) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:226) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:269) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:123) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.operation.QueryBatchCursor.getMore(QueryBatchCursor.java:260) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.operation.QueryBatchCursor.hasNext(QueryBatchCursor.java:138) ~[mongo-java-driver-3.12.1.jar:?]
	at com.mongodb.client.internal.MongoBatchCursorAdapter.hasNext(MongoBatchCursorAdapter.java:54) ~[mongo-java-driver-3.12.1.jar:?]
	at org.nuxeo.ecm.core.api.CursorResult.hasNext(CursorResult.java:71) ~[nuxeo-core-api-11.3.26.jar:?]
	at org.nuxeo.ecm.core.api.CursorService.scroll(CursorService.java:137) ~[nuxeo-core-api-11.3.26.jar:?]
	at org.nuxeo.ecm.core.storage.mongodb.MongoDBConnection.scroll(MongoDBConnection.java:843) ~[nuxeo-core-storage-mongodb-11.3.26.jar:?]

ReRun indexixng after having:
 
 - changed the MongoDB Timeout to 12h
 - changed the concurrent for bulk indexing (several times)
 - cleaned part of the stream 
 
 
    > kafka-topics.sh -bootstrap-server b-1.nxbench-2826-bench-11.yx3zdh.c4.kafka.us-east-1.amazonaws.com:9094  --command-config kafka-ssl.properties --delete --topic nuxeo-work-default
    > kafka-topics.sh -bootstrap-server b-1.nxbench-2826-bench-11.yx3zdh.c4.kafka.us-east-1.amazonaws.com:9094  --command-config kafka-ssl.properties --delete --topic nuxeo-work-updateACEStatus
    > kafka-topics.sh -bootstrap-server b-1.nxbench-2826-bench-11.yx3zdh.c4.kafka.us-east-1.amazonaws.com:9094  --command-config kafka-ssl.properties --delete --topic nuxeo-work-elasticSearchIndexing
    > kafka-topics.sh -bootstrap-server b-1.nxbench-2826-bench-11.yx3zdh.c4.kafka.us-east-1.amazonaws.com:9094  --command-config kafka-ssl.properties --delete --topic nuxeo-bulk-.*


In [16]:
!(INJECTOR="http://127.0.0.1:8080";\
curl -H 'Content-Type:application/json+nxrequest' \
  -H 'X-NXRepository:us-west' \
  -X POST -d '{"params":{},"context":{}}' -u $NXUSER:$NXPWD \
  "$INJECTOR/nuxeo/api/v1/automation/Elasticsearch.BulkIndex")

{"commandId":"d7089bee-0935-4709-9b12-499b04b96032"}

In [17]:
# check status
! cid="d7089bee-0935-4709-9b12-499b04b96032"; \
  INJECTOR="http://127.0.0.1:8080"; \
  curl -H 'Content-Type:application/json+nxrequest' -H 'X-NXRepository:us-west' \
  -u $NXUSER:$NXPWD $INJECTOR/nuxeo/api/v1/bulk/$cid

{"entity-type":"bulkStatus","commandId":"d7089bee-0935-4709-9b12-499b04b96032","state":"SCROLLING_RUNNING","processed":60133,"error":false,"errorCount":0,"total":0,"action":"index","username":"nco-admin","submitted":"2020-09-15T03:29:26.385Z","scrollStart":"2020-09-15T03:29:26.471Z","scrollEnd":null,"processingStart":null,"processingEnd":null,"completed":null,"processingMillis":0}

In [1]:
# check status
! cid="d7089bee-0935-4709-9b12-499b04b96032"; \
  INJECTOR="http://127.0.0.1:8080"; \
  curl -H 'Content-Type:application/json+nxrequest' -H 'X-NXRepository:us-west' \
  -u $NXUSER:$NXPWD $INJECTOR/nuxeo/api/v1/bulk/$cid

{"entity-type":"bulkStatus","commandId":"d7089bee-0935-4709-9b12-499b04b96032","state":"COMPLETED","processed":801708762,"error":false,"errorCount":0,"total":801695569,"action":"index","username":"nco-admin","submitted":"2020-09-15T03:29:26.385Z","scrollStart":"2020-09-15T03:29:26.471Z","scrollEnd":"2020-09-15T08:39:23.314Z","processingStart":null,"processingEnd":null,"completed":"2020-09-15T20:09:45.964Z","processingMillis":0}

In [11]:
from dateutil import parser
start = "2020-09-15T03:29:26.385Z"
end = "2020-09-15T20:09:45.964Z"
nbdocs = 801695569
s = parser.parse(start)
e = parser.parse(end) 
throughput = nbdocs / (e-s).total_seconds()

print('%s docs/s' % format(throughput, ',.2f'))

13,357.23 docs/s
