Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss when using Fiware Orion Broker, QuantumLeap and CrateDB #722

Closed
NunopRolo opened this issue Mar 16, 2023 · 13 comments
Closed

Data loss when using Fiware Orion Broker, QuantumLeap and CrateDB #722

NunopRolo opened this issue Mar 16, 2023 · 13 comments
Assignees

Comments

@NunopRolo
Copy link

Describe the bug
I'm using Fiware Orion Broker, QuantumLeap and CrateDB, with the aim of recording all temporal data in cratedb.

My docker-compose configuration is this:

orion:
    image: fiware/orion:${ORION_VERSION}
    hostname: orion
    container_name: fiware-orion
    depends_on:
        - mongo-db
    networks:
        - fiware
    expose:
        - "${ORION_PORT}"
    ports:
        - "${ORION_PORT}:${ORION_PORT}"
    command: -dbhost mongo-db
    healthcheck:
        test: curl --fail -s http://orion:${ORION_PORT}/version || exit 1
        interval: 5s

mongo-db:
    image: mongo:latest
    hostname: mongo-db
    container_name: db-mongo
    expose:
        - "${MONGO_DB_PORT}"
    ports:
        - "${MONGO_DB_PORT}:${MONGO_DB_PORT}"
    networks:
        - fiware
    volumes:
        -  ./volumes/mongo-db:/data
    healthcheck:
        test: |
            host=`hostname --ip-address || echo '127.0.0.1'`; 
            mongo --quiet $host/test --eval 'quit(db.runCommand({ ping: 1 }).ok ? 0 : 2)' && echo 0 || echo 1
        interval: 5s

quantumleap:
    image: orchestracities/quantumleap:latest
    hostname: quantumleap
    container_name: fiware-quantumleap
    ports:
        - "${QUANTUMLEAP_PORT}:${QUANTUMLEAP_PORT}"
    depends_on:
        - crate-db
        - redis-db
    environment:
        - CRATE_HOST=crate-db
        - LOGLEVEL=WARNING
    healthcheck:
        test: curl --fail -s http://quantumleap:${QUANTUMLEAP_PORT}/version || exit 1
    networks:
        - fiware

crate-db:
    image: crate:latest
    hostname: crate-db
    container_name: db-crate
    ports:
        - "4200:4200"
        - "4300:4300"
    command: crate -Cauth.host_based.enabled=false  -Ccluster.name=democluster -Chttp.cors.enabled=true -Chttp.cors.allow-origin="*" -Cdiscovery.type=single-node
    environment:
        - CRATE_HEAP_SIZE=2g 
    volumes:
        - ./volumes/crate-db:/data
    networks:
        - fiware

I registered different entities, and then I created the notification to QuantumLeap like this:

curl -iX POST \
  'http://localhost:1026/v2/subscriptions/' \
  -H 'Content-Type: application/json' \
  -H 'fiware-service: openiot' \
  -H 'fiware-servicepath: /' \
  -d '{
  "description": "Notify QuantumLeap",
  "subject": {
    "entities": [
      {
        "idPattern": "Device.*"
      }
    ],
    "condition": {
      "attrs": [
        "power"
      ]
    }
  },
  "notification": {
    "http": {
      "url": "http://quantumleap:8668/v2/notify"
    },
    "attrs": [
      "power"
    ],
    "metadata": ["dateCreated", "dateModified"]
  }
}' 

And then I'm running performance tests through Apache JMeter, and consecutive requests are sent for 1 minutes to evaluate their performance, but I'm having a problem where some data is not being registered in the CrateDB, that is, in the last test I did, it was done about 18000 requests, and in CrateDB only about 10000 are registered.

Expected behavior
The CrateDB database should contain the same amount of data as the requests made to Fiware Orion Broker

Environment (please complete the following information):

  • OS: Linux

Additional context
I also tried using the TimescaleDB database in QuantumLeap, but the same problem happens, so I assume that the problem is not with the database.

Does anyone know what the problem could be?
I am available to provide more information if necessary.

@c0c0n3
Copy link
Member

c0c0n3 commented Mar 21, 2023

Hello @NunopRolo and thanks for reporting this issue!

We also noticed a certain amount of data loss in our load tests. Under heavy load, the QuantumLeap notify endpoint lost about 4% of the NGSI entities

We've never had the time to figure out why exactly, but it could be a combination of our code being too resource intensive and too little worker threads specified in the Gunicorn config---I mean too little for the request workload.

But your scenario is taking this to a whole new level :-) We're talking about an order of magnitude higher data loss, i.e. you lost about 40% of the incoming NGSI entities. One avenue to explore is Orion notification throttling. If memory serves, the default is to notify subscribers of the last entity update received in the previous second timespan. So if Orion got an average of 10 entity updates (to the same entity) per second, it'd only notify QuantumLeap of one entity update per second. Can you try deleting your subscription and then adding back with a throttling: 0 field and see if it makes any difference?

If you're still losing alot of data, then it could be you need to beef up your test environment. E.g. run Crate and QL on a separate boxes, making sure Crate gets lots of RAM.

Hope this helps!

@NunopRolo
Copy link
Author

Hello @c0c0n3
Thanks for the answer

I tried to put the throttling: 0 and the same problem happens. I also tried to use threadpool mode in Orion Broker, and I tried to add more workers to QuantumLeap and it improved a little, but I still have a big loss of data.

It's a shame, knowing that the problem is in the component itself, and not in some wrong configuration on my part. I will then continue to try by adjusting some parameters, to see if I can avoid this problem.
If you have any more tips please let me know.

Thank you for your help.

@SBlechmann
Copy link

We were facing similar issues when conducting performance tests, even with wq. I am very much looking forward to a solution to this.

BTW: 0 is the default value when not explicitly stating another value to throttling.

@c0c0n3
Copy link
Member

c0c0n3 commented Mar 22, 2023

Hi @NunopRolo :-)

Since tweaking configuration didn't help, my guess is that you'll need more beefy hardware. Unfortunately QuantumLeap wasn't really designed with performance in mind from the outset and our stack requires lots of horsepower to handle that kind of workload---if I understand correctly, you're trying to process 18,000 requests a minute.

The first thing I'd do is move away from Docker Compose. I'd set up a separate server machine with 4 CPUs + 16GB RAM + SSD to run Crate. Then on another server machine (same specs) I'd run Orion and QuantumLeap with about 20 workers. Finally, JMeter should run on your client machine.

Also keep in mind we a have work queue a solution to mitigate data loss:

It should up a little your throughput w/r/t vanilla QuantumLeap and contain data loss to the bare minimum---possibly no loss at all. But it's way more complex to deploy.

@c0c0n3
Copy link
Member

c0c0n3 commented Mar 22, 2023

@SBlechmann sorry to hear you're having issues too. Work queue should help, but you'll need beefy hardware to run it smoothly, see my previous comment about it. Anyway, WQ isn't a game changer. Like I said, the problem is that performance wasn't really a design goal from the start and at this point trying to turn QuantumLeap into a high-performance solution would require a complete redesign of the architecture and a rewrite of the code from scratch.

@c0c0n3
Copy link
Member

c0c0n3 commented Mar 22, 2023

BTW: 0 is the default value when not explicitly stating another value to throttling.

Do you have a reference for that? I always struggle to remember what the defaults are and couldn't find any explicit mention of that in the docs---or the default being one second for that matter. But I've bumped into this

@SBlechmann
Copy link

Hey @c0c0n3 ,

well, we did some performance tests as well... and 300 req / s is not much and a python script should be able to handle this imho.
I don't have the data with me atm, but I believe we ran several tests vrom 100 to 700 req / s. For low rates the data was saved persistently while at 700 req / s it was less than 50 %.

I know of a colleague who also ran some tests but with more hardware ressources... let me reach out.

@SBlechmann
Copy link

BTW: 0 is the default value when not explicitly stating another value to throttling.

Do you have a reference for that? I always struggle to remember what the defaults are and couldn't find any explicit mention of that in the docs---or the default being one second for that matter. But I've bumped into this

* https://fiware-orion.readthedocs.io/en/master/admin/perf_tuning.html#subscription-cache

Indeed, I can't find any ref for that... but I was sure I found out about that in the past.
I just did a little test and posted three subscriptions.

Sub 1: leave out throttling option
Sub 2: throttling:0
Sub 3: throttling:1

Running a GET against orions /v2/subscriptions will not show throttling except for Sub 3.
Yet in MongoDB, it says throttling:0 for Subs 1 and 2 and throttling:1 for Sub 3.

@c0c0n3
Copy link
Member

c0c0n3 commented Mar 22, 2023

@SBlechmann

300 req / s is not much and a python script should be able to handle this imho.

agree :-) indeed QL + WQ can handle that actually. We happened to experience that exact workload in a prod scenario and there was no data loss. But you'll need a setup similar to the one I mentioned earlier w/ different boxes for Crate, QL and Redis.

100 to 700 req / s. For low rates the data was saved persistently while at 700 req / s it was less than 50 %.

Did you run this test through Docker Compose on a single machine with QL+WQ? I experienced something similar when running everything in Docker Compose on my laptop, but the reason was that I didn't have enough horsepower so the test client pumping requests would keep on getting 500s b/c there wasn't enough CPU and RAM to handle that workload. Also keep in mind, when using WQ, you might have to wait a few minutes before checking if the data is in the DB b/c of the WQ exponential backoff algorithm that retries failed inserts. See:

In general, if you want high-performance and efficient resource usage, QL is not the right solution. The QL architecture wasn't designed for performance and its utterly wasteful when it comes to resource usage. We desperately tried to bolt on performance improvements when we started hitting prod issues, but like I said earlier there's only so much you can do without rewriting the software from scratch using a different architecture. To see why that's the case, think of a scenario where a device sends measurements every 5 seconds. That's one call to Orion, followed by one to MongoDB, followed by a notification to QL which finally issues an insert in the time series DB. The approach doesn't scale well. Just think about the QL bit of the journey: you pay the price for one DB insert every 5 secs. Now imagine you had 1,000 devices sending data every 5 secs. Well, that's 12,000 inserts a minute. Surely you can think of a different architecture where readings are buffered and then bulk-inserted into the DB. In this architecture you would e.g. only do a bulk insert of 6,000 records every 30 secs which is 2 inserts a minute vs 12,000 a minute.

I just did a little test and posted three subscriptions.
...
Yet in MongoDB, it says throttling:0 for Subs 1 and 2 and throttling:1 for Sub 3.

Ha! that's great, thank you so much for this, very valuable piece of info indeed!

@StWiemann
Copy link

I did some load-testing about 2 years ago, as well. The results weren't great for QL.
I did my testing in a Kubernetes-Cluster and assigned 3 Worker-Nodes (8 Cores, 64GB RAM) to handle Fiware.
It became clear that QL is the biggest bottle neck. Followed by IoT-Agents. Orion seems to be fine with a lot more load than either of those could handle. After starting enough instances of QL and IoT-Agents I was able to get about 6000 inserts/s working. Most of those tests were missing roughly 0-10 Data-Points out of 180k with enough scaling.
I ran 3 Crate-Nodes and no replica sets for best performance.
It might be important that I didn't reach a limit there. It was just sufficient for my proof of concept. Throwing enough hardware and a load balancer at stuff like this solves most problems, I guess. But it might not be the best solution like c0c0n3 already said.

A big part of this is of course SSDs. If you are running this on some kind of hosting service, sometimes your volumes are mounted on rather slow hardware or you get restricted IOPS. I wasn't able to get past 120 requests/s because of that initially. Caching notifications in redis helps with bursts.

When I did that Scorpio was just about to be useable. I don't know if that is another feasible route to go (provided it doesn't rely on QL as well), since QL won't be rewritten I guess and imho python might not be the most performant choice there.

@c0c0n3
Copy link
Member

c0c0n3 commented Mar 22, 2023

@StWiemann

I did some load-testing about 2 years ago, as well. The results weren't great for QL.

Not surprised to be honest :-)

Throwing enough hardware and a load balancer at stuff like this solves most problems

Yep it does, but that hurts your pocket :-)
Anyway, your results are totally in line with our experience and performance tests.

QL won't be rewritten I guess

You guessed right :-) At this point in time we only have barely enough resources for minor improvements, but I wish we could do a rewrite to solve most of the problems we have, performance and NGSI-LD coming on top of my list...

imho python might not be the most performant choice there

I couldn't agree more. If we ever do a rewrite, it's most likely going to be Rust...

@NunopRolo
Copy link
Author

Thanks for all the comments. I already understand that the problem is in the QL and there won't be a change soon.
I'm running these tests on a laptop (4 cores, 16GB RAM, NVME SSD), as I'm still investigating the fiware stuff, and apparently it has few resources to deal with the QL. I will try it on a better machine to see if that solves the problem.

If I still have problems with a better machine, I will try to find an alternative, to receive Orion Broker notifications on another service with better performance (I don't know if it will be possible, but I will investigate).

Thank you all

@c0c0n3
Copy link
Member

c0c0n3 commented Mar 23, 2023

pleasure! keep in mind you could also turn on telemetry in quantum leap and then analyse telemetry data with pandas to figure out exactly what's happening

@c0c0n3 c0c0n3 closed this as completed Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants