Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datastore: Performance across 1000+ rows #1314

Closed
jryd opened this issue Sep 21, 2018 · 23 comments
Closed

Datastore: Performance across 1000+ rows #1314

jryd opened this issue Sep 21, 2018 · 23 comments
Assignees
Labels
api: datastore Issues related to the Datastore API. type: question Request for information or clarification. Not an issue.

Comments

@jryd
Copy link

jryd commented Sep 21, 2018

I have a PHP project currently running Laravel 5.7.

I have a bunch of IoT sensors that are sending data into Google Datastore.

I am wanting to query this data so that I can show a graph of this data on the frontend of the project.

The frontend makes a request to my project, my project makes the request to Google Datastore, and then builds up the result to return it as JSON to the frontend.

There's a fair bit of data (roughly 1200 rows), and it takes about 13 seconds to fetch and render on the screen. Which is longer than I would like.

Here is my current code to fetch this IoT Data:

class Datastore
{
    protected $datastore;

    public function __construct()
    {
        $this->datastore = new DatastoreClient([
            'keyFilePath' => __DIR__ . '/auth.json'
        ]);
    }

    public function fetchData()
    {
        dump("GDS query starting " . date("Y-m-d H:i:s"));
        $query = $this->datastore->query()
            ->kind('TemperatureEvent')
            ->filter('device_id', '=', 'abc123')
            ->filter('published_at', '>=', '2018-09-19T04:52:01.429Z')
            ->order('published_at', Query::ORDER_ASCENDING)
            ->projection(['current_temperature', 'target_temperature', 'published_at']);

        $results = $this->datastore->runQuery($query);

        dump("GDS query finished " . date("Y-m-d H:i:s"));

        return($this->transformData($results));
    }

    private function transformData($results,)
    {
        dump("GDS transform starting " . date("Y-m-d H:i:s"));

        $data = [];

        foreach ($results as $result) {
            $row = $result->get();

            $data["current_temperature"][] = [
                Carbon::parse($row["published_at"])->timestamp * 1000,
                (float)$row["current_temperature"]
            ];

            $data["target_temperature"][] = [
                Carbon::parse($row["published_at"])->timestamp * 1000,
                (float)$row["target_temperature"]
            ];
        }

        dump("GDS transform finished " . date("Y-m-d H:i:s"));

        return $data;
    }
}

Below is the output of my dump() statements:

"GDS query starting 2018-09-21 02:59:00"
"GDS query finished 2018-09-21 02:59:00"
"GDS transform starting 2018-09-21 02:59:00"
"GDS transform finished 2018-09-21 02:59:13"

From this, I would understand that the query is fast, but the processing of the data is slow. But that doesn't make sense. I am running a foreach loop and would therefore expect this to be pretty fast over already fetched data.

When does the library actually fetch the rows? Judging by the timings, I would almost assume that it is during the process of getting the data for that row?

In an attempt to optimise this code and make it faster, I thought I might just take every 50th row:

    private function transformData($results,)
    {
        dump("GDS transform starting " . date("Y-m-d H:i:s"));

        $data = [];
        $counter = 0;

        foreach ($results as $result) {
            if ($counter % 50 == 0) {
                $row = $result->get();

                $data["current_temperature"][] = [
                    Carbon::parse($row["published_at"])->timestamp * 1000,
                    (float)$row["current_temperature"]
                ];

                $data["target_temperature"][] = [
                    Carbon::parse($row["published_at"])->timestamp * 1000,
                    (float)$row["target_temperature"]
                ];
            }

            $counter++; 
        }

        dump("GDS transform finished " . date("Y-m-d H:i:s"));

        return $data;
    }

However the processing time is just as long.

Given what I am trying to achieve, how can I speed this up?

@jdpedrie jdpedrie added type: question Request for information or clarification. Not an issue. api: datastore Issues related to the Datastore API. labels Sep 21, 2018
@jdpedrie
Copy link
Contributor

Hi @jryd, thanks for bringing this to our attention!

I've been doing some testing on my side, and using a data set of 2000 rows (with, I think, a similar but not identical dataset to yours), I'm seeing total execution time to read and iterate through the set of under two seconds.

Google Cloud PHP doesn't actually send the runQuery request until you start iterating through $results, so it's no surprise that the bulk of the execution time is within that step. Additionally, Datastore does paginate large result sets, so it's likely that 3-5 requests are made.

I'm curious if your filters may be causing the query to run slower than it otherwise would. If you removed, temporarily, the filters and instead used Query::limit() to return a similar number of results, does that impact the response time at all?

@jryd
Copy link
Author

jryd commented Sep 21, 2018

Hi @jdpedrie!

Thanks for looking into this with me.

I updated my query to simply be:

$query = $this->datastore->query()
            ->kind('TemperatureEvent')
            ->limit(1200);

And it is still taking just as long.

Could latency be an issue here? I had setup the datastore using 'Cloud Firestore in Datastore' mode, and had to pick the us-east-1 location (I'm in Australia).
I've just spun a new project up to use 'Cloud Datastore' mode and used the AU location.
Will comment back with results once I get some data streamed in.

@jryd
Copy link
Author

jryd commented Sep 25, 2018

@jdpedrie so I've been testing this over the last few days.

I've set the datastore up in the Sydney datacentre, and I'm fetching the results from my local PC here in Brisbane.

The query is fetching 714 rows currently, and taking 5 seconds to execute and process in the foreach loop, which is double what you were seeing at 2000 rows...

What more could you suggest?

@jryd
Copy link
Author

jryd commented Sep 25, 2018

I'm using the Laravel framework and it doesn't look like their cache implementation implements the CacheItemPoolInterface that is needed to implement an authCache as you suggest in this comment.

So I've pulled in the Symfony one you suggest and it is no quicker; still taking 5 seconds to execute.

@jdpedrie
Copy link
Contributor

jdpedrie commented Sep 25, 2018

I'll work on seeing if I'm able to replicate the higher latency in the Sydney zone.

Could you try something, just to satisfy me that it's the request which is taking the bulk of the time rather than a code problem? The snippet below will use the library's connection to send requests and iterate through the set, but won't do any processing of the data.

$projectId = 'MY_PROJECT_ID';

$ds = \Google\Cloud\Core\Testing\TestHelpers::stub(DatastoreClient::class);
$conn = $ds->___getProperty('connection');

$cursor = null;
$done = false;

$now = microtime(true);

do {
    $res = $conn->runQuery([
        'projectId' => $projectId,
        'partitionId' => [
            'projectId' => $projectId
        ],
        'query' => [
            'kind' => [
                [
                    'name' => 'TemperatureEvent'
                ]
            ],
            'limit' => 1200,
            'startCursor' => $cursor
        ]
    ]);

    if ($res['batch']['moreResults'] === 'NOT_FINISHED') {
        $cursor = $res['batch']['endCursor'];
    } else {
        $done = true;
    }

    var_dump('iteration time: '. (microtime(true) - $now));
    $now = microtime(true);

} while (!$done);

Edit: If you have a very large data set, perhaps you should consider creating some indexes as well, to help optimize the queries you're executing. I'm not sure how much of an impact it would have as it depends heavily on your data set, but it's worth a try, and will be useful in the future anyways.

@jdpedrie
Copy link
Contributor

I've been running some tests in the australia-southeast1 zone, and I'm seeing considerably higher latency than my normal us-east1 zone. Some of that is due to the geographical distance, but it seems there may be something else going on as well.

australia-southeast1:

➜ php ds.php
string(31) "iteration time: 2.2753970623016"
string(31) "iteration time: 1.7546408176422"
string(30) "iteration time: 1.530779838562"

us-east1:

➜ php ds.php
string(32) "iteration time: 0.58208990097046"
string(32) "iteration time: 0.50783181190491"
string(32) "iteration time: 0.18584489822388"

@tmatsuo tmatsuo self-assigned this Sep 25, 2018
@jryd
Copy link
Author

jryd commented Sep 25, 2018

@jdpedrie here's my results:

C:\xampp\htdocs\community-tools\app\Core\GDS\Datastore.php:156:string 'iteration time: 7.4898290634155' (length=31)
C:\xampp\htdocs\community-tools\app\Core\GDS\Datastore.php:156:string 'iteration time: 1.287367105484' (length=30)
C:\xampp\htdocs\community-tools\app\Core\GDS\Datastore.php:156:string 'iteration time: 1.2852709293365' (length=31)
C:\xampp\htdocs\community-tools\app\Core\GDS\Datastore.php:156:string 'iteration time: 0.47953295707703' (length=32)

I have already created a composite index for the query I am running:

indexes:

- kind: TemperatureEvent
  properties:
  - name: device_id
  - name: published_at
    direction: asc
  - name: current_temperature
  - name: target_temperature

@jryd
Copy link
Author

jryd commented Oct 8, 2018

@tmatsuo have you made any progress looking into this?

@tmatsuo tmatsuo removed their assignment Dec 4, 2018
@jdpedrie
Copy link
Contributor

@jryd do you see any performance improvement with my datastore-grpc branch? (Be sure you have ext-grpc installed and enabled.)

@joseph1125
Copy link

joseph1125 commented Jan 29, 2019

@jryd do you see any performance improvement with my datastore-grpc branch? (Be sure you have ext-grpc installed and enabled.)

@jdpedrie there was a tiny performance improvement on the requests (around 20-30ms each request), however, we believe the bottleneck is coming from creating the connection the first time, which took around 200-300ms. We believe the issue can not be fixed as we understand we can not make persistent connection in PHP.

@jdpedrie
Copy link
Contributor

@joseph1125 have you implemented an auth token cache?

use Google\Auth\Cache\SysVCacheItemPool;
use Google\Cloud\Datastore\DatastoreClient;

$datastore = new DatastoreClient([
    'authCache' => new SysVCacheItemPool()
]);

Any PSR-6 cache implementation will work. If you already use a persistent key/value store such as redis, that would be a great candidate for this case.

@joseph1125
Copy link

@joseph1125 have you implemented an auth token cache?

use Google\Auth\Cache\SysVCacheItemPool;
use Google\Cloud\Datastore\DatastoreClient;

$datastore = new DatastoreClient([
    'authCache' => new SysVCacheItemPool()
]);

Any PSR-6 cache implementation will work. If you already use a persistent key/value store such as redis, that would be a great candidate for this case.

Thanks a lot, it significantly improves our performance

@jryd
Copy link
Author

jryd commented Feb 21, 2019

@jdpedrie sorry for the delay in getting back to you.

I went to integrate with gRPC today but your branch you reference is not there anymore. How do I get to testing this now?

@joseph1125
Copy link

@jryd it is pushed to the latest version, all you need to do is upgrade

@jryd
Copy link
Author

jryd commented Feb 21, 2019

Thanks @joseph1125. As for actually utilising gRPC as opposed to REST calls - I installed gRPC in my Docker containers today - does it just automatically detect and use this?

@joseph1125
Copy link

joseph1125 commented Feb 21, 2019 via email

@jdpedrie
Copy link
Contributor

@joseph1125 is correct. If gRPC is available, Google Cloud PHP will default to using it. You can verify whether gRPC is available by calling phpinfo() and searching the output for "grpc":

screen shot 2019-02-21 at 9 16 10 am

If you wish to switch between REST and gRPC, you can provide the transport argument to the DatastoreClient constructor:

use Google\Cloud\Datastore\DatastoreClient;

$datastore = new DatastoreClient([
    'transport' => 'rest', // available options are 'rest', 'grpc'.
]);

@jryd
Copy link
Author

jryd commented Feb 22, 2019

Thanks @jdpedrie !

I tried to rerun the test you got me to do to see what impact using gRPC has - but the code doesn't seem to run anymore:

        $projectId = 'my-project-xxxxx';

        $ds = \Google\Cloud\Core\Testing\TestHelpers::stub(DatastoreClient::class);
        $conn = $ds->___getProperty('connection');

        $cursor = null;
        $done = false;

        $now = microtime(true);

        do {
            $res = $conn->runQuery([
                'projectId' => $projectId,
                'partitionId' => [
                    'projectId' => $projectId
                ],
                'query' => [
                    'kind' => [
                        [
                            'name' => 'TemperatureEvent'
                        ]
                    ],
                    'limit' => 1200,
                    'startCursor' => $cursor
                ]
            ]);

            if ($res['batch']['moreResults'] === 'NOT_FINISHED') {
                $cursor = $res['batch']['endCursor'];
            } else {
                $done = true;
            }

            var_dump('iteration time: '. (microtime(true) - $now));
            $now = microtime(true);
        } while (!$done);

Gives me an error:

Google \ ApiCore \ ValidationException
Error decoding message: Given value cannot be converted to string.

I'd love to rerun the tests and see what a difference gRPC is making.

@jdpedrie
Copy link
Contributor

jdpedrie commented Feb 22, 2019

Hi @jryd, running your script with dummy data I created, I do not receive an error. Can you share an example entity, represented as a PHP array, which I can insert and then read?

Please also make sure you're using the most up-to-date versions of the Google Cloud PHP clients.

@jdpedrie
Copy link
Contributor

@jryd have you been able to figure this out?

@jryd
Copy link
Author

jryd commented Apr 15, 2019

@jdpedrie I didn't spend any more time trying to recreate the dummy script, but I am happy to report that the performance is much better!

I'm using gRPC and I am seeing ~1500 to ~2000 rows loading within 4 or 5 seconds now.

@jdpedrie
Copy link
Contributor

Excellent! I'm really glad the situation has improved. Please let us know if you run into any more problems. :)

@jim5359
Copy link

jim5359 commented Apr 27, 2020

"~1500 to ~2000 rows loading within 4 or 5 seconds now"
I don't understand why this is closed. Maybe Datastore is just expected to be slow. Coming from MongoDB this performance is totally unacceptable. I recently tried a project using the PHP library on PHP 7.2 and found retrieving 2,800 records was taking ~6 seconds doing a simple filter on an indexed field and sorting on the same indexed field. I tried switching to gRPC. I also tried the auth caching, but didn't see any noticeable improvement. I then switched over to MongoDB and now the same queries are taking < 1s, which is what I'm used to.

It's unfortunate that Datastore is so slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

5 participants