Speed of query retrieval #10

filmenczer · 2018-04-04T21:16:04Z

We noticed that the search from the article search engine (Lucene) is often very slow. Can you please do some tests with random queries? We should find the bottleneck: is it Lucene? the database? the API? the network?

shaochengcheng · 2018-04-07T13:13:15Z

Hi Fil,

I tested the API flow with random words selected from the title of articles. Here are the time escaped in the API flow (in seconds, 100 rounds):

t0_lucene_query query from article 0.554636
t1_article_filtering filtering disabled site 0.004863
t2_article_sharing query twitter sharing of the article. 11.400047
t3_network_building_old build network by old api 21.456027
t4_network_buiding_new build network by new api . 15.631910

Please note that this testing is running on the server directly without the mashape middleware and only background data flow without front-end part.

The sum of first three items is about our first step in the front-end. And third and four item is about the second step in the front-end. As you can see, Lucene itself is really fast. The problem is that the query of the database does take tens seconds. As you can see that the new network API did have better performance.

The possible solution could be indexing and partition the database. However, I am not a database expert. Unfortunately, I cannot make much progress on the performance.

Thanks
Chengcheng

filmenczer · 2018-04-07T21:30:32Z

Thank you @shaochengcheng that explains clearly --- I attributed the delay of the first phase to Lucene when in fact it is the retrieval of the tweets.

I wonder if we could speed up tweet retrieval by better indexing. @glciampaglia can we discuss this?

Also I understand that the network API is faster now. Thank you for that too! I expected a larger speedup because I thought that the network API now uses the edge table (per issue #4)? Is that a separate issue still being worked on?

glciampaglia · 2018-04-08T15:34:40Z

Thank you @shaochengcheng for running this analysis. This explains the bottleneck perfectly. I think that adding indexing to the article_sharing query could speed up things significantly, like what happened with the Botometer database. I can work on it. Could you please point me to the source code of the article_sharing query? What about the new API? Is it also an SQL query, or are you still parsing things in Python? Perhaps we could add indexes there too.

@filmenczer let's talk about this on Monday, if you are around.

filmenczer · 2018-04-09T21:16:39Z

@shaochengcheng -- the table ass_tweet_url had an index (tweet_id, url_id). Therefore when querying by url, it was not using the index, therefore it was slow. Giovanni and I created a new index (url_id, tweet_id). In this way, when querying by url, this is executed as an index scan and is MUCH faster!!!

Please update the code that creates the table to add this new index, than you can close this issue. Thanks!

shaochengcheng · 2018-04-10T01:50:34Z

Hi @filmenczer and @glciampaglia

I am not sure whether we need an extra index on table ass_tweet_url, because there is a unique constraint on it when creating. Let us look at the table info:

hoaxy=> \d+ ass_tweet_url
                                      Table "public.ass_tweet_url"
  Column  |  Type   |                         Modifiers                          | Storage | Description
----------+---------+------------------------------------------------------------+---------+-------------
 id       | integer | not null default nextval('ass_tweet_url_id_seq'::regclass) | plain   |
 tweet_id | integer |                                                            | plain   |
 url_id   | integer |                                                            | plain   |
Indexes:
    "ass_tweet_url_pkey" PRIMARY KEY, btree (id)
    "tweet_url_uq" UNIQUE, btree (tweet_id, url_id)
    "url_tweet" btree (url_id, tweet_id)
Foreign-key constraints:
    "ass_tweet_url_tweet_id_fkey" FOREIGN KEY (tweet_id) REFERENCES tweet(id) ON UPDATE CASCADE ON DELETE CASCADE
    "ass_tweet_url_url_id_fkey" FOREIGN KEY (url_id) REFERENCES url(id) ON UPDATE CASCADE ON DELETE CASCADE
Has OIDs: no

As you can see, the index is already there, "tweet_url_uq" UNIQUE, btree (tweet_id, url_id). And according to PostgreSQL docs

One should, however, be aware that there's no need to manually create indexes on unique columns; doing so would just duplicate the automatically-created index.

Thus I think, table ass_tweet_url does not need a manual index.

Am I right?

Thanks
Chengcheng

filmenczer · 2018-04-10T04:28:19Z

Giovanni will answer more definitely, but as I recall:

Before we added the index, the query was not running on the index, it was scanning the database. And it took several seconds (consistent with your measurements).
After we added the index, the query ran scanning only the index, and it was super fast (less than a second). The difference is very noticeable in the live demo.

So I think that the index was needed.

glciampaglia · 2018-04-10T13:11:11Z

The index is composite so when you look up a row by URL ID you are doing a partial lookup. However, with b-tree indexes (like the one in that table) this only works if you are using the leftmost part of the index. In other words, the index was being used when the reference was the tweet-ID, but not the other way round. Adding another index (URL ID, Tweet ID), does the trick.

glciampaglia · 2018-04-10T13:12:24Z

Btw Clayton pointed out that hash indexes would be even faster than b-tree indexes. I am not sure we need the extra speed at the moment though.

glciampaglia · 2018-05-08T21:23:58Z

Closed via 3dea321

filmenczer added the high priority label Apr 4, 2018

filmenczer assigned shaochengcheng and ZacMilano Apr 4, 2018

filmenczer changed the title ~~Speed of Lucene query retrieval~~ Speed of query retrieval Apr 7, 2018

glciampaglia closed this as completed May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed of query retrieval #10

Speed of query retrieval #10

filmenczer commented Apr 4, 2018

shaochengcheng commented Apr 7, 2018

filmenczer commented Apr 7, 2018

glciampaglia commented Apr 8, 2018

filmenczer commented Apr 9, 2018

shaochengcheng commented Apr 10, 2018

filmenczer commented Apr 10, 2018

glciampaglia commented Apr 10, 2018

glciampaglia commented Apr 10, 2018

glciampaglia commented May 8, 2018

Speed of query retrieval #10

Speed of query retrieval #10

Comments

filmenczer commented Apr 4, 2018

shaochengcheng commented Apr 7, 2018

filmenczer commented Apr 7, 2018

glciampaglia commented Apr 8, 2018

filmenczer commented Apr 9, 2018

shaochengcheng commented Apr 10, 2018

filmenczer commented Apr 10, 2018

glciampaglia commented Apr 10, 2018

glciampaglia commented Apr 10, 2018

glciampaglia commented May 8, 2018