-
Notifications
You must be signed in to change notification settings - Fork 43
Speed of query retrieval #10
Comments
Hi Fil, I tested the API flow with random words selected from the title of articles. Here are the time escaped in the API flow (in seconds, 100 rounds): t0_lucene_query query from article 0.554636 Please note that this testing is running on the server directly without the mashape middleware and only background data flow without front-end part. The sum of first three items is about our first step in the front-end. And third and four item is about the second step in the front-end. As you can see, Lucene itself is really fast. The problem is that the query of the database does take tens seconds. As you can see that the new network API did have better performance. The possible solution could be indexing and partition the database. However, I am not a database expert. Unfortunately, I cannot make much progress on the performance. Thanks |
Thank you @shaochengcheng that explains clearly --- I attributed the delay of the first phase to Lucene when in fact it is the retrieval of the tweets. I wonder if we could speed up tweet retrieval by better indexing. @glciampaglia can we discuss this? Also I understand that the network API is faster now. Thank you for that too! I expected a larger speedup because I thought that the network API now uses the edge table (per issue #4)? Is that a separate issue still being worked on? |
Thank you @shaochengcheng for running this analysis. This explains the bottleneck perfectly. I think that adding indexing to the article_sharing query could speed up things significantly, like what happened with the Botometer database. I can work on it. Could you please point me to the source code of the article_sharing query? What about the new API? Is it also an SQL query, or are you still parsing things in Python? Perhaps we could add indexes there too. @filmenczer let's talk about this on Monday, if you are around. |
@shaochengcheng -- the table Please update the code that creates the table to add this new index, than you can close this issue. Thanks! |
Hi @filmenczer and @glciampaglia I am not sure whether we need an extra index on table
As you can see, the index is already there,
Thus I think, table Am I right? Thanks |
Giovanni will answer more definitely, but as I recall:
So I think that the index was needed. |
The index is composite so when you look up a row by URL ID you are doing a partial lookup. However, with b-tree indexes (like the one in that table) this only works if you are using the leftmost part of the index. In other words, the index was being used when the reference was the tweet-ID, but not the other way round. Adding another index (URL ID, Tweet ID), does the trick. |
Btw Clayton pointed out that hash indexes would be even faster than b-tree indexes. I am not sure we need the extra speed at the moment though. |
Closed via 3dea321 |
We noticed that the search from the article search engine (Lucene) is often very slow. Can you please do some tests with random queries? We should find the bottleneck: is it Lucene? the database? the API? the network?
The text was updated successfully, but these errors were encountered: