Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize queries table #1255

Merged
merged 10 commits into from
Jan 16, 2022
Merged

Optimize queries table #1255

merged 10 commits into from
Jan 16, 2022

Conversation

DL6ER
Copy link
Member

@DL6ER DL6ER commented Dec 16, 2021

By submitting this pull request, I confirm the following:

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.

How familiar are you with the codebase?:

10


This PR optimizes the queries table by using link-tables with IDs instead of storing the domain, client and forward as strings over and over. This is not a breaking change to users as we provide a VIEW queries in place of the old TABLE queries that will give the same results for external queries as before:

VIEW queries

$ sqlite3 /etc/pihole/pihole-FTL.db -header "SELECT * FROM queries ORDER BY id DESC LIMIT 1;"                                                                                           
id|timestamp|type|status|domain|client|forward|additional_info
19391970|1639654453|2|2|forum.arduino.cc|192.168.2.206|127.0.0.1#5353|

TABLE query_storage (holding the real data)

$ sqlite3 /etc/pihole/pihole-FTL.db -header "SELECT * FROM query_storage ORDER BY id DESC LIMIT 1;"                                                                                           
id|timestamp|type|status|domain|client|forward|additional_info
19391970|1639654453|2|2|114|2|1|

In the example above, storage required per query before:

   4    sizeof(id)
+  4    sizeof(timestamp)
+  1    sizeof(type)
+  1    sizeof(status)
+ 16    strlen("forum.arduino.cc") + 1
+ 14    strlen("192.168.2.206") + 1
+ 15    strlen("127.0.0.1#5353") + 1
+  1    sizeof(NULL)
= 56 bytes per record

Storage required per query with the optimization of this PR:

   4    sizeof(id)
+  4    sizeof(timestamp)
+  1    sizeof(type)
+  1    sizeof(status)
+  1    sizeof(domain)
+  1    sizeof(client)
+  1    sizeof(forward)
+  1    sizeof(additional_info)
= 14 bytes per record

This reduction by factor 4 easily outperforms the tiny extra amount of space we need for the newly introduced link-tables. We chose this hybrid approach through the VIEW because we cannot afford rewriting the entire queries table as this would literally take forever on low-end devices such as Raspberry Pis from the early days.

This also lays the foundation for a requested feature that is availability of host names in the long-term database as the linked client ID are in fact a pair of (ip,hostname) in the database. This allows us to store both without any really noticeable overhead.

This PR comes with a tiny drawback (that isn't really one, though): After the database has been optimized (upgraded to version 11), older versions of FTL will not be able to write new queries to the database. It is possible to revert the upgrade of the database without data-loss, however, it is a multiple-step task and should not be done copy-paste-like. That's why I'm not giving the precise instructions to do this here.

edit We furthermore save some space on the additional_information field that con contain domains when blocking during deep CNAME inspection.

…ngs into a linking table and referencing them by an ID. This also allows storing (IP,hostname) pair for clients.

Signed-off-by: DL6ER <dl6er@dl6er.de>
Signed-off-by: DL6ER <dl6er@dl6er.de>
… that the prepared statement will be retained for a long time and probably reused many times. Without this flag, SQLite3 assumes that the prepared statement will be used just once or at most a few times and then destroyed relatively soon. The current implementation acts on this hint by avoiding the use of [lookaside memory] so as not to deplete the limited store of lookaside memory. Also, update the tests for the expected database schema after renaming the table and adding the new view.

Signed-off-by: DL6ER <dl6er@dl6er.de>
@DL6ER DL6ER marked this pull request as draft December 16, 2021 12:18
@DL6ER
Copy link
Member Author

DL6ER commented Dec 16, 2021

Putting this back to draft mode as we might want to explore another link-table for additional_info

Signed-off-by: DL6ER <dl6er@dl6er.de>
@DL6ER DL6ER force-pushed the new/optimize-queries branch 2 times, most recently from 277542b to 729ff31 Compare December 16, 2021 21:27
…nothing is actually INSERTed as we chose the OR IGNORE path.

Signed-off-by: DL6ER <dl6er@dl6er.de>
@DL6ER DL6ER marked this pull request as ready for review December 16, 2021 21:36
@DL6ER
Copy link
Member Author

DL6ER commented Dec 16, 2021

Ready for review. The additional_info column is now also linked to a corresponding table giving us the flexibility to add multiple properties to a query in the future.

@DL6ER
Copy link
Member Author

DL6ER commented Dec 23, 2021

Branch updated from latest development

dschaper
dschaper previously approved these changes Jan 9, 2022
src/database/common.c Show resolved Hide resolved
src/database/gravity-db.c Show resolved Hide resolved
@yubiuser
Copy link
Member

yubiuser commented Jan 9, 2022

After switching to that branch

[2022-01-09 21:00:00.718 2828/T2832] SQLite3 message: cannot modify queries because it is a view in "DELETE FROM queries WHERE timestamp <= 1633982400" (1)
[2022-01-09 21:00:00.719 2828/T2832] ERROR: SQL query "DELETE FROM queries WHERE timestamp <= 1633982400" failed: SQL logic error
[2022-01-09 21:00:00.719 2828/T2832] delete_old_queries_in_DB(): Deleting queries due to age of entries failed

@yubiuser
Copy link
Member

yubiuser commented Jan 9, 2022

This will need some documentation changes in https://docs.pi-hole.net/database/ftl/

E.g. addinfo_by_id meaning of type

…en couting them

Signed-off-by: DL6ER <dl6er@dl6er.de>
@DL6ER
Copy link
Member Author

DL6ER commented Jan 9, 2022

@yubiuser Thanks for the error message, you indeed found the spot where the renaming queries -> query_storage is missing. I pushed an update.

@yubiuser
Copy link
Member

Error is fixed
[2022-01-10 07:00:00.352 24148/T24152] Notice: Database size is 149.30 MB, deleted 2955 rows

Copy link
Member

@yubiuser yubiuser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember the documentation.

Thanks for that update.

Do we need any modification on web? I know it will use the queries view now automatically, but generating the view can take some time. We might not need all columns from there, e.g. wenn computing "Long term graph"

@yubiuser
Copy link
Member

Just an idea: now that you shrank the size per record tremendously, should we save the reply as well?

@DL6ER
Copy link
Member Author

DL6ER commented Jan 11, 2022

I will use this PR as reminder by only merging when I had time to work on the documentation changes.

Do we need any modification on web? I know it will use the queries view now automatically, but generating the view can take some time. We might not need all columns from there, e.g. wenn computing "Long term graph"

We are not requesting these columns through the VIEW, hence, they are also not looked up. For instance, take https://github.com/pi-hole/AdminLTE/blob/682a174330d37c4bb8222ff371a3a4b7d363fafa/api_db.php#L264

SELECT COUNT(timestamp) FROM queries;

compiles to the following database engine bytecode:

addr  opcode         p1    p2    p3    p4             p5  comment      
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     11    0                    00  Start at 11  
1     Null           0     1     2                    00  r[1..2]=NULL 
2     OpenRead       1     2     0     8              00  root=2 iDb=0; query_storage
3     Rewind         1     7     0                    00               
4       Column         1     1     3                    00  r[3]=query_storage.timestamp
5       AggStep        0     3     1     count(1)       01  accum=r[1] step(r[3])
6     Next           1     4     0                    01               
7     AggFinal       1     1     0     count(1)       00  accum=r[1] N=1
8     Copy           1     4     0                    00  r[4]=r[1]    
9     ResultRow      4     1     0                    00  output=r[4]  
10    Halt           0     0     0                    00               
11    Transaction    0     0     22    0              01  usesStmtJournal=0
12    Goto           0     1     0                    00

Querying query_storage does not change anything:

SELECT COUNT(timestamp) FROM query_storage;
addr  opcode         p1    p2    p3    p4             p5  comment      
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     11    0                    00  Start at 11  
1     Null           0     1     2                    00  r[1..2]=NULL 
2     OpenRead       1     4     0     k(2,,)         00  root=4 iDb=0; idx_queries_timestamps
3     Rewind         1     7     3     0              00               
4       Column         1     0     3                    00  r[3]=query_storage.timestamp
5       AggStep        0     3     1     count(1)       01  accum=r[1] step(r[3])
6     Next           1     4     0                    01               
7     AggFinal       1     1     0     count(1)       00  accum=r[1] N=1
8     Copy           1     4     0                    00  r[4]=r[1]    
9     ResultRow      4     1     0                    00  output=r[4]  
10    Halt           0     0     0                    00               
11    Transaction    0     0     22    0              01  usesStmtJournal=0
12    Goto           0     1     0                    00

as the former already uses the real storage. The same holds true for all other queries.

Just an idea: now that you shrank the size per record tremendously, should we save the reply as well?

For v6.0, I already implemented storing everything visible on the Query Log (also reply time, and DNSSEC status). We can do this already now, that's a good idea. Something for a follow-up PR.

DL6ER added a commit to pi-hole/docs that referenced this pull request Jan 16, 2022
Signed-off-by: DL6ER <dl6er@dl6er.de>
@DL6ER DL6ER merged commit 48e7bd8 into development Jan 16, 2022
@DL6ER DL6ER deleted the new/optimize-queries branch January 16, 2022 21:05
@pralor-bot
Copy link

This pull request has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/pi-hole-ftl-v5-14-web-v5-11-and-core-v5-9-released/53529/1

@pralor-bot
Copy link

This pull request has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/delete-queries-error/53761/2

@pralor-bot
Copy link

This pull request has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/adding-many-lists-via-shell/54356/2

@pralor-bot
Copy link

This pull request has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/pi-hole-blocking-less-than-expected/54394/8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants