Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance with Mariadb, slower than ext4 #7573

Closed
rajil opened this issue May 28, 2018 · 9 comments
Closed

Poor performance with Mariadb, slower than ext4 #7573

rajil opened this issue May 28, 2018 · 9 comments
Labels
Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem

Comments

@rajil
Copy link

rajil commented May 28, 2018

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version
Linux Kernel 4.9.95-gentoo
Architecture x86_64
ZFS Version 0.7.9-r0-gentoo
SPL Version 0.7.9-r0-gentoo

Describe the problem you're observing

I have a mariadb database (dev-db/mariadb-10.1.31-r1) which performs poorly with ZFS. However, with ext4 it is pretty quick.

On ext4 it takes 18seconds to complete the query
On ZFS it takes 2 minutes 20 seconds
On ZFS with /sys/module/zfs/parameters/zfs_compressed_arc_enabled=0, it takes 3m 50 seconds

Describe how to reproduce the problem

I am using Mythtv-0.29 to populate an xmltv database

time /usr/bin/mythfilldatabase  --file --sourceid 1 --xmlfile /tmp/guide/guide.xml

Include any warning/errors/backtraces from the system logs

Here are some stats,

ext4

# /usr/share/bcc/tools/ext4dist 500
In file included from /virtual/main.c:3:
/lib/modules/4.9.95-gentoo/build/include/linux/fs.h:2700:9: warning: comparison of unsigned enum expression < 0 is always false [-Wtautological-compare]
        if (id < 0 || id >= READING_MAX_ID)
            ~~ ^ ~
1 warning generated.
Tracing ext4 operation latency... Hit Ctrl-C to end.
^C
23:34:20:

operation = read
     usecs               : count     distribution
         0 -> 1          : 533984   |****************************************|
         2 -> 3          : 106      |                                        |
         4 -> 7          : 103      |                                        |
         8 -> 15         : 293      |                                        |
        16 -> 31         : 66       |                                        |
        32 -> 63         : 3        |                                        |

operation = write
     usecs               : count     distribution
         0 -> 1          : 55377    |****************************************|
         2 -> 3          : 29       |                                        |
         4 -> 7          : 10       |                                        |
         8 -> 15         : 66       |                                        |

ZFS

# /usr/share/bcc/tools/zfsdist 500
In file included from /virtual/main.c:3:
/lib/modules/4.9.95-gentoo/build/include/linux/fs.h:2700:9: warning: comparison of unsigned enum expression < 0 is always false [-Wtautological-compare]
        if (id < 0 || id >= READING_MAX_ID)
            ~~ ^ ~
1 warning generated.
Tracing ZFS operation latency... Hit Ctrl-C to end.
^C
23:59:12:

operation = b'open'
     usecs               : count     distribution
         0 -> 1          : 1204     |****************************************|
         2 -> 3          : 25       |                                        |
         4 -> 7          : 5        |                                        |
         8 -> 15         : 3        |                                        |

operation = b'read'
     usecs               : count     distribution
         0 -> 1          : 275123   |****************************************|
         2 -> 3          : 10138    |*                                       |
         4 -> 7          : 18033    |**                                      |
         8 -> 15         : 1329     |                                        |
        16 -> 31         : 158      |                                        |
        32 -> 63         : 16       |                                        |
        64 -> 127        : 10       |                                        |
       128 -> 255        : 4419     |                                        |
       256 -> 511        : 171583   |************************                |
       512 -> 1023       : 60302    |********                                |
      1024 -> 2047       : 3719     |                                        |
      2048 -> 4095       : 2444     |                                        |
      4096 -> 8191       : 22       |                                        |

operation = b'fsync'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1        |*****                                   |
       128 -> 255        : 8        |****************************************|
       256 -> 511        : 3        |***************                         |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 1        |*****                                   |

operation = b'write'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 5        |                                        |
         4 -> 7          : 287      |                                        |
         8 -> 15         : 42922    |******************************          |
        16 -> 31         : 55692    |****************************************|
        32 -> 63         : 1338     |                                        |
        64 -> 127        : 42       |                                        |
       128 -> 255        : 1        |                                        |
       256 -> 511        : 16       |                                        |
       512 -> 1023       : 81       |                                        |
      1024 -> 2047       : 294      |                                        |
      2048 -> 4095       : 11       |                                        |
      4096 -> 8191       : 2        |                                        |

ZFS with /sys/module/zfs/parameters/zfs_compressed_arc_enabled=0

# /usr/share/bcc/tools/zfsdist 500
In file included from /virtual/main.c:3:
/lib/modules/4.9.95-gentoo/build/include/linux/fs.h:2700:9: warning: comparison of unsigned enum expression < 0 is always false [-Wtautological-compare]
        if (id < 0 || id >= READING_MAX_ID)
            ~~ ^ ~
1 warning generated.
Tracing ZFS operation latency... Hit Ctrl-C to end.
^C
00:17:51:

operation = b'open'
     usecs               : count     distribution
         0 -> 1          : 1456     |****************************************|
         2 -> 3          : 19       |                                        |
         4 -> 7          : 5        |                                        |
         8 -> 15         : 1        |                                        |

operation = b'write'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 6        |                                        |
         4 -> 7          : 281      |                                        |
         8 -> 15         : 47095    |*****                                   |
        16 -> 31         : 341626   |****************************************|
        32 -> 63         : 5725     |                                        |
        64 -> 127        : 376      |                                        |
       128 -> 255        : 11       |                                        |
       256 -> 511        : 9        |                                        |
       512 -> 1023       : 54       |                                        |
      1024 -> 2047       : 430      |                                        |
      2048 -> 4095       : 70       |                                        |
      4096 -> 8191       : 29       |                                        |
      8192 -> 16383      : 2        |                                        |

operation = b'read'
     usecs               : count     distribution
         0 -> 1          : 323412   |****************************************|
         2 -> 3          : 12762    |*                                       |
         4 -> 7          : 17341    |**                                      |
         8 -> 15         : 1825     |                                        |
        16 -> 31         : 159      |                                        |
        32 -> 63         : 30       |                                        |
        64 -> 127        : 43       |                                        |
       128 -> 255        : 124      |                                        |
       256 -> 511        : 114149   |**************                          |
       512 -> 1023       : 236825   |*****************************           |
      1024 -> 2047       : 11873    |*                                       |
      2048 -> 4095       : 4039     |                                        |
      4096 -> 8191       : 65       |                                        |

Mariadb config

# cat /etc/mysql/my.cnf | egrep -v "(^#.*|^$)"                  
[client]
port                                            = 3306
socket                                          = /var/run/mysqld/mysqld.sock
[mysql]
character-sets-dir=/usr/share/mysql/charsets
default-character-set=utf8
[mysqladmin]
character-sets-dir=/usr/share/mysql/charsets
default-character-set=utf8
[mysqlcheck]
character-sets-dir=/usr/share/mysql/charsets
default-character-set=utf8
[mysqldump]
character-sets-dir=/usr/share/mysql/charsets
default-character-set=utf8
[mysqlimport]
character-sets-dir=/usr/share/mysql/charsets
default-character-set=utf8
[mysqlshow]
character-sets-dir=/usr/share/mysql/charsets
default-character-set=utf8
[myisamchk]
character-sets-dir=/usr/share/mysql/charsets
[myisampack]
character-sets-dir=/usr/share/mysql/charsets
[mysqld_safe]
err-log                                         = /var/log/mysql/mysql.err
[mysqld]
character-set-server            = utf8
user                                            = mysql
port                                            = 3306
socket                                          = /var/run/mysqld/mysqld.sock
pid-file                                        = /var/run/mysqld/mysqld.pid
log-error                                       = /var/log/mysql/mysqld.err
basedir                                         = /usr
datadir                                         = /var/lib/mysql
skip-external-locking
key_buffer_size                         = 16M
max_allowed_packet                      = 4M
table_open_cache                        = 400
sort_buffer_size                        = 512K
net_buffer_length                       = 16K
read_buffer_size                        = 256K
read_rnd_buffer_size            = 512K
myisam_sort_buffer_size         = 8M
lc_messages_dir                 = /usr/share/mysql
lc_messages                     = en_US
bind-address                            = 127.0.0.1
log-bin
server-id                                       = 1
tmpdir                                          = /tmp/
innodb_buffer_pool_size = 128M
innodb_data_file_path = ibdata1:10M:autoextend:max:128M
innodb_log_file_size = 48M
innodb_log_buffer_size = 8M
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50
innodb_file_per_table
innodb_data_home_dir            = /var/lib/mysql/innodb
innodb_log_group_home_dir       = /var/lib/mysql/logs
innodb_doublewrite = 0
skip-name-resolve
loose-federated
[mysqldump]
quick
max_allowed_packet                      = 16M
[mysql]
[isamchk]
key_buffer_size                         = 20M
sort_buffer_size                        = 20M
read_buffer                             = 2M
write_buffer                            = 2M
[myisamchk]
key_buffer_size                         = 20M
sort_buffer_size                        = 20M
read_buffer_size                        = 2M
write_buffer_size                       = 2M
[mysqlhotcopy]
interactive-timeout
[mariadb]

ZFS dataset properties

#zfs get recordsize rpool/db
NAME      PROPERTY    VALUE    SOURCE
rpool/db  recordsize  128K     default
# zfs get recordsize rpool/db/logs
NAME           PROPERTY    VALUE    SOURCE
rpool/db/logs  recordsize  128K     local
# zfs get recordsize rpool/db/innodb
NAME             PROPERTY    VALUE    SOURCE
rpool/db/innodb  recordsize  16K      local
@ryao
Copy link
Contributor

ryao commented May 28, 2018

We have been talking about this in IRC. Here are some offtime CPU graphs that he also collected:

https://svgshare.com/i/6rA.svg
https://svgshare.com/i/6qm.svg

mysql is spending more than 7 times the amount of time in offtime. This correlates well to the different run times.

What is happening here is that ext4 is quickly queuing things to be processing asynchronously. On the other hand, our AIO implementation is a compatibility shim that the standard allowed rather than a proper asynchronous implementation. The AIO writes in ZFS are handled asynchronously after the buffer is copied into the DMU rather than before the buffer is copied. We also don't do any zero copy, while ext4 likely can do. On the other hand, the reads block until they have completed and this workload is dominated by reads. Compressed ARC likely isn't doing us any favors here either.

At the very least, we need to modify the AIO implementation to operate asynchronously on reads so that the callback is done by the DMU. That should speed this up nicely.

@alek-p
Copy link
Contributor

alek-p commented May 28, 2018 via email

@rajil rajil changed the title Poor performance with Mysql, slower than ext4 Poor performance with Mariadb, slower than ext4 May 28, 2018
@ryao
Copy link
Contributor

ryao commented May 28, 2018

@alek-p He followed the tuning advice that I posted on the openzfs wiki:

http://open-zfs.org/wiki/Performance_tuning#MySQL

In this situation, his workload is hitting a genuine deficiency in our ZPL code. The AIO bits are a compatibility shim that do the absolute minimum needed to conform to the POSIX standard. I wrote it that way to avoid having to make more risky changes to the code base at the time.

I knew the day would come when we needed to replace it. I did not expect it to take so long before someone found a workload that suffered because of it. In any case, I think that the compatibility shim did more good than harm, so I do not think it was a mistake on my part back then. Hopefully, I'll be able to make time to implement my vision for what the revision was.

I still need to finish #7423, but the immediate fire there is out, so I am willing to take my time with it. We'll see which one I finish first.

@rajil
Copy link
Author

rajil commented May 28, 2018

After adding innodb_use_atomic_writes = 0 to my.cnf the query run time is 43 seconds on ZFS. Also, zfs_compressed_arc_enabled=1
Much better than before.

After a reboot, the performance is back to poor speeds ~ 3m 9 seconds.

@ryao
Copy link
Contributor

ryao commented May 28, 2018

I guess I was wrong about the shim doing more good than harm, at least in this case. I have updated the openzfs wiki to include this tip:

http://open-zfs.org/wiki/Performance_tuning#InnoDB

I regret how this issue had eluded us for so long. :/

@shodanshok
Copy link
Contributor

shodanshok commented May 29, 2018

Based on @rajil comment above, the key to improve performance was to disable atomic writes (with innodb_use_atomic_writes=0) rather than disabling AIO (with innodb_use_native_aio=0). Based on the documentation found here, innodb_use_atomic_writes should only have effect on very specific SSD and filesystem combo.
So, why disabling atomic writes (and only them) had such large effect on performance?

@bunder2015 bunder2015 added the Type: Performance Performance improvement or performance problem label Jan 31, 2019
@janetcampbell
Copy link

janetcampbell commented Apr 6, 2019

I have some suspicions here. If the dbdata dataset had logbias=throughput, it will be full of indirect sync writes, and those will be awful for random reads - likely doubling the IOPs needed for a given amount of space, and nearly eliminating the ability to aggregate reads together of either data or metadata. This is made worse because 16K is so small.

If the OP is still around and can supply details like logbias and some zpool iostat -r's, it would be fairly clear. ZFS is not slower when properly set up. The instructions on the Wiki will result in badly fragmented data and metadata and an high IOP count to service them, for a random read-heavy workload. Doing the following would likely do much better:

32K recordsize (preserves locality of input data)
logbias=latency (preserves locality of data on disk)
primarycache=all (anything other than this usually indicates trouble)
raising zfs_immediate_write_sz to 64K (all data writes will be direct sync) or 256K (all sync writes to both data and log will be direct sync)
OR add an SLOG (even better)
raising txg timeout and dirty_data_sync (reduce RMW from partial writes)
decreasing zfs_sync_taskq_batch_pct (decrease impact of txg commit and rate limit RMW reads)

Edit: If you want to understand the problem, test the IO & disk in isolation to figure out if your metadata is badly fragmented and if the rest of the pipeline is healthy. Clear caches, zfs send >/dev/null, while you watch zpool iostat -r. This will likely show you many isolated 4K reads that cannot be merged. zfs send | zfs receive it into another dataset, clear caches, and zfs send the new data >/dev/null while you watch it - you will likely see much better read IO merge.

If zfs send and receive can't get clean IO in the pool, nothing else will be able to. They make great testing tools for finding performance problems.

@janetcampbell
Copy link

Eliminating RMW from kernel interactions with ZFS will also make a huge difference and allow you to use larger records efficiently:

#8590

@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020
@stale stale bot closed this as completed Nov 25, 2020
@gedia gedia mentioned this issue Apr 29, 2023
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

6 participants