Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Matomo database use utf8mb4 collation by default #9785

Closed
mattab opened this issue Feb 15, 2016 · 30 comments · Fixed by #15618
Closed

Make Matomo database use utf8mb4 collation by default #9785

mattab opened this issue Feb 15, 2016 · 30 comments · Fixed by #15618
Assignees
Labels
c: Security For issues that make Matomo more secure. Please report issues through HackerOne and not in Github.
Milestone

Comments

@mattab
Copy link
Member

mattab commented Feb 15, 2016

When we will require MySQL 5.5.3 in Piwik, it would be desired to make our database use utf8mb4 collation. This would solve some possibly weird issues and in general, is a best practise for properly managing unicode characters in MySQL.

Background information utf8mb4

Notes

  • we wouldn't need to convert the database
  • only use utf8mb4 if it's present (in case it could be disabled for some reasons)
@mattab mattab added the c: Security For issues that make Matomo more secure. Please report issues through HackerOne and not in Github. label Mar 13, 2016
@mattab
Copy link
Member Author

mattab commented Mar 13, 2016

This was originally suggested by @EgiX who recommends us to implement this improvement sooner rather than later. Requires MySQL > 5.5.3 - discussion to increase MySQL requirement covered in #9107

@ThaDafinser
Copy link
Contributor

👍 for this.

Just switched to MySQL 5.7 and it works fine with the new utf8mb4.

Also they plan to make this value the default in 5.8 maybe
http://mysqlserverteam.com/planning-the-defaults-for-mysql-5-8/

@slawa-dev
Copy link

slawa-dev commented Mar 28, 2019

I stumbled upon a problem where custom value was showing �����. Looking through the possible causes I found out that Matomo changed only to utf8_unicode_ci.
utf8_unicode_ci does not cover all unicode characters, but only those with up to 3-bytes. To cover all characters utf8mb4_unicode_ci must be used.

Can I just change the collation of database table to utf8mb4_unicode_ci? Will it cause any issues?
I guess I have to change some Matomo code too.
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
But then it will break on every update.

@alexhass
Copy link

This change require single file per table. Such a change is extreme problematic as many vservers have limis with open file handles and will run out of resources. I tried on my server and was made to revert... and my vserver has high limits... there are other hosters that have just 1/3 of mine.

@Findus23
Copy link
Member

@alexhass Can you expand what the encoding/collation of a database table has to do with the amount of files needed?
I'd expect MySQL/MariaDB to be able to handle utf8mb4_unicode_ci databases properly as they are using it as a default (I think).
But I also don't know that much about the details.

@alexhass
Copy link

To be able to use utf8mb4 with php you have requirements. One is that you need to reconfigure mysql to use one file per table (innodb_file_per_table=true). This increases the files from one large db file to thousands. As mysql opens all files and keep them open you will run out of file handles and get an error „Too many open files“. Than mysql cannot start and you are offline.

At https://www.drupal.org/project/utf8mb4_convert you can see what is required to make utf8mb4 working. I wish I could convert my drupal sites but I cannot because of thousands of open file handles. The hoster has limited to a number where one drupal install may be too much... note every single mysql table requires at least two file handles.

@Findus23
Copy link
Member

Hi,

Indeed on my MariaDB instance innodb_file_per_table seems to be enabled and indeed there are two files per table.
But even assuming you are running Matomo since 10 years ago (so 10122=240 files) and are using as many plugins as I am (56 other tables), that just means you have 352 files open which is nothing compared to the default global limit I found (65536) or the limit on my vServer (1597359).
Even the per-process limit of 1024 should be enough to run Matomo for many years without changing any system config.

Which brings me to the question of why the limits are so low on your server and why you can't increase them just high enough.

@alexhass
Copy link

alexhass commented Mar 28, 2019

I‘m not only running matomo on the server! Also keep in mind the operating system, dovecot, postfix, apache, every php file include and so on - all need additional file handles.

Mine is based on Virtuozzo virtualization and has 8gb ram, 200gb ssd and a typical limit is 1400 file handles for a 8,85€/month server. That is where vservers start and i‘m not willing to spend more money per month.

The file handle number cannot increased as the hoster force you to this limit. I asked them and was told to book a dedicated server for 40€/month. No no no...

If I‘m not wrong 1blue.de has a limit of ~800 file handles for 10€/month servers. That is common at this pricing and my hoster limits are the highest in Germany for entry level servers. Larger vserver for 16€ have ~2300 file handles.

One Drupal site has maybe 300-500 tables and I‘m running 7 of them.

@sgiehl
Copy link
Member

sgiehl commented Feb 14, 2020

@tsteur @mattab switching to utf8mb4 or at least activating it by default for new installs shouldn't be a problem.

only use utf8mb4 if it's present (in case it could be disabled for some reasons)

that would make it a lot more complicate, as we would need to check wich collation is actually in use before storing 4 byte chars and stuff like that.
Imho it I would be fine to require the collation for Matomo 4 as it should be very unlikely that it's not available...

@sgiehl
Copy link
Member

sgiehl commented Feb 14, 2020

I just tried to implement that, but actually it won't be that easy as simply converting all tables to utf8mb4. We have some tables where a varchar is part of a index. With utf8mb4 each char will need 4 bytes, but the default maximum index length is 767 bytes. So a varchar(200) will already be to big for a index. So either we make those fields smaller or we do not use utf8mb4 charset when it's not needed 🤔
@mattab @tsteur @diosmosis any thoughts?

@tsteur
Copy link
Member

tsteur commented Feb 15, 2020

Not sure when we have an index on varchar. Would it help to index only first say 30 char of varchar? That might be good anyway to keep index smaller

@sgiehl
Copy link
Member

sgiehl commented Feb 15, 2020

            'site_url'    => "CREATE TABLE {$prefixTables}site_url (
                              idsite INTEGER(10) UNSIGNED NOT NULL,
                              url VARCHAR(255) NOT NULL,
                                PRIMARY KEY(idsite, url)
                              ) ENGINE=$engine DEFAULT CHARSET=$charset
            ",

            'option'        => "CREATE TABLE `{$prefixTables}option` (
                                option_name VARCHAR( 255 ) NOT NULL,
                                option_value LONGTEXT NOT NULL,
                                autoload TINYINT NOT NULL DEFAULT '1',
                                  PRIMARY KEY ( option_name ),
                                  INDEX autoload( autoload )
                                ) ENGINE=$engine DEFAULT CHARSET=$charset
            ",

            'session'       => "CREATE TABLE {$prefixTables}session (
                                id VARCHAR( 255 ) NOT NULL,
                                modified INTEGER,
                                lifetime INTEGER,
                                data TEXT,
                                  PRIMARY KEY ( id )
                                ) ENGINE=$engine DEFAULT CHARSET=$charset
            ",

            'archive_numeric'     => "CREATE TABLE {$prefixTables}archive_numeric (
                                      idarchive INTEGER UNSIGNED NOT NULL,
                                      name VARCHAR(255) NOT NULL,
                                      idsite INTEGER UNSIGNED NULL,
                                      date1 DATE NULL,
                                      date2 DATE NULL,
                                      period TINYINT UNSIGNED NULL,
                                      ts_archived DATETIME NULL,
                                      value DOUBLE NULL,
                                        PRIMARY KEY(idarchive, name),
                                        INDEX index_idsite_dates_period(idsite, date1, date2, period, ts_archived),
                                        INDEX index_period_archived(period, ts_archived)
                                      ) ENGINE=$engine DEFAULT CHARSET=$charset
            ",

            'archive_blob'        => "CREATE TABLE {$prefixTables}archive_blob (
                                      idarchive INTEGER UNSIGNED NOT NULL,
                                      name VARCHAR(255) NOT NULL,
                                      idsite INTEGER UNSIGNED NULL,
                                      date1 DATE NULL,
                                      date2 DATE NULL,
                                      period TINYINT UNSIGNED NULL,
                                      ts_archived DATETIME NULL,
                                      value MEDIUMBLOB NULL,
                                        PRIMARY KEY(idarchive, name),
                                        INDEX index_period_archived(period, ts_archived)
                                      ) ENGINE=$engine DEFAULT CHARSET=$charset
            ",

@tsteur
Copy link
Member

tsteur commented Feb 16, 2020

I reckon it could be fine for these columns to limit to say 30 or 50 characters if that helps make it work?

@mattab
Copy link
Member Author

mattab commented Feb 17, 2020

Don't know if it's possible to limit to 30-50 characters the INDEX because it's also used in PRIMARY KEY?

it seems maybe we need to limit these fields to 191 chars in length? https://stackoverflow.com/questions/51934422/varchar-primary-key-cannot-be-longer-than-191

@sgiehl
Copy link
Member

sgiehl commented Feb 17, 2020

Limiting to 191 works only for keys that aren't combined with other fields. Otherwise we need to reduce it even more (by the amount of bytes the other field takes).
We could also enforce those columns to use utf8 charset. Guess none of them should contain 4 byte chars nevertheless

@alexhass
Copy link

I‘m strongly against mb4 format if not 100% needed. I cannot use matomo anymore than.

My vserver (18€/month) has a max open files limit that disallows me to convert to mb4 at all. mysql server requires two files per table and this means the same file handles once mysql starts up. You can no longer use one large innodb file that contains all tables. I have a medium vserver, but 2500 file handles are used very fast and than the server stops.

A root server is too expensive.

@tsteur
Copy link
Member

tsteur commented Feb 17, 2020

Cheers @alexhass I had no idea hosters limit these things. Is that happening as soon as one table is using mb4 ? Or would it be fine if a few tracking tables (say 3-5 tables) use mb4?

On the tables above we indeed wouldn't really need MB4 so be easiest to keep them the way they are.

@alexhass
Copy link

To run mb4 you need to enable innodb_file_per_table. The setting is global to mysql server and affects all tables. I‘m not aware that any other way exists.

@tsteur
Copy link
Member

tsteur commented Feb 17, 2020

👍 so we would need to check such a flag and only use MB4 if possible. Would that work in your case?

@slawa-dev
Copy link

slawa-dev commented Feb 19, 2020

Users have been waiting for utf8mb4 support for many years. Especially those with non English websites. It will solve a lot of issues.
Extremely limited servers can still run 3.x version of Matomo.
Utf8mb4 is the only way not to exclude certain languages that use character encoding with 4-bytes.

@alexhass
Copy link

alexhass commented Feb 19, 2020

Tsteuer: maybe.

The general problem with mb4 stays as is. This is not about excluding a language. In general I would be 100% pro mb4, but the current limitations in mysql are really bad. I have no idea why file per table is required by design.... but as I know it is required and my server is not extremly limited!

Just try not enforcing mb4 and leave an option to run without mb4, please.

@alexhass
Copy link

Maybe it is possible to detect file per table setting and use this setting to switch between utf8 and utf8mb4.

@slawa-dev
Copy link

slawa-dev commented Feb 19, 2020

Server version: 10.1.34-MariaDB MariaDB Server

show variables like 'innodb_file_format';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_file_per_table | ON    |
+-----------------------+-------+
1 row in set (0.00 sec)

https://docs.nextcloud.com/server/15/admin_manual/configuration_database/mysql_4byte_support.html

@sgiehl
Copy link
Member

sgiehl commented Feb 19, 2020

In MySQL 8 utf8mb4 will be the standard collation.

@alexhass is the utf8mb4 charset available on your system, or isn't it listed at all? E.g. does show collation like 'utf8mb4%'; return anything?

Imho we won't need utf8mb4 for all tables, it might be enough to convert the log tables...

@alexhass
Copy link

alexhass commented Feb 19, 2020

Yes it is available. Maybe you read my comments first. This is not about the mysql does not have the feature. It requires file per table and this causes the kernel to go out of file handles as there are limits.

@sgiehl
Copy link
Member

sgiehl commented Feb 19, 2020

I have read them. Actually I would have expected that utf8mb4 is not available if it would require innodb_file_per_table.
I'm not that deep into that topic, but I've read through the MySQL docs a bit, and it doesn't seem to generally require it. But if we need keys longer than 767 chars, it's a requirement (along with innodb_large_prefix option)

@Findus23
Copy link
Member

@alexhass I get your argument that enabling innodb_file_per_table (which is required for utf8mb4) is a bad idea when one is strongly limited by file handles.

What I don't understand is why one would be limited in file handles. On my vServer (Hetzner 20€/Month, but the same is true for the 3€ per Month server) the limit is (/proc/sys/fs/file-max) 1.6M
(of which 200K are used) and I'm assuming this is the Debian default. If I wanted, I could simply set it to something higher (on my desktop computer it seems to be 9*10^18).

The linux kernel confirms this default of 1.6 million handles for 16GB

  /*
   * One file with associated inode and dcache is very roughly 1K.
   * Per default don't use more than 10% of our memory for files. 
   */

https://serverfault.com/questions/716578/default-value-of-proc-sys-fs-file-max/716591

Increasing the file handler limit by 10K would allow 3000 more MySQL tables which should be enough for Matomo for a while and would worst case (if all of them are used) take up an additional 10MB of memory. So unless you are running on a microcontroller with severe memory limitations file handlers aren't scarce.

And it shouldn't matter to the Hoster what file limit I am using, if I want to spend a few MB more or less of my memory on it, which kernel I am using or even if I want to run OpenBSD or Haiku on my virtual server as thanks to KVM they just forward the low-level commands to the CPU.

I don't think allowing hosters to save single digit megabytes of memory is worth denying people to store data in their native language.

@alexhass
Copy link

I‘m on server4you vserver and the limits are the highest vserver limits on the market. 1blu and others have a lot lower limits than server4you has

@slawa-dev
Copy link

Just checked a cheap t3a.micro AWS VS with 1GB RAM and default Ubuntu. You get 95.6K file handlers.

@alexhass
Copy link

I’m not paying Amazon for underpowered and extreme expensive hosting. Find the 16gb machine with 512gb ssd for 16€ first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Security For issues that make Matomo more secure. Please report issues through HackerOne and not in Github.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants