Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collation and charset #62

Closed
ccinelli opened this issue Jun 28, 2019 · 10 comments
Closed

Collation and charset #62

ccinelli opened this issue Jun 28, 2019 · 10 comments

Comments

@ccinelli
Copy link

This code here is a little weird:
https://github.com/MariaDB/mariadb-connector-nodejs/blob/3eeba5ee1c4e9d7549aa857316db679c6af159a6/lib/config/connection-options.js#L28-L35 and it seems to mix up collation and charset.

Using charset: 'utf8mb4' throws and error. Adding collation: 'UTF8MB4_GENERAL_CI' also throw an error.

In order to work you need to use charset: 'UTF8MB4_GENERAL_CI'

Finally it looks like that if you pass charsetNumber: 12345678 you will get a Collations.fromIndex(224); //UTF8MB4_UNICODE_CI

In my humble opinion charset: 'utf8mb4' should work and use collation: 'UTF8MB4_GENERAL_CI'as default. Also collations should be specified withcollation:, no charset`.

@knoxcard
Copy link
Contributor

knoxcard commented Jun 29, 2019

My apologies, originally stated that my MariaDB/Sequelize connection settings matched your description, which is incorrect. Currently using charset: utf8mb4 instead of UTF8MB4_GENERAL_CI

Using Sequelize/ORM and this driver to interface with MariaDB.

Assuming that Sequelize does extra work behind the scenes, which could prevent error logging to my console.

@rusher - any thoughts?

My MariaDB/Sequelize Connection Code for Reference

   charset: 'utf8mb4',
   collate: 'utf8mb4_general_ci'
  var Sequelize = require('sequelize'),
    sequelize = new Sequelize(process.env.db_name, process.env.db_user, process.env.db_pass, {
      dialect: 'mariadb',
      port: process.env.db_port,
      pool: {
        min: 1,
        max: 60,
        idle: 10000
      },
      define: {
        charset: 'utf8mb4',
        timestamps: false,
        hooks: {
          afterCreate: (model, fn) => {
            app.get('functions').dbModelFormatJsonColumns(model, fn)
          },
          afterUpdate: (model, fn) => {
            app.get('functions').dbModelFormatJsonColumns(model, fn)
          }
        }
      },
      dialectOptions: {
        collate: 'utf8mb4_general_ci',
        useUTC: true,
        timezone: process.env.db_timezone
      },
      benchmark: false,
      logging: false
    })

@ccinelli
Copy link
Author

ccinelli commented Jul 1, 2019

Your dialectOptions has only:

        collate: 'utf8mb4_general_ci',

It does not work for me with both:

charset: 'utf8mb4',
collate: 'utf8mb4_general_ci'

@knoxcard
Copy link
Contributor

knoxcard commented Jul 3, 2019

This works for me as I do not see the error message you mentioned.

Can you post your full Sequelize connection code as well?
Try setting up your code to match mine below.

  var Sequelize = require('sequelize'),
    sequelize = new Sequelize(process.env.db_name, process.env.db_user, process.env.db_pass, {
      dialect: 'mariadb',
      dialectOptions: {
        timezone: process.env.db_timezone
      },
      port: process.env.db_port,
      pool: {
        min: 0,
        max: 2,
        idle: 10000
      },
      define: {
        charset: 'utf8mb4',
        collate: 'utf8mb4_general_ci',
        timestamps: false
      },
      benchmark: false,
      logging: false
    })

@rusher
Copy link
Collaborator

rusher commented Jul 5, 2019

Alright, there is really an issue there, because actual implementation accepts only collation name (like 'UTF8MB4_GENERAL_CI').
To be compatible with other drivers, option charset must be either a ... charset (ex: 'utf8mb4') or a collation name (ex: 'utf8mb4_esperanto_ci').

@knoxcard
Copy link
Contributor

knoxcard commented Jul 5, 2019

@rusher - you are absolutely right.

This now works without defining either charset or collate, as shown below...

  var Sequelize = require('sequelize'),
    sequelize = new Sequelize(process.env.db_name, process.env.db_user, process.env.db_pass, {
      dialect: 'mariadb',
      dialectOptions: {
        timezone: process.env.db_timezone
      },
      port: process.env.db_port,
      pool: {
        min: 0,
        max: 2,
        idle: 10000
      },
      define: {
        timestamps: false
      },
      benchmark: false,
      logging: false
    })

@knoxcard
Copy link
Contributor

knoxcard commented Jul 8, 2019

@ccinelli - close issue?

@rusher
Copy link
Collaborator

rusher commented Jul 8, 2019

done with commit: 5eb3847
So this will be in next 2.1.0 version

@rusher rusher closed this as completed Jul 8, 2019
@ccinelli
Copy link
Author

ccinelli commented Jul 8, 2019

Are you sure it will be backward compatible? Otherwise it should be in 3.0.0 assuming you use semver

@ccinelli
Copy link
Author

ccinelli commented Jul 8, 2019

Can you please take a second look at the code?

Let me add some context:

I had the usual problem of trying to store emojis in a table that was using the default latin1 as charset.

So I was trying to set a default in the sequelize so tables were going to be created with the "real" UTF8 as default. See https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-utf8mb4-11761243e434

In SQL you usually use:

CREATE TABLE IF NOT EXISTS table_name (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE utf8mb4_general_ci;

My understanding is that a collation adds specificity to a charset.

So my first attempt was to specify both in dialectOptions:

 charset: 'utf8mb4',
 collation: 'UTF8MB4_GENERAL_CI' 

But I had an error. Then I looked at the code and realized that in order to make it work I had to use:

  charset: 'UTF8MB4_GENERAL_CI'

That seemed wrong since that is a collation and I reported this issue.

How it should be

I thought that you should be able to specify both charset and collation. If the specified collation is not compatible with the charset then, it should error out.

If you specify only charset it should use the default collation.

If you specify only collation, it should work since collation implicitly specifies the charset.

How it works in the mysql npm module

However I dug around and I found that other databases have a different way to specify collations (if they even support them.) So it seem that if this driver follow the npm mysql conventions, we are good. Just make sure the documentation of mariadb-connector-nodejs is clear about this.

From: https://www.npmjs.com/package/mysql#connection-options

charset: The charset for the connection. This is called "collation" in the SQL-level of MySQL (like utf8_general_ci). If a SQL-level charset is specified (like utf8mb4) then the default collation for that charset is used. (Default: 'UTF8_GENERAL_CI')

This seems a little imprecise but keep things simple and makes sense.

What about this change?

Does the code after this change behave in this way? It is not completely clear from the diff but from the new tests it seems that charset needs to be replaced with collation.

If I am correct, this change will break existing code and still not behave as either the mysql driver or let you specify both charset and collation. If I am wrong, there should be tests to check these conditions.

@rusher
Copy link
Collaborator

rusher commented Jul 9, 2019

@ccinelli here is the new implementation:

The option charset now permit to indicate charset. the connector will use default collation associate to charset. For example charset: 'cp850' connector will use collation 'CP850_GENERAL_CI'.

The option collation permit to indicate collation, so 'CP850_GENERAL_CI' for example.

For compatibility, it's still possible to indicate a collation using the charset option (ex:charset: 'CP850_GENERAL_CI') but a warning will be logged.
That's why 2.1.0 is ok, and this is compatible with other drivers.

I'm still hesitant about multiplying options, but charset: 'CP850_GENERAL_CI' hurt my eyes so much I think it's better this way.

And since you've asked, that means this need better documentation, I'll see to that now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants