Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config: Add support for PostgreSQL #47

Open
sokoow opened this issue Oct 24, 2018 · 61 comments
Open

Config: Add support for PostgreSQL #47

sokoow opened this issue Oct 24, 2018 · 61 comments
Labels
help wanted Well suited for external contributors! idea Feedback wanted / feature request priority Issue backed by early sponsors

Comments

@sokoow
Copy link

sokoow commented Oct 24, 2018

Nice idea lads, I totally support it. Were youever wondering to switch to postgres ? For the deployment size I'm predicting this going to have, mysql might be a bit suboptimal choice :)


Details on possible implementation strategies can be found in this comment:

@lastzero
Copy link
Member

Not right now, but in general: anything to store a few tables will do... as simple and stable as possible... many developers are familiar with mysql, so that's my default when I start a new project. Tooling is also good.

sqlite is a very lean option, but obviously - if you run multiple processes or want to directly access / backup your data - it doesn't scale well or at all.

@lastzero lastzero added the idea Feedback wanted / feature request label Oct 24, 2018
@lastzero lastzero added the declined Cannot be merged / implemented at this time label Nov 17, 2018
@lastzero
Copy link
Member

It became clear that we have to build a single binary for distribution to reach broad adoption. Differences between SQL dialects are too large to have them abstracted away by our current ORM library, for example when doing date range queries. They are already different between MySQL and sqlite.

For those reasons we will not implement Postgres support for our MVP / first release. If you have time & energy, you are welcome to help us. I will close this issue for now, we can revisit it later when there is time and enough people want this 👍

@sokoow
Copy link
Author

sokoow commented Nov 17, 2018

ok fair point - I was raising this because cost of maintenance and troubleshooting at scale is much lower with postgres, and lots of succesfull projects have this support. so, from what you wrote about differences, it seems that you don't have pluggable orm-like generic read/write storage methods just yet, right ?

@lastzero
Copy link
Member

@sokoow We do use GORM, but it doesn't help with search queries that use database specific SQL.

If you like to dive into the subject, DATEDIFF is a great example: MySQL and SQL Server use DATEDIFF(), Postgres seems to prefer DATE_PART() whereas sqlite only has julianday().

It goes even deeper when you look into how tables are organized. You can't abstract and optimize at the same time. We want to provide the best performance to our users.

See

@sokoow
Copy link
Author

sokoow commented Nov 17, 2018

No that's a fair point, you're not the first project that has this challenge - something to think about on higher abstraction level.

@LeKovr
Copy link

LeKovr commented Nov 17, 2018

If you have time & energy, you are welcome to help us.

I guess it won't be so hard, so I would try

@lastzero
Copy link
Member

Getting it to work somehow at a single point in time is not hard, getting it to work with decent performance, finding developers who are comfortable with it and constantly maintaining the code is incredibly hard.

Keep in mind: You also need to maintain continuous integration infrastructure and effectively run all tests with every database.

@LeKovr
Copy link

LeKovr commented Nov 20, 2018

Ofcourse, tests might be same for every supported database and this might be solved within #60.
Also, sqlite support will probably entail some architectural changes (like search using Bleve and db driver dependent sql queries). It won't be hard to add postgresql support after that. And may be you'll find "developers who are comfortable with it" by this time

@lastzero
Copy link
Member

@LeKovr Did you see how we renamed the label from "rejected" to "descoped"? 😉 Yes indeed, later it might be great to add support for additional databases, if users actually need it in practice. Maybe everyone will be happy with an embedded database if we do it well. It is hard to predict.

What I meant was that if you change some code that involves SQL you might feel uncomfortable because you only have experience with one database, so you end up doing nothing. And that can be very dangerous for a project.

@LeKovr
Copy link

LeKovr commented Nov 20, 2018

@lastzero, You are right. May be later. There are more important things to do by now

@bobobo1618
Copy link

I had a quick look and it looks like the queries at least are trivial to add. The biggest problem is the models. The varbinary and datetime types are hard-coded into the models but don't exist in PostgreSQL, so the migration fails.

I'm not sure what the solution is here. I'd guess that the solution is to use the types Gorm expects (e.g. []byte instead of string when you want a column filled with bytes) but there's probably a good reason why it wasn't done that way to start with.

I'll play with it some more and see. It'd be nice to put everything in my PostgreSQL DB instead of SQLite.

@LeKovr
Copy link

LeKovr commented Jul 9, 2020

The varbinary and datetime types are hard-coded into the models

may be create domain varbinary... may helps

@bobobo1618
Copy link

All of the varbinary have different lengths and seem to have different purposes, so I don't think that'll help unfortunately.

@lastzero
Copy link
Member

lastzero commented Jul 9, 2020

Yes, we use binary for plain ASCII, especially when strings need to be sorted, indexed or compared and should not be normalized in any way.

@bobobo1618
Copy link

Shouldn't that be the case by default for string fields? I know MySQL does some stupid stuff with character encodings but it shouldn't modify plain ASCII, right?

@lastzero
Copy link
Member

lastzero commented Jul 9, 2020

But it uses 4 BYTES per ASCII character, so the index becomes very big. Also when you compare strings, it's somewhat more complex with unicode than just to compare bytes. I'm aware you can PROBABLY do the same with VARCHAR with the right settings and enough time to test, but it was hard to see business value in such experiments.

@bobobo1618
Copy link

But it uses 4 BYTES per ASCII character

As far as I can tell looking at the SQLite docs, the MySQL docs and PostgreSQL docs, that isn't the case at all. A varchar uses a 1-4 byte prefix depending on the size of the field but each byte of payload consumes one byte of storage.

Also when you compare strings, it's somewhat more complex with unicode than just to compare bytes.

But we're not storing unicode, we're storing ASCII in a field that could contain unicode. I don't think any of those edge-cases apply here.

I'm aware you can PROBABLY do the same with VARCHAR with the right settings and enough time to test, but it was hard to see business value in such experiments.

Fair enough.

Also, queries aren't so straightforward after all. The queries extensively use 0 and 1 instead of false and true, which isn't supported by PostgreSQL (and as a side note, makes the query more difficult to read, since you don't know if it's meant to be a boolean comparison or an integer comparison).

I managed to do a little bit of cleanup of that and managed to get something working at least.

@lastzero
Copy link
Member

lastzero commented Jul 9, 2020

Not in the index, check again. Maybe also not in memory when comparing.

@bobobo1618
Copy link

I couldn't find documentation so I just ran a quick test to see.

import sqlite3
c = sqlite3.connect('test.db')
c.execute('CREATE TABLE things (integer PRIMARY KEY, testcolumn varchar(32))')
c.execute('CREATE INDEX test_idx ON things(testcolumn)')
for x in range(0, 10000):
    c.execute('INSERT INTO things(testcolumn) VALUES (?)', (hex(x * 472882049 % 15485867),))
c.commit()

Which resulted in 79.3k of actual data:

SELECT SUM(length(testcolumn)) FROM things;
79288

I analyzed it with sqlite3_analyzer.

Table:

Bytes of storage consumed......................... 167936
Bytes of payload.................................. 109288      65.1%
Bytes of metadata................................. 50517       30.1%

Index:

Bytes of storage consumed......................... 163840
Bytes of payload.................................. 129160      78.8%
Bytes of metadata................................. 30476       18.6%

So for 79288 bytes of actual data sitting in the column, we have 109288 bytes total for the data itself (1.38 bytes per byte) and 129160 for the index (1.63 bytes per byte).

I repeated the test with varbinary(32) instead of varchar(32) and got precisely the same result, down to the exact number of bytes.

So I don't see any evidence that a varchar consumes more space in an index than a varbinary.

@lastzero
Copy link
Member

lastzero commented Jul 9, 2020

You'll find some information on this page: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html

You might also want to read this and related RFCs: https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Note that Microsoft, as far as I know, still uses UCS-2 instead of UTC-8 in Windows, for all the reasons I mentioned. Maybe they switched to UTF-16. Their Linux database driver for SQL Server used null terminated strings, guess how well this works with UCS-2. Not at all.

For MySQL, we use 4 byte UTF8, which needs 4 bytes in indexes unless somebody completely refactored InnoDB in the meantime. Note that the MySQL manual was wrong on InnoDB for a long time, insisting that MySQL doesn't know or support indexed organized tables while InnoDB ONLY uses index organized tables.

When you're done with this, enjoy learning about the four Unicode normalization forms: https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

Did you know there's a difference between Linux und OS X? Apple uses decomposed, so you need to convert all strings when copying files. Their bundled command line tools were not compiled with iconv support, so you had to compile it yourself. Some of this still not fixed until today.

@lastzero
Copy link
Member

lastzero commented Jul 9, 2020

Note that Sqlite ignores VARBINARY and probably also VARCHAR to some degree. It uses dynamic typing. That's why all string keys are prefixed with at least once non-numeric character. It would convert the value to INT otherwise and comparisons with binary data or strings would fail:

SQLite uses a more general dynamic type system. In SQLite, the datatype of a value is associated with the value itself, not with its container. The dynamic type system of SQLite is backwards compatible with the more common static type systems of other database engines in the sense that SQL statements that work on statically typed databases should work the same way in SQLite. However, the dynamic typing in SQLite allows it to do things which are not possible in traditional rigidly typed databases.

See https://www.sqlite.org/datatype3.html

@bobobo1618
Copy link

I'm aware of Unicode encodings and some of the important differences between them. I still don't see anything in the docs indicating that using a varchar containing ASCII will consume 4 bytes in an index but I'll take your word for it.

To be clear, in case there's some miscommunication going on, my assumption is that even if the column is switched to varchar, plain ASCII (i.e. the first 128 unicode code points, which are all encoded with 8 bits) will still be stored in it. That being the case, 1 character = 1 byte and comparisons are bog-standard string comparisons.

In other news, here's a PoC of PostgreSQL mostly working. It's intended as an overview of the work that needs to be done, not as a serious proposal.

@bobobo1618
Copy link

Actually on string vs. []byte it occurred to me, if you only want to store ASCII here and don't want to treat this like a thing that's semantically like a string, is it a bad thing to use a []byte type? Is it the hassle of converting to/from strings when dealing with other APIs that's offputting?

With []byte, gorm will choose an appropriate type for each DB by default.

@bobobo1618
Copy link

bobobo1618 commented Jul 9, 2020

Ah, looks like the string vs. []byte is mostly solved by gorm V2 anyhow. You'll just be able to put type:bytes in the tag and it'll handle it for you.

@lastzero
Copy link
Member

lastzero commented Jul 10, 2020

See https://mathiasbynens.be/notes/mysql-utf8mb4

The InnoDB storage engine has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters when using utf8mb4. (Because of this, I had to change some indexed VARCHAR(255) columns to VARCHAR(191).)

Maybe we can switch to []byte in Go. Let's revisit this later, there are a ton of items on our todo with higher priority and that's by far not the only change we need to support other databases.

Edit: As you can see in the code, I already implemented basic support for multiple dialects when we added Sqlite. For Postgres there's more to consider, especially data types. Sqlite pretty much doesn't care. Bool columns and date functions might also need attention. I'm fully aware Postgres is very popular in the GIS community, so it will be worth adding when we have the resources needed to implement and maintain it (see next comment).

@lastzero
Copy link
Member

We also need to consider the impact on testing and continuous integration when adding support for additional storage engines and APIs. That's often underestimated and causes permanent overhead. From a contributor's perspective, it might just be a one time pull request. Anyhow, we value your efforts and feedback! Just so that you see why we're cautious.

@myxor
Copy link

myxor commented Jan 19, 2022

Are there any news to PostgreSQL support?

@graciousgrey
Copy link
Member

No, we are currently working on multi-user support, which is really an epic.
You can find a list of upcoming features on our roadmap: https://github.com/photoprism/photoprism/projects/5

@davralin
Copy link

Not sure if there's a place to mention it - or if it's really a new issue - but how to migrate between databases would also be nice in addition to "just" supporting postgresql.

@lastzero
Copy link
Member

@francisco1844
Copy link

francisco1844 commented Mar 24, 2022

Is there a place where people can put money towards a particular feature? I think that would help to see how much existing / future users value a particular feature. Also, for many people it may be more appealing towards a specific feature than just to make a donation and hope that eventually the feature they need will make it.

@francisco1844
Copy link

Don't see Postgresql in the Roadmap, or is it under generic name for other DBs support?

@graciousgrey
Copy link
Member

Here you find an overview of our current funding options. Sponsors in higher tiers can give golden sponsor labels to features.

Is there a place where people can put money towards a particular feature?

Not anymore. While we like IssueHunt and are grateful for the donations we've received so far, it hasn't proven to be a sustainable funding option for us as we spend much of our time maintaining existing features and providing support.
If we don't have enough resources to provide support and bugfixes, we can't start working on new features.

@graciousgrey graciousgrey added priority Issue backed by early sponsors and removed low-priority Everything nice to have but not so important labels Mar 25, 2022
@lastzero
Copy link
Member

Don't see Postgresql in the Roadmap

That's because we plan to support PostgreSQL anyway, ideally when there is less pressure to release new features than right now. We can't perform a major backend/database refactoring while pushing huge new features like multi-user support.

@dradux
Copy link

dradux commented Jun 3, 2022

I would love to see postgres support! I'll contribute time, talent, and/or treasure.

@vyruss
Copy link

vyruss commented Mar 23, 2023

I can also contribute Postgres knowledge & time.

@pashagolub
Copy link

I can help you with PostgreSQL support.

@lastzero
Copy link
Member

@pashagolub My apologies for not getting back to you sooner! We had to focus all our resources on the release and then needed a break. Any help with adding PostgreSQL is, of course, much appreciated. There are two basic strategies:

  1. Keep the current ORM (which doesn't support dynamic columns for the auto-migrations) and work around this by using only manual migrations for PostgreSQL. This seems doable to me with a few code changes, but needs to be tested before you invest a lot of time.
  2. Upgrading the ORM, which requires rewriting large chunks of code and re-testing every single detail. This approach seems cleaner, but it could also result in much more work and prevent us from releasing new features for some time, which might not be popular with some users (except those who are just waiting for PostgreSQL support, of course).

Should you decide to tackle this, I'm happy to help and give advice to the best of my ability. Also, if you have any personal questions, feel free to contact me directly via email so as to avoid notifying all issue subscribers on GitHub about a new comment.

@Tragen
Copy link

Tragen commented May 30, 2023

The strategy should be doing 1 and then 2. ;)
But after 1 there is often no reason for 2.

@lastzero
Copy link
Member

I often wish we had a more compatible, intuitive database abstraction layer. But compared to important core features that are still missing, like batch editing, this is not a big pain point at the moment and therefore not a top priority.

@pashagolub
Copy link

  1. Keep the current ORM

Would you please name it? :) It's hard to find the name in the .mod file without actually knowing it :-)

Speaking about go.mod... I was surprised to see lib/pq dependency. :-D

@lastzero
Copy link
Member

We currently use GORM v1, I was assuming this is mentioned/discussed in the comments above: https://v1.gorm.io/docs/

@pashagolub
Copy link

Sorry. Missed that

@rustygreen
Copy link

rustygreen commented Sep 8, 2023

Any update on when we can expect PostgreSQL support?

@lastzero
Copy link
Member

lastzero commented Sep 9, 2023

We had several contributors who wanted to work on this. However, there is no pull request for it yet and so I can't tell you anything about the progress.

@fl0wm0ti0n
Copy link

any news for postgres support?

@ezra-varady
Copy link

Are there any contributors working on this atm? My team is interested in this feature, and I might be able to contribute some time

@lastzero
Copy link
Member

lastzero commented Nov 4, 2023

@ezra-varady We appreciate any help we can get! To reiterate what I wrote above, there are two possible strategies:

  1. Keep the current GORM version (which does not support abstract/dynamic column types if you use auto-migrations to create/update the database schema) and work around this by using our manual migration package to maintain the PostgreSQL schema. This seems doable to me with a few changes in the internal/config package, though it should be tested as a proof-of-concept before you invest a lot of time.
  2. Upgrade GORM from v1 to v2, which requires rewriting large parts of the code and retesting every single detail. This approach may be beneficial in the long run, although it will probably also cause a lot more work and might prevent us from releasing new features for some time, which our users would not be happy about. For this reason, the entire work could of course also be done in a long-lived feature branch until everything is ready. However, you must then be prepared to resolve merge conflicts with our main branch (develop) from time to time until it can finally be merged.

Due to the higher chances of success (and because it doesn't block us from upgrading later), I would personally recommend going for (1), i.e. adding (a) manual migrations (for the initial setup of the database schema in the first step) and (b) hand-written SQL for custom queries for which the ORM is not used, for example:

switch DbDialect() {
case MySQL:
res = Db().Exec(`UPDATE albums LEFT JOIN (
SELECT p2.album_uid, f.file_hash FROM files f, (
SELECT pa.album_uid, max(p.id) AS photo_id FROM photos p
JOIN photos_albums pa ON pa.photo_uid = p.photo_uid AND pa.hidden = 0 AND pa.missing = 0
WHERE p.photo_quality > 0 AND p.photo_private = 0 AND p.deleted_at IS NULL
GROUP BY pa.album_uid) p2 WHERE p2.photo_id = f.photo_id AND f.file_primary = 1 AND f.file_error = '' AND f.file_type IN (?)
) b ON b.album_uid = albums.album_uid
SET thumb = b.file_hash WHERE ?`, media.PreviewExpr, condition)
case SQLite3:
res = Db().Table(entity.Album{}.TableName()).
UpdateColumn("thumb", gorm.Expr(`(
SELECT f.file_hash FROM files f
JOIN photos_albums pa ON pa.album_uid = albums.album_uid AND pa.photo_uid = f.photo_uid AND pa.hidden = 0 AND pa.missing = 0
JOIN photos p ON p.id = f.photo_id AND p.photo_private = 0 AND p.deleted_at IS NULL AND p.photo_quality > 0
WHERE f.deleted_at IS NULL AND f.file_missing = 0 AND f.file_hash <> '' AND f.file_primary = 1 AND f.file_error = '' AND f.file_type IN (?)
ORDER BY p.taken_at DESC LIMIT 1
) WHERE ?`, media.PreviewExpr, condition))
default:
log.Warnf("sql: unsupported dialect %s", DbDialect())
return nil
}

Should you decide to tackle this, we will be happy to help and provide advice to the best of our ability. You are also welcome to contact us via email or chat if you have general questions that don't need to be documented as a public issue comment on GitHub.

@lastzero lastzero added the help wanted Well suited for external contributors! label Nov 4, 2023
@vnnv
Copy link

vnnv commented Nov 24, 2023

@lastzero did you considered the option to remove GORM at all and replace it with something else? Perhaps a lightweight lib for db access? something similar to github.com/jmoiron/sqlx ?

@stavros-k
Copy link

@pashagolub
Copy link

I think pgx is enough for most of the functionality. But again if we want to be able to talk to different databases, we should come with some kind of database engine. And ORM is not the best choice, because the problem is not in the relation-object mapping but in the logic behind

@lastzero
Copy link
Member

@vnnv @stavros-k @pashagolub Yes, of course we have also considered switching to a completely different library... There are many more choices now than when we started the project.

That said, some kind of abstraction seems necessary if we want to support multiple dialects with the resources we have. Also, I think it's a good idea to cover simple standard use cases instead of creating every single SQL query manually.

Either way, the amount of work required to switch to a different library would be even greater than what I described in my comment above as 2.: #47 (comment)

Even for 1. and 2. it seems extremely difficult to find contributors with the time and experience required, and my personal time is very limited due to the amount of support and feature requests we receive.

@stavros-k
Copy link

That said, some kind of abstraction seems necessary if we want to support multiple dialects with the resources we have. Also, I think it's a good idea to cover simple standard use cases instead of creating every single SQL query manually.

What kind of abstraction are you looking for? I saw that is regarding column types.
Do you have columns that you dont know the type before hand?

If you do know it before hand but its "changing" frequently, Ent might be a better option, as you can extend the generated code with some gotemplates. As for migrations I would look into Atlas.

That being said, I was just subscribed into this issue for a long time, and thought I'd share what I found from my recent search for a db lib. As I was looking to start a mini side project.

I wish I had the experience to help with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Well suited for external contributors! idea Feedback wanted / feature request priority Issue backed by early sponsors
Projects
Status: Ideas 💭
Development

No branches or pull requests