Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need way to housekeep records in DB #1574

Closed
8 tasks
22vincetsang opened this issue Sep 23, 2019 · 79 comments
Closed
8 tasks

Need way to housekeep records in DB #1574

22vincetsang opened this issue Sep 23, 2019 · 79 comments
Labels
feat New feature or request. help wanted We are looking for help on this one.
Milestone

Comments

@22vincetsang
Copy link

22vincetsang commented Sep 23, 2019

There are lots of records in Hydra DB that get accumulated over time and are never cleaned up.
The flush API (as I tested with 1.0) can only clean up access tokens as stated in documentation, but not records in at least these tables:

  • hydra_oauth2_authentication_request
  • hydra_oauth2_authentication_request_handled
  • hydra_oauth2_authentication_session
  • hydra_oauth2_code
  • hydra_oauth2_consent_request
  • hydra_oauth2_consent_request_handled
  • hydra_oauth2_pkce
  • hydra_oauth2_refresh

We need at minimum an API to flush all obsolete records, if not an automatic housekeeping mechanism which I understand isn't in place today for access tokens neither.

@aeneasr aeneasr added this to the v1.1.0 milestone Sep 23, 2019
@aeneasr aeneasr added the feat New feature or request. label Sep 23, 2019
@aeneasr
Copy link
Member

aeneasr commented Sep 23, 2019

Yes, so I checked and flush currently only clears up access tokens. Since this has not ever become an issue even in large deployments so far, I'm pushing the urgency back a bit.

@aeneasr
Copy link
Member

aeneasr commented Sep 23, 2019

PKCE is removed when it's being used, but it's still possible that old, dead entries are around if the authorize code is not used.

@22vincetsang
Copy link
Author

Thanks, that's helpful - and I just verified the normal path won't create a record there. So it should be safe to clean up these pkce records then (still don't yet understand where they came from as the auth code should always be consumed immediately by our mobile app/spa... maybe there's some error case I'm not aware of...).

Looking forward to this enhancement so other DB tables can be handled too. It just seems not that trivial by looking at the other tables to identify which records are safe to delete so I dare not do so...

Thanks again!

@pereslava
Copy link

Hello. Any progress with it?
I was in milestone v1.3.0, I'm running v1.4.6 now and still no API.

As described here it is safe to cleanup the tables manually. But PR was not merged

@aeneasr
Copy link
Member

aeneasr commented Apr 23, 2020

Contributions are welcomed, I think the threads and comments and existing PRs shed light on what needs to be done!

@aeneasr aeneasr added the help wanted We are looking for help on this one. label Apr 23, 2020
@aeneasr aeneasr modified the milestones: v1.4.0, v1.5.0 Apr 23, 2020
@pereslava
Copy link

pereslava commented Apr 25, 2020

@aeneasr
Lets discuss what do you mean here:

All hydra_oauth2_(authentication|consent)request that do not have a result for hydra_oauth2(authentication|consent)request_handled.challenge = hydra_oauth2(authentication|consent)request.challenge and where hydra_oauth2(authentication|consent)_request.requested_at is older than a certain threshold.

That says to flush all not handled and obsolete requests.
Then you say:
All hydra_oauth2_(authentication|consent)_request where one result exists for hydra_oauth2_(authentication|consent)_request_handled.challenge = hydra_oauth2_(authentication|consent)_request.challenge and where hydra_oauth2_(authentication|consent)_request_handled.was_used=false and hydra_oauth2_(authentication|consent)_request.requested_at is older than a certain threshold.

I'm confused here the first statement will delete all records including was_used=false.
I guess the first statement is dangerous?

All hydra_oauth2_logout_request where accepted=false, rejected=false, was_used=false and (this field needs to be added) requested_at is older than a certain threshold.

That understood, deleting all not used and obsolete requests.

Also the logout request does not have ttl currently. Is this parameter shall be added, or ttl.login_consent_request can be used?

On other side what a reason to keep handled, used and obsolete requests? Is it removed at some flow?
Same for hydra_oauth2_(authentication|consent)_request_handled is it safe to flash those tables?

@aeneasr
Copy link
Member

aeneasr commented Apr 25, 2020

Keep in mind, this comment is from 2019. The code has changed since then!

Maybe let's start with that: What tables accumulate lots in your system? If we know those, we will know how to clean them.

@pereslava
Copy link

The tables i most care about:

  • hydra_oauth2_(authentication|consent)_request
  • hydra_oauth2_(authentication|consent)_request_handled
  • hydra_oauth2_logout_request

Question is can we flush hydra_oauth2_(authentication|consent)_request_handled as well?

@aeneasr
Copy link
Member

aeneasr commented Apr 25, 2020

How much data has accumulated for those? Are they growing fast? And if so, how fast?

@pereslava
Copy link

@aeneasr
I'll check it again later, and hopefully will summarize the questions.

@aeneasr aeneasr modified the milestones: v1.5.0, unplanned Apr 25, 2020
@purmac
Copy link

purmac commented Aug 11, 2020

Hi @aeneasr,

We are facing the similar issue in our production database that

hydra_oauth2_(authentication|consent)request
hydra_oauth2
(authentication|consent)_request_handled

tables grow quite quickly. At the rate of 9000 new records per day.
Is it safe to delete an arbitrary range said anything before 2020-01-01 in above tables if we need a quick purge to reduce the rows counts?

@aeneasr
Copy link
Member

aeneasr commented Aug 12, 2020

See #1987 @purmac

@purmac
Copy link

purmac commented Aug 12, 2020

Hi @aeneasr

We are going to update to latest version of Hydra which can solve our long query problem. We still want to tidy up some tables to keep the size in a cost margin. As you mentioned above, the latest version had some code change. Is the clean up instruction in #1574 still relevant or you could provide me an update instruction on how to safely clean up those table? We will only only do the simple clean up that remove records older than a certain time.

@aeneasr
Copy link
Member

aeneasr commented Aug 13, 2020

Clean up should now be more or less automatic for most flows. We also introduced better foreign key checks since then. There might still be stale data but it definitely shouldn't be too much. If you do find leaky tables in the latest version, please do report here :)

@purmac
Copy link

purmac commented Aug 14, 2020

@aeneasr Let me follow up on that. The newer version of hydra has cleanup function build-in and we shouldn't worry about too much in the future. How about the existing aged data in the current table? Is the new version of hydra will clean the old data as well?

@aberasarte
Copy link
Contributor

We are interested in this feature as well. The tables I'm seeing growing the most at the moment are hydra_oauth2_access and hydra_oauth2_refresh. We can cleanup the former using the flush endpoint but as the documentation states, this endpoint doesn't remove stale refresh tokens:

Refresh tokens can not be flushed as they are deleted automatically when performing the refresh flow.

When using the PKCE flow from a SPA, we store the refresh_token in memory for security reasons, so when the user closes the browser, the last received refresh_token isn't going to be used nor removed ever. In this closed PR @aeneasr said the following:

Removing refresh tokens will cause the refresh token chain to fail. It's important to keep track of all issued refresh tokens and access codes to be able to properly revoke them and their children later on. I believe this is true for OIDC as well.

I don't get this, once the refresh token has expired, it can't be used for refreshing any access_token; therefore, it should be safe to remove it, right? Could you clarify this?

@maciej-kapusta
Copy link

But aren't all the tokens with requested_at earlier than their now - TTL already expired?

@LAMASE
Copy link

LAMASE commented Jan 11, 2021

With few million users per day, the number of records can be a problem for me.
We started the Oauth server one year ago for a really small portion of our users, and we found that almost all the records we have today can probably be deleted (old authentication_requests, never handled).
I'm following the discussion.

@SCLogo
Copy link

SCLogo commented Jan 29, 2021

testing with locust after 1 day we have 21GB data with outdated data. Any plan to review your cleanup process?

@aeneasr
Copy link
Member

aeneasr commented Jan 29, 2021

Please provide useful information and context as mandated by our contribution guidelines. Also make sure to read the comments in this thread, as a lot of time was spent on this already.

@VladislavKravitski4888
Copy link

As I wrote here using authorization code flow with refresh token could avoid fast db data increasing. But since a few days we are using ory hydra for production for further approaches and getting thousends of new users every day. That is leading to the same problem. Our tables like hydra_oauth2_authentication_request, hydra_oauth2_code, hydra_oauth2_consent_request and hydra_oauth2_refresh are becoming a few gigabytes bigger every day. As I understood hydra_oauth2_refresh would be cleaned automatically after TTL_REFRESH_TOKEN. But for the other once we would need some sort of workaround...

@aeneasr
Copy link
Member

aeneasr commented Jan 31, 2021

Are you running the flush command?

@VladislavKravitski4888
Copy link

Yes, we are flushing with a request to "/oauth2/flush" automatically every x hours. As I could see in our logs this only deletes rows from "hydra_oauth2_access". Or is there any dependencies to other ones?

@aeneasr
Copy link
Member

aeneasr commented Feb 1, 2021

@VladislavKravitski4888
Copy link

@aeneasr :-) Yes, I'm the author of it :-)

@aeneasr
Copy link
Member

aeneasr commented Feb 1, 2021

Ah I see, so the issue still persists? Could you maybe show some of the records you think should be purged from the database? Based on that we can craft improved queries for the clean up routine.

@LAMASE
Copy link

LAMASE commented Feb 1, 2021

If those record are "technically valid" and for this they cannot be deleted automatically, there should be a configuration parameter to limit those records or the total space occupied in the database, sacrifying the least recently used / oldest ones, OR (better) a parameter to the flush command to do the same cleaning.

@aeneasr
Copy link
Member

aeneasr commented Feb 1, 2021

Yes, I agree, but what would be really helpful is some real world examples of rows that you deem obsolete. If we get those, we can easily add this code. Without examples, we are walking in the dark!

@LAMASE
Copy link

LAMASE commented Feb 1, 2021

You know why records are not deleted, so let's take any db in real world use:
what (and how) will you delete if you need to shrink the db to XX GB? (or a number of records, or handled sessions..)

(I'm still not in production, @VladislavKravitski4888 have you some example data to show?)

@aeneasr
Copy link
Member

aeneasr commented Feb 1, 2021

To the best of my knowledge, all records might be required for auditing or are required for a functioning system. That's why I am asking. Most people asking about this here are running load tests with synthetic data which does not reflect real world use. This has been explained and discussed in the earlier comments.

What we need to help you is concrete data, concrete examples, and ideas what records are being kept that could be purged. Without this information, commenting +1 on this issue will not help anyone and we keep going in circles.

@enrico-atzeni
Copy link

enrico-atzeni commented Feb 2, 2021

I'm facing same issue already described by @LAMASE and @VladislavKravitski4888.
I have hydra in production with a daily average of 4k login requests started and only 100 completed. This is an average of more than 110k login request never completed logged in database each month.

The real issue comes now that we are replacing separate logins with a centralized single sign-on powered by hydra, increasing user base to few millions per day (from statistics we expect a minimum 30x increase in hydra requests).

Of course we tried flush command but it only clean oauth2_access table, that was already small.
Digging in the database I give you some concrete data to help solve this issue:

  • table hydra_oauth2_access contains 21716 rows
  • table hydra_oauth2_authentication_request contains 1091349 rows
  • table hydra_oauth2_authentication_request_handled contains 23115 rows
  • table hydra_oauth2_authentication_session contains 1085028 rows

As I understood table hydra_oauth2_authentication_request contains all started login requests, and when accepted an entry into hydra_oauth2_authentication_request_handled will be created.
hydra_oauth2_authentication_session instead contains all user sessions starting the login flow, so if an user come back after successfull login with remember = true and start again the login flow we will find two authentication_request rows for only one authentication_session. Otherwise, with remember = false, we have a session row for each login request made.

Now the hint

If a login request never be completed before its expiration should be deleted because useless for user. The only use for this row is to trigger a "The login request has expired" error message, but a new login flow must be started and saying "expired" or "not valid" have same meaning for user; therefore this operation does not penalize user experience.
As wrote before each non completed authentication request started from a remember=false session have a dedicated row in the authentication_session table. This means we can safely delete the row from authentication_session and on cascade the authentication_request row will be deleted (thanks to foreign key).

Making login requests expiring in 1 month in our system allows us to delete 937405 rows from each previous mentioned tables ( hydra_oauth2_authentication_request and hydra_oauth2_authentication_session).
Running this task together with flush command will dramatically decrease database size.

Take in account that in our application we can consider a session expired even after 1 hour.

Running this query:

SELECT authsess.authenticated_at 
FROM hydra_oauth2_authentication_session authsess 
   JOIN hydra_oauth2_authentication_request authreq on authreq.login_session_id = authsess.id 
WHERE authreq.authenticated_at IS NULL 
   AND authsess.remember = 0 
ORDER BY authsess.authenticated_at ASC 
LIMIT 5;

gives me our very first login sessions not completed, dated more than 1 year ago:

+---------------------+
| authenticated_at    |
+---------------------+
| 2020-01-24 14:22:28 |
| 2020-01-24 14:22:51 |
| 2020-01-24 14:24:41 |
| 2020-01-24 14:25:38 |
| 2020-01-24 14:26:34 |
+---------------------+

and this is the number of non completed requests session we can consider expired (1 month old) and drop safely from database without breaking user experience:

SELECT COUNT(*) 
FROM hydra_oauth2_authentication_session authsess
    JOIN hydra_oauth2_authentication_request authreq on authreq.login_session_id = authsess.id
WHERE authreq.authenticated_at IS NULL 
    AND authsess.remember = 0
    AND authsess.authenticated_at < NOW() - INTERVAL 1 MONTH;

---
+----------+
| COUNT(*) |
+----------+
|   937405 |
+----------+

Please note: our application starts new oauth login request anytime user press the login button, we don't try to cache or reuse same login challenge in following calls.

Let me know if can I share other specific data to help you.

@pharman
Copy link
Contributor

pharman commented Feb 14, 2021

I can add some more production experience. We have 16m users and handle between .5-1m log-ins per day and around 250-500k token refreshes per day. We have the token flush command running nightly. Here is table size in descending size order. I'll keep this up to date to see how it's changing.

We have had Hydra in production for about 6 months, but we went from ~5m to ~16m users only a few weeks ago, so we don't have a clear picture yet. Throughout this year we'll ramp up to about 30m.

    schema_name     |                         relname                          |    size    | table_size
--------------------+----------------------------------------------------------+------------+-------------
 public             | hydra_oauth2_code                                        | 36 GB      | 38150152192
 public             | hydra_oauth2_refresh                                     | 33 GB      | 35775725568
 public             | hydra_oauth2_oidc                                        | 32 GB      | 34834407424
 public             | hydra_oauth2_authentication_request                      | 26 GB      | 28372320256
 public             | hydra_oauth2_consent_request                             | 20 GB      | 21775179776
 public             | hydra_oauth2_consent_request_handled                     | 3709 MB    |  3888947200
 public             | hydra_oauth2_authentication_session                      | 3180 MB    |  3334889472
 public             | hydra_oauth2_oidc_pkey                                   | 3047 MB    |  3194773504
 public             | hydra_oauth2_authentication_request_handled              | 3026 MB    |  3172638720
 public             | hydra_oauth2_access                                      | 2786 MB    |  2921840640

Update! After another week of running:

    schema_name     |                         relname                          |    size    | table_size
--------------------+----------------------------------------------------------+------------+-------------
 public             | hydra_oauth2_code                                        | 41 GB      | 43515076608
 public             | hydra_oauth2_refresh                                     | 39 GB      | 41881731072
 public             | hydra_oauth2_oidc                                        | 38 GB      | 40482250752
 public             | hydra_oauth2_authentication_request                      | 29 GB      | 31507857408
 public             | hydra_oauth2_consent_request                             | 23 GB      | 24580825088
 public             | hydra_oauth2_consent_request_handled                     | 4380 MB    |  4592943104
 public             | hydra_oauth2_authentication_session                      | 3637 MB    |  3814146048
 public             | hydra_oauth2_authentication_request_handled              | 3560 MB    |  3733069824
 public             | hydra_oauth2_access                                      | 2735 MB    |  2868174848

Update 2/3/21

    schema_name     |                         relname                          |    size    | table_size
--------------------+----------------------------------------------------------+------------+-------------
 public             | hydra_oauth2_code                                        | 48 GB      | 51242213376
 public             | hydra_oauth2_refresh                                     | 46 GB      | 48857759744
 public             | hydra_oauth2_oidc                                        | 43 GB      | 46416322560
 public             | hydra_oauth2_authentication_request                      | 33 GB      | 35010969600
 public             | hydra_oauth2_consent_request                             | 26 GB      | 27663220736
 public             | hydra_oauth2_consent_request_handled                     | 5091 MB    |  5337915392
 public             | hydra_oauth2_authentication_session                      | 4126 MB    |  4326875136
 public             | hydra_oauth2_oidc_pkey                                   | 4059 MB    |  4255842304
 public             | hydra_oauth2_authentication_request_handled              | 3949 MB    |  4141195264
 public             | hydra_oauth2_access                                      | 2418 MB    |  2535563264

@Benehiko
Copy link
Contributor

Benehiko commented Mar 5, 2021

Hi all,

Please review this draft PR #2381
Add any concerns you have there and we'll see to it that it's addressed :)

aeneasr added a commit that referenced this issue Mar 8, 2021
See #1574 (comment)

Co-authored-by: hackerman <3372410+aeneasr@users.noreply.github.com>
@Benehiko Benehiko moved this from To do to Pending Review in Maintainer's Board Mar 22, 2021
Maintainer's Board automation moved this from Pending Review to Done Mar 24, 2021
@aeneasr
Copy link
Member

aeneasr commented Mar 24, 2021

We are happy to announce that hydra janitor will be part of the next 1.9.x release due tomorrow (if everything works fine). Use with caution, as it is a new feature, but please report your results here! Thank you for everyone's input :)

@mohsen3
Copy link

mohsen3 commented Jun 4, 2021

Hey, I noticed that janitor $DSN --requests cleans up hydra_oauth2_authentication_request and hydra_oauth2_consent_request tables, but not hydra_oauth2_authentication_session. That latter one is the second largest table in our database right now. Is there a reason to keep those rows around while cleaning up the first two tables? (I suppose not).

If not, I can create a PR for that.

@aeneasr
Copy link
Member

aeneasr commented Jun 4, 2021

A PR would be welcomed for this! You can also first create an issue and outlay the plan for the delete query :) Also there's an ongoing refactoring of the system, see #2540

@mohsen3
Copy link

mohsen3 commented Jun 4, 2021

Thank you @aeneasr!
I created an issue: #2561

@Byjus-riwaz
Copy link

Byjus-riwaz commented Jul 13, 2023

is it fine to remove the hydra_oauth2_(authentication|consent)_request_handled.was_used=true records as well. what is the _request_handled tables being used for?
@aeneasr @pharman @enrico-atzeni

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat New feature or request. help wanted We are looking for help on this one.
Projects
Development

No branches or pull requests