Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistence of stale session ids in MnesiaCache #713

Closed
gVirtu opened this issue Sep 13, 2023 · 10 comments · Fixed by #714
Closed

Persistence of stale session ids in MnesiaCache #713

gVirtu opened this issue Sep 13, 2023 · 10 comments · Fixed by #714

Comments

@gVirtu
Copy link

gVirtu commented Sep 13, 2023

Hello all! :)

I have recently noticed a peculiar scenario in an API using Mnesia-backed Pow in production which I'm interested in reproducing to get to the bottom of it, but I've had no success thus far.

Due to reasons yet unknown some sessions are seemingly persisting in the cache even after their TTL has passed.

We have set up an authentication plug similar to the one in "How to use Pow in an API" tutorial and we are using Pow.Store.Backend.MnesiaCache as our cache_store_backend. Under normal circumstances, creating/renewing/deleting sessions via the plug works just as expected, as well as having them expire automatically. Even though the application is containerized, mnesia data is kept between versions in an attached persistent volume. I'm currently using this snippet to check active sessions for an user:

iex> config = [pow_config: [otp_app: :app], backend: Pow.Store.Backend.MnesiaCache]
iex> {:ok, user} = get_user(...)
iex> sessions = Pow.Store.CredentialsCache.sessions(config, user)
[...] # session_ids

Testing with a newly-created user, this list behaves normally, having a session id inserted after signing in, replaced after renewal, and deleted after signing out or after 30 minutes. However, for some of the earlier users, this list has been steadily growing over the months, even reaching hundreds of session ids. Luckily, these sessions cannot be used for authentication, since Pow.Store.CredentialsCache.get/1 returns :not_found.

Reproducing these steps in CredentialsCache, the first call for Base.get returns the expected {user_key, metadata}, but the second one returns :not_found. Manually assembling the session_key ([struct, :user, id, :session, session_id]) I was also able to verify that the timestamps are also still present in the cache.

Next, I tried to inspect the current state of MnesiaCache as well using the code below:

iex> Process.whereis(Pow.Store.Backend.MnesiaCache) |> :sys.get_state()
%{
  invalidators: %{
    ...
  }
}

And found that all of the stale sessions still had invalidators in the state, and the call to Process.read_timer/1 to each of their References returned false which suggests they should be expired. Then I went to check the deletion logic in MnesiaCache and tried to reproduce its steps as well. Right away it begins by fetching the key so I figured I'd get the same result by calling get/2:

iex> Pow.Store.Backend.MnesiaCache.get(config, hd(sessions))
:not_found

Which according to table_get/2 means that the fetch has returned nil. This seemed off since the same call for Pow.Store.Base.get/3 yielded the {user, metadata} earlier, and it was at this point that I decided to reach out for help. 😅

What could be going off here? I suspect it has to do with sessions that persist between deploys, but I couldn't reproduce it consistently yet.

Any help will be much appreciated, thanks!

@gVirtu gVirtu changed the title Persistence of stale session keys in MnesiaCache Persistence of stale session ids in MnesiaCache Sep 13, 2023
@danschultzer
Copy link
Collaborator

Very peculiar. This will take some effort to figure out, but could very well be a bug!

First, lets verify that when you fetch the key calling Pow.Store.Backend.MnesiaCache.get(config, hd(sessions)), the config also matches what credentials cache would set. You'll need to set a namespace and with the session cache it would be: [namespace: "credentials"]

Next, let's check what format the existing keys have and what's the expire timestamp value. Maybe something have set a much larger expire timestamp. If the config setup above worked it should return them.

If not you should be able to query them directly from Mnesia with :mnesia.dirty_select([{{Pow.Store.Backend.MnesiaCache, :_, :_}, [], [:"$_"]}]). If you also have other caches in Mnesia then this may produce too many results to go through, and could be scoped with: :mnesia.dirty_select([{{Pow.Store.Backend.MnesiaCache, ["credentials", :_], :_}, [], [:"$_"]}])

The Process.read_timer/1 returning false is interesting. It can happen when the timer has expired, when the ref doesn't exist, or if it has been cancelled. I suspect it may have been cancelled. This happens in multiple places, among others when the key is deleted or the key is upserted. But if you see it consistently being false then there may be something different going on.

First we'll check if you can fetch the key and values so we can further debug if there's anything different about the old sessions.

@gVirtu
Copy link
Author

gVirtu commented Sep 14, 2023

Thank you for the reply @danschultzer !

Indeed once I added the missing namespace, the session was returned in one piece. My bad! Could that somehow be misconfigured on the app's end as well? In a way that might be preventing MnesiaCache from being able to fetch its keys from inside delete_or_reschedule/3, which would leave the invalidators and table keys untouched. Here's our Pow config:

config :app, :pow,
  user: App.Auth.User,
  users_context: App.Auth,
  extensions: [PowResetPassword, PowEmailConfirmation, PowPersistentSession, PowInvitation],
  repo: App.Repo,
  cache_store_backend: Pow.Store.Backend.MnesiaCache,
  persistent_session_cookie_opts: [
    secure: true
  ]

Querying the expire timestamps, it does look like there's a significant chunk that's already past their expiration date.

iex> alias Pow.Store.Backend.MnesiaCache
iex> :mnesia.dirty_select(MnesiaCache, [{{MnesiaCache, ["credentials", :_], :_}, [], [:"$_"]}]) |> 
Map.new(fn {_cache, key, {_, expire}} -> {key, DateTime.from_unix(expire, :millisecond) end)                                                            
%{                                                                                                                                                       
  ["credentials", "bb8a0a9b-d5dd-4d6e-8913-f4de38bafbd5"] => {:ok,                                                                                       
   ~U[2023-08-28 21:16:42.873Z]},                                                                                                                        
  ["credentials", "65981219-a750-4001-8361-586ad5f07a5c"] => {:ok,                                                                                       
   ~U[2023-08-10 20:28:33.378Z]},                                                                                                                        
  ["credentials", "db7ffa49-143a-46cb-90e8-c4efba633849"] => {:ok,                                                                                       
   ~U[2023-08-10 19:02:25.777Z]},                                                                                                                        
  ["credentials", "b7709629-17e8-42f8-ad87-8d5ab3137a0e"] => {:ok,                                                                                       
   ~U[2023-07-13 20:15:53.189Z]},                                                                                                                        
  ["credentials", "bebe6449-2c8e-49fb-9da5-146a4212b5cf"] => {:ok,                                                                                       
   ~U[2023-06-13 23:52:52.089Z]},                                                                                                                        
  ["credentials", "dc395bc3-a9b5-4d2d-b78c-9a95ef6beb7e"] => {:ok,                                                                                       
   ~U[2023-07-14 19:53:13.103Z]},                                                                                                                        
  ["credentials", "784fb420-a415-41c6-9c5a-56fd27cd1f7b"] => {:ok,                                                                                       
   ~U[2023-06-27 22:24:41.075Z]},                                                                                                                        
  ["credentials", "78d0b931-6258-4e53-a7e9-6d9ea0501a73"] => {:ok,                                                                                       
   ~U[2023-08-29 20:58:04.250Z]},                                                                                                                        
  ["credentials", "96e426c6-cfcc-44f5-a664-351af9a52c6d"] => {:ok,                                                                                       
   ~U[2023-05-24 22:32:45.565Z]},
...

@danschultzer
Copy link
Collaborator

Could that somehow be misconfigured on the app's end as well?

I don't believe so as you can fetch the credentials setting the cache. The only way you can override this namespace is to make a custom cache module.

Back to what could go wrong here. Following the codepath we know that these are added to the invalidators when starting the mnesia cache (as you already wrote):

{@mnesia_cache_tab, key, {_value, expire}}, invalidators when is_list(key) ->
ttl = Enum.max([expire - timestamp(), 0])
key
|> unwrap()
|> append_invalidator(invalidators, ttl, config)

The TTL being 0 is what catches my eye here. We are calling Process.send_after/2 with the TTL, and this will trigger invalidate:

defp append_invalidator(key, invalidators, ttl, config) do
invalidators = clear_invalidator(key, invalidators)
invalidator = trigger_ttl(key, ttl, config)
Map.put(invalidators, key, invalidator)
end

defp trigger_ttl(key, ttl, config) do
Process.send_after(self(), {:invalidate, config, key}, ttl)
end

def handle_info({:invalidate, config, key}, %{invalidators: invalidators} = state) do
invalidators = delete_or_reschedule(key, invalidators, config)
{:noreply, %{state | invalidators: invalidators}}
end

defp delete_or_reschedule(key, invalidators, config) do
config
|> fetch(key)
|> case do
nil ->
invalidators

From reading the docs where it says that the messages may be received before the message send by Process.send_after/4, makes me wonder if the reverse is happening here. What if the state that is ultimately set by init_invalidators only gets set after the message was sent and handled in the GenServer? It matches up with the ref being false. I'll set up a test to replicate this exact scenario, to see if that's the cause.

@danschultzer
Copy link
Collaborator

danschultzer commented Sep 14, 2023

No, I can't reproduce the error by only setting an expired TTL and restarting:

# test/pow/store/backend/mnesia_cache_test.exs
    test "expired records are purged immediately on startup" do
      :mnesia.dirty_write({MnesiaCache, ["pow:test", "key"], {"value", :os.system_time(:millisecond) - 1}})
      assert MnesiaCache.get(@default_config, "key") == "value"
      restart(@default_config)
      assert_receive {:mnesia_table_event, {:delete, _, _}} # Wait delete to happen
      assert MnesiaCache.get(@default_config, "key") == :not_found
    end

This all works as expected. I'll try a few other things, and see how I can reproduce something where an expired session isn't being removed.

@gVirtu
Copy link
Author

gVirtu commented Sep 14, 2023

I've inspected the codepath locally with a copy of our staging environment and one thing that caught my attention is the config that is being passed around is an empty list. I guess that's to be expected as I'm not passing any extra options in application.ex, though.

With the inspects still in place, I then tried to sign in with an user and lo and behold, this time the config was there: 🤯

[info] POST /api/v1/signin
trigger_ttl("4981f2d6-0452-477a-9416-c7008d860be4", 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.1183618603.1427636226.146442>
trigger_ttl([App.Auth.User, :user, 1], 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.1183618603.1427636226.146450>
trigger_ttl([App.Auth.User, :user, 1, :session, "4981f2d6-0452-477a-9416-c7008d860be4"], 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.1183618603.1427636226.146453>
trigger_ttl("e08589b5-5eb8-48e5-880a-d7812b16538d", 2592000000, [ttl: 2592000000, namespace: "persistent_session"]): #Reference<0.1183618603.1427636226.146456>
[info] Sent 200 in 475ms

The MnesiaCache usage docs do mention namespace as a configuration option, but it looks like it defaults to "cache". I tried to pass it as "credentials" and now it looks like those entries were promptly cleared!

iex> :mnesia.dirty_select(MnesiaCache, [{{MnesiaCache, ["credentials", :_], :_}, [], [:"$_"]}])
[]

There are still invalidators in there, but only those with UUID keys, and they still have mostly active timers:

iex> Process.whereis(Pow.Store.Backend.MnesiaCache) |> :sys.get_state() |> Map.get(:invalidators) |> Map.values() |> Enum.map(& Process.read_timer(&1))
[2488410885, false, 1903208148, false, 2488410111, false, false, 1903853122,
 1903156850, 1903687689, 2493595062, 1903897831, false, 1903561874, false,
 2488371072, 1903554220, 1903442037, 1903928609, 2488409282, 2488411728,
 1903377475, 1903772893, 2493729373, 678361955, false, 1903613699, false, 
 1903366255, 1903647865, 2493596154, false, false, false, 1903357776,
 2484443349, 1903472001, false, 1903993289, 1903989452, 2505517026, 2505697206,  
 1903883038, false, 2493597098]

Here's what I'd logged (before making any modifications to the namespace):

  • Return of init/1
  • Return of handle_info/2 (:invalidate)
  • Return of fetch/2 in delete_or_reschedule
  • Return of trigger_ttl/3 in append_invalidator/4
  • Return of append_invalidator/4 in init_invalidators/2

iex.log

Would it be fine to explicitly pass the namespace to MnesiaCache like this or might there be unintended side-effects? Thanks again.

@danschultzer
Copy link
Collaborator

Pow.Store.CredentialsCache would force the namespace. It's to prevent conflicts with other caches in pow, like the persistent session cache, etc, so there should be no need to set namespace anywhere. At glance I don't see anything off from any of the data, but I'll be going through more throughly soon, and try figure out the code path. Work came in the way.

We know that expired sessions gets picked up when mnesiacache inits, but for some reason the trigger to actually expires it either gets cancelled, or never runs in the first place. Alternatively something rewrite the session though I have no idea why. It smells of some race condition.

Are the expired sessions consistent (it's the same key)? Do you use Mnesia clustering or is this just a single node?

@gVirtu
Copy link
Author

gVirtu commented Sep 15, 2023

Pow.Store.CredentialsCache would force the namespace. It's to prevent conflicts with other caches in pow, like the persistent session cache, etc, so there should be no need to set namespace anywhere.

I see, that makes perfect sense.

I've worked on a more contrived example to reproduce this from an empty mnesia state. I got to the problematic state after doing these steps: (for these, using a short TTL will be convenient)

1. Spin up the phoenix server

Console output:

init([]): {:ok, %{invalidators: %{}}}
2. Authenticate an user

Via the custom authorization plug from the API tutorial, this calls put/3 on both CredentialsCache and PersistentSessionCache in Conn.register_before_send/1

Console output:

trigger_ttl("7c56f349-a262-49c6-a3d7-013bed0d1691", 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.176822019.4286054407.197793>
trigger_ttl([App.Auth.User, :user, 1], 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.176822019.4286054404.201834>
trigger_ttl([App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"], 1800000, [ttl: 1800000, namespace: "credentials"]): #Reference<0.176822019.4286054404.201837>
trigger_ttl("887a5604-6f68-4e66-9f2d-054604175e29", 2592000000, [ttl: 2592000000, namespace: "persistent_session"]): #Reference<0.176822019.4286054404.201840>
3. Stop the server, wait until after the TTL passes, then start it up again

Console output:

trigger_ttl([App.Auth.User, :user, 1], 0, []): #Reference<0.3574711379.2956460033.218191>
append_invalidator(["credentials", App.Auth.User, :user, 1], %{}, 0, []): %{[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>}
trigger_ttl([App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"], 0, []): #Reference<0.3574711379.2956460033.218205>
append_invalidator(["credentials", App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"], %{[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>}, 0, []): %{
[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>,
[App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>
}
trigger_ttl("7c56f349-a262-49c6-a3d7-013bed0d1691", 0, []): #Reference<0.3574711379.2956460033.218211>
append_invalidator(["credentials", "7c56f349-a262-49c6-a3d7-013bed0d1691"], %{[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>}, 0, []): %{
[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>,
[App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>,
"7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>
}
trigger_ttl("887a5604-6f68-4e66-9f2d-054604175e29", 2587044572, []): #Reference<0.3574711379.2956460033.218217>
append_invalidator(["persistent_session", "887a5604-6f68-4e66-9f2d-054604175e29"], %{[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>, [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>, "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>}, 2587044572, []): %{
[App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>,
[App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>,
"7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>,
"887a5604-6f68-4e66-9f2d-054604175e29" => #Reference<0.3574711379.2956460033.218217>
}
init([]): {:ok,
%{
 invalidators: %{
   [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>,
   [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>,
   "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>,
   "887a5604-6f68-4e66-9f2d-054604175e29" => #Reference<0.3574711379.2956460033.218217>
 }
}}
fetch([], [App.Auth.User, :user, 1]): nil
handle_info({:invalidate, [], [App.Auth.User, :user, 1]}, ...): {:noreply,
%{
 invalidators: %{
   [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>,
   [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>,
   "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>,
   "887a5604-6f68-4e66-9f2d-054604175e29" => #Reference<0.3574711379.2956460033.218217>
 }
}}
fetch([], [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"]): nil
handle_info({:invalidate, [], [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"]}, ...): {:noreply,
%{
 invalidators: %{
   [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>,
   [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>,
   "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>,
   "887a5604-6f68-4e66-9f2d-054604175e29" => #Reference<0.3574711379.2956460033.218217>
 }
}}
fetch([], "7c56f349-a262-49c6-a3d7-013bed0d1691"): nil
handle_info({:invalidate, [], "7c56f349-a262-49c6-a3d7-013bed0d1691"}, ...): {:noreply,
%{
 invalidators: %{
   [App.Auth.User, :user, 1] => #Reference<0.3574711379.2956460033.218191>,
   [App.Auth.User, :user, 1, :session, "7c56f349-a262-49c6-a3d7-013bed0d1691"] => #Reference<0.3574711379.2956460033.218205>,
   "7c56f349-a262-49c6-a3d7-013bed0d1691" => #Reference<0.3574711379.2956460033.218211>,
   "887a5604-6f68-4e66-9f2d-054604175e29" => #Reference<0.3574711379.2956460033.218217>
 }
}}

The expired sessions and invalidators remained untouched, with the same key as before, locally on a single node.

It does look like some sort of race condition, but by looking at the code/logs I can't tell for sure what might be causing it. I want to build a minimal repo to reproduce this, but I might not be able to get to it for the next few days...

@danschultzer
Copy link
Collaborator

I see, reading this I strongly suspect it has something to do with this part:

config
|> fetch(key)
|> case do
nil ->

If for some reason fetching for the key returns nil then the invalidators remain be untouched (which I think in the first place is incorrect, it should still clear the invalidator ref). I'm unsure how that could happen without the key actually having been deleted, and I haven't been able to reproduce this in tests. I'll try set up a full demo app in local to see if I can reproduce it that way (since you are able to reproduce it consistently it should be trivial).

@danschultzer
Copy link
Collaborator

I've found a way to reproduce it, and indeed it seems to be a race condition. I'll work on a solution.

@danschultzer
Copy link
Collaborator

danschultzer commented Sep 18, 2023

I found the issue, it was not a race condition, but incorrect handling of the keys. I have publish v1.0.34 that resolves this, please let me know if it works correctly now, and all the expired keys are being wiped.

This is a more serious issue than at first glance, so thank you so much for bringing this up! A CVE will be published shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants