Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Captive Portal : Fix the bug affecting users getting disconnected when reconfiguring a zone #4031

Closed

Conversation

Projects
None yet
5 participants
@Augustin-FL
Copy link
Contributor

commented Jan 6, 2019

Note : please do not backport this PR on 2.4.4 branch, as it include a config version bump.

Forum thread : https://forum.netgate.com/topic/137824/pfsense-no-internet-when-it-is-said-you-are-connected/13

@plumbeo

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

@Augustin-FL I'm not sure about preserving users after a reboot. It's a significant change that will alter significantly how captive portal works, and there are a lot of corner cases to consider.

For example, if RADIUS is in use the reboot procedure disconnect all users (a captiveportal_radius_stop_all(7) call in system.inc) so after the boot there will be spurious RADIUS accounting packets for sessions that don't exist anymore that could be a problem for many RADIUS servers.

Also, even assuming you don't explicitly disconnect the users, there's no guarantee that pfsense will be back online in a short time and that time will be considered used time, which could be a problem if users have a limited usage time (vaucher, etc.).

There will also be problems with traffic accounting because the new tables you're creating lose all the previous traffic information and this will be a problem when you reinit the rules, not only after a reboot.

@Gertjanpfsense

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

There will also be problems with traffic accounting because the new tables you're creating lose all the previous traffic information and this will be a problem when you reinit the rules, not only after a reboot.

Yep, I'm seeing the same thing : https://forum.netgate.com/topic/130420/any-change-and-save-update-captive-portal-bug/41 (but : there will be no traffic when the tables are flushed ...))

I'm testing this path since a day or two in a "live" environment - all good !

edit : Rebooting will disconnect users ... (bottom to top)

Jan  8 16:40:04 | php-fpm | 67158 | /diag_reboot.php: Stopping all packages. -- | -- | -- | --
Jan  8 16:40:04 | logportalauth | 67158 | Zone: cpzone1 - DISCONNECT: x, 90:b9:31:77:5e:26, 192.168.2.9
Jan  8 16:38:22 | logportalauth | 338 | Zone: cpzone1 - ACCEPT: x, 90:b9:31:77:5e:26, 192.168.2.9
@Augustin-FL

This comment has been minimized.

Copy link
Contributor Author

commented Jan 8, 2019

Thanks for your comment @plumbeo !

if RADIUS is in use the reboot procedure disconnect all users (a captiveportal_radius_stop_all(7) call in system.inc) so after the boot there will be spurious RADIUS accounting packets for sessions that don't exist anymore that could be a problem for many RADIUS servers.

That is completly true, thanks for pointing me out.

Just to make it clear, captiveportal_radius_stop_all() is responsible for sending Accounting-Stop messages (if RADIUS accounting is enabled), and for writing "disconnect" log messages for each user, but this function does not actually disconnect users (That explains the logs of @Gertjanpfsense ). Its call inside system.inc causes the RADIUS server to consider all users as disconnected when rebooting.

When rebooting, a captiveportal_send_server_accounting('off') is performed during shutdown, and a captiveportal_send_server_accounting('on') is performed during boot....So i think removing this captiveportal_radius_stop_all(7) during shutdown should be enough to mitigate the problem.

There will also be problems with traffic accounting because the new tables you're creating lose all the previous traffic information and this will be a problem [...]

Well, i don't think that's an issue at all (do you? please gainsay me if you think so). If you are worried about RADIUS accounting updates, the next accounting update should simply override the value before reconfigure/reboot.

Anyway, i will include soon a commit to this PR (for removing captiveportal_radius_stop_all(7)). Thanks again for the notice !

I'm not sure about preserving users after a reboot. It's a significant change that will alter significantly how captive portal works, and there are a lot of corner cases to consider.

I agree, but i think it's better to implement this feature then fix potential issues, rather than block ourselves from ever including it.

@plumbeo

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

@Augustin-FL

There will also be problems with traffic accounting because the new tables you're creating lose all the previous traffic information and this will be a problem [...]

Well, i don't think that's an issue at all (do you? please gainsay me if you think so). If you are worried about RADIUS accounting updates, the next accounting update should simply override the value before reconfigure/reboot.

That’s the problem, the next accounting packets will override the correct values with the new values, that is the traffic generated from the moment the rules were reinitialised, not from the start of the session. I haven’t studied the code so I can’t suggest an alternative right now, but I think that those tables shouldn’t be touched.

EDIT: when did this bug appear? I’m pretty sure I didn’t notice it about a year ago. Maybe it can be fixed in a different way.

Anyway, i will include soon a commit to this PR (for removing captiveportal_radius_stop_all(7)). Thanks again for the notice !

I'm not sure about preserving users after a reboot. It's a significant change that will alter significantly how captive portal works, and there are a lot of corner cases to consider.

I agree, but i think it's better to implement this feature then fix potential issues, rather than block ourselves from ever including it.

But why should you include it? Shutting down or rebooting a firewall has always reset all the state (firewall states, VPNs, NPT states, etc) and that’s what anybody can reasonably expect, I don’t see why the list of captive portal users logged in should be an exception, and I don’t think you should change something like this without even offering an option to preserve the current behaviour.

What actual problem does this solve? The users will not be able to access the network while the firewall is rebooting anyway and all their connections will be cut and will have to be recreated, so they may as well login again as soon as the firewall is up.

@Gertjanpfsense

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2019

@plumbeo : the original bug : affecting users getting disconnected when reconfiguring a (captive portal) zone - as the subject stated.
When a portal zone is re configured, all ipfw rules and tables are flushed, and then recreated.
tables myzone_auth_up and myzone_auth_down (see here>) will be empty.
But : the database belonging that this captive portal zone with logged in clients is not flushed.
This is where things go wrong.
What happens is : portal users are re directed again to the captive portal "login" web page. The index.php will discover that the user is already logged in (that is : it figures in the database) : the login procedure stops there. No way the user can login again.

The quick solution was / is right now : flush all users (when re configuring the zone)

A next-best solution could be : flushing the database that belongs to the zone .
(not a good thing for captive portals with many users and admin that like to re configure their portal many times a day ;) )

@Augustin-FL chooses IMHO a better method : before flushing the ipfw tables (these : tables myzone_auth_up and myzone_auth_down), update the up and down byte counter into the database, which contains now everything to re build them afterwards.
When rules and tables are re created, insert back in the users from the database (which is still intact).

@Augustin-FL went one step further : logged in users will stay logged in after a reboot.
IMHO : this goes very far ... what about update / upgrade issues ?
A re configuration : ok for me if connected users stay connected.
A reboot, for me : better make it a clean start.

@Augustin-FL : please correct my explanation is ok for you.

@jim-p

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2019

I'd prefer not to preserve the database across a shutdown/reboot by default. It may be an option, but it must default to disabled. To do otherwise is a drastic and unexpected change.

@plumbeo

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2019

@Gertjanpfsense I see. I'm not sure when I'll have time to work on this, but from a rapid look at the relevant code the original bug is caused by captiveportal_init_rules(), which always calls captiveportal_free_dnrules() that resets the list of pipes, and captiveportal_delete_rules(), that destroys the firewall tables. To fix it you'd just need a check in those functions to skip removing pipes and tables when the captive portal is already enabled (for example when the user database file exists).

@Augustin-FL

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2019

I didn't specifically intend to implement that feature. This pull request is focused on fixing captive portal re-configure problem.

Because a reboot is essentially a captive portal reconfigure, this pull request allow users to be saved across reboot. This was not purposely intended, it's more a side effect of fixing the current reconfigure problem.

I can implement some additional code to disallow login saving across reboot (eg, drop database during shutdown), but in this case feature 5644 should maybe be closed ("rejected - won't implement") ?

In mid/long term, i also plan to implement true High Avaliability for captive portal (Redmine 97) and i am thinking about the behavior that pfSense should have when "one node goes back online". I'd like to implement saving users across reboot in order to prepare for High Avaliability sync.

@jim-p could you confirm that I need to implement a database drop when pfSense turns off, with these new elements?

When did this bug appear? I’m pretty sure I didn’t notice it about a year ago. Maybe it can be fixed in a different way.

This bug appeared in pfSense 2.4 (2.3.5 is fine, 2.4.0 is not). At that time, Netgate chosed to deprecate pfsense-tools and to rely on stock FreeBSD kernel/ports instead. Multiple "custom" ipfw features developed specifically for pfSense have been removed at that time, such as ipfw context, meaning that captive portal has been essentially rewritten at that time (see 517b893 for details).

@jim-p

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2019

I didn't say it couldn't be done, I just said it had to be optional and default to disabled. There are people who would benefit from the feature, it's just too much of a change to implement without a way to retain the current behavior. Make it a checkbox in the GUI they have to opt into (e.g. Retain Captive Portal login database across reboots), and if it is needed for HA, then document that as a required option when the time comes.

@plumbeo

This comment has been minimized.

Copy link
Contributor

commented Jan 12, 2019

Hi, I think this PR is doing a lot more than fixing the original bug, and that the way it fixes it has some unfortunate side effects, so IMHO it shouldn't be merged as it is.

I have a different fix here (plumbeo@c43be93) that preserves traffic accounting and applies to 2.4.4_p1 too. I did some light testing and as a proof-of-concept it seemed ok, but there are a couple of things that I'd like more time to think about and I haven't tried the various MAC authentication options yet. @Gertjanpfsense and @Augustin-FL, could you see if it works for you too?

The other changes in this PR could then be discussed and eventually applied separately.

@Augustin-FL

This comment has been minimized.

Copy link
Contributor Author

commented Jan 13, 2019

@jim-p I added an option for retaining captive portal logins across reboot, defaulted to false.

@plumbeo Nice catch: I didn't think about changing captiveportaldn.rules format. Your patch is also a lot lighter than me.

I noted a small issue however: "Connected users" are not completely reset when rebooting (the database still shows connected users, while ipfw rules are flushed during reboot, resulting in a "you are connected" error). This can be fixed easily (a simple unlink_if_exist() should do the trick)

Maybe you could improve your work and submit it as PR, so that Netgate can decide what's the best way to fix this issue?

I think this PR is doing a lot more than fixing the original bug, and that the way it fixes it has some unfortunate side effects, so IMHO it shouldn't be merged as it is.

It is true that I included some slight refactoring in this PR (such as removing old unlink_if_exists(); performed on files that don't exist anymore, or change the location of some lock()), but this PR is not doing "a lot more" than fixing the original issue.
The side effect you are talking about, is assumed : in my opinion, accounting data should be reset when reconfiguring a zone. I'm not sure if that's the right thing to do however, so I'll let Netgate be the judge on what's the best.

@Gertjanpfsense

This comment has been minimized.

Copy link
Contributor

commented Jan 21, 2019

I suggest adding an option to the GUI setting (as implement by @Augustin-FL after @jim-p's remark) : remove all users when the portal settings are saved.
This could be a third option, making it a three-state option :

  • Remove connected users users when config changes
  • Keep connected users when config changes
  • Keep connected users when pfSense reboots.

The first or second option should be the default.

If have not real reason to ask for the extension of this GUI option, but 'something' says to me it would be nice to have all the options covered. Especially, the first option is easy to code ^^

I'm using the latest patches proposed by @Augustin-FL in a live environment.
I'm editing portal setting rather often these days on a live portal setup, to check for any unsuspected behavior. I'm using a FreeRadius with a SQL server present on LAN. Some users have a daily quota restriction.
I didn't find any issues yet.

@jim-p jim-p requested review from jim-p and rbgarga and removed request for jim-p Jan 21, 2019

@jim-p

jim-p approved these changes Jan 21, 2019

@plumbeo

This comment has been minimized.

Copy link
Contributor

commented Jan 21, 2019

@jim-p could I ask again to put this on hold for a while? as I said I have a branch (here: https://github.com/plumbeo/pfsense/commits/fix-reconfig) that needs a bit of testing but fixes this bug without interfering with traffic accounting.

I think that if someone cares about traffic statistics, resetting them without warning during normal operations will not be what those users want or expect.

@plumbeo

This comment has been minimized.

Copy link
Contributor

commented Feb 3, 2019

I submitted PR #4042. The other changes included here could be applied over it, that is, for example, rebuilding the firewall tables from scratch only after a reboot instead of every time settings are changed.

@plumbeo

This comment has been minimized.

Copy link
Contributor

commented Feb 18, 2019

@rbgarga could you please review pr #4042 before merging this? I still believe that this pr has significant side effects that shouldn't just be handwaved away. We will be forced to revert it internally for the foreseeable future because we need precise traffic accounting and this pr makes it unreliable.

@jim-p
Copy link
Contributor

left a comment

Looks like we'll take #4042 over this one for now, and then the good parts from this can be adapted as optional behavior on top of that.

@Augustin-FL

This comment has been minimized.

Copy link
Contributor Author

commented Feb 18, 2019

Ok, fine ! Closing this for now

@Augustin-FL Augustin-FL deleted the Augustin-FL:patch-captivportal-reconfigure branch Mar 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.