Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pade behind a proxy #408

Closed
slmc-tech opened this issue Mar 21, 2022 · 106 comments
Closed

Pade behind a proxy #408

slmc-tech opened this issue Mar 21, 2022 · 106 comments

Comments

@slmc-tech
Copy link

We are trying to make Pade work in a clustered environment where nodes sit behind a reverse proxy (nginx).
When we try to configure network parameters from the openfire.xml file we see that video bridges do not come up and neither does the focus service.
When no network settings are hardcoded the bridges and focus come up as expected.
Clients that connect to the nodes directly can create and join conferences just fine. However clients who go through the reverse proxy do not get any audio or video.
We are forwarding 443 from the proxy to 7443 on the openfire nodes and are also loadbalancing udp 10000. However we do not see any udp traffic leaving the clients so webrtc / websockets does not get initiated at all.
Our setup was working fine when there was just one node (no clustering enabled).
For screenshots of our network settings please refer to this thread: https://discourse.igniterealtime.org/t/pade-1-6-2-clustering/91483/4.
Any advice on how we could make this work would be greatly appreciated.
@gjaekel Guido hi @deleolajide Dele thought that maybe you can offer your insight on this.
Thanks,

@gjaekel
Copy link
Contributor

gjaekel commented Mar 21, 2022

You're probably using the reverse proxy because of a NATing environment. But at the screenshots in the thread, in the section "IP Address Mapping", the fields for the local and pubic IP (the internal and external one) are empty. This settings "feed" the generation of the ICE harvester configuration

org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS=<internal ip>
org.ice4j.ice.harvest.NAT_HARVESTER_PUBLIC_ADDRESS=<external ip>

of the JVB (.../plugins/pade/classes/jvb/config/sip-communicator.properties).

@slmc-tech
Copy link
Author

Hi Guido,
Thank you for looking into this.
I understand that but which one of the two nodes should i use for the local address?
If i modify the openfire.xml which would allow me to define the private and public ip for each node the video bridge does not come up and neither does jicofo.

@gjaekel
Copy link
Contributor

gjaekel commented Mar 22, 2022

Sorry, but I can't tell you anything about an OpenFire-Cluster at this time; I don't test it for now.

Do you know the sketch at https://github.com/igniterealtime/openfire-pade-plugin/wiki/OFmeet-Network-Scheme ?

Dele recently started activities to enable Pade for an Openfire Cluster. But I don't know if this is already usable. In addition to OpenFire itself, the Jitsi components also must be set up to work as a cluster: I would think that all used JVB must register at each JiCoFo and interconnected ("Octo"-Feature).

And the ICE harvester (as part of the JVB) at each node must at least announce the own address. Dele added some new code about this:

case PluginImpl.MANUAL_HARVESTER_LOCAL_PROPERTY_NAME:
if (value == null || value.isEmpty()) break;
if (ClusterManager.isClusteringEnabled()) {
String localBoundIp = JiveGlobals.getXMLProperty("network.interface");
if (localBoundIp != null && !localBoundIp.isEmpty()) {
value = localBoundIp;
}
}
if (value == null || value.isEmpty()) break;
props.setProperty( "org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS", value );
break;
case PluginImpl.MANUAL_HARVESTER_PUBLIC_PROPERTY_NAME:
if (value == null || value.isEmpty()) break;
if (ClusterManager.isClusteringEnabled()) {
String publicBoundIp = JiveGlobals.getXMLProperty("network.interface_public");
if (publicBoundIp != null && !publicBoundIp.isEmpty()) {
value = publicBoundIp;
}
}
if (value == null || value.isEmpty()) break;
props.setProperty( "org.ice4j.ice.harvest.NAT_HARVESTER_PUBLIC_ADDRESS", value );
break;

IMHO, this code will override the ICE addresses with the values for network.interface and network.interface_public, but only if a value is already provided (L320,L337). From that, you have to enter (maybe dummy values by intentions) at the Admin UI. After startup, please check the generated config files.

Dele changed the default log level of the Jitsi Component Wrappers, you might add

	<!-- <<< 20220217/gj	Re-Enable JVB/JiCoFo logging -->
	<Logger name="org.jivesoftware.openfire.plugin.ofmeet.JitsiJvbWrapper" level="debug"/>
	<Logger name="org.jivesoftware.openfire.plugin.ofmeet.JitsiJicofoWrapper" level="debug"/>
	<!-- >>> -->

to the logging configuration file log4j2.xml to see what's get configured during startup.

@slmc-tech
Copy link
Author

Hi Guido.
From what we can see on our environment clustering seems to work for pade as long as you go to any of the nodes directly.
The issue starts appearing once you introduce a load balancer to the environment.
I have enabled the debugs that you have suggested and to be honest i cannot make much sense of them. It seems that the configuration loaded on both nodes is what is defined in the database from the Admin console but still no jvb or jicofo comes up to any of the nodes.
This happens when you try to define network interfaces at the openfire.xml file like @deleolajide suggested.
If you remove this configuration the video bridges and focus comes up.
Maybe you can make some sense out of the attached logs.
SharedDebug.txt

@deleolajide
Copy link
Member

@slmc-tech can we confirm if your network has NAT or not. If both your nodes have static IP addresses with a public FQDN for your domain, then you do not need to specify public and private IP addresses and don't need to modify openfire.xml for clustering.

If you bypass the load-balancer, can you get a 3-way conference with participants connected directly with either node01 or node02 with audio and video working ok?

If that is the case, then the load balancer could be the cause of the issue. That is where you will need the expertise of @gjaekel
If not, then it is a bug in the clustering logic that I have to fix.

@slmc-tech
Copy link
Author

@deleolajide yes this is a NATed environment. The nodes sit inside a class C network and have only private ip addresses assigned. The load balancer is a reverse proxy and load balancer (nginx) where we translate incoming tcp 443 to the private ip of the nodes on port 7443. The load balancer also passes udp 10000 to the nodes. It is basically set up like @gjaekel has shared in the diagram. We have DNS servers that translate FQDN addresses to the nodes for users that are logged in to the internal network. When you go to the nodes directly everything works just fine. You have multiuser conferences with audio video and all that jazz as expected. Regardless of which node you connect to you can join conferences and create new ones.
Things fall apart when you direct the traffic to the load balancer. Basically what happens is that users that go through the load balancer can still join the rooms but there is no audio or video. Also there is no UDP traffic leaving the clients. I suspect that the browser is not instructed to create the websocket tunnel. Also for some reason we are seeing a lot of 404 for gravatar or some domain like that on the clients that go through the load balancer.
I am pretty sure that this is a networking misconfiguration issue my suspicion is that there is an issue with the NAT harvester settings but if we define these settings on the DB we cannot assign a private ip for each node. if we define the private and public addresses with openfire.xml the video bridges and jicofo refuse to cooperate. I also defined the private ip in these settings as the private ip of the load balancer and it did not do the trick.
My knowledge of java is essentially non existent so I am having a really hard time troubleshooting this.
Let me know if you need any more clarification.

@slmc-tech
Copy link
Author

@deleolajide @gjaekel
I thought sharing the nginx configuration will help you understand how this is set up.
The relevant parts are like so:
upstream VideoConferencing_7443{
ip_hash;
server 192.168.1.10:7443;
server 192.168.1.11:7443;
}

server {
listen 443 ssl;
server_name meet.domain.com, rv-xmpp-01.domain.com, rv-xmpp-02.domain.com;
access_log /var/log/nginx/access_conferencing.log;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_ssl_verify off;
location /colibri-ws/ {
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
proxy_pass https://VideoConferencing_7443/colibri-ws/;
}
location /http-bind/ {
proxy_pass https://VideoConferencing_7443/http-bind/;
}
location / {
proxy_pass https://VideoConferencing_7443/ofmeet/;
}
}

@deleolajide
Copy link
Member

Thank you responding with so much detail 👍

if we define the private and public addresses with openfire.xml the video bridges and jicofo refuse to cooperate.

The issue could be me using network.interface which cause Openfire to bind to that network adaptor only. Let me make a change to use a different parameter value and see if it makes a difference.

@deleolajide
Copy link
Member

deleolajide commented Mar 22, 2022

I have made the change. See this commit

Use this snapshot build of pade.jar to test it. I tested on my dev server and it works ok with 2 nodes. However, I don't have a load balancer in front of Openfire

Now use the following new property names instead of the old names

  <ofmeet> 
    <local_address>192.168.1.251</local_address>  
    <public_address>192.168.1.251</public_address> 
  </ofmeet>

Updated wiki page - https://github.com/igniterealtime/openfire-pade-plugin/wiki/Clustering-multiple-Jitsi-Videobridges-using-Hazelcast-plugin

@slmc-tech
Copy link
Author

@deleolajide Thank you very much for spending time on this!!
From a quick look at this I can see that the videobridge and jicofo are coming up now.
I would like some time to test how this is working with the load balancer and will get back to you with my findings by tomorrow at the latest.

@slmc-tech
Copy link
Author

@deleolajide so we have done some testing here and it kind of seems like we are 50% there :-)
Basically what happens is if the clients are directed to the senior node with the load-balancer everything works just fine. So they can create and join conferences and everyone is thrilled.
But.
If for some reason the client is directed to node02 which let's say is the junior node then the connection is declined. This apparently is not related to the load balancer since the connection is declined even if clients go directly to the junior node.
Did you manage to make this work in your lab? Do you think anything would change if we removed the old plugin completely and then tried to reinstall the version you have shared.
We did not go through a complete removal when we tried your latest commit.
We are also seeing an exception and some warnings which may be relevant.
2022.03.22 21:20:10 ERROR [httpbind-worker-2]: org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node f365305d-67d4-4726-a069-4ad889fa7fc1 not found in cluster
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:409) ~[?:?]
at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:751) ~[xmppserver-4.7.1.jar:4.7.1]
at org.jivesoftware.openfire.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:43) ~[xmppserver-4.7.1.jar:4.7.1]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:569) ~[xmppserver-4.7.1.jar:4.7.1]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:354) ~[xmppserver-4.7.1.jar:4.7.1]
at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:340) ~[xmppserver-4.7.1.jar:4.7.1]
at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:105) ~[xmppserver-4.7.1.jar:4.7.1]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:74) ~[xmppserver-4.7.1.jar:4.7.1]
at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:104) ~[xmppserver-4.7.1.jar:4.7.1]
at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:63) ~[xmppserver-4.7.1.jar:4.7.1]
at org.jivesoftware.openfire.http.HttpSession.lambda$sendPendingPackets$2(HttpSession.java:559) ~[xmppserver-4.7.1.jar:4.7.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_322]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_322]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
2022.03.22 21:21:39 WARN [httpbind-worker-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Requested node f365305d-67d4-4726-a069-4ad889fa7fc1 not found in cluster
2022.03.22 21:21:43 WARN [httpbind-worker-2]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Requested node f365305d-67d4-4726-a069-4ad889fa7fc1 not found in cluster
2022.03.22 21:21:51 WARN [httpbind-worker-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Requested node f365305d-67d4-4726-a069-4ad889fa7fc1 not found in cluster
2022.03.22 21:21:52 WARN [hz.openfire.cached.thread-2]: com.hazelcast.nio.tcp.TcpIpConnectionErrorHandler - [192.168.1.81]:5701 [openfire] [3.12.5] Removing connection to endpoint [192.168.1.80]:5701 Cause => java.net.SocketException {Connection refused to address /192.168.1.80:5701}, Error-Count: 5
2022.03.22 21:21:52 WARN [hz.openfire.cached.thread-2]: com.hazelcast.internal.cluster.impl.MembershipManager - [192.168.1.81]:5701 [openfire] [3.12.5] Member [192.168.1.80]:5701 - 862938de-b1dc-4a9e-9f71-648871bb1e44 is suspected to be dead for reason: No connection
2022.03.22 21:21:52 WARN [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Client route not found for route jvb@domain.com, while user session still exists, Current content of users cache is {user01@domain.com/Spark=ClientRoute{nodeID=35caf93e-6993-4f27-b85d-8d41c518b2a7, available=true}, jvb@domain.com/60vcrv22u5=ClientRoute{nodeID=35caf93e-6993-4f27-b85d-8d41c518b2a7, available=true}, strategydummy@domain.com/5zvze1werm=ClientRoute{nodeID=35caf93e-6993-4f27-b85d-8d41c518b2a7, available=false}}
2022.03.22 21:22:40 WARN [pool-monitoring12]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - An instance of class org.jivesoftware.openfire.reporting.stats.GetStatistics (provided by plugin monitoring) that is executed as a cluster task. This will cause issues when reloading the plugin that provides this class. The plugin implementation should be modified.

@deleolajide
Copy link
Member

Do you think anything would change if we removed the old plugin completely and then tried to reinstall the version you have shared.

I believe you have to remove the old plugin on both nodes and make sure both nodes have pade version 1.6.3-SNAPSHOT

Did you manage to make this work in your lab?

Yes, but I created two instances of Openfire bound to different IP addresses on my DEV PC. I noticed that I had to wait for a few minutes before both nodes saw each other.

the connection is declined even if clients go directly to the junior node.

That sound like a regression. You were able to connect to either node directly before.

@gjaekel
Copy link
Contributor

gjaekel commented Mar 23, 2022

(As I was very busy yesterday, i wasn't able to contribute)
@slmc-tech : Maybe it's helpful for us to draw a rough network sketch (and this might be published in the Wiki lateron as well).
You'll realize for sure, that there are maybe three domains of communication:

  • The clustered OpenFire setup with internal cluster traffic, acting public transparent as one server
  • The XMPP-client/server view, where a client browser load the WebApp from the server via HTTP(S) and communicate via XMPP.
  • The Jitsi A/V view, where a client hold a SWebSocket control connection (CoLiBri) with (one or many) JVB(s) and send/receive UDP Traffic to it.

Let me ask you if you want your setup act in active/active (to spread the load) or active/passive mode (for redundancy). You have to answer this for the for the "Openfire" and "Jitsi" domain, it don't must use the same mode for the different domains but the configuration will have to respect this, of course.

Note, that the SWebSocket proxy offered by Pàdé isn't mandatory from Jitsi's point of view. It just ease the network setup, because it allow to run the JVB without an external IP. But this external IP is announced by the ICE component and is the important on for the A/V streams, the ICE handshake (will try to) chose the right target IP to be used as a route between the clients and the serving JVB (for UDP, with an optional fallback to TCP).

@slmc-tech
Copy link
Author

@deleolajide you were right. We removed the plugin and cleared all parameters stored in the DB and then reinstalled the snapshot. This is what we are seeing now:

  • If only one node is up people can create and join conferences regardless of whether they go through the proxy - load balancer or not. In other words if we stop the service at any of the nodes video conferencing works just fine.
  • If both nodes are up you can only do a session with two participants regardless of whether you go directly to the nodes or through a load balancer. Also it does not matter where each of the two participants sit. It could be on the same node or split between nodes, video conferencing works when there are two participants. If there are more than two participants on the session you can see the users as logged in to the room but there is no audio or video. This again is regardless of where clients are connected. You could have all clients directly connecting to one node or be split between the nodes in any case you have no audio or video.
  • When there are two participants (in which case as discussed you have both video and audio) if one of the participants gets disconnected or the browser is reloaded then when he or she rejoins it appears like a third person has joined the room which results in both video and audio getting lost. In other words it looks to the browser like there are stale sessions.... Interestingly if you look at the sessions in the admin console at each point you can see those people that are indeed connected. So the admin console correctly reports two members in the room but for the participants it looks like there are three or more depending on the number of browser reloads.
    @gjaekel thank you for your input. i understand that there are all these moving parts with this application. in terms of our "needs" we do not really need to distribute the load across multiple nodes. we are just looking for some resiliency for reboots etc. Also I don't think there is any point in sketching up a diagram since this would be exactly like the one you pointed me to on your earlier post. This can also be derived from the load balancer configuration.
    @deleolajide from all the tests that we were able to perform it would look to me that with this new snapshot something breaks with the videobridge or how focus operates. I have been using your work for so many years and I really wish I could do some more to help you with this. Alas java is not my cup of tea.

@gjaekel
Copy link
Contributor

gjaekel commented Mar 23, 2022

When there are two participants ...

Note, that the XMPP connections and the JVB connections are two pair of shoes. You "see" other members in rooms by concerns of the XMPP part, but the A/V connections is build up by the JVM. The JVB is commanded by OpenFire to "open the commuication channels"

If there are more than two participants ...

If P2P is enabled, 2-participants-meetings will connect each other aside the JVB.

I wonder if and how your two JVBs are exposed to the internet: Are both visible on separate IPs or are they hidden behind and served through the load balancer? In the later case: How do you manage the the UDP traffic?

@slmc-tech
Copy link
Author

@gjaekel I see. Thank you for clarifying this. I can see that p2p is indeed enabled. As discussed if we stop one of the servers everything works just fine so maybe this is a clustering issue then....

@gjaekel
Copy link
Contributor

gjaekel commented Mar 23, 2022

... so we probably "just" have left an issue with the A/V streams.

@slmc-tech
Copy link
Author

:-)

@slmc-tech
Copy link
Author

@gjaekel i missed one of your points. None of the nodes are exposed to the internet. They sit within a class c internal network and are accessible from the internet through a reverse proxy where we port forward 443 to 7443 of the nodes. UDP streams is something that i am still trying to figure out how to load balance effectively. As of now we are just forwarding UDP to the nodes like so:
upstream VideoConferencing{
hash $remote_addr consistent;
server 192.168.1.80:10000;
server 192.168.1.81:10000;
}

server {
listen 10000 udp;
access_log /var/log/nginx/video_udp.log main;
error_log /var/log/nginx/video_udp.log;
proxy_pass VideoConferencing;
}
This is far from ideal since you cannot control which node is used. So you cannot really pair tcp and udp streams for the clients.
Also none of the nodes have internet access.

@gjaekel
Copy link
Contributor

gjaekel commented Mar 23, 2022

This is far from ideal since you cannot control which node is used ...

But here -- maybe in contrast to DNS requests -- we have a high stateful usecase: The traffic must by directed to the bridge which have the knowledge of because IMHO from the "Jitsi Point of View" the participants of a session an all other corresponding data is hold here. In the case of a JVB-cluster, there's an additional Inter-Bridge-Communication TCP channel. The "Octo" Feature seems to allow to move traffic between bridges.

Written this, I want to point out that I havn't any practical experience with this so far at all. I'm also never used Ngix yet and wasn't aware that it offers UPD proxying and even load balancing.

@gjaekel
Copy link
Contributor

gjaekel commented Mar 23, 2022

As a PoC, I would recommend to use just one JVB for the start, i.e. just let Ngix proxy it to one destination.

@slmc-tech
Copy link
Author

Yes nginx can indeed load balance UDP that's why we have been using it for this particular service. By using just one JVB you mean route all UDP to just one node when in fact both nodes are used for tcp load balancing?
We have been using this configuration with just one node (no clustering) for more than a year with no issues. Also even now if we shut down one of the nodes everything is working just fine regardless of how you reach the node (directly or through nginx does not matter). As noted if we fire up the second node we have no A/V again regardless of whether you go through a load balancer or not to the nodes. It could be that you direct all connections to one node directly and it does not work. You may also split traffic between nodes again it does not work. When it does not work, we are also seeing a lot of 404 to https://www.gravatar.com from the clients. Any idea what this is all about?

@gjaekel
Copy link
Contributor

gjaekel commented Mar 23, 2022

By using just one JVB you mean route all UDP to just one node when in fact both nodes are used for tcp load balancing?

No, this is what will not work for sure IMHO. For the fist sprint, expose one bridge using one external IP to the clients for TCP and UDP. Next, you may expose the 2nd on a different IP. Next, you may check if the inter-JVB-communication works as well as this will allow to fail-over to another bridge in case of an graceful(!) shutdown of one or in the case of JVB load balancing mechanism will try to move traffic between bridges. Note, that (to my knowledge) one actual session can't be split between bridges.

@slmc-tech
Copy link
Author

I see. We can confirm that everything is working just fine when there is just one node so we don't really need to test this any further. This works fine with and without the intermediate load balancer. But. If you fire up the second node it does not work. Take the load balancer off the equation completely. It does not work. It does not work if you direct all traffic to one node it does not work if you split up clients to nodes the openfire nodes just do not do any load balancing. It is as if they are not in a cluster and also any kind of A/V does not work.
This I believe is significant because with the previous version of Pade load balancing was indeed working when you took the intermediate load balancer device out of the setup. So with Pade 1.2 you could distribute the sessions and traffic and it would work as long as the clients connected directly to the openfire nodes. This is what we noticed from the start today as mentioned in my comments earlier. In my opinion we would need to make sure that this is working without an intermediate load balancer device before making any changes to the intermediate load balancer. I hope this makes some sense.

@deleolajide
Copy link
Member

In my opinion we would need to make sure that this is working without an intermediate load balancer device before making any changes to the intermediate load balancer. I hope this makes some sense.
👍

@gjaekel
Copy link
Contributor

gjaekel commented Mar 23, 2022

BTW: To evaluate clustering is on my ToDo-list and our network layout seems comparable. Therefore to get it running this is a win-win for me. 😉

@slmc-tech
Copy link
Author

@deleolajide I hope this is going to help you approach the issue when that time comes.
We have done a lot of testing here and it looks like there is something wrong with the snapshot version of the Plugin at the clustering level. We removed all interface configuration from the openfire.xml file for both nodes and when you connect to the senior node directly you can do p2p conferencing but when you connect to the junior node directly your connection is automatically declined. You cannot join or create conferences from the junior node.
Also we did a lot of work trying various configurations including:

  • having hardcoded network settings on one of the nodes' openfire.xml file but no setting on the other node,
  • having settings on the db for one of the nodes and hardcoded settings on the openfire.xml file for the other node, and
  • having no settings on any of the openfire.xml files on the nodes and no settings on the DB.

Basically nothing worked. Keep in mind that I am talking about direct connections to the nodes here with no load balancer interfering.
Did you manage to get clustering working with this version in your lab? When it worked did you define settings on the openfire.xml file?

@deleolajide
Copy link
Member

Did you manage to get clustering working with this version in your lab? When it worked did you define settings on the openfire.xml file?

I did, but it is not working any more. I had to modify openfire.xml to bind both nodes to different IP addresses on the same PC and I also modified hazelcast-local-config.xml. It looks like my test was incorrectly done or I have since messed up my cluster configuration. Sorry for misleading you.

I have to find a free weekend to spend some quality time on this and set up a proper multi-node cluster with Docker or multiple PCs.

If possible, can you confirm that normal group-chat works ok on your cluster with Spark, Inverse or any other XMPP client using clients connected to both nodes. Please confirm that the MUC room can be created ok from any node. Thanks

@deleolajide
Copy link
Member

deleolajide commented Mar 24, 2022

I hope this is going to help you approach the issue when that time comes.

Indeed it has. Thank for the support to get this working. I think we now almost there with these latest changes

I got it working with 6 users, 3 on each node 👍

This is jitsi-meet on node1

This is jitsi-meet on node2

You can confirm the distribution of users on the nodes
image

The key difference is that I have given each JVB user unique names across the cluster. That is achieved by adding an extra Openfire XML Property octo_id.

  <ofmeet> 
    <local_address>192.168.1.251</local_address>  
    <public_address>93.184.216.34</public_address> 
    <octo_id>1</octo_id> 
  </ofmeet>  

I am going to test properly with two PCs over the weekend. If you can't wait for that, then try the latest snapshot at https://igniterealtime.org/projects/openfire/plugins/1.6.3-SNAPSHOT/pade.jar?snapshot=20220324.143502-29

@gjaekel
Copy link
Contributor

gjaekel commented Mar 24, 2022

With clustered JVBs, a graceful shutdown might be become meaningful. This is advised via the REST-API (must be enabled) and enter a state where it continue to host existing conferences, but not accepting new ones.
https://github.com/jitsi/jitsi-videobridge/blob/master/resources/graceful_shutdown.sh#L82

@slmc-tech
Copy link
Author

@deleolajide Apologies for the very late response.
We apparently cannot make this work on our network. I am surprised that it works for you.
I have used the latest snapshot as per your last comment and have bound JVB explicitly.
The only difference is that in my clustering I am not defining a required member. I will do this now and post back,
All testing is directly done on the nodes. And nothing is interfering. From the logs I do not see anything interesting basically no errors get logged as far as I can tell. The behavior we see is as discussed above meaning that no video or audio is joined when clients connect to different nodes. I doubt that this is network related since we do not see any packet drops or anything like that. We do see client browsers trying to reach a www.gravatar.com domain which I have no idea why. Is it supposed to do that?
Please suggest any changes in the logs that you think would show something.
Also as far as the wiki goes I believe it should be mentioned that all jvb accounts need to have the same domain password which is defined once for the primary jvb account in the application.
Keep in mind that I never managed to make the jvb1 authenticate. This was resolved by moving to a hybrid auth for openfire.

@slmc-tech
Copy link
Author

Quick update: I have setup the clustering same as you Dele and there is no change. On the browser I am seeing 404 for the https://www.gravatar.com/avatar/ domain and 405 for the https://rv-xmpp-02.domain.com:7443/ws/?room=test room. I will also disable Websockets Data Chanel from the network settings and retest.

@gjaekel
Copy link
Contributor

gjaekel commented Mar 30, 2022

... 405 for the https://rv-xmpp-02.domain.com:7443/ws/?room=test room.

This is caused by a periodical connection test of the Websocket connection initialed by the Jitsi WebClient. There should no neet for this, but found that this is meant to keep firewalls happy with the long-running websocket connection. We may implement a GET method to avoid this, because it looks to newbies like an error as we see.

@slmc-tech
Copy link
Author

I am pretty sure I was not seeing a 405 in this before. I believe this was 200. Anyway I just cannot make this work it seems.

@deleolajide
Copy link
Member

deleolajide commented Mar 30, 2022

We apparently cannot make this work on our network. I am surprised that it works for you.

Probably because I don't have any firewalls or any security restrictions in place

Assuming the situation is still the same. It works when all clients connect to the same node and have the same region value in their config.js generated by ofmeet. If clients connect to both nodes, then only clients on the same node see and hear each other.

This may be obvious, but are you absolutely sure you have opened TCP port 4096 between both nodes for Jitsi Otco communication as this is what enables the multiple JVBs to share the same meeting.

On the browser I am seeing 404 for the https://www.gravatar.com/avatar/ domain

I assume this is because there is no internet access to your network, otherwise that is very strange.

We may implement a GET method to avoid this, because it looks to newbies like an error as we see.

Great idea 👍 💯

@slmc-tech
Copy link
Author

I shall doublecheck the firewall and indeed turn off the firewall on the servers for testing all together. Also note that the 404s of the gravatar.com are noticed on the clients who have internet access just fine. The browser gets instructed from the application to go to this domain and returns a 404. This is noticed when there is an issue with the application i.e. if all clients go to one node we do not see this message. I presume it is a Jitsi thing then...
I will post back with no firewall or any kind of security in place.

@slmc-tech
Copy link
Author

Just wanted to point out that JVB seems to be looking for UDP 4096 not TCP. Still testing this and it is still not working. I will post more once I have a conclusive idea of what the application is trying to do. Looking at server captures now.

@gjaekel
Copy link
Contributor

gjaekel commented Mar 30, 2022

@slmc-tech You're right with UDP. You may check connectivity with nc -z -v -u IP-ADDRESS 4096

According to the very long Jitsi Community discussion about bridging you should see something like

JVB 2021-04-27 17:33:40.768 INFO: [1] OctoRelayService.<init>#72: Created Octo UDP transport
JVB 2021-04-27 17:33:40.852 INFO: [1] [relayId=10.0.100.12:4096] BridgeOctoTransport.<init>#78: Created OctoTransport

in the log. Here another article about OCTO configuration.

@slmc-tech
Copy link
Author

Ok I think I have found the issues. I have managed to make this work now. I believe I will manage to make this work through the load balancer also now that I understand what the traffic flows look like a bit better. I will be updating with my comments shortly. I really think that this should be properly documented in a wiki by the way... @deleolajide let me know how i can help once i have verified a working production ready set up.

@slmc-tech
Copy link
Author

Dele I am going through packet captures that's how I found the various misconfigurations. I will update as soon as possible.

@gjaekel
Copy link
Contributor

gjaekel commented Mar 30, 2022

@deleolajide
Concerning the "405" error, I compared to https://meet.ffmuc.net. They use Prosody and this delivers on the URL https://meet.ffmuc.net/xmpp-websocket?room=foo just

<!DOCTYPE html><html><head><title>Websocket</title></head><body>
			<p>It works! Now point your WebSocket client to this URL to connect to Prosody.</p>
			</body></html>

@deleolajide
Copy link
Member

They use Prosody and this delivers on the URL https://meet.ffmuc.net/xmpp-websocket?room=foo

I have deployed latest code and now https://pade.chat:5443/pade/keepalive/ works 👍

let me know how i can help once i have verified a working production ready set up.

That is music to my ears. You are most welcome to help us improve documentation. I can give you permission to edit the wiki files and you can submit PRs on the main code.

@deleolajide
Copy link
Member

I really think that this should be properly documented in a wiki by the way..

image

Looking forward to the updated wiki 👍

@gjaekel
Copy link
Contributor

gjaekel commented Mar 30, 2022

I have deployed latest code and now https://pade.chat:5443/pade/keepalive/

Oh, that's a completely different approach: I thought that Jitsi Webclient will call GET /ws/?room=foo hard wired we had have to register a handler for this path.

The committed code should work; I simulate it by adding config.websocketKeepAliveUrl = "/pade/popup.html" to config_custom.js and Browser Console then log that it is fetching this URL (successful). But wonder if this meet the intention of the feature: In jitsi/lib-jitsi-meet#1123 (comment) and answer it is rumored that this should "keep the things running on load balances" for the path used for the WebSocket URL. This would mean that we have to register a dummy GET handler for /ws/. This mimics are located in Openfire, right?

@deleolajide
Copy link
Member

This mimics are located in Openfire, right?

Yes. I have no intention of making a PR on Openfire. It is much easier to use config.websocketKeepAliveUrl as meet.jit.si/config.js does or setting config.websocketKeepAlive to 0,

@gjaekel
Copy link
Contributor

gjaekel commented Mar 31, 2022

It's much more easier, indeed. I agree, let's see for the effects of this in the wild at first.

@deleolajide
Copy link
Member

It will now be possible to see the settings in openfire.xml that affect Pade clustering form Openfire Admin UI. I have updated the wiki.

image

@gjaekel
Copy link
Contributor

gjaekel commented Mar 31, 2022

I don't understand the value shown for "Interfaces Allowed"; I would expetct something like ethX - like on the Pade|Networking page
image

@deleolajide
Copy link
Member

I don't understand the value shown for "Interfaces Allowed"; I would expetct something like ethX - like on the Pade|Networking page

I have a Windows PC :-)

image

@gjaekel
Copy link
Contributor

gjaekel commented Mar 31, 2022

Ahh! 😄

@gjaekel
Copy link
Contributor

gjaekel commented Mar 31, 2022

"Octo-Docs" The SplitBridgeSelectionStrategy can be used for testing. It tries to select a new bridge for each client, regardless of the regions. This is useful while testing, because you can verify that Octo works before setting up the region configuration for the clients.

May we offer to switch between RegionBasedBridgeSelectionStrategy and SplitBridgeSelectionStrategy?

@slmc-tech
Copy link
Author

It will now be possible to see the settings in openfire.xml that affect Pade clustering form Openfire Admin UI.

This is very useful indeed. Ok guys so I have managed to make this work in our environment with load balancers and all but I must say that this is beyond complicated from a networking perspective. Keep in mind that I am working with enterprise grade equipment and the whole setup is quite complicated to start with.
@deleolajide before i proceed with the wiki should I send you an email with my findings from a networking perspective to verify that this is what you expect from an application perspective. How do you want to handle this?

@gjaekel
Copy link
Contributor

gjaekel commented Mar 31, 2022

I use Google Docs for the existing Single Node Network Diagram as source. May I create a new one that we may Co-Edit?

@slmc-tech
Copy link
Author

Sure yes we can do that Guido.

@deleolajide
Copy link
Member

How do you want to handle this?

I am a simple developer :-) Network configuration sends my head into a spin. Please feel to create as many WIKI pages as you like with as much information for other DevOps and Network administrators who will need this stuff 👍 💯

@deleolajide
Copy link
Member

verify that this is what you expect from an application perspective.

It is actually the other way round. Tell me what you need to simplify this and make it easy to use and I will do my best to get into the code 👍

@slmc-tech
Copy link
Author

Ok I will try to explain this in as much detail as I can with all protocols involved. Should I start a new wiki page then or work on an existing one? I would like to manage your expectations on this in terms of time. I want to do this as soon as possible before I forget how this is set up but I have a ton of things that I have put on hold to work on this so it will take a couple of days. Sounds fair?

@gjaekel
Copy link
Contributor

gjaekel commented Mar 31, 2022

@deleolajide
Copy link
Member

Sounds fair?

Indeed it does.

Whatever you contribute will be fully appreciated, considering how busy you are and the sacrifices you have made to get this far and achieve the breakthrough. We can always go back to merge and edit the wiki pages. Capturing as much info while it it is still fresh in memory and we have access to screenshots and config data should definitely be the priority.

@slmc-tech
Copy link
Author

I have started this wiki to discuss the proxy setup. We should also put the network diagrams in there i think.
https://github.com/igniterealtime/openfire-pade-plugin/wiki/Considerations-On-Setting-Up-Pade-Behind-a-Reverse-Proxy
I will let you know once the text is all there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants