Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add DynamicInetProvider to get ip of gateway dynamiclly #557

Closed
wants to merge 2 commits into from

Conversation

jasonjoo2010
Copy link

@jasonjoo2010 jasonjoo2010 commented Nov 26, 2017

provide two implementation by default:

  • SimpleInetProvider
    totally same as previous
  • DynamicInetProvider
    can load ip list periodly from dns servers given. maybe about 70~100 ip in list after several minutes running.

we improve the performance from 4 ~ 20k to 30 ~ 38k per node with the time reducing from 40 ~ 55 mins to less than 14 mins. 4.18m tokens with 5 2cores/4g nodes.

our config:
ipush.threadnum = 6 //io threads in nettty
ipush.connections = 128 //concurance
ipush.qps = 5000 //how many msgs remaining waiting for "completion notification", we will not add more to netty when we reach this.

before dynamic:

2017-11-26 12 48 20

2017-11-26 12 48 53

after dynamically

2017-11-26 12 47 45

several days during optimizing:

2017-11-26 12 51 34

@qiankunli
Copy link

dnsjava只是负责域名解析,为什么会有这样的改善呢?我们实践中碰到了大量的Http2ChannelClosedException及Http2GoAwayException,我在pushy中加了重试机制来处理推送没有发出的情况,比如acquire channel fail和write fail的推送会重试。但对于其他异常,因为不清楚apns是否发出,所以不敢贸然重试。初步估计是因为连接质量比较低,你们那边有没有碰到这样的问题呢?

@jchambers
Copy link
Owner

@qiankunli 对不起,我不能用中文回答,但这是我的想法。

Some users have reported that some APNs servers are dramatically faster than others for reasons that are not entirely clear. If I'm understanding this pull request correctly (and I have not had much time to read it, regrettably), it actually accumulates a list of all of the IP addresses it has ever seen when resolving api.push.apple.com.

Normally, DNS resolution only gives us ~8 results at a time, but with a short TTL. If this change means that we're remembering all of the IP addresses we've seen in the past, it would mean we're distributing connections across servers that Apple is generally not asking people to use. That may mean we wind up connecting to less loaded servers, or servers that are (again, for reasons unknown) faster.

It's also worth noting that right now, Pushy doesn't deliberately distribute connections even across the ~8 IPs that DNS already gives us. It may be that the performance gains that @jason-joo is seeing just comes from diversification. Some of the servers are worse, but some are better, and on average, we see better results across the board. This seems likely given the relatively high number of connections Jason is using.

That said, I'm not sure if this is either the strategy or implementation we want to use. My knee-jerk reaction is that we should respect what's coming from DNS and not try to connect to servers that aren't in the rotation. I'm also a little hesitant about adding a new dependency when Netty (to some extent) already does this kind of thing. Could we achieve similar results with DnsNameResolver, for example?

If there are significant performance gains to be had, we should consider this issue seriously. At the same time, I'd like to think a bit more about the specific approach.

In the meantime, @jason-joo, thanks very much for the contribution!

@jasonjoo2010
Copy link
Author

jasonjoo2010 commented Nov 29, 2017

@jchambers Thanks for your response.

I have read your post carefully and i am agree most of your opinion. I am very glad to make some explaination further.

  1. about the record's ttl

It's correctly as you said in your post. But we must notice that: the records always need more time to be refreshed allover the world take 48 hours longest because of so many/deep dns forwarders.

So if you get a record with 60 seconds ttl, it means you would better get it again from your upstream after 60 seconds, better no less time for better performance(cache), no more time for a valid record.

So it doesn't means your application must refresh it after that, it is totally a protocol's design. While in practice we will see the connection to the ip gotten during application starting will not be changed due to jvm dns cache, and connection wouldn't be closed except Apple server decide to do it. On the contrary, according to above reasons, we are obeying the rules of the ttl rules of dns protocol(we take 300 seconds expiration, configurable). Maybe apple take the short ttl mainly for load balancing purpose.

  1. about concurrence connections

Pushy added support for concurrence connections in recent versions(good job for it) to improve performance. But there is a problem: the client will only create the concurrence connections on the save server(IP address) because of that we can only get only ONE ip after dns resolving and the jvm may cache it for a quite long time during life cycle. So we will create N connections on a same server in actually startup.

In this scene there are mainly two problems:

  • Apple may do some QUOTATION to particular client ip.
  • If one instance picks a BAD ip during its booting unfortunately, it will be always stuck(after typically 3 days).

By distributing the connection among different ip we will RANDOMLY get a better running condition even when you just boot up a fresh instance. The performance for your cluster can be scaled out horizontally(to be validated in future because we have already got enough performance, 450w/15mins/5 nodes).

@qiankunli 具体的原因见上面的回复,总结来说,就是基于以下几点:

  1. 每个ip(服务器or集群)的负载情况不同。比如三个ip ABC,A跟C你可能获得剧烈波动的性能(10~300qps),而B可能稳定发送在1000上下,那么如果固定ip,获得了A、C这类ip的节点就没可能改善性能了,只能期望重启之类操作获得新的机会。
  2. 同一ip连3个以上连接,既改善不大(如果是连接质量),也可能遇到苹果在单服务器上对调用者的流控(攻击防火墙限流、流量限制)。

目前我们并没有加太多重试(因为这个系统主要在为精准批量营销推送服务),实际观测失败率不足千分之一,未来也可以加入。

@jmason
Copy link

jmason commented Nov 30, 2017

my thoughts from the sidelines ;) -- I agree that #2 sounds like a problem. Ideally we should keep the server's location as a String and defer DNS resolution until right before we attempt to open a TCP connection, in order to allow DNS-based load balancing to work correctly. However I am not sure that depending on a non-default DNS resolver implementation library is the right way to do it. I would prefer to keep using the JDK's builtin DNS implementation if possible, to maintain the principle of least surprise for Pushy users.

@jasonjoo2010
Copy link
Author

@jmason glad to hear from you.

if you are pointing to the default dns setting of user's actually server OS, i totally agree you on offering default dns server settings fetching in the implementation.
if you are pointing to keep the mechanisms inside jvm on resolving domain, the cache problem will remain and we cannot disable the jvm-cache option due to performance of other codes in the same jvm. And don't you think it is a problem that creating all connections to single ip?

@jmason
Copy link

jmason commented Nov 30, 2017

It's true that the default JVM behaviour on DNS caching is not good -- I haven't run a service in production without the "-Dsun.net.inetaddr.ttl=10 -Dnetworkaddress.cache.ttl=10" flags in many years for this reason!

I guess as long as the DynamicInetProvider behaviour is not the default, that would be ok, IMO.

@jasonjoo2010
Copy link
Author

@jmason You are right.

And the new feature must be enable manually to keep the forward compatibility.
Or if the concurrent connections can be created with a "warming up" way it would be ok with the ttl settings to enjoy benefit of large count of concurrent connections.

@jchambers
Copy link
Owner

I agree that #2 sounds like a problem. Ideally we should keep the server's location as a String and defer DNS resolution until right before we attempt to open a TCP connection, in order to allow DNS-based load balancing to work correctly.

FWIW, this was fixed in #571, and just released as part of Pushy 0.12.

@jasonjoo2010
Copy link
Author

@jchambers

I have checked it out and tested but i find it doesn't solve the problem. Because it will create all the connections at once when the worker boot up with some load.

ConcurrentConnections = 16

screen shot 2018-01-16 at 15 54 50

Maybe it would be better after a LONG time running(some connections will break up)?

@jasonjoo2010
Copy link
Author

@jchambers

And surely there is still one problem in my branch: it won't work in tomcat. I am planning to figure it after some work in hands.

@jchambers
Copy link
Owner

@jasonjoo2010 Thank you again for your contribution here! I think we can achieve similar results much more simply by using Netty's built-in RoundRobinDnsAddressResolverGroup. I took a shot at that in #594.

I'm going to close this pull request for now, and would love your feedback on #594 as your time allows. Thanks again!

@jchambers jchambers closed this Mar 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants