Bad hostname when mod_proxy is used to make Apache a reverse proxy #221

Closed
GoogleCodeExporter opened this Issue Apr 6, 2015 · 16 comments

Comments

Projects
None yet
1 participant
Scenario:
Apache uses mod_proxy for some of its content. More specifically there are 
multiple back ends and hence mod_proxy_balancer is used. The issue is that 
mod_pagespeed reports the following error:
[mod_pagespeed 0.9.15.3-404] Invalid resource url '/blah/image.png' relative to 
'balancer://cluster/blah.aspx'

What I would expect:
The balancer://... stuff shouldn't actually be there. Maybe it's possible to 
hook into ProxyPassReverse in some way? Note that image.png may be on the 
backend but it can also be statically available on Apache. Whe mod_pagespeed 
needs to do requests it'll have to call Apache in all cases.

Original issue reported on code.google.com by Bla...@gmail.com on 1 Mar 2011 at 1:09

  • Merged into: #74
Whoa that's a new URL. So it looks like mod_pagespeed thinks the HTML URL is 
balancer://cluster/blah.aspx. It get's the document URL from 
server_rec.unparsed_uri, do you know why that URL would be so strange? Is the 
original URL stored somewhere else? We cannot really optimize the page without 
knowing its base URL.

Original comment by sligocki@google.com on 1 Mar 2011 at 4:11

  • Changed state: RequestClarification
This is apparently the result of using mod_proxy_balancer. Simplifed example 
config:
<Proxy balancer://cluster>
  BalancerMember http://1.1.1.1:8080 route=node1 loadfactor=50
  BalancerMember http://2.2.2.2:8080 route=node2 loadfactor=50
  ProxySet lbmethod=byrequests
  ProxySet stickysession=BALANCEID
  ProxySet nofailover=on
  ProxySet timeout=5
</Proxy>
RewriteRule (.*\.aspx.*) balancer://cluster/$1 [P]
ProxyPassReverse / balancer://nvbcluster/

Basically what you get is an extension of mod_proxy but instead of mapping to 
http:// you map to balancer:// and Apache handles your loadbalancing, session 
stickiness ... (see 
http://httpd.apache.org/docs/2.2/mod/mod_proxy_balancer.html). Note that 
anything that doesn't contain .aspx will be handled by Apache locally. So if 
the result contains a reference to /images/blah.png and the client requests for 
it then Apache will serve it from disk (or maybe even from a different set of 
back end hosts).

So for correct behavior I'd always assume that relative requests from pages 
that are mapped into your site via mod_proxy result in requests to your host 
instead of the back end as you may actually cross-reference between different 
back-ends and local content.

So if I have something like this:
RewriteRule ^/a/(.*) http://servera/$1 [P]
RewriteRule ^/b/(.*) http://serverb/$1 [P]
RewriteRule ^/c/(.*) http://serverc/$1 [P]
then if servera responds and refers to /b/images/blah.png the request for this 
resource must go to serverb as this is what would effectively happen if we let 
the HTML go to the client. So what ends up being called is 
http://serverb/images/blah.png and not http://servera/b/images/blah.png. I 
don't know if this would work correctly by default but as its http its possible 
already to work-around the issue by remapping the backend requests back to the 
localhost. For balancer:// we basically have the same but it can't be fixed as 
it's balancer:// instead of http://

So possibly you'll have to actually fix forwarding of http:// in a mod_proxy 
configuration too as the current philosophy may not work in every case. I'll 
test this by reconfiguring my set up so that it uses a non load-balanced setup 
and simply refers to a single back end. If everything works as it should then 
any request initiated by mod_pagespeed resulting from replies from the back end 
should go to apache and not directly to the back end.

Original comment by Bla...@gmail.com on 2 Mar 2011 at 8:25

Ok the scope of this issue should be extended. Basically the summary is:
"mod_pagespeed does not function correctly when mod_proxy is used to make 
Apache a reverse proxy"

I tested what happens if I make the following change:
RewriteRule (.*\.aspx.*) balancer://cluster/$1 [P]
into
RewriteRule (.*\.aspx.*) http://1.1.1.1:8080/$1 [P]

The result is that mod_pagespeed sees the page as coming from 
http://1.1.1.1:8080/ and a css or js combine refers to http://1.1.1.1:8080/ to 
which the client has no access. Even better not all of the css/js files are 
actually on that host, some are served from disk by Apache. I don't really see 
how I can fix this with current domain mapping options.

At the moment to me the most logical option is still to honor ProxyPassReverse 
(but obviously I have not checked the viability of this when it comes to 
coding). Apache uses the mechanism already when a request results in a 
redirect. When it comes to html there's a third party module 
(http://apache.webthing.com/mod_proxy_html/) that offers rewrites for html 
similar to what ProxyPassReverse does. I'm not going to try and combine it with 
mod_pagespeed though cause it'll most likely fail anyway.

Original comment by Bla...@gmail.com on 2 Mar 2011 at 10:35

Aha, great investigative work. Thanks!

This looks like issue 74. So there is a good chance that you can work around 
this by using ModPagespeedMapRewriteDomain and ModPagespeedMapOriginDomain 
(http://code.google.com/speed/page-speed/docs/using_mod.html#Mapping%20Origin%20
Domains)

But it does sound like we should be dealing with this better by default. So I'm 
going to try to look back into this again.

Original comment by sligocki@google.com on 2 Mar 2011 at 3:30

  • Changed state: Accepted
Hi, I can indeed fix the http case partially by adding:
ModPagespeedMapRewriteDomain yourdomain 1.1.1.1:8080

but I don't find a fix for the fact that mod_pagespeed is trying to fetch the 
resources relative to http://1.1.1.1:8080/. So even when using http instead of 
balancer it's not working as it should.

What it would really need to be able to do is:
- mod_pagespeed must connect to localhost to fetch content
- mod_pagespeed must preserve the proper hostname (similar to 
ProxyPreserveHost) or it must be possible to instruct it which hostname to use

This is because:
- Only Apache knows correctly where to map requests to (I may have multiple 
different types of back end servers + local content underneath 1 single public 
hostname)
- The backend servers themselves may in turn require for the hostname to be 
correct instead of it ending up being derived from 1.1.1.1:8080


Original comment by Bla...@gmail.com on 2 Mar 2011 at 5:17

So the problem is that mod_pagespeed is connecting to the correct server 
http://1.1.1.1:8080/, but it is not passing in the correct hostname 
(yourdomain) and so the server (perhaps using vhost) doesn't know what content 
to serve?

Yes, that does seem broken. Do you happen to know anything about writing Apache 
modules? Specifically where the original URL is stored if not in 
server_rec.unparsed_uri?

Original comment by sligocki@google.com on 2 Mar 2011 at 5:55

  • Changed state: Started
Summary was: Can't combine mod_proxy_balancer and mod_pagespeed

Original comment by sligocki@google.com on 2 Mar 2011 at 6:07

  • Changed title: Bad hostname when mod_proxy is used to make Apache a reverse proxy
No it's not even connecting to the right host actually. I'll try to make some 
sort of illustration of how this would actually need to work.

I think we're mostly facing a logical problem at first and a technical 
challenge afterwards. I'll make a better description including more detailed 
instructions on how to set up something similar.

Original comment by Bla...@gmail.com on 2 Mar 2011 at 6:16

I looked into issue 74 and I think I fixed it. If you are able to build from 
the source and can test a revision newer than r506 in the trunk, that would be 
great.

Thanks,
-Shawn

Original comment by sligocki@google.com on 2 Mar 2011 at 10:01

Ok, I hope the description below can help to fully understand the problem. It's 
a bit long but I'm sure it's worth the read. It should also help if you want to 
set up a reference configuration.

Lets imagine we have a company called JustaCompany. They sell products, their 
main site consists of:
- Company Information
- A shop
- After sales customer support
They offer their services in France and Germany and all applications support 
localisation and parse the hostname

How their website works:
- For /shop/ all requests are reverse proxied to ecommerce application servers
- For /support/ all requests are reverse proxied to a crm application server
- /pinfo/ is locally available on each Apache and is fed from a Product 
Information Management system. It contains PDF's, images ... 
- Databases contain references to product information, the product data is not 
duplicated across sites, the applications refer to the correct path in /pinfo/
- Everything else is redirected to a set of servers which contain the company 
information, this server also contains some common css files and images

Their Apache config (may contain minor errors as I wrote this without testing, 
paste it with a console font for readability):
<VirtualHost *:80>
  ServerName www.justacompany.fr
  ServerAlias www.justacompany.de

  DocumentRoot /data/www
  <Directory "/data/www">
    Options FollowSymlinks
    AllowOverride None
  </Directory>

  ErrorLog /var/log/apache2/acompany/error_log
  CustomLog  /var/log/apache2/acompany/access_log

  HostnameLookups Off
  UseCanonicalName Off
  ProxyPreserveHost On

  FileETag MTime Size

  <Proxy balancer://ecomcluster>
    BalancerMember http://1.1.1.1:8000/ route=node1 loadfactor=50
    BalancerMember http://1.1.1.2:8000/ route=node2 loadfactor=50
    ProxySet lbmethod=byrequests
    ProxySet stickysession=ECOM_BALID
  </Proxy>

  <Proxy balancer://crmcluster>
    BalancerMember http://1.1.2.1:8000/ route=node1 loadfactor=50
    BalancerMember http://1.1.2.2:8000/ route=node2 loadfactor=50
    ProxySet lbmethod=byrequests
    ProxySet stickysession=CRM_BALID
  </Proxy>

  <Proxy balancer://infocluster>
    BalancerMember http://1.1.3.1:8000/ route=node1 loadfactor=50
    BalancerMember http://1.1.3.2:8000/ route=node2 loadfactor=50
    ProxySet lbmethod=byrequests
    ProxySet stickysession=INFO_BALID
  </Proxy>

  ProxyPass /shop/ balancer://ecomcluster/
  ProxyPass /support/ balancer://crmcluster/
  ProxyPass /pinfo/ !
  ProxyPass / balancer://infocluster/
  ProxyPassReverse /shop/ balancer://ecomcluster/
  ProxyPassReverse /support/ balancer://crmcluster/
  ProxyPassReverse / balancer://infocluster/
</VirtualHost>

Now let's discuss the problem with mod_pagespeed.

A typical page in the ecommerce site will refer to:
- ecommerce specific content coming from /shop/
- references to product information in /pinfo/
- references to commong images in /images/ and common css in /css/

What happens normally:
The clients webbrowser sends a request and Apache distributes all requests 
correctly. I'll direct to ecomcluster for /shop/, serve /pinfo/ from disk and 
will forward /images/ and /css/ requests to infocluster.

What happens with mod_pagespeed:
The HTML-page in /shop/ is coming from balancer://ecomcluster/ and hence 
mod_pagespeed tries to reference all content relative to 
balancer://ecomcluster/. The first issue is that mod_pagespeed doesn't actually 
understand the balancer concept so it simply can not connect. To simplify this 
we could put a hardware loadbalancer between Apache and the back end servers 
and switch to http.

Unfortunately even when simply using http this will fail because mod_pagespeed 
will try to get all of it's content directly from the backend server that 
delivered the HTML. So if we forward to http://shopip/ and it serves back a 
page with a reference to /css/common.css then mod_pagespeed will send a request 
to http://shopip/css/common.css while in reality when the clients browser does 
the request it is coming from http://infoip/css/common.css. Similarly a request 
to /pinfo/p1234.jpg will go to http://shopip/pinfo/p1234.jpg while actually 
Apache reads it from disk.

The end result is that mod_pagespeed is not able to retrieve the objects it 
wants to optimise and hence no optimisation can be performed.

How to solve this:
The primary way of retrieving relative content for mod_pagespeed should be to 
contact HTTP_HOST, it should not really care where the content came from. It is 
off course more interesting to map it so that it connects to localhost instead 
(but still uses correct hostname in HTTP). But the latter is what you do to 
avoid network overhead.

The exception to this rule should be when we add domains to ModPagespeedDomain 
and want to optimise by having local caching on our Apache. For example if I 
add a survey from www.asurveycompany.com I may need to add references to 
http://www.asurveycompany.com/css/our.css in my HTML. I'd like to spare my 
customers from the seperate download and www.asurveycompany.com gives a 
generous expiry time -> mod_pagespeed can retrieve it from 
www.asurveycompany.com and provide it in a combined css request.

I think the main issue here is that the philosophy has been geared a bit too 
much towards having mod_pagespeed as in inbetween optimiser and hence a forward 
proxy. Obviously when testing/demonstrating this is what is set up easily but 
in general its not exactly the main use of Apache. I'd really love for this to 
be changed. Obviously this may be considered a major change but this can be 
softened by adding an option like ModPagespeedIsForwardProxy which is On by 
default and which can be turned off in order to get the connection strategy we 
really need. Without a change like this I don't see how we can use 
mod_pagespeed in a typical reverse proxy Apache set up.

Original comment by Bla...@gmail.com on 3 Mar 2011 at 10:34

Thanks for the detailed explanation.

Which servers have mod_pagespeed installed on them? I think we should support 
this sort of configuration as long as you only install mod_pagespeed on the 
front-end servers. (Although this was broken until yesterday by a bug I fixed 
in r506)

Were you able to try building from head?

If you only have mod_pagespeed installed on the front-end servers (not the 
application servers) then it shouldn't care where all the HTML content comes 
from or where the sub-resources are stored. As you said, it will just put out 
an HTTP request for each resource and Apache will work out the details. The 
only reason I know of why this wouldn't have happened until recently is that we 
were broken on ProxyPass directive.

Original comment by sligocki@google.com on 3 Mar 2011 at 11:08

Original comment by sligocki@google.com on 3 Mar 2011 at 11:08

  • Changed state: RequestClarification
I've it installed only on the front end servers (others are IIS, tomcat ...) 
and I'm in progress of building from source so I'll continue testing with the 
latest version soon. I'll report back with the results ASAP

Original comment by Bla...@gmail.com on 4 Mar 2011 at 3:01

If possible, could you try your trunk release (r521) and see if the problem is 
resolved?  The mod_speling fix (issue 194) is now in trunk and it's possible 
that the fix for that issue fixes this one too.

Note that this fix will *not* be in the upcoming official release -- it missed 
the deadline slightly.

Original comment by jmara...@google.com on 8 Mar 2011 at 12:12

I got my build environment going on friday and did some tests during the 
weekend with 515. Everything seems to work fine when it comes to mod_proxy and 
mod_proxy_balancer. So this can be closed.

Original comment by Bla...@gmail.com on 8 Mar 2011 at 10:13

Great to hear it, this fix will be in the release coming out soon.

Original comment by sligocki@google.com on 8 Mar 2011 at 3:22

  • Changed state: Duplicate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment