Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turpentine esi request loop when using a google bot #599

Closed
csdougliss opened this issue Aug 12, 2014 · 10 comments
Closed

Turpentine esi request loop when using a google bot #599

csdougliss opened this issue Aug 12, 2014 · 10 comments

Comments

@csdougliss
Copy link
Contributor

If I set my user agent to Google bot I am seeing a loop on the turpentine ESI requests. It is also returning the whole page instead of just the header for example.

This might be the cause of our issues indexing the site. Firebug says 302 forced.302 redirect for the turpentine ESI request.

I also don't see cookie being set to crawler-session. In addition the home page keeps reloading under net in firebug?

Age 0
Cache-Control   no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Connection  keep-alive
Content-Encoding    gzip
Content-Type    text/html; charset=utf-8
Date    Tue, 12 Aug 2014 11:01:04 GMT
Expires Thu, 19 Nov 1981 08:52:00 GMT
Location    http://www.xx.co.uk/
Pragma  no-cache
Set-Cookie  frontend=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=.xx.co.uk; httponly
Transfer-Encoding   chunked
X-Varnish-Host  www.xx.co.uk
X-Varnish-URL   /turpentine/esi/getBlock/method/ajax/access/private/ttl/0/hmac/e53c498b043da05d6807ac0d1636450b6bcad8a3fdcf3424d93350705ef48eb1/data/-XSCO0rgxv8Z6H4vzOfkywZ5DKRI0sSbKlTcIrUBK2LhsPr-Q2LQzFRicnTzAbsFNnfxmXsaI8Tiz9ILhQPxUsLMUb4ULmWxfg32H0tnnsuwzcSYOKoSfUHXFR3QSHeR400i-PnAvXDVexMUyfznnhRmy5zV5CCf536LNkAOyC9RT6ahZYTo67A0SppMUAY0cBzflcwJQJru3s7yEVLZMySsA3XWPWBZIPLafKXIaG-TXlDWt7yWuaIn2Ok3CT4e4yWbxzoKAtM36DQPbTozzERLw6byg0JER1-wdGvv80.Q8MVYRKFroszkfDxIQc5nRQWQ6o3tMM-JX8IIw3DF4g==/
view source
Accept  text/javascript, text/html, application/xml, text/xml, */*
Accept-Encoding gzip, deflate
Accept-Language en-US,en;q=0.5
Cookie  X-Mapping-fjhppofk=05C7BCABB68AA698F7E2F219B277775E; __utma=27950523.977198102.1407840875.1407840875.1407840875.1; __utmb=27950523.4.10.1407840875; __utmc=27950523; __utmz=27950523.1407840875.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); BVImplMain%20Site=13080
DNT 1
Host    www.xx.co.uk
Referer http://www.xx.co.uk/
User-Agent  Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
X-Prototype-Version 1.7
X-Requested-With    XMLHttpRequest

VCL content:

# Nexcess.net Turpentine Extension for Magento
# Copyright (C) 2012  Nexcess.net L.L.C.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
#51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

## Nexcessnet_Turpentine Varnish v3 VCL Template

## Custom C Code

C{
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>

static pthread_mutex_t lrand_mutex = PTHREAD_MUTEX_INITIALIZER;

void generate_uuid(char* buf) {
    pthread_mutex_lock(&lrand_mutex);
    long a = lrand48();
    long b = lrand48();
    long c = lrand48();
    long d = lrand48();
    pthread_mutex_unlock(&lrand_mutex);
    // SID must match this regex for Kount compat /^\w{1,32}$/
    sprintf(buf, "frontend=%08lx%04lx%04lx%04lx%04lx%08lx",
        a,
        b & 0xffff,
        (b & ((long)0x0fff0000) >> 16) | 0x4000,
        (c & 0x0fff) | 0x8000,
        (c & (long)0xffff0000) >> 16,
        d
    );
    return;
}

}C

## Imports

import std;

## Custom VCL Logic

# Additional includes for logging
C{
#include <stdio.h>
#include <stdlib.h>
#include <syslog.h>
#include <stddef.h>
#include <sys/time.h>
#include <time.h>
}C

sub vcl_recv {
    # Add X-Request-Start header so we can track queue times in New Relic RPM beginning at Varnish.
    if (req.restarts == 0) {
        C{
                struct timeval detail_time;
                gettimeofday(&detail_time,NULL);
                char start[20];
                sprintf(start, "t=%lu%06lu", detail_time.tv_sec, detail_time.tv_usec);
                VRT_SetHdr(sp, HDR_REQ, "\020X-Request-Start:", start, vrt_magic_string_end);
        }C
    }

    # Bypass registration form
    if (req.url ~ "^/registration/form") {
        return (pass);
    }
}

sub vcl_error {
    set obj.http.Content-Type = "text/html; charset=utf-8";
    set obj.http.Retry-After = "5";

    if (obj.status >= 500) {
        C{
            FILE *fp;
            char ft[256];
            struct tm *tmp;
            time_t curtime;

            fp = fopen("/var/log/varnish/error_log", "a");
            time(&curtime);
            tmp = localtime(&curtime);
            strftime(ft, 256, "%D - %T", tmp);

            if(fp != NULL) {
                fprintf(fp, "%s: Error (%s) (%s) (%s)\n",
                ft, VRT_r_req_url(sp), VRT_r_obj_response(sp), VRT_r_req_xid(sp));

                fclose(fp);
            } else {
                syslog(LOG_INFO, "Error (%s) (%s) (%s)",
                VRT_r_req_url(sp), VRT_r_obj_response(sp), VRT_r_req_xid(sp));
            }
        }C
    }

    synthetic {"
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html>
   <head>
     <title>"} + obj.status + " " + obj.response + {"</title>
     <style type="text/css">
     /* Errors */
        * {
            margin: 0;
            padding: 0;
        }
        .error-layout { width: 100%; max-width: 1600px; height: 100%; margin: 0 auto; padding: 0; }
        .error-layout .main-container { height: 100%; }
        .error-layout .main { background: url("/errors/default/images/error_background.jpg") no-repeat #fff; min-height: 590px; color: #1D2B33; float: left; }
        .error-layout .col-main { width: 100%; }
        .error-layout .std { margin: 100px 0 0 400px; padding-left: 270px; background: url("/errors/default/images/warning_icon.png") no-repeat 10px 0; height: 200px; }
        .error-layout h3 { font-size: 400%; margin-bottom: 10px; }
        .error-layout p { font-size: 180%; }
        .error-layout .back { float: left; background: url("/skin/frontend/vax/uk/images/icons/left_arrow.png") no-repeat 8px center #1D2B33; height: 18px; padding: 5px 8px 5px 33px; color: #FFF; font-size: 13px; line-height: 18px; text-decoration: none; margin-top: 10px; }
    /* ======================================================================================= */
    </style>
   </head>
   <body>
     <div class="page error-layout">
        <div class="main-container">
            <div class="main">
                <div class="col-main">
                    <div class="std"><h3>Oops, something went wrong</h3>
                        <!--<h1>Error "} + obj.status + " " + obj.response + {"</h1>
                        <p>"} + obj.response + {"</p>
                        <h3>Guru Meditation:</h3>
                        <p>XID: "} + req.xid + {"</p>
                        <hr>-->
                        <p><a class="back" title="Go Back" href="javascript: history.go(-1);">Go Back</a></p>
                    </div>
                 </div>
            </div>
      </div>
    </body>
 </html>
 "};
     return (deliver);
}

# customized vcl_deliver to allow use of cross sub-domain cookies
sub vcl_deliver {
    if (req.http.X-Varnish-Faked-Session) {
        # need to set the set-cookie header since we just made it out of thin air
        call generate_session_expires;
        set resp.http.Set-Cookie = req.http.X-Varnish-Faked-Session +
            "; expires=" + resp.http.X-Varnish-Cookie-Expires + "; path=/";
        if (req.http.Host) {
            if(req.http.Host ~ "^(www\.|spares\.|support\.)?vax\.co\.uk$") {
                set resp.http.Set-Cookie = resp.http.Set-Cookie +
                    "; domain=.vax.co.uk";
            } else {
                set resp.http.Set-Cookie = resp.http.Set-Cookie +
                    "; domain=" + regsub(req.http.Host, ":\d+$", "");
            }
        }
        set resp.http.Set-Cookie = resp.http.Set-Cookie + "; httponly";
        unset resp.http.X-Varnish-Cookie-Expires;
    }
    if (req.http.X-Varnish-Esi-Method == "ajax" && req.http.X-Varnish-Esi-Access == "private") {
        set resp.http.Cache-Control = "no-cache";
    }
    if (false || client.ip ~ debug_acl) {
        # debugging is on, give some extra info
        set resp.http.X-Varnish-Hits = obj.hits;
        set resp.http.X-Varnish-Esi-Method = req.http.X-Varnish-Esi-Method;
        set resp.http.X-Varnish-Esi-Access = req.http.X-Varnish-Esi-Access;
        set resp.http.X-Varnish-Currency = req.http.X-Varnish-Currency;
        set resp.http.X-Varnish-Store = req.http.X-Varnish-Store;
    } else {
        # remove Varnish fingerprints
        unset resp.http.X-Varnish;
        unset resp.http.Via;
        unset resp.http.X-Powered-By;
        unset resp.http.Server;
        unset resp.http.X-Turpentine-Cache;
        unset resp.http.X-Turpentine-Esi;
        unset resp.http.X-Turpentine-Flush-Events;
        unset resp.http.X-Turpentine-Block;
        unset resp.http.X-Varnish-Session;
        unset resp.http.X-Varnish-Host;
        unset resp.http.X-Varnish-URL;
        # this header indicates the session that originally generated a cached
        # page. it *must* not be sent to a client in production with lax
        # session validation or that session can be hijacked
        unset resp.http.X-Varnish-Set-Cookie;
    }
    return (deliver);
}


## Backends

backend default {
    .host = "127.0.0.1";
    .port = "8080";
   .first_byte_timeout = 300s;
   .between_bytes_timeout = 300s;
}


backend admin {
    .host = "127.0.0.1";
    .port = "8080";
   .first_byte_timeout = 21600s;
   .between_bytes_timeout = 21600s;
}


## ACLs

acl crawler_acl {
    "127.0.0.1";
}

acl debug_acl {
    "127.0.0.1";
}

## Custom Subroutines

sub generate_session {
    # generate a UUID and add `frontend=$UUID` to the Cookie header, or use SID
    # from SID URL param
    if (req.url ~ ".*[&?]SID=([^&]+).*") {
        set req.http.X-Varnish-Faked-Session = regsub(
            req.url, ".*[&?]SID=([^&]+).*", "frontend=\1");
    } else {
        C{
            char uuid_buf [50];
            generate_uuid(uuid_buf);
            VRT_SetHdr(sp, HDR_REQ,
                "\030X-Varnish-Faked-Session:",
                uuid_buf,
                vrt_magic_string_end
            );
        }C
    }
    if (req.http.Cookie) {
        # client sent us cookies, just not a frontend cookie. try not to blow
        # away the extra cookies
        std.collect(req.http.Cookie);
        set req.http.Cookie = req.http.X-Varnish-Faked-Session +
            "; " + req.http.Cookie;
    } else {
        set req.http.Cookie = req.http.X-Varnish-Faked-Session;
    }
}

sub generate_session_expires {
    # sets X-Varnish-Cookie-Expires to now + esi_private_ttl in format:
    #   Tue, 19-Feb-2013 00:14:27 GMT
    # this isn't threadsafe but it shouldn't matter in this case
    C{
        time_t now = time(NULL);
        struct tm now_tm = *gmtime(&now);
        now_tm.tm_sec += 14400;
        mktime(&now_tm);
        char date_buf [50];
        strftime(date_buf, sizeof(date_buf)-1, "%a, %d-%b-%Y %H:%M:%S %Z", &now_tm);
        VRT_SetHdr(sp, HDR_RESP,
            "\031X-Varnish-Cookie-Expires:",
            date_buf,
            vrt_magic_string_end
        );
    }C
}

## Varnish Subroutines

sub vcl_recv {
    # this always needs to be done so it's up at the top
    if (req.restarts == 0) {
        if (req.http.X-Forwarded-For) {
            set req.http.X-Forwarded-For =
                req.http.X-Forwarded-For + ", " + client.ip;
        } else {
            set req.http.X-Forwarded-For = client.ip;
        }
    }

    # We only deal with GET and HEAD by default
    # we test this here instead of inside the url base regex section
    # so we can disable caching for the entire site if needed
    if (!true || req.http.Authorization ||
        req.request !~ "^(GET|HEAD)$" ||
        req.http.Cookie ~ "varnish_bypass=1") {
        return (pipe);
    }

    # remove double slashes from the URL, for higher cache hit rate
    set req.url = regsuball(req.url, "(.*)//+(.*)", "\1/\2");

    if (req.http.Accept-Encoding) {
        if (req.http.Accept-Encoding ~ "gzip") {
            set req.http.Accept-Encoding = "gzip";
        } else if (req.http.Accept-Encoding ~ "deflate") {
            set req.http.Accept-Encoding = "deflate";
        } else {
            # unkown algorithm
            unset req.http.Accept-Encoding;
        }
    }

    #if (req.http.User-Agent ~ "(?i)(ads|google|bing|msn|yandex|baidu|ro|career|)bot" ||
    #   req.http.User-Agent ~ "(?i)(baidu|jike|symantec)spider" ||
    #   req.http.User-Agent ~ "(?i)scanner" ||
    #   req.http.User-Agent ~ "(?i)(web)crawler") {
    #   set req.http.X-Normalized-User-Agent = "bot";
    #} else {
    #    set req.http.X-Normalized-User-Agent = "other";
    #}

    # check if the request is for part of magento
    if (req.url ~ "^(/media/|/skin/|/js/|/)(?:(?:index|litespeed)\.php/)?") {
        # set this so Turpentine can see the request passed through Varnish
        set req.http.X-Turpentine-Secret-Handshake = "1";
        # use the special admin backend and pipe if it's for the admin section
        if (req.url ~ "^(/media/|/skin/|/js/|/)(?:(?:index|litespeed)\.php/)?admin") {
            set req.backend = admin;
            return (pipe);
        }
        if (req.http.Cookie ~ "\bcurrency=") {
            set req.http.X-Varnish-Currency = regsub(
                req.http.Cookie, ".*\bcurrency=([^;]*).*", "\1");
        }
        if (req.http.Cookie ~ "\bstore=") {
            set req.http.X-Varnish-Store = regsub(
                req.http.Cookie, ".*\bstore=([^;]*).*", "\1");
        }
        # looks like an ESI request, add some extra vars for further processing
        if (req.url ~ "/turpentine/esi/get(?:Block|FormKey)/") {
            set req.http.X-Varnish-Esi-Method = regsub(
                req.url, ".*/method/(\w+)/.*", "\1");
            set req.http.X-Varnish-Esi-Access = regsub(
                req.url, ".*/access/(\w+)/.*", "\1");

            # throw a forbidden error if debugging is off and a esi block is
            # requested by the user (does not apply to ajax blocks)
            if (req.http.X-Varnish-Esi-Method == "esi" && req.esi_level == 0 &&
                    !(false || client.ip ~ debug_acl)) {
                error 403 "External ESI requests are not allowed";
            }
        }
        # no frontend cookie was sent to us
        if (req.http.Cookie !~ "frontend=") {
            if (client.ip ~ crawler_acl ||
                    req.http.User-Agent ~ "^(?:ApacheBench/.*|.*Googlebot.*|JoeDog/.*Siege.*|magespeedtest\.com|Nexcessnet_Turpentine/.*)$") {
                # it's a crawler, give it a fake cookie
                set req.http.Cookie = "frontend=crawler-session";
            } else {
                # it's a real user, make up a new session for them
                call generate_session;
            }
        }
        if (true &&
                req.url ~ ".*\.(?:css|js|jpe?g|png|gif|ico|swf)(?=\?|&|$)") {
            # don't need cookies for static assets
            unset req.http.Cookie;
            unset req.http.X-Varnish-Faked-Session;
            return (lookup);
        }
        # this doesn't need a enable_url_excludes because we can be reasonably
        # certain that cron.php at least will always be in it, so it will
        # never be empty
        if (req.url ~ "^(/media/|/skin/|/js/|/)(?:(?:index|litespeed)\.php/)?(?:admin|api|cron\.php|registration/form|oauth|site|scripts)" ||
                # user switched stores. we pipe this instead of passing below because
                # switching stores doesn't redirect (302), just acts like a link to
                # another page (200) so the Set-Cookie header would be removed
                req.url ~ "\?.*__from_store=") {
            return (pipe);
        }
        if (true &&
                req.url ~ "(?:[?&](?:__SID|XDEBUG_PROFILE)(?=[&=]|$))") {
            # TODO: should this be pass or pipe?
            return (pass);
        }
        if (req.url ~ "[?&](utm_source|utm_medium|utm_campaign|gclid|cx|ie|cof|siteurl)=") {
            # Strip out Google related parameters
            set req.url = regsuball(req.url, "(?:(\?)?|&)(?:utm_source|utm_medium|utm_campaign|gclid|cx|ie|cof|siteurl)=[^&]+", "\1");
            set req.url = regsuball(req.url, "(?:(\?)&|\?$)", "\1");
        }

        # everything else checks out, try and pull from the cache
        return (lookup);
    }
    # else it's not part of magento so do default handling (doesn't help
    # things underneath magento but we can't detect that)
}

sub vcl_pipe {
    # since we're not going to do any stuff to the response we pretend the
    # request didn't pass through Varnish
    unset bereq.http.X-Turpentine-Secret-Handshake;
    set bereq.http.Connection = "close";
}

# sub vcl_pass {
#     return (pass);
# }

sub vcl_hash {
    hash_data(req.url);
    if (req.http.Host) {
        hash_data(req.http.Host);
    } else {
        hash_data(server.ip);
    }
    hash_data(req.http.Ssl-Offloaded);
    if (req.http.X-Normalized-User-Agent) {
        hash_data(req.http.X-Normalized-User-Agent);
    }
    if (req.http.Accept-Encoding) {
        # make sure we give back the right encoding
        hash_data(req.http.Accept-Encoding);
    }
    if (req.http.X-Varnish-Store || req.http.X-Varnish-Currency) {
        # make sure data is for the right store and currency based on the *store*
        # and *currency* cookies
        hash_data("s=" + req.http.X-Varnish-Store + "&c=" + req.http.X-Varnish-Currency);
    }

    if (req.http.X-Varnish-Esi-Access == "private" &&
            req.http.Cookie ~ "frontend=") {
        hash_data(regsub(req.http.Cookie, "^.*?frontend=([^;]*);*.*$", "\1"));


    }
    return (hash);
}

sub vcl_hit {
    # this seems to cause cache object contention issues so removed for now
    # TODO: use obj.hits % something maybe
    # if (obj.hits > 0) {
    #     set obj.ttl = obj.ttl + s;
    # }
}

# sub vcl_miss {
#     return (fetch);
# }

sub vcl_fetch {
    # set the grace period
    set req.grace = 15s;

    # Store the URL in the response object, to be able to do lurker friendly bans later
    set beresp.http.X-Varnish-Host = req.http.host;
    set beresp.http.X-Varnish-URL = req.url;

    # if it's part of magento...
    if (req.url ~ "^(/media/|/skin/|/js/|/)(?:(?:index|litespeed)\.php/)?") {
        # we handle the Vary stuff ourselves for now, we'll want to actually
        # use this eventually for compatibility with downstream proxies
        # TODO: only remove the User-Agent field from this if it exists
        unset beresp.http.Vary;
        # we pretty much always want to do this
        set beresp.do_gzip = true;

        if (beresp.status != 200 && beresp.status != 404) {
            # pass anything that isn't a 200 or 404
            set beresp.ttl = 15s;
            return (hit_for_pass);
        } else {
            # if Magento sent us a Set-Cookie header, we'll put it somewhere
            # else for now
            if (beresp.http.Set-Cookie) {
                set beresp.http.X-Varnish-Set-Cookie = beresp.http.Set-Cookie;
                unset beresp.http.Set-Cookie;
            }
            # we'll set our own cache headers if we need them
            unset beresp.http.Cache-Control;
            unset beresp.http.Expires;
            unset beresp.http.Pragma;
            unset beresp.http.Cache;
            unset beresp.http.Age;

            if (beresp.http.X-Turpentine-Esi == "1") {
                set beresp.do_esi = true;
            }
            if (beresp.http.X-Turpentine-Cache == "0") {
                set beresp.ttl = 15s;
                return (hit_for_pass);
            } else {
                if (true &&
                        bereq.url ~ ".*\.(?:css|js|jpe?g|png|gif|ico|swf)(?=\?|&|$)") {
                    # it's a static asset
                    set beresp.ttl = 2592000s;
                    set beresp.http.Cache-Control = "max-age=2592000";
                } elseif (req.http.X-Varnish-Esi-Method) {
                    # it's a ESI request
                    if (req.http.X-Varnish-Esi-Access == "private" &&
                            req.http.Cookie ~ "frontend=") {
                        # set this header so we can ban by session from Turpentine
                        set beresp.http.X-Varnish-Session = regsub(req.http.Cookie,
                            "^.*?frontend=([^;]*);*.*$", "\1");
                    }
                    if (req.http.X-Varnish-Esi-Method == "ajax" &&
                            req.http.X-Varnish-Esi-Access == "public") {
                        set beresp.http.Cache-Control = "max-age=" + regsub(
                            req.url, ".*/ttl/(\d+)/.*", "\1");
                    }
                    set beresp.ttl = std.duration(
                        regsub(
                            req.url, ".*/ttl/(\d+)/.*", "\1s"),
                        300s);
                    if (beresp.ttl == 0s) {
                        # this is probably faster than bothering with 0 ttl
                        # cache objects
                        set beresp.ttl = 15s;
                        return (hit_for_pass);
                    }
                } else {
                    set beresp.ttl = 3600s;
                }
            }
        }
        # we've done what we need to, send to the client
        return (deliver);
    }
    # else it's not part of Magento so use the default Varnish handling
}

sub vcl_deliver {
    if (req.http.X-Varnish-Faked-Session) {
        # need to set the set-cookie header since we just made it out of thin air
        call generate_session_expires;
        set resp.http.Set-Cookie = req.http.X-Varnish-Faked-Session +
            "; expires=" + resp.http.X-Varnish-Cookie-Expires + "; path=/";
        if (req.http.Host) {
            set resp.http.Set-Cookie = resp.http.Set-Cookie +
                "; domain=" + regsub(req.http.Host, ":\d+$", "");
        }
        set resp.http.Set-Cookie = resp.http.Set-Cookie + "; httponly";
        unset resp.http.X-Varnish-Cookie-Expires;
    }
    if (req.http.X-Varnish-Esi-Method == "ajax" && req.http.X-Varnish-Esi-Access == "private") {
        set resp.http.Cache-Control = "no-cache";
    }
    if (false || client.ip ~ debug_acl) {
        # debugging is on, give some extra info
        set resp.http.X-Varnish-Hits = obj.hits;
        set resp.http.X-Varnish-Esi-Method = req.http.X-Varnish-Esi-Method;
        set resp.http.X-Varnish-Esi-Access = req.http.X-Varnish-Esi-Access;
        set resp.http.X-Varnish-Currency = req.http.X-Varnish-Currency;
        set resp.http.X-Varnish-Store = req.http.X-Varnish-Store;
    } else {
        # remove Varnish fingerprints
        unset resp.http.X-Varnish;
        unset resp.http.Via;
        unset resp.http.X-Powered-By;
        unset resp.http.Server;
        unset resp.http.X-Turpentine-Cache;
        unset resp.http.X-Turpentine-Esi;
        unset resp.http.X-Turpentine-Flush-Events;
        unset resp.http.X-Turpentine-Block;
        unset resp.http.X-Varnish-Session;
        unset resp.http.X-Varnish-Host;
        unset resp.http.X-Varnish-URL;
        # this header indicates the session that originally generated a cached
        # page. it *must* not be sent to a client in production with lax
        # session validation or that session can be hijacked
        unset resp.http.X-Varnish-Set-Cookie;
    }
}

@csdougliss
Copy link
Contributor Author

For some reason when setting Googlebot as my user-agent I am seeing expires as 1970, without crawler-session:

frontend=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=.xx.co.uk; httponly

@eth8505
Copy link
Contributor

eth8505 commented Sep 10, 2014

From what I can tell, crawlers don't actually get a cookie set. I was just playing around with that a few days ago when updating our cache warming tool.
You should see your session name in the X-Varnish-Set-Cookie header though if you have debug info enabled.

@csdougliss
Copy link
Contributor Author

@eth8505 The issue I have is if I use Google web master tools and do "fetch as google" I will quite often get a re-direct or a temporary unavailable message.

If I set my user-agent to Googlebot in firefox, I just get a re-direct loop. If no cookie is being set, then that could cause the re-redirect as no cookie's exist? I am not sure if the same happens using Google web master tools or that is a seperate issue.

generate_session_expires sets the ttl of the cookie and that only get's called if req.http.X-Varnish-Faked-Session, is that set during a crawler? I am not sure! Actually looking at the code, it should be.

Also, I see in the VCL

# it's a crawler, give it a fake cookie
                set req.http.Cookie = "frontend=crawler-session";

I don't see X-Varnish-Set-Cookie anywhere in debug :(

On the turpentine demo site, no cookie is set at all using googlebot, wonder why it is on mine! However browsing unofficial varnish sites, there is one set

@csdougliss
Copy link
Contributor Author

@eth8505 I've discovered that If I disable our CM_RedisSession module I no longer get the re-direct with google bot. Do you have any ideas?

@eth8505
Copy link
Contributor

eth8505 commented Sep 10, 2014

In vcl_recv() the request cookie "frontend" is set in req.http.Cookie. That's the one passed to magento to be used as internal session ID. However, for crawlers generate_session() is never called, hence not filling req.http.X-Varnish-Faked-Session.

In vcl_fetch() the _response_cookie is read from beresp.http.Set-Cookie and stored in beresp.http.X-Varnish-Set-Cookie.

In vcl_deliver(), generate_session_expires() is only called, if req.http.X-Varnish-Faked-Session is set, which is never called for crawlers (see vcl_recv()). And hence, no resp.http.Set-Cookie is returned to the client.
This does make sense, since crawlers usually don't remember any cookies to send along to the next request. Why send data to someone who doesn't need it. Especially, as the bot will get the same session ID (crawler-session) upon the next request due to the user agent detection in vcl_recv().

@eth8505
Copy link
Contributor

eth8505 commented Sep 10, 2014

@craigcarnell We use Cm_RedisSession as well. We never had any problems with redirect loops though. At least not afaik.
Cm_RedisSession does however have internal bot detection, since it sets different session parameters for bots. You may want to check Cm_RedisSession_Model_Session::getLifeTime() to find out exactly what it does.
A problem that we had though, was crawler threads locking each other due to long wait timeouts in Cm_RedisSession that got a little better with the current version due to smaller sleep cycles while waiting for session locks.
You might want to try updating to the current version:
https://github.com/colinmollenhour/Cm_RedisSession

@csdougliss
Copy link
Contributor Author

@eth8505 Thanks for that explanation. I am already running the latest code from git however :(

@csdougliss
Copy link
Contributor Author

@eth8505 Do you out of interest use CM RedisSession to share session information across multiple hosts via load balancing? Have you tried google bot in that scenario?

@eth8505
Copy link
Contributor

eth8505 commented Sep 10, 2014

@craigcarnell we have two webservers sharing the session data behind a load balancer.
I just tried googlebot as a user agent and it works perfectly fine.

@csdougliss
Copy link
Contributor Author

@eth8505 Ensuring that bots now get a cookie has now resolved the issue for me with redis session. It's a workaround for now, but it's important to let it do it's thing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants