Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do dns based pooling? #9

Open
WesselAtWork opened this issue Jun 20, 2024 · 26 comments
Open

How to do dns based pooling? #9

WesselAtWork opened this issue Jun 20, 2024 · 26 comments

Comments

@WesselAtWork
Copy link

WesselAtWork commented Jun 20, 2024

Currently trying to do a simple replication proxy by using a dns name that contains all the other memcached ip addresses

will look at more complicated hashing things later, for now, this is the simplest I can do

Using the dns name works kinda...

Setup

memcached contains 3 ips, order is randomized every time I invoke it
nslookup memcached

Name:   memcached.test.local
Address: 10.0.0.12
Name:   memcached.test.local
Address: 10.0.0.11
Name:   memcached.test.local
Address: 10.0.0.10

memcached -vv --port=22122 -o proxy_config=test.lua

test.lua

package.loaded["simple"] = nil
local s = require("simple")

verbose(true)

router{
  router_type = "flat",
  log = true,
}

pool{
  name = "default",
  backends = {"memcached:11211"},
}

simple.lua is in the same dir with a small modification (changed the o to a c)

            if c.r.log ~= nil then
                top = logreq_factory(c.pool)
            else
                top = function(r) return c.pool(r) end
            end

startup looks like:

router: overriding default for    router_type
making backend for... memcached:11211
setting up a zoneless flat pool
setting up a zoneless flat pool
setting up a zoneless flat pool
setting up a zoneless flat pool
<43 server listening (proxy)
<44 server listening (proxy)

Results

Setting a unique key in each instance separately, I can kind of get them from the proxy, but it's inconsistent.

10.0.0.11 is the most consistent, request the key that's located here, I (almost) always get a response
10.0.010 is less consitent, i get about 40% to 50% of the time
Weirdly, 10.0.012 is never queried! The proxy never hits this one!

Also of note the 4 "setting up a zoneless flat pool" lines. I would expect it to only output 3 becuase there is only three ips, but it called mcp_config_routes 4 times?
Maybe there is a bad address in the list?

There is obviously some kind of resolution going on inside of memcached but it's not transferring correctly to the proxy, or my env is messing with the setup.


I don't mind digging into the lua myself, but if possible can you send me on the right track?

Where do I need to focus?

As a start I should probably install luasocket and then hack around the dns module, but if I can avoid that, it would be ideal.

Additionally I wonder about freshness. How does the resolution function?
Is it a one-and-done or does it query each time?

@dormando
Copy link
Member

Hey,

First off, you'll probably want to stick to routelib: https://github.com/memcached/memcached-proxylibs/tree/main/lib/routelib - it makes a lot more sense and simple should be deprecated at this point (did I forget to mark it?)

Next, well as far as I know none of the existing memcached proxies allowed for dns based pool definitions, so we definitely don't support that out of the box. If you give a hostname to a backend it'll pick the first dns response and stick to that forever.

That said it's doable but you'll have to set up a connector to load your server list. There are a few options:

  • You have an external cron/script that checks your DNS, and if it updates, write out a little lua-style datafile to disk and send a HUP signal to memcached. Then, at the top of your routelib config, load this file and parse it into a table that can be passed to pools{} (I can help with some examples of this if needed). DNS SOA numbers can be useful here to avoid having to do a lot of comparison work.

  • You can use the proxy's cron utility: https://github.com/memcached/memcached/blob/master/t/proxycron.lua#L7 to do something similar but internally to the proxy. Every N seconds, do a DNS lookup (you can try to do this via luasocket, or just have lua call to a script in a more familiar language, or etc). If DNS changed from the last check time, you put stash the variable in a global variable somewhere, then call mcp.schedule_config_reload(), which will re-run the main config which you can pick up from there.

It's pretty easy to mix a cron file with routelib, which you do via passing multiple start scripts (in order):
memcached -o proxy_config=routelib.lua:mycron.lua,proxy_arg=config.lua

This sounds a little complicated, but because the system is flexible about where the server list comes from, it's designed to plug into any kind of server discovery already. The user might just need to do a little work to get the server list into the thing :)

If you can give some hints as to which direction would work best for you I can help come up with a more complete framework example to save you some time.

@dormando
Copy link
Member

oh also: the reason why zoneless flat pool got printed four times is because part of the config code is executed once per worker thread, and you're probably running the default of four threads.

@WesselAtWork
Copy link
Author

Thanks for the quick reply!
Ok I'm going to try RouteLib and see where that puts me first then I'll come back.

So far the proxycron is the most appealing.

Question though, why do I see the behaviour of the proxy hitting some of the instnaces. If what you said was true

"If you give a hostname to a backend it'll pick the first dns response and stick to that forever."
then I would expect it to always hit one of the instances.

Is it because there is 4 threads? It seems the ip address order is randomized when it's queried, am I just seeing the result of some of the threads getting a different address?

@dormando
Copy link
Member

Just in case it wasn't obvious from the above; you could dump your whole routelib focused config.lua from something else and issue a HUP if that's easiest. don't need to pull the backend list from within config.lua unless that's easier than the other way.

If it's a big enough thing we can look into adding some internal dns utilities

@dormando
Copy link
Member

Is it because there is 4 threads? It seems the ip address order is randomized when it's queried, am I just seeing the result of some of the threads getting a different address?

This, yup. Each worker thread has its own connection to the backend servers (unless you flip an option to turn on a consolidating io thread). So you're seeing the result of each threads' DNS lookup.

@WesselAtWork
Copy link
Author

Definitely easier from inside the conf.lua, as few dependencies as possible as far as I am willing to push it.

I could probably get away with calling os.execute("nslookup") but I probably shouldn't.

The dns utils inside memcached sound great!
But there are alternatives available.

I'm gonna try following the cron template and use luasocket to see what I can come up with.

@dormando
Copy link
Member

Sounds good. Please don't suffer too long though; if something is frustrating that's probably a pain point for me to clarify, document, or patch if necessary. Though we probably wouldn't see direct DNS utilities for a while.

So if something is going sideways you can show me what you have so far and I'll adjust the framework or add an example config to the routelib repo or something.

The proxy's in pretty heavy use but not that many total users so far, so I've been focusing on polish this year to help more varied use cases.

Good luck! Thanks for checking it out.

@WesselAtWork
Copy link
Author

Ok made some progress

package.loaded["socket.core"] = nil
local s = require("socket.core")

verbose(true)
debug(true)

-- to be set via env later
local port = "11211" --string

local function dns_to_ips(address, socket)
  local info = socket.dns.getaddrinfo(address)
  if info == nil then
    print("NO DNS IPs!")
    return {}
  end

  local ips = {}
  for index, value in ipairs(info) do
    ips[index] = value.addr..':'..port
  end

  return ips
end

settings{
  active_request_limit = 100,
  backend_connect_timeout = 3,
}

pools{
  myself = {
    backends = {
      "localhost:11211",
    }
  },
  other = {
    backends = dns_to_ips("memcached", s)
  },
}

-- using cmap for the future and to see what different routing policies do 
routes{
  cmap = {
    [mcp.CMD_GET] = route_allfastest{
      children = { "other" },
    },
    [mcp.CMD_SET] = route_allfastest{
      children = { "other" },
    },
  },
  default = route_direct{
    child = "myself",
  },
}

--variance setup so that multiple instances don't get synced and spike the DNS server.
local variance = math.random(3, 6)

math.randomseed(os.time())

mcp.register_cron("dns",
{ every = variance,
  func = function ()
    mcp.schedule_config_reload()
  end })

The hardest part was actually installing and figuring out luasocket!
I needed to set both LUA_PATH, LUA_CPATH correctly before it was able to find it.

Sadly luasocket's dns capabilities are very very basic.
I wanted to use srv records (so port config is also defined in DNS) but all I can do is the ips :(

I can also see it reloading so that's nice!

verbosity set to:    true
debug set to:    true
settings:
{ ["backend_connect_timeout"] = 3,["active_request_limit"] = 100,} 
pools config:
{ ["other"] = { ["backends"] = { [1] = 10.0.0.110:11211,[2] = 10.0.0.111:11211,[3] = 10.0.0.112:11211,} ,} ,["myself"] = { ["backends"] = { [1] = localhost:11211,} ,} ,} 
routes:
{ ["cmap"] = { [9] = { ["f"] = allfastest,["a"] = { ["children"] = { [1] = other,} ,} ,} ,[7] = { ["f"] = allfastest,["a"] = { ["children"] = { [1] = other,} ,} ,} ,} ,["default"] = { ["f"] = direct,["a"] = { ["child"] = myself,} ,} ,} 
loaded
changing global setting:    backend_connect_timeout    to:    3
changing global setting:    active_request_limit    to:    100
making pool:    myself    
    { ["backends"] = { [1] = localhost:11211,} ,} 
making backend for... localhost:11211
making pool:    other    
    { ["backends"] = { [1] = 10.0.0.110:11211,[2] = 10.0.0.111:11211,[3] = 10.0.0.112:11211,} ,} 
making backend for... 10.0.0.110:11211
making backend for... 10.0.0.111:11211
making backend for... 10.0.0.112:11211
Checking child for route:    children
Checking child for route:    children
Checking child for route:    child
mcp_config_pools: done

The ips get shuffled every reload.


A different problem. I am probably doing it wrong but none of the existing routers are giving valuable functionality.

  • route_direct: looks like it always chooses one instance (seems to be the first one in the list)
  • route_allsync: described as requesting between pools in parallel, did not request all instances in parallel.
  • route_failover: thought it would failover inside the pool, but seems to want to failover between pools
  • route_allfastest: sadly this one behaves the same as direct. because it does not query in parallel but serial and because memcached responds so fast, what ends up happening is that the first instance queried responds first, practically always. After reload the order is shuffled, and we are hoping the instance ends up in the first entry.

What is the conceptual reasoning behind pools? I probably missed it on the wiki.

Should I be creating a pool-per-backend? Or am I missing something crucial?

@dormando
Copy link
Member

dormando commented Jun 22, 2024

Congrats!

Well I was hoping saying "pool" everywhere would be pretty clear, but this is actually the second time this week it's confused someone :) I'll see about adding some extra examples/docs somewhere.

The primary use case of memcached is "hash keys against a list of servers, and store/fetch that key from exactly one server", thus adding servers to a "pool" increases available memory. In the proxy a pool is exactly this (there are configuration options to control the key hashing). The routes are for directing/copying keys between pools. So "allfastest" will route a key to all pools. It doesn't know/care if a pool is a single server or 100.

The concept of copying routes between pools was originally for things like "Datacenter A and B" or "availability zones 1/2", or racks/cages/etc.

So if you just have a small list of memcached servers and you want keys copied to all of them, yes you need to create one pool per backend.

You can create a "pool set" to make the configuration a bit easier. There's an example here: #7 (comment)

@dormando
Copy link
Member

Fwiw you should see if you can move more of the logic into the cron so you're only reloading the configuration if the backends actually change. It's designed to be reloaded very frequently but it is a bit wasteful on CPU.

@WesselAtWork
Copy link
Author

Agreed

-- imports
package.loaded["socket.core"] = nil
local s = require("socket.core")

-- env
local port = os.getenv("BACKEND_PORT") or "11211" --string
local dnsname = os.getenv("BACKEND_HOSTNAME") or "memcached" --string

-- setup
verbose(true)
debug(true)

function DNS2IPS(address, socket)
  local info = socket.dns.getaddrinfo(address)
  if info == nil then
    -- say("DNS returned no IPs!")
    return {}
  end

  local ips = {}
  for index, value in ipairs(info) do
    ips[index] = value.addr..':'..port
  end
  -- dsay("Got DNS: "..dump(ips))
  return ips
end

-- https://stackoverflow.com/a/54140176
function SIMPLE_TABLE_COMPARE(tA, tB)
  return table.concat(tA) == table.concat(tB)
end

----

-- settings
settings{
  active_request_limit = 100,
  backend_connect_timeout = 3,
}

-- dns
DNS_IPS = DNS2IPS(dnsname, s)
table.sort(DNS_IPS) -- performance is 1s per 1M entries

-- pools
pools{
  myself = {
    backends = {
      "localhost:11211",
    }
  },
  other = {
    backends = DNS_IPS
  },
}

-- routes
routes{
  cmap = {
    [mcp.CMD_GET] = route_allfastest{
      children = { "other" },
    },
    [mcp.CMD_SET] = route_direct{
      child = "other",
    },
  },
  default = route_direct{
    child = "myself",
  },
}

-- cron
local variance = math.random(3, 6)
math.randomseed(os.time())

mcp.register_cron("dns",
{ every = variance,
  func = function ()
    new_ips = DNS2IPS(dnsname, s)
    table.sort(new_ips)
    if not SIMPLE_TABLE_COMPARE(new_ips, DNS_IPS) then
      mcp.schedule_config_reload()
    end
  end })

Unsure why, but dsay and say don't work the way I expect them to.
They are global in the lib, but they don't do anything if I use them in this file.

@WesselAtWork
Copy link
Author

I'm going to see what the zoned config can do.

@WesselAtWork
Copy link
Author

Using zpools

package.loaded["socket.core"] = nil
local s = require("socket.core")

local port = os.getenv("BACKEND_PORT")                  or "11211" --string
local dnsname = os.getenv("BACKEND_HOSTNAME")           or "memcached" --string

verbose(true)
debug(true)

local_zone("zlocal") -- we NEED to define a local zone

function GET_DNS(address, socket)
  local info = socket.dns.getaddrinfo(address)
  if info == nil then
    -- say("NO DNS IPs!")
    return {}
  end
  return info
end

-- IPP = IP:PORT
function DNS2IPPS(address, socket)
  local info = GET_DNS(address,socket)

  local ipps = {}
  for index, value in ipairs(info) do
    ipps[index] = value.addr..':'..port
  end
  -- dsay("Got DNS: "..dump(ips))
  return ipps
end

function IPPS2POOLS(ipps)
  local pools = {}
  for index, ipp in ipairs(ipps) do
    pools[ipp] = { ["backends"] = { [1] = ipp } }
  end
  -- dsay("Generated pool: "..dump(pools))
  return pools
end

-- https://stackoverflow.com/a/54140176
function SIMPLE_TABLE_COMPARE(tA, tB)
  return table.concat(tA) == table.concat(tB)
end

settings{
  active_request_limit = 100,
  backend_connect_timeout = 3,
}

DNS_IPPS = DNS2IPPS(dnsname, s)
table.sort(DNS_IPPS) -- performance is 1s per 1M entries

local bmyself = {
  backends = {
    "localhost:11211",
  }
}
local szmain  = IPPS2POOLS(DNS_IPPS)
szmain["zlocal"] = bmyself

-- pools
pools{
  pmyself = bmyself,
  set_zmain = szmain
}

-- routes
routes{
  cmap = {
    [mcp.CMD_GET] = route_zfailover{
      children = "set_zmain",
      stats = true,
      miss = true,
    },
    [mcp.CMD_SET] = route_allsync{
      children = "set_zmain",
    },
  },
  default = route_direct{
    child = "pmyself",
  },
}

local variance = math.random(3, 6)
math.randomseed(os.time())

mcp.register_cron("dns",
{ every = variance,
  func = function ()
    candidate_ipps = DNS2IPPS(dnsname, s)
    table.sort(candidate_ipps)
    if not SIMPLE_TABLE_COMPARE(candidate_ipps, DNS_IPPS) then
      mcp.schedule_config_reload()
    end
  end })

The idea is to make the 'local instance the default pool, it is the "closest" so could be checked first before the rest.

Now that i am sorting the order of the dns responses, let me try the original settings again. Maybe it'll hash/route correctly now.

I would prefer it to function the way you described pools.
Basically: the more memcached instances there are, the more caching space you have.

Currently it's coping all the keys which is cool for HA, but less cool for scaling.

You mentioned there are distribution options available?
What is the difference between dist_jump_hash and dist_ring_hash?


I saw this in the Wiki

Since every proxy is also a normal memcached, it is possible to create an "L1/L2" layered cache. The details will depend highly on your needs, but doing so can remove the added latency and CPU overhead of having extra network requests added from having to first go through the proxy.

How do I set that up?
Is there a special flag or something I need to set?

@WesselAtWork
Copy link
Author

WesselAtWork commented Jun 24, 2024

Ok cool.

Doing this now works fine:

-- pools
pools{
  pmyself = {
    backends = {
      "localhost:11211",
    }
  },
  pother = {
    backends = DNS_IPPS
  },
}

-- routes
routes{
  cmap = {
    [mcp.CMD_GET] = route_direct{
      child = "pother",
    },
    [mcp.CMD_SET] = route_direct{
      child = "pother",
    },
  },
  default = route_direct{
    child = "pmyself",
  },
}

I can always get a key that was set through the proxy.

Sadly if the ip table changes, chances are high that the new table will hash differently.
Sometimes it would still hash to the instance with the key, but more often then not the setup would "lose track" of the key.

Which feels wrong to me. The key is in the pool, just not where you expect it to be.

Depending on your use case this could be a non issue: short lived keys (5s to 1m) would practically be uneffected. long lived keys and "permanent" keys would start showing issues.
Considering low instance churn (probably once every hour or so), it would effectively reset the cache every time.
Not super ideal.

I might do something like this:

local bmyself = {
  backends = {
    "localhost:11211",
  }
}
local szmain  = IPPS2POOLS(DNS_IPPS)
szmain["zlocal"] = bmyself

-- pools
pools{
  pmyself = bmyself,
  pother = {
    backends = DNS_IPPS
  },
  set_zmain = szmain
}

-- routes
routes{
  cmap = {
    [mcp.CMD_GET] = route_zfailover{
      children = "set_zmain",
      stats = true,
      miss = true,
    },
    [mcp.CMD_SET] = route_direct{
      child = "pother",
    },
  },
  default = route_direct{
    child = "pmyself",
  },
}

That guarantees I set a key somewhere once, and I can find it where-ever it is.

I could do

    [mcp.CMD_SET] = route_zfailover{
      children = "set_zmain",
      stats = true,
      shuffle = true,
    },

That would always set zlocal first.

@WesselAtWork
Copy link
Author

Ok I figured out something a little goofy.

-- pool sets
local szmain  = IPPS2POOLS(DNS_IPPS)
szmain["zlocal"] = {
  backends = DNS_IPPS
}

-- pools
pools{
  pmyself = {
    backends = {
      "localhost:11211",
    }
  },
  set_zmain = szmain
}

-- routes
routes{
  cmap = {
    [mcp.CMD_GET] = route_zfailover{
      children = "set_zmain",
      stats = true,
      miss = true,
    },
    [mcp.CMD_SET] = route_zfailover{
      children = "set_zmain",
      stats = true,
      shuffle = true,
    },
  },
  default = route_direct{
    child = "pmyself",
  },
}

Basically I am setting zlocal equal to the list of backends:

set_zmain:
  "10.0.0.110:11211":
    backends: ["10.0.0.110:11211"]
  "10.0.0.111:11211":
    backends: ["10.0.0.111:11211"]
  "10.0.0.112:11211":
    backends: ["10.0.0.112:11211"]
  zlocal:
    backends:
    - "10.0.0.110:11211"
    - "10.0.0.111:11211"
    - "10.0.0.112:11211"

Now i get the best of both :^).

  1. It tries zlocal, Hashing works and I get the benefits of it's deterministic set and get
  2. An instance gets added/removed/changed, now the original hashing order is reset breaking most of the key locations.
  3. It fails over, and eventually finds an instance that has the key.
  4. The new hashing order will cause sets to go to a new location but will also get that new location.

This does mean that zombie keys could happen.

Consider a situation where an key is set for 5m TTL:

  • 00:00 -> key SOMEKEY is set valueX to instance1
  • 01:00 -> hashing gets reset, ( SOMEKEY now points to instance2)
  • 01:30 -> we get SOMEKEY
  • -> failover because it's not on instance2
  • -> get valueX from instance1
  • 02:00 -> we set SOMEKEY to valueY (it is set on instance2)
  • 02:30 -> we get valueY from SOMEKEY on instance2
  • 03:00 -> instance2 gets removed, hashing reset again (SOMEKEY now points to instance3)
  • 04:00 -> we get SOMEKEY
  • -> failover because it's not on instance3
  • -> get valueX from instance1

So in this situation we got old data sadly, and a client expecting fresh data won't know. :(
The chances of this occurring are directly proportional to the size of the total pool, the TTL of the keys, average instance churn.

Regardless I should definitely do:

    [mcp.CMD_DELETE] = route_allsync{
      children = "set_zmain",
    },

Deletes should go everywhere.

Maybe I should just do the set routeall and hope the amplification isn't too bad?
I doubt there is an easy way to describe: if set occurs, propagate a global delete for the key then action the set.

Maybe a stop job? If an instance is going away, maybe it should issue a global delete for every key it contains?

@dormando
Copy link
Member

How often are instances being added/removed from your setup here?

Sorry you wrote a bunch here; can you back up and maybe clearly write something short about what your ultimate goal is? :) Then maybe I can clarify what's going on here.

A few random questions:

  • say/dsay: dunno, will have to check. they should work so long as you set verbose(true) or debug(true) before calling them.
  • jump vs ring (see below)
  • re: l1/l2: this is something the raw API can do but I've not created any routelib functions to do it. It can add a lot of overhead or inconsistency if not done right so I was waiting to see more use cases first.

On the hashing stuff TL:DR:

  • Jump is fast with good key spreading characteristics, but the order of backends is important - servers must only be added or removed from the end of the array, or else keys will get re-arranged.
  • ring: this is like ketama if you google that. The order of backends doesn't matter (it's sorted by the name), but it's slower than jump and re-arranges more keys than jump when servers are added/removed.

I can probably give you some clear ideas based on your goals once you clarify. Usually people's list of memcached servers are fairly static. They get added to or removed from very rarely. Thus people usually take a "one time hit" when a server dies and is replaced, and they have extra misses for a while. If this isn't acceptable we use the proxy to make extra copies of keys across pools. It kinda looks like you're currently sitting halfway between these two ideas. :)

@dormando
Copy link
Member

Looks like I can make some tweaks to routelib to make this easier, but lets see what we're trying to accomplish here first.

I definitely need to set up ordered pool sets and add more examples for overriding pool options. This is all good stuff for understanding what I need to document still, thanks! :)

@WesselAtWork
Copy link
Author

WesselAtWork commented Jul 1, 2024

I am doing is practising deploying workloads on k8s.

In this case scaling Memcached.

k8s does have zone primitives, but the actual getting it into the applications is a little undefined. I want to get the simple multi-server deployment correct before I start problem solving the zones.

The problem space is dealing with the very dynamic structure of deployments (especially in cloud environments) in general you can expect at minimum one pod (memcached server instance) to get deleted and replaced once every day.

K8s has a way to inject the ip addresses of all the deployed instances into a dns record avialable in-cluster (it's called a headless service) but that means that the client needs to deal with the balancing over all the instances (not guaranteed to be the case)

I wanted a simple gateway into the distributed memcached deployment and the proxy seemed the best way to do it.

At this point it is very close to working as i originally intended it.


Testing with ring I can confirm it's a lot more stable!

instances dropping and coming back are routed correctly.
Instances being added (pool size increasing) seem to be fine.
Instances removed (pool size decreasing) seems to loose track again.

I'm going to do some more in-depth testing and report my findings here on how the ring type behaves exactly.

@dormando
Copy link
Member

dormando commented Jul 9, 2024

Hey, was out last week. You're definitely swimming upstream a little. Hopefully you can make a small adjustment to make things easier (and take a look at a new example I just uploaded).

To restate your problem, you have a list of servers that come back from a DNS entry as IP addresses, ie:
{ "10.0.0.1:11211", "10.0.0.2:11211", "10.0.0.3:11211" }

Since k8s is complete chaos, your list can change at random, like:

  • The list of servers comes back in a different order (though you can sort)
  • A server gets pulled "10.0.0.2:11211"
  • A pulled server gets replaced with a new IP or hostname "10.0.0.5:11211", which will sort to the end of the list.

This doesn't play well with how the proxy (or any memcached client) works. You need to stabilize the list order, then things will get easier. Let me walk through this for jump hash or ring hash.

Jump hash

  • You need a mechanism (I don't know what k8s has available, but I bet there's something) to assign server IP's to a slot ID.
  • For example, the above list of servers is implicitly:
    {1 = "10.0.0.1:11211", 2 = "10.0.0.2:11211", 3 = "10.0.0.3:11211" }

This is actually an array of servers, 1/2/3. Your system have a desired count of servers (3 in this case), and if a server is replaced it needs to go back into the same slot. IE: if 10.0.0.2 dies and is replaced with 10.0.0.5, the array should update to:
{1 = "10.0.0.1:11211", 2 = "10.0.0.5:11211", 3 = "10.0.0.3:11211" }

Now jump hash is perfectly happy: all entries that originally mapped to 10.0.0.2 now go to 10.0.0.5: no keys move positions.

If you want to add a new server to the list, you add it to the end, ie:
{1 = "10.0.0.1:11211", 2 = "10.0.0.5:11211", 3 = "10.0.0.3:11211", 4 = "10.0.0.4:11211" }

... this works well with jump hash. If servers are added to or removed from the end of the list, a minimal number of keys end up rehashed.

Ring hash

Ring hash is more resilient to the server list reordering: This is because it internally is a hash map with the host/ip/port of the server. Unfortunately with the options it has today it can't only look at a backend label, so in the previous example of 10.0.0.2 dying and being replaced with 10.0.0.5 you get key displacement regardless of what you do (but it does try to minimize this effect).

  • If servers are being removed from any point in the list and not for the sake of replacing a server, ring hash will still work better.

Both

Ensuring a dead server comes back with the same IP can help simplify things in both cases.

Ensuring the length of the list of servers doesn't change for no reason helps a lot for the stability of the cache system. It's not designed to have the list randomly contract and expand: the number of servers should be a deliberate calculation.

Gutter example

In this example: https://github.com/memcached/memcached-proxylibs/blob/main/lib/routelib/examples/failover-gutter.lua

We show how to handle the temporary loss of a server with less impact to clients. In short: if a backend is down we fail over to another pool or to a remapped backend list within the same pool. At the same time we adjust the TTL of set commands so these failed over cache entries won't stay for long periods of time which will improve bad cache scenarios.

This can bridge a gap where if you can get k8s to have a stable list of servers, but they still die and are replaced with some frequency, and it can take some time (minutes/etc) to update the server list, the gutter cache can help keep up your hit rate.

@WesselAtWork
Copy link
Author

I got some time to check on it today.

When it's working it works great! But I am experiencing multiple problems at the moment.

Gutter

This snipet:

children = { "foo", route_ttl{ ttl = 300, child = "gutter" } },

isn't working for me.

IDK if I am doing something wrong but even the code does not look like it's expecting what route_ttl returns:

function route_ttl_start(a, ctx)
-- if ctx:cmd() == mcp.CMD_ANY_STORAGE do etc else etc
local fgen = mcp.funcgen_new()
local o = { ttl = a.ttl }
o.handle = fgen:new_handle(a.child)
if ctx:cmd() ~= mcp.CMD_ANY_STORAGE then
o.cmd = ctx:cmd()
end
fgen:ready({
a = o,
n = ctx:label(),
f = route_ttl_f,
})
return fgen
end

And

function route_failover_start(a, ctx)
local fgen = mcp.funcgen_new()
local o = { t = {}, c = 0 }
-- NOTE: if given a limit, we don't actually need handles for every pool.
-- would be a nice small optimization to shuffle the list of children then
-- only grab N entries.
-- Not doing this _right now_ because I'm not confident children is an
-- array or not.
for _, child in pairs(a.children) do
table.insert(o.t, fgen:new_handle(child))
o.c = o.c + 1
end
if a.shuffle then
-- shuffle the handle list
for i=#o.t, 2, -1 do
local j = math.random(i)
o.t[i], o.t[j] = o.t[j], o.t[i]
end
end
o.miss = a.miss
o.limit = a.failover_count
o.stats_id = a.stats_id
fgen:ready({ a = o, n = ctx:label(), f = route_failover_f })
return fgen
end

Don't seem interface-able.

Right now I am just using it like so:

children = { "foo", "gutter" },

Crons

Something regressed.

Crons are broken for me.

It works about 80% of the time on startup, then it executes the first reload and it always exits with Exit Code: 139 which is a SIGSEGV.

The 80% I can't quite tell if it's this issue or something else, but I am going to assume it is.

When I remove the cron stanza it's fine. I've tried trunning it once with a if RUN_ONCE latch, still breaks.

Valgrind

Setup with valgrind (this was a pain memcached/memcached#420)

I built memcached with all setrlimit stanzas removed. (Only the 2 in memcached.c)
You can't have setrlimit errors if you never call it :^)

=== mcp_config_pools: done ===
=== mcp_config_routes: start ===
building root for tag:    default
making a new router
generating a route:    default    37
generating a route:    cmdmap    9
attaching to proxy default tag
==1== Thread 6:
==1== Invalid read of size 1
==1==    at 0x16C350: luaH_realasize (ltable.c:242)
==1==    by 0x16C3AA: luaH_next (ltable.c:339)
==1==    by 0x15F16D: lua_next (lapi.c:1251)
==1==    by 0x14CB29: mcp_funcgen_router_cleanup (proxy_luafgen.c:1586)
==1==    by 0x14CB29: mcp_funcgen_cleanup (proxy_luafgen.c:468)
==1==    by 0x1472D8: mcplib_attach (proxy_lua.c:1469)
==1==    by 0x161C6A: luaD_precall (ldo.c:532)
==1==    by 0x1708D0: luaV_execute (lvm.c:1624)
==1==    by 0x161F5F: ccall (ldo.c:577)
==1==    by 0x161F5F: luaD_callnoyield (ldo.c:595)
==1==    by 0x16100A: luaD_rawrunprotected (ldo.c:144)
==1==    by 0x1622AF: luaD_pcall (ldo.c:892)
==1==    by 0x15EB90: lua_pcallk (lapi.c:1057)
==1==    by 0x150DB3: proxy_thread_loadconf (proxy_config.c:683)
==1==  Address 0x1e is not stack'd, malloc'd or (recently) free'd
==1== 
==1== 
==1== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==1==  Access not within mapped region at address 0x1E
==1==    at 0x16C350: luaH_realasize (ltable.c:242)
==1==    by 0x16C3AA: luaH_next (ltable.c:339)
==1==    by 0x15F16D: lua_next (lapi.c:1251)
==1==    by 0x14CB29: mcp_funcgen_router_cleanup (proxy_luafgen.c:1586)
==1==    by 0x14CB29: mcp_funcgen_cleanup (proxy_luafgen.c:468)
==1==    by 0x1472D8: mcplib_attach (proxy_lua.c:1469)
==1==    by 0x161C6A: luaD_precall (ldo.c:532)
==1==    by 0x1708D0: luaV_execute (lvm.c:1624)
==1==    by 0x161F5F: ccall (ldo.c:577)
==1==    by 0x161F5F: luaD_callnoyield (ldo.c:595)
==1==    by 0x16100A: luaD_rawrunprotected (ldo.c:144)
==1==    by 0x1622AF: luaD_pcall (ldo.c:892)
==1==    by 0x15EB90: lua_pcallk (lapi.c:1057)
==1==    by 0x150DB3: proxy_thread_loadconf (proxy_config.c:683)
==1==  If you believe this happened as a result of a stack
==1==  overflow in your program's main thread (unlikely but
==1==  possible), you can try to increase the size of the
==1==  main thread stack using the --main-stacksize= flag.
==1==  The main thread stack size used in this run was 8388608.
==1== 
==1== HEAP SUMMARY:
==1==     in use at exit: 11,359,830 bytes in 8,769 blocks
==1==   total heap usage: 11,861 allocs, 3,092 frees, 11,923,435 bytes allocated
==1== 
==1== LEAK SUMMARY:
==1==    definitely lost: 0 bytes in 0 blocks
==1==    indirectly lost: 0 bytes in 0 blocks
==1==      possibly lost: 818,950 bytes in 6,635 blocks
==1==    still reachable: 10,540,880 bytes in 2,134 blocks
==1==         suppressed: 0 bytes in 0 blocks
==1== Rerun with --leak-check=full to see details of leaked memory
==1== 
==1== For lists of detected and suppressed errors, rerun with: -s
==1== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

Seems to happen if you run mcp_config_routes twice but unsure.

Minimal Config

If you want to test it out: this is a minimal config to cause it to happen

-- optional verbose and debug. Does not effect outcome
verbose(true)
debug(true)

pools{
  pmyself = {
    backends = {
      "localhost:11211",
    }
  },
}

routes{
  cmap = {},
  default = route_direct{
    child = "pmyself",
  },
}

mcp.register_cron("test",
{ every = 1,
  func = function ()
    print("reloaded")
    mcp.schedule_config_reload()
  end })

Bad Version

Memcached Tested: 1.6.27, 1.6.28 and 1.6.29 all have the problem (so it's probably Proxylib)
Proxylib Tested (On 1.6.29): only 4a6ebc6 works. Everything after is broken. I think 53c6cce introduced the issue.

Result

Testing it broken yielded positive results so far!
I wanna run it with everything stable though before I write up.

@dormando
Copy link
Member

Hey,

I'll look into the cron failure, thanks for reporting!

Can you give a more complete example for what "doesn't work" with route_ttl? ie; a config and expected vs received result.

routelib is calling the route_name_conf func during the config load stage. There's some abstraction to how the children objects are handled. They're resolved before calling the _start function during the worker config load stage.

Thanks!

@dormando
Copy link
Member

Hey,

I've pushed a fix for the segfault. I'll add the missing unit tests soon then maybe cut a bugfix release. Thanks for the report! My test suite didn't catch it because many of the standard tests are using an older API that didn't use the builtin router object. :/

@WesselAtWork
Copy link
Author

Awesome!

Cron

next is working for me now!

Skip Hash (Ketama)

I'm testing with skip hash and I think I might have discovered another issue.

==1== Thread 7:
==1== Conditional jump or move depends on uninitialised value(s)
==1==    at 0x150FF1: ketama_get_server (proxy_ring_hash.c:108)
==1==    by 0x14AB46: mcplib_pool_proxy_call_helper (proxy_lua.c:1092)
==1==    by 0x14E785: mcp_run_rcontext_handle (proxy_luafgen.c:1116)
==1==    by 0x13F146: _proxy_run_rcontext_queues (proto_proxy.c:832)
==1==    by 0x13F146: proxy_run_rcontext (proto_proxy.c:1015)
==1==    by 0x13FFEF: complete_nread_proxy (proto_proxy.c:754)
==1==    by 0x120D2B: complete_nread (memcached.c:1490)
==1==    by 0x120D2B: drive_machine (memcached.c:3239)
==1==    by 0x48F45FC: ??? (in /usr/lib/libevent-2.1.so.7.0.1)
==1==    by 0x48F4D25: event_base_loop (in /usr/lib/libevent-2.1.so.7.0.1)
==1==    by 0x12C3BD: worker_libevent (thread.c:530)
==1==    by 0x405C348: ??? (in /lib/ld-musl-x86_64.so.1)
==1== 

It doesn't SEGFAULT but I think it's effecting the consistency of the ketama, when I was testing it would sometimes freak out and then stabilizes.

Route TTL

Not sure if I am doing something wrong.
But this looks like an issue at the fgen:new_handle(child) step.

 Failed to execute mcp_config_routes: invalid argument to new_handle

@dormando
Copy link
Member

can you please include the configs you’re using that aren’t working? like you did for the cron issue. I don’t have a lot of time to go guessing :( thanks!

@WesselAtWork
Copy link
Author

Apologies:

Skip hash

I am definitely doing something weird.

Not sure what is causing it to happen, trying some minimal stuff does not instance it.

I'll do some more indepth stuff later. Ignore this issue for now.

Route TTL

Minimal to get it to happen

local stat_list = {
  "10.244.0.84:11211",
  "10.244.0.86:11211",
  "10.244.0.85:11211",
}

pools{
  main = {
    backends = stat_list ,
  },

  gutter_same = {
    options = { seed = "failover" },
    backends = stat_list,
  },
}

routes{
  cmap = {
      [mcp.CMD_SET] = route_failover{
          children = { "main", route_ttl{ ttl = 10, child = "gutter_same" } },
          stats = true,
          miss = false,
      },
  },
  default = route_failover{
      children = { "main", "gutter_same" },
  },
}

@dormando
Copy link
Member

dormando commented Jul 22, 2024

@WesselAtWork apologies for the delay. Just pushed a fix that should make the gutter example work again. Had a one character typo.

I didn't yet further test the example but I'll try to do that myself when I get a minute. It should work fine now. I did test that it starts I just didn't try to push traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants