Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AuthN Docker RDS Connectivity Issue #175

Closed
xcskier56 opened this issue Apr 23, 2021 · 14 comments
Closed

AuthN Docker RDS Connectivity Issue #175

xcskier56 opened this issue Apr 23, 2021 · 14 comments

Comments

@xcskier56
Copy link
Contributor

xcskier56 commented Apr 23, 2021

Hi there, I'm getting the following error when trying to run the AuthN docker image on EC2 Ubuntu 20.04.

~*~ Keratin AuthN v1.10.4 ~*~
data.NewDB: dial tcp: lookup <redacted>.us-east-1.rds.amazonaws.com: device or resource busy

I have a fairly simple setup, just an EC2 instance and an RDS instance. I have done the following debugging and am quite certain that it is not a networking issue like security groups.

  1. On the EC2, I can psql to the RDS instance
  2. I used docker exec to run the AuthN container, installed psql using apk --update add postgresql-client and was able to successfully connect to rds
  3. There is a dockerized Rails app on the same EC2 that can reach the same RDS instance
  4. The docker-compose below works fine on my local machine (mac) but causes the above error on EC2

Here is my docker-compose (sensitive items removed)

version: '3.8'
volumes:
  redis_auth_data:

networks:
  frontend:
  backend:

services:
  redis:
    image: redis
    # using persistant storage
    command: ['redis-server', '--appendonly', 'yes']
    volumes:
      - redis_auth_data:/data
    networks:
      backend:
        aliases:
          - redis_auth.docker # used below

  keratin_auth:
    image: keratin/authn-server:1.10.4
    ports:
      - "21002:3000"
      - "21001:21001"
    environment:
      - AUTHN_URL=http://localhost:8080
      - APP_DOMAINS=localhost
      - HTTP_AUTH_USERNAME # passthrough from env
      - HTTP_AUTH_PASSWORD # passthrough from env
      - DATABASE_URL=postgresql://<pg_username>:<pg_password>@<host>.us-east-1.rds.amazonaws.com/keratin_authn
      - PUBLIC_PORT=21001
      - REDIS_URL=redis://redis_auth.docker:6379/1
      - SECRET_KEY_BASE=<redacted>
    networks:
      backend:
      frontend:
    depends_on:
      - redis

Thanks in advance

@cainlevy
Copy link
Member

That's a new one!

A few options to consider:

  • do you get a different error if the database url has the wrong host, port, or credentials?
  • could you also try v1.10.3? I just updated the Postgres driver in v1.10.4, and while it's currently working for my own connection to RDS, testing the previous patch could eliminate a regression

@xcskier56
Copy link
Contributor Author

@cainlevy,

  1. I do not get a different error with an invalid postgres url
  2. This is still occurring on 1.10.3.

I have created a compose file that is reproducing the problem for me. This file works on my local machine (mac) but fails on EC2 with ubuntu 20.04 AND on an Azure VM running Ubuntu 18.04. I did some searching for Go + Docker connectivity and maybe saw something about Go using different DNS lookup like 8.8.8.8 so it couldn't resolve a non-public endpoint. Not sure if real and even related but an idea. Here's the docker-file that should reproduce the problem:

version: '3.9'

services:
  keratin_auth:
    image: keratin/authn-server:1.10.3
    ports:
      - 8080:3000
    environment:
      - AUTHN_URL=http://localhost:8080
      - APP_DOMAINS=localhost
      - REDIS_URL=redis://redis:6379/1
      - DATABASE_URL=postgres://postgres:password@postgres:5432/keratin_authn?sslmode=disable
      - SECRET_KEY_BASE=changeme
      - HTTP_AUTH_USERNAME=hello
      - HTTP_AUTH_PASSWORD=world
    networks:
      - my-network
    depends_on:
      postgres:
        condition: service_healthy

  redis:
    image: redis
    ports:
      - "6379:6379"
    command: ['redis-server', '--appendonly', 'yes']
    volumes:
      - /var/www/redis:/data
    networks:
      - my-network

  postgres:
    image: postgres:12.6
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_PASSWORD=password
      - POSTGRES_USER=postgres
      - POSTGRES_DB=keratin_authn
    networks:
      - my-network
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    volumes:
      - /var/www/postgres:/var/lib/postgresql/data

networks:
  my-network:
    driver: bridge

@kkohrt
Copy link

kkohrt commented Apr 23, 2021

Hi @cainlevy -- thanks for the help.
I spun up an unbuntu docker image in the same docker network as the keratin auth image 9and postgres) and implemented the simple go program from https://github.com/jmoiron/sqlx#usage
only substituting the line

db, err := sqlx.Connect("postgres", "user=foo dbname=bar sslmode=disable")

with the format I see in https://github.com/keratin/authn-server/blob/master/app/data/postgres/db.go#L12,
using the full DATABASE_URL that was passed into the keratin server using environment variables

db, err := sqlx.Connect("postgres", "postgres://postgres:password@postgres:5432/keratin_authn?sslmode=disable")

The example worked.

The possible differences I can see are:
(a) Latest ubuntu image
(b) I installed go
(c) I installed the sqlx package

I'm not sure if that provides any more useful information, but perhaps it give you a clue. Still stumped here.

@cainlevy
Copy link
Member

That's pretty thorough. Hmm.

From what I can discover, this error appears to be coming back from the DNS server while attempting to resolve the domain. I suppose that means we can rule out credentials, ports, and anything else that's part of the actual server connection.

So the question I'm pondering is why the DNS service might report a busy error. Does it have some assumption or configuration that AuthN breaks, maybe during its startup routine when it establishes a connection pool.

The call to data.NewDB is very early in the startup routine.

authn-server/app/app.go

Lines 31 to 37 in b2ee0bc

func NewApp(cfg *Config, logger logrus.FieldLogger) (*App, error) {
errorReporter, err := ops.NewErrorReporter(cfg.ErrorReporterCredentials, cfg.ErrorReporterType, logger)
db, err := data.NewDB(cfg.DatabaseURL)
if err != nil {
return nil, errors.Wrap(err, "data.NewDB")
}

You don't appear to have an error reporting service configured, so let's back up a step:

authn-server/main.go

Lines 44 to 53 in b2ee0bc

func serve(cfg *app.Config) {
fmt.Println(fmt.Sprintf("~*~ Keratin AuthN v%s ~*~", VERSION))
// Default logger
logger := logrus.New()
logger.Formatter = &logrus.JSONFormatter{}
logger.Level = logrus.DebugLevel
logger.Out = os.Stdout
app, err := app.NewApp(cfg, logger)

Okay that's just logging. I can't see anything that has happened to cause DNS contention. That theory feels like a dead-end.

So what if it's ENV variable parsing? Is it possible that this ENV parsing is getting tripped up on a special character?

authn-server/app/env.go

Lines 40 to 45 in b2ee0bc

func lookupURL(name string) (*url.URL, error) {
if val, ok := os.LookupEnv(name); ok {
return url.Parse(val)
}
return nil, nil
}

@kkohrt did you embed the ENV var directly as a string in your test, or did you read it from os.LookupEnv?

@kkohrt
Copy link

kkohrt commented Apr 23, 2021

@cainlevy You are faster than me creating a gist! Which I finally did after @xcskier56 & I hacked away a bit more.
The full story of what we tried is here (the story is at the bottom; files at the top--I did not know gists were alphabetized)
https://gist.github.com/kkohrt/f07c64cb7a15e402385f700ea1f43590

I used os.ExpandEnv("$DATABASE_URL") instead of os.LookupEnv(name) (see the test go file in the gist)
Probably has a lot to do with being more of a Rubist than a Go Guy... but I'm starting to like go

I'll try to pull work in that lookupURL function verbatim and see if anything new pops up.

@kkohrt
Copy link

kkohrt commented Apr 23, 2021

Alas, no luck.

I inserted methods from the Keratin source, and called them at the top of the main function (from the gist)

func lookupURL(name string) (*url.URL, error) {
    if val, ok := os.LookupEnv(name); ok {
        return url.Parse(val)
    }
    return nil, nil
 }

func NewDB(url *url.URL) (*sqlx.DB, error) {
    return sqlx.Connect("postgres", url.String())
}

func main() {
    DbUrl, err := lookupURL("DATABASE_URL")
    if err != nil {
        log.Fatalln(err)
    }
    fmt.Printf("\n-----------\nParsed HOST: %s\n", DbUrl.Hostname())
    fmt.Printf("URL back to  String: %s\n\n", DbUrl.String())

    // this Pings the database trying to connect
    // use sqlx.Open() for sql.Open() semantics
    // db, err := sqlx.Connect("postgres", dburl)
    db, err := NewDB(DbUrl)
    if err != nil {
        log.Fatalln(err)
    }

And...its cool; and working--as in, none of those routines seems to be the issue.
Maybe I try compiling the app? Or puling in keratin source and running from that? Suggestions?

@cainlevy
Copy link
Member

Probably has a lot to do with being more of a Rubist than a Go Guy.

Doesn't seem to be slowing you down!

Do I understand that I could reproduce this issue by running the given docker-compose.yml on an isolated EC2 instance? RDS isn't even necessary for the reproduction?

@xcskier56
Copy link
Contributor Author

xcskier56 commented Apr 24, 2021

@cainlevy, with the compose file I posted earlier I reproduced the error on EC2 and Azure VM and it has no external dependencies so you should be able to just run it. No RDS needed

@cainlevy
Copy link
Member

cainlevy commented Apr 24, 2021

It seems I'm missing some part of the reproduction. Here's what I tried:

The last log lines I see are a successful boot:

keratin_auth_1  | ~*~ Keratin AuthN v1.10.4 ~*~
keratin_auth_1  | {"keyID":"W0sbN9bUh-_9tSXC24-3F3OAM_bfY6nUsqT0X8hakpA","keyName":"rsa:449806","level":"info","msg":"new key generated","scope":"NewKeyStoreRotater","time":"2021-04-24T22:16:57Z"}
keratin_auth_1  | AUTHN_URL: http://localhost:8080
keratin_auth_1  | PORT: 3000

@xcskier56
Copy link
Contributor Author

@cainlevy, thank you for the detail in how you were able to successfully run the compose file. At this point, I'm pretty convinced that this is some weird issue with Ubuntu + Go/Keratin. I was able to use Amazon Linux 2 to successfully run the compose file.

I am going to just sidestep this issue by using Amazon linux 2. Feel free to close this issue if you want, and once again, thank you for your help in debugging this.

@cainlevy
Copy link
Member

What a journey. Thanks for digging in!

@bradenmacdonald
Copy link

Hey, I just ran into a weird issue that may be related to this one.

My app uses GitHub Actions to launch a docker-compose file to run redis, authn-server, and my app's test suite. It was working fine with authn-server 1.10.2 for ages, but today I tried to upgrade and found that any newer version (1.10.3 and up) would fail. It would work fine on my MacBook, but on GitHub Actions, the authn-server container would fail to start with this error:

~*~ Keratin AuthN v1.10.3 ~*~
Maintain: restore: Get: Get: dial tcp: lookup redis: device or resource busy

Now the changelog didn't have any clues for what changed in 1.10.3 so I had to check git and found this commit to bump golang to 1.15. That got me on the right track and with enough searching I figured it was some change to go's DNS resolver causing the problem.

Eventually I found a workaround from the documentation in net.go: if I set the environment variable GODEBUG to netdns=go to force the use of the pure Go DNS resolver, the issue went away and authn could lookup the redis hostname again.

Commenting here in case anyone else ever has a similar issue :)

@cainlevy
Copy link
Member

Wow excellent sleuthing. Thanks for sharing that.

It looks like the cgo resolver is only chosen in specific conditions. Do you think that had something to do with the Docker host?

@bradenmacdonald
Copy link

Something, yes... but I didn't dig any deeper into this once I found a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants