Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stand alone Django backend for civic sandbox #216

Open
AraOshin opened this issue Aug 2, 2018 · 15 comments
Open

Stand alone Django backend for civic sandbox #216

AraOshin opened this issue Aug 2, 2018 · 15 comments

Comments

@AraOshin
Copy link

AraOshin commented Aug 2, 2018

Most of the civic sandbox backed infrastructure was integrated into the 2018 project backends rather than creating a new separate backend infrastructure so close to demo day 2018. In moving forward with sandbox backend improvements, I believe creating a stand alone civic sandbox backend is important.

Some reasoning:
Moving forward, civic may want to be able to quickly spin up a sandbox package with data from an external partner who may give us access to their data, but not their backend infrastructure. We are not there yet, but I want to work towards being ready for this.
Refactoring and improving code will be easier to implement in one place rather than the current four repos.
There is currently a civic-sandbox repo which houses the code related to the “central” sandbox api (essentially a registry of the sandbox packages and layers). By merging the Django backend into this repo, it will be easier to write code for managing the sandbox layers and changes can happen in one place rather than in the registry repo and then project repo.

Some thoughts:
Django can handle hitting multiple databases (https://docs.djangoproject.com/en/2.0/topics/db/multi-db/), which this change will necessitate. However, this will mean managing all the related database secret credentials in one repo. Are there limitations here in terms of how travis will handle this in production?

@MikeTheCanuck @DingoEatingFuzz

(cc: @jaronheard)

@nam20485
Copy link
Member

nam20485 commented Aug 2, 2018

I'll jump in here...

This appears to make a lot of sense, moving the sandbox backends out of the respective theme backends separates multiple concerns housed in single repos, i.e. when @AraOshin commits sandbox backend code to my backend I don't have to approve code I don't know anything about, and she can maintain her own code in her own backend without my intervention.

As far as travis goes, for the DB credentials/settings, you will just have to create multiple entries for the ones that are different per backend, i.e. the DB name.

So POSTGRES_NAME will have to become DISASTER_POSTGRES_NAME, TRANSO_POSTGRES_NAME, EMERGENCY_POSTGRES_NAME, etc.

I think some of the other setting environment variables for the DB will be the same for all of the backends (i.e. POSTGRES_HOST, POSTGRES_PORT, etc.).

Multiple DB definitions in Django is easy:

see e.g. the 2018 transpo backend:

DATABASES = {
'default': {
'ENGINE': 'django.contrib.gis.db.backends.postgis',
'PASSWORD': os.environ.get('POSTGRES_PASSWORD'),
'NAME': os.environ.get('POSTGRES_NAME'),
'USER': os.environ.get('POSTGRES_USER'),
'HOST': os.environ.get('POSTGRES_HOST'),
'PORT': os.environ.get('POSTGRES_PORT')
},
'odot_crash_data': {
'ENGINE': 'django.contrib.gis.db.backends.postgis',
'PASSWORD': os.environ.get('POSTGRES_PASSWORD'),
'OPTIONS': {
'options': '-c search_path=django,odot_crash_data'
},
'NAME': os.environ.get('POSTGRES_NAME'),
'USER': os.environ.get('POSTGRES_USER'),
'HOST': os.environ.get('POSTGRES_HOST'),
'PORT': os.environ.get('POSTGRES_PORT')
},
'multnomah_county_permits': {
'ENGINE': 'django.contrib.gis.db.backends.postgis',
'PASSWORD': os.environ.get('POSTGRES_PASSWORD'),
'OPTIONS': {
'options': '-c search_path=django,public,multnomah_county_permits'
},
'NAME': os.environ.get('POSTGRES_NAME'),
'USER': os.environ.get('POSTGRES_USER'),
'HOST': os.environ.get('POSTGRES_HOST'),
'PORT': os.environ.get('POSTGRES_PORT')
},
'passenger_census': {
'ENGINE': 'django.contrib.gis.db.backends.postgis',
'PASSWORD': os.environ.get('POSTGRES_PASSWORD'),
'OPTIONS': {
'options': '-c search_path=django,passenger_census'
},
'NAME': os.environ.get('POSTGRES_NAME'),
'USER': os.environ.get('POSTGRES_USER'),
'HOST': os.environ.get('POSTGRES_HOST'),
'PORT': os.environ.get('POSTGRES_PORT')
},
...
'trimet_stop_events': {

If Mike and Mike approve, and you go forward, let me know if you need any help as I helped set up a lot of the backend and travis patterns.

@znmeb
Copy link
Contributor

znmeb commented Aug 3, 2018

Yeah - I like this. Just out of curiosity, what's the scale of database we're anticipating, and do we always need GIS? For a lot of the small databases we had for transportation, we could probably have gone with an embedded SQLite database in the API container. And there are GIS extensions (SpatiaLite) for that, although it's now as powerful as PostGIS.

@AraOshin AraOshin closed this as completed Aug 4, 2018
@AraOshin AraOshin reopened this Aug 4, 2018
@AraOshin
Copy link
Author

AraOshin commented Aug 4, 2018

thanks @nam20485 for your comments and offer of assistance!

In terms of set up: one curiosity I have is if it would make more sense to start from the backend exemplar and build it up from there or from one of the 2018 project repos and remove the files I don't need related to that team project.

@znmeb At this point, I am not anticipating any additional database. The sandbox django backend will still hit the team or partner specific existing databases. Sandbox itself as a project does not have its own datasets. The exception here is some metadata related to how the datasets should appear in the sandbox frontend and which layers go together in certain packages etc. For now, this data is some simple JSON configuration that is returned by an AWS lambda (see hackoregon/civic-sandbox). At some point, we'll grow into wanting a database for that but that's not my focus yet at this point.

@znmeb
Copy link
Contributor

znmeb commented Aug 4, 2018

At some point we were talking about merging some code from the individual projects back into the exemplar. I haven't looked at the exemplar since we froze the database code. Maybe this is a good time to do the merging, since we can test the exemplar on our own machines without touching the running apps.

@MikeTheCanuck
Copy link
Contributor

Hi Ara, we should be able to squeeze another small Django container into the mix.

Could you clarify whether this idea is meant to migrate all the functionality that was integrated into the non-sandbox API containers (and remove that code once migrated), or if this will stand in addition to that functionality (and leave that existing code where it is)?

As Nathan correctly asserts, we will need to define multiple variables (two for each DB to be connected) and then configure those env vars both in Travis and in AWS Parameter Store to ensure all those connections work well. I'm going to propose that each database be named as suffix to the variables we'll need, e.g.:

  • POSTGRES_NAME_DISASTER-RESILIENCE
  • POSTGRES_USER_DISASTER-RESILIENCE
  • POSTGRES_PASSWORD_DISASTER-RESILIENCE

(We might need to split other env vars too, if we end up deciding to split our database resources across different AWS services - nothing in the works, but keeping our options open depending on budget, skills and requirements.)

We'll definitely want to be more careful with this box and its multiple DB connections - we ran into trouble this season with some containers just sucking the life out of the DB, and finally got that quelled. Let's keep an eye out that we're using a similar config as the non-sandbox Django apps.

I'm concerned about adding Django app to a repo that currently houses a Lambda - is that what you were suggesting Ara? We don't want a spread of hundreds of repos, sure, but mixing two different application bases into one repo just feels messy and hard to keep straight for others who wander by, and repos are technically free - and meant to keep different code separate.

@AraOshin
Copy link
Author

AraOshin commented Aug 5, 2018

Hi @MikeTheCanuck Thank you so much for your thoughts.

  • Ultimately, I would love to see all the code migrated and use updated endpoints for the live sandbox project (the urls will change since they won't be routed through the projects themselves). However, I was envisioning an in-between phase when I will have those endpoints recreated, but the front end might still use the original endpoints until there is enough capacity on the front end to make the adjustments. This may not be an issue but my hope was to build something new and better without breaking anything. :) Build the new, wait for confirmation that the front end is able to do the migration on their end, then tear down the old was my general idea.

  • I like the naming convention you propose.

  • Yes, I was suggesting merging the two into one repo. However, I now realize that will be completely unnecessary! The way I see it, this is where our heroic lambda function will be laid to rest. The reasoning behind the lambda in the first place was to have a super lightweight solution for dealing with the centralized sandbox data, which didn't belong in any team specific repo. Now that the sandbox will have it's own django project, the central endpoint can be created from there. Thanks so much for your question.... writing out an answer to why I wanted to merge them helped me realize this fairly obvious solution was staring me in the face.

@nam20485
Copy link
Member

nam20485 commented Aug 5, 2018

If it were me I think I would be inclined to start with one of the 2018 theme backend repos and then add all your other sandbox apps' functionality into that. I am not sure what state the backend-exemplar repo is in, if I recall correctly we stopped short of migrating several significant updates/modifications that we made to the theme backend repos into the backend-exemplar, so it may be behind the current state of our best practice patterns and configuration. I worry that if you start with it, you may run into issues that don't have anything to do with your intended initiative here.

@BrianHGrant, @MikeTheCanuck, @znmeb What is your opinion here?

@AraOshin
Copy link
Author

AraOshin commented Aug 5, 2018

@nam20485 That was my thinking as well. I am currently working out of the neighborhoods 2018 repo and it would definitely be my preference to just cull the files in this repo and remove anything specific to the neighborhood team. It seems to me as though largely, this would be just a matter of looking at the folders and removing whole folders and keeping others. It doesn't seem like there is too much intermixing of team specific code and set up related code within folders. One obvious exception being the settings.py file.

I may wade into this today and see if I am able to go along (somewhat naively, I'll admit) and just keep deleting until something breaks! Haha!

@znmeb
Copy link
Contributor

znmeb commented Aug 5, 2018

I don't have any opinions about most of this, since my focus was on PostGIS, which I'm assuming won't change at all unless some vulnerability shows up that actually affects our current operations. So whatever works best for the other cities who want to build on the platform is where I'd go.

@MikeTheCanuck
Copy link
Contributor

The intent of the backend exemplar was to provide an easy path to initialize a new Django project that honoured our infrastructure, architecture patterns and chosen technologies.

That it currently is inadvisable to use it as a starting place (and preferable to use a live, fully-evolved project instead) is either a sad indictment of how far the exemplar has fallen behind, or an example of what difference there is between what it actually offers and what people believe it offers.

It might actually be closer to the current patterns than we suspect, though it's been a month or more since I checked.

It would be wonderful if we could use this exercise to figure out how to get it closer to reality - and closer to a place where new Django apps are better to start there than somewhere else.

But I also recognize that that'd be asking for extra labour that doesn't directly contribute to the current problem.

That said, I've got a day to myself, and would be happy to contribute to the refactoring of the exemplar where needed.

@AraOshin
Copy link
Author

AraOshin commented Aug 5, 2018

@MikeTheCanuck I hear ya! I felt a little mischievous/silly suggesting not using it. I am willing to give it a try since it sounds like you have capacity to help with the refactoring. I tried to follow along somewhat but I feel pretty ignorant of how the setup came to be and what it entailed. I know it worked by the time I jumped into the teams repos so my desire to skip going back there just stemmed from fear of wading into the unknown.

what first steps would you suggest for me?

Clone the backend exemplar, add some simple sandbox api code and then test pushing to production with you?

or would you start by looking at the backend exemplar repo to see if anything major jumps out as missing ?

@AraOshin
Copy link
Author

AraOshin commented Aug 6, 2018

Okay, out of curiosity I started looking into the differences by replacing all the backend exemplar files with the neighborhoods repo files locally and then once I was looking at that, I got kind of sucked in! (fyi, I did look for a way to compare two separate repos on github and some old posts seemed to suggest this would be possible, but eventually I came to believe it's not a current feature. Someone please let me know if I could have done that!)

So I took a first pass at the differences and reverted any change that seemed to be clearly specific to neighborhoods and didn't have value for the backend exemplar. That leaves 20 files with changes, 4 of which are completely new files and the rest are modifications.

I am going to push this branch to the backend exemplar repo so that everyone else can see the differences. Is this a reasonable way to manage this situation? I recognize that if we pull the changes in this way, it will look like the code was contributed by me or who every ultimately merges a branch in this way. While the original authors were @MikeTheCanuck @DingoEatingFuzz @hassanshamim and maybe others.

I tried not to make any decisions beyond reverting changes that seemed clear to me. Anything that seemed even slightly up to interpretation, I left changed so that others could view and decide if the original backend exemplar code should be returned to, use the newer code from the neighborhoods repo, or a merger of the two.

@znmeb
Copy link
Contributor

znmeb commented Aug 6, 2018

See my comment on the pull request - can we somehow make this test-driven? Is that worth the effort?

@DingoEatingFuzz
Copy link
Contributor

DingoEatingFuzz commented Aug 6, 2018

Hi, I'm late to the party, but I have thoughts.

Multiple databases

This may be the first project to need to connect to multiple databases, but I don't think it will be the last. It's not hard to imagine projects connecting to "common" databases. This year we flirted with the notion of common databases for geocoding and census demographics, but ended up embedding that data into project databases, which is also reasonable, but if it was easy to depend on pre-existing databases, I can't imagine going the route we went this year again.

I see multiple as being N, not 5. To this end, repeating env vars in Travis and in the .env file is just not scalable. It doesn't need to solved as a part of this project, but in the 2019 project cycle, we should explore service discovery tools, since they are built for this exact problem. Consul has a nice web page describing this, but there are other tools too, such as etcd.

I see this project, and in turn @AraOshin scouting out this space. By the end of this project, it will be clear what the pain points we should solve for.

One thing I saw in the comments that also relates to this was skipping a DB_PORT variable per database because it will always be the same. This is true today, but it isn't future proof. We should assume all infrastructure is dynamic and subject to change. A service discovery tool would give us a framework for dealing with this.

Identity and security

Another comment I saw was that sandbox code being PR'd into project repos was a bit funny because it suggests that the maintainers of the project repos ought to code review this code they aren't necessarily responsible for. And while moving that code into a central repository will mean that the backend django projects are no longer shared spaces, the databases will still be shared.

I'd like to propose that the project repos (and maintainers thereof) are the database owners, and the sandbox (and maintainers thereof) are database visitors.

What this means from a collaboration perspective is that if the sandbox project would like a schema change or net-new data, the maintainers of the sandbox (Ara for now) would have to request that from the maintainers of that database. If sandbox maintainers are allowed to change these databases as will, there will be upstream consequences. This forces the project maintainers to be in loop.

What this means from a code/security perspective is that the sandbox can not share db credentials with the project repos. Even if today both the sandbox and the project repo only get read-only access, I want access to reflect our organization. This also protects us from a future where a project repo has write access, but the sandbox ought to only have read.

The env vars structure @MikeTheCanuck looks good, except - isn't an allowed env character, so those will have to be underscores.

Identity management is going to be a critical part of the Civic Platform, and I see it as another theme for the 2019 project cycle.

Migration strategy

I'm so glad you are already thinking about migration, Ara. It's necessary that the lambda remains up until we are ready to switch over to the django service. To throw a wrench in here, I don't want the sandbox URL changing. If the new sandbox backend will not be backwards compatible, then this means the backend and frontend will need coordinated deploys that look something like this:

  1. New django backend at sandbox-beta.civicplatform.org
  2. Frontend is changed to accommodate changes to the sandbox api and now points to sandbox-beta.civicplatform.org
  3. Lambda function and API Gateway is taken down.
  4. New django backend is accessible from both sandbox-beta.civicplatform.org as well as sandbox.civicplatform.org
  5. Frontend is changed to now point back at sandbox.civicplatform.org
  6. sandbox-beta.civicplatform.org is removed from the load balancer rules

If the changes are backwards compatible, then this is much simpler, but I do ask that the django service is proved to work before taking down the API gateway for the current sandbox. This will reduce potential downtime due to unforeseen problems in deployment.

Development and testing

I see this as an elephant in the room. Each project has a docker compose setup for the API and the DB for development use. Testing has always been in elephant in the room, but I might as well bring it up here.

Have you thought about how to manage N dev databases or N test databases?

Summary

This will be a great project both for sandbox developer experience and platform research. I'm excited to see what you end up with Ara, and please take notes if you run into any issues. They will inform how we invest our time next project cycle.

Lastly, thank you everyone else on the thread for your continued interest in the sandbox and in the civic platform 🎉

@AraOshin
Copy link
Author

AraOshin commented Aug 7, 2018

@DingoEatingFuzz Thanks so much Michael for your thoughtful comments. We are very much on the same page.

A few points I wanted to highlight:

I'd like to propose that the project repos (and maintainers thereof) are the database owners, and the sandbox (and maintainers thereof) are database visitors.

Yes! I agree and aside from a few exceptions where I jumped in to help create a view here and there, I operated from this standpoint during sandbox 1.0 development. As you have said, in practice this means that sandbox should only have read access on any database connection.

Re: migration strategy. Yes, the 6 point cycle you mentioned is what I was envisioning. I have been keeping track of any changes that effect the front-end and my plan was to ensure that all sandbox 1.0 infrastructure remains unchanged until the front end is able to to respond to those changes, and then complete the migration process as you described. I have been considering options for retaining backwards compatibility and it's doable, but I think I would like to go with this plan for now. The changes will all be minor, but I think it's worthwhile during the downtime we have here with no crunch related to a deadline to make systemic improvements on what were very time constrained decisions without adding in what would be some redundancies for the sake of backwards compatibility. There will come a point, of course, when the benefits wouldn't out-way the cost but I don't think we're there yet, personally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants