Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify next steps until we can run this #26

Open
bradenmacdonald opened this issue Mar 7, 2023 · 9 comments
Open

Identify next steps until we can run this #26

bradenmacdonald opened this issue Mar 7, 2023 · 9 comments
Assignees

Comments

@bradenmacdonald
Copy link
Contributor

bradenmacdonald commented Mar 7, 2023

Identify what remaining features/support are required before we can start using this for hosting instance(s) in production at one of the participating providers.

  • What is our definition of "production ready"?
  • e.g. do we need autoscaling support before we can use this "in production"? do we need CI doing helm chart tests? Do we need better monitoring?
  • How does this integrate with existing tooling the provider has (e.g. Grove)?
@bradenmacdonald
Copy link
Contributor Author

@gabor-boros please investigate this for us, or assign to someone from our team.

@gabor-boros
Copy link
Contributor

@bradenmacdonald I was lagging behind with OpenCraft-related commitments that I had to sort out first. Unfortunately, I had no time to get back to this.

Also, today, it turned out that I'd have to be a core contributor to be able to review tasks here. I guess that stands for working on issues too, though I have/had concerns with from Serenity point of view at the moment. Ref: #25 (comment)

Taking these into account, I may not be the best to work on this at the moment. Or should I start work on this without being a CC, @antoviaque ?

@felipemontoya
Copy link
Member

I don't think you need to be a core contributor to review. Only to be assigned as a reviewer. All and any feedback you can give at any time will be appreciated.

@MoisesGSalas
Copy link
Contributor

At the moment we have merged functionality for a shared elasticsearch, but we still need openedx/edx-search#130 to be merged if we want to avoid forcing a new image build with a non-standard dependency.

We are going to keep discussing this topic at eduNEXT internally.

@antoviaque
Copy link

@gabor-boros Yup, as noted in other related ticket, you don't need to be a core contributor to do the review itself, only to be able to give a final 👍 and merge.

@MoisesGSalas
Copy link
Contributor

MoisesGSalas commented Mar 27, 2023

I did a very small review and I came to the following conclusion:

If we consider the simplest most barebones production installation that we would feel comfortably running we would consider that the project would needs to offer at least this basic features:

  • A central load balancer with automatic SSL certificates.
  • Autoscaling of both pods and nodes.
  • Some level of monitoring tools.

Of these features, the first one is already implemented, and the other two are being tracked in #2 and #3. In addition, we also implemented a shared elasticsearch for the whole cluster, but it depends on changes on external packages/services that has not been merged/released (edx-search#130 and cs_comments_service#404).

We also have some housekeeping to do, namely renaming every mention of tutor-multi to harmony, hosting the chart as a chart repository and releasing a new version.

I think that would be the bare minimum for one of the providers to actually start using the tool. Nevertheless, we could also improve our current documentation to ease the adoption of the project from people that have not been part of it since the beginning.

In summary:

@gabor-boros
Copy link
Contributor

As discussed I did the review as well. I completely agree what @MoisesGSalas said above. Without proper monitoring and auto-scaling, I wouldn't consider anything "production ready". Besides that, I would add the following:

  • Setting node affinity for resources defined by Tutor (lms, cms, workers, redis, etc)
  • Some way to size volumes attached to instances, elastic search, redis, etc
  • Some way to run periodic/cron jobs

Refining "Some level of monitoring tools", I would say that for production-grade cluster we would need the folllowing (maybe not every bit is needed for every provider, but could be turned on/off upon need):

  • Alertmanager
  • Prometheus
  • Grafana
  • Log collection and aggregation (i.e. Opensearch + Filebeat)

Also, if we want to make sure the chart is production ready, we should have some sort of automation for

  • Chart testing
  • Chart packaging
  • Chart releasing
  • README formatting and markdown validation

Lastly, this is a gray area in terms of responsibilities/project boundary but the ability to support backward compatible edX installations. At the moment (IIRC) Tutor is not supporting past edX installations. The latest versions of Tutor only supports the latest versions of edX instances and no backporting is supported (though I may remember wrongly). From a production point of view, this could be quickly become a difficulty, especially if a provider have to deal with instances that cannot be immediately updated to new versions.

@felipemontoya
Copy link
Member

@bradenmacdonald @jfavellar90 @gabor-boros @MoisesGSalas @cmltaWt0 tagging you here to please take an async look to figure out what should be the next step for this project.

@bradenmacdonald
Copy link
Contributor Author

In terms of us deploying this in production, we have scheduled a discovery task for this for our team (SE-5971) to work out the plan, but not yet sure on when we can fit it into a sprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

5 participants