Log collection and aggregation #32

felipemontoya · 2023-04-20T21:12:40Z

During the latest meeting we reviewed @gabor-boros answer at #26. Most missing features had a ticket covering them, but log collection did not.

The situation is:

there are some tools to collect and aggregate logs inside of a namespace where tutor is already installed. Logstash and Vector are common alternatives.
in the umbrella portions of the cluster, the charts and pods that run on the global namespace we don't have yet anything for log collection.

The question remains open if we want/need a specific tool for that and if there is interest in the participants of this repo in building one.

On the plus side we could have a tool that makes handling many instances simples.
The con is that we would be splitting the effort that could otherwise go into making the tools for log collection an individual namespace better.

I personally have not taken a side for any of the options, but we need a place where it can be discussed.

felipemontoya · 2024-05-28T17:22:23Z

@Ian2012 I know we are storing logs for some installations that want to start aspects with some data from before redwood. Could you please share in this context how we are doing that?

Ian2012 · 2024-05-28T17:57:33Z

On production, we are using Vector deployed with a helm chart with a sink configuration that saves all the logs on an S3 bucket splitting the logs per namespace/kind/application. Would that be a suitable solution for this problem?

Eventually, once Aspects is configured we can trigger a job that reads from: <namespace>/tracking/lms|lms-worker tracking logs and does the proper backfill

Ian2012 · 2024-05-31T15:02:14Z

Another solution that I see feasible is to store the tracking log data into ClickHouse using Vector to have quicker backfills on Aspects and being able to have an out-of-box backups solution for tracking logs. This is nothing new, as Cairn performs a similar operation by storing all tracking logs into ClickHouse via Vector

gabor-boros · 2024-06-17T08:59:54Z

@bradenmacdonald and @Agrendalath inviting you to this conversation. I think both solutions could be feasible, though you may have better insights here. Especially @Agrendalath as I know one of your clients is using tracking logs.

bradenmacdonald · 2024-06-17T17:16:55Z

@pomegranited might be a better person to ask :) I don't have much insight on this topic.

pomegranited · 2024-06-18T02:37:22Z

Hi @felipemontoya, thank you for starting the discussion! I think we need to define some scope and goals before making technology decisions.

Is this about general Open edX log collection/aggregation, like for monitoring instance health and investigating incidents? Or is it just about storing tracking logs?

How much of a solution should we provide? If we're providing log collection, do we need parsing, monitoring, dashboards, and alerting too?

What solutions are people currently using? What are their pain points?

There's a lot to consider. But we can totally take cues from @bmtcril 's Aspects architecture and integrate with suitable open source 3rd party tools, rather than writing our own.

bmtcril · 2024-06-18T16:45:57Z

FWIW Aspects can store tracking logs in ClickHouse via Vector now, though I'm not sure when the last time was that we tested it.

I definitely agree that having long term, flexible, rotated log storage for both operational and tracking logs (and potentially xAPI logs) is hugely important. I personally wouldn't mind seeing Vector used for that, but I'm sure site operators have much more valuable insight on any pain points with it.

MoisesGSalas · 2024-06-18T21:30:19Z

I've seen two common patterns when collecting logs in k8s: A sidecar container that runs alongside the application and a DaemonSet that runs on every node and mounts the /var/log/ from the host.

IIRC Adam Blackwell mentioned that they were using the sidecar approach in 2U.

With @Ian2012, we have tested the DaemonSet approach in a couple of clusters. We installed a global helm chart for vector and configured the sinks, sources and transforms.

We retrieve all the logs from certain pods (i.e with the annotation app.kubernetes.io/managed-by=tutor) and Cristhian wrote the transformer to extract the tracking logs. We push all the logs to S3.

We also found that this vector instance can serve multiple purposes, we can extract and push the tracking logs to s3, but we can also push the standard application logs of the openedx services to cloudwatch or even push the logs of other services (ingress-nginx, etc).

I think with a similar approach we can eventually cover most of this:

Is this about general Open edX log collection/aggregation, like for monitoring instance health and investigating incidents? Or is it just about storing tracking logs?

How much of a solution should we provide? If we're providing log collection, do we need parsing, monitoring, dashboards, and alerting too?

MoisesGSalas self-assigned this Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log collection and aggregation #32

Log collection and aggregation #32

felipemontoya commented Apr 20, 2023

felipemontoya commented May 28, 2024

Ian2012 commented May 28, 2024 •

edited

Loading

Ian2012 commented May 31, 2024

gabor-boros commented Jun 17, 2024

bradenmacdonald commented Jun 17, 2024

pomegranited commented Jun 18, 2024

bmtcril commented Jun 18, 2024

MoisesGSalas commented Jun 18, 2024

Log collection and aggregation #32

Log collection and aggregation #32

Comments

felipemontoya commented Apr 20, 2023

felipemontoya commented May 28, 2024

Ian2012 commented May 28, 2024 • edited Loading

Ian2012 commented May 31, 2024

gabor-boros commented Jun 17, 2024

bradenmacdonald commented Jun 17, 2024

pomegranited commented Jun 18, 2024

bmtcril commented Jun 18, 2024

MoisesGSalas commented Jun 18, 2024

Ian2012 commented May 28, 2024 •

edited

Loading