# Introduction to the Workshop on Observability


Welcome to this workshop on observability! 

In this workshop, we will explore the concept of observability and its importance in modern development and operations. Observability is the practice that allows us to gain insights into the behavior and performance of our systems, enabling us to make informed decisions and improve our applications.

The workshop is split into sections, or notebooks in this case.
* You are currently in the introduction section, that in this notebook, This will cover the following topics:
  * Why do we need observability?
  * What is observability?
  * How can we implement observability in data science and engineering projects and how can we derive insights and value from it?
  * What are the key components of observability a modern monitoring stack?
* Azure resources in the monitor stack, What functionality do they provide and how do you use them?
  * Azure monitor, application insights, log analytics workspace and how they're connected
  * Quick intro to KQL (Kusto Query Language), The query language used in Azure Monitor and Log Analytics.
* How to configure Azure services to submit logs to a log analytics workspace via Bicep and via the Azure portal.
  * Why is this important?
  * Quick introduction to Bicep
  * Keyvault, App-service, Azure OpenAI
* How to add observability to your code via Opentelemetry
  * Introduction to opentelemetry
  * How to instrument your code with opentelemetry
  * Opentelemetry config files
  * Let's try to instrument a simple program that interacts with database(sqlite3)
* Let's try to instrument a more complex codebase(llm webapp and bring your own code.)

* How build a dashboard in Azure monitor
  * What is a dashboard?
  * How to create a dashboard in Azure monitor
  * How to add widgets to the dashboard
  * How to share the dashboard with others




### Why ?

The reason why we should care about and implement good observability practices is that it gives us realtime and historic insight into how our code is preforming and behaving. the obvious reasons finding, diagnosing and reducing bugs, performance issues, or other unintended behavior. It could also be if an attacker is trying to exploit our code, compliance, and a plethora of other reasons.

* **Troubleshooting** One of the core reasons is improving troubleshooting, both proactively and reactively. Proactively, we can use observability to identify potential issues before they become critical problems. Reactively, we can use observability to quickly diagnose and resolve issues when they do occur.
  - **Proactive**: Monitor exception/errors and performance issues to identify potential problems before they affect users.
  - **Reactive**: Quickly diagnose and resolve issues when they occur, minimizing downtime and impact on users. This could be getting a traceID from a user, and then using that to find the root cause of the issue, they are experiencing.
* **Performance**: By understanding how our code is performing, we can identify areas for improvement and optimize our systems for better performance. This could be identifying a slow query, or a slow function, If it's visible we can act on it and resolve it. But it could also be if a external service is slow, or if we are getting throttled by a third party API. If we can see that we are getting throttled, we can take action to mitigate the issue, such as caching the data, using a different API, or implementing a fallback strategy.
* **Security**: By monitoring our systems for unusual behavior, we can identify potential security threats and take action to mitigate them. This could be identifying a user that is trying to brute force their way into our system, or a user that is trying to access data they shouldn't be able to access. If we can see that this is happening, we can take action to mitigate it or reduce impact.
* **Compliance**: By monitoring our systems for compliance with regulations and standards, we can ensure that we are meeting our obligations and avoid potential penalties. This could be monitoring for data breaches, or ensuring that we are complying with GDPR or HIPAA. If we can see that we are not compliant.

There are also other use-cases, if they are business, development, operations or security related. The core is that we need to be able to see what is going on in our systems, and be able to act on it.

### What ?

Observability is the ability to measure and understand the internal state and workings of a application and it's dependencies. That allows us to monitor, analyze, and troubleshoot effectively. 

This is seeing how a request propagates through the system, how long it takes, and what resources are used. 

If we take a LLM webapp as an example.

```mermaid
graph
    subgraph "LLM webapp"
        Browser(User Browser)

        LoadBalancer(Load Balancer)

        auth(Azure Entra ID)

        Browser-- redirect -->auth

        LoadBalancer(Load Balancer)

        Browser-- http request -->LoadBalancer

        webserver(webserver)

        LoadBalancer-- forward to server 1 --> webserver

        keyvault(Key Vault)

        webserver-- Fetch Secrets --> keyvault


        AzureOAI(Azure OpenAI)

        webserver-- Call model --> AzureOAI

        CosmosDB(Azure CosmosDB)

        webserver-- Fetch and store chat history --> CosmosDB

    end

    subgraph AzureMonitorStack["Azure Monitor Stack"]
        Applicationinsight{{Application Insight}}
        logAnalyticsWorkspace{{ Log Analytics Workspace}}
        Applicationinsight --> logAnalyticsWorkspace
    end

    AzureOAI -.-> logAnalyticsWorkspace
    keyvault -.-> logAnalyticsWorkspace
    webserver -.-> logAnalyticsWorkspace
    Browser --> Applicationinsight
    webserver --> Applicationinsight
    LoadBalancer -.-> logAnalyticsWorkspace
    CosmosDB -.-> logAnalyticsWorkspace
```

In this example, we have a web application that interacts with an Azure OpenAI model, stores chat history in Azure CosmosDB, and fetches secrets from Azure Key Vault. It could also have a load balancer to distribute requests if we have multiple backend servers. 

The web application is instrumented with opentelemetry, which exports the telemetry data to application insight which then stores it in a Log Analytics Workspace. The Azures services stores these logs directly into the workspace.

Let's take a look at the flow of information in this system when a user sends a request.

**Application logs:**

1. When the user first loads the web application, the frontend initializes in the browser and sends a log record to application insights containing browser information, IPs, how long it took to load the page, and other information.
2. After it's done initializing it prompts the user to log in, and following this is another log record the user information, or errors related to this.
3. Following this we have the first request to the backend server, the backend server sends a log record to application insights containing the request information, which user made the request, and which server handled it.
4. The backend server then handles the requests for the client, during this it could be getting secrets from key vault, calling the LLM model, or interact with the CosmosDB.
   * Openai: The records related to the LLM might contain how many tokens were used, how long it took to the model to generate the output, which model was used, and the prompt that was sent to the model.
   * CosmosDB: The records related to the database might contain how long it took to fetch the data, which query was used, and how many records were returned.
   * Key Vault: The records related to the key vault might contain how long it took to fetch the secrets, which secrets were fetched, and which identity accessed the key vault. 


**Service logs:**
We might also collect service logs from all the managed services. Here it would be the keyvault, azure openai, and azure cosmosdb.

* Azure OpenAI: The logs related to the LLM model contain how many tokens were used, how long it took to the model to generate the output, which model was used, and the prompt that was sent to the model.
* Azure CosmosDB: The logs related to the database contain how long it took to fetch the data, which query was used, and how many records were returned.
* Azure Key Vault: The logs related to the key vault contain, which operation was preformed, which secrets was affected, and which identity accessed the key vault.

It should be noted that there will most likely be some overlap and redundancy in the logs, but they are all useful in different ways.

If we think about the keyvault from the security point of view, We might want to know who accessed the keyvault, when, and what was accessed, and if it was accessed by a user or a service other than our app service, the application logs won't give us this information, but the audit logs from the keyvault will.


**Correlation of logs:**
By itself this is already a lot of information, but if we can't correlate the logs from the different services and components, we just a pile of data that gives no insight or value. This is way TraceIds and OperationIds come into play. They allow us to correlate the logs from different services and components, and see how they are related to each other. This is crucial for understanding how different parts of a system interact and for diagnosing issues that may arise in complex systems.


  * TraceId: is used within distributed systems and is used for tracing the entire journey of a request as it flows through multiple services and components. For instance, if a request starts at a web server, goes through a backend, and ends at a database, the same traceId is used across all of these components. This allows for end-to-end tracking of the request, which is invaluable in debugging and performance monitoring.
  * OperationId: This is typically used to identify a specific operation or task within a larger process or system. For example, in APIs or logging systems, operationId helps to track the execution of a single operation, like a specific API call or database query. It is usually unique to that operation and can help developers debug or analyze that specific task.

This means when we have a system consisting of multiple components, we can use the traceId to see the entire flow of a request and the operationId to see the details of each individual operation within that flow. This is crucial for understanding how different parts of a system interact and for diagnosing issues that may arise in complex systems.




### How ? 

The establish a good observability practice, we need to establish a good data collection and analysis pipeline. This can be done using a variety of tools and techniques, such as logging, metrics, and tracing. The key is to collect data that is relevant to our systems and applications, and that can help us understand their behavior.

For the samples here we will use the Azure monitor stack with the Opentelemetry collector, But the concepts are applicable to any observability stack.

* **Collect**: The first step in observability is to collect data from our systems in a centralized location. The key goal is both to collect information that is relevant to our systems and applications, such that we can diagnose issues, but also to make them accessible such that we can analyze and act on them. They are of no value if they're spread on 100s of servers where we have to ssh around and dig into 100s of log files to figure out what is wrong.
* **Viability**: Once we have collected our logs, we need to make them visible and accessible, so that we can gain insight from them. Here dashboards and alerts serve as a good starting point, what is the rate of unrecoverable exceptions, what is our response times, is our database slowing down.
* **Act**: Once we have made easy consumption of these metrics available, we can start to act on them. This could be both setting up alerts if we have key failure modes we know about, such that we can resolve them prior to affecting users. Finding issues and bug in our code. Do we have slow queries that slow down the entire application. Do we have underlying bugs that are hidden due to a catch all exception handler, or is someone trying to exploit our system. There are many different ways we can act on the data we have collected, and the key is to find the right balance between collecting too much data and not enough data. We don't want to be flooded with alerts, but we also don't want to miss critical issues.

#### Key components of the observability stack


**Collectors**

Collectors run along side your application and collects data from it, This is done using a few different methods, such as:
* Collect data from core logging and metrics frameworks and parses, enrich end exports the data to a backend, either a local agent, or a remote one.
* Collects trace information, and enriches it with data from the application, Function calls, internal and external API calls, database connections and similar events.
* Collects metrics from the application and host, such as CPU usage, memory usage, IO and other "metric data".

For this example we will use the Opentelemetry collector, which is a vendor-neutral open-source observability framework that provides a set of APIs, libraries, agents, and instrumentation to enable the collection of telemetry data from applications and systems. It supports various data sources, including metrics, logs, and traces, and can export this data to various backends for analysis and visualization.

**Agents**

Agents are lightweight processes that run on the host and collect data from the host and application. they can serve as a local collector, or as a remote one. They can also be used to collect data from other sources, such as databases, system services and other applications. In some cases they can even serve as a local job executor that can call health checks on services/endpoints and similar tasks.  

**Database**

The database is where the data is stored, for our workshop here we will use The azure monitor stack, hence a log analytics workspace. this is also where the opentelemetry collector is so powerful as we can develop using the tools we are familiar with, and then the customer can export it to the system of their choice, as most companies often have a standardized stack as SOC, OPS, and other teams need to have access to the data in real time.

Some of the other options within the azure are Prometheus and Elastic Search

**Dashboards/Visualization tools**

Dashboards are used to visualize the data and provide insights into the behavior of our systems. They can be used to display metrics, logs, and traces, and can be customized to meet our needs, they also often allow us to set up alerts and notifications.

For this workshop we will use Grafana for more in-depth dashboards and Azure monitor for the basic dashboards. Grafana is a powerful and flexible open-source analytics and monitoring platform that allows users to visualize and analyze data from various sources. It provides a wide range of features, including customizable dashboards, alerting, and data exploration capabilities.

Grafana supports a wide range of data sources, including Prometheus, InfluxDB, Elasticsearch, and many others. It provides a wide range of visualization options, including graphs, tables, and maps.






### What are some of the terms we use in observability?

* **Traces** represent the lifecycle of a request as it travels through different components of a distributed system. They provide insights into how requests are handled, highlighting bottlenecks or failures at specific points in the system.

* **Logs** are time-stamped records of discrete events within a system. They offer detailed insights into application behavior, errors, and system processes, making them valuable for debugging and forensic analysis.

* **Metrics** are numerical values that reflect the performance or health of a system, such as CPU usage, memory consumption, request latency, or throughput. They help monitor trends and alert teams when thresholds are exceeded.

* **Exceptions** are error conditions that occur during runtime, disrupting normal operation. Observing and categorizing exceptions is crucial for identifying recurring issues and improving system reliability.

*  **Events** are occurrences or changes in state within a system or application. They help track significant moments, such as deployments, user actions, or configuration changes, aiding in understanding how a system evolves over time.

#### Next steps
Let's jump into looking at the Azure monitor stack and how we can use it to implement observability in our systems.

Go to the next notebook to learn about the Azure monitor stack and how we can use it to implement observability in our systems.