# OpenObserve - Data Connector

## Description
The data provider module of msticpy provides functions to allow for the defining of data sources, connectors to them and queries for them as well as the ability to return query result from the defined data sources. 

For more information on Data Propviders, check documentation
- Data Provider: https://msticpy.readthedocs.io/en/latest/data_acquisition/DataProviders.html

In this notebooks we will demonstrate OpenObserve data connector feature of msticpy. 
This feature is built on-top of the [OpenObserve REST API](https://openobserve.ai/docs/api/) and [Unofficial python-openobserve module](https://github.com/JustinGuese/python-openobserve) with some customizations and enhancements (https://github.com/juju4/python-openobserve/tree/devel-all).

### Installation

In [1]:
# Only run first time to install/upgrade msticpy to latest version
# %pip install --upgrade msticpy

### Authentication

Authentication for the OpenObserve data provider is handled by specifying credentials (user and password) directly in the connect call or specifying the credentials in msticpy config file.

For more information on how to create credentials, follow OpenObserve Docs [Users](https://openobserve.ai/docs/user-guide/users/).

Once you created user account, you will require the following details to specify while connecting
- connection_str = "https://localhost:5080" (bare server install) or "https://cloud.openobserve.ai?" (cloud)
- user = "xxx" (as created)
- password = "xxx" (same)

Once you have details, you can specify it in `msticpyconfig.yaml` as shown in below example

```
DataProviders:
  Openobserve:
    Args:
      connection_str: "{Openobserve url endpoint}"
      user: "{user with search permissions to connect}"
      password: "{password of the user specified}"
```

In [1]:
# Check we are running Python 3.6
import sys

MIN_REQ_PYTHON = (3, 6)
if sys.version_info < MIN_REQ_PYTHON:
    print("Check the Kernel->Change Kernel menu and ensure that Python 3.6")
    print("or later is selected as the active kernel.")
    sys.exit("Python %s.%s or later is required.\n" % MIN_REQ_PYTHON)




In [1]:
# imports
import pandas as pd
from datetime import datetime, timedelta

# data library imports
from msticpy.data.data_providers import QueryProvider

print("Imports Complete")

Imports Complete


In [1]:
import os
# Custom Certificate Authority?
os.environ['REQUESTS_CA_BUNDLE'] = os.environ['HOME'] + '/path/to/ca-bundle-internal.pem'

In [1]:
# use custom config file?
# os.environ['MSTICPYCONFIG'] = '/path/to/msticpyconfig.yaml'

In [None]:
# FIXME! does not get MSTICPYCONFIG...
from msticpy.config import MpConfigEdit
# mpconfig = MpConfigFile()
# mpconfig.load_default()
# mpconfig.view_settings()
mpedit = MpConfigEdit()
mpedit


## Instantiating a query provider

You can instantiate a data provider for OpenObserve by specifying the credentials in connect or in msticpy config file. 
<br> If the details are correct and authentication is successful, it will show connected.

In [1]:
openobserve_prov = QueryProvider("OpenObserve")
openobserve_prov.connect(connection_str="<url>", user="<user>", password="<password>")
# openobserve_prov.connect()


connected with user user@example.com


## Running a Ad-hoc OpenObserve query
You can define your own openobserve query and run it via openobserve provider via `QUERY_PROVIDER.exec_query(<queryname>)`

For more information, check documentation [Running and Ad-hoc Query](https://msticpy.readthedocs.io/en/latest/data_acquisition/DataProviders.html#running-an-ad-hoc-query)

In [1]:
openobserve_query = """
SELECT log_file_name,count(*) FROM "default" GROUP BY log_file_name
"""
df = openobserve_prov.exec_query(openobserve_query, days=7, verbosity=1)
df.head()

INFO: from 2025-03-11 18:53:41.122503 to 2025-03-18 18:53:41.122503, TZ UTC
{'query': {'end_time': 1742324021122503,
           'sql': '\n'
                  'SELECT log_file_name,count(*) FROM "default" GROUP BY '
                  'log_file_name\n',
           'start_time': 1741719221122503}}


Unnamed: 0,count(*),log_file_name
0,110,history.log
1,2,mail.err


In [1]:
openobserve_query = """SELECT body__systemd_unit, count(*) FROM "journald" group by body__systemd_unit order by count(*) desc"""
df = openobserve_prov.exec_query(
    openobserve_query,
    start_time=datetime.now() - timedelta(days=7),
    end_time=datetime.now() - timedelta(days=1),
)
df.head()

Unnamed: 0,body__systemd_unit,count(*)
0,00-kunai.service,20812630
1,init.scope,217243
2,osqueryd.service,93914
3,falcoctl-artifact-follow.service,87232
4,ssh.service,36890


In [63]:
# first/last seen of streams
streams = ['default', 'journald', 'zeek', 'webproxy']
df_streams_seen = pd.DataFrame(columns=['first_seen', 'last_seen'])
days_period = 7

In [1]:
for s in streams:
    # panic... pandas handling conversion anyway
    # firstseen_sql = f"SELECT date_format(_timestamp, '%Y-%m-%d %H:%M:%S', 'UTC') FROM \"{s}\" order by _timestamp asc limit 1"
    # lastseen_sql = f"SELECT date_format(_timestamp, '%Y-%m-%d %H:%M:%S', 'UTC') FROM \"{s}\" order by _timestamp desc limit 1"
    firstseen_sql = f"SELECT _timestamp FROM \"{s}\" order by _timestamp asc limit 1"
    lastseen_sql = f"SELECT _timestamp FROM \"{s}\" order by _timestamp desc limit 1"
    df_first = openobserve_prov.exec_query(
        firstseen_sql,
        start_time=datetime.now() - timedelta(days=days_period),
        end_time=datetime.now() - timedelta(days=0),
        verbosity = 0,
    )
    df_last = openobserve_prov.exec_query(
        lastseen_sql,
        start_time=datetime.now() - timedelta(days=days_period),
        end_time=datetime.now() - timedelta(days=0),
        verbosity = 0,
    )
    df_streams_seen = pd.concat(
        [df_streams_seen,
         pd.DataFrame({ "first_seen": [df_first['_timestamp'][0]], "last_seen": [df_last['_timestamp'][0]]}, index=[s])
        ],
        ignore_index=False
    )


  df_streams_seen = pd.concat([df_streams_seen, pd.DataFrame({ "first_seen": [df_first['_timestamp'][0]], "last_seen": [df_last['_timestamp'][0]]}, index=[s])], ignore_index=False)


In [1]:
df_streams_seen

Unnamed: 0,first_seen,last_seen
default,2025-03-12 06:27:19.380708,2025-03-18 06:28:48.146779
journald,2025-03-11 19:39:40.959506,2025-03-18 19:39:41.134215
zeek,2025-03-11 19:39:46.579635,2025-03-18 19:39:40.746550
webproxy,2025-03-11 19:40:16.781357,2025-03-18 19:39:16.745473


## References

- [OpenObserve REST API](https://openobserve.ai/docs/api/)
- [Unofficial python-openobserve module](https://github.com/JustinGuese/python-openobserve) with some customizations and enhancements (https://github.com/juju4/python-openobserve/tree/devel-all)
- Openobserve github discussions: https://github.com/openobserve/openobserve/discussions/