# `gitrpcd` Breakdown

This script uses pandas to generate:

- Simple plot of of a `gitrpcd.csv` file. It is assumed that the csv file has been previously created by [kv-to-csv.py](https://github.com/gm3dmo/syslog-to-csv/blob/main/kv-to-csv.py)

You will need to `conda install bokeh` or `pip install bokeh`

In [None]:
import pandas as pd
import numpy as np
import panel as pn
pn.extension('tabulator')
import hvplot.pandas
import pathlib

from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource
from bokeh.models.tools import HoverTool

In [None]:
pd.set_option('display.max_rows', 1000)

cwd = pathlib.Path.cwd()
csv_file = cwd / 'gitrpcd.log.csv'
df = pd.read_csv(csv_file,  dtype={"line_number": int, "line_length": int,  "hostname": "string", "wiped_line": "string" , "daemon": "string", "health": "string", "msg": "string", "repository_id": "string", "twirp_error": "string",  "path": "string"})

In [None]:
df.info()

Create a pandas datetime column called `real_date` using the `unix_timestamp` column as a source:

In [None]:
df['time'] = pd.to_datetime(df['time'])
df.info()
df.head()

In [None]:
df["repository_id"].value_counts(normalize=True)

In [None]:
df["user_agent"].value_counts(normalize=True)

In [None]:
df["path"].value_counts(normalize=True)

In [None]:
df["twirp_error"].value_counts(normalize=True)

Create the time period *buckets* in which to group the data. In this script We've chosen `600S` (10 minutes) for the granularity of the bucket. Other frequencies can be chosen and are documented in the [offset-aliases](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases). To choose a different *bucket size*, simply replace the `600S` with a new value where you see `freq='600S'`

In [None]:
# Set the sample frequency 10 minutes = 600 seconds
sample_frequency = '600S'
buckets = df.groupby([pd.Grouper(key='time', axis=0, freq=sample_frequency),'service']).count()
buckets.head()

#### Which daemons are producing the most messages per 10 minutes?

In [None]:
buckets_of_service = df.groupby([pd.Grouper(key='time', axis=0, freq=sample_frequency),'service'])['service'].count().unstack()

### Generate summaries of the bucket data

In [None]:
daemon_plot = buckets_of_service.hvplot.scatter(x = 'time', by='service', line_width=2, title="service lines in gitrpcd.log", width=1600, height=1200)
daemon_plot

In [None]:
buckets_of_agent = df.groupby([pd.Grouper(key='time', axis=0, freq=sample_frequency),'user_agent'])['user_agent'].count().unstack()
plot_agent = buckets_of_agent.hvplot.scatter(x = 'time', by='user_agent', line_width=2, title="user_agent lines in gitrpcd.log", width=1600, height=1200)
plot_agent

In [None]:
buckets_of_twirp_method = df.groupby([pd.Grouper(key='time', axis=0, freq=sample_frequency),'twirp_method'])['twirp_method'].count().unstack()
plot_twirp_method = buckets_of_twirp_method.hvplot.scatter(x = 'time', by='twirp_method', line_width=2, title="twirp_method lines in gitrpcd.log", width=1600, height=1200)
plot_twirp_method

In [None]:
buckets_of_twirp_error = df.groupby([pd.Grouper(key='time', axis=0, freq=sample_frequency),'twirp_error'])['twirp_error'].count().unstack()
plot_twirp_error = buckets_of_twirp_error.hvplot.scatter(x = 'time', by='twirp_error', line_width=2, title="twirp_error lines in gitrpcd.log", width=1600, height=1200)
plot_twirp_error

In [None]:
buckets_of_repository_id = df.groupby([pd.Grouper(key='time', axis=0, freq=sample_frequency),'repository_id'])['repository_id'].count().unstack()
plot_repository_id = buckets_of_repository_id.hvplot.scatter(x = 'time', by='repository_id', line_width=2, title="repository_id lines in gitrpcd.log", width=1600, height=1200)
plot_repository_id

In [None]:
buckets_of_level = df.groupby([pd.Grouper(key='time', axis=0, freq=sample_frequency),'level'])['level'].count().unstack()
plot_level = buckets_of_level.hvplot.scatter(x = 'time', by='level', line_width=2, title="level lines in gitrpcd.log", width=1600, height=1200)
plot_level