## Hierarchical Bayesian Modeling for EA metrics for tribal knowledge silos  

In this analysis, we will try to create a methodology and data-driven metric for identifying potential technological risks within an organization's coding protocols. We'll examine how programming languages are utilised across various projects and repositories, similar to those found in code repositories such as GitHub,  and leverage Hierarchical Bayesian Modelling (HBM) for multi-level data analysis. 

HBM effectively captures project-specific variances and overall project trends, providing a nuanced "risk" metric for Enterprise Architecture. This enables organisations to identify potential knowledge silos and make strategic decisions to enhance project continuity and organisational adaptability.

By analysing language usage across different organisational levels and integrating uncertainty, HBM aims to expose pockets of siloed tribal knowledge (in this example, via a proxy of languages used, but can easily be extended to accommodate other features such a #of commits, time since last commit, total commits, etc., etc.), which is crucial for identifying hidden risks within the architectural framework. This analysis uncovers potential vulnerabilities and compares language usage at repository and project scales against wider organisational patterns. These comparative insights are critical, revealing when a technology may seem insignificant in isolation emerges as a considerable risk in the broader organisational context due to limited expertise or exposure. This comprehensive examination ensures that technology decisions are made with a strategic perspective, reinforcing organisational resilience in the face of technological evolution.

For a more concrete example, consider a scenario where an organisation's repository primarily uses Haskell, a language not commonly used in broader enterprise contexts. Hierarchical Bayesian Modelling evaluates the risk by scrutinising Haskell's application within the repository, its relevance to the project, and its organisational prevalence. This comprehensive assessment ascertains the alignment of Haskell's use with the enterprise's technological trajectory and knowledge base, guiding strategic architectural decisions.


#### Flow
To start building a hierarchical Bayesian model using PyMC3 based on your JSON data, you'll first need to parse the data to extract the relevant information for modelling. This involves aggregating language usage across repositories and projects. After that, we define a hierarchical model that captures the variability within repositories and commonalities across projects.

* Data Preparation: Aggregate the language bytes for each language across all repositories and projects.
* Model Definition: Define a hierarchical model in Pymc, using project-level priors influencing repository-level distributions.
* Inference: Use MCMC provided by pymc to sample from the posterior distribution.
* Analysis: Analyze the posterior distributions to identify languages with usage outside the credible regions.

In [None]:
# Import required packages

import polars as pl
import pymc as pm
import arviz as az
import pprint

from utils import load_data, json2polars


## Pre-Processing
The first step is to run the `generate_dummy_data.py` file to make sure we have data to play around with, the generated dummy data is similar to what you might pull from GitHub's REST API for repository languages https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-languages

```GitHub CLI api
https://cli.github.com/manual/gh_api

gh api \
  -H "Accept: application/vnd.github+json" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  /repos/OWNER/REPO/languages

Example Response:
{
  "C": 78769,
  "Python": 7769
}
```



From here, we can load and transform the data. In this instance, we will be using `polars` rather than `pandas` dataframes 

As we are using polars and not pandas, we want to avoid Pandas-style coding and use the Polars Expressions API. 
Expressions are the heart of Polars and yield the best performance.

*N.B.* Here's a section of the User Guide that may help transitioning from Pandas-style coding to using Polars Expressions.
https://docs.pola.rs/user-guide/migration/pandas/

In [None]:
df_json = load_data("data/dummy_language_data.json")
# Prety print some Projects and Repos randomly to visualise the data
NUM_PROJECTS = 1
first_N_projects = {k: df_json[k] for k in list(df_json)[:NUM_PROJECTS]}
pp = pprint.PrettyPrinter(depth=3)
pp.pprint(first_N_projects)

Let's flip this into a normal dataset we are used to, and and a new variable to log transform the byte count.

In [None]:
df=json2polars(df_json)
df = df.with_columns(pl.col('ByteCount').log().alias('logByteCount'))
print(df.head(n=20))
print("Total number of projects:", df['Project'].unique().count())

## Hierarchical Model Specification
Introduction to Hierarchical Bayesian Modelling
Defining the hierarchical model structure (e.g., projects as higher-level groups, repositories within projects, and languages within repositories)
Explanation of model parameters and priors

## Model Implementation
Implementing the HBM using PyMC3
Defining the model in PyMC3
Setting up the priors for each level of the hierarchy
Incorporating the data into the model
Model fitting (e.g., using MCMC methods)

## Model Diagnostics
Checking model convergence (e.g., trace plots, R-hat statistics)
Posterior predictive checks

##  Results and Interpretation
Extracting and summarizing the posterior distributions of model parameters
Identifying significant factors and their impacts on language usage risk metrics
Ridge plots for visualizing the distribution of language usage across projects and repositories, highlighting potential outliers or risks
Additional plots for deeper insights (e.g., comparison of language usage trends across different organizational levels)

## Streamlit app