## Hierarchical Bayesian Modeling to assess tribal knowledge

In this analysis, we will try to create a methodology and data-driven metric for identifying potential technological risks within an organization's coding protocols. We'll examine how programming languages are utilised across various projects and repositories, similar to those found in code repositories such as GitHub,  and leverage Hierarchical Bayesian Modelling (HBM) for multi-level data analysis. 

HBM effectively captures project-specific variances and overall project trends, providing a nuanced "risk" metric for Enterprise Architecture. This enables organisations to identify potential knowledge silos and make strategic decisions to enhance project continuity and organisational adaptability.

By analysing language usage across different organisational levels and integrating uncertainty, HBM aims to expose pockets of siloed tribal knowledge (in this example, via a proxy of languages used, but can easily be extended to accommodate other features such a #of commits, time since last commit, total commits, etc., etc.), which is crucial for identifying hidden risks within the architectural framework. This analysis uncovers potential vulnerabilities and compares language usage at repository and project scales against wider organisational patterns. These comparative insights are critical, revealing when a technology may seem insignificant in isolation emerges as a considerable risk in the broader organisational context due to limited expertise or exposure. This comprehensive examination ensures that technology decisions are made with a strategic perspective, reinforcing organisational resilience in the face of technological evolution.

For a more concrete example, consider a scenario where an organisation's repository primarily uses Haskell, a language not commonly used in broader enterprise contexts. Hierarchical Bayesian Modelling evaluates the risk by scrutinising Haskell's application within the repository, its relevance to the project, and its organisational prevalence. This comprehensive assessment ascertains the alignment of Haskell's use with the enterprise's technological trajectory and knowledge base, guiding strategic architectural decisions.


#### Flow
To model this problem space using a Bayesian hierarchical model

- **Data Preparation:** Aggregate the language bytes for each language across all repositories and projects.
- **Model Definition:** Define a hierarchical model in Pymc using project-level priors influencing repository-level distributions.
- **Inference:** Use MCMC provided by pymc to sample from the posterior distribution.
- **Analysis:** Analyze the posterior distributions to identify languages with usage outside the credible regions.

In [1]:
# Import required packages

import polars as pl
import pymc as pm
import arviz as az
import pprint

from utils import load_data, json2polars




## Data Preparation
The first step is to run the `generate_dummy_data.py` file to make sure we have data to play around with, the generated dummy data is similar to what you might pull from GitHub's REST API for repository languages https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-languages

```GitHub CLI api
https://cli.github.com/manual/gh_api

gh api \
  -H "Accept: application/vnd.github+json" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  /repos/OWNER/REPO/languages

Example Response:
{
  "C": 78769,
  "Python": 7769
}
```



From here, we can load and transform the data. In this instance, we will be using `polars` rather than `pandas` dataframes 

As we are using polars and not pandas, we want to avoid Pandas-style coding and use the Polars Expressions API. 
Expressions are the heart of Polars and yield the best performance.

*N.B.* Here's a section of the User Guide that may help transitioning from Pandas-style coding to using Polars Expressions.
https://docs.pola.rs/user-guide/migration/pandas/

In [2]:
df_json = load_data("data/dummy_language_data.json")
# Prety print some Projects and Repos randomly to visualise the data
NUM_PROJECTS = 1
first_N_projects = {k: df_json[k] for k in list(df_json)[:NUM_PROJECTS]}
pp = pprint.PrettyPrinter(depth=3)
pp.pprint(first_N_projects)

FileNotFoundError: [Errno 2] No such file or directory: 'data/dummy_language_data.json'

Let's flip this into a normal dataset we are used to, and and a new variable to log transform the byte count.

In [10]:
df=json2polars(df_json)
df = df.with_columns(pl.col('ByteCount').log().alias('logByteCount'))
# Encode cols to categories rather than str

columns_to_encode = ['Project', 'Repository', 'Language']
new_columns = []
# Loop over the columns to encode
for column_name in columns_to_encode:
    df = df.with_column(pl.col(column_name).cast(pl.Categorical).alias(f'{column_name}_codes'))

# Add all new columns to the DataFrame at once
df = df.with_columns(new_columns)


# Number of categories for each variable
n_projects = df['Project'].unique().count()
n_repos = df['Repository'].unique().count() 
n_languages = df['Language'].unique().count()


print(df.head(n=20))
print("Total number of projects:", df['Project'].unique().count())

AttributeError: 'DataFrame' object has no attribute 'with_column'

## Hierarchical Model Specification

This section introduces Hierarchical Bayesian Modelling (HBM) principles and their application in structuring complex, multi-level datasets, such as those encountered in evaluating technological risks within coding languages.

#### Introduction to Hierarchical Bayesian Modelling
Hierarchical Bayesian Modelling is a statistical framework that enables data analysis across different levels of hierarchy by integrating the variability within individual units (such as repositories) and the commonalities across groups (such as projects, domains, or departments). Bayes' theorem is at the core of Bayesian inference, which updates the probability for a hypothesis as more evidence becomes available. One of the critical concepts in HBM is exchangeability, which implies that data points are probabilistically symmetrical. This makes it suitable for modelling data that doesn't have a natural ordering or grouping but is considered identically distributed given some unknown parameters.

### Defining the Hierarchical Model Structure

When we analyse the usage of programming languages, we're looking at a hierarchical model structure with multiple layers. These layers represent language usage within repositories, which are nested within projects. 

- **Level 1 - Repository-Level Likelihood:** At this level, we describe the observed data, such as the amount of code written in each language within a repository, using a likelihood function.
  
  $$ P(Language_{ij} | \theta_{ij}) \sim SomeDistribution(\theta_{ij}) $$

  Here, \($ \theta_{ij} $\) is a parameter representing the language usage, where \($ i $\) denotes the repository and \($ j $\) the language.

- **Level 2 - Project-Level Priors:** As we move up to the project level, parameters from the repository level are considered uncertain and are described by priors.

  $$ \theta_{ij} | \mu_i, \kappa_i \sim Beta(\mu_i \kappa_i, (1 - \mu_i) \kappa_i) $$

  The Beta distribution parameters \($ \mu_i $\) and \($ \kappa_i $\) represent the expected language usage and variability within a project.

- **Level 3 - Organisational-Level Hyperpriors:** At the organisational level, we look at the broader patterns in language usage across all projects. 

  $$ \mu_i \sim Beta(a_{\mu}, b_{\mu}) $$
  $$ \kappa_i \sim Gamma(a_{\kappa}, b_{\kappa}) $$

  Hyperpriors for \($ \mu_i $\) and \($ \kappa_i $\) reflect our assumptions about these patterns before analysing the data.

This hierarchical approach allows for a detailed analysis of organisational language usage. We're not just modelling individual repositories but also capturing trends across projects and the entire organisation.

### Explanation of Model Parameters and Priors

In Hierarchical Bayesian Modelling (HBM), parameters are not fixed values but are expressed as distributions, known as priors. Priors encapsulate our initial understanding or beliefs about the parameters before we examine the data. For instance, the use of a programming language within a repository can be characterised by a Beta distribution, representing the proportion of code written in that language.

The hyperparameters, denoted as \($ \mu_i $\) and \($ \kappa_i $\) introduce a layer of variability that accounts for differences within repositories and across projects. They are informed by higher-level distributions or hyperpriors, like the Beta distribution for \($ \mu_i $\), which describes the mean language usage, and the Gamma distribution for \($ \kappa_i $\), which relates to the variability of that usage.

By employing a hierarchical model, we can refine our initial beliefs in light of the data collected, which leads to a more nuanced understanding of an organisation's coding practices. The model's flexibility allows us to adapt to new information, enhancing the precision of our insights.


## Transitioning to Posterior Distributions

### From Theory to Practice: The Role of Posterior Distributions

With our model parameters defined and their priors set, the next step in Bayesian analysis is to update these beliefs with observed data. This is where the posterior distribution comes into play.


#### What is the posterior?

In the hierarchical Bayesian modelling (HBM) context, the posterior distribution is the updated belief about our model's parameters after considering the observed data. It combines our prior beliefs (the priors) and the evidence from the data (the likelihood). Mathematically, it is expressed as:

$$
P(\theta | data) \propto P(data | \theta) \times P(\theta)
$$

Where:
- \($ P(\theta | data) $\) is the posterior distribution of the parameters \($ \theta $\).
- \($ P(data | \theta) $\) is the likelihood of the data given the parameters.
- \($ P(\theta) $\) is the prior distribution of the parameters.

After observing the data, the posterior distribution provides a range of likely values for the parameters, which is crucial for making informed decisions.


## Practical Implications of Posterior Analysis in Hierarchical Bayesian Modelling

Through the lens of Hierarchical Bayesian Modelling, the posterior distribution becomes a beacon, illuminating the path to understanding and action within an organisation's coding practices.


### Informing Strategic Decision-Making

The power of posterior analysis extends beyond diagnostics; it informs strategic resource allocation and risk management:

- **Credible Intervals**: The precision of parameter estimates, reflected in the credible intervals of the posterior distribution, directs our focus to areas where additional data collection or deeper investigation may be warranted.

- **Outlier Detection**: Spotting outliers within posterior distributions alerts us to unconventional language usage patterns. These could represent areas of innovation warranting further exploration or potential risks if the languages in question lack broad support.

- **Strategic Resource Allocation**: Insights gained from posterior distributions enable informed decisions on resource allocation—be it for targeted training programmes, strategic hiring to build expertise in underutilised languages, or investment in technology stacks that promise to align with and propel the organisation's strategic objectives.

Interpreting the posterior distributions derived from our hierarchical model does more than just enhance our understanding of language usage; it equips us to forecast, plan, and foster a coding environment that is both efficient and resilient to future challenges.

### Uncovering Knowledge Silos

The posterior distributions for language usage within repositories serve as a diagnostic tool, revealing languages that are disproportionately relied upon. Anomalies in these distributions may signal the existence of knowledge silos, suggesting areas where diversification and training could be beneficial. By identifying these silos, we can proactively address potential bottlenecks in knowledge transfer and code maintenance.

### Assessing Project-Level Variability

The consistency of coding practices across repositories within projects is characterised by project-level hyperparameters. When significant variability is observed in these posterior distributions, it may reflect fragmented coding practices that could undermine team collaboration and project efficiency. This insight drives us to review and possibly revise coding standards, ensuring that practices are aligned and conducive to project success.

### Evaluating Organisational Coding Norms

At the highest organisational level, posterior distributions offer a macro perspective of coding culture and norms. Deviations in these distributions can reveal organisational preferences or aversions towards specific languages. Understanding these trends is critical for shaping future strategies in technology adoption, capability development, and training initiatives.


## Model Implementation
- Implementing the HBM using PyMC3
- Defining the model in PyMC3
- Setting up the priors for each level of the hierarchy
- Incorporating the data into the model
- Model fitting (e.g., using MCMC methods)

## Model Diagnostics
 - Checking model convergence (e.g., trace plots, R-hat statistics)
 - Posterior predictive checks

##  Results and Interpretation
- Extracting and summarizing the posterior distributions of model parameters
- Identifying significant factors and their impacts on language usage risk metrics
- Ridge plots for visualizing the distribution of language usage across projects and repositories, highlighting potential outliers or risks
- Additional plots for deeper insights (e.g., comparison of language usage trends across different organizational levels)

#### Key Considerations:

- **Credible Intervals and Outliers**: Narrow credible intervals suggest high parameter estimate certainty, while wide intervals indicate areas needing further investigation. Outlier detection can pinpoint innovative areas or non-standard practices for strategic exploration.

- **Decision-Making**: Insights from the HBM should inform strategic decisions regarding technology adoption, project management, and training to align coding practices with organizational goals, enhancing project continuity and adaptability.

This combined understanding of HBM principles, model structure, and practical implications equips readers with the knowledge to interpret complex data analyses meaningfully, driving informed decision-making within the organisation.

## Streamlit app