# Automated Data Science Asset Register

[Code adapted from https://github.com/moj-analytical-services/data-science-assets]

This notebook uses the GitHub API to pull information from the repos in `assets.yaml`. \
The API call extracts the `yaml` block from each GitHub repo and uses this to populate a dataframe. \
The Gov Notify service is then used to send reminder emails for projects past their review date. \
The dataframe is formatted into html and saved. (ambition is to automate adding this to sharepoint, subject to permissions being granted).

In [60]:
# Load packages
from github import Github
import yaml
import re
import pandas as pd

from ai_nexus_backend.github_api import get_readme_content

In [92]:
print(yaml.__version__)

6.0.2


In [85]:

def extract_yaml_from_md(md_content:str) -> dict:
    """
    Extract the first YAML block from Markdown content string.

    If several YAML code blocks are included within the README, only the
    first will be returned. Will not match YAML code blocks with
    ```{yaml}...``` syntax.

    Parameters
    ----------
    md_content : str
        A string containing Markdown content, which may include YAML code
        blocks.

    Returns
    -------
    dict
        A dictionary containing the parsed YAML content with keys in
        lowercase.

    Raises
    ------
    ValueError
        If no YAML block is found in the provided Markdown content.
    yaml.YAMLError
        If there is an error parsing the YAML content.

    """
    # Regular expression pattern to match the FIRST fenced YAML block
    # won't match curly braces: https://regex101.com/r/oYHdwB/1
    # though curly braces are valid github MD YAML blocks:
    # https://github.com/r-leyshon/example_yaml-_metadata
    yaml_block_pattern = re.compile(r"```yaml([\s\S]*?)```")
    
    # Search for the first YAML block
    match = yaml_block_pattern.search(md_content)
    if match:
        yaml_content = match.group(1).strip()
        try:
            # Parse the first YAML content block
            yam = yaml.safe_load(yaml_content)
            return {k.lower(): v for k, v in yam.items()}
        except yaml.YAMLError as e:
            raise yaml.YAMLError("Error parsing YAML content:", e)
    else:
        raise ValueError("No YAML found in `md_content`")

In [63]:
# read in api tokens

import dotenv

secrets = dotenv.dotenv_values("../.env")
user_agent = secrets["AGENT"]
pat = secrets["PAT"] # TODO: Implement OAuth
org_nm1 = secrets["ORG_NM1"]
org_nm2 = secrets["ORG_NM2"]

github_api_token = pat
github_api_token


'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'

## Github scrape

In [86]:
test_url = "https://github.com/r-leyshon/example_yaml-_metadata"
readme = get_readme_content(test_url, pat, user_agent)
out = extract_yaml_from_md(readme) # only takes the first YAML chunk
out

{'name': 'Test1',
 'category': 'App',
 'description': 'Some tricky /description',
 'impact': 'War and peace',
 'g6 lead': 'Designated lead',
 'sro': 'Designated manager',
 'technical lead': 'Designated tech lead',
 'business lead': 'Designated BA',
 'last review date': 'Oct-24',
 'next review date': 'Oct-25',
 'outage impact': 'Green',
 'maintenance (fte)': '1',
 'documentation': 'https://some-url',
 'contact': 'some_email@anywhere.co.uk'}

In [88]:
readme

'# example_yaml-_metadata\n\ntesting metadata ingest from readme\n\n```yaml\nName: "Test1"\nCategory: "App"\nDescription: "Some tricky /description"\nImpact: "War and peace"\nG6 lead: "Designated lead"\nSRO: "Designated manager"\nTechnical lead: "Designated tech lead"\nBusiness lead: "Designated BA"\nLast review date: "Oct-24"\nNext review date: "Oct-25"\nOutage Impact: "Green"\nMaintenance (FTE): "1"\nDocumentation: "https://some-url"\nContact: "some_email@anywhere.co.uk"\n```\nAdditional README content...\n\nWhat about this:\n\n```{yaml}\nName: "Test2"\nCategory: "App"\nDescription: "Some tricky /description"\nImpact: "War and peace"\nG6 lead: "Designated lead"\nSRO: "Designated manager"\nTechnical lead: "Designated tech lead"\nBusiness lead: "Designated BA"\nLast review date: "Oct-24"\nNext review date: "Oct-25"\nOutage Impact: "Green"\nMaintenance (FTE): "1"\nDocumentation: "https://some-url"\nContact: "some_email@anywhere.co.uk"\n\n```\n'