## Repocop prototyping notebook

> **Warning**
> This notebook writes to the cloudquery database. Proceed with extreme caution

> **Warning**
> Please clear all your outputs and restart the kernel before committing

### Prerequisites
- Have deployTools credentials
- Be connected to the VPN

### Summary
This notebook connects to the cloudquery database, evaluates resources (aka github repositories) against best practice rules for those resources, and writes the information as a table back to the db, so other tools can use the information to notify teams about actions they need to take.

In [None]:
import sqlalchemy as sa
import boto3
from database_interactions import connect_to_db

#create the database connection
session = boto3.Session(profile_name='deployTools')
conn = connect_to_db(session)
engine = sa.create_engine('postgresql://', creator=lambda: conn)

Here we create our first intermediate dataframe, assessing rule REPOSITORY-01, which says that the default branch for all repositories should be main.

n.b. we have not yet exempted archived repositories from any of these rulesets in code. A decision about where to deal with those has yet to be made

In [None]:
from database_interactions import select
import pandas as pd

#default_branch_df needs to be a dataframe with the following columns:
# - full_name (string)
# - default_branch (string)
def repository_01(default_branch_df: pd.DataFrame) -> pd.DataFrame:
    """Repository 01: Default branch should be main"""
    default_branch_df['repository_01'] = default_branch_df['default_branch'] == 'main'
    return default_branch_df[['full_name', 'repository_01']]


In [None]:
default_branch_df = select('github_repositories',['full_name', 'default_branch'], engine) #this includes archived repos. make a decision about this
repository_01_df = repository_01(default_branch_df)
repository_01_df.head()

Here we asses rule REPOSITORY_06, which is significantly more complicated. It states that all repositories owned by one or more P+E teams require a valid production status topic. Repos owned exlusively by teams outside of P+E are exempt from this requirement. The most common example of this is interactives repositories.

Production statuses recognised by DevX are enumerated in the `guardian_production_status` table.

In [None]:
from rules import repository_06

#gather all the required data for repository_06
valid_topics_df = select('guardian_production_status', ['status', 'priority'], engine)
topics_list = valid_topics_df['status'].tolist()
non_pe_teams_list = select('guardian_non_p_and_e_github_teams', ['team_name'], engine)['team_name'].tolist()
topics_df = select('github_repositories', ['full_name', 'topics'], engine)
teams_df = select('github_teams', ['name', 'slug'], engine)
#select function doesn't work on views, so we have to use read_sql_query
ownership_df = pd.read_sql_query("select repo_name, github_team_name, github_team_id from view_repo_ownership", con=conn)
ownership_df = ownership_df.merge(teams_df, how='left', left_on='github_team_name', right_on='name')[['repo_name', 'github_team_name', 'slug']]

In [None]:
import matplotlib.pyplot as plt
freq = ownership_df['github_team_name'].value_counts()
freq[:15].plot(kind='bar', title='Teams that own the most repos', xlabel='team name', ylabel='Count')
plt.show()

In [None]:
repository_06_df = repository_06(ownership_df, topics_df, topics_list, non_pe_teams_list)
freq = repository_06_df['repository_06'].value_counts()
freq.plot.pie(subplots=True, figsize=(5, 4), title='repository_06')

Here, we merge the interim rule tables into one table, keyed on repository name, and write that information back to the database. 

### Please exercise extreme caution before modifying this section as we have not set up a local development environment yet.

In [None]:
repository_rule_df =  repository_01_df.merge(repository_06_df, how='left', left_on='full_name', right_on='repo_name')[['full_name', 'repository_01', 'repository_06']]
repository_rule_df.head()

In [None]:
repository_rule_df.to_sql('repocop_github_repository_rules', engine, if_exists='replace', index=False)
pd.read_sql_query("select * from repocop_github_repository_rules", con=conn).head()