This repository contains two main files that work together to automate the collection and analysis of Pull Request (PR) data from GitHub repositories. These tools aim to provide insights into PR metrics such as approval time, review patterns, and SLA compliance.
This project leverages Databricks and Apache Airflow to retrieve and process GitHub Pull Request data. It integrates with the GitHub API to analyze repository activity, providing detailed metrics such as:
- Time taken for PR approval after specific labels are applied.
- PR review comments, changes, and requests count.
- SLA metrics and other actionable insights.
This notebook uses PySpark and the GitHub API to extract, process, and analyze PR data.
- Retrieves closed PRs from a specified GitHub repository.
- Calculates SLA and time-to-approval metrics.
- Filters and processes data into a structured Spark DataFrame for further analysis.
github_token: Personal access token for GitHub authentication.repositorio: Repository name passed as a parameter.perfil: GitHub workspace name.itens: Number of PRs to fetch, limited to 150 by default.
The Airflow DAG orchestrates the execution of the Databricks notebook for multiple repositories.
- Dynamically iterates through a list of repositories.
- Configures a cooldown function to manage GitHub API rate limits.
- Handles retry policies and execution timeouts.
git_url: GitHub repository URL for the Databricks notebook source.git_branch: Branch to fetch the notebook from.dbw_iris_restapi_cluster_id: Databricks cluster ID for execution.
- Databricks:
- Configure a cluster and retrieve its ID (
cluster_id).
- Configure a cluster and retrieve its ID (
- Apache Airflow:
- Install Airflow with required providers:
apache-airflow-providers-databricks.
- Install Airflow with required providers:
- GitHub API Token:
- Generate a Personal Access Token in GitHub and store it securely (e.g., Azure Key Vault).
git clone <repository_url> cd <repository_folder>
pip install apache-airflow apache-airflow-providers-databricks
Ensure PySpark and other dependencies are available for Databricks If using a specific environment manager (e.g., conda), activate it first Example: conda activate <your_environment_name>
Replace placeholders with your actual values:
- Replace
github_tokenwith your GitHub personal access token - Replace
perfilwith your GitHub workspace name
git_url: URL of the GitHub repository hosting the notebookgit_branch: Branch containing the notebookdbw_iris_restapi_cluster_id: ID of your Databricks cluster
cp airflow_dag.py <airflow_dags_folder>
airflow scheduler & airflow webserver
Access the Airflow web UI to enable and trigger the DAG
| repository | created_at | merged_at | approver | pr_id | labels | sla (seconds) | num_comments | num_changes | ...
Airflow DAG Logs Includes:
- Task execution details
- Cooldown handling for GitHub API rate limits
- Error logs for troubleshooting