# Pull Requests Data Collection

In this notebook, we will collect the raw Pull Request data from a github repo of choice, and save it as a json file on an s3 bucket. To do this, we will use the `srcopsmetrics` tool developed by the Thoth team.

In [4]:
import os

from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv(), override=True)

True

## Set Environment variables

Before you start, make sure to create a .env file at the root of the repository with the required credentials and paths. Refer to workshop section **"Set your Environment Variables"** for details on how to set it and this example [file](env_example_workshop) for the structure.

In [5]:
# get the org/repo from env vars
ORG = os.getenv("GITHUB_ORG")
REPO = os.getenv("GITHUB_REPO")
CEPH_BUCKET_PREFIX = os.getenv("CEPH_BUCKET_PREFIX")

print(f"{ORG}/{REPO}")
print(
    f"Downloaded file is being stored at {CEPH_BUCKET_PREFIX}/srcopsmetrics/bot_knowledge/{ORG}/{REPO}/PullRequest.json"
)

operate-first/support
Downloaded file is being stored at mobeloper/srcopsmetrics/bot_knowledge/operate-first/support/PullRequest.json


In [6]:
# run collection on the org/repo specified
!python -m srcopsmetrics.cli --create-knowledge --repository $ORG/$REPO --entities PullRequest

INFO:srcopsmetrics.github_knowledge:Overall repositories found: 1
INFO:srcopsmetrics.bot_knowledge:######################## Analysing operate-first/support ########################

INFO:srcopsmetrics.utils:No repo identified, creating new directory at /opt/app-root/src/ocp-ci-analysis/notebooks/time-to-merge-prediction/workshop/srcopsmetrics/bot_knowledge/operate-first/support
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Detected entities:
CodeFrequency # Commit # DependencyUpdate # Fork # Issue # IssueEvent # KebechetUpdateManager # License # PullRequest # PullRequestDiscussion # RawIssue # RawPullRequest # ReadMe # Release # Stargazer # ThothAdviseMetrics # ThothMetrics # ThothMetrics # ThothMetrics # ThothVersionManagerMetrics # TrafficClones # TrafficPaths # TrafficPaths # TrafficReferrers # TrafficClones # TrafficViews
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:PullRequest inspection
INF

# Conclusion

By running this notebook we have collected and stored the github PR data to our s3 bucket. It is now ready for the cleaning and feature engineering steps of the ML workflow.

# Next Step

In the next [notebook](./02_feature_engineering.ipynb), we will engineer some features from the raw PR data which can be used to train a ML model