### Imports, Auth, and Config
* Imports the necessary Python libraries  
* Retrieves the GitHub API token from Colab's secret manager  
* Sets Base URL and Headers  
* Creates sample_data directory to store the output files


In [None]:
import time, math, json, pathlib, requests, pandas as pd
from datetime import datetime
from google.colab import userdata

TOKEN = userdata.get('GH_TOKEN')
if not TOKEN:
  raise RuntimeError('Set Github Auth Token as GH_TOKEN via Colab Secrets before running')

BASE = 'https://api.github.com'
HEADERS = {
    "Authorization": f"Bearer {TOKEN}",
    "Accept":        "application/vnd.github+json",
    "X-GitHub-Api-Version": "2022-11-28"
}

OUTPUT_DIR = pathlib.Path('sample_data')
OUTPUT_DIR.mkdir(exist_ok=True)

### Core Data Extraction Function ('gh_get')  
Reusable utility designed to handle all requests needed for this project, it supports:  
* **Pagination:** Fetches multiple pages  
* **Rate Limit Handling:** if the script hits a rate limit, this function will wait for the required reset period, and continue where it left off.  
* **Error Retries**: It can handle temporary network or server errors by automatically retrying a failed request a few times with a delay.

This function makes all the subsequent data-fetching calls much simpler and more reliable.

In [None]:
def gh_get(
    url,
    *,
    params=None,
    per_page=100,
    max_pages=10,
    max_retries=3,
    backoff_factor=2,
    bar_desc='request'
):
    params = params or {}
    params["per_page"] = per_page
    all_rows, page = [], 1

    while True:
        for attempt in range(1, max_retries+1):
            try:
                resp = requests.get(url, headers=HEADERS, params={**params, "page": page}, timeout=10)
            except RequestException as e:
                wait = backoff_factor ** (attempt - 1)
                print(f"Network error ({e}), retrying in {wait}s… (attempt {attempt}/{max_retries})")
                time.sleep(wait)
                continue

            # Rate-limit hit?
            if resp.status_code == 403 and resp.headers.get("X-RateLimit-Remaining") == "0":
                reset = int(resp.headers.get("X-RateLimit-Reset", time.time()+60))
                wait = max(reset - time.time(), 1)
                print(f"Rate-limit reached, waiting {math.ceil(wait)}s")
                time.sleep(wait)
                # after sleep, retry same page
                continue

            # Retry on 5xx
            if 500 <= resp.status_code < 600:
                wait = backoff_factor ** (attempt - 1)
                print(f"Server error {resp.status_code}, retrying in {wait}s…")
                time.sleep(wait)
                continue

            # Other statuses: break out of retry loop
            break
        else:
            resp.raise_for_status()

        #Auth/token issue
        if resp.status_code == 401:
            raise RuntimeError("Unauthorized: check your GH_TOKEN and its scopes")

        resp.raise_for_status()

        try:
            payload = resp.json()
        except ValueError:
            raise RuntimeError("Invalid JSON response")

        rows = payload.get("items", payload) if isinstance(payload, dict) else payload
        all_rows.extend(rows)

        if len(rows) < per_page or page >= max_pages:
            break

        page += 1

    print(f"{bar_desc}: {len(all_rows)} rows")
    return all_rows

### Task 1 - Search Public Repositories

This cell executes the first required task: searching for public repositories. It uses the `gh_get` function to query the `/search/repositories` endpoint.  
The search looks for repositories related to "marketing", sorts them by the number of stars in descending order, and fetches the first 60 results (2 pages of 30). The final list of repositories is saved to `search_repos_sample.json`.

In [None]:
search_params = {
    "q": "marketing",
    "sort": "stars",
    "order": "desc",
}
repos = gh_get(
    f"{BASE}/search/repositories",
    params=search_params,
    max_pages=2,
    per_page=30,
    bar_desc='search/repos')

json.dump(repos, open(OUTPUT_DIR / "search_repos_sample.json", "w"), indent=4)


search/repos: 60 rows


### Task 2 - Fetch Commits for a Repository

This cell handles the second task: fetching commit history. It uses `gh_get` to retrieve the 300 most recent commits (3 pages of 100). The data is then saved to `commits_sample.json`.

In [None]:
OWNER, REPO = 'microsoft', 'vscode'
commits = gh_get(f'{BASE}/repos/{OWNER}/{REPO}/commits',
                 max_pages=3)

json.dump(commits, open(OUTPUT_DIR / 'commits_sample.json', 'w'), indent=4)

request: 300 rows


### Task 3 - List Repository Contents

This cell completes the third task: listing the contents of a repository. It gets the list of all files and folders in the root of the `pandas-dev/pandas` repository and saves the output to `contents_sample.json`.

In [None]:
OWNER, REPO = 'pandas-dev', 'pandas'
contents = gh_get(f"{BASE}/repos/{OWNER}/{REPO}/contents")

json.dump(contents, open(OUTPUT_DIR / "contents_sample.json", "w"), indent=4)

request: 31 rows


### Just for fun

The cell below grabs the hottest machine-learning repos created since June 2024, ranks them by **“⭐ per day”**, and displays the top 15.

In [39]:
ml_repos = gh_get(
    f"{BASE}/search/repositories",
    params={
        "q": "topic:machine-learning created:>=2024-06-01",
        "sort": "stars",
        "order": "desc"
    },
    per_page=100,
    max_pages=1,
    bar_desc="ml repos"
)

df = (
    pd.json_normalize(ml_repos)
      .loc[:, ["full_name", "stargazers_count", "created_at", "html_url"]]
      .head(15)
      .copy()
)

df["created_at"] = pd.to_datetime(df["created_at"])
days_live = (pd.Timestamp.utcnow() - df["created_at"]).dt.days.clip(lower=1)
df["⭐ per day"] = (df["stargazers_count"] / days_live).round(1)
df["stargazers_count"] = df["stargazers_count"].apply(lambda x: f"{x:,}")
df.sort_values("⭐ per day", ascending=False, inplace=True)

from IPython.display import display
display(df.style.format({"⭐ per day": "{:.1f}"}).hide(axis="index")
)

outfile = OUTPUT_DIR / "ml_repos_sample.json"
df.to_json(outfile, orient="records", indent=4, force_ascii=False)

ml repos: 100 rows


full_name,stargazers_count,created_at,html_url,⭐ per day
Olow304/memvid,7563,2025-05-27 16:01:08+00:00,https://github.com/Olow304/memvid,504.2
mediar-ai/screenpipe,14994,2024-06-19 13:23:56+00:00,https://github.com/mediar-ai/screenpipe,42.0
patchy631/ai-engineering-hub,9695,2024-10-21 10:43:24+00:00,https://github.com/patchy631/ai-engineering-hub,41.6
armankhondker/awesome-ai-ml-resources,3294,2025-02-09 00:12:17+00:00,https://github.com/armankhondker/awesome-ai-ml-resources,27.0
roboflow/rf-detr,2228,2025-03-19 20:43:00+00:00,https://github.com/roboflow/rf-detr,26.8
tensorzero/tensorzero,6586,2024-07-16 21:00:53+00:00,https://github.com/tensorzero/tensorzero,20.0
awslabs/agent-squad,6012,2024-07-23 12:48:30+00:00,https://github.com/awslabs/agent-squad,18.6
BragAI/bRAG-langchain,2898,2024-11-16 07:41:36+00:00,https://github.com/BragAI/bRAG-langchain,14.0
huggingface/speech-to-speech,4056,2024-08-07 15:32:09+00:00,https://github.com/huggingface/speech-to-speech,13.2
plexe-ai/plexe,1953,2025-01-05 18:34:25+00:00,https://github.com/plexe-ai/plexe,12.4
