Fetching URL candidate from GitHub, PyPI, CRAN and then finally from Google, excluding results from the previous three. The notebook begins by  defining four source‐specific fetchers (for GitHub, PyPI, CRAN and general Google searches), a handful of URL‐normalization and deduplication utilities, and simple cache‐loading/saving routines. At its heart is the function fetch_candidate_urls(name: str) -> set[str], which takes a software name and returns a deduplicated set of URLs by calling, in order: (1) fetch_github_urls(name), which queries GitHub’s Search Repositories API with a “{name} in:name” filter, sorts results by star count descending (so the most popular repos come first), and retries up to a configurable limit when rate-limited; (2) fetch_pypi_urls(name), which first attempts an exact JSON-API lookup (https://pypi.org/pypi/{pkg}/json) to retrieve the package URL, then falls back on fuzzy searches—returning closest name-matches in descending order of string similarity—until it reaches its max_results cap; (3) fetch_cran_urls(name), which loads and caches CRAN’s master PACKAGES index, returns an exact package-page link if available, then uses difflib.get_close_matches (with a 0.6 cutoff) to find near names sorted by match quality, up to its limit; and (4) fetch_google_urls(name), which issues a Google Custom Search API call (rotating through multiple API keys, paginating with start/num parameters), excludes known domains (GitHub, PyPI, CRAN) to avoid duplicates, and returns the top results by Google’s default relevance. After collecting all candidates, it simply converts the combined list into a Python set, removing duplicates (though without preserving cross‐source ranking), and hands that back. Finally, an orchestrator function reads an Excel corpus, merges these fetched URLs with any cache of previously found candidates, deduplicates, updates the DataFrame, and writes the results back to disk.

In [55]:
%pip install requests

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [56]:
import requests
from typing import List, Dict, Set
import os
import pandas as pd
import difflib
import time
import json
from collections import defaultdict


In [57]:
%pip install googlesearch-python beautifulsoup4

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [58]:
from googlesearch import search
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin
import xmlrpc.client
from functools import lru_cache





In [59]:


GITHUB_API_URL = "https://api.github.com/search/repositories"
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")
if not GITHUB_TOKEN:
    raise ValueError("Please set the GITHUB_TOKEN environment variable.")

HEADERS = {
    "Authorization": f"token {GITHUB_TOKEN}",
    "Accept":        "application/vnd.github.v3+json",
    "User-Agent":    "my-software-disambiguator"  # any non-empty string
}

def fetch_github_urls(
    name: str,
    per_page: int = 5,
    max_retries: int = 3
) -> List[str]:
    """
    Return up to `per_page` GitHub repo URLs matching `name`, handling rate limits.
    """
    params = {
        "q":        f"{name} in:name",
        "sort":     "stars",
        "order":    "desc",
        "per_page": per_page
    }

    for attempt in range(1, max_retries + 1):
        resp = requests.get(GITHUB_API_URL, params=params, headers=HEADERS, timeout=10)
        # 403 could be a rate-limit on the Search API
        if resp.status_code == 403:
            reset_ts = int(resp.headers.get("X-RateLimit-Reset", time.time() + 60))
            wait = max(reset_ts - time.time(), 1)
            print(f"[Attempt {attempt}] Rate limited. Sleeping {int(wait)}s until reset…")
            time.sleep(wait)
            continue

        # a 401 means bad token, 404 would be weird, anything else we raise
        resp.raise_for_status()
        items = resp.json().get("items", [])
        return [item["html_url"] for item in items]

    # If we exhausted retries
    raise RuntimeError(f"GitHub search for '{name}' failed after {max_retries} attempts (last status: {resp.status_code})")


In [60]:
PYPI_JSON_URL    = "https://pypi.org/pypi/{pkg}/json"
PYPI_PROJECT_URL = "https://pypi.org/project/{pkg}/"

@lru_cache(maxsize=512)
def _get_pypi_info(pkg: str, timeout: float = 10.0) -> Dict:
    """
    Fetches the JSON info block for `pkg`, or returns {} on error.
    """
    try:
        r = requests.get(PYPI_JSON_URL.format(pkg=pkg), timeout=timeout)
        if r.status_code == 200:
            return r.json().get("info", {})
    except requests.RequestException:
        pass
    return {}

@lru_cache(maxsize=256)
def fetch_pypi_urls(
    pkg_name: str,
    max_results: int = 5,
    timeout: float = 10.0
) -> List[str]:
    """
    1) Exact lookup via JSON API → returns info['package_url'] (or info['project_url'])
    2) Fuzzy lookup via XML‐RPC + JSON API per hit
    """
    urls: List[str] = []

    # 1) Exact match
    info = _get_pypi_info(pkg_name, timeout)
    if info:
        url = info.get("package_url") or info.get("project_url")
        if url:
            urls.append(url)

    if len(urls) >= max_results:
        return urls[:max_results]

    # 2) Fuzzy search
    try:
        client = xmlrpc.client.ServerProxy("https://pypi.org/pypi")
        hits = client.search({"name": pkg_name}, "or")
        seen = set(pkg_name.lower())

        for hit in hits:
            name = hit.get("name")
            key  = name.lower() if name else None
            if not key or key in seen:
                continue
            seen.add(key)

            # pull its JSON info to get the true URL
            info = _get_pypi_info(name, timeout)
            if info:
                url = info.get("package_url") or info.get("project_url")
                if url:
                    urls.append(url)
                    if len(urls) >= max_results:
                        break
                    continue

            # fallback (should rarely be needed)
            urls.append(PYPI_PROJECT_URL.format(pkg=name))
            if len(urls) >= max_results:
                break

    except Exception:
        pass

    return urls[:max_results]

In [61]:
CRAN_PACKAGES_URL = "https://cran.r-project.org/src/contrib/PACKAGES"
CRAN_BASE_URL     = "https://cran.r-project.org/web/packages/{pkg}/index.html"
CRAN_SHORT_URL    = "https://cran.r-project.org/package={pkg}"

@lru_cache(maxsize=1)
def _load_cran_packages(timeout: float = 10.0) -> List[str]:
    """
    Fetch and parse the CRAN PACKAGES index into a list of package names.
    Cached in memory so we only download it once.
    """
    resp = requests.get(CRAN_PACKAGES_URL, timeout=timeout)
    resp.raise_for_status()
    pkgs = []
    for line in resp.text.splitlines():
        if line.startswith("Package:"):
            pkgs.append(line.split(":", 1)[1].strip())
    return pkgs

@lru_cache(maxsize=256)
def fetch_cran_urls(
    name: str,
    max_results: int = 5,
    timeout: float = 10.0
) -> List[str]:
    """
    Return up to `max_results` canonical CRAN URLs for packages matching `name`:
      1) exact match
      2) substring match
      3) fuzzy match via difflib
    """
    pkgs = _load_cran_packages(timeout)
    urls: List[str] = []
    name_lower = name.lower()

    # 1) Exact
    if name in pkgs:
        urls.append(CRAN_SHORT_URL.format(pkg=name))

    # 2) Substring
    if len(urls) < max_results:
        subs = [p for p in pkgs if name_lower in p.lower() and p != name]
        for p in subs:
            if len(urls) >= max_results:
                break
            urls.append(CRAN_SHORT_URL.format(pkg=p))

    # 3) Fuzzy
    if len(urls) < max_results:
        # cutoff=0.6 is a sensible default; tweak as needed
        fuzzy = difflib.get_close_matches(name, pkgs, n=max_results, cutoff=0.6)
        for p in fuzzy:
            if len(urls) >= max_results:
                break
            if p not in [u.split("/")[-2] for u in urls]:
                urls.append(CRAN_SHORT_URL.format(pkg=p))

    return urls[:max_results]

In [62]:
GOOGLE_API_URL = "https://www.googleapis.com/customsearch/v1"

API_KEYS = [
    os.getenv("GOOGLE_API_KEY"),
    os.getenv("GOOGLE_API_KEY1"),
    os.getenv("GOOGLE_API_KEY2"),
    os.getenv("GOOGLE_API_KEY3")
]

API_KEYS = [k for k in API_KEYS if k]

CSE_ID = os.environ["GOOGLE_CSE_ID"]
EXCLUDE_SITES = [
    "github.com",
    "pypi.org",
    "cran.r-project.org",
    "youtube.com",
    "youtu.be",
    "medium.com",
    "stackoverflow.com",
    "reddit.com",
    "twitter.com",
    "facebook.com",
    "linkedin.com",
    "geeksforgeeks.org",
    "w3schools.com",
    "tutorialspoint.com"
]

@lru_cache(maxsize=128)
def fetch_google_urls(name: str, num_results: int = 5) -> List[str]:
    exclude_query = " ".join(f"-site:{d}" for d in EXCLUDE_SITES)
    query = f"{name} {exclude_query}"
    key_idx = 0
    urls = []
    page_size = 10
    nkeys = len(API_KEYS)
    # loop over pages of results
    for start in range(1, num_results + 1, page_size):
        params = {
            "key":   API_KEYS[key_idx],
            "cx":    CSE_ID,
            "q":     query,
            "start": start,
            "num":   min(page_size, num_results - len(urls)),
        }

        # try each key up to nkeys times
        for attempt in range(nkeys):
            resp = requests.get(GOOGLE_API_URL, params=params, timeout=5)

            # rate-limited? rotate key and retry
            if resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", 0))
                wait = retry_after if retry_after > 0 else 2 ** attempt
                print(f"[{name!r}] key#{key_idx} 429 → sleeping {wait}s…")
                time.sleep(wait)
                key_idx = (key_idx + 1) % nkeys
                params["key"] = API_KEYS[key_idx]
                continue

            # other errors raise
            resp.raise_for_status()

            # success!
            data = resp.json()
            for item in data.get("items", []):
                urls.append(item["link"])
                print(f"[{name!r}] key#{key_idx} → {item['link']}")
                if len(urls) >= num_results:
                    break
            break  # out of retry-loop

        if len(urls) >= num_results:
            break  # out of paging-loop

    return urls

In [63]:
def fetch_candidate_urls(name: str) -> set[str]:
    """
    For each software name, fetch candidate URLs in this order:
      1. GitHub
      2. PyPI
      3. CRAN
      4. General Google search (excluding above domains)
    """
    results = []

    # GitHub
    try:
        results += fetch_github_urls(name)
    except Exception as e:
        print(f"[!] GitHub fetch failed for '{name}': {e}")

    # PyPI
    try:
        results += fetch_pypi_urls(name)
    except Exception as e:
        print(f"[!] PyPI fetch failed for '{name}': {e}")

    # CRAN
    try:
        results += fetch_cran_urls(name)
    except Exception as e:
        print(f"[!] CRAN check failed for '{name}': {e}")

    # Google
    try:
        time.sleep(1)
        results += fetch_google_urls(name)
    except Exception as e:
        print(f"[!] Google search failed for '{name}': {e}")

    # dedupe, preserve order
    return set(results)

In [64]:
"""names = ["TensOrflow",'tidyr','reQuests']
for name in names:
    print(f"Candidate URLs for '{name}':")
    urls = fetch_candidate_urls(name)
    for url in urls:
        print(f"  - {url}")
    print()"""

'names = ["TensOrflow",\'tidyr\',\'reQuests\']\nfor name in names:\n    print(f"Candidate URLs for \'{name}\':")\n    urls = fetch_candidate_urls(name)\n    for url in urls:\n        print(f"  - {url}")\n    print()'

In [65]:
def load_candidates(path: str) -> Dict[str, Set[str]]:
    """Load a JSON cache of {name: [urls…]}, return {name: set(urls)…}."""
    if os.path.exists(path) and os.path.getsize(path) > 0:
        with open(path, "r", encoding="utf-8") as f:
            try:
                data = json.load(f)
            except json.JSONDecodeError:
                print("⚠️ Warning: corrupt JSON cache; starting fresh.")
                data = {}
    else:
        data = {}

    # convert lists→sets
    return {name: set(urls) for name, urls in data.items()}

In [66]:
def save_candidates(candidates: Dict[str, Set[str]], path: str):
    """Convert sets→lists and write out a pretty JSON file."""
    serializable = {name: sorted(list(urls)) for name, urls in candidates.items()}
    with open(path, "w", encoding="utf-8") as f:
        json.dump(serializable, f, indent=2, ensure_ascii=False)

In [67]:
def update_candidate_cache(
    corpus: pd.DataFrame,
    fetcher,                # your fetch_candidate_urls(name) function
    cache_path: str
) -> Dict[str, Set[str]]:
    # 1) load existing
    candidates = load_candidates(cache_path)

    # 2) iterate unique names
    for name in corpus['name'].unique():
        # initialize if needed
        if name not in candidates:
            candidates[name] = set()

        # 3) add any pre-existing URLs from your dataframe
        urls_cell = corpus.loc[corpus['name'] == name, 'candidate_urls'].dropna().astype(str)
        for cell in urls_cell:
            for u in cell.split(','):
                u = u.strip()
                if u:
                    candidates[name].add(u)

        # 4) fetch & add new ones
        new = set(fetcher(name))
        # only do the network hit if there’s something new to add
        if not new.issubset(candidates[name]):
            candidates[name].update(new)

    # 5) persist back to JSON
    save_candidates(candidates, cache_path)
    return candidates


In [68]:
from urllib.parse import urlparse, urlunparse
from typing import Dict, Iterable, List

def normalize_url(u: str) -> str:
    p = urlparse(u)
    scheme = "https"
    netloc = p.netloc.lower()
    path = p.path.rstrip("/")
    # drop params, query, fragment
    return urlunparse((scheme, netloc, path, "", "", ""))

def dedupe_candidates(candidates: Dict[str, Iterable[str]]) -> None:
    """
    For each key in `candidates`, normalize its URLs and drop duplicates,
    preferring the https version when http & https both appear.
    Modifies `candidates` in place, replacing each value with a List[str].
    """
    for key, urls in candidates.items():
        seen: Dict[str, str] = {}
        for u in urls:
            norm = normalize_url(u)
            if norm not in seen:
                # first time we see this normalized URL,
                # store the original
                seen[norm] = u
            else:
                # if we already have an http version, but now see an https one, upgrade it
                if u.startswith("https") and not seen[norm].startswith("https"):
                    seen[norm] = u
        # replace with de-duplicated list
        candidates[key] = list(seen.values())


In [69]:
"""names = ["TensOrflow",'tidyr','reQuests']
for name in names:
    print(f"Candidate URLs for '{name}':")
    urls = fetch_candidate_urls(name)
    for url in urls:
        print(f"  - {url}")
    print()"""

'names = ["TensOrflow",\'tidyr\',\'reQuests\']\nfor name in names:\n    print(f"Candidate URLs for \'{name}\':")\n    urls = fetch_candidate_urls(name)\n    for url in urls:\n        print(f"  - {url}")\n    print()'

In [70]:
"""corpus = pd.read_excel("../corpus_v2.xlsx")
cache_file = "../candidate_urls.json"

candidates = update_candidate_cache(corpus, fetch_candidate_urls, cache_file)
print(f"Cached URLs for {len(candidates)} names.")"""

'corpus = pd.read_excel("../corpus_v2.xlsx")\ncache_file = "../candidate_urls.json"\n\ncandidates = update_candidate_cache(corpus, fetch_candidate_urls, cache_file)\nprint(f"Cached URLs for {len(candidates)} names.")'

In [None]:
corpus     = pd.read_excel("../corpus_v2.xlsx")
candidates = load_candidates("../candidate_urls.json")


dedupe_candidates(candidates)

# now http://chriscainx/mnnpy is merged into the https:// one
save_candidates(candidates, "../candidate_urls.json")

15 {'https://scanpy.readthedocs.io/en/stable/generated/scanpy.external.pp.mnn_correct.html', 'https://github.com/granatumx/gbox-mnnpy', 'https://github.com/chriscainx/mnnpy', 'https://www.amazon.es/Anatomy-Muscles-Pictures-Education-20X27inch/dp/B0888LZ94B', 'https://github.com/aysalama/mnnpython', 'https://pypi.org/project/mnnpy/', 'https://cran.r-project.org/package=mnonr', 'https://cran.r-project.org/package=knnp', 'https://cran.r-project.org/package=nn2poly', 'https://scanpy.readthedocs.io/en/1.9.x/generated/scanpy.external.pp.mnn_correct.html', 'https://anaconda.org/bioconda/mnnpy', 'https://cran.r-project.org/package=nplyr', 'http://github.com/chriscainx/mnnpy', 'https://www.amazon.se/-/en/Anatomy-Muscles-Pictures-Education-16x20inch/dp/B0888N3YVH', 'https://cran.r-project.org/package=mpoly'}
14 ['https://scanpy.readthedocs.io/en/stable/generated/scanpy.external.pp.mnn_correct.html', 'https://github.com/granatumx/gbox-mnnpy', 'https://github.com/chriscainx/mnnpy', 'https://www.am

In [72]:
candidates = load_candidates("../candidate_urls.json")
corpus = pd.read_excel("../corpus_v2.xlsx")
corpus['candidate_urls'] = corpus['name'].map(candidates).astype(str)
corpus['candidate_urls'] = corpus['candidate_urls'].str.replace("{", "").str.replace("}", "").str.replace("[", "").str.replace("]", "").str.replace("'", "").str.replace('"', '').str.replace(",", ",").str.replace(" ", "") # remove unwanted characters
corpus['candidate_urls'] = corpus['candidate_urls'].str.replace("'", "").str.replace('"', '').str.replace(",", ",").str.replace(" ", "") # remove unwanted characters
corpus.to_excel("../corpus_v2.xlsx", index=False) # save the updated corpus with candidate URLs