Skip to content

lucas-link-code/Repository-Hash-Search-Tools

Repository files navigation

Repository Hash Audit Tools

OVERVIEW This repository contains four command-line tools for performing recursive SHA256 hash auditing across different package repositories and code hosting platforms. Each tool searches its respective platform, allows interactive package selection, downloads and extracts files, then computes SHA256 hashes for all files within the selected packages or repositories.

The four tools are:

  • npm_audit.py: NPM package registry auditor
  • github_audit.py: GitHub repository auditor
  • hf_audit2.py: HuggingFace model repository auditor
  • pypi_audit.py: PyPI package index auditor

All tools output normalized hash reports in a consistent format suitable for security analysis, integrity verification, and forensic auditing.

WHAT IT DOES Each tool follows a similar workflow:

  1. Search the target platform using a keyword query
  2. Display matching results in an interactive menu
  3. Allow selection of one or all packages/repositories
  4. Download package archives or repository files
  5. Extract files from archives or process repository file trees
  6. Compute SHA256 hashes for each file
  7. Write results to a text file in normalized format

HOW IT WORKS

NPM Audit Tool (npm_audit.py) Searches the NPM registry using the search API, retrieves package metadata, downloads tarballs from the registry CDN, extracts files in memory using tarfile, and hashes each file. Handles gzip-compressed tarballs via streaming to avoid disk writes.

GitHub Audit Tool (github_audit.py) Uses GitHub API to search repositories, retrieves file trees via Git Trees API, processes files by downloading from raw.githubusercontent.com. Detects Git LFS pointer files and extracts SHA256 from pointers rather than hashing pointer content. Streams large files for memory efficiency. Supports path filtering to process only specific file patterns.

HuggingFace Audit Tool (hf_audit2.py) Leverages huggingface_hub library to search models, retrieves file metadata including LFS information. For LFS-tracked files, uses the SHA256 hash from server metadata without downloading. For non-LFS files, streams and hashes the content. Sorts results by download count to prioritize popular models.

PyPI Audit Tool (pypi_audit.py) Searches PyPI using XML-RPC API, retrieves package metadata via JSON API, downloads all distribution files (wheels and source distributions) for selected packages. Extracts files from ZIP archives (wheels) and TAR.GZ archives (source distributions), then hashes each extracted file.

OUTPUT FORMAT All tools produce normalized output with the following structure:

AUDIT TARGET: platform/identifier@version file/path/within/package sha256_hash_hexadecimal_64_chars

file/path/another/file sha256_hash_hexadecimal_64_chars

Each file entry consists of the file path, its SHA256 hash, and a blank line separator. The AUDIT TARGET header identifies the package or repository being audited.

REQUIREMENTS Python 3.7 or higher

Core dependencies (all tools):

  • requests: HTTP client for API calls and file downloads
  • tqdm: Progress bar display

Additional dependencies:

  • huggingface_hub: Required for hf_audit2.py (HuggingFace API client)

Standard library modules used:

  • argparse, hashlib, sys, json, tarfile, zipfile, io, xmlrpc.client, typing, os, re

INSTALLATION Install base dependencies: pip install requests tqdm

For HuggingFace tool: pip install requests tqdm huggingface_hub

GITHUB TOKEN SETUP The github_audit.py tool requires a GitHub Personal Access Token for higher rate limits. Without a token, GitHub limits API requests to 60 per hour. With a token, the limit increases to 5000 per hour.

Create token:

  1. Go to https://github.com/settings/personal-access-tokens
  2. Generate new token (classic)
  3. Select scope: public_repo (read-only)
  4. Copy token

Set token: export GITHUB_TOKEN="ghp_your_token_here"

Or pass via command line: python github_audit.py query --token ghp_your_token_here

USAGE EXAMPLES

NPM Audit: python npm_audit.py express python npm_audit.py react --limit 20 --output npm_report.txt

GitHub Audit: python github_audit.py qwen python github_audit.py qwen --name-only --path-filter qwen --sort-by stars python github_audit.py qwen --token ghp_token --output github_report.txt

HuggingFace Audit: python hf_audit2.py Qwen python hf_audit2.py gpt2 --limit 20 --output hf_report.txt

PyPI Audit: python pypi_audit.py requests python pypi_audit.py numpy --limit 10 --output pypi_report.txt

COMMAND LINE OPTIONS

npm_audit.py: query (optional): Search keyword for NPM packages --limit: Maximum packages to display (default: 15) --output: Output file path (default: audit_report_npm.txt)

github_audit.py: query (optional): Search keyword for GitHub repositories --token: GitHub Personal Access Token (or set GITHUB_TOKEN env var) --limit: Maximum repos to display (default: 15) --output: Output file path (default: audit_report_github.txt) --name-only: Only search repos with keyword in their name --path-filter: Only hash files whose path contains this string (case-insensitive) --sort-by: Sort results by stars or forks (default: forks)

hf_audit2.py: query (optional): Search query for model name or family --limit: Maximum models to display (default: 15) --output: Output file path (default: audit_report.txt)

pypi_audit.py: query (optional): Search query for PyPI packages --limit: Maximum packages to display (default: 15) --output: Output file path (default: audit_report_pypi.txt)

INTERACTIVE MODE All tools support interactive mode. If no query is provided via command line, the tool will prompt for input. After searching, a numbered menu is displayed showing matching results. Selection options:

  • Enter a number (1-N) to audit a specific package/repository
  • Enter 'all' to audit all listed results
  • Enter 'exit' to quit

TECHNICAL DETAILS

Memory Efficiency: All tools process files in memory without writing intermediate files to disk. Large files are streamed in chunks to avoid memory overflow. NPM and PyPI tools extract archives directly from HTTP streams.

LFS Handling: GitHub tool detects Git LFS pointer files by regex pattern matching. When detected, extracts the SHA256 hash from the pointer file rather than hashing the pointer content. HuggingFace tool uses server metadata to obtain LFS file hashes without downloading.

Archive Processing: NPM tool handles gzip-compressed tarballs using tarfile with streaming mode. PyPI tool handles both ZIP archives (wheels) and TAR.GZ archives (source distributions). Files are extracted and hashed without temporary disk storage.

Error Handling: Network errors, missing packages, invalid archives, and user input errors are handled gracefully. Error messages are printed to stderr, allowing clean separation from audit results written to output files.

Progress Indication: All tools use tqdm for progress bars. GitHub and HuggingFace tools show progress for individual file processing. NPM and PyPI tools show progress for multiple package processing.

CONFIGURATION Default output files:

  • npm_audit.py: audit_report_npm.txt
  • github_audit.py: audit_report_github.txt
  • hf_audit2.py: audit_report.txt
  • pypi_audit.py: audit_report_pypi.txt

API Endpoints:

User Agent strings:

  • NPM/PyPI tools: DeepSearch-Intel/1.0
  • GitHub tool: DeepSearch-Audit/1.0
  • HuggingFace tool: DeepSearch-Audit/1.0

USE CASES Security auditing of package contents before deployment Integrity verification of downloaded packages Forensic analysis of package file structures Hash-based comparison of package versions Automated compliance checking for file integrity Threat intelligence gathering on package repositories

LIMITATIONS GitHub tool may encounter truncated file trees for repositories with extremely large numbers of files. PyPI XML-RPC search can be slow for broad queries. NPM search results are weighted by popularity which may filter out less popular packages. HuggingFace tool requires internet connectivity and valid HuggingFace account access for private repositories.

LICENSE Check individual script headers for license information.

CONTRIBUTING This is a collection of audit tools for security professionals. Contributions should maintain the normalized output format and follow the existing code structure.

SUPPORT For issues or questions, refer to individual tool documentation files:

  • npm_audit.md
  • github_audit.md
  • hf_audit2.md
  • pypi_audit.md

About

command-line tools for performing recursive SHA256 hash auditing across different package repositories and code hosting platforms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages