A multithreaded tool for mining data from APIs with configurable endpoints.
- Multithreaded API requests
- Configurable endpoints
- Retry logic with rotating proxies
- Progress tracking
- Data storage to disk or database
Install the required Python dependencies:
pip install -r requirements.txtCreate a .env file in the root directory and configure the required variables:
SCRAPOXY_USER=<your_scrapoxy_user>
SCRAPOXY_TOKEN=<your_scrapoxy_token>
SCRAPOXY_PORT=<your_scrapoxy_port>
SCRAPOXY_URL=<your_scrapoxy_url>
SCRAPOXY_CRT=<optional_certificate_path>
DEFAULT_ENDPOINT=locations
DUCKDB_TOKEN=<your_duckdb_token>Scrapoxy is recommended for proxy management. Follow these steps to install and configure Scrapoxy: For latest instructions visit their docs.
-
Install Docker if it is not already installed:
brew install --cask docker
-
Pull the Scrapoxy Docker image:
docker pull scrapoxy/scrapoxy
-
Run the Scrapoxy container:
docker run -d -p 8888:8888 -p 8890:8890 -v ./scrapoxy:/cfg -e AUTH_LOCAL_USERNAME=admin -e AUTH_LOCAL_PASSWORD=password -e BACKEND_JWT_SECRET=secret1 -e FRONTEND_JWT_SECRET=secret2 -e STORAGE_FILE_FILENAME=/cfg/scrapoxy.json scrapoxy/scrapoxy
-
Access the Scrapoxy dashboard: Open your browser and navigate to
http://localhost:8888. Use your Scrapoxy credentials to log in. -
Configure your cloud provider using Scrapoxy's docs.
Run the tool with the desired endpoint:
python main.py --endpoint locationsexample_client/: API client logic and endpoint configurations.utils/: Utility modules for logging, retries, and workers.main.py: Entry point for the tool.
- Ensure that Scrapoxy is running and properly configured before starting the tool.
- You can customize the number of threads, batch size, and maximum records by modifying the global variables in
main.py.
- If you encounter issues with proxies, verify that Scrapoxy is running and the
.envfile is correctly configured. - For database-related issues, ensure that the DuckDB connection string is valid and the database file is accessible.
- Implement more robust error handling and retries for database write operations.
- Add unit tests for core components (e.g., client functions, parsers, worker logic).
- Add configuration to allow easier selection of different data storage backends (e.g., PostgreSQL, local files only).
- Improve documentation on how to add and configure new API endpoints.
- Add support for different output formats
- Implement a way for resuming interrupted jobs, stateful.
- Improve logging traceability of individual requests.
- Explore options for dynamic scaling of worker threads based on workload.