This is a Scrapy-based web scraper designed to extract Russian vocabulary words, their translations, example sentences, and media files (audio & images) .
- Scrapes vocabulary words along with their English translations.
- Extracts example sentences with phonetic transcription.
- Downloads pronunciation audio and related images.
- Handles pagination for comprehensive data collection.
- Uses
requestsfor efficient file downloads.
Ensure you have Python installed, then install the required dependencies:
pip install -r requirements.txtCreate a .env file and define the following:
ALLOWED_DOMAINS=<the website url>
START_URLS=<the website url>Run the Scrapy spider:
scrapy crawl general -o output.jsonThis will save the extracted data in output.json.
├── ruvocab/
│ ├── spiders/
│ │ ├── general.py # Main Scrapy spider
│ ├── utils.py # Download helper
├── audio/ # Folder for downloaded audio files
├── images/ # Folder for downloaded images
├── requirements.txt
├── .env # Environment variables
├── README.md # This file
The scraper includes a helper function to download files efficiently:
import requests
import os
def download_file(url, folder):
filename = os.path.basename(url)
file_path = os.path.join(folder, filename)
if os.path.exists(file_path):
return file_path
response = requests.get(url, stream=True)
response.raise_for_status()
with open(file_path, "wb") as f:
for chunk in response.iter_content(1024):
f.write(chunk)
return file_pathThis project is licensed under the MIT License.