Skip to content

kimanxo/RuVocab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Russian Vocabulary Scraper

This is a Scrapy-based web scraper designed to extract Russian vocabulary words, their translations, example sentences, and media files (audio & images) .

Features

  • Scrapes vocabulary words along with their English translations.
  • Extracts example sentences with phonetic transcription.
  • Downloads pronunciation audio and related images.
  • Handles pagination for comprehensive data collection.
  • Uses requests for efficient file downloads.

Installation

Ensure you have Python installed, then install the required dependencies:

pip install -r requirements.txt

Environment Setup

Create a .env file and define the following:

ALLOWED_DOMAINS=<the website url>
START_URLS=<the website url>

Usage

Run the Scrapy spider:

scrapy crawl general -o output.json

This will save the extracted data in output.json.

Project Structure

├── ruvocab/
│   ├── spiders/
│   │   ├── general.py  # Main Scrapy spider
│   ├── utils.py  # Download helper
├── audio/  # Folder for downloaded audio files
├── images/  # Folder for downloaded images
├── requirements.txt
├── .env  # Environment variables
├── README.md  # This file

Download Helper (download_file)

The scraper includes a helper function to download files efficiently:

import requests
import os

def download_file(url, folder):
    filename = os.path.basename(url)
    file_path = os.path.join(folder, filename)
    if os.path.exists(file_path):
        return file_path
    response = requests.get(url, stream=True)
    response.raise_for_status()
    with open(file_path, "wb") as f:
        for chunk in response.iter_content(1024):
            f.write(chunk)
    return file_path

License

This project is licensed under the MIT License.

About

Scrapy-based web scraper designed to extract Russian vocabulary words, their translations, example sentences, and media files (audio & images) .

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages