# Introduction

This notebook retrieves and analyzes **Russian court cases** (criminal and administrative) from the **sudrf.ru** court information system. It uses the open-source [**sudrfparser**](https://github.com/dataout-org/sudrfparser) project to:

* automatically open court websites,
* bypass captchas (optionally via 2Captcha),
* download case metadata and full texts,
* save results as structured JSON files.

## Install a ChromeDriver Version That Matches Your Chrome Browser

Before running the parser, you must install a ChromeDriver version that matches the version of Google Chrome installed on your computer.  
Chrome and ChromeDriver must be **exactly the same major version** (e.g., Chrome 142 -> ChromeDriver 142).  
If they don't match, Selenium will fail with an error like:

> This version of ChromeDriver only supports Chrome version XXX.  
> Current browser version is YYY.

Installing dependencies matching versions expected by the parser:

In [None]:
!pip install --upgrade beautifulsoup4 requests 2captcha-python webdriver_manager "selenium<4.0.0" "urllib3<2.0.0"

Clone the parser repo if you have not before:

In [None]:
# !git clone https://github.com/dataout-org/sudrfparser.git

In [None]:
import sys
from pathlib import Path
import inspect

# Notebook directory: .../sudrfparser/example
nb_dir = Path.cwd()

# Project root is two levels above
project_root = nb_dir.parent.parent
print("Detected project root:", project_root)

# Add project root to sys.path so Python can find sudrfparser/
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Now import the module
import sudrfparser.sudrfparser as sp

print("Loaded sudrfparser from:", sp.__file__)
print(
    "get_* functions:",
    [name for name, obj in inspect.getmembers(sp)
     if inspect.isfunction(obj) and name.startswith("get_")]
)

Here you can specify the website and region and some other filters, as well as TwoCaptcha details if you are using it

Gathering administrative cases for the parameters provided. Adjust the path to your driver here if it is different.

In [None]:
# scrape all courts from sudrf_websites.json, saving per-court results

import os
import json
from pathlib import Path

# -------------------------------------------------------------------
# 0) Make sure we have a parser instance `sp`
# -------------------------------------------------------------------
try:
    sp 
except NameError:
    from sudrfparser import SudRFParser
    sp = SudRFParser()

# -------------------------------------------------------------------
# 1) Global config
# -------------------------------------------------------------------

# Administrative articles: "<article>_<part>"
adm_articles = ["6.21_1", "6.21_2", "6.21_3", "6.21_4", "6.21_5", "6.21_6", "6.21_7", "6.21_8"]

# Date range (inclusive) in DD.MM.YYYY
start_date = "01.01.2021"
end_date   = "31.12.2025"

# Where to save all results
base_path_to_save = nb_dir.parent.parent / "sudrf_results_run_18.12_6_21"
base_path_to_save.mkdir(exist_ok=True)

# TwoCaptcha config (if you don't want auto-captcha, set captcha_config = {})
captcha_config = {
    "config": {"apiKey": ""},  # put your real key here
    "numeric": 1,
    "minLen": 5,
    "maxLen": 5,
    "phrase": 0,
}

# Chromedriver path
path_to_driver = "/opt/homebrew/bin/chromedriver"

# Optional: limit to specific regions (e.g. {"78"}) or set None to process all
regions_whitelist = None  # e.g. {"78"} to only process St. Petersburg courts

# -------------------------------------------------------------------
# 2) Load courts list from sudrf_websites.json
# -------------------------------------------------------------------
courts_json_path = project_root / "sudrfparser" / "courts_info" / "sudrf_websites.json"
with courts_json_path.open("r", encoding="utf-8") as f:
    courts_by_region = json.load(f)

print("Base output directory:", base_path_to_save)
print("This run will save EACH court's summary separately and skip finished courts.\n")

# -------------------------------------------------------------------
# 3) Main loop: process court-by-court, saving immediately
# -------------------------------------------------------------------
for region_code, courts in courts_by_region.items():
    if regions_whitelist is not None and region_code not in regions_whitelist:
        continue

    print(f"\n=== Region {region_code} ===")

    for court in courts:
        court_id   = court["court_id"]
        court_name = court["court_name"]
        website    = court["court_website"]
        srv_nums   = court.get("srv", ["1"])

        # Folder for this court
        court_save_dir = base_path_to_save / region_code / court_id
        court_save_dir.mkdir(parents=True, exist_ok=True)

        summary_file = court_save_dir / "court_summary.json"
        error_file   = court_save_dir / "error.json"

        # Skip if already successfully processed
        if summary_file.exists():
            print(f"  -> SKIP {court_id} (summary exists)")
            continue

        print(f"  -> Processing {court_id}: {court_name}")
        print(f"     website={website}, srv={srv_nums}")

        try:
            result = sp.get_adm_cases(
                website=website,
                region=region_code,
                adm_articles=adm_articles,
                start_date=start_date,
                end_date=end_date,
                path_to_driver=path_to_driver,
                court_code=court_id,            # required for form2; ok for form1
                srv_num=srv_nums,               # e.g. ["1"] or ["1", "2"]
                path_to_save=str(court_save_dir) + os.sep,
                captcha_config=captcha_config,
            )

            # Save per-court summary immediately
            with summary_file.open("w", encoding="utf-8") as f:
                json.dump(result, f, ensure_ascii=False, indent=2)

            # If there was an old error file from previous runs, you can optionally remove it
            if error_file.exists():
                error_file.unlink()

            print(f"Saved summary -> {summary_file}")
        except Exception as e:
            print(f"ERROR for {court_id}: {e}")

            # Save error info for this court so we know it failed
            with error_file.open("w", encoding="utf-8") as f:
                json.dump({"error": str(e)}, f, ensure_ascii=False, indent=2)

            print(f"Saved error log â†’ {error_file}")
            # continue to next court

print("\nDone. You can inspect per-court folders under:", base_path_to_save)