Skip to content
/ docs2md Public

Converts PDF or Word to MD using markitdown and adjusts using custom functions.

License

Notifications You must be signed in to change notification settings

mjanez/docs2md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docs2md - Document to Markdown Converter

docs2md is a robust Python tool to convert multiple document formats to Markdown and apply customizable post‑processing adjustments. It uses markitdown as the conversion engine.

Features

  • Multi‑format conversion: PDF, DOCX, XLSX, images, audio, HTML, PowerPoint, CSV, JSON, XML, ZIP, YouTube URLs, and EPUB
  • Configurable post‑processing: built‑in functions for tables, headers, and formatting
  • YAML configuration: a simple, clear system to define inputs, outputs, and adjustments
  • Robust path handling: cross‑platform validation and resolution
  • Detailed logging: automatic log rotation and cleanup
  • Modular architecture: maintainable code with separation of responsibilities

Using

Docker (Docker Compose) Recommended

Launch the conversion tool using Docker Compose:

cp config.yml.example config.yml
docker compose run --rm docs2md

Volume Mapping

Ensure the following volume mappings in docker-compose.yml:

  • ./input → /app/input
  • ./output → /app/output
  • ./config.yml → /app/config.yml

Local setup (PDM)

  1. Clone the repository:

    git clone https://github.com/mjanez/docs2md.git
    cd docs2md
  2. Install dependencies using PDM:

    pdm install

    Note: This project uses PDM for dependency management. If you don't have PDM installed, install it first:

    pip install pdm

Configuration

Create a configuration file by copying the example template:

cp config.yml.example config.yml

Edit config.yml to specify your input file and output directory:

# docs2md configuration
input_file: "input/docx/your-document.docx"
output_dir: "output"

# Export embedded images (data:image/...;base64,...) to files and replace links in markdown
export_images: true
# Optional: custom directory for exported images (defaults to "output/<filename>_images")
# images_output_dir: "output/images"

# Optional: markdown adjustment functions
adjust_functions:
  - adjust_markdown_tables
  - remove_index_texts
  - convert_usage_notes
  - remove_bug_double_header_tables
  - adjust_double_header_tables
  - remove_exact_empty_cells_in_tables
  - adjust_complex_double_header_tables

Configuration Options

  • input_file: Path to your input document (relative to project root)
  • output_dir: Directory where converted files will be saved
  • export_images: Export embedded images to files and replace data URIs in markdown (default: true)
  • images_output_dir: Optional custom directory for exported images (defaults to output/<filename>_images)
  • adjust_functions: List of post-processing functions to apply (optional)

Available Adjustment Functions

The adjustment functions are now organized in specialized modules:

Table Adjustments

  • adjust_markdown_tables: Improves table formatting by splitting applicability information
  • remove_bug_double_header_tables: Fixes tables with duplicate headers
  • adjust_double_header_tables: Merges double header rows into proper single headers
  • remove_exact_empty_cells_in_tables: Cleans up empty table cells
  • adjust_complex_double_header_tables: Handles complex table structures with proper sectioning

Content Adjustments

  • adjust_markdown_headers: Adjusts header structure (placeholder for future implementation)
  • remove_index_texts: Removes stray index references like "Tabla . 1 - Description"
  • convert_usage_notes: Converts usage notes to proper markdown note blocks
  • remove_empty_lines_excess: Reduces excessive empty lines while preserving structure
  • normalize_whitespace: Standardizes whitespace usage throughout the document

Usage

Convert a document using your configuration file:

pdm run python src/main.py config.yml

Direct Module Execution

  • Uses PDM's environment management
  • Direct access to the main module
  • Useful for development and debugging

Advanced Usage

You can create multiple configuration files for different documents:

# Create specific config files
cp config.yml config-document1.yml
cp config.yml config-document2.yml

# Convert different documents (using simple entrypoint)
python docs2md.py config-document1.yml
python docs2md.py config-document2.yml

# Alternative using PDM
pdm run python src/main.py config-document1.yml
pdm run python src/main.py config-document2.yml

Example Workflow

  1. Place your document in the input/ directory (create subdirectories as needed)
  2. Update config.yml with the correct input file path
  3. Run the converter:
    python docs2md.py config.yml
  4. Check the output in your specified output directory

Output

The tool generates two files:

  • [filename].md: The raw converted markdown
  • [filename]_adjusted.md: The markdown with applied adjustments

If export_images is enabled, any embedded base64 images are exported to a folder (by default output/[filename]_images) and the Markdown links are updated to reference those files.

Logs are automatically saved in the output directory with timestamps and automatic cleanup (keeps 10 most recent log files).

Modular Adjustment System

The adjustment functions have been organized into specialized modules for better maintainability and traceability:

Table Adjustments (adjustments/table_adjustments.py)

  • adjust_markdown_tables: Split applicability into separate rows
  • remove_bug_double_header_tables: Remove duplicate header rows
  • adjust_double_header_tables: Fix malformed headers
  • remove_exact_empty_cells_in_tables: Clean empty table cells
  • adjust_complex_double_header_tables: Handle complex table structures

Content Adjustments (adjustments/content_adjustments.py)

  • adjust_markdown_headers: Header structure improvements
  • remove_index_texts: Remove table references and indices
  • convert_usage_notes: Convert notes to markdown note blocks
  • remove_empty_lines_excess: Reduce excessive empty lines
  • normalize_whitespace: Standardize whitespace usage

Using Specific Adjustment Categories

You can now apply only specific types of adjustments:

# Apply only table adjustments
python demo_adjustments.py table output/document.md

# Apply only content adjustments  
python demo_adjustments.py content output/document.md

# Apply custom pipeline
python demo_adjustments.py custom output/document.md "adjust_markdown_tables,remove_index_texts"

# Show available adjustments
python demo_adjustments.py info

Advanced Configuration

Create targeted configuration files for different adjustment needs:

# config-tables-only.yml
input_file: "input/docx/document.docx"
output_dir: "output"
adjust_functions:
  - adjust_markdown_tables
  - remove_exact_empty_cells_in_tables
  - adjust_double_header_tables
# config-content-only.yml  
input_file: "input/docx/document.docx"
output_dir: "output"
adjust_functions:
  - remove_index_texts
  - convert_usage_notes
  - normalize_whitespace

Development

If you want to add a new adjustment (e.g., to fix a specific table pattern or normalize specific text), follow these steps to maintain consistency with the modular architecture:

  1. Create the adjustment module

    • Create a file in src/adjustments/, for example src/adjustments/my_adjustments.py or group by category (table_*, content_*).
    • Import base utilities from src/adjustments/base.py:
from .base import validate_file_path, apply_pattern_to_file, apply_line_replacements
import re
import logging

def my_new_adjustment(file_path: str) -> None:
    """Describe what the adjustment does.

    Args:
        file_path: Path to the markdown file to modify.
    """
    file_path = validate_file_path(file_path)

    # Example: replace by pattern in each line
    pat = re.compile(r"PATTERN_TO_SEARCH")

    def repl(match, lines, i):
        # Build and return the replacement string for the line
        return match.group(0).replace('old', 'new')

    apply_pattern_to_file(file_path, pat, repl)
    logging.info(f"Applied my_new_adjustment to {file_path}")
  1. Register the adjustment in the adjustments package

    • Open src/adjustments/__init__.py and add an import and an entry in ADJUSTMENT_FUNCTIONS:
from .my_adjustments import my_new_adjustment

ADJUSTMENT_FUNCTIONS['my_new_adjustment'] = my_new_adjustment
  1. Add tests

    • Create a test in tests/test_my_adjustment.py that creates a temporary file, runs the function, and validates the output (using pytest):
import tempfile
from adjustments.my_adjustments import my_new_adjustment

def test_my_new_adjustment():
    with tempfile.NamedTemporaryFile(mode='w+', suffix='.md', delete=False) as tmp:
        tmp.write('| **Applicability** | old. 1 |\n')
        tmp.flush()
        my_new_adjustment(tmp.name)
        tmp.seek(0)
        content = tmp.read()
    assert 'new' in content
- Run the tests:
pdm run pytest tests/test_my_adjustment.py
  1. Hot test / dry-run

    • Use the demo script to apply and check the adjustment without modifying the global configuration:
python demo_adjustments.py custom output/document.md "my_new_adjustment"

General Best Practices for Adjustments:

  • Add clear docstrings describing inputs/outputs and side effects.
  • Ensure idempotency: running the adjustment multiple times should not break the document.
  • Handle errors with logging and avoid silent exceptions.
  • Optimize regex patterns to avoid unnecessary iteration over large lines.
  • If the adjustment is costly, consider processing in chunks or using a max_rows parameter in the function for testing.

With this template, you can create specialized adjustments and easily register them in the modular system.

Troubleshooting

Common Issues

  1. "Configuration file not found"

    • Ensure config.yml exists in the project root
    • Check that the path to your config file is correct
  2. "Input file not found"

    • Verify the input_file path in your config.yml
    • Ensure the file exists and is accessible
  3. Python import errors

    • Make sure you're running from the project root: python docs2md.py config.yml
    • Alternative: pdm run python src/main.py config.yml
    • Verify all dependencies are installed: pdm install
  4. Permission errors

    • Ensure you have read access to input files
    • Verify write permissions for the output directory

Getting Help

  • Check the logs in your output directory for detailed error information
  • Review the configuration file format against the example
  • Ensure all required fields are present in your config.yml

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Converts PDF or Word to MD using markitdown and adjusts using custom functions.

Topics

Resources

License

Stars

Watchers

Forks

Packages