docs2md is a robust Python tool to convert multiple document formats to Markdown and apply customizable post‑processing adjustments. It uses markitdown as the conversion engine.
- Multi‑format conversion: PDF, DOCX, XLSX, images, audio, HTML, PowerPoint, CSV, JSON, XML, ZIP, YouTube URLs, and EPUB
- Configurable post‑processing: built‑in functions for tables, headers, and formatting
- YAML configuration: a simple, clear system to define inputs, outputs, and adjustments
- Robust path handling: cross‑platform validation and resolution
- Detailed logging: automatic log rotation and cleanup
- Modular architecture: maintainable code with separation of responsibilities
Launch the conversion tool using Docker Compose:
cp config.yml.example config.yml
docker compose run --rm docs2mdEnsure the following volume mappings in docker-compose.yml:
./input→/app/input./output→/app/output./config.yml→/app/config.yml
-
Clone the repository:
git clone https://github.com/mjanez/docs2md.git cd docs2md -
Install dependencies using PDM:
pdm install
Note: This project uses PDM for dependency management. If you don't have PDM installed, install it first:
pip install pdm
Create a configuration file by copying the example template:
cp config.yml.example config.ymlEdit config.yml to specify your input file and output directory:
# docs2md configuration
input_file: "input/docx/your-document.docx"
output_dir: "output"
# Export embedded images (data:image/...;base64,...) to files and replace links in markdown
export_images: true
# Optional: custom directory for exported images (defaults to "output/<filename>_images")
# images_output_dir: "output/images"
# Optional: markdown adjustment functions
adjust_functions:
- adjust_markdown_tables
- remove_index_texts
- convert_usage_notes
- remove_bug_double_header_tables
- adjust_double_header_tables
- remove_exact_empty_cells_in_tables
- adjust_complex_double_header_tablesinput_file: Path to your input document (relative to project root)output_dir: Directory where converted files will be savedexport_images: Export embedded images to files and replace data URIs in markdown (default: true)images_output_dir: Optional custom directory for exported images (defaults tooutput/<filename>_images)adjust_functions: List of post-processing functions to apply (optional)
The adjustment functions are now organized in specialized modules:
adjust_markdown_tables: Improves table formatting by splitting applicability informationremove_bug_double_header_tables: Fixes tables with duplicate headersadjust_double_header_tables: Merges double header rows into proper single headersremove_exact_empty_cells_in_tables: Cleans up empty table cellsadjust_complex_double_header_tables: Handles complex table structures with proper sectioning
adjust_markdown_headers: Adjusts header structure (placeholder for future implementation)remove_index_texts: Removes stray index references like "Tabla . 1 - Description"convert_usage_notes: Converts usage notes to proper markdown note blocksremove_empty_lines_excess: Reduces excessive empty lines while preserving structurenormalize_whitespace: Standardizes whitespace usage throughout the document
Convert a document using your configuration file:
pdm run python src/main.py config.ymlDirect Module Execution
- Uses PDM's environment management
- Direct access to the main module
- Useful for development and debugging
You can create multiple configuration files for different documents:
# Create specific config files
cp config.yml config-document1.yml
cp config.yml config-document2.yml
# Convert different documents (using simple entrypoint)
python docs2md.py config-document1.yml
python docs2md.py config-document2.yml
# Alternative using PDM
pdm run python src/main.py config-document1.yml
pdm run python src/main.py config-document2.yml- Place your document in the
input/directory (create subdirectories as needed) - Update config.yml with the correct input file path
- Run the converter:
python docs2md.py config.yml
- Check the output in your specified output directory
The tool generates two files:
[filename].md: The raw converted markdown[filename]_adjusted.md: The markdown with applied adjustments
If export_images is enabled, any embedded base64 images are exported to a folder (by default output/[filename]_images) and the Markdown links are updated to reference those files.
Logs are automatically saved in the output directory with timestamps and automatic cleanup (keeps 10 most recent log files).
The adjustment functions have been organized into specialized modules for better maintainability and traceability:
adjust_markdown_tables: Split applicability into separate rowsremove_bug_double_header_tables: Remove duplicate header rowsadjust_double_header_tables: Fix malformed headersremove_exact_empty_cells_in_tables: Clean empty table cellsadjust_complex_double_header_tables: Handle complex table structures
adjust_markdown_headers: Header structure improvementsremove_index_texts: Remove table references and indicesconvert_usage_notes: Convert notes to markdown note blocksremove_empty_lines_excess: Reduce excessive empty linesnormalize_whitespace: Standardize whitespace usage
You can now apply only specific types of adjustments:
# Apply only table adjustments
python demo_adjustments.py table output/document.md
# Apply only content adjustments
python demo_adjustments.py content output/document.md
# Apply custom pipeline
python demo_adjustments.py custom output/document.md "adjust_markdown_tables,remove_index_texts"
# Show available adjustments
python demo_adjustments.py infoCreate targeted configuration files for different adjustment needs:
# config-tables-only.yml
input_file: "input/docx/document.docx"
output_dir: "output"
adjust_functions:
- adjust_markdown_tables
- remove_exact_empty_cells_in_tables
- adjust_double_header_tables# config-content-only.yml
input_file: "input/docx/document.docx"
output_dir: "output"
adjust_functions:
- remove_index_texts
- convert_usage_notes
- normalize_whitespaceIf you want to add a new adjustment (e.g., to fix a specific table pattern or normalize specific text), follow these steps to maintain consistency with the modular architecture:
-
Create the adjustment module
- Create a file in
src/adjustments/, for examplesrc/adjustments/my_adjustments.pyor group by category (table_*,content_*). - Import base utilities from
src/adjustments/base.py:
- Create a file in
from .base import validate_file_path, apply_pattern_to_file, apply_line_replacements
import re
import logging
def my_new_adjustment(file_path: str) -> None:
"""Describe what the adjustment does.
Args:
file_path: Path to the markdown file to modify.
"""
file_path = validate_file_path(file_path)
# Example: replace by pattern in each line
pat = re.compile(r"PATTERN_TO_SEARCH")
def repl(match, lines, i):
# Build and return the replacement string for the line
return match.group(0).replace('old', 'new')
apply_pattern_to_file(file_path, pat, repl)
logging.info(f"Applied my_new_adjustment to {file_path}")-
Register the adjustment in the adjustments package
- Open
src/adjustments/__init__.pyand add an import and an entry inADJUSTMENT_FUNCTIONS:
- Open
from .my_adjustments import my_new_adjustment
ADJUSTMENT_FUNCTIONS['my_new_adjustment'] = my_new_adjustment-
Add tests
- Create a test in
tests/test_my_adjustment.pythat creates a temporary file, runs the function, and validates the output (usingpytest):
- Create a test in
import tempfile
from adjustments.my_adjustments import my_new_adjustment
def test_my_new_adjustment():
with tempfile.NamedTemporaryFile(mode='w+', suffix='.md', delete=False) as tmp:
tmp.write('| **Applicability** | old. 1 |\n')
tmp.flush()
my_new_adjustment(tmp.name)
tmp.seek(0)
content = tmp.read()
assert 'new' in content- Run the tests:
pdm run pytest tests/test_my_adjustment.py-
Hot test / dry-run
- Use the demo script to apply and check the adjustment without modifying the global configuration:
python demo_adjustments.py custom output/document.md "my_new_adjustment"General Best Practices for Adjustments:
- Add clear docstrings describing inputs/outputs and side effects.
- Ensure idempotency: running the adjustment multiple times should not break the document.
- Handle errors with logging and avoid silent exceptions.
- Optimize regex patterns to avoid unnecessary iteration over large lines.
- If the adjustment is costly, consider processing in chunks or using a
max_rowsparameter in the function for testing.
With this template, you can create specialized adjustments and easily register them in the modular system.
-
"Configuration file not found"
- Ensure
config.ymlexists in the project root - Check that the path to your config file is correct
- Ensure
-
"Input file not found"
- Verify the
input_filepath in your config.yml - Ensure the file exists and is accessible
- Verify the
-
Python import errors
- Make sure you're running from the project root:
python docs2md.py config.yml - Alternative:
pdm run python src/main.py config.yml - Verify all dependencies are installed:
pdm install
- Make sure you're running from the project root:
-
Permission errors
- Ensure you have read access to input files
- Verify write permissions for the output directory
- Check the logs in your output directory for detailed error information
- Review the configuration file format against the example
- Ensure all required fields are present in your config.yml
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
This project is licensed under the MIT License. See the LICENSE file for details.