# Markdown to Clean Text Conversion

This notebook converts markdown files to clean text files by removing markdown formatting, images, and links while preserving the main content. It's part of the data preprocessing pipeline for text analysis.

In [4]:
from pathlib import Path
import re

## Define Input/Output Directories

We set up the input directory containing markdown files and create an output directory for the cleaned text files.

In [5]:
markdown_dir = Path("../../data/markdown")
txt_output_dir = Path("../../data/txt_clean")
txt_output_dir.mkdir(parents=True, exist_ok=True)

## Conversion Process

The following code:
1. Defines a `clean_markdown()` function that removes:
   - Base64 encoded images
   - External images and embedded links
   - Markdown links (keeping only the text)
   - Markdown headings
   - Emphasis markers (*, **, _)
   - Excess line breaks
2. Processes each markdown file in the input directory
3. Saves the cleaned text to corresponding .txt files in the output directory

In [6]:
def clean_markdown(md_text):
    # Remove base64 images
    md_text = re.sub(r'!\[.*?\]\(data:image\/.*?\)', '', md_text)
    # Remove external images and embedded links ![alt](url)
    md_text = re.sub(r'!\[.*?\]\(http.*?\)', '', md_text)
    # Remove links [text](url)
    md_text = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', md_text)
    # Remove Markdown headings (##, ### etc)
    md_text = re.sub(r'^#+\s+', '', md_text, flags=re.MULTILINE)
    # Remove emphasis (*, **, _) and excess line breaks
    md_text = re.sub(r'[*_`]', '', md_text)
    md_text = re.sub(r'\n{2,}', '\n\n', md_text)
    return md_text.strip()

# Loop through .md files
for md_path in markdown_dir.glob("*.md"):
    with md_path.open("r", encoding="utf-8") as f:
        md_text = f.read()

    cleaned_text = clean_markdown(md_text)

    txt_filename = md_path.stem + ".txt"
    txt_path = txt_output_dir / txt_filename

    with txt_path.open("w", encoding="utf-8") as f:
        f.write(cleaned_text)

    print(f"Cleaned and saved: {txt_filename}")

Cleaned and saved: Shimizu and Srinivasan - 2022 - Improving classification and reconstruction of imagined images from EEG signals.txt
Cleaned and saved: Moses et al. - 2021 - Neuroprosthesis for Decoding Speech in a Paralyzed Person with Anarthria.txt
Cleaned and saved: Shukla et al. - 2025 - A Survey on Bridging EEG Signals and Generative AI From Image and Text to Beyond.txt
Cleaned and saved: Goldstein et al. - 2024 - Alignment of brain embeddings and artificial contextual embeddings in natural language points to com.txt
Cleaned and saved: Anumanchipalli et al. - 2019 - Speech synthesis from neural decoding of spoken sentences.txt
Cleaned and saved: Guenther et al. - 2024 - Image classification and reconstruction from low-density EEG.txt
Cleaned and saved: Tang et al. - 2023 - Semantic reconstruction of continuous language from non-invasive brain recordings.txt
Cleaned and saved: Pereira et al. - 2018 - Toward a universal decoder of linguistic meaning from brain activation.txt
Clean