<a href="https://colab.research.google.com/github/jvictorferreira3301/LaTeX_OCR_IC/blob/main/latexOCR.ipynb" target="_parent">
        <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="height: 27px; margin-right: 10px;"/>

## LATex OCR 📝

### 1. Introduction

LaTeX-OCR is an advanced application that combines OCR and Transformers technologies to automate the conversion of mathematical expressions into LaTeX code, bringing efficiency to academic and scientific activities. Although it still faces challenges with complex data and manuscripts, it represents an important evolution in the use of AI to facilitate the digitization and editing of complex mathematical content.

<div style="display: flex; justify-content: center;">
    <figure style="margin-right: 10px; text-align: center;">
        <img src="./assets/wf.png" width="auto" height="auto" style="display: flex; margin:0">
        <figcaption>Fig. 1 - Flowchart of application. </figcaption>
    </figure>
    <figure style="text-align: center;">
        <img src="./assets/gif.gif" width="auto" height="auto" tyle="display: flex; margin:10px">
        <figcaption> Fig. 2 - Example of use with GUI.</figcaption>
    </figure>
</div>

#### 1.1. Background

<div style="text-align: center;">
        <img src="./assets/timeline.svg" width="auto" height="auto">
        <figcaption>Fig. 3 - Timeline of models.</figcaption>
    </figure>
</div>

- **Optical Character Recognition (OCR):** is the technology used to identify and extract text from images. In applications like LaTeX-OCR, OCR needs to go beyond simple text reading and recognize the hierarchical structure of mathematical expressions.

- **Image Tokenization:** Tokenization is the process of breaking data into smaller units, called tokens, that can be processed individually. In traditional text OCR, each character or word may be a token; in images of equations, each part of the expression (symbol, operator, or substructure) can be tokenized.

- **Attention and Self-Attention:** Attention is a mechanism that allows the model to “focus” on specific parts of the data when making decisions. Self-attention enables each part of the sequence (or image) to consider all other parts, capturing complex dependencies.

- **Positional Encoding**: In neural networks, especially Transformers, positional encoding is used to preserve the order of tokens, as the model itself does not maintain an explicit sequence.

- **Transformers and Encoder-Decoder Networks:** Transformers are neural networks that use attention mechanisms to process data sequences. Encoder-Decoder models are a common Transformer architecture for translating input sequences to output sequences.

Given the foundation, we will train the model using our dataset extracted from the LaTeX files of the following books by Prof. Aldebaro Klautau.

- _Digital Signal Processing with Python, Matlab or Octave_
- _Digital Communications with Python, Matlab or Octave_


### 2. Data extraction

#### 2.1. Extract equations

Extract all equations from .tex files to a single .tex 

In [None]:
# 70% ChatGPT :D

import re
from pathlib import Path
from wand.image import Image
from wand.color import Color
import os
import subprocess
from concurrent.futures import ThreadPoolExecutor

def load_macros(macros_file):
    """Reads macros from a LaTeX macros file and returns a dictionary of replacements."""
    macros = {}
    macro_pattern = r'\\(?:def|newcommand)\s*\\(\w+)\s*(?:\[(.*?)\])?\s*(\{(.*)\})'
    
    with open(macros_file, 'r') as file:
        for line in file:
            match = re.match(macro_pattern, line.strip())
            if match:
                command, _, definition, inside_def = match.groups()

                macros[rf'\\{command}'] = inside_def  # Store macro with regex-escaped backslash

                # print(f'{command} {inside_def}')
            
    return macros

def replacement_function(match, definition, param_mapping):
    if len(param_mapping) > 0:
        print(f'{match} {definition} {param_mapping}')

def replace_macros(equations, macros):
    """Replaces macros in equations using the given macros dictionary, ensuring standalone replacements."""
    for i, equation in enumerate(equations):
        for macro, definition in macros.items():
            # Replace each macro with its definition
            # This regex captures the argument inside the curly braces
            equation = re.sub(
                rf'(?<!\\)({macro})\{{(.*?)\}}',  # Match macro and its argument
                lambda match: definition.replace('#1', match.group(2).replace('$', '')),  # Replace #1 with the captured argument
                equation
            )
          

        equations[i] = equation
    return equations

def replace_macros2(equations, macros):
    """Replaces macros in equations using the given macros dictionary, ensuring standalone replacements."""
    for i, equation in enumerate(equations):
        for macro, definition in macros.items():
            # Check if the definition contains placeholders
            param_pattern = re.findall(r'#(\d+)', definition)

            # Create a mapping for parameters if there are placeholders
            param_mapping = {}
            for j in range(len(param_pattern)):
                param_index = int(param_pattern[j]) - 1  # Convert to 0-based index
                # Capture parameters from the equation based on their positions
                matches = re.findall(r'(\{.*?\}|\S+)', equation)  # Capture everything in braces and standalone words
                if param_index < len(matches):
                    param_mapping[f'#{j + 1}'] = matches[param_index]

            # Replace each macro using a lambda to avoid escape sequence issues
            equation = re.sub(
                rf'(?<!\\)({macro})(?![a-zA-Z0-9])',
                lambda match: re.sub(
                    r'(#\d+)',
                    lambda m: param_mapping.get(m.group(0), m.group(0)),  # Substitute placeholders with mapped values
                    definition
                ),
                equation
            )
        equations[i] = equation
    return equations

def extract_equations(input_tex_file, macros_file):
    # Load macros from macros_file
    macros = load_macros(macros_file)
    
    # Regular expressions to match equations
    equation_patterns = [
        # r'\$\$(.*?)\$\$',            # Inline equations with $$ ... $$
        # r'\$(.*?)\$',                # Inline equations with $ ... $
        r'\\\[(.*?)\\\]',            # Displayed equations with \[ ... \]
        r'\\begin\{equation\}(.*?)\\end\{equation\}',  # equation environment
        # r'\\begin\{align\}(.*?)\\end\{align\}'         # align environment
    ]
    
    # Read the input LaTeX file
    with open(input_tex_file, 'r') as file:
        tex_content = file.read()
    
    # Find all matches for each pattern
    equations = []
    for pattern in equation_patterns:
        matches = re.findall(pattern, tex_content, re.DOTALL)
        equations.extend(matches)
    
    # Replace macros in equations
    equations = replace_macros(equations, macros)
    equations = replace_macros2(equations, macros)

    return equations
    
tex_files_folder = 'bootor_tex'

output_tex_file = f'outputs/extracted_equations.tex'

macros_file = 'macros/macros.tex'

total_equations = 0

if not os.path.exists('outputs'):
        os.mkdir('outputs')

# Write equations to a new .tex file
with open(output_tex_file, 'w') as output_file:
    output_file.write('\\documentclass{article}\n')
    output_file.write('\\usepackage{amsmath}\n')
    output_file.write('\\usepackage{amssymb}\n')
    output_file.write('\\begin{document}\n')

    for subfolder in Path(tex_files_folder).glob('*'):

        for file in subfolder.glob('*'):
            equations = extract_equations(file, macros_file)

            # Write each extracted equation
            for eq in equations:
                output_file.write('\n\\begin{equation}\n')
                output_file.write(eq.strip())
                output_file.write('\n\\end{equation}\n')

            print(f"Extracted {len(equations)} equations from {file}.")
            total_equations = total_equations + len(equations)

    # Write end of the document
    output_file.write('\\end{document}\n')

print(f'Found {total_equations} equations in the folder {tex_files_folder}')

#### 2.2. Equations to .txt

Parse extracted equations to a .txt where each line will be a different equation

In [None]:
def extract_equations(input_file, output_file):
    # Regular expression for LaTeX equations in \begin{equation} ... \end{equation}, ignoring \label{}
    equation_pattern = r'\\begin\{equation\}(.+?)\\end\{equation\}'
    label_pattern = r'\\label\{[^}]*\}'  # Pattern to match and remove labels

    # Read input file and extract equations
    with open(input_file, 'r') as file:
        content = file.read()

        # Find all matches for the equation pattern
        equations = re.findall(equation_pattern, content, re.DOTALL)
        
        # Process each equation to remove labels and convert to a single line
        cleaned_equations = []
        for eq in equations:
            cleaned_eq = re.sub(label_pattern, '', eq)  # Remove label

            eq_lines = cleaned_eq.split('\n')

            poped = 0
            for i in range(len(eq_lines)):                
                if eq_lines[i - poped].replace(' ', '').startswith('%'):
                    eq_lines.pop(i - poped)
                    poped += 1

                if (len(eq_lines[i - poped].split('%')) > 1):
                    eq_lines[i - poped] = eq_lines[i - poped].split('%')[0]

            cleaned_eq = ''.join(eq_lines)

            cleaned_eq = re.sub(r'\s+', ' ', cleaned_eq)  # Replace multiple spaces/newlines with a single space
            cleaned_eq = cleaned_eq.strip()  # Trim leading and trailing spaces
            cleaned_equations.append(cleaned_eq)

    # Write each cleaned equation to a new line in the output file
    with open(output_file, 'w') as file:
        total_equations = 0
        for equation in cleaned_equations:
            if len(equation) > 0:
                file.write(equation + '\n')
                total_equations += 1

        print(f'Extracted {total_equations} usable equations!')

# Usage
input_tex_file = 'outputs/extracted_equations.tex'
output_txt_file = 'outputs/extracted_equations.txt'
extract_equations(input_tex_file, output_txt_file)


#### 2.3. Generate PDFs



Compile extracted equations to PDF (each equation generates a different file)

In [3]:
def extract_equations(input_txt_file):
    """Extract equations from the input .txt file, each line being an equation."""
    equations = []
    with open(input_txt_file, 'r') as file:
        # Read each line and strip any extra whitespace
        for line in file:
            stripped_line = line.strip()
            if stripped_line:  # Only add non-empty lines
                equations.append(stripped_line)
    return equations

def create_tex_file(equation, index):
    """Create a .tex file for the given equation."""
    tex_content = r"""\documentclass{article}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{xcolor}
\begin{document}
\pagestyle{empty}
\begin{equation*}
""" + equation + r"""
\end{equation*}
\end{document}
"""
    file_name = f"{str(index).zfill(5)}.tex"
    with open(file_name, 'w') as file:
        file.write(tex_content)
    return file_name

def compile_tex_file(tex_file):
    """Compile the .tex file using XeLaTeX."""
    subprocess.run(['xelatex', tex_file], check=True)

def compile_tex_files(tex_files):
    """Compile multiple .tex files concurrently."""
    with ThreadPoolExecutor(max_workers=10) as executor:
        executor.map(compile_tex_file, tex_files)

def main(input_tex_file):
    if not os.path.exists('outputs/pdfs'):
        os.makedirs('outputs/pdfs')

    equations = extract_equations(input_tex_file)

    tex_files = []
    for index, equation in enumerate(equations):
        tex_file = create_tex_file(equation.strip(), index)
        tex_files.append(tex_file)

    # Compile all .tex files concurrently
    compile_tex_files(tex_files)

    # Move PDFs to the output directory and clean up auxiliary files
    for tex_file in tex_files:
        pdf_file = Path(tex_file).with_suffix('.pdf')
        if pdf_file.exists():
            pdf_file.rename(Path('outputs/pdfs') / pdf_file.name)
            # Clean up auxiliary files
            os.remove(tex_file)
            os.remove(tex_file.replace('.tex', '.aux'))
            os.remove(tex_file.replace('.tex', '.log'))

if __name__ == "__main__":
    input_tex_file = 'outputs/extracted_equations.txt'
    main(input_tex_file)

#### 2.4. PDF to PNG

Convert PDF files to PNG images

In [4]:
SRC_PATH = 'outputs/pdfs'
TRG_PATH = 'outputs/images'
RESOLUTION = 200  # lower resolution for faster processing

def convert_pdf_to_png(pdf_path):
    output_path = Path(TRG_PATH) / f"{pdf_path.stem}.png"
    with Image(filename=str(pdf_path), resolution=RESOLUTION) as img:
        img.format = 'png'
        img.depth = 8
        img.trim(color=Color('rgba(0,0,0,0)'), fuzz=0)  # Trim transparent areas
        img.background_color = Color('white')  # Set white background
        img.alpha_channel = 'remove'           # Remove transparency
        img.save(filename=str(output_path))
    print(f"Converted {pdf_path} to {output_path}")

def main():
    if not os.path.exists('outputs/images'):
        os.mkdir('outputs/images')

    pdf_dir = Path(SRC_PATH)
    pdf_files = list(pdf_dir.glob('*.pdf'))

    with ThreadPoolExecutor(max_workers=10) as executor:
        executor.map(convert_pdf_to_png, pdf_files)

if __name__ == "__main__":
    main()

### 3. Training the model

### 4. Model 

#### 4.1 Perfomance