# latex.convert

> Convert LaTeX files into Obsidian.md notes

This module contains functions and methods to automatically make Obsidian notes from LaTeX files of mathematical papers, most notably those on arXiv.

See the [Potential Problems](#potential-problems) section below for some common errors that arise from this module and how to circumvent them.

In [None]:
#| default_exp latex.convert

In [None]:
#| export
from collections import OrderedDict
import os
from os import PathLike
from pathlib import Path
import re
from typing import Union

from pylatexenc import latexwalker, latex2text
from pylatexenc.latexwalker import (
    LatexWalker, LatexEnvironmentNode, get_default_latex_context_db,
    LatexNode, LatexSpecialsNode, LatexMathNode, LatexMacroNode, LatexCharsNode,
    LatexGroupNode, LatexCommentNode
)
from pylatexenc.latex2text import (
    MacroTextSpec, EnvironmentTextSpec)
from pylatexenc.macrospec import (
    MacroSpec, LatexContextDb, EnvironmentSpec
)
import regex

from trouver.helper import (
    find_regex_in_text, dict_with_keys_topologically_sorted,
    containing_string_priority, replace_string_by_indices, text_from_file
)
from trouver.markdown.markdown.file import (
    MarkdownFile, MarkdownLineEnum
)

from trouver.markdown.obsidian.vault import VaultNote
from trouver.markdown.obsidian.personal.index_notes import (
    correspond_headings_with_folder, convert_title_to_folder_name
)
from trouver.markdown.obsidian.personal.reference import setup_folder_for_new_reference
from trouver.markdown.obsidian.vault import VaultNote
import warnings

In [None]:
#| export
DEFAULT_NUMBERED_ENVIRONMENTS = ['theorem', 'corollary', 'lemma', 'proposition',
                                 'definition', 'conjecture', 'remark', 'example',
                                 'question']

In [None]:
from fastcore.test import ExceptionExpected, test_eq
from trouver.helper import _test_directory, non_utf8_chars_in_file

## Potential problems

The following are some frequently problems that arise when using this module:


#### UnicodeDecodeErrors arise when reading LaTeX files

By default, the `text_from_file` method in `trouver.helper` reads files and attempts to decode them in `utf-8`. If a LaTeX file has characters that cannot be decoded into `utf-8`, then a `UnicodeDecodeError` may be raised. In this case, one can find identify these characters using the `trouver.helper.non_utf8_chars_in_file` method and modify the LaTeX file manually. It may be useful to use a text editor to jump to the positions that the characters are at and to change the encoding of the LaTeX file into `utf-8`; for example, the author of `trouver` has opened some `ANSI`-encoded LaTeX documents in `Notepad++` and converted their encoding into `UTF-8`.

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'circumflex_E_example.tex'
contents, non_utf8_chars = non_utf8_chars_in_file(latex_file_path)
#print(contents.decode(encoding='utf-8'))
test_eq(len(non_utf8_chars), 4)
print(f'The following are the non unicode characters and their positions: {non_utf8_chars}')

The following are the non unicode characters and their positions: [('Ê', 130), ('Ê', 165), ('Ë', 196), ('Ì', 227)]


#### `NoDocumentNodeErrors` arise even though the LaTeX file has a document environemt (i.e. `\begin{document}...\end{document}`)

The `find_document_node` method in this module sometimes is not able to detect the docment environment of a LaTeX file. This error is known to arise when
- there are macros (which include commands) defined that represents/expands to characters including `\begin{...}... \end{...}`. For example

In [None]:
# TODO in the above explanation, include an example.

## LaTeX comments

In [None]:
#| export
def remove_comments(text: str) -> str:
    # Find all occurrences of the comment pattern %[^\n]*
    return re.sub(r"%[^\n]*", "", text)

In [None]:
text = r"""% Commands with parameters
\newcommand{\field}[1]{\mathbb{#1}}
\newcommand{\mat}[4]{\left[\begin{array}{cc}#1 & #2 \\
                                         #3 & #4\end{array}\right]}
\newcommand{\dual}[1]{#1^{\vee}}
\newcommand{\compl}[1]{\hat{#1}}
"""
assert '%' not in remove_comments(text)
print(remove_comments(text))

text = r"""Hi. I'm not commented. %But I am!"""
test_eq(remove_comments(text), "Hi. I'm not commented. ")


\newcommand{\field}[1]{\mathbb{#1}}
\newcommand{\mat}[4]{\left[\begin{array}{cc}#1 & #2 \\
                                         #3 & #4\end{array}\right]}
\newcommand{\dual}[1]{#1^{\vee}}
\newcommand{\compl}[1]{\hat{#1}}



## Divide LaTeX file into parts

To make Obsidian notes from a LaTeX file, I use sections/subsections, and environments as places to make new notes.

Things to think about:
Sections/subsections
environments, including theorems, corollaries, propositions, lemmas, definitions, notations
citations
Macros defined in the preamble?

LatexMacroNodes include: sections/subsections, citations, references, and labels, e.g.

```latex
> \section{Introduction}
\cite{ellenberg2nilpotent}
\subsection{The section conjecture}
\'e
\ref{fundamental-exact-sequence}
\cite{stix2010period}
\ref{fundamental-exact-sequence}
\cite{stix2012rational}
\cite[Appendix C]{stix2010period}
\subsection{The tropical section conjecture}
\label{subsec:tropical-section-conjecture}
```

#### Divide the preamble from the rest of the document

Some macros and commands defined in the preamble seem to prevent the `pylatexenc` methods from properly identifying the document environment/node in a LaTeX document. To circumvent this, we define a function to divide the preamble from the rest of the document

In [None]:
#| export
def divide_preamble(
        text: str, # LaTeX document
        document_environment_name: str = "document"
        ) -> tuple[str, str]:
    """Divide the preamble from the rest of a LaTeX document.
    """
    begin_environment_str = rf'\begin{{{document_environment_name}}}'
    pattern = re.compile(re.escape(begin_environment_str))
    match = re.search(pattern, text) 
    start_match, end_match = match.span()
    return text[:start_match], text[start_match:]

    

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'example_with_a_command_with_begin.tex'
text = text_from_file(latex_file_path)

preamble, document = divide_preamble(text)
assert r'\begin{displaymath}' in preamble
assert r'Hyun Jong Kim' in preamble

assert r'Hyun Jong Kim' not in document
assert document.startswith(r'\begin{document}')
assert document.endswith('\\end{document}')

#### Get the Document Node

In [None]:
#| export
class NoDocumentNodeError(Exception):
    """Exception raised when a LatexEnvironmentNode corresponding to the document 
    environment is expected in a LaTeX string, but no such node exists.
    
    **Attributes**
    - text - str
        - The text in which the document environment is not found.
    """
    
    def __init__(self, text):
        self.text = text
        super().__init__(
            f"The following text does not contain a document environment:\n{text}")



In [None]:
#| export
def find_document_node(
        text: str, # LaTeX str
        document_environment_name: str = "document" # The name of the document environment.
        ) -> LatexEnvironmentNode:
    """Find the `LatexNode` object for the main document in `text`.
    
    **Raises**
    - NoDocumentNodeError
        - If document environment node is not detected.
    """
    w = LatexWalker(text)
    nodelist, _, _ = w.get_latex_nodes(pos=0)
    for node in nodelist:
        if node.isNodeType(LatexEnvironmentNode)\
                and node.environmentname == document_environment_name:
            return node
    raise NoDocumentNodeError(text)

The main content of virtually all LaTeX math articles belongs to a document environment, which pylatexenc can often detect. The `find_document_node` function returns this `LatexEnvironmentNode` object:

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'latex_example_1' / 'main.tex'
text = text_from_file(latex_file_path)
document_node = find_document_node(text)

If the LaTeX file has no `document` environment, then a `NoDocumentNodeError` is raised:

In [None]:
# This latex document has its `document` environment commented out.
latex_file_path = _test_directory() / 'latex_examples' / 'latex_example_2' / 'main.tex'
text = text_from_file(latex_file_path)
with ExceptionExpected(NoDocumentNodeError):
    document_node = find_document_node(text)

At the time of this writinga `NoDocumentNodeError` may be raised even if the LaTeX file has a proper `document` environment

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'example_with_a_command_with_begin.tex'
text = text_from_file(latex_file_path)

# Perhaps in the future, pylatexenc will be able to find the document node for this file.
# When that time comes, delete this example.
with ExceptionExpected(NoDocumentNodeError):
    find_document_node(text)



The `divide_preamble` function can be used to circumvent this problem:

In [None]:
preamble, document = divide_preamble(text)
document_node = find_document_node(document)
test_eq(document_node.environmentname, 'document')
assert document_node.isNodeType(LatexEnvironmentNode)

In [None]:
# hide
# Find no document node error causes

# latex_file_path = r'_tests\latex_full\litt_cfag\main.tex'
# text = text_from_file(latex_file_path)
# document_node = find_document_node(text)

### Detect environment names used in a file

In [None]:
#| export
def environment_names_used(
        text: str # LaTeX document
        ) -> set[str]: # The set of all environment names used in the main document.
    """Return the set of all environment names used in the main document
    of the latex code.
    """
    document_node = find_document_node(text)
    return {node.environmentname for node in document_node.nodelist
            if node.isNodeType(LatexEnvironmentNode)}        

Writers often use different environment names. For examples, writers often use `theorem`, `thm`, or `theo` for theorem environments or `lemma` or `lem` for lemma environments. The `environment_names_used` function returns the environment names actually used in the tex file.

In the example below, note that only the environments that are actually used are returned. For instance, the preamble of the document defines the theorem environments `problem`, and `lemma` (among other things), but these are not actually used in the document itself.

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'has_fully_written_out_environment_names.tex'
sample_text_1 = text_from_file(latex_file_path)
sample_output_1 = environment_names_used(sample_text_1)
test_eq({'corollary', 'proof', 'maincorollary', 'abstract', 'proposition'}, sample_output_1)

The document in the example below uses shorter names for theorem environments:

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'has_shorter_environment_names.tex'
sample_text_2 = text_from_file(latex_file_path)
sample_output_2 = environment_names_used(sample_text_2)
test_eq({'conj', 'notation', 'corollary', 'defn'}, sample_output_2)

#### Identify the numbering convention of a LaTeX document

LaTeX documents have various number conventions. Here are some examples of papers on the arXiv and notes on their numbering schemes. Note that the source code to these articles are publicly available on the arXiv. 

- Ellenberg, Venkatesh, and Westerland, *[Homological stability for Hurwitz spaces and the Cohen-Lenstra conjecture over function fields](https://arxiv.org/abs/0912.0325)*, 
    - The subsections and theorem-like environments of each section share a numbering scheme, e.g. section 1 has subsection `1.1 The Cohen-Lenstra heuristics`, `1.2 Theorem`, `1.3 Hurwitz spaces`. This is accomplished by defining theorem-like environments using the `subsection` counter, e.g.

        ```latex
        \theoremstyle{plain}
        \newtheorem{thm}[subsection]{Theorem}
        \newtheorem{prop}[subsection]{Proposition}
        \newtheorem{cor}[subsection]{Corollary}
        \newtheorem{remark}{Remark}
        \newtheorem{conj}[subsection]{Conjecture}
        \newtheorem*{conj*}{Conjecture}
         ```

        defines the `thm`, `prop`, `cor`, and `conj` environments to be numbered using the `subsection` counter, the `remark` environmment to be defiend as an unnumbered environment, and the `conj*` environment to be defined as an unnumbered environment with a different name than the `conj` environment.

    - The `\swapnumbers` command is included in the preamble to change the way that theorems are numbered in the document, e.g. the article has `1.2 Theorem` as opposed to `Theorem 1.2`.
    - The equations are numbered along the subsections - this is accomplished by the lines 

        ```latex
        \numberwithin{equation}{subsection}
        \renewcommand{\theequation}{\thesubsection.\arabic{equation}}
        ```

        in the preamble.

In [None]:
# TODO: consider different 
# TODO: cosider swapnumbers
# TODO: consider 

In [None]:
#| export
def counters_for_environments(
        text: str # The LaTeX document
        ) -> dict:  
    r"""Return the dict specifying the counters for each theorem-like environment.

    This function uses two separate regex patterns, one to detect the invocations of `\newtheorem`
    in which the optional parameter is the second parameter and one to detect those in which
    the optional parameter is the third parameter.

    Assumes that
    - invocations of the `\newtheorem` command are exclusively in the
    preamble of the LaTeX document.
    - theorem-like environments are defined using the `\newtheorem` command.
    - no environments of the same name are defined twice.

    """
    preamble, _ = divide_preamble(text)
    second_parameter_pattern = re.compile(
        # r'\\newtheorem\s*\{\s*(\w+)\s*\}\s*(\[\s*(\w+)\s*\])?\s*\{\s*(.*)\s*\}')
        # In this case, the optional parameter (if any) should not follow the newtheorem.
        r'\\newtheorem\s*\{\s*(\w+)\s*\}\s*(\[\s*(\w+)\s*\])?\s*\{\s*(.*)\s*\}(?!\s*\[\s*(\w+)\s*\])')
    third_parameter_pattern = re.compile(
        r'\\newtheorem\s*\{\s*(\w+)\s*\}\s*\{\s*(.*)\s*\}\s*(\[\s*(\w+)\s*\])?')
    second_results = _search_counters_by_pattern(preamble, second_parameter_pattern, 3)
    third_results = _search_counters_by_pattern(preamble, third_parameter_pattern, 4)
    return second_results | third_results
    

def _search_counters_by_pattern(
        preamble: str,
        newtheorem_regex: re.Pattern,
        counter_group: int # This depends on which `newtheorem_regex` is used, and is either 3 or 4. 
        ) -> dict:
    """
    Capture the newly defined theorem-like environment names as well as the
    counters that they belong to"""
    counters = {}
    for match in newtheorem_regex.finditer(preamble):
        env_name = match.group(1)
        counter = match.group(counter_group)
        # If no counter was specified, use the environment name as the counter
        if counter is None:
            counter = env_name
        counters[env_name] = counter
    return counters

In [None]:
text = text_from_file(_test_directory() / 'latex_examples' / 'newtheorem_example.tex') 

counters = counters_for_environments(text)
test_eq(counters, {'theorem': 'theorem', 'lemma': 'theorem', 'definition': 'theorem', 'corollary': 'corollary', 'remark': 'theorem'})


In [None]:
text = r"""
\theoremstyle{plain}
\newtheorem{thm}[subsection]{Theorem}
\newtheorem{prop}[subsection]{Proposition}
\newtheorem{cor}[subsection]{Corollary}
\newtheorem{remark}{Remark}
\newtheorem{conj}[subsection]{Conjecture}
\newtheorem*{conj*}{Conjecture}
\begin{document}
\end{document}
"""
counters = counters_for_environments(text)
test_eq(counters, {'thm': 'subsection', 'prop': 'subsection', 'cor': 'subsection', 'remark': 'remark', 'conj': 'subsection'})

In [None]:
#| hide

# Test that the contents of the `counters_for_environments` function are detecting
# The defined commands correctly.
text = text_from_file(_test_directory() / 'latex_examples' / 'newtheorem_example.tex') 
preamble, _ = divide_preamble(text)
second_parameter_pattern = re.compile(
    # r'\\newtheorem\s*\{\s*(\w+)\s*\}\s*(\[\s*(\w+)\s*\])?\s*\{\s*(.*)\s*\}')
    r'\\newtheorem\s*\{\s*(\w+)\s*\}\s*(\[\s*(\w+)\s*\])?\s*\{\s*(.*)\s*\}(?!\s*\[\s*(\w+)\s*\])')
third_parameter_pattern = re.compile(
    r'\\newtheorem\s*\{\s*(\w+)\s*\}\s*\{\s*(.*)\s*\}\s*(\[\s*(\w+)\s*\])?')
second_results = _search_counters_by_pattern(preamble, second_parameter_pattern, 3)
third_results = _search_counters_by_pattern(preamble, third_parameter_pattern, 4)
assert 'remark' not in second_results
assert 'remark' in third_results

#### Getting the display names of environment

For example, `\newtheorem{theorem}{Theorem}` defines a theorem-like environment called `theorem` whose display name is `Theorem`.

In [None]:
#| export
def display_names_of_environments(
        text: str # The LaTeX document
        ) -> dict:  
    r"""Return the dict specifying the display names for each theorem-like environment.

    This function uses two separate regex patterns, one to detect the invocations of `\newtheorem`
    in which the optional parameter is the second parameter and one to detect those in which
    the optional parameter is the third parameter.

    Assumes that
    - invocations of the `\newtheorem` command are exclusively in the
    preamble of the LaTeX document.
    - theorem-like environments are defined using the `\newtheorem` command.
    - no environments of the same name are defined twice.

    """
    preamble, _ = divide_preamble(text)
    second_parameter_pattern = re.compile(
        # In this case, the optional parameter (if any) should not follow the newtheorem.
        r'\\newtheorem\*?\s*\{\s*(\w+\*?)\s*\}\s*(\[\s*(\w+)\s*\])?\s*\{\s*(.*)\s*\}(?!\s*\[\s*(\w+)\s*\])')
    third_parameter_pattern = re.compile(
        r'\\newtheorem\*?\s*\{\s*(\w+\*?)\s*\}\s*\{\s*(.*)\s*\}\s*(\[\s*(\w+)\s*\])?')
    second_results = _search_display_names_by_pattern(preamble, second_parameter_pattern, 4)
    third_results = _search_display_names_by_pattern(preamble, third_parameter_pattern, 2)
    return second_results | third_results
    

def _search_display_names_by_pattern(
        preamble: str,
        newtheorem_regex: re.Pattern,
        display_name_group: int # This depends on which `newtheorem_regex` is used, and is either 3 or 4. 
        ) -> dict:
    """
    Capture the newly defined theorem-like environment names as well as the
    counters that they belong to"""
    display_names = {}
    for match in newtheorem_regex.finditer(preamble):
        env_name = match.group(1)
        display_name = match.group(display_name_group)
        display_names[env_name] = display_name
    return display_names

In [None]:
text = text_from_file(_test_directory() / 'latex_examples' / 'newtheorem_example.tex') 
display_names = display_names_of_environments(text)
test_eq(display_names, {'theorem': 'Theorem',
 'lemma': 'Lemma',
 'definition': 'Definition',
 'corollary': 'Corollary',
 'conjecture*': 'Conjecture',
 'remark': 'Remark'})

### Divide latex text into parts

In [None]:
# Examples of different numbering conventions:
# Ellenberg, Venkatesh, Westerland, HOmological stability - 1.1 The Cohen-Lenstra heuristics, 1.2 Theorem, 1.3 Hurwitz spaces, 1.4 Stability of homology, 1.5 Conjecture, 1.6 Some context


In [None]:
#| export
def divide_latex_text(
        text: str, # The text of a LaTeX document
        environments_to_divide_along: list[str], # A list of the names of environments that warrant a new note
        numbered_environments: list[str], # A list of the names of environments that do not warrant a new note
        numbering_convention: str,
        section_name: str = 'section', # The command name for sections
        subsection_name: str= 'subsection', # The command name for subsections
        proof_name: str = 'proof', # The environment name for proofs
        ) -> list[tuple[str, str]]: 
    """Divide LaTeX text to convert into Obsidian.md notes.
    

    """
    return

In [None]:
#| export
# TODO: numbering convention could be theorems separate (e.g. theorem 1, 2, ...)
# and subsections separate.
# TODO: fix up this method
def divide_latex_text(
        text, # The text of a latex document.
        numbered_environments: list[str] = DEFAULT_NUMBERED_ENVIRONMENTS, # A list of the names of environments which are numbered in the latex code. 
        numbering_convention: str = 'separate', # One of <br><br> - 'separate': Subsections of a section have separate numberings, e.g. 'Lemma 1.2.1, Proposition 1.2.2, Figure 1.2.3, Theorem 1.3.1' <br> - 'shared': Subsections of a section share numberings, e.g.  'Lemma 1.1, Proposition 1.2, Figure 1.3, Theorem 1.4'
        section_name: str = 'section', # The command name for sections. For example, SGA has chapters and sections. For the purposes of this function, it is appropriate to regard them as sections and subsections, respectively.
        subsection_name: str = 'subsection', # The commmand name for subsections
        proof_name: str = 'proof' # The environment name for proofs
        ) -> list[tuple[str, str]]: # Each tuple corresponds to an Obsidian note to be constructed.  Such a tuple is of the form `[<node_type & numbering>, <text>]` where `node_type & numbering` is a string which serves as a title for the text making up the note, and `text` is the content of the note.
    """Divides latex text to convert into Obsidian notes.
    
    """
    document_node = find_document_node(text)
    section_num = 0
    subsection_num = 0
    environment_num = 0
    outside_num = 1  # Since not everything is in a nice environment, many 
                     # notes will need their own numbers.
    parts = []
    accumulation = ''
    for node in document_node.nodelist:
        (section_num, subsection_num, environment_num, outside_num,
         accumulation)\
            = _process_node(
                section_num, subsection_num, environment_num, outside_num,
                accumulation, parts, node, section_name, subsection_name,
                proof_name, numbered_environments, numbering_convention)
            
    outside_num += 1
    parts.append([str(outside_num), accumulation])
    return parts
            
def _process_node(
        section_num: int, subsection_num: int, environment_num: int,
        outside_num: int, accumulation: str, parts: list, node: LatexNode,
        section_name: str, subsection_name: str, proof_name: str,
        numbered_environments: list[str], numbering_convention: str) -> tuple:
    """
    Choose the node-processing method, if the node is a section/subsection or environment
    and

    """
    process_method_to_run = None
    if node.isNodeType(LatexMacroNode) and node.macroname == section_name:
        process_method_to_run = _process_section
    elif node.isNodeType(LatexMacroNode) and node.macroname == subsection_name:
        process_method_to_run = _process_subsection
    elif (node.isNodeType(LatexEnvironmentNode)
          and node.environmentname in numbered_environments):
        process_method_to_run = _process_environment_node
    if process_method_to_run:
        (section_num, subsection_num, environment_num, outside_num,
        accumulation)\
            = process_method_to_run(
            section_num, subsection_num, environment_num, outside_num,
            accumulation, parts, node, section_name, subsection_name,
            numbering_convention)
    elif (node.isNodeType(LatexEnvironmentNode)
          and node.environmentname == proof_name):
          # TODO: if the environment is a proof, and if it starts a section/subsection,
          # Then the proof is appended into the title of the section/subsection, see
          # landesman_litt_ipwc, around line 1858-1863 for example.
        parts[-1][1] += f'\n{node.latex_verbatim()}'
    else:
        accumulation += node.latex_verbatim()
    return (section_num, subsection_num, environment_num, outside_num,
            accumulation)


def _process_section(
        section_num: int, subsection_num: int, environment_num: int,
        outside_num: int, accumulation: str, parts: list[list],
        node: LatexMacroNode, section_name: str, subsection_name: str,
        numbering_convention: str) -> tuple:
    """Do stuff when the node is a section node. Return updated
    section_num, subsection_num, environment_num
    """
    numbered, title  = _section_title(
        node.latex_verbatim(), section_name, subsection_name)
    section_num += 1 if numbered else 0
    subsection_num = 0
    environment_num = 0
    if accumulation.strip() != '':
        parts.append([str(outside_num), accumulation])
        outside_num += 1
        accumulation = ''
    parts.append([f'{section_name} {section_num}', title])
    return (section_num, subsection_num, environment_num, outside_num,
            accumulation)
    

def _process_subsection(
        section_num: int, subsection_num: int, environment_num: int,
        outside_num: int, accumulation: str, parts: list[list],
        node: LatexMacroNode, section_name: str, subsection_name: str,
        numbering_convention: str) -> tuple:
    """Do stuff when the node is a subsection node.
    """
    numbered, title  = _section_title(
        node.latex_verbatim(), section_name, subsection_name)
    subsection_num += 1 if numbered else 0
    if numbering_convention == 'separate':
        environment_num = 0
    if accumulation.strip() != '':
        parts.append([str(outside_num), accumulation])
        outside_num += 1
        accumulation = ''
    parts.append([f'{subsection_name} {section_num}.{subsection_num}', title])
    return (section_num, subsection_num, environment_num, outside_num,
            accumulation)


def _process_environment_node(
        section_num: int, subsection_num: int, environment_num: int,
        outside_num: int, accumulation: str, parts: list[list],
        node: LatexMacroNode, section_name: str, subsection_name: str,
        numbering_convention: str) -> tuple:
    """
    """
    environment_num += 1
    if accumulation.strip() != '':
        parts.append([str(outside_num), accumulation])
        outside_num += 1
        accumulation = ''
    if numbering_convention == 'separate':
        pointed_numbering = f'{section_num}.{subsection_num}.{environment_num}'
        numbering = f'{node.environmentname} {pointed_numbering}'
    elif numbering_convention == 'shared':
        numbering = f'{node.environmentname} {section_num}.{environment_num}'
    parts.append([numbering, node.latex_verbatim()])
    return (section_num, subsection_num, environment_num, outside_num,
            accumulation)


def _section_title(text: str, section_name, subsection_name) -> str:
    """Returns the title of a section or subsection from a latex str
    and whether or not the section/subsection is numbered.
    
    **Parameters**
    - text - str
    - section_name - str
    - subsection_name - str
    
    **Returns**
    - str, bool
    """
    # TODO: test things like `\\section {Generating series of special divisors}`
    # See qiu_amsd for example.
    # TODO: deal with the possibility of multi-line sections/subsections,
    # e.g. \subsection{Arithmetic intersrection\n pairing},
    # see qiu_amsd for example
    regex_search = re.search(r'\\' + fr'(?:{section_name}|{subsection_name}) *?'
                             + r'(?:\[.*\])?(\*)?\{(.*)\}', text)
    # regex_search = re.search(r'\\' + fr'(?:{section_name}|{subsection_name})'
    #                          + r'(?:\[.*\])?(\*)?\{(.*)\}', text)
    # print(text)
    # print(section_name, subsection_name)
    if regex_search is None:
        print(text, section_name, subsection_name)
    return not bool(regex_search.group(1)), regex_search.group(2)



The `divide_latex_text` function divides latex text 

In [None]:
# latex_file_path = r'_tests\latex_full\pauli_wickelgren\main.tex'
# text = text_from_file(latex_file_path)
# parts = divide_latex_text(text, numbering_convention='separate')
# for title, text in parts[39:44]:
#     print(title, text)



In [None]:
# TODO: Find a list of environment names commonly used.

In [None]:
# TODO: examples with different numbering convention and different numbered environments

In [None]:
# TODO: make numbering_convention work correctly.
# Here are some latex files with different conventions:
# - All subsections in a section share numbering, 
#   - achter_pries_imht https://arxiv.org/abs/math/0608038: e.g. Lemmas 2.1, 2.2, 2.3 are in subsection 2.2 and Lemma 2.4 and Remark 2.5 are in subsection 2.4.as_integer_ratio
#   - pauli_wickelgren https://arxiv.org/abs/2010.09374: e.g. Example 3.5, 3.11 are in subsubsection 3.3.2, Exercise 4.1, Remark 4.2, are in subsection 4.1, Theorem 4.3 is in subsection 4.2, Theorem 4.4 is in subsection 4.3
# - Different environment types have different counts and the counts do not show the section number.
#   - vankataramana_imbrd https://arxiv.org/abs/1205.6543: 
#       - e.g. section 1 has Theorem 1, Remark 1, Remark 2, Remark 3, subsection 1.1.3 has Remark 4, Subsection 2.2 has Definition 1

## Formatting modifications

### Identify macros and commands to replace

Authors usually define a lot of custom commands and macros in their LaTeX files. Such customizations vary from author to author and most customized commands are not recognized by Obsidian. 

See `nbs/_tests/latex_examples/commands_example/main.tex` for some examples of custom commands.

In [None]:
#| export
def custom_newcommands(
        preamble: str, # The preamble of a LaTeX document.
        ) -> dict[str, tuple[int, Union[str, None], str]]: # The keys are the names of the newly defined commands and the values are tuples consisting of 1. the number of parameters 2. The default argument if specified or `None` otherwise, and 3. the display text of the command.
    """
    Return a dict mapping commands defined in `preamble` to the number of arguments
    display text of the commands.

    Assumes that the newcommands only have at most one default parameter (newcommands with
    multiple default parameters are not valid in LaTeX).

    Ignores all comented newcommands.
    """
    preamble = remove_comments(preamble)
    newcommand_regex = regex.compile(
        r'(?<!%)\s*\\(?:re)?newcommand\s*\{\\\s*(\w+)\s*\}\s*(\[(\d+)\]\s*(?:\[(\w+)\])?)?\s*\{((?>[^{}]+|\{(?5)\})*)\}', re.MULTILINE)
    # newcommand_regex = regex.compile(
    #     r'(?<!%)\s*\\(?:re)?newcommand\s*\{\\\s*(\w+)\s*\}\s*(\[(\d+)\]\s*(?:\[(\w+)\])?)?\s*\{\s*(.*)\s*\}', re.MULTILINE)
    commands = {}
    for match in newcommand_regex.finditer(preamble):
        name = match.group(1)
        num_args = match.group(3)
        optional_default_arg = match.group(4)
        definition = match.group(5)

        # Convert the number of arguments to an integer, if it was specified
        if num_args is not None:
            num_args = int(num_args)
        else:
            num_args = 0

        commands[name] = (num_args, optional_default_arg, definition)
    return commands



In [None]:
# Basic
text_1 = r'\newcommand{\con}{\mathcal{C}}'
test_eq(custom_newcommands(text_1), {'con': (0, None, r'\mathcal{C}')})

# With a parameter
text_2 = r'\newcommand{\field}[1]{\mathbb{#1}}'
test_eq(custom_newcommands(text_2), {'field': (1, None, r'\mathbb{#1}')}) 

# With multiple parameters, the first of which has a default value of `2`
text_3 = r'\newcommand{\plusbinomial}[3][2]{(#2 + #3)^#1}'
test_eq(custom_newcommands(text_3), {'plusbinomial': (3, '2', r'(#2 + #3)^#1')}) 

# The display text has backslashes `\` and curly brances `{}``
text_4 = r'\newcommand{\beq}{\begin{displaymath}}'
test_eq(custom_newcommands(text_4), {'beq': (0, None, '\\begin{displaymath}')})


# Basic with spaces in the newcommand declaration
text_6 = r'\newcommand {\con}  {\mathcal{C}}'
test_eq(custom_newcommands(text_6), {'con': (0, None, r'\mathcal{C}')})

# With a parameter and spaces in the newcommand declaration
text_7 = r'\newcommand   {\field}   [1] {\mathbb{#1}}'
test_eq(custom_newcommands(text_7), {'field': (1, None, r'\mathbb{#1}')}) 

# With multiple parameters, a default value, and spaces in the newcommand declaration
text_8 = r'\newcommand {\plusbinomial} [3] [2] {(#2 + #3)^#1}'
test_eq(custom_newcommands(text_8), {'plusbinomial': (3, '2', r'(#2 + #3)^#1')}) 

# With a comment `%`
text_9 = r'% \newcommand{\con}{\mathcal{C}}'
test_eq(custom_newcommands(text_9), {})


# Spanning multiple lines
text_10 = r'''\newcommand{\mat}[4]{\left[\begin{array}{cc}#1 & #2 \\
                                         #3 & #4\end{array}\right]}'''
test_eq(
    custom_newcommands(text_10),
    {'mat': (4, None,
             '\\left[\\begin{array}{cc}#1 & #2 \\\\\n                                         #3 & #4\\end{array}\\right]')})


In [None]:
#| export
def custom_mathoperators(
        preamble: str, # The preamble of a LaTeX document.
        ) -> dict[str, tuple[int, None, str]]: # The keys are the names of the newly defined commands and the values are tuples consisting of 1. the number of arguments and 2. the display text of the command.
    """
    Return a dict mapping commands defined in `preamble` to the number of arguments
    display text of the commands.
    """
    declaremathoperator_regex = re.compile(r'\\DeclareMathOperator\s*\{\\\s*(\w+)\s*\}\s*\{\s*(.*)\s*\}')
    commands = {}
    for match in declaremathoperator_regex.finditer(preamble):
        name = match.group(1)
        definition = match.group(2)

        commands[name] = (0, None, definition)
    return commands

In [None]:
text_1 = r'\DeclareMathOperator{\Hom}{Hom}'
test_eq(custom_mathoperators(text_1), {'Hom': (0, None, 'Hom')})

text_2 = r'\DeclareMathOperator{\tConf}{\widetilde{Conf}}'
test_eq(custom_mathoperators(text_2), {'tConf': (0, None, r'\widetilde{Conf}')})

In [None]:
# TODO: use a regexp pattern like this one to extract balanced curly braces
# \\mat\{((?>[^{}]+|\{(?1)\})*)\}\{((?>[^{}]+|\{(?2)\})*)\}

In [None]:
#| export
def regex_pattern_detecting_command(
        command_name: str,
        command_tuple: tuple[int, Union[None, str], str], # Consists of 1. the number of parameters 2. The default argument if specified or `None`, and 3. the display text of the command.
        ) -> regex.Pattern:
    """Return a `regex.pattern` object (not a `re.pattern` object) detecting
    the command with the specified number of parameters, optional argument,
    and display text.

    Assumes that the curly braces used to write the invocations of the commands
    are balanced and properly nested. Assumes that there are no two commands
    of the same name.
    """
    num_parameters, optional_arg, _ = command_tuple
    backslash_name = fr"\\{command_name}"
    optional_argument_detection = fr"(?:\[(.*?)\])?" if optional_arg is not None else ""
    argument_detection = r""
    if optional_arg is not None:
        pattern = f"{backslash_name}\\s*(?:{optional_argument_detection})"
        trailing_arguments = [_argument_detection(i) for i in range(2, 1+num_parameters)]
        trailing_args_pattern = "\\s*".join(trailing_arguments)
        pattern = (f"{pattern}\\s*{trailing_args_pattern}")
    elif num_parameters > 0:
        arguments = [_argument_detection(i) for i in range(1, 1+num_parameters)]
        args_pattern = "\\s*".join(arguments)
        pattern = f"{backslash_name}\\s*{args_pattern}"
    else:
        pattern = f"{backslash_name}"
    return regex.compile(pattern)

def _argument_detection(group_num: int):
    return "\{((?>[^{}]+|\{(?1)\})*)\}".replace("1", str(group_num))
    

In [None]:
pattern = regex_pattern_detecting_command('Sur', (0, None, r'\mathrm{Sur}'))
text = r'The number of element of $\Sur(\operatorname{Cl} \mathcal{O}_L, A)$ is ...'
match = pattern.search(text)
start, end = match.span()
test_eq(text[start:end], r'\Sur')

pattern = regex_pattern_detecting_command('field', (1, None, r'\mathbb{\#1}'))
text = r'\field{Q}'
# print(pattern.pattern)
match = pattern.search(text)
start, end = match.span()
test_eq(text[start:end], text)


pattern = regex_pattern_detecting_command('mat', (4, None, r'\left[\begin{array}{cc}#1 & #2 \\ #3 & #4\end{array}\right]'))
text = r'\mat{{123}}{asdfasdf{}{}}{{{}}}{{asdf}{asdf}{}}' # This is a balanced str.
match = pattern.search(text)
start, end = match.span()
test_eq(text[start:end], text)
test_eq(match.group(1), r'{123}')

pattern = regex_pattern_detecting_command('plusbinomial', (3, '2', r'(#2 + #3)^#1'))
text = r'\plusbinomial{x}{y}'
match = pattern.search(text)
start, end = match.span()
test_eq(text[start:end], text)

text = r'\plusbinomial[4]{x}{y}'
match = pattern.search(text)
start, end = match.span()
test_eq(text[start:end], text)


In [None]:
#| export 
def replace_commands_in_latex_document(
        docment: str, 
        command_dict: dict[str, tuple[int, Union[None, str], str]]
        ) -> str:
    return
    

In [None]:
# export
def custom_commands(text: str) -> list[dict[str, Union[str, int]]]:
    # TODO: test that commented out custom commands are not used.
    """Returns a dict of custom commands/macros/operators in a latex document
    along with their corresponding definitions.
    
    Assumes that the document's `\begin{document}` invocation (if it exists)
    is in its own line. The preamble (particularly the definitions of the
    commands/macros) may be commented. In this case, the format of the
    comments should be `%<definition of the command/macro/operat>`, i.e.
    the comment should start with a single comment symbol `%` and should not
    have any characters in between the comment symbol and the definition of
    the command/macro/operator.
    
    TODO: Make it so that commands that depend on other commands are
    fully decoded - e.g. boggess-sankar has the command `\tunignsheaf`
    which depends on the custom commands `\ol` and `\cV`.
    
    **Parameters**
    - text - str
        - The text of a LaTeX document. This should at least contain the
        preamble of the document.
        
    **Returns**
    - list[dict[str, Union[str, int]]]
        - For each dict, the keys include `'name'`, `'format'`, and
        optionally `'num_params'`, `'default_argument'`. The values
        are the command name (without the starting backslashes),
        format of the output of the command, the number of parameters
        that the command must take (if any), and the number of
        optional parameters that the command takes.
    """
    text = preamble_text(text)
    lines = text.split('\n')
    lines = [line for line in lines if _type_of_line(line)]
    custom_names_and_defs = [_custom_name_and_def(line) for line in lines]
    custom_names_and_defs = [custom for custom in custom_names_and_defs
                             if custom]
    return custom_names_and_defs
    
    
def preamble_text(text: str) -> str:
    """Returns the preamble text of a LaTeX document text, i.e. the text before
    the `\begin{document}`.
    
    **Parameters**
    - text - str
        - The LaTeX document text
        
    **Returns**
    - str
        - The preamble of the LaTeX documen text.
    """
    matches = find_regex_in_text(text, pattern=r'\\begin\{document\}')
    if not matches:
        return text
    else:
        start, _ = matches[0]
        return text[:start]
             
             
def _type_of_line(line: str) -> Union[str, None]:
    """Returns the type of LaTeX preamble line.
    
    **Parameters**
    - line - str

    **Returns**
    - Union[str, None]
        - one of `'command'`, `'mathoperator'`, `'def'`, `'mathtext'`,
        or `None`.
    """
    if find_regex_in_text(line, pattern=r'%?\\(re)?newcommand'):
        return 'command'
    elif find_regex_in_text(line, pattern=r'%?\\DeclareMathOperator'):
        return 'mathoperator'
    elif find_regex_in_text(line, pattern=r'%?\\def'):
        return 'def'
    return None


def _custom_name_and_def(line: str)\
        -> Union[dict[str, Union[str, int]], None]:
    if _type_of_line(line) == 'command':
        return _command_name_and_def(line)
    elif _type_of_line(line) == 'mathoperator':
        return _math_operator_name_and_def(line)
    elif _type_of_line(line) == 'def':
        return _def_name_and_def(line)
    # elif _type_of_line(line) == 'mathtext':
    #     return _mathtext_name_and_def(line)
        
def _command_name_and_def(line: str) -> dict[str, Union[str, int]]:
    """Returns a dict containing 1. the name of the command and 2.
    How the command manifests, including any inputs.
    
    See https://www.overleaf.com/learn/latex/Commands#Defining_a_new_command
    for how LaTeX commands are defined.
    
    **Returns**
    - dict
        - The keys include 'name', 'format', and optionally
        'num_params', 'default_argument'. The values are the command
        name, format of the output of the command, the number of parameters
        that the command must take (if any), and the number of optional
        parameters that the command takes.
    """
    match = re.search(
        r'^(?<!%)\\(?:re)?newcommand(?:\{| *)\\([^\{\}\] ]*)(?: *?|\})(?:\[(\d+)\])?'
        r'(?:\[(\d+)\])?\{(.*?)\}[\s]*$', line)
    # match = re.search(
    #     r'^%?\\(?:re)?newcommand(?:\{| *)\\([^\{\}\] ]*)(?: *?|\})(?:\[(\d+)\])?'
    #     r'(?:\[(\d+)\])?\{(.*?)\}[\s]*$', line)
    if match:
        command_dict = {'name': match.group(1), 'format': match.group(4)}
        if match.group(2):
            command_dict['num_params'] = int(match.group(2))
        if match.group(3):
            command_dict['default_argument'] = match.group(3)
        return command_dict
    else:
        # print(line)
        return None

def _math_operator_name_and_def(line: str) -> dict[str, Union[str, int]]:
    match = re.search(
        r'^%?\\DeclareMathOperator(?:\{| *)\\([^\{\}\] ]*)(?: *?|\})'
        '\{(.*?)\}[\s]*$', line)
    if match:
        command_dict = {'name': match.group(1),
                        'format': f'\\operatorname{{{match.group(2)}}}'}
        return command_dict
    else:
        # print(line)
        return None

def _def_name_and_def(line: str) -> dict[str, Union[str, int]]:
    match = re.search(
        r'^%?\\def(?:\{| *)\\([^\{\}\] ]*)(?: *?|\})\{(.*?)\}[\s]*$', line)
    if match:
        command_dict = {'name': match.group(1),
                        'format': match.group(2)}
        return command_dict
    else:
        # print(line)
        return None

# def _mathtext_name_and_def(line: str) -> dict[str, Union[str, int]]:
#     match = re.search(
#         r'%?\\DeclareMathOperator(?:\{| *)([^\{\}\] ]*)(?: *?|\})\{(.*?)\} *$', line)
# def _simple_command_name_and_def(line: str) -> dict[str]:
#     match = re.search(
#         r'%?\\newcommand(?:\{| *)([^\} ]*)(?: *|\})\{(.*?)\} *$', line)
#     return match.group(1), match.group(2)

In [None]:
# # hide
# import unittest
# class CustomCommandsHelperUnitTest(unittest.TestCase):
#     def test_type_of_line(self):
#         self.assertEqual(_type_of_line(r'%\newcommand \X {{\cal X}}'), 'command')
#         self.assertEqual(_type_of_line(r'\newcommand \X {{\cal X}}'), 'command')
#         self.assertEqual(_type_of_line(r'\newcommand  \QQ {{\mathbb Q}}'), 'command')
#         self.assertEqual(_type_of_line(r'%\newcommand \CM {{\cal M}}'), 'command')
#         self.assertEqual(_type_of_line(r'%\renewcommand{\P}{\ensuremath{{\mathbb{P}}}}'), 'command')
#         self.assertEqual(_type_of_line(r'%\def\subsectie{\emppsubsection[]{\unskip}}'), 'def')

        
#     def test_command_name_and_def(self):
#         self.assertDictEqual(
#             _command_name_and_def(r'%\newcommand \X {{\cal X}}'),
#             {'name': r'X', 'format': r'{\cal X}'})
#         self.assertDictEqual(
#             _command_name_and_def(r'%\newcommand \val {\mathop{\rm val}}'),
#             {'name': r'val', 'format': r'\mathop{\rm val}'})
#         self.assertDictEqual(
#             _command_name_and_def(r'\newcommand{\colim}{\mathop{\mathrm{colim}}}'),
#             {'name': r'colim', 'format': r'\mathop{\mathrm{colim}}'})
#         self.assertDictEqual(
#             _command_name_and_def(r'\newcommand{\G}{\mathbb G}'),
#             {'name': r'G', 'format': '\mathbb G'})
#         self.assertDictEqual(
#             _command_name_and_def(r'\newcommand{\al}[1]{\overline{#1}}'),
#             {'name': r'al', 'num_params': 1, 'format': '\overline{#1}'})
#         self.assertDictEqual(
#             _command_name_and_def(r'\newcommand{\barEll}[1]{\overline{\operatorname{Ell}}_{#1}}'),
#             {'name': r'barEll', 'num_params': 1, 'format': '\overline{\operatorname{Ell}}_{#1}'})
#         self.assertDictEqual(
#             _command_name_and_def(r'%\newcommand{\plusbinomial}[3][2]{(#2 + #3)^#1}'),
#             {'name': r'plusbinomial', 'num_params': 3, 'default_argument': '2', 'format': r'(#2 + #3)^#1'})
#         self.assertDictEqual(
#             _command_name_and_def(r'%\newcommand\mx[1]{\ensuremath{\begin{pmatrix}#1\end{pmatrix}}}'),
#             {'name': r'mx', 'num_params': 1, 'format': r'\ensuremath{\begin{pmatrix}#1\end{pmatrix}}'})
#         self.assertDictEqual(
#             _command_name_and_def(r'%\renewcommand{\bar}[1]{{\overline{#1}}}'),
#             {'name': r'bar', 'num_params': 1, 'format': r'{\overline{#1}}'})
#         self.assertIsNone(
#             _command_name_and_def(r'%%\newcommand{}{\unskip\nolinebreak\hfill\hbox{\quad $\square$}}'))
#         self.assertIsNone(
#             _command_name_and_def(r'%%\newcommand{}{\unskip\nolinebreak\hfill\hbox{\quad $\square$}}'))
                
#     def test_math_operator_name_and_def(self):
#         self.assertDictEqual(
#             _math_operator_name_and_def(r'%\DeclareMathOperator{\spec}{Spec}'),
#             {'name': r'spec', 'format': r'\operatorname{Spec}'})
#         self.assertDictEqual(
#             _math_operator_name_and_def(r'%\DeclareMathOperator{\sg}{SG}'),
#             {'name': r'sg', 'format': r'\operatorname{SG}'})
#         self.assertIsNone(
#             _math_operator_name_and_def(r'%%\DeclareMathOperator{\sp}{Sp}'))
#         self.assertIsNone(
#             _math_operator_name_and_def(r'%%\DeclareMathOperator{\std}{std}'))
#         self.assertIsNone(
#             _math_operator_name_and_def(r'%%\DeclareMathOperator{\u}{U}'))


#     def test_def_name_and_def(self):
#         self.assertDictEqual(
#             _def_name_and_def(r'%\def\dual{^\vee}'),
#             {'name': r'dual', 'format': r'^\vee'})
#         self.assertDictEqual(
#             _def_name_and_def(r'%\def\inv{^{-1}}'),
#             {'name': r'inv', 'format': r'^{-1}'})
#         self.assertDictEqual(
#             _def_name_and_def(r'%\def\integ{\mathbb Z}'),
#             {'name': r'integ', 'format': r'\mathbb Z'})
#         self.assertDictEqual(
#             _def_name_and_def(r'%\def\smooth{{\rm sm}}		'),
#             {'name': r'smooth', 'format': r'{\rm sm}'})
        
# unittest.main(argv=[''], verbosity=1, exit=False)