# latex.convert

> Convert LaTeX files into Obsidian.md notes

This module contains functions and methods to automatically make Obsidian notes from LaTeX files of mathematical papers, most notably those on arXiv.

See the [Potential Problems](#potential-problems) section below for some common errors that arise from this module and how to circumvent them.

In [None]:
#| default_exp latex.convert

In [None]:
#| export
from collections import OrderedDict
import os
from os import PathLike
from pathlib import Path
from pylatexenc import latexwalker, latex2text
from pylatexenc.latexwalker import (
    LatexWalker, LatexEnvironmentNode, get_default_latex_context_db,
    LatexNode, LatexSpecialsNode, LatexMathNode, LatexMacroNode, LatexCharsNode,
    LatexGroupNode, LatexCommentNode
)
from pylatexenc.latex2text import (
    MacroTextSpec, EnvironmentTextSpec)
from pylatexenc.macrospec import (
    MacroSpec, LatexContextDb, EnvironmentSpec
)
import re
from typing import Union
from trouver.helper import (
    find_regex_in_text, dict_with_keys_topologically_sorted,
    containing_string_priority, replace_string_by_indices, text_from_file
)
from trouver.markdown.markdown.file import (
    MarkdownFile, MarkdownLineEnum
)

from trouver.markdown.obsidian.vault import VaultNote
from trouver.markdown.obsidian.personal.index_notes import (
    correspond_headings_with_folder, convert_title_to_folder_name
)
from trouver.markdown.obsidian.personal.reference import setup_folder_for_new_reference
from trouver.markdown.obsidian.vault import VaultNote
import warnings

In [None]:
#| export
DEFAULT_NUMBERED_ENVIRONMENTS = ['theorem', 'corollary', 'lemma', 'proposition',
                                 'definition', 'conjecture', 'remark', 'example',
                                 'question']

In [None]:
from fastcore.test import ExceptionExpected, test_eq
from trouver.helper import _test_directory, non_utf8_chars_in_file

## Potential problems

The following are some frequently problems that arise when using this module:


#### UnicodeDecodeErrors arise when reading LaTeX files

By default, the `text_from_file` method in `trouver.helper` reads files and attempts to decode them in `utf-8`. If a LaTeX file has characters that cannot be decoded into `utf-8`, then a `UnicodeDecodeError` may be raised. In this case, one can find identify these characters using the `trouver.helper.non_utf8_chars_in_file` method and modify the LaTeX file manually. It may be useful to use a text editor to jump to the positions that the characters are at and to change the encoding of the LaTeX file into `utf-8`; for example, the author of `trouver` has opened some `ANSI`-encoded LaTeX documents in `Notepad++` and converted their encoding into `UTF-8`.

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'circumflex_E_example.tex'
contents, non_utf8_chars = non_utf8_chars_in_file(latex_file_path)
#print(contents.decode(encoding='utf-8'))
test_eq(len(non_utf8_chars), 4)
print(f'The following are the non unicode characters and their positions: {non_utf8_chars}')

The following are the non unicode characters and their positions: [('Ê', 130), ('Ê', 165), ('Ë', 196), ('Ì', 227)]


#### `NoDocumentNodeErrors` arise even though the LaTeX file has a document environemt (i.e. `\begin{document}...\end{document}`)

The `find_document_node` method in this module sometimes is not able to detect the docment environment of a LaTeX file. This error is known to arise when
- there are macros (which include commands) defined that represents/expands to characters including `\begin{...}... \end{...}`. For example

In [None]:
# TODO in the above explanation, include an example.

## Divide LaTeX file into parts

To make Obsidian notes from a LaTeX file, I use sections/subsections, and environments as places to make new notes.

Things to think about:
Sections/subsections
environments, including theorems, corollaries, propositions, lemmas, definitions, notations
citations
Macros defined in the preamble?

LatexMacroNodes include: sections/subsections, citations, references, and labels, e.g.

```latex
> \section{Introduction}
\cite{ellenberg2nilpotent}
\subsection{The section conjecture}
\'e
\ref{fundamental-exact-sequence}
\cite{stix2010period}
\ref{fundamental-exact-sequence}
\cite{stix2012rational}
\cite[Appendix C]{stix2010period}
\subsection{The tropical section conjecture}
\label{subsec:tropical-section-conjecture}
```

#### Divide the preamble from the rest of the document

Some macros and commands defined in the preamble seem to prevent the `pylatexenc` methods from properly identifying the document environment/node in a LaTeX document. To circumvent this, we define a function to divide the preamble from the rest of the document

In [None]:
#| export
def divide_preamble(
        text: str, # LaTeX document
        document_environment_name: str = "document"
        ) -> tuple[str, str]:
    """Divide the preamble from the rest of a LaTeX document.
    """
    begin_environment_str = rf'\begin{{{document_environment_name}}}'
    pattern = re.compile(re.escape(begin_environment_str))
    match = re.search(pattern, text) 
    start_match, end_match = match.span()
    return text[:start_match], text[start_match:]

    

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'example_with_a_command_with_begin.tex'
text = text_from_file(latex_file_path)

preamble, document = divide_preamble(text)
assert r'\begin{displaymath}' in preamble
assert r'Hyun Jong Kim' in preamble

assert r'Hyun Jong Kim' not in document
assert document.startswith(r'\begin{document}')
assert document.endswith('\\end{document}')

#### Get the Document Node

In [None]:
#| export
class NoDocumentNodeError(Exception):
    """Exception raised when a LatexEnvironmentNode corresponding to the document 
    environment is expected in a LaTeX string, but no such node exists.
    
    **Attributes**
    - text - str
        - The text in which the document environment is not found.
    """
    
    def __init__(self, text):
        self.text = text
        super().__init__(
            f"The following text does not contain a document environment:\n{text}")



In [None]:
#| export
def find_document_node(
        text: str, # LaTeX str
        document_environment_name: str = "document" # The name of the document environment.
        ) -> LatexEnvironmentNode:
    """Find the `LatexNode` object for the main document in `text`.
    
    **Raises**
    - NoDocumentNodeError
        - If document environment node is not detected.
    """
    w = LatexWalker(text)
    nodelist, _, _ = w.get_latex_nodes(pos=0)
    for node in nodelist:
        if node.isNodeType(LatexEnvironmentNode)\
                and node.environmentname == document_environment_name:
            return node
    raise NoDocumentNodeError(text)

The main content of virtually all LaTeX math articles belongs to a document environment, which pylatexenc can often detect. The `find_document_node` function returns this `LatexEnvironmentNode` object:

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'latex_example_1' / 'main.tex'
text = text_from_file(latex_file_path)
document_node = find_document_node(text)

If the LaTeX file has no `document` environment, then a `NoDocumentNodeError` is raised:

In [None]:
# This latex document has its `document` environment commented out.
latex_file_path = _test_directory() / 'latex_examples' / 'latex_example_2' / 'main.tex'
text = text_from_file(latex_file_path)
with ExceptionExpected(NoDocumentNodeError):
    document_node = find_document_node(text)

At the time of this writinga `NoDocumentNodeError` may be raised even if the LaTeX file has a proper `document` environment

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'example_with_a_command_with_begin.tex'
text = text_from_file(latex_file_path)

# Perhaps in the future, pylatexenc will be able to find the document node for this file.
# When that time comes, delete this example.
with ExceptionExpected(NoDocumentNodeError):
    find_document_node(text)



The `divide_preamble` function can be used to circumvent this problem:

In [None]:
preamble, document = divide_preamble(text)
document_node = find_document_node(document)
test_eq(document_node.environmentname, 'document')
assert document_node.isNodeType(LatexEnvironmentNode)

In [None]:
# hide
# Find no document node error causes

# latex_file_path = r'_tests\latex_full\litt_cfag\main.tex'
# text = text_from_file(latex_file_path)
# document_node = find_document_node(text)

### Detect environment names used in a file

In [None]:
#| export
def environment_names_used(
        text: str # LaTeX document
        ) -> set[str]: # The set of all environment names used in the main document.
    """Return the set of all environment names used in the main document
    of the latex code.
    """
    document_node = find_document_node(text)
    return {node.environmentname for node in document_node.nodelist
            if node.isNodeType(LatexEnvironmentNode)}        

Writers often use different environment names. For examples, writers often use `theorem`, `thm`, or `theo` for theorem environments or `lemma` or `lem` for lemma environments. The `environment_names_used` function returns the environment names actually used in the tex file.

In the example below, note that only the environments that are actually used are returned. For instance, the preamble of the document defines the theorem environments `problem`, and `lemma` (among other things), but these are not actually used in the document itself.

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'has_fully_written_out_environment_names.tex'
sample_text_1 = text_from_file(latex_file_path)
sample_output_1 = environment_names_used(sample_text_1)
test_eq({'corollary', 'proof', 'maincorollary', 'abstract', 'proposition'}, sample_output_1)

The document in the example below uses shorter names for theorem environments:

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'has_shorter_environment_names.tex'
sample_text_2 = text_from_file(latex_file_path)
sample_output_2 = environment_names_used(sample_text_2)
test_eq({'conj', 'notation', 'corollary', 'defn'}, sample_output_2)

### Divide latex text into parts

In [None]:
#| export
# TODO: numbering convention could be theorems separate (e.g. theorem 1, 2, ...)
# and subsections separate.
def divide_latex_text(
        text, # The text of a latex document.
        numbered_environments: list[str] = DEFAULT_NUMBERED_ENVIRONMENTS, # A list of the names of environments which are numbered in the latex code.
        numbering_convention: str = 'separate',
        section_name: str = 'section',
        subsection_name: str = 'subsection',
        proof_name: str = 'proof') -> list:
    """Divides latex text to convert into Obsidian notes.
    
    **Parameters**
    - text - str
    - numbered_environments = list
    - numbering_convention - str
        - One of
            
            - 'separate' - Subsections of a section have separate numberings, 
            e.g. 'Lemma 1.2.1, Proposition 1.2.2, Figure 1.2.3, Theorem 1.3.1'
            - 'shared' - Subsections of a section share numberings, e.g.
            'Lemma 1.1, Proposition 1.2, Figure 1.3, Theorem 1.4'
            
    - section_name - str
        - The macronames for "sections". Defaults to `'section'`.
        For example, SGA has chapters and sections. For the purposes of this function,
        it is appropriate to regard them as sections and subsections, respectively.
    - subsection_name - str
        - Defaults to `'subsection'`.
    - proof_name - str
        - The environment names for proofs. Defaults to `'proof'`.
        
    **Returns**
    - list of list
        - Each list corresponds to an Obsidian note to be constructed.
        Such a list is of the form `[<node_type & numbering>, <text>]` where 
        `node_type & numbering` is a string which serves as a title for the
        text making up the note, and `text` is the content of the note.
    """
    document_node = find_document_node(text)
    section_num = 0
    subsection_num = 0
    environment_num = 0
    outside_num = 1  # Since not everything is in a nice environment, many 
                     # notes will need their own numbers.
    parts = []
    accumulation = ''
    for node in document_node.nodelist:
        if len(parts) > 102 and parts[-1][0] == 'subsection 6.1':
            print('hi')
        # if '\\begin{proof}' in node.latex_verbatim():
        #     print(node.environmentname)
        #    print(node.latex_verbatim())
        #if 'Gal' in accumulation:
        #    print(accumulation)
        (section_num, subsection_num, environment_num, outside_num,
         accumulation)\
            = _process_node(
                section_num, subsection_num, environment_num, outside_num,
                accumulation, parts, node, section_name, subsection_name,
                proof_name, numbered_environments, numbering_convention)
        # if len(parts) > 20:    
            
    outside_num += 1
    parts.append([str(outside_num), accumulation])
    return parts
            
def _process_node(
        section_num: int, subsection_num: int, environment_num: int,
        outside_num: int, accumulation: str, parts: list, node: LatexNode,
        section_name: str, subsection_name: str, proof_name: str,
        numbered_environments: list[str], numbering_convention: str) -> tuple:
    """
    Choose the node-processing method, if the node is a section/subsection or environment
    and

    """
    if node.latex_verbatim().startswith('\\begin{proof}\nLet $\\mathscr{H} := \\math'):
        print('hi')
    process_method_to_run = None
    if node.isNodeType(LatexMacroNode) and node.macroname == section_name:
        process_method_to_run = _process_section
    elif node.isNodeType(LatexMacroNode) and node.macroname == subsection_name:
        process_method_to_run = _process_subsection
    elif (node.isNodeType(LatexEnvironmentNode)
          and node.environmentname in numbered_environments):
        process_method_to_run = _process_environment_node
    if process_method_to_run:
        (section_num, subsection_num, environment_num, outside_num,
        accumulation)\
            = process_method_to_run(
            section_num, subsection_num, environment_num, outside_num,
            accumulation, parts, node, section_name, subsection_name,
            numbering_convention)
    elif (node.isNodeType(LatexEnvironmentNode)
          and node.environmentname == proof_name):
          # TODO: if the environment is a proof, and if it starts a section/subsection,
          # Then the proof is appended into the title of the section/subsection, see
          # landesman_litt_ipwc, around line 1858-1863 for example.
        parts[-1][1] += f'\n{node.latex_verbatim()}'
    else:
        accumulation += node.latex_verbatim()
    return (section_num, subsection_num, environment_num, outside_num,
            accumulation)


def _process_section(
        section_num: int, subsection_num: int, environment_num: int,
        outside_num: int, accumulation: str, parts: list[list],
        node: LatexMacroNode, section_name: str, subsection_name: str,
        numbering_convention: str) -> tuple:
    """Do stuff when the node is a section node. Return updated
    section_num, subsection_num, environment_num
    """
    numbered, title  = _section_title(
        node.latex_verbatim(), section_name, subsection_name)
    section_num += 1 if numbered else 0
    subsection_num = 0
    environment_num = 0
    if accumulation.strip() != '':
        parts.append([str(outside_num), accumulation])
        outside_num += 1
        accumulation = ''
    parts.append([f'{section_name} {section_num}', title])
    return (section_num, subsection_num, environment_num, outside_num,
            accumulation)
    

def _process_subsection(
        section_num: int, subsection_num: int, environment_num: int,
        outside_num: int, accumulation: str, parts: list[list],
        node: LatexMacroNode, section_name: str, subsection_name: str,
        numbering_convention: str) -> tuple:
    """Do stuff when the node is a subsection node.
    """
    numbered, title  = _section_title(
        node.latex_verbatim(), section_name, subsection_name)
    subsection_num += 1 if numbered else 0
    if numbering_convention == 'separate':
        environment_num = 0
    if accumulation.strip() != '':
        parts.append([str(outside_num), accumulation])
        outside_num += 1
        accumulation = ''
    parts.append([f'{subsection_name} {section_num}.{subsection_num}', title])
    return (section_num, subsection_num, environment_num, outside_num,
            accumulation)


def _process_environment_node(
        section_num: int, subsection_num: int, environment_num: int,
        outside_num: int, accumulation: str, parts: list[list],
        node: LatexMacroNode, section_name: str, subsection_name: str,
        numbering_convention: str) -> tuple:
    """
    """
    environment_num += 1
    if accumulation.strip() != '':
        parts.append([str(outside_num), accumulation])
        outside_num += 1
        accumulation = ''
    if numbering_convention == 'separate':
        pointed_numbering = f'{section_num}.{subsection_num}.{environment_num}'
        numbering = f'{node.environmentname} {pointed_numbering}'
    elif numbering_convention == 'shared':
        numbering = f'{node.environmentname} {section_num}.{environment_num}'
    parts.append([numbering, node.latex_verbatim()])
    return (section_num, subsection_num, environment_num, outside_num,
            accumulation)


def _section_title(text: str, section_name, subsection_name) -> str:
    """Returns the title of a section or subsection from a latex str
    and whether or not the section/subsection is numbered.
    
    **Parameters**
    - text - str
    - section_name - str
    - subsection_name - str
    
    **Returns**
    - str, bool
    """
    # TODO: test things like `\\section {Generating series of special divisors}`
    # See qiu_amsd for example.
    # TODO: deal with the possibility of multi-line sections/subsections,
    # e.g. \subsection{Arithmetic intersrection\n pairing},
    # see qiu_amsd for example
    regex_search = re.search(r'\\' + fr'(?:{section_name}|{subsection_name}) *?'
                             + r'(?:\[.*\])?(\*)?\{(.*)\}', text)
    # regex_search = re.search(r'\\' + fr'(?:{section_name}|{subsection_name})'
    #                          + r'(?:\[.*\])?(\*)?\{(.*)\}', text)
    # print(text)
    # print(section_name, subsection_name)
    if regex_search is None:
        print(text, section_name, subsection_name)
    return not bool(regex_search.group(1)), regex_search.group(2)



The `divide_latex_text` function divides latex text 

In [None]:
# latex_file_path = r'_tests\latex_full\pauli_wickelgren\main.tex'
# text = text_from_file(latex_file_path)
# parts = divide_latex_text(text, numbering_convention='separate')
# for title, text in parts[39:44]:
#     print(title, text)



In [None]:
# TODO: Find a list of environment names commonly used.

In [None]:
# TODO: examples with different numbering convention and different numbered environments

In [None]:
# TODO: make numbering_convention work correctly.
# Here are some latex files with different conventions:
# - All subsections in a section share numbering, 
#   - achter_pries_imht https://arxiv.org/abs/math/0608038: e.g. Lemmas 2.1, 2.2, 2.3 are in subsection 2.2 and Lemma 2.4 and Remark 2.5 are in subsection 2.4.as_integer_ratio
#   - pauli_wickelgren https://arxiv.org/abs/2010.09374: e.g. Example 3.5, 3.11 are in subsubsection 3.3.2, Exercise 4.1, Remark 4.2, are in subsection 4.1, Theorem 4.3 is in subsection 4.2, Theorem 4.4 is in subsection 4.3
# - Different environment types have different counts and the counts do not show the section number.
#   - vankataramana_imbrd https://arxiv.org/abs/1205.6543: 
#       - e.g. section 1 has Theorem 1, Remark 1, Remark 2, Remark 3, subsection 1.1.3 has Remark 4, Subsection 2.2 has Definition 1