# latex.convert

> Convert LaTeX files into Obsidian.md notes

This module contains functions and methods to automatically make Obsidian notes from LaTeX files of mathematical papers, most notably those on arXiv.

See the [Potential Problems](#potential-problems) section below for some common errors that arise from this module and how to circumvent them.

In [None]:
#| default_exp latex.convert

In [None]:
#| export
from collections import OrderedDict
import os
from os import PathLike
from pathlib import Path
from pylatexenc import latexwalker, latex2text
from pylatexenc.latexwalker import (
    LatexWalker, LatexEnvironmentNode, get_default_latex_context_db,
    LatexNode, LatexSpecialsNode, LatexMathNode, LatexMacroNode, LatexCharsNode,
    LatexGroupNode, LatexCommentNode
)
from pylatexenc.latex2text import (
    MacroTextSpec, EnvironmentTextSpec)
from pylatexenc.macrospec import (
    MacroSpec, LatexContextDb, EnvironmentSpec
)
import re
from typing import Union
from trouver.helper import (
    find_regex_in_text, dict_with_keys_topologically_sorted,
    containing_string_priority, replace_string_by_indices, text_from_file
)
from trouver.markdown.markdown.file import (
    MarkdownFile, MarkdownLineEnum
)

from trouver.markdown.obsidian.vault import VaultNote
from trouver.markdown.obsidian.personal.index_notes import (
    correspond_headings_with_folder, convert_title_to_folder_name
)
from trouver.markdown.obsidian.personal.reference import setup_folder_for_new_reference
from trouver.markdown.obsidian.vault import VaultNote
import warnings

In [None]:
#| export
DEFAULT_NUMBERED_ENVIRONMENTS = ['theorem', 'corollary', 'lemma', 'proposition',
                                 'definition', 'conjecture', 'remark', 'example',
                                 'question']

In [None]:
from fastcore.test import *
from trouver.helper import _test_directory, non_utf8_chars_in_file

## Potential problems

The following are some frequently problems that arise when using this module:


#### UnicodeDecodeErrors arise when reading LaTeX files

By default, the `text_from_file` method in `trouver.helper` reads files and attempts to decode them in `utf-8`. If a LaTeX file has characters that cannot be decoded into `utf-8`, then a `UnicodeDecodeError` may be raised. In this case, one can find identify these characters using the `trouver.helper.non_utf8_chars_in_file` method and modify the LaTeX file manually. It may be useful to use a text editor to jump to the positions that the characters are at and to change the encoding of the LaTeX file into `utf-8`; for example, the author of `trouver` has opened some `ANSI`-encoded LaTeX documents in `Notepad++` and converted their encoding into `UTF-8`.

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'circumflex_E_example.tex'
contents, non_utf8_chars = non_utf8_chars_in_file(latex_file_path)
#print(contents.decode(encoding='utf-8'))
test_eq(len(non_utf8_chars), 4)
print(f'The following are the non unicode characters and their positions: {non_utf8_chars}')

The following are the non unicode characters and their positions: [('Ê', 130), ('Ê', 165), ('Ë', 196), ('Ì', 227)]


#### `NoDocumentNodeErrors` arise even though the LaTeX file has a document environemt (i.e. `\begin{document}...\end{document}`)

The `find_document_node` method in this module sometimes is not able to detect the docment environment of a LaTeX file. This error is known to arise when
- there are macros (which include commands) defined that represents/expands to characters including `\begin{...}... \end{...}`. For example

In [None]:
# TODO in the above explanation, include an example.

## Divide LaTeX file into parts

To make Obsidian notes from a LaTeX file, I use sections/subsections, and environments as places to make new notes.

Things to think about:
Sections/subsections
environments, including theorems, corollaries, propositions, lemmas, definitions, notations
citations
Macros defined in the preamble?

LatexMacroNodes include: sections/subsections, citations, references, and labels, e.g.

```latex
> \section{Introduction}
\cite{ellenberg2nilpotent}
\subsection{The section conjecture}
\'e
\ref{fundamental-exact-sequence}
\cite{stix2010period}
\ref{fundamental-exact-sequence}
\cite{stix2012rational}
\cite[Appendix C]{stix2010period}
\subsection{The tropical section conjecture}
\label{subsec:tropical-section-conjecture}
```

#### Get the Document Node

In [None]:
#| export
class NoDocumentNodeError(Exception):
    """Exception raised when a LatexEnvironmentNode corresponding to the document 
    environment is expected in a LaTeX string, but no such node exists.
    
    **Attributes**
    - text - str
        - The text in which the document environment is not found.
    """
    
    def __init__(self, text):
        self.text = text
        super().__init__(
            f"The following text does not contain a document environment:\n{text}")



In [None]:
#| export
def find_document_node(
        text: str, # LaTeX str
        document_environment_name: str = "document" # The name of the document environment.
        ) -> LatexEnvironmentNode:
    """Finds the `LatexNode` object for the main document in `text`.
    
    **Raises**
    - NoDocumentNodeError
        - If document environment node is not detected.
    """
    w = LatexWalker(text)
    nodelist, _, _ = w.get_latex_nodes(pos=0)
    for node in nodelist:
        if node.isNodeType(LatexEnvironmentNode)\
                and node.environmentname == document_environment_name:
            return node
    raise NoDocumentNodeError(text)

The main content of virtually all LaTeX math articles belongs to a document environment, which pylatexenc can often detect. The `find_document_node` function returns this `LatexEnvironmentNode` object:

In [None]:
latex_file_path = _test_directory() / 'latex_examples' / 'latex_example_1' / 'main.tex'
text = text_from_file(latex_file_path)
document_node = find_document_node(text)

If the tex file has no `document` environemtn, then a `NoDocumentNodeError` is raised:

In [None]:
from fastcore.test import ExceptionExpected
# This latex document has its `document` environment commented out.
latex_file_path = _test_directory() / 'latex_examples' / 'latex_example_2' / 'main.tex'
text = text_from_file(latex_file_path)
with ExceptionExpected(NoDocumentNodeError):
    document_node = find_document_node(text)

In [None]:
# TODO: sometimes, latex files have document environments, but pylatexenc won't detect them.
# Compile of list of example of these and Figure out what is wrong. 
# ellenberg_li_shusterman, kohler_roessler_fpf_i, litt_cfag, qiu_amsd
# ellenberg_li_shusterman: I think the issue is in lines 113-116; there are \newcommands declared for \begin{displaymath}, \end{displaymath}, \begin{equation}, and \end{equation}
# kohler_roessler_fpf_i: The issue was that the tex document was OCR'd using mathpix and did not have a begin/end document environment at all.
# litt_cfag: Same as above.
# qiu_amsd: There are new commands declared with \begin and \end's.


In [None]:
# hide
# Find no document node error causes

# latex_file_path = r'_tests\latex_full\litt_cfag\main.tex'
# text = text_from_file(latex_file_path)
# document_node = find_document_node(text)