In [None]:
#| default_exp markdown.obsidian.personal.machine_learning.notation_identification

# markdown.obisidian.personal.machine_learning.notation_identification
> Functions for finding notations introduced in mathematical text. 

**Note: This module is deprecated in favor of markdown.obsidian.personal.machine_learning.tokenize.def_and_notat_token_classification**

In `markdown.obsidian.personal.notation`, we explained why it is convenient to keep notation notes. To automatically make such notation notes, one needs a way to identify notations; more precisely, one needs a way to identify where notations are newly introduced 

The author of `trouver` surrounds (newly introduced) definitions and notations with double asterisks `**` in his `Obsidian.md` math vault; note that surrounding text with double asterisks `**` can boldface said text[^1]. This particular notebook focuses on notations, rather than definitions.

[^1]: However, LaTeX math mode text does not get bold-faced with double asterisks.

For the purposes of `trouver`, a notation is (contained in) a purely LaTeX math mode text[^2] that is surrounded by double asterisks `**`, whereas a definition is any other double asterisk surrounded text. 

[^2]: For the purposes of `trouver`, a math mode text must be an in-line math mode text surrounded by single dollar signs `$` or display math mode text surrounded by double dollar signs `$$`. This is because Markdown does not recognize `\( \)` and `\[ \]` for math mode text.

For example, in the Markdown text


```markdown
The **Galois group** of a Galois extension $L/K$ is the group **$\operatorname{Gal}(L/K)$** whose elements are the automorphisms of the field $L$ fixing $K$ pointwise. For example, $\operatorname{Gal}(\mathbb{C}/\mathbb{R})$ is isomorphic to $\mathbb{Z}/2 \mathbb{Z}$ with complex conjugation as the nontrivial element.

```

the text `Galois group` constitutes a definition and `$\operatorname{Gal}(L/K)$` constitutes a notation. On the other hand, the LaTeX math mode strings `$\operatorname{Gal}(\mathbb{C}/\mathbb{R})$` and `$\mathbb{Z}/2 \mathbb{Z}$` do not newly introduce notations (or definitions for that matter) in the context of the text.


As another example, in

```markdown
A **Hausdorff space** or a **$T_2$-space** is a topological space $X$ such that for all $x,y \in X$, there exist open neighborhoods $U$ and $V$ of $x$ and $y$ such that $U \cap V = \emptyset$.
```

the texts `Hausdorff space` and `$T_2$-space` are both definitions.

The following has an example of a notation that can also be regarded as a definition. For the purposes of `trouver`, this is considered a notation. In some sense, most to all notations can be regarded as definitions as well as notations and hence we classify such things as notations.

```markdown
Given an ideal $I$ of a ring $R$, **$R/IR$** is the ring whose underlying set is the set $R$ modulo the equivalence relation $\sim$ where $a \sim b$ if and only if $a-b \in I$ and whose addition and multiplication structures are defined by...
```

One downside to this convention is that a notation for the purposes of `trouver` might contain extraneous information. For example, in 

```markdown
Let $R$ be a ring and let $M$ be a module. The **dual of $M$** is defined as

**$$ M^\vee := \operatorname{Hom}_R(M,R).$$**
```

the entire displayed math mode text is considered a notation by `trouver`, even if only `M^\vee` is the actual notation.

In [None]:
#| export
import os
from os import PathLike
from pathlib import Path
import pandas as pd

from fastai.text.learner import TextLearner

from trouver.helper.definition_and_notation import (
    defs_and_notats_separations,
)
from trouver.helper.date_and_time import current_time_formatted_to_minutes
from trouver.helper.regex import latex_indices, replace_string_by_indices
from trouver.markdown.markdown.file import MarkdownFile, MarkdownLineEnum
from trouver.markdown.obsidian.personal.note_processing import process_standard_information_note
from trouver.markdown.obsidian.personal.machine_learning.database_update import append_to_database
from trouver.markdown.obsidian.vault import(
    # all_note_paths_by_name, note_path_by_name,
    VaultNote
)


In [None]:
from unittest import mock
import shutil
import tempfile

from fastcore.test import *
from pathvalidate import validate_filename 
from torch import Tensor

from trouver.helper.tests import _test_directory

## Get notation data

Given information notes with notations marked with double asterisks `**`, we extract the data of these double asterisks organize them for machine learning.

Ultimately, we would like to have a ML model that can find the locations where notations are newly introduced in a note. The approach here is to train a categorization model which takes an input a text with a single double asterisk pair surrounding a LaTeX math mode string and outputs whether the LaTeX math mode string contains a notation. We then use the categorization model to find all LaTeX math mode strings containing notations one by one.

In [None]:
#| export
def add_one_double_asts_to_line(
        line: str, # The text to which to add the double asterisks `**`
        start: int, # The first double asterisks are added in between `line[start-1]` and `line[start]`.
        end: int # The second double asterisks are added in between `line[end-1]` and `line[end]`.
        ) -> str: # The str obtained from `line` by surrounding the substring `line[start:end]` with double asterisks.
    # TODO: rename to add_one_double_asts_to_line. Better yet, also
    # implement a function which adds multiple double asts.
    """
    Return `line` with only one double asterisks `**` surrounded text.
    
    Used in `_definition_data_from_line`
    """
    return f'{line[:start]}**{line[start:end]}**{line[end:]}'

In [None]:
test_eq(add_one_double_asts_to_line("I will add just one double ast pair.", 2,6), 'I **will** add just one double ast pair.')

In [None]:
#| export
def notation_data_from_text(
        with_double_asts: str # May or may not have double asterisks to signify definitions and notations
        ) -> tuple[str, list[tuple[int, int, bool]]]:
    """Extracts data on the locations of notations in a text with
    double asterisks.
    
    Used in `notation_data_from_note`

    **Returns**

    - tuple[str, list[tuple[int, int, bool]]]
        - The str is the str `no_double_asts`, which is the same as
        `with_double_asts`, except with the double asterisks removed.
        - Each list represents a data point for a LaTeX math-mode
          string in `no_double_asts`and consists of

            1. The indices `start, end` where the data point considers
               whether or not the LaTeX math-mode substring
               `line_no_double_asts[start:end]` is surrounded by
               double-asterisks (and hene is supposed to introduce a notation).

            2. A bool which is `True`, if the data-point represents a
               str with double-asterisks surrounding a notation and `False`
               otherwise.
    """
    defs_and_notats = defs_and_notats_separations(with_double_asts)
    only_indices = [(start, end) for start, end, _ in defs_and_notats]
    replace_with = [with_double_asts[start+2:end-2]
                    for start, end in only_indices]
    no_double_asts = replace_string_by_indices(
        with_double_asts, only_indices, replace_with)

    bold_indices_in_no_double_asts = [
        (start - 4*i, end - 4*i - 4, is_notat)
        for i, (start, end, is_notat) in enumerate(defs_and_notats)]

    notation_indices = [(start, end, True) for start, end, is_notat 
                        in bold_indices_in_no_double_asts if is_notat]

    notat_indices_in_no_double_asts = [
        (start, end) for start, end, _ in bold_indices_in_no_double_asts]
    all_latex_indices = latex_indices(no_double_asts)
    non_notat_indices = [tuppy for tuppy in all_latex_indices
                         if tuppy not in notat_indices_in_no_double_asts]
    non_notat_indices = [(start, end, False)
                         for start, end in non_notat_indices]
    
    return no_double_asts, notation_indices + non_notat_indices


In [None]:
sample_output = notation_data_from_text(
    r'**here is a double ast text**. It is not a LaTeX math mode string,'
    r'so it will not be included as a data point.'
    r'On the other hand, **$\operatorname{Gal}(L/K)$** and $\mathbb{Z}/2\mathbb{Z}$'
    r'are both included LaTeX math mode strings and are included as data points.'
    r'The bool for the former is `True`, whereas the bool for the latter is `False`.')

assert '**' not in sample_output[0]
start, end, is_notation = sample_output[1][0]
test_eq(sample_output[0][start:end], r'$\operatorname{Gal}(L/K)$')
start, end, is_notation = sample_output[1][1]
test_eq(sample_output[0][start:end], r'$\mathbb{Z}/2\mathbb{Z}$')
print(sample_output)

('here is a double ast text. It is not a LaTeX math mode string,so it will not be included as a data point.On the other hand, $\\operatorname{Gal}(L/K)$ and $\\mathbb{Z}/2\\mathbb{Z}$are both included LaTeX math mode strings and are included as data points.The bool for the former is `True`, whereas the bool for the latter is `False`.', [(124, 149, True), (154, 178, False)])


In [None]:
#| export
def _notation_data_with_indices(
        note: VaultNote, vault: PathLike) -> tuple[
            MarkdownFile, list[tuple[int, int, bool]]]:
    r"""Obtain notation data from a note including the indices.
    
    Used in `notation_data_from_note`

    **Parameters**
    - note - VaultNote
    - vault - PathLike

    **Returns**
    - str, list[list[str, int, bool]]
        - The str is the str of the processed MarkdownFile except
        without double asterisks.
        - Each list consists of

            1. The indices `start, end` where the data point considers
            whether or not the substring `no_double_asts[start:end]`
            contains a notation.
            2. A bool that is `True` if the LaTeX text contains
            notation.
    """
    # TODO: test
    mf = process_standard_information_note(
        MarkdownFile.from_vault_note(note), vault,
        remove_double_asterisks = False)
    with_double_asts = str(mf)
    no_double_asts, data = notation_data_from_text(with_double_asts)
    return no_double_asts, data

In [None]:
#| export

def notation_data_from_note(
        note: VaultNote, vault: PathLike
        ) -> list[tuple[str, str, bool]]:
    # TODO: Implement the option to include multiple-lines in the data.
    """Obtain notation data from a note.

    Note that the lists of str might not be in any particular order.
    
    **Returns**

    - list[tuple[str, str, bool]]
        - Each list consists of 
            1. The name of `note`,
            2. The processed str of `note` with only a single double
            asterisk surrounded LaTeX text. Note that the processed str
            merges display math mode text into single lines, cf.
            `process_standard_information_note`.
            3. A bool that is `True` if the LaTeX text contains
            notation.
    """
    # TODO: treat '`$`` separately.
    no_double_asts, data = _notation_data_with_indices(note, vault)
    return [
        (note.name,
         add_one_double_asts_to_line(no_double_asts, start, end),
         is_notat) for start, end, is_notat in data]


We first set up an example:

In [None]:
test_vault = _test_directory() / 'test_vault_6'
vn = VaultNote(test_vault, name='reference_with_tag_labels_Definition 2')
print(vn.text())

---
cssclass: clean-embeds
aliases: []
tags: [_meta/literature_note, _meta/definition, _meta/notation]
---
# Ring of integers modulo $n$[^1]

Let $n \geq 1$ be an integer. The **ring of integers modulo $n$**, denoted by **$\mathbb{Z}/n\mathbb{Z}$**, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.

More precisely, $\mathbb{Z}/n\mathbb{Z}$ has the elements $0,1,\ldots,n-1$.

...


# See Also
- [[reference_with_tag_labels_Exercise 1|reference_with_tag_labels_Z_nZ_is_a_ring]]
# Meta
## References

## Citations and Footnotes
[^1]: Kim, Definition 2


In [None]:
sample_output = notation_data_from_note(vn, test_vault)
total_count_for_is_notation = 0
for name, with_one_double_asts, is_notation in sample_output:
    test_eq(name, vn.name)
    test_eq(with_one_double_asts.count('**'), 2)
    if is_notation:
        total_count_for_is_notation += 1
test_eq(total_count_for_is_notation, 1)
sample_output



[('reference_with_tag_labels_Definition 2',
  'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by **$\\mathbb{Z}/n\\mathbb{Z}$**, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
  True),
 ('reference_with_tag_labels_Definition 2',
  'Let **$n \\geq 1$** be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
  False),
 ('reference_with_tag_labels_Definition 2',
  'Let $n \\geq 1$ be an integer. The ring of integers modulo **$n$**, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the unders

## Make database of notation data

In [None]:
#| export
def append_notation_data_to_database(
        vault: PathLike, # The vault from which the data is drawn
        file: PathLike,  # The path to a CSV file
        notes: list[VaultNote], # The notes to add to the database
        backup: bool = True # If `True`, makes a copy of `file` in the same directoy and with the same name, except with an added extension of `.bak`.
        ) -> None:
    to_turn_into_a_df = []
    current_time = current_time_formatted_to_minutes()
    for note in notes:
        notation_data_for_note = notation_data_from_note(note,  vault)
        for _, with_one_double_asts, is_notat in notation_data_for_note:
            row_dict = {
                'Time added': current_time,
                'Time modified': current_time,
                'Note name': note.name,
                'LaTeX in text': with_one_double_asts,
                'Is notation': is_notat
            }
            to_turn_into_a_df.append(row_dict)
    df = pd.DataFrame(to_turn_into_a_df)
    append_to_database(
        file, df,
        cols=['Time added', 'Time modified', 'Note name',
              'LaTeX in text', 'Is notation'],
        pivot_column='LaTeX in text',
        columns_to_update=['Time modified', 'Note name', 'Is notation'],
        backup=backup)
    
    

In [None]:
# TODO: example

## Use ML categorization model to find and mark notations in notes 

In [None]:
#| export
def automatically_mark_notations(
        vn: VaultNote, # The information note to which to mark notations.
        learn: TextLearner, # The ML model which predicts where notation notes should occur.  This is a classifier which takes as input a str with double asterisks surrounding LaTeX text. The model outputs whether or not the single double asterisk pair surrounds a LaTeX text with notation.
        create_notation_notes: bool = False, # If `True`, creates the notations notes for the predicted notations and links them to the 'See Also' sections of the information notes.
        reference_name: str = '' # The name of the reference that `vn` belongs to; this is only relevant when `create_notation_notes=True` so that the created notation notes have file names starting with the reference name.
        ) -> None:
    # TODO: before running this, make sure to warn or check that this
    # will change contents of files drasticall.
    # TODO: implement `overwrite` parameter
    """Predict and mark where notations occur in a note, and optionally
    create a notation note, and add the notation note to the `See Also`
    section of the note.

    Assumes that no double asterisks are already in the contents of `vn`.

    This function Removes links, headings, footnotes, etc.
    from the original note and merges multi-line display math mode LaTeX
    text into single lines. Use with caution.
    """
    no_double_asts, index_data = _notation_data_with_indices(vn, vn.vault)
    notations_to_add = _get_notation_indices_to_add(
        no_double_asts, index_data, learn)
    with_double_asts = no_double_asts
    for start, end in reversed(notations_to_add):
        with_double_asts = add_one_double_asts_to_line(
            with_double_asts, start, end)

    original_mf = MarkdownFile.from_vault_note(vn)
    _, end_metadata = original_mf.metadata_lines()
    see_also_line = original_mf.get_line_number_of_heading('See Also')
    original_mf.remove_lines(end_metadata + 1, see_also_line)
    original_mf.insert_line(end_metadata + 1, {
        'type': MarkdownLineEnum.HEADING, 'line': '# Topic[^1]'})
    original_mf.add_line_in_section('Topic[^1]', {
        'type': MarkdownLineEnum.DEFAULT, 'line': with_double_asts})
    original_mf.write(vn)


def _get_notation_indices_to_add(
        no_double_asts: str, index_data: list[list[int, bool]],
        learn: TextLearner)\
            -> list[tuple[int]]:
    """Used in `automatically_add_notations`"""
    to_test = [add_one_double_asts_to_line(no_double_asts, start, end)
                       for start, end, is_notat in index_data if not is_notat]
    with learn.no_bar(), learn.no_logging():
        predictions = [learn.predict(one_double_ast)
                       for one_double_ast in to_test]
    notations_to_add = [
        (start, end) for (start, end, is_notat), prediction
        in zip(index_data, predictions) if is_notat or prediction[0] == 'True']
    notations_to_add.extend([
        (start, end) for (start, end, is_notat) in index_data if is_notat])
    notations_to_add.sort()
    return notations_to_add


In [None]:
# TODO: Test 

In [None]:
with tempfile.TemporaryDirectory(prefix='tmp_dir_', dir=os.getcwd()) as tmp_dir:
    tmp_dir = Path(tmp_dir)
    temp_vault = tmp_dir / 'test_vault_6'
    shutil.copytree('_tests/test_vault_6', temp_vault)

    note = VaultNote(temp_vault, name='number_theory_reference_1_Definition 15')

    with mock.patch('__main__.TextLearner') as mock_textlearner_class:
        mock_textlearner = mock_textlearner_class.return_value
        mock_textlearner.predict.side_effect = [
            ('False', Tensor([0]), Tensor([1, 0])),
            ('True', Tensor([0]), Tensor([0, 1])),
            ('False', Tensor([0]), Tensor([1, 0])),
            ('False', Tensor([0]), Tensor([1, 0])),
            ]
        automatically_mark_notations(note, mock_textlearner)
        print('The following is the note after the double asterisks are added, '
              'assuming that the ML model predictions are as above:')
        print(note.text())
        assert r'**$\operatorname{Gal}(L/K)$**' in note.text()



The following is the note after the double asterisks are added, assuming that the ML model predictions are as above:
---
cssclass: clean-embeds
aliases: []
tags: [_meta/literature_note, _meta/definition, _meta/notation]
---
# Topic[^1]
%%This is an example file to which  `automatcally_mark_notations` will be applied.%%

Let $L/K$ be a Galois field extension. Its Galois group **$\operatorname{Gal}(L/K)$** is defined as the group of automorphisms of $L$ fixing $K$ pointwise.

# See Also

# Meta
## References

## Citations and Footnotes
[^1]: Kim, 


In [None]:
# TODO: test 'w' after implementing `overwrite.`

In [None]:
# TODO: test 'a' after implementing `overwrite.`

In [None]:
# TODO: test `None` after implementing `overwrite.`