# text datasets

first we are going to build in a real life text datasource from hugging face

In [1]:
from nova_py.transcribe import MINT
from nova_py import TACO
import itertools
import tensorflow as tf
import pandas as pd
from datasets import load_dataset_builder, load_dataset
from itertools import chain
import json

2025-01-24 21:57:33.754847: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
mint = MINT.load()

In [3]:
mint.save()

### Annotations for Invalid Sequences

The easiest manual training for the model (in otherwords without implementing something reenforcement based) is to train it to learn what an invalid sequence is. We can do this by giving it large blocks of free text from books and things. The largest corpus of text that I know of (which we will use now) is the stackv2 on hugging face, we will query some human written text that would be nonsensical in programming context, split it by sentence, and then tell the model it is invalid. The one trick with this is there are alot of variations of invalid sequences so we may end up overfitting a few with this technique.

In [4]:
dataset_name = "sedthh/gutenberg_english"

dataset = load_dataset(dataset_name, 'default', split='train', streaming=True)

Resolving data files:   0%|          | 0/37 [00:00<?, ?it/s]

Next we load in the text data, were going to use 5 articles for this

In [5]:
just_text = [d['TEXT'].split('.') for d in dataset.take(5)]

flat_text = list(itertools.chain.from_iterable(just_text))

just_text

[['The United States Bill of Rights',
  '\r\nThe Ten Original Amendments to the Constitution of the United States\r\nPassed by Congress September 25, 1789\r\nRatified December 15, 1791\r\n\r\n\r\nI\r\nCongress shall make no law respecting an establishment of religion, or\r\n\r\nprohibiting the free exercise thereof; or abridging the freedom of speech, or of\r\n\r\nthe press, or the right of the people peaceably to assemble, and to petition the\r\n\r\nGovernment for a redress of grievances',
  '\r\nII\r\nA well-regulated militia, being necessary to the security of a free State,\r\n\r\nthe right of the people to keep and bear arms, shall not be\r\n\r\ninfringed',
  '\r\nIII\r\nNo soldier shall, in time of peace be quartered in any house, without the\r\n\r\nconsent of the owner, nor in time of war, but in a manner to be prescribed by\r\n\r\nlaw',
  '\r\nIV\r\nThe right of the people to be secure in their persons, houses, papers, and\r\n\r\neffects, against unreasonable searches and seizur

In [6]:
len(flat_text)

436

In [35]:
tokens = TACO.inBatch(flat_text)

tokens

<tf.Tensor: shape=(296, 482), dtype=string, numpy=
array([[b'the', b'united', b'states', ..., b'<pad>', b'<pad>', b'<pad>'],
       [b'the', b'ten', b'original', ..., b'<pad>', b'<pad>', b'<pad>'],
       [b'ii', b'a', b'well-regulated', ..., b'<pad>', b'<pad>',
        b'<pad>'],
       ...,
       [b'it', b'is', b'in', ..., b'<pad>', b'<pad>', b'<pad>'],
       [b'gentlemen', b'may', b'cry', ..., b'<pad>', b'<pad>', b'<pad>'],
       [b'the', b'war', b'is', ..., b'<pad>', b'<pad>', b'<pad>']],
      dtype=object)>

In [8]:
mint_pass = mint(tokens)

In [31]:
bad_seq_tag = list(mint.tags.keys())[10]

bad_seq_tag

'BAD SEQUENCE ERROR'

In [40]:
flat_pretagged_batch = tf.map_fn(mint.pretag, tf.reshape(tokens, [-1]))

training_tags = tf.where(flat_pretagged_batch == '', bad_seq_tag, flat_pretagged_batch)

In [47]:
ground_truths = tf.reshape(tf.transpose(tf.stack([tf.reshape(tokens, [-1]), training_tags])), shape = (tokens.shape[0], tokens.shape[1], 2))

ground_truths

<tf.Tensor: shape=(296, 482, 2), dtype=string, numpy=
array([[[b'the', b'~pad~'],
        [b'united', b'BAD SEQUENCE ERROR'],
        [b'states', b'BAD SEQUENCE ERROR'],
        ...,
        [b'<pad>', b'~pad~'],
        [b'<pad>', b'~pad~'],
        [b'<pad>', b'~pad~']],

       [[b'the', b'~pad~'],
        [b'ten', b'BAD SEQUENCE ERROR'],
        [b'original', b'BAD SEQUENCE ERROR'],
        ...,
        [b'<pad>', b'~pad~'],
        [b'<pad>', b'~pad~'],
        [b'<pad>', b'~pad~']],

       [[b'ii', b'BAD SEQUENCE ERROR'],
        [b'a', b'~pad~'],
        [b'well-regulated', b'BAD SEQUENCE ERROR'],
        ...,
        [b'<pad>', b'~pad~'],
        [b'<pad>', b'~pad~'],
        [b'<pad>', b'~pad~']],

       ...,

       [[b'it', b'BAD SEQUENCE ERROR'],
        [b'is', b'~relation~'],
        [b'in', b'BAD SEQUENCE ERROR'],
        ...,
        [b'<pad>', b'~pad~'],
        [b'<pad>', b'~pad~'],
        [b'<pad>', b'~pad~']],

       [[b'gentlemen', b'BAD SEQUENCE ERROR'],
     

In [48]:
mint.train(ground_truths)

In [49]:
mint.TransitionMatrix

<tf.Tensor: shape=(1464, 11), dtype=float64, numpy=
array([[3.43696508e-02, 1.81171054e-01, 7.30040149e-04, ...,
        7.88762342e-05, 7.88762342e-05, 7.79171473e-01],
       [8.56656668e-02, 1.04512113e-01, 8.56656668e-02, ...,
        8.56656668e-02, 8.56656668e-02, 1.24496886e-01],
       [4.17626825e-02, 4.49628983e-02, 2.36480664e-02, ...,
        2.36480664e-02, 2.36480664e-02, 7.16805028e-01],
       ...,
       [9.09090936e-02, 9.09090936e-02, 9.09090936e-02, ...,
        9.09090936e-02, 9.09090936e-02, 9.09090936e-02],
       [9.09090936e-02, 9.09090936e-02, 9.09090936e-02, ...,
        9.09090936e-02, 9.09090936e-02, 9.09090936e-02],
       [5.00483768e-02, 9.70469207e-02, 3.94949248e-45, ...,
        3.94949248e-45, 3.94949248e-45, 8.52904702e-01]])>

In [51]:
i = mint.TransitionStates['']

ps = mint.TransitionMatrix[i,:]

In [52]:
ps

<tf.Tensor: shape=(11,), dtype=float64, numpy=
array([3.43696508e-02, 1.81171054e-01, 7.30040149e-04, 7.88762342e-05,
       7.88762342e-05, 7.88762342e-05, 4.08452430e-03, 7.88762342e-05,
       7.88762342e-05, 7.88762342e-05, 7.79171473e-01])>

In [138]:
example_text = ["Hello = \"1\""]

new_tokens = TACO.inBatch(example_text)

new_tokens

<tf.Tensor: shape=(1, 5), dtype=string, numpy=array([[b'hello', b'=', b'"', b'1', b'"']], dtype=object)>

In [175]:
mint(new_tokens)

<tf.Tensor: shape=(1, 5, 2), dtype=string, numpy=
array([[[b'hello', b'BAD SEQUENCE ERROR'],
        [b'=', b'~relation~'],
        [b'"', b'~container~'],
        [b'1', b'~connector~'],
        [b'"', b'~container~']]], dtype=object)>

As you can see our training has worked (ish) and most sequences relating to a bad sequence will also produce a bad sequence

In [176]:
mint.save()

### Training the model on basic sequences

To be continued...

In [4]:
mint.tags

{'~relation~': 0,
 '~pad~': 1,
 '~var~': 2,
 '~value~': 3,
 '~func~': 4,
 '~break~': 5,
 '~container~': 6,
 '~definition~': 7,
 '~brelation~': 8,
 '~connector~': 9,
 'BAD SEQUENCE ERROR': 10}

In [5]:
# example of a valid equation

ground_truths = tf.constant([[["variable", "~var~"], ["=", "~relation~"], ["\"", "~container~"], ["hello", "~value~"], ["\"", "~container~"]]])

In [6]:
mint.train(ground_truths, num_epochs= 30)

In [7]:
test_string = ["variable = \"string\""]

In [8]:
tokens = TACO.inBatch(test_string)

tokens

<tf.Tensor: shape=(1, 5), dtype=string, numpy=array([[b'variable', b'=', b'"', b'string', b'"']], dtype=object)>

In [21]:
mint(tokens)

[b'variable' b'~var~']
[b'=' b'~relation~']
[b'"' b'~container~']
[b'string' b'~value~']
[b'"' b'~container~']


<tf.Tensor: shape=(5,), dtype=string, numpy=array([b'~var~', b'=', b'"', b'~value~', b'"'], dtype=object)>