# Embedding Trainer
---
Main script to train our various embedding models. What we'll do is import our 'Embedding Model' objects that we'll define in other files - and this code will be the training loop and will save the results to various folders.

## Create our Dataloader and Tokenizer
---
We already defined these and trained the tokenizer in the other files. So, let's go ahead and instantiate these as a first step

### Import Dependencies

In [1]:
# To allow for easy access of other packages in this directory, let's first nav to the project root
import sys
import os
project_root = os.path.dirname(os.getcwd())
sys.path.append(project_root)

In [2]:
from dataloader.dataloader import MyDataLoader
from tokenizer.tokenizer import MyTokenizer

### Instantiate our Dataloader

In [3]:
dl = MyDataLoader(promptuser=False)    # By setting promptuser=False, we just use the 'enwiki_articles_20240320_mini' dataset (50MB)
dataloader = dl.get_dataloader()

for batch in dataloader:
    sample_data = batch[0][0]
    break

print(f"Number of chars: {len(sample_data)}")
print(f"{'='*60}")
print(sample_data[:500])

Number of chars: 62911
There are a vast number of absurd and mischievous fallacies, which pass readily in the world for sense and virtue, while in truth they tend only to fortify error and encourage crime. Mr. Bentham has enumerated the most conspicuous of these in the book before us.

Whether it be necessary there should be a middleman between the cultivator and the possessor, learned economists have doubted; but neither gods, men, nor booksellers can doubt the necessity of a middleman between Mr. Bentham and the pub


### Instantiate our Tokenizer

In [5]:
tokenizer = MyTokenizer()

# Ensure our tokenizer is running properly
chars_to_print = 200
print(tokenizer.encode_as_ids(sample_data[:chars_to_print]))
print(f"{'-'*60}")
print(tokenizer.encode_as_pieces(sample_data[:chars_to_print]))
print(f"{'-'*60}")
print(tokenizer.decode(tokenizer.encode_as_ids(sample_data[:chars_to_print])))

[689, 147, 5, 4474, 1122, 16, 7587, 21, 1716, 2509, 15815, 1226, 173, 164, 15941, 128, 564, 6311, 33, 6, 635, 65, 1771, 21, 3073, 15941, 1084, 33, 1710, 142, 3598, 454, 32, 2512, 1342, 6381, 21, 4281, 3426, 15944, 1416, 15944, 122, 2714, 90, 319, 49]
------------------------------------------------------------
['▁There', '▁are', '▁a', '▁vast', '▁number', '▁of', '▁absurd', '▁and', '▁mis', 'chie', 'vous', '▁fall', 'ac', 'ies', ',', '▁which', '▁pass', '▁readily', '▁in', '▁the', '▁world', '▁for', '▁sense', '▁and', '▁virtue', ',', '▁while', '▁in', '▁truth', '▁they', '▁tend', '▁only', '▁to', '▁fort', 'ify', '▁error', '▁and', '▁encourage', '▁crime', '.', '▁Mr', '.', '▁B', 'enth', 'am', '▁has', '▁e']
------------------------------------------------------------
There are a vast number of absurd and mischievous fallacies, which pass readily in the world for sense and virtue, while in truth they tend only to fortify error and encourage crime. Mr. Bentham has e


## Ensure Our Data Pipeline is Set Up