Training a Tokenizer from large texts read from memory

I have around 800M files where the text field is one of the fields in the file. I would like to train a new tokenizer only on the text field. I cannot extract the text and write to new files because of the huge volume of files. Is there anyway to train a new tokenizer on this data by reading only the text fields from each of these files and passing them to the training process?

Thanks,
Ravi.