You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have around 800M files where the text field is one of the fields in the file. I would like to train a new tokenizer only on the text field. I cannot extract the text and write to new files because of the huge volume of files. Is there anyway to train a new tokenizer on this data by reading only the text fields from each of these files and passing them to the training process?