-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to correctly load .txt file #2840
Comments
I guess this is an interesting corner case, because in most cases mlpack/src/mlpack/core/data/load_csv.cpp Line 45 in 051b80b
We could pass another argument to the Also, I'm wondering if it might be useful to use the |
Thanks @zoq , after making the change in line 45 of mlpack/core/data/load_csv.cpp, I am getting the correct o/p. |
https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/tokenizers/split_by_any_of.hpp An example to use the token is : mlpack/src/mlpack/tests/string_encoding_test.cpp Lines 189 to 198 in 9e112d7
|
Unfortunately, I don't think |
Updated my comment 7 hrs ago. |
@ayushsingh11 you can simply read a file using file stream object and store the lines in a vector and then you can use the tokenizer to tokenise the line Not sure if that is your end goal Convert a file to a vector of token --> |
Thank you @jeffin-ntx , this simple method worked. |
sorry which simple method ? |
Using a fstream object to load the txt file.
Will add this code in loadvocab function in #2822 |
Issue description
I want to load a txt file as a vector of strings. Specifically, I want to upload the vocab.txt file from a BERT model which shall be used for tokenizing any given text. Let each row of the file be a token, I am able to load all the tokens correctly from the file except if the token has a comma in it. In these cases, either the comma is removed or if the token is like "2,", then one element is deleted from the vector.
I have attached a zip folder that contains the simple code used for testing and the vocab.txt file whose outputs are discussed below.
What change should I do in the below code to ensure that the data from the txt file is correctly loaded?
Your environment
Steps to reproduce
One can run the following code:
Expected behavior
The size of vec is : 10
un
##ing
[UNK]
##ed
wa
runn
king
2,
##,
,
Actual behavior
The size of vec is : 9
un
##ing
[UNK]
##ed
wa
runn
king
2
(The last entry is a blank)
Vocab_tester.zip
vocab.txt
The text was updated successfully, but these errors were encountered: