Not able to correctly load .txt file #2840

ayushsingh11 · 2021-02-15T18:52:30Z

Issue description

I want to load a txt file as a vector of strings. Specifically, I want to upload the vocab.txt file from a BERT model which shall be used for tokenizing any given text. Let each row of the file be a token, I am able to load all the tokens correctly from the file except if the token has a comma in it. In these cases, either the comma is removed or if the token is like "2,", then one element is deleted from the vector.
I have attached a zip folder that contains the simple code used for testing and the vocab.txt file whose outputs are discussed below.
What change should I do in the below code to ensure that the data from the txt file is correctly loaded?

Your environment

version of mlpack: 3.4.2
operating system: ubuntu
compiler: g++
version of dependencies (Boost/Armadillo):
any other environment information you think is relevant:

Steps to reproduce

One can run the following code:

#include <mlpack/prereqs.hpp>
#include <mlpack/core.hpp>

using namespace std;

void loadVocab(const std::string& vocabFile) {
    arma::mat temp;
    mlpack::data::DatasetInfo info;
    mlpack::data::Load(vocabFile, temp, info);
    vector<string> vocabset;

    // Loading contents of vocab file from DatasetInfo object to vector<string>.
    for (size_t i = 0; i < info.NumMappings(0); ++i)
        vocabset.push_back(info.UnmapString(i, 0));

    cout << "The size of vec is : " << vocabset.size() << endl;
    for(size_t i = 0; i < vocabset.size(); ++i)
        cout << vocabset[i] << endl;
}

int main() {
    loadVocab("vocab.txt");
    return 0;
}

Expected behavior

The size of vec is : 10
un
##ing
[UNK]
##ed
wa
runn
king
2,
##,
,

Actual behavior

The size of vec is : 9
un
##ing
[UNK]
##ed
wa
runn
king
2

(The last entry is a blank)
Vocab_tester.zip
vocab.txt

The text was updated successfully, but these errors were encountered:

zoq · 2021-02-22T18:58:24Z

I guess this is an interesting corner case, because in most cases , is used as a delimiter. As a quick workaround you can remove , from the stringRule parameter here:

mlpack/src/mlpack/core/data/load_csv.cpp

Line 45 in 051b80b

stringRule = quotedRule.copy() | qi::raw[*~qi::char_(" ,\r\n")];

We could pass another argument to the Load method to define the delimiter, but since we are planning to remove the boost dependencies including boost::spirit (#2646) the library we use to parse I'm not sure it makes sense to implement that support at this point @shrit correct me if I'm wrong.

Also, I'm wondering if it might be useful to use the tokenizer functionality instead, which already allows you to specify the delimiter, maybe @lozhnikov @jeffin143 can provide some more insight into that direction.

ayushsingh11 · 2021-02-23T10:55:59Z

Thanks @zoq , after making the change in line 45 of mlpack/core/data/load_csv.cpp, I am getting the correct o/p.
Looking forward to know whether I should add that argument in Load method.
Also, which tokenizer functionality have you mentioned?

jeffin-ntx · 2021-02-23T11:34:00Z

https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/tokenizers/split_by_any_of.hpp

An example to use the token is :

mlpack/src/mlpack/tests/string_encoding_test.cpp

Lines 189 to 198 in 9e112d7

    
           std::vector<boost::string_view> tokens; 
        
           boost::string_view line(stringEncodingInput[0]); 
        
           SplitByAnyOf tokenizer(" ,."); 
        
           boost::string_view token = tokenizer(line); 
        
           while (!token.empty()) 
        
           { 
        
             tokens.push_back(token); 
        
             token = tokenizer(line); 
        
           }

ayushsingh11 · 2021-02-23T17:03:10Z

Unfortunately, I don't think tokenizer functionality would solve this issue. The problem is regarding correctly loading a txt file. Although, shall use tokenizer in implementation of BERT Tokenizer.

ayushsingh11 · 2021-02-23T17:37:45Z

Updated my comment 7 hrs ago.

jeffin-ntx · 2021-02-23T17:40:08Z

@ayushsingh11 you can simply read a file using file stream object and store the lines in a vector and then you can use the tokenizer to tokenise the line

Not sure if that is your end goal Convert a file to a vector of token -->

ayushsingh11 · 2021-02-23T18:05:31Z

Thank you @jeffin-ntx , this simple method worked.
Don't know why, I earlier thought we can't use fstream inside mlpack.
Closing this issue.

jeffin-ntx · 2021-02-23T18:06:27Z

sorry which simple method ?

ayushsingh11 · 2021-02-23T18:12:53Z

Using a fstream object to load the txt file.
I meant using this code -

#include <mlpack/prereqs.hpp>
using namespace std;

int main(){
    std::fstream f;
    f.open("vocab.txt");

    std::string line;
    std::vector<std::string> vocabset;
    while (std::getline(f, line))
    {
        vocabset.push_back(line);
    }
    std::cout << "The size of vec is : " << vocabset.size() << endl;
    for(size_t i = 0; i < vocabset.size(); ++i)
        cout << vocabset[i] << endl;
    f.close();
	return 0;
}

Will add this code in loadvocab function in #2822

ayushsingh11 added s: unanswered t: bug report labels Feb 15, 2021

zoq added c: core help wanted t: feature request t: question and removed s: unanswered t: bug report labels Feb 22, 2021

ayushsingh11 closed this as completed Feb 24, 2021

ayushsingh11 mentioned this issue Mar 9, 2021

Added BERT Tokenizer #2822

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to correctly load .txt file #2840

Not able to correctly load .txt file #2840

ayushsingh11 commented Feb 15, 2021 •

edited

zoq commented Feb 22, 2021

ayushsingh11 commented Feb 23, 2021 •

edited

jeffin-ntx commented Feb 23, 2021

ayushsingh11 commented Feb 23, 2021

ayushsingh11 commented Feb 23, 2021

jeffin-ntx commented Feb 23, 2021

ayushsingh11 commented Feb 23, 2021

jeffin-ntx commented Feb 23, 2021

ayushsingh11 commented Feb 23, 2021 •

edited

Not able to correctly load .txt file #2840

Not able to correctly load .txt file #2840

Comments

ayushsingh11 commented Feb 15, 2021 • edited

Issue description

Your environment

Steps to reproduce

Expected behavior

Actual behavior

zoq commented Feb 22, 2021

ayushsingh11 commented Feb 23, 2021 • edited

jeffin-ntx commented Feb 23, 2021

ayushsingh11 commented Feb 23, 2021

ayushsingh11 commented Feb 23, 2021

jeffin-ntx commented Feb 23, 2021

ayushsingh11 commented Feb 23, 2021

jeffin-ntx commented Feb 23, 2021

ayushsingh11 commented Feb 23, 2021 • edited

ayushsingh11 commented Feb 15, 2021 •

edited

ayushsingh11 commented Feb 23, 2021 •

edited

ayushsingh11 commented Feb 23, 2021 •

edited