Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to correctly load .txt file #2840

Closed
ayushsingh11 opened this issue Feb 15, 2021 · 9 comments
Closed

Not able to correctly load .txt file #2840

ayushsingh11 opened this issue Feb 15, 2021 · 9 comments

Comments

@ayushsingh11
Copy link
Contributor

ayushsingh11 commented Feb 15, 2021

Issue description

I want to load a txt file as a vector of strings. Specifically, I want to upload the vocab.txt file from a BERT model which shall be used for tokenizing any given text. Let each row of the file be a token, I am able to load all the tokens correctly from the file except if the token has a comma in it. In these cases, either the comma is removed or if the token is like "2,", then one element is deleted from the vector.
I have attached a zip folder that contains the simple code used for testing and the vocab.txt file whose outputs are discussed below.
What change should I do in the below code to ensure that the data from the txt file is correctly loaded?

Your environment

  • version of mlpack: 3.4.2
  • operating system: ubuntu
  • compiler: g++
  • version of dependencies (Boost/Armadillo):
  • any other environment information you think is relevant:

Steps to reproduce

One can run the following code:

#include <mlpack/prereqs.hpp>
#include <mlpack/core.hpp>

using namespace std;

void loadVocab(const std::string& vocabFile) {
    arma::mat temp;
    mlpack::data::DatasetInfo info;
    mlpack::data::Load(vocabFile, temp, info);
    vector<string> vocabset;

    // Loading contents of vocab file from DatasetInfo object to vector<string>.
    for (size_t i = 0; i < info.NumMappings(0); ++i)
        vocabset.push_back(info.UnmapString(i, 0));

    cout << "The size of vec is : " << vocabset.size() << endl;
    for(size_t i = 0; i < vocabset.size(); ++i)
        cout << vocabset[i] << endl;
}

int main() {
    loadVocab("vocab.txt");
    return 0;
}

Expected behavior

The size of vec is : 10
un
##ing
[UNK]
##ed
wa
runn
king
2,
##,
,

Actual behavior

The size of vec is : 9
un
##ing
[UNK]
##ed
wa
runn
king
2

(The last entry is a blank)
Vocab_tester.zip
vocab.txt

@zoq
Copy link
Member

zoq commented Feb 22, 2021

I guess this is an interesting corner case, because in most cases , is used as a delimiter. As a quick workaround you can remove , from the stringRule parameter here:

stringRule = quotedRule.copy() | qi::raw[*~qi::char_(" ,\r\n")];

We could pass another argument to the Load method to define the delimiter, but since we are planning to remove the boost dependencies including boost::spirit (#2646) the library we use to parse I'm not sure it makes sense to implement that support at this point @shrit correct me if I'm wrong.

Also, I'm wondering if it might be useful to use the tokenizer functionality instead, which already allows you to specify the delimiter, maybe @lozhnikov @jeffin143 can provide some more insight into that direction.

@ayushsingh11
Copy link
Contributor Author

ayushsingh11 commented Feb 23, 2021

Thanks @zoq , after making the change in line 45 of mlpack/core/data/load_csv.cpp, I am getting the correct o/p.
Looking forward to know whether I should add that argument in Load method.
Also, which tokenizer functionality have you mentioned?

@jeffin-ntx
Copy link

https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/tokenizers/split_by_any_of.hpp

An example to use the token is :

std::vector<boost::string_view> tokens;
boost::string_view line(stringEncodingInput[0]);
SplitByAnyOf tokenizer(" ,.");
boost::string_view token = tokenizer(line);
while (!token.empty())
{
tokens.push_back(token);
token = tokenizer(line);
}

@ayushsingh11
Copy link
Contributor Author

Unfortunately, I don't think tokenizer functionality would solve this issue. The problem is regarding correctly loading a txt file. Although, shall use tokenizer in implementation of BERT Tokenizer.

@ayushsingh11
Copy link
Contributor Author

Updated my comment 7 hrs ago.

@jeffin-ntx
Copy link

@ayushsingh11 you can simply read a file using file stream object and store the lines in a vector and then you can use the tokenizer to tokenise the line

Not sure if that is your end goal Convert a file to a vector of token -->

@ayushsingh11
Copy link
Contributor Author

Thank you @jeffin-ntx , this simple method worked.
Don't know why, I earlier thought we can't use fstream inside mlpack.
Closing this issue.

@jeffin-ntx
Copy link

sorry which simple method ?

@ayushsingh11
Copy link
Contributor Author

ayushsingh11 commented Feb 23, 2021

Using a fstream object to load the txt file.
I meant using this code -

#include <mlpack/prereqs.hpp>
using namespace std;

int main(){
    std::fstream f;
    f.open("vocab.txt");

    std::string line;
    std::vector<std::string> vocabset;
    while (std::getline(f, line))
    {
        vocabset.push_back(line);
    }
    std::cout << "The size of vec is : " << vocabset.size() << endl;
    for(size_t i = 0; i < vocabset.size(); ++i)
        cout << vocabset[i] << endl;
    f.close();
	return 0;
}

Will add this code in loadvocab function in #2822

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants