Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lazy loading instead of one-off loading #28

Merged
merged 28 commits into from May 10, 2019

Conversation

Projects
None yet
3 participants
@chengfx
Copy link
Contributor

commented May 8, 2019

Modify the load method in building dictionary and encoding data in order to avoiding memory overflow

Feixiang Cheng and others added some commits Apr 25, 2019

@ljshou ljshou added the enhancement label May 8, 2019

@ljshou
Copy link
Member

left a comment

thanks Feixiang for the request, this is very important feature. Have we tested a large scale data, such as knowledge disllation using like 1000 million data?

@chengfx

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2019

thanks Feixiang for the request, this is very important feature. Have we tested a large scale data, such as knowledge disllation using like 1000 million data?

Hi, @ljshou . I have tested Query Binary Classification task using 35 million data( The size of file is about 1.5G) In my local PC( Memory 32G) already. It can run normally, including building dictionary, encoding raw data and training.
In building dictionary phase, there will not be memory overflow if the size of word universe is less than that of memory. I tested it using 200 million data(the size of file is about 8.4G). The process just need about 10G memory(majorly used by word universe). There is a function parameter chunk_size which can control the number of line loaded by process. I think it could be added as the config parameter later.
In encoding phase, we make an assumption that all encoded data could be loaded in memory. Otherwise, there will be a out of memory exception.

For 1000 million data, I think current version could build dictionary, but will fail at encoding raw data. 😄

@chengfx chengfx requested a review from ljshou May 9, 2019

problem.py Outdated
@@ -285,6 +300,7 @@ def build(self, training_data_path, file_columns, input_types, file_with_col_hea

assert loaded_emb_dim == word_emb_dim, "The dimension of defined word embedding is inconsistent with the pretrained embedding provided!"

logging.info("constrct embedding table")

This comment has been minimized.

Copy link
@woailaosang

woailaosang May 9, 2019

Contributor

constrct -> construct ?

This comment has been minimized.

Copy link
@chengfx

chengfx May 10, 2019

Author Contributor

thanks, Zhijie, done

chengfx added some commits May 10, 2019

@ljshou ljshou merged commit fa9e8e2 into master May 10, 2019

1 check passed

license/cla All CLA requirements met.
Details
@ljshou

This comment has been minimized.

Copy link
Member

commented May 10, 2019

thanks feixiang

ericwtlin pushed a commit that referenced this pull request May 10, 2019

Use lazy loading instead of one-off loading (#28)
* Add new config about knowledge distillation for query binary classifier

* remove inferenced result in knowledge distillation for query binary classifier

* Add AUC.py in tools folder

* Add test_data_path into conf_kdqbc_bilstmattn_cnn.json

* Modify AUC.py

* Rename AUC.py into calculate_AUC.py

* Modify test&calculate AUC commands for Knowledge Distillation for Query Binary Classifier

* Add cpu_thread_num parameter in conf.training_params

* Rename cpu_thread_num into cpu_num_workers

* update comments in ModelConf.py

* Add cup_num_workers in model_zoo/advanced/conf.json

* Add the description of cpu_num_workers in Tutorial.md

* Update inference speed of compressed model

* Add ProcessorsScheduler Class

* Add license in ProcessorScheduler.py

* use lazy loading instead of one-off loading

* Remove Debug Info in problem.py

* use open instead of codecs.open

* update the inference of build dictionary for classification

* update typo

* update typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.