Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrained model #10

Open
Mihonarium opened this issue Jul 12, 2021 · 15 comments
Open

Pretrained model #10

Mihonarium opened this issue Jul 12, 2021 · 15 comments
Assignees
Labels
enhancement New feature or request

Comments

@Mihonarium
Copy link
Contributor

Mihonarium commented Jul 12, 2021

While it's relatively easy to train the model on the Dataset-mini (even Colab allows that), it's not as easy to reproduce the paper's results with the Dataset-full. It would be great if you could publish a model trained on the full dataset.

(By the way, congratulations on the paper, and thanks for publishing the work, it's really cool!)

@Mihonarium
Copy link
Contributor Author

Oh, sorry, I just saw that you actually use the mini dataset for training and the full one for a full-scale evaluation. Closing the issue

@mimbres
Copy link
Owner

mimbres commented Jul 12, 2021

Thanks. Yes actually the training part is same.
I have a plan for colab. The g-drive (raw) files are exactly for the purpose of mounting it on colab .

Training in colab:
I didn't test it but it should work. You first need to modify the config/default.yaml. The OUTPUT_ROOT_DIR and LOG_ROOT_DIR must be set to you gdrive directory. And other paths like SOURCE_ROOT etc. should be the dataset (raw) I shared.
In training, It saves model checkpoint every epoch. Usually every twenty minutes or it can take longer.
So if the colab was auto-shut down, you can continue training from the last checkpoint.
If you meet any problem, just let me know. It will be a nice contribution.

About sharing a trained model, yes I can. The plan is to write a one page colab demo by loading it for the next update.
But if you wanna early-try, here is the link.

I really welcome feedback from colab users. I feel it is the way this open project to go.

@mimbres mimbres self-assigned this Jul 12, 2021
@mimbres mimbres reopened this Jul 12, 2021
@mimbres mimbres added the enhancement New feature or request label Jul 12, 2021
@mimbres
Copy link
Owner

mimbres commented Jul 12, 2021

I am wondering if it is possible to install faiss (required for constructing search engine) smoothly in colab. I've never tried it yet. It is also an important prerequisite to develop colab demo. I'll test it out a bit tonight.

  • Installation of faiss-gpu on colab.

@Mihonarium
Copy link
Contributor Author

I was able to run the training process in Colab with Miniconda, but just installing requirements without Miniconda leads to an error. #12 should fix it.

Restoring from that checkpoint doesn't work for some reason. It outputs a long list of messages like WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'v' for (root).model.div_enc.split_fc_layers.124.layer_with_weights-0.bias for all the layers, weights, etc., and this warning at the end:

WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details

@mimbres
Copy link
Owner

mimbres commented Jul 12, 2021

@Mihonarium Thanks for report. Yes, it seems we don't need conda for colab. Just pip install works smooth. Installation of faiss-gpu was super smooth too: !pip install faiss-gpu.

About your checkpoint loading issue, let me ask:

  • Just use the config/640_lamb.yaml in repo.
  • Did you specify config? The command should be like:
python run.py train -c 640_lamb  640_lamb # ignore this line..
!python run.py generate -c 640_lamb 640_lamb 101

BTW, just try generate command. Continuing train from the checkpoint of different type of device is weird scenario.
If you send me your notebook, I'll look at it tomorrow.

@Mihonarium
Copy link
Contributor Author

Yes, I did specify the config.

What's even more strange, the issue with a lot of warnings appears only with run.py train and doesn't appear for generate.

The notebook: https://gist.github.com/Mihonarium/e3fd355cb560b82373fd2186139f1bc2 (the last cells show that generate and training from scratch work).

@mimbres
Copy link
Owner

mimbres commented Jul 12, 2021

@Mihonarium Oh it is an expected behavior as I wrote it above. The checkpoint file contains optimzer's states info which is GPU device dependent. So, if you wanna continue train using my checkpoint as an initial parameter, it's possible but I didn't consider such use. It requires to load model without connecting optimizer first (as in generate). Then initialize optimizer and start training.

@mimbres
Copy link
Owner

mimbres commented Jul 12, 2021

@Mihonarium About training from scratch error: First, for P100 GPU, I recommend

BSZ:
    TR_BATCH_SZ : 320
        # Training batch size N must be EVEN number.
    TR_N_ANCHOR : 160

You didn't get out of memory error though. But this is not related with your issue.
I am now checking CPU info of colab.
In config, try:

DEVICE:
    CPU_N_WORKERS : 4 # 4 for minimal system. 8 is recommended.
    CPU_MAX_QUEUE : 10 # 10 for minimal system. 20 is recommended.

It depends on how many threads the system can handle.
I will run it tomorrow.

@Mihonarium
Copy link
Contributor Author

it is an expected behavior as I wrote it above. The checkpoint file contains optimzer's states info which is GPU device dependent.

Got it, makes sense. Thanks!

Training from scratch didn't give any errors, I interrupted it. I included it to show that errors are from the checkpoints load (I didn't know it was the expected behavior) and not from something else. You're right though, I would probably get an out of memory error if trained for longer. I was actually able to train the model successfully with a batch size of 320.

@Mihonarium
Copy link
Contributor Author

Mihonarium commented Jul 12, 2021

Got unsupported operand type(s) for +: 'PosixPath' and 'str' from line 306 of dataset.py when tried to generate from a custom source

@mimbres
Copy link
Owner

mimbres commented Jul 12, 2021

@Mihonarium Solved by removing pathlib for argin. Also fixed same issue for --output option.

@TheMightyRaider
Copy link

TheMightyRaider commented Jul 28, 2021

@mimbres @Mihonarium Is it possible for you guys to share the trained model, It's quite hard to train with 320 as batch size? 🤞

@Mihonarium
Copy link
Contributor Author

@TheMightyRaider the trained model is available here

@TheMightyRaider
Copy link

Thanks! @Mihonarium

@haha010508
Copy link

haha010508 commented Nov 8, 2022

i use the pretrained model, and same database(Dataset-mini), for evalue step, but i got very poor result, i want to know: why? this is my code
`
CUDA_VISIBLE_DEVICES=1 python run.py evaluate 640_lamb 101.index -c 640_lamb
cli: Configuration from ./config/640_lamb.yaml
Load 29,500 items from ./logs/emb/640_lamb/101.index/query.mm.
Load 29,500 items from ./logs/emb/640_lamb/101.index/db.mm.
Load 581,922 items from ./logs/emb/640_lamb/101.index/dummy_db.mm.
Creating index: ivfpq
Copy index to GPU.
Training index...
Elapsed time: 23.07 seconds.
581922 items from dummy DB
29500 items from reference DB
Added total 611422 items to DB. 2.25 sec.
Created fake_recon_index, total 611422 items. 0.04 sec.
test_id: icassp, n_test: 2000
========= Top1 hit rate (%) of segment-level search =========
---------------- Query length ----------------
segments 1 3 5 9 11 19
seconds (1s) (2s) (3s) (5s) (6s) (10s)

Top1 exact 3.75 5.90 6.45 7.25 7.25 7.80
Top1 near 4.00 6.15 6.70 7.30 7.30 7.80
Top3 exact 4.40 7.00 7.85 8.60 8.45 8.95
Top10 exact 5.40 8.35 9.40 10.90 11.15 10.90

average search + evaluation time 7.25 ms/query
Saved test_ids and raw score to ./logs/emb/640_lamb/101.index/.
`
if i need retrain?

Rodrigo29Almeida pushed a commit to Rodrigo29Almeida/neural-audio-fp that referenced this issue Apr 16, 2024
Rodrigo29Almeida pushed a commit to Rodrigo29Almeida/neural-audio-fp that referenced this issue Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants