Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set the batch size for prediction? #36

Closed
o0windseed0o opened this issue Jan 2, 2018 · 9 comments
Closed

How to set the batch size for prediction? #36

o0windseed0o opened this issue Jan 2, 2018 · 9 comments
Assignees
Labels

Comments

@o0windseed0o
Copy link

Hi all, I think it is possible to set the training batch size as 100 and predicting size as 10, right?
So I tried different sizes of predicting batch sizes, from 1, 10, to 100, after predicting, there are different results:
It is for binary classification using match_pyramid and predict totally 42,155 samples.
size = 1 numpy.core._internal.AxisError: axis 1 is out of bounds for array of dimension 1
size=10 predict and output predicting results for 42,142 samples
size=50 predict and output predicting results for 42,142 samples
size=100 predict and output predicting results for 42,092 samples
Anyone knows what was wrong?

@uduse
Copy link
Member

uduse commented Jan 3, 2018

@o0windseed0o can you paste your complete config file here? It's probably an iteration boundary related bug somewhere in our code.

@o0windseed0o
Copy link
Author

o0windseed0o commented Jan 3, 2018

@uduse Thanks for your reply! Please see the following.

{
 "net_name": "match_pyramid",
  "global":{
      "model_type": "PY",
      "weights_file": "examples/QA/weights/matchpyramid_classify.weights",
      "save_weights_iters": 10,
      "num_iters": 200,
      "display_interval": 10,
      "test_weights_iters": 200,
      "optimizer": "adam",
      "learning_rate": 0.0001
  },
  "inputs": {
    "share": {
        "text1_corpus": "./data/QA/corpus_preprocessed.txt",
        "text2_corpus": "./data/QA/corpus_preprocessed.txt",
        "use_dpool": true,
        "embed_size": 100,
        "train_embed": true,
        "vocab_size": 28780,
        "target_mode": "classification",
        "class_num": 2,
        "text1_maxlen": 25,
        "text2_maxlen": 50
    },
    "train": {
        "input_type": "PointGenerator", 
        "phase": "TRAIN",
        "use_iter": false,
        "query_per_iter": 20,
        "batch_per_iter": 5,
        "batch_size": 100,
        "relation_file": "./data/QA/relation_train.txt"
    },
    "valid": {
        "input_type": "PointGenerator", 
        "phase": "EVAL",
        "batch_size": 100,
        "relation_file": "./data/QA/relation_train.txt"
    },
    "test": {
        "input_type": "PointGenerator", 
        "phase": "EVAL",
        "batch_size": 100,
        "relation_file": "./data/QA/relation_test.txt"
    },
    "predict": {
        "input_type": "PointGenerator", 
        "phase": "PREDICT",
        "batch_size": 50,
        "relation_file": "./data/QA/relation_test.txt"
    }
  },
  "outputs": {
    "predict": {
      "save_format": "TEXTNET",
      "save_path": "predict.test.medqa_matchpyramid_classify.txt"
    }
  },
  "model": {
    "model_path": "matchzoo/models/",
    "model_py": "matchpyramid.MatchPyramid",
    "setting": {
        "kernel_count": 32, 
        "kernel_size": [3, 3], 
        "dpool_size": [3, 10],
        "dropout_rate": 0.5
    }
  },
  "losses": [
    {
       "object_name": "categorical_crossentropy",
       "object_params": {}
    }
  ],
  "metrics": [ "accuracy" ]
}

There are several paramters in the config file that I don't know how to set, such as query_per_iter and bath_per_iter. Are there any instructions or introductions on how to write config files?

As to the error, if it's not related to the config file, it might be the operation on generating batches, since the missing samples are always the last ones.

@o0windseed0o
Copy link
Author

Have you figured out what caused the problem? Or can anybody tell me which py file should be checked, related to the batch generator?

@uduse
Copy link
Member

uduse commented Jan 8, 2018

@o0windseed0o haven't got a chance to dive into the problem yet. You might want to look at PointGenerator class.

@o0windseed0o
Copy link
Author

@uduse I have checked the PointerGenerator class, and I think the problem should be around the while True loop in the get_batch_generator function. There are no operations on how to deal with those samples that cannot be built up to a batch. Please check from here when you are free. Thank you!

@bwanglzu
Copy link
Member

Need to figure out whether a bug exist.

@Genie-Liu
Copy link

Today I come to the same situation: I have 5000 prediction samples, and it output 4998 predicts.
But no matter how I change my batch_size, the output is always 4998.
Then later I found that there's duplicate samples in my prediction sample.

@o0windseed0o Not sure if you have the duplicate sample.

@bwanglzu By the way, can the model fixed the duplicate situation?

@bwanglzu
Copy link
Member

bwanglzu commented Jul 3, 2018

Apparently there's something wrong in Generator, I guess @faneshion and @yangliuy are the right person to ask.

@Genie-Liu and you provide a bit more context?

@bwanglzu
Copy link
Member

@faneshion any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants