-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training ner on a a new corpus #11
Comments
That's just how the optimizer works. To do the training with any non-trivial amount of data you need to compile in 64bit mode and use a 64bit OS. Otherwise you can only use 2GB of ram which isn't very much. |
@davisking The config of my machine is 16gb memory , 64 bit , 2.1 ghz processor |
13GB is a lot for 198 samples. How exactly did you run the trainer? |
@davisking I used the total_word_feature_extractor.dat for the vocab file trainer = ner_trainer(vocabfile) for line in input_lines: trainer.num_threads = 16 here the sample variable is of type ner_training_instance , this is how we are intitalising it |
That doesn't look like it should run at all. The trainer.add() method is
supposed to take a ner_training_instance object. Is that what
strip_braces() returns?
Can you post a complete program that reproduces the problem? One that I
can run?
|
ya, thats exactly what it returns |
@KanwalSingh the way you are labeling your training data has a bug...
will label only "india" as builder, not "india bulls" as the line "{india bulls :: builder}..." would suggest. The correction is as follows:
See here for documentation on the xrange function and here for example usage. Please fix this bug and let us know if the problem persists. |
Hi that's my bad in mentioning the label data , yes I am giving it as 0,2 On Monday, February 23, 2015, Arjun Majumdar notifications@github.com
Sent from My iPhone |
What about the RAM usage. When I run the python trainer I don't get anything like 13GB of RAM usage. What happens when you run our provided train_ner.py python program? Does it use a lot of RAM or is only on your data that this happens? |
I am facing this problem too. When I run the train_ner.py python program, it uses about 718 MB of ram (value under RES - top command). When I run it on my data it uses 6.7 GB of ram (value under RES - top command). Screenshots attached. The python program was faster than the c++ program in determining the best C. But after determining the best C, the python program started consuming lot of ram (it did not comsume lot of ram until then). (Best C was calculated to be 300.69) The c++ program took about 2.5 hours to determine the best C where as the python program took about an hour. I had to force shutdown the system after an hour or so after best C was determined in both the cases. I've used the code under tools to build a custom total_word_feature_extractor. This is the output from the python code before it starts determining the best C: words in dictionary: 282 Machine Configuration: Please let me know if I've made a mistake somewhere since this was the first time I've executed the program. |
Can you post the inputs you used to run this such that I can exactly reproduce this issue? |
I'm not sure if I am allowed to share the training data. I'll post it here if I am allowed to, on monday. |
Sounds good. The only reason I can think of that might cause this is if you use a very large number of labels. How many different label strings did you use? E.g. the example program uses just person and org so that's 2 different types of labels. If you used 1000 then it's going to take a huge amount of RAM because it solves a big multiclass linear SVM in the last step that uses an amount of RAM linear in the number of distinct labels. |
There are 8 different labels in the training data. And the data contains about 600 sentences. Each sentence/phrase contains three to ten words. |
That amount should be fine.
|
I too had 10 labels max On Fri, Feb 27, 2015 at 6:10 PM, Manal notifications@github.com wrote:
Kanwal Prakash Singh |
Can you post your training data? :)
We (and a bunch of other groups) have been using MITIE to train models and
haven't had any issues. So I need one of you guys to post a program that
reproduces the issue you are having or it's going to be impossible to debug
:)
|
hey Davis, I understand that, will be sharing the training data by Monday, On Fri, Feb 27, 2015 at 6:22 PM, Davis E. King notifications@github.com
Kanwal Prakash Singh |
Sounds good. You can email me at davis@dlib.net
|
@davisking , I won't be able to share the training data. Sorry! @KanwalSingh could you please share the training data you used with Davis. |
@davisking I have mailed you the training data |
Thanks. Please also include a working python program that when executed causes this large ram usage bug to appear so that I can debug it. |
done, mailed it to you On Mon, Mar 2, 2015 at 6:03 PM, Davis E. King notifications@github.com
Kanwal Prakash Singh |
I've looked this over and the problem is in the training data given to MITIE. However, before I explain the issue it's helpful to understand a little about how MITIE works. In MITIE, each sentence is chunked into entities and then each chunk is labeled by a multiclass classifier (a multiclass support vector machine specifically). To classify each chunk, MITIE creates a 501354 dimensional vector which is given to the multiclass classifier. Now the way the multiclass classifier works is it learns one linear function for each label (and an addition one for the special 'not an entity' category). So if you have N labels then the classifier has 501354*(N+1) numbers it needs to learn. Moreover, since we use a cutting plane solver there is an additional factor of RAM usage in the solver, let's call it Z. The amount of RAM used by the multiclass trainer is 501354*(N+1)Zsizeof(double) + (the amount of RAM needed to store the training data). That means there are 501354*(N+1)Zsizeof(double) bytes of RAM usage no matter the size of your dataset. The Z value is normally in the range 40-80. However, if you give input data that is basically impossible to correctly classify then the solver needs to work harder to find a way to separate it so Z might go up to about 200. It will also take a long time to train. In your case, you gave data with these 18 labels: 'builder', 'project', ' builder', 'size', 'location', 'price', ' price ', 'Infra', 'time', 'price psf', 'loction', ' time', ' size', ' location', ' infra', ' price', ' project', 'infra'. Now what's happening is MITIE is trying to figure out, for example, how to separate the ' price', 'price', and 'price ' labels but this is probably impossible as I'm sure you meant to give all these things the same label. But the MITIE solver still tries and it needs a large Z to build up a high accuracy solution that can do this. So if Z=200 and N=18 then 501354*(N+1)Zsizeof(double)=14GB of RAM. So the solution is to fix your training data so that the labels make sense. If you do that then I would expect more like 2GB-4GB of RAM usage. I have also updated the MITIE code to print the labels supplied by the user. So if you pull from github and rerun it will show these labels and that should make this kind of error a lot easier to spot in the future. |
@davisking Thanks for such detailed explanation about MITIE working. It solved my issue regarding the long training duration. |
gagan-bansal did you solve your issue? |
Is there any memory leak? its taking a lot of memory for a very few training samples .
Its getting killed after printing this
num feats in chunker model: 4095
train: precision, recall, f1-score: 0.984615 0.984615 0.984615
now do training
num training samples: 198
I observed the memory usage and saw that it kept on increasing gradually once it reaches here, as if in each iteration some memory is getting filled garbage.
The text was updated successfully, but these errors were encountered: