Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training ner on a a new corpus #11

Closed
KanwalSingh opened this issue Feb 19, 2015 · 26 comments
Closed

Training ner on a a new corpus #11

KanwalSingh opened this issue Feb 19, 2015 · 26 comments

Comments

@KanwalSingh
Copy link

Is there any memory leak? its taking a lot of memory for a very few training samples .
Its getting killed after printing this

num feats in chunker model: 4095
train: precision, recall, f1-score: 0.984615 0.984615 0.984615
now do training
num training samples: 198

I observed the memory usage and saw that it kept on increasing gradually once it reaches here, as if in each iteration some memory is getting filled garbage.

@davisking
Copy link
Contributor

That's just how the optimizer works. To do the training with any non-trivial amount of data you need to compile in 64bit mode and use a 64bit OS. Otherwise you can only use 2GB of ram which isn't very much.

@KanwalSingh
Copy link
Author

@davisking
the top command showed 13 GB memory usage, it got killed after that (safe to assume it was using the complete memory, hence got killed)

The config of my machine is 16gb memory , 64 bit , 2.1 ghz processor

@davisking
Copy link
Contributor

13GB is a lot for 198 samples. How exactly did you run the trainer?

@davisking davisking reopened this Feb 19, 2015
@KanwalSingh
Copy link
Author

@davisking I used the total_word_feature_extractor.dat for the vocab file

trainer = ner_trainer(vocabfile)

for line in input_lines:
sample = strip_braces(line)
trainer.add(sample)

trainer.num_threads = 16
ner = trainer.train()

here the sample variable is of type ner_training_instance , this is how we are intitalising it
line = "{india bulls :: builder} panvel greens"
strip_line = "india bulls panvel greens"
sample = ner_training_instance(strip_line.split())
sample.add_entity(xrange(0,1,"builder")

@davisking
Copy link
Contributor

davisking commented Feb 23, 2015 via email

@KanwalSingh
Copy link
Author

ya, thats exactly what it returns

@arjunmajum
Copy link
Contributor

@KanwalSingh the way you are labeling your training data has a bug...

sample.add_entity(xrange(0,1),"builder")

will label only "india" as builder, not "india bulls" as the line "{india bulls :: builder}..." would suggest. The correction is as follows:

sample.add_entity(xrange(0,2),"builder")

See here for documentation on the xrange function and here for example usage.

Please fix this bug and let us know if the problem persists.

@KanwalSingh
Copy link
Author

Hi that's my bad in mentioning the label data , yes I am giving it as 0,2

On Monday, February 23, 2015, Arjun Majumdar notifications@github.com
wrote:

@KanwalSingh https://github.com/KanwalSingh the way you are labeling
your training data has a bug...

sample.add_entity(xrange(0,1,"builder")

will label only "india" as builder, not "india bull" as the line "{india
bulls :: builder}..." would suggest. The correction is as follows:

sample.add_entity(xrange(0,2,"builder")

See here https://docs.python.org/2/library/functions.html#xrange for
documentation on the xrange function and here
http://www.pythoncentral.io/how-to-use-pythons-xrange-and-range/ for
example usage.

Please fix this bug and let us know if the problem persists.


Reply to this email directly or view it on GitHub
#11 (comment).

Sent from My iPhone

@davisking
Copy link
Contributor

What about the RAM usage. When I run the python trainer I don't get anything like 13GB of RAM usage. What happens when you run our provided train_ner.py python program? Does it use a lot of RAM or is only on your data that this happens?

@manalgandhi
Copy link

I am facing this problem too.

When I run the train_ner.py python program, it uses about 718 MB of ram (value under RES - top command).

When I run it on my data it uses 6.7 GB of ram (value under RES - top command).

Screenshots attached.

The python program was faster than the c++ program in determining the best C. But after determining the best C, the python program started consuming lot of ram (it did not comsume lot of ram until then). (Best C was calculated to be 300.69)

The c++ program took about 2.5 hours to determine the best C where as the python program took about an hour. I had to force shutdown the system after an hour or so after best C was determined in both the cases.

I've used the code under tools to build a custom total_word_feature_extractor.

This is the output from the python code before it starts determining the best C:

words in dictionary: 282
num features: 271
now do training
C: 20
epsilon: 0.01
num threads: 4
cache size: 5
loss per missed segment: 3
C: 20 loss: 3 0.949386
C: 35 loss: 3 0.949386
C: 20 loss: 4.5 0.948403
C: 5 loss: 3 0.944963
C: 20 loss: 1.5 0.946437
C: 27.5 loss: 3.375 0.948894
C: 21.2605 loss: 3.35924 0.95086
C: 19.135 loss: 3.2257 0.949877
C: 22.119 loss: 3.19385 0.950369
C: 21.9391 loss: 3.60092 0.949877
C: 21.941 loss: 3.36495 0.95086
best C: 21.2605
best loss: 3.35924
num feats in chunker model: 4095
train: precision, recall, f1-score: 0.996075 0.997543 0.996808
now do training
num training samples: 2043
...

Machine Configuration:
8GB ram
Intel i5 2.7GHz
Ubuntu 12.04 64bit

Please let me know if I've made a mistake somewhere since this was the first time I've executed the program.

train_ner - custom data - python - top
train_ner - python - top

@davisking
Copy link
Contributor

Can you post the inputs you used to run this such that I can exactly reproduce this issue?

@manalgandhi
Copy link

I'm not sure if I am allowed to share the training data. I'll post it here if I am allowed to, on monday.

@davisking
Copy link
Contributor

Sounds good.

The only reason I can think of that might cause this is if you use a very large number of labels. How many different label strings did you use? E.g. the example program uses just person and org so that's 2 different types of labels. If you used 1000 then it's going to take a huge amount of RAM because it solves a big multiclass linear SVM in the last step that uses an amount of RAM linear in the number of distinct labels.

@manalgandhi
Copy link

There are 8 different labels in the training data.

And the data contains about 600 sentences. Each sentence/phrase contains three to ten words.

@davisking
Copy link
Contributor

davisking commented Feb 27, 2015 via email

@KanwalSingh
Copy link
Author

I too had 10 labels max

On Fri, Feb 27, 2015 at 6:10 PM, Manal notifications@github.com wrote:

There are 8 different labels in the training data.


Reply to this email directly or view it on GitHub
#11 (comment).

Kanwal Prakash Singh
#Data
Housing.com
+919619281431

@davisking
Copy link
Contributor

davisking commented Feb 27, 2015 via email

@KanwalSingh
Copy link
Author

hey Davis, I understand that, will be sharing the training data by Monday,
will also see if we can share the code with you. Hope thats fine with you,
also share your personal email.

On Fri, Feb 27, 2015 at 6:22 PM, Davis E. King notifications@github.com
wrote:

Can you post your training data? :)

We (and a bunch of other groups) have been using MITIE to train models and
haven't had any issues. So I need one of you guys to post a program that
reproduces the issue you are having or it's going to be impossible to debug
:)


Reply to this email directly or view it on GitHub
#11 (comment).

Kanwal Prakash Singh
#Data
Housing.com
+919619281431

@davisking
Copy link
Contributor

davisking commented Feb 27, 2015 via email

@manalgandhi
Copy link

@davisking , I won't be able to share the training data. Sorry!

@KanwalSingh could you please share the training data you used with Davis.

@KanwalSingh
Copy link
Author

@davisking I have mailed you the training data

@davisking
Copy link
Contributor

Thanks. Please also include a working python program that when executed causes this large ram usage bug to appear so that I can debug it.

@KanwalSingh
Copy link
Author

done, mailed it to you

On Mon, Mar 2, 2015 at 6:03 PM, Davis E. King notifications@github.com
wrote:

Thanks. Please also include a working python program that when executed
causes this large ram usage bug to appear so that I can debug it.


Reply to this email directly or view it on GitHub
#11 (comment).

Kanwal Prakash Singh
#Data
Housing.com
+919619281431

@davisking
Copy link
Contributor

I've looked this over and the problem is in the training data given to MITIE. However, before I explain the issue it's helpful to understand a little about how MITIE works. In MITIE, each sentence is chunked into entities and then each chunk is labeled by a multiclass classifier (a multiclass support vector machine specifically). To classify each chunk, MITIE creates a 501354 dimensional vector which is given to the multiclass classifier.

Now the way the multiclass classifier works is it learns one linear function for each label (and an addition one for the special 'not an entity' category). So if you have N labels then the classifier has 501354*(N+1) numbers it needs to learn. Moreover, since we use a cutting plane solver there is an additional factor of RAM usage in the solver, let's call it Z. The amount of RAM used by the multiclass trainer is 501354*(N+1)Zsizeof(double) + (the amount of RAM needed to store the training data). That means there are 501354*(N+1)Zsizeof(double) bytes of RAM usage no matter the size of your dataset.

The Z value is normally in the range 40-80. However, if you give input data that is basically impossible to correctly classify then the solver needs to work harder to find a way to separate it so Z might go up to about 200. It will also take a long time to train. In your case, you gave data with these 18 labels: 'builder', 'project', ' builder', 'size', 'location', 'price', ' price ', 'Infra', 'time', 'price psf', 'loction', ' time', ' size', ' location', ' infra', ' price', ' project', 'infra'.

Now what's happening is MITIE is trying to figure out, for example, how to separate the ' price', 'price', and 'price ' labels but this is probably impossible as I'm sure you meant to give all these things the same label. But the MITIE solver still tries and it needs a large Z to build up a high accuracy solution that can do this. So if Z=200 and N=18 then 501354*(N+1)Zsizeof(double)=14GB of RAM.

So the solution is to fix your training data so that the labels make sense. If you do that then I would expect more like 2GB-4GB of RAM usage. I have also updated the MITIE code to print the labels supplied by the user. So if you pull from github and rerun it will show these labels and that should make this kind of error a lot easier to spot in the future.

@gagan-bansal
Copy link

@davisking Thanks for such detailed explanation about MITIE working. It solved my issue regarding the long training duration.
I am using rasa_nlu with MITIE. There are few issues(160 and 260) in rasa_nlu and root cause may be the one you have explained here.

@ghost
Copy link

ghost commented Dec 19, 2017

gagan-bansal did you solve your issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants