Training ner on a a new corpus #11

KanwalSingh · 2015-02-19T11:35:24Z

Is there any memory leak? its taking a lot of memory for a very few training samples .
Its getting killed after printing this

num feats in chunker model: 4095
train: precision, recall, f1-score: 0.984615 0.984615 0.984615
now do training
num training samples: 198

I observed the memory usage and saw that it kept on increasing gradually once it reaches here, as if in each iteration some memory is getting filled garbage.

davisking · 2015-02-19T11:42:45Z

That's just how the optimizer works. To do the training with any non-trivial amount of data you need to compile in 64bit mode and use a 64bit OS. Otherwise you can only use 2GB of ram which isn't very much.

KanwalSingh · 2015-02-19T11:50:15Z

@davisking
the top command showed 13 GB memory usage, it got killed after that (safe to assume it was using the complete memory, hence got killed)

The config of my machine is 16gb memory , 64 bit , 2.1 ghz processor

davisking · 2015-02-19T17:23:19Z

13GB is a lot for 198 samples. How exactly did you run the trainer?

KanwalSingh · 2015-02-23T10:27:05Z

@davisking I used the total_word_feature_extractor.dat for the vocab file

trainer = ner_trainer(vocabfile)

for line in input_lines:
sample = strip_braces(line)
trainer.add(sample)

trainer.num_threads = 16
ner = trainer.train()

here the sample variable is of type ner_training_instance , this is how we are intitalising it
line = "{india bulls :: builder} panvel greens"
strip_line = "india bulls panvel greens"
sample = ner_training_instance(strip_line.split())
sample.add_entity(xrange(0,1,"builder")

davisking · 2015-02-23T12:11:21Z

That doesn't look like it should run at all. The trainer.add() method is supposed to take a ner_training_instance object. Is that what strip_braces() returns? Can you post a complete program that reproduces the problem? One that I can run?

KanwalSingh · 2015-02-23T13:00:10Z

ya, thats exactly what it returns

arjunmajum · 2015-02-23T14:06:21Z

@KanwalSingh the way you are labeling your training data has a bug...

sample.add_entity(xrange(0,1),"builder")

will label only "india" as builder, not "india bulls" as the line "{india bulls :: builder}..." would suggest. The correction is as follows:

sample.add_entity(xrange(0,2),"builder")

See here for documentation on the xrange function and here for example usage.

Please fix this bug and let us know if the problem persists.

KanwalSingh · 2015-02-23T14:37:40Z

Hi that's my bad in mentioning the label data , yes I am giving it as 0,2

On Monday, February 23, 2015, Arjun Majumdar notifications@github.com
wrote:

@KanwalSingh https://github.com/KanwalSingh the way you are labeling
your training data has a bug...

sample.add_entity(xrange(0,1,"builder")

will label only "india" as builder, not "india bull" as the line "{india
bulls :: builder}..." would suggest. The correction is as follows:

sample.add_entity(xrange(0,2,"builder")

See here https://docs.python.org/2/library/functions.html#xrange for
documentation on the xrange function and here
http://www.pythoncentral.io/how-to-use-pythons-xrange-and-range/ for
example usage.

Please fix this bug and let us know if the problem persists.

—
Reply to this email directly or view it on GitHub
#11 (comment).

Sent from My iPhone

davisking · 2015-02-23T22:42:43Z

What about the RAM usage. When I run the python trainer I don't get anything like 13GB of RAM usage. What happens when you run our provided train_ner.py python program? Does it use a lot of RAM or is only on your data that this happens?

manalgandhi · 2015-02-27T11:14:22Z

I am facing this problem too.

When I run the train_ner.py python program, it uses about 718 MB of ram (value under RES - top command).

When I run it on my data it uses 6.7 GB of ram (value under RES - top command).

Screenshots attached.

The python program was faster than the c++ program in determining the best C. But after determining the best C, the python program started consuming lot of ram (it did not comsume lot of ram until then). (Best C was calculated to be 300.69)

The c++ program took about 2.5 hours to determine the best C where as the python program took about an hour. I had to force shutdown the system after an hour or so after best C was determined in both the cases.

I've used the code under tools to build a custom total_word_feature_extractor.

This is the output from the python code before it starts determining the best C:

words in dictionary: 282
num features: 271
now do training
C: 20
epsilon: 0.01
num threads: 4
cache size: 5
loss per missed segment: 3
C: 20 loss: 3 0.949386
C: 35 loss: 3 0.949386
C: 20 loss: 4.5 0.948403
C: 5 loss: 3 0.944963
C: 20 loss: 1.5 0.946437
C: 27.5 loss: 3.375 0.948894
C: 21.2605 loss: 3.35924 0.95086
C: 19.135 loss: 3.2257 0.949877
C: 22.119 loss: 3.19385 0.950369
C: 21.9391 loss: 3.60092 0.949877
C: 21.941 loss: 3.36495 0.95086
best C: 21.2605
best loss: 3.35924
num feats in chunker model: 4095
train: precision, recall, f1-score: 0.996075 0.997543 0.996808
now do training
num training samples: 2043
...

Machine Configuration:
8GB ram
Intel i5 2.7GHz
Ubuntu 12.04 64bit

Please let me know if I've made a mistake somewhere since this was the first time I've executed the program.

davisking · 2015-02-27T11:59:41Z

Can you post the inputs you used to run this such that I can exactly reproduce this issue?

manalgandhi · 2015-02-27T12:31:17Z

I'm not sure if I am allowed to share the training data. I'll post it here if I am allowed to, on monday.

davisking · 2015-02-27T12:37:18Z

Sounds good.

The only reason I can think of that might cause this is if you use a very large number of labels. How many different label strings did you use? E.g. the example program uses just person and org so that's 2 different types of labels. If you used 1000 then it's going to take a huge amount of RAM because it solves a big multiclass linear SVM in the last step that uses an amount of RAM linear in the number of distinct labels.

manalgandhi · 2015-02-27T12:40:37Z

There are 8 different labels in the training data.

And the data contains about 600 sentences. Each sentence/phrase contains three to ten words.

davisking · 2015-02-27T12:48:15Z

That amount should be fine.

KanwalSingh · 2015-02-27T12:49:08Z

I too had 10 labels max

On Fri, Feb 27, 2015 at 6:10 PM, Manal notifications@github.com wrote:

There are 8 different labels in the training data.

—
Reply to this email directly or view it on GitHub
#11 (comment).

Kanwal Prakash Singh
#Data
Housing.com
+919619281431

davisking · 2015-02-27T12:52:15Z

Can you post your training data? :) We (and a bunch of other groups) have been using MITIE to train models and haven't had any issues. So I need one of you guys to post a program that reproduces the issue you are having or it's going to be impossible to debug :)

KanwalSingh · 2015-02-27T12:54:55Z

hey Davis, I understand that, will be sharing the training data by Monday,
will also see if we can share the code with you. Hope thats fine with you,
also share your personal email.

On Fri, Feb 27, 2015 at 6:22 PM, Davis E. King notifications@github.com
wrote:

Can you post your training data? :)

We (and a bunch of other groups) have been using MITIE to train models and
haven't had any issues. So I need one of you guys to post a program that
reproduces the issue you are having or it's going to be impossible to debug
:)

—
Reply to this email directly or view it on GitHub
#11 (comment).

Kanwal Prakash Singh
#Data
Housing.com
+919619281431

davisking · 2015-02-27T12:56:56Z

Sounds good. You can email me at davis@dlib.net

manalgandhi · 2015-03-02T07:06:55Z

@davisking , I won't be able to share the training data. Sorry!

@KanwalSingh could you please share the training data you used with Davis.

KanwalSingh · 2015-03-02T08:33:56Z

@davisking I have mailed you the training data

davisking · 2015-03-02T12:33:47Z

Thanks. Please also include a working python program that when executed causes this large ram usage bug to appear so that I can debug it.

KanwalSingh · 2015-03-02T12:45:20Z

done, mailed it to you

On Mon, Mar 2, 2015 at 6:03 PM, Davis E. King notifications@github.com
wrote:

Thanks. Please also include a working python program that when executed
causes this large ram usage bug to appear so that I can debug it.

—
Reply to this email directly or view it on GitHub
#11 (comment).

Kanwal Prakash Singh
#Data
Housing.com
+919619281431

davisking · 2015-03-03T16:41:33Z

I've looked this over and the problem is in the training data given to MITIE. However, before I explain the issue it's helpful to understand a little about how MITIE works. In MITIE, each sentence is chunked into entities and then each chunk is labeled by a multiclass classifier (a multiclass support vector machine specifically). To classify each chunk, MITIE creates a 501354 dimensional vector which is given to the multiclass classifier.

Now the way the multiclass classifier works is it learns one linear function for each label (and an addition one for the special 'not an entity' category). So if you have N labels then the classifier has 501354*(N+1) numbers it needs to learn. Moreover, since we use a cutting plane solver there is an additional factor of RAM usage in the solver, let's call it Z. The amount of RAM used by the multiclass trainer is 501354*(N+1)Zsizeof(double) + (the amount of RAM needed to store the training data). That means there are 501354*(N+1)Zsizeof(double) bytes of RAM usage no matter the size of your dataset.

The Z value is normally in the range 40-80. However, if you give input data that is basically impossible to correctly classify then the solver needs to work harder to find a way to separate it so Z might go up to about 200. It will also take a long time to train. In your case, you gave data with these 18 labels: 'builder', 'project', ' builder', 'size', 'location', 'price', ' price ', 'Infra', 'time', 'price psf', 'loction', ' time', ' size', ' location', ' infra', ' price', ' project', 'infra'.

Now what's happening is MITIE is trying to figure out, for example, how to separate the ' price', 'price', and 'price ' labels but this is probably impossible as I'm sure you meant to give all these things the same label. But the MITIE solver still tries and it needs a large Z to build up a high accuracy solution that can do this. So if Z=200 and N=18 then 501354*(N+1)Zsizeof(double)=14GB of RAM.

So the solution is to fix your training data so that the labels make sense. If you do that then I would expect more like 2GB-4GB of RAM usage. I have also updated the MITIE code to print the labels supplied by the user. So if you pull from github and rerun it will show these labels and that should make this kind of error a lot easier to spot in the future.

gagan-bansal · 2017-08-24T09:40:02Z

@davisking Thanks for such detailed explanation about MITIE working. It solved my issue regarding the long training duration.
I am using rasa_nlu with MITIE. There are few issues(160 and 260) in rasa_nlu and root cause may be the one you have explained here.

ghost · 2017-12-19T09:56:01Z

gagan-bansal did you solve your issue?

davisking closed this as completed Feb 19, 2015

davisking reopened this Feb 19, 2015

davisking closed this as completed Mar 3, 2015

maxoodf mentioned this issue May 29, 2017

memory allocation #130

Closed

gagan-bansal mentioned this issue Aug 24, 2017

training gets stuck RasaHQ/rasa#160

Closed

gagan-bansal mentioned this issue Aug 24, 2017

How to speed up mitie training process RasaHQ/rasa#260

Closed

BrikerMan mentioned this issue Nov 17, 2017

关于total_word_feature_extractor_zh.dat文件 crownpku/Rasa_NLU_Chi#13

Open

This was referenced Aug 7, 2019

为什么我训练的特别慢总是卡在 partII 就不动了怎么办 crownpku/Rasa_NLU_Chi#19

Open

模型训练速度问题 crownpku/Rasa_NLU_Chi#71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training ner on a a new corpus #11

Training ner on a a new corpus #11

KanwalSingh commented Feb 19, 2015

davisking commented Feb 19, 2015

KanwalSingh commented Feb 19, 2015

davisking commented Feb 19, 2015

KanwalSingh commented Feb 23, 2015

davisking commented Feb 23, 2015 via email

KanwalSingh commented Feb 23, 2015

arjunmajum commented Feb 23, 2015

KanwalSingh commented Feb 23, 2015

davisking commented Feb 23, 2015

manalgandhi commented Feb 27, 2015

davisking commented Feb 27, 2015

manalgandhi commented Feb 27, 2015

davisking commented Feb 27, 2015

manalgandhi commented Feb 27, 2015

davisking commented Feb 27, 2015 via email

KanwalSingh commented Feb 27, 2015

davisking commented Feb 27, 2015 via email

KanwalSingh commented Feb 27, 2015

davisking commented Feb 27, 2015 via email

manalgandhi commented Mar 2, 2015

KanwalSingh commented Mar 2, 2015

davisking commented Mar 2, 2015

KanwalSingh commented Mar 2, 2015

davisking commented Mar 3, 2015

gagan-bansal commented Aug 24, 2017

ghost commented Dec 19, 2017

Training ner on a a new corpus #11

Training ner on a a new corpus #11

Comments

KanwalSingh commented Feb 19, 2015

davisking commented Feb 19, 2015

KanwalSingh commented Feb 19, 2015

davisking commented Feb 19, 2015

KanwalSingh commented Feb 23, 2015

davisking commented Feb 23, 2015 via email

KanwalSingh commented Feb 23, 2015

arjunmajum commented Feb 23, 2015

KanwalSingh commented Feb 23, 2015

davisking commented Feb 23, 2015

manalgandhi commented Feb 27, 2015

davisking commented Feb 27, 2015

manalgandhi commented Feb 27, 2015

davisking commented Feb 27, 2015

manalgandhi commented Feb 27, 2015

davisking commented Feb 27, 2015 via email

KanwalSingh commented Feb 27, 2015

davisking commented Feb 27, 2015 via email

KanwalSingh commented Feb 27, 2015

davisking commented Feb 27, 2015 via email

manalgandhi commented Mar 2, 2015

KanwalSingh commented Mar 2, 2015

davisking commented Mar 2, 2015

KanwalSingh commented Mar 2, 2015

davisking commented Mar 3, 2015

gagan-bansal commented Aug 24, 2017

ghost commented Dec 19, 2017