Add utf8 character support #12

5kg · 2015-05-27T07:04:02Z

I tested by feeding the model a Chinese novel, and it produces some interesting results.

VitoVan · 2015-06-12T00:29:03Z

Wow, this is just what I'm going to do, thank you.

5kg · 2015-06-12T06:08:55Z

@VitoVan You can also try the original code on byte input without any modification.

In my experiment, the trained LSTM model can actually learn the utf-8 encoding of chinese character. I didn't see any broken codepoint in the generated text.

VitoVan · 2015-06-12T06:12:24Z

------ADDED 2015-6-12 16:43:56------
Sorry, I'm new to Lua, so I may have follow stupid question:
------ADDED 2015-6-12 16:43:56------

@5kg I haven't try yet, well then, if the original code works well, what's the meaning of this pull request? Make the learning faster on Chinese?

karpathy · 2015-06-17T10:30:26Z

I assume this code is backwards compatible to previous datasets?

VitoVan · 2015-06-17T11:26:51Z

I think so, haven't test it.

Sent from my phone.
On 17 Jun 2015 6:30 pm, Andrej notifications@github.com wrote:I assume this code is backwards compatible to previous datasets?

—Reply to this email directly or view it on GitHub.

wb14123 · 2015-06-17T13:34:40Z

This patch increases the size of vocab a lot. I have a dataset of 16M. The origin code generates a vocab with size 230 but this code generates a vocab with 180128, which need 241G memory to load.

wb14123 · 2015-06-17T13:48:11Z

I just realize that my dataset is not UTF8. But this may break the support for other input stream than text. And the vocab generated from UTF8 dataset is also bigger that the origin size.

hughperkins · 2015-11-15T09:57:31Z

@5kg I haven't try yet, well then, if the original code works well, what's the meaning of this pull request? Make the learning faster on Chinese?

Presumably the advantage is that the model doesnt have to spend effort on learning how to construct unicode code points, and wont ever write invalid unicode code points.

But the increase in vocab size increase will vastly increase the number of parameters in the fully-connected Linear layers, as far as I can see. Based on my calcs at https://www.reddit.com/r/MachineLearning/comments/3ejizl/karpathy_charrnn_doubt/ctfndk6 , the number of weights is:

4 * rnn_size * ( vocab_size + rnn_size + 2 ) + (rnn_size + 1) * vocab_size

eg, if rnn_size is 128, and vocab_size is eg 96 then the number of weights is: 128K, which takes 512KB of memory (4 bytes per float)

but if vocab_size is 180,128, then the number of weights is: 115M, which takes 460MB of memory

hughperkins · 2015-11-15T10:01:13Z

Hmmm, but actually, I dont remember there are so many chinese characters. I think there are only 10 to 20 thousand in normal usage?

InnovativeInventor · 2017-12-31T03:35:20Z

What is the status on this?

Add utf8 character support

9c3a41d

inDream mentioned this pull request Jun 17, 2015

Utf-8 support #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add utf8 character support #12

Add utf8 character support #12

5kg commented May 27, 2015

VitoVan commented Jun 12, 2015

5kg commented Jun 12, 2015

VitoVan commented Jun 12, 2015

karpathy commented Jun 17, 2015

VitoVan commented Jun 17, 2015

wb14123 commented Jun 17, 2015

wb14123 commented Jun 17, 2015

hughperkins commented Nov 15, 2015

hughperkins commented Nov 15, 2015

InnovativeInventor commented Dec 31, 2017

Add utf8 character support #12

Are you sure you want to change the base?

Add utf8 character support #12

Conversation

5kg commented May 27, 2015

VitoVan commented Jun 12, 2015

5kg commented Jun 12, 2015

VitoVan commented Jun 12, 2015

karpathy commented Jun 17, 2015

VitoVan commented Jun 17, 2015

I think so, haven't test it.

wb14123 commented Jun 17, 2015

wb14123 commented Jun 17, 2015

hughperkins commented Nov 15, 2015

hughperkins commented Nov 15, 2015

InnovativeInventor commented Dec 31, 2017