char_indices, indices_char for custom training text? #9

shiffman · 2017-10-26T02:51:41Z

In class this week, to compare with my markov chain example, I ran through training an LSTM model using train.py with itp.txt. I've updated the code in this repo to reflect this. The steps I took are:

Run train.py
Run json_checkpoints_var.py
Switch out hamlet.js in example with a new itp.js.

I realize this is tiny tiny data with little training so getting good results isn't really possible. However, I got nonsense results with characters not in the original data set. I'm imagining this has something to do with char_indices and indices_char. Is there a way to auto-generate these during the training process? (They would be different depending on the characters used in the training data, yes?)

The text was updated successfully, but these errors were encountered:

cvalenzuela · 2017-10-26T03:22:20Z

The dictionary is autogenerated during the training process but if you autogenerate it during the inference mode, there are issues on how javascript and python sort characters. They must be the same in both for the lstm to work.

This was my initial approach:

In python using hamlet.txt in train.py:

chars = sorted(list(set(text)))
char_indices = dict((c,i) for i,c in enumerate(chars))

returns:

{' ': 0, '!': 1, '&': 2, "'": 3, ',': 4, '-': 5, '.': 6, '1': 7, '2': 8, ':': 9, ';': 10, '?': 11, '[': 12, ']': 13, 'a': 14, 'b': 15, 'c': 16, 'd': 17, 'e': 18, 'f': 19, 'g': 20, 'h': 21, 'i': 22, 'j': 23, 'k': 24, 'l': 25, 'm': 26, 'n': 27, 'o': 28, 'p': 29, 'q': 30, 'r': 31, 's': 32, 't': 33, 'u': 34, 'v': 35, 'w': 36, 'x': 37, 'y': 38, 'z': 39}

But the same in lstm.js

let chars = Array.from(new Set(Array.from(text))).sort(); 
let char_indices = chars.reduce((acc, cur, i) => {
  acc[cur] = i;
  return acc;
}, {});

returns:

{ '1': 8, '2': 9, '\n': 0, ' ': 1, '!': 2, '&': 3, '\'': 4, ',': 5, '-': 6, '.': 7, ':': 10, ';': 11, '?': 12, '[': 13, ']': 14, a: 15, b: 16, c: 17, d: 18, e: 19, f: 20, g: 21, h: 22, i: 23, j: 24, k: 25, l: 26, m: 27, n: 28, o: 29, p: 30, q: 31, r: 32, s: 33, t: 34, u: 35, v: 36, w: 37, x: 38, y: 39, z: 40 }

That's why I just copied the dictionary from train.py into js. I'm not sure if there's a better approach to this.

shiffman · 2017-10-26T11:26:10Z

Ah, I see, perhaps we could add something to either train.py or json_checkpoints_var.py that generates a JSON file with the character tables so that at least removes the manual copying and would be less prone to error?

shiffman · 2017-10-26T13:51:17Z

I did a round of work on this in 1b8160d and 93e4fe8. I'm not getting good results, this is most likely due to the tiniest data set ever and training for only 50 epochs. But @cvalenzuela can you look over what I've done to training and examples/lstm_1 and see if you notice anything awry?

shiffman · 2017-10-26T17:57:08Z

I also added a README with instructions on doing the training if you want to take a look.

https://github.com/ITPNYU/p5-deeplearn-js/blob/master/training/lstm/README.md

cvalenzuela · 2017-10-26T18:48:46Z

nice!
I'm looking at the example now

cvalenzuela · 2017-10-26T23:55:40Z

So I did some test and I think I got better results. Here's what I did:

I trained a new model with the itp.txtdata. In train.pyI updated the length of the dictionary, that always needs to match the source text:

NLABELS = len(chars)

and trained it on 1000 epochs since it's a really small source text.

In the lstm.js the onehot variable needs to have a shape that includes the dictionary:

const onehot = track(deeplearn.Array2D.zeros([1, 32]));

Here is what I'm getting from the lstm_1 example

I'll push the code now

shiffman · 2017-10-27T00:02:31Z

awesome, that's great! Thank you! Can the onehot variable pull its shape dynamically?

shiffman · 2017-10-27T00:02:59Z

Oh, and feel free to add any of these details to the README in train/!

cvalenzuela · 2017-10-27T00:10:12Z

Maybe in the training process, train.pycould output just one file that has all the variables and objects that could later be imported into the sketch file?

We could also try other rnn implementations that use words instead of characters. Maybe using something like this, this or this. I'll post how this goes

cvalenzuela · 2017-10-27T18:00:23Z

I updated the README in train/ to reflect this changes

shiffman · 2017-10-30T17:04:15Z

This is great! I made a small change in the code (2dbaf98) to skip saving the model at step 0. Or maybe it's a good idea to leave this in since you get something immediately even if it's nonsense. In any case, I'm closing this issue for now! We can open new ones as to things come up re: LSTM training and generation examples!

shiffman added a commit that referenced this issue Oct 26, 2017

saving json char indices #9

1b8160d

shiffman added a commit that referenced this issue Oct 26, 2017

new lstm_1 example with itp.txt and saving char-index tables to json #9

93e4fe8

shiffman closed this as completed Oct 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

char_indices, indices_char for custom training text? #9

char_indices, indices_char for custom training text? #9

shiffman commented Oct 26, 2017

cvalenzuela commented Oct 26, 2017

shiffman commented Oct 26, 2017

shiffman commented Oct 26, 2017

shiffman commented Oct 26, 2017

cvalenzuela commented Oct 26, 2017

cvalenzuela commented Oct 26, 2017

shiffman commented Oct 27, 2017

shiffman commented Oct 27, 2017

cvalenzuela commented Oct 27, 2017

cvalenzuela commented Oct 27, 2017

shiffman commented Oct 30, 2017

char_indices, indices_char for custom training text? #9

char_indices, indices_char for custom training text? #9

Comments

shiffman commented Oct 26, 2017

cvalenzuela commented Oct 26, 2017

shiffman commented Oct 26, 2017

shiffman commented Oct 26, 2017

shiffman commented Oct 26, 2017

cvalenzuela commented Oct 26, 2017

cvalenzuela commented Oct 26, 2017

shiffman commented Oct 27, 2017

shiffman commented Oct 27, 2017

cvalenzuela commented Oct 27, 2017

cvalenzuela commented Oct 27, 2017

shiffman commented Oct 30, 2017