Introduction • Goal • Results • Questions
Incorrect | Correct |
---|---|
I'm fond to reading and dressing up myself. | I'm fond of reading and dressing up myself |
Full dataset here.
Word | Index |
---|---|
I | 85 |
learn | 10 |
a | 32 |
lot | 25 |
with | 76 |
it | 23 |
. | 4 |
1. Raw Data
'I learn a lot with it.'
2. Tokenize Data
['I', 'learn', 'a', 'lot', 'with', 'it', '.']
3. Build Vocab Dictionary
85, 10, 32, 25, 76, 23, 4
4. Pad Data
<Start>
is the beginning of the sentence.
<End>
is the end of the sentence.
<Pad>
keeps the sentence length consistent.
Word | Index |
---|---|
<Pad> |
0 |
<Start> |
1 |
<End> |
2 |
[10, 17, 23, 4]
→ [10, 17, 23, 4, 2, 0, 0, 0]
[5, 18, 38, 40, 44, 4]
→ [1, 5, 18, 38, 40, 44, 4, 0]
Embedding(input_dim, output_dim, weights=
[embedding_matrix], trainable=[True | False])
input_dim
: vocabulary size.output_dim
: output size, dim of word vector.weights
: pre-trained weights.trainable
: if false, freeze the layer.
LSTM(units, return_sequences=[True | False],
return_state=[True | False])
units
:hidden dimension (output size).return_sequences
:if true,return the output of all words; otherwise, the output of the last word is returned.return_state
:whether to return the last cell state.
Dense(units, activation=)
units:num of decoder tokens
activation:softmax
Example:
"To discuss about private problems..."
False Positive | Correct |
---|---|
discuss about | discuss |
explain about | explain |
mention about | mention |
desribe about | describe |
-
Clean every occurrence of false-positive data
{‘discuss about’, ‘explain about’, ‘mention about’, ‘describe about’}
-
Train a new model
-
Analyze the result of the two models
-
To use
Seq2Seq
Neural Network architecture in machine translation to perform grammatical error correction. -
Word embedding: pre-trained word embedding
-
To Analyze the effect of cleaning data
Model 1:
Model 2:
When training starts, the validation loss gets lower and lower but after a certain number of epochs we see that the validation loss starts to increase while the training loss continues to increase. This result suggests that there may be a possibility of overfitting during training. This happens where a bias present in the model and it causes the weights to adjust themselves in a way to better predict the training set while not actually improving its chances of predicting sentences that are not included in the training set.
Results using only 10721 sentences and a small percentage of the validation set:
Some of the prediction results:
The comparison is done using the second method.
Training method:
Clean data results:
Dirty data results:
Percentage of target sentences cleaned : 0.00022037905196938736
Number of target sentences cleaned: 33
Clean data results of 20 epochs
Submit your questions and bug reports here.
© luowensheng.