Seq2Seq Neural Network for Grammatical Error Correction

Introduction • Goal • Results • Questions

Introduction

(Back to top 🔼)

Seq2Seq Examples

Neural Network Structure

Keras Layers

NMT model summary:

Dataset

Incorrect	Correct
I'm fond to reading and dressing up myself.	I'm fond of reading and dressing up myself

Full dataset here.

Data Processing

Word	Index
I	85
learn	10
a	32
lot	25
with	76
it	23
.	4

1. Raw Data

'I learn a lot with it.'

2. Tokenize Data

['I', 'learn', 'a', 'lot', 'with', 'it', '.']

3. Build Vocab Dictionary

85, 10, 32, 25, 76, 23, 4

4. Pad Data

Data Padding

<Start> is the beginning of the sentence.

<End> is the end of the sentence.

<Pad> keeps the sentence length consistent.

Word	Index
`<Pad>`	0
`<Start>`	1
`<End>`	2

Encoder sentence (input):

[10, 17, 23, 4] → [10, 17, 23, 4, 2, 0, 0, 0]

Encoder sentence (input):

[5, 18, 38, 40, 44, 4] → [1, 5, 18, 38, 40, 44, 4, 0]

Embedding Layer

Embedding(input_dim, output_dim, weights=
[embedding_matrix], trainable=[True | False])

input_dim: vocabulary size.
output_dim: output size, dim of word vector.
weights: pre-trained weights.
trainable: if false, freeze the layer.

LSTM Layer

LSTM(units, return_sequences=[True | False],
return_state=[True | False])

units：hidden dimension (output size).
return_sequences：if true，return the output of all words; otherwise, the output of the last word is returned.
return_state：whether to return the last cell state.

Dense Layer

Dense(units, activation=)
    units：num of decoder tokens
    activation：softmax

Dirty Data

Example:

"To discuss about private problems..."

False Positive	Correct
discuss about	discuss
explain about	explain
mention about	mention
desribe about	describe

Effect of Cleaning Data

Clean every occurrence of false-positive data

{‘discuss about’, ‘explain about’, ‘mention about’, ‘describe about’}

Train a new model
Analyze the result of the two models

Goal

(Back to top 🔼)

To use Seq2Seq Neural Network architecture in machine translation to perform grammatical error correction.
Word embedding: pre-trained word embedding
To Analyze the effect of cleaning data

Results

(Back to top 🔼)

Method 1

Method 2

Method 3

Method 4

Model 1:

Model 2:

Method 5

Remarks:

When training starts, the validation loss gets lower and lower but after a certain number of epochs we see that the validation loss starts to increase while the training loss continues to increase. This result suggests that there may be a possibility of overfitting during training. This happens where a bias present in the model and it causes the weights to adjust themselves in a way to better predict the training set while not actually improving its chances of predicting sentences that are not included in the training set.

Results using only 10721 sentences and a small percentage of the validation set:

Some of the prediction results:

Comparisons

(Back to top 🔼)

The comparison is done using the second method.

Training method:

Clean data results:

Dirty data results:

Percentage of target sentences cleaned : 0.00022037905196938736

Number of target sentences cleaned: 33

Clean data results of 20 epochs

Questions

Submit your questions and bug reports here.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
method3.ipynb		method3.ipynb
method_1.ipynb		method_1.ipynb
method_2.ipynb		method_2.ipynb
method_4.ipynb		method_4.ipynb
method_5_clean_data.ipynb		method_5_clean_data.ipynb

luowensheng/NLP_Seq2seq-Neural-Network-for-grammatical-error-correction

Folders and files

Latest commit

History

Repository files navigation

Seq2Seq Neural Network for Grammatical Error Correction

Introduction

Seq2Seq Examples

Neural Network Structure

Keras Layers

Dataset

Data Processing

Data Padding

Encoder sentence (input):

Encoder sentence (input):

Embedding Layer

LSTM Layer

Dense Layer

Dirty Data

Effect of Cleaning Data

Goal

Results

Method 1

Method 2

Method 3

Method 4

Method 5

Remarks:

Comparisons

Questions

About

Resources

Stars

Watchers

Forks

Languages