Skip to content

luowensheng/NLP_Seq2seq-Neural-Network-for-grammatical-error-correction

Repository files navigation

Seq2Seq Neural Network for Grammatical Error Correction

IntroductionGoalResultsQuestions



Introduction

(Back to top 🔼)

Seq2Seq Examples

Neural Network Structure

1

Keras Layers

NMT model summary:
2 3

Dataset

Incorrect Correct
I'm fond to reading and dressing up myself. I'm fond of reading and dressing up myself

Full dataset here.

Data Processing

Word Index
I 85
learn 10
a 32
lot 25
with 76
it 23
. 4

1. Raw Data

'I learn a lot with it.'

2. Tokenize Data

['I', 'learn', 'a', 'lot', 'with', 'it', '.']

3. Build Vocab Dictionary

85, 10, 32, 25, 76, 23, 4

4. Pad Data

Data Padding

<Start> is the beginning of the sentence.

<End> is the end of the sentence.

<Pad> keeps the sentence length consistent.

Word Index
<Pad> 0
<Start> 1
<End> 2

Encoder sentence (input):

[10, 17, 23, 4][10, 17, 23, 4, 2, 0, 0, 0]

Encoder sentence (input):

[5, 18, 38, 40, 44, 4][1, 5, 18, 38, 40, 44, 4, 0]

Embedding Layer

Embedding(input_dim, output_dim, weights=
[embedding_matrix], trainable=[True | False])
  • input_dim: vocabulary size.
  • output_dim: output size, dim of word vector.
  • weights: pre-trained weights.
  • trainable: if false, freeze the layer.

LSTM Layer

LSTM(units, return_sequences=[True | False],
return_state=[True | False])
  • units:hidden dimension (output size).
  • return_sequences:if true,return the output of all words; otherwise, the output of the last word is returned.
  • return_state:whether to return the last cell state.

Dense Layer

Dense(units, activation=)
    units:num of decoder tokens
    activation:softmax

Dirty Data

Example:

"To discuss about private problems..."

False Positive Correct
discuss about discuss
explain about explain
mention about mention
desribe about describe

Effect of Cleaning Data

  • Clean every occurrence of false-positive data

    {‘discuss about’, ‘explain about’, ‘mention about’, ‘describe about’}
    
  • Train a new model

  • Analyze the result of the two models

Goal

(Back to top 🔼)

  • To use Seq2Seq Neural Network architecture in machine translation to perform grammatical error correction.

  • Word embedding: pre-trained word embedding

  • To Analyze the effect of cleaning data

Results

(Back to top 🔼)

Method 1

4

Method 2

5

Method 3

6

Method 4

Model 1:

7

Model 2:

8

Method 5

9

Remarks:

When training starts, the validation loss gets lower and lower but after a certain number of epochs we see that the validation loss starts to increase while the training loss continues to increase. This result suggests that there may be a possibility of overfitting during training. This happens where a bias present in the model and it causes the weights to adjust themselves in a way to better predict the training set while not actually improving its chances of predicting sentences that are not included in the training set.

Results using only 10721 sentences and a small percentage of the validation set: 10

Some of the prediction results:

11

Comparisons

(Back to top 🔼)

The comparison is done using the second method.

Training method:

12

Clean data results:

13

Dirty data results:

14

Percentage of target sentences cleaned : 0.00022037905196938736

Number of target sentences cleaned: 33

Clean data results of 20 epochs

15

Questions

Submit your questions and bug reports here.


© luowensheng.

About

Seq2seq Neural Network for grammatical error correction implementation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published