part1.py
- Read csv file into dataframe(read message data only)
- Extract mail content only since we don't need any mail details
- There is raw data in mail content other than simple sentences. Filter this raw data.
- Using Nltk sentence tokenizer, split the email data paragraphs into sentences. Preprocessing: a.Remove the sentences which contains links, different symbols(like ~~~, ----, *****, >>, ==). b.Also remove sentences with has less than 3 words and greater than 25 words. Before preprocessing sentence count was 6627371 where as after preprocessing count is 3672744
- Save this preprocessed sentences into a file.
Rules for actionable sentences:
- Sentence start with VB(go,do,make)
- Sentence start with VB_Phrase
Examples of verb phrases:
VB-Phrase: {} (carefully drive)
VB-Phrase: {<,>} (Bah ! go get) some work
VB-Phrase: {<,>} (Great ! have fun)
VB-Phrase: {<NN.?>+<,>} (Virat, please mail) me the docs VB-Phrase: {- <,>*} (Just carefully listen) VB-Phrase: {} (you stop) this
- sentence starts with "please"
- sentence containing "please"
part2.py
- Read data from a file which is created in part1.py
- Classify sentences into true value(actionable) and false value(non-actionable) sentences according to rules described above.
- For time being only 50000 sentences for each class has been classified.
- Classified sentence data saved into two different files for true value(actionable) and false value(non-actionable) sentences.
part3.py Following is the flow for objective 2:
- Text data
- Embedding
- Deep Network(GRU)
- Fully connected layer
- Output layer(sigmoid)
- Final output
Summary of the built model...
embedding_3 (Embedding) (None, 30, 50) 1673150
gru_3 (GRU) (None, 32) 7968
Total params: 1,681,151 Trainable params: 1,681,151 Non-trainable params: 0
None Train...
(98486, 30) (98486,) Train on 98486 samples, validate on 1520 samples Epoch 1/25
- 182s - loss: 0.3617 - acc: 0.7993 - val_loss: 0.7880 - val_acc: 0.7750 Epoch 2/25
- 170s - loss: 0.0851 - acc: 0.9762 - val_loss: 0.9968 - val_acc: 0.7743