# NLP based text entity extraction - French 

### Input 

In [1]:
import os
import utils
import spacy
from spacy import displacy
import pandas as pd
import numpy as np

In [2]:
utils.PDF('assets/Use_caes_NLP.pdf',size=(600,500))

### Things need to be Expected output to be extracted

In [3]:
utils.PDF('assets/output.pdf',size=(600,500))

# My Solution

## The simple solution involves training the Custom NER Spacy model ***fr_core_news_md*** French language model.

    - Augmentation has to be used anyway so I augmented the given input data like this.

##### So the actual inputs would look like this, augmented input goes into those {xxxxxx} and {xxxx / 0.000} tags.

<br>

In [5]:
utils.load_text('assets/input_augment.txt', print_=True)

POS. DESCRIPTION QUANTITE   COMMANDEE QUANTITE   CONFIRMEE PRIX /  UNITE  NET /  UNITE  MONTANT  NET  10  PFWVTG001005000803 / PWVTGG50080 Z (*) Vis à panneaux PFS+ TF-T * 5,00x80 Zn (T25)  double tête plate six lobes int. avec nervures  filet partiel - pointe coupante T17  emballage standard Art. client: {xxxxxx} QTE: 200 Pce  Demandé: 2.000 Pce Arrondi:  2.400 Pce  Stock: 2.400 Pce  {xxxx / 0.000} Pce  -2 %  {xxxx / 0.000}   Pce  117,53   20  PFWVTV001004500603 / PWVTVK45060 Z (*) Vis à panneaux PFS+ TF-T * 4,50x60 Zn (T20)  double tête plate six lobes int. avec nervures  filet total - pointe coupante T17  emballage petit Art. client: {xxxxxx} QTE: 200 Pce  2.400 Pce Stock: 2.400 Pce  {xxxx / 0.000} Pce {xxxx / 0.000}  68,71   Sous Total 188,64   Remise cumulée -2,40   Montant net 186,24   TVA (21%) 39,11   Montant total 225,35  Poids brut 27,600 KG


<br>

##### Training the model with 100 augmented samples and for an epoch this is the output.

In [4]:
test_input = utils.load_text('assets/test_input.txt')
french_model = utils.load_model('fr_model2')
output = utils.inference(french_model, test_input)


Loaded ('fr', 'core_news_md') model 



<br><br><br><br><br><br><br><br><br><br>

# Approaches

#### Below are NLP approaches to the extract the menioned information from the above given document. 

## 1. Camelot
    - Even before wanting to go ahead with NLP, I tried a simple CV based approach using Camelot which internally leverage OpenCV.

##### But as expected results weren't promising.

In [25]:
# for as is extraction
df0 = pd.read_parquet('assets/camelot_plain-0.parquet', engine='fastparquet')
df0

Unnamed: 0,0,1
0,Tél. +32 (0)9 272 70 70 \n \ninfo@pgb-europe.com,
1,Fax +32 (0)9 272 70 99 \n \nwww.pgb-europe.com,Société Certifiée ISO 9001
2,CONFIRMATION D'ORDRE - 301437,
3,,Plasticentre sa
4,Date du document\n02/05/2022,
5,,Lid Menouquin :115 /PLH
6,Numéro de client\n1000293,
7,,Chée de Russeignies 11
8,Numéro de TVA client\nBE0415809702,
9,,9600 Renaix


In [26]:
# table we are interested in
df1 = pd.read_parquet('assets/camelot_plain-1.parquet', engine='fastparquet')
df1

Unnamed: 0,0,1,2,3,4,5,6,7
0,,,QUANTITE,QUANTITE,PRIX /,,NET /,MONTANT
1,POS.,DESCRIPTION,,,,,,
2,,,COMMANDEE,CONFIRMEE,UNITE,,UNITE,NET
3,,PFWVTG001005000803 / PWVTGG50080 Z (*),,,,,,
4,,"Vis à panneaux PFS+ TF-T * 5,00x80 Zn (T25)",,,,,,
5,,double tête plate six lobes int. avec nervures,,,,,,
6,,,Demandé:,,,,,
7,,,,,"49,97 €",,"48,97 €",
8,,filet partiel - pointe coupante T17,,,,,,
9,,,2.000 Pce,Stock:,,,,


## 2. Apache Tika 
    - Used Tika to extract text content from PDF file.

In [None]:
# Here is the extraced content

utils.load_text('assets/tika_text.txt', print_=True)

##### Extraction of the text was good, so went ahead with Custom NER training.


## 3. Custom NER training in Spacy

    - However as a first cut I passed in all the text content which was extracted. My first impression was the model didn't learn at all because when I pass the full text it couldn't find any entities.
<br>

##### My reasoning to not learning anything was this
***The reason to that is there was a lot of other contexually non-relevant content in the header and footer which confuses the model.***

In [9]:
test_input = utils.load_text('assets/tika_text.txt')
french_model = utils.load_model('fr_model')
output = utils.inference(french_model, test_input)


Loaded ('fr', 'core_news_md') model 



## 4. Transformer model to Rescue.
    - Translated French text to English to use it in Transformer model to train.
<br><br>

***Results from Transformer models are even worse.***

In [2]:
test_input = utils.load_text('assets/tika_text.txt')
transformer_model = utils.load_model('eng_transformers')
output = utils.inference(transformer_model, test_input, transformer=True)


Loaded ('en', 'core_web_trf') model 



<br><br><br><br><br><br>
# This is the actual hack to the solution.

## Just use the relevant content to train and test.

***Rather than this***

In [5]:
utils.load_text('assets/tika_text.txt')

"Numéro de TVA : BE 0425.888.396 RPR GENT KBC IBAN BE50 4459 6389 4118 BIC KREDBEBB  ING IBAN BE98 3900 3232 4293 BIC BBRUBEBB  BNP IBAN BE78 2900 1606 0086 BIC GEBABEBB  BEL IBAN BE30 5645 1378 2011 BIC GKCCBEBB 1 / 2  pgb-Europe nv  Gontrode Heirweg 170  9090 Melle - Belgium    Tél. +32 (0)9 272 70 70  info@pgb-europe.com  Fax +32 (0)9 272 70 99  www.pgb-europe.com  Société Certifiée ISO 9001  CONFIRMATION D'ORDRE - 301437  Date du document 02/05/2022 Numéro de client 1000293 Numéro de TVA client BE0415809702 Votre réf. 13639 E-mail florine.fernez@plasticentre.b  e Tél. 069532060  Plasticentre sa  Lid Menouquin :115  /PLH   Chée de Russeignies 11  9600 Renaix  Adresse de facturation Menouquin sv Pro  Groep Meno0  Vieux Chemin de Thines 14  1400 Nivelles  Adresse de livraison Plasticentre  003147999  Rue De L'Artisanat 11  7900 Leuze-en-Hainaut  Vous pourrez trouver nos conditions générales de vente et de livraison sur ce  document ou via le lien suivant :   https://www.pgb-europe.com

<br>
<br>

***Just use this***

In [7]:
utils.load_text('assets/test_input.txt')

'POS. DESCRIPTION QUANTITE   COMMANDEE QUANTITE   CONFIRMEE PRIX /  UNITE  NET /  UNITE  MONTANT  NET  10  PFWVTG001005000803 / PWVTGG50080 Z (*) Vis à panneaux PFS+ TF-T * 5,00x80 Zn (T25)  double tête plate six lobes int. avec nervures  filet partiel - pointe coupante T17  emballage standard Art. client: 924734 QTE: 200 Pce  Demandé: 2.000 Pce Arrondi:  2.400 Pce  Stock: 2.400 Pce  49,97 / 1.000   Pce  -2 %  48,97 / 1.000   Pce  117,53   20  PFWVTV001004500603 / PWVTVK45060 Z (*) Vis à panneaux PFS+ TF-T * 4,50x60 Zn (T20)  double tête plate six lobes int. avec nervures  filet total - pointe coupante T17  emballage petit Art. client: 929735 QTE: 200 Pce  2.400 Pce Stock: 2.400 Pce  28,63 / 1.000 Pce 28,63 / 1.000 Pce  68,71   Sous Total 188,64   Remise cumulée -2,40   Montant net 186,24   TVA (21%) 39,11   Montant total 225,35  Poids brut 27,600 KG'

<br><br>
## 5. Transformer model but just with the most appropriate content.

***Translated French text to English but but but just the part where it's contexually necessary to train.*** 
<br><br>
***Still results from Transformer aren't that great. Like no difference at all, infact it's not even close.***

In [4]:
test_input = utils.load_text('assets/tika_text.txt')
transformer_model = utils.load_model('eng_transformers2')
output = utils.inference(transformer_model, test_input, transformer=True)


Loaded ('en', 'core_web_trf') model 



## 6. Train the same old *fr_core_news_md* French model but with new above mentioned data type.

***Voila! It something seems to work.*** 
<br><br>
***Results are great!!!***

In [8]:
test_input = utils.load_text('assets/test_input.txt')
transformer_model = utils.load_model('fr_model2')
output = utils.inference(transformer_model, test_input)


Loaded ('fr', 'core_news_md') model 



## 6.1. Comparing *fr_core_news_md* French model with Transformer model.

<br><br>
***!!!***

#### Transformers doesn't live upto it's name in this particular task. (But that doesn't mean they are not capable!)

In [8]:
test_input = utils.load_text('assets/test_input.txt')
transformer_model = utils.load_model('eng_transformers2')
output = utils.inference(transformer_model, test_input, transformer=True)


Loaded ('en', 'core_web_trf') model 



<br><br><br><br><br><br>

# The most important relevation to me!!!

**The simple Medium French model was able to capture the entities given and no need of Transformer models. Just needs need the consise and contextually important input training sample.**


In [9]:
test_input = utils.load_text('assets/test_input.txt')
transformer_model = utils.load_model('fr_model')
output = utils.inference(transformer_model, test_input)


Loaded ('fr', 'core_news_md') model 

