-
Notifications
You must be signed in to change notification settings - Fork 1
How_to_use_EN
This page must be translated into English (or reviewed). Use French pages model. Thank you for your contributions
For our first example, we will build a classifier that detects the language of a document (or sentence). This case is simple because the documents are classified with a single symbol (Mono-ranking) and the ranking is defined by a single level.
Make a protype in three steps:
- the constitution of the training corpus
- indexing of documents
- tests and validation of the classifier
The production step is completed by:
- construction and storage of neural networks of classifiers of the classification hierarchy.
- publication of the classifier with a WebService
- development of client applications.
The training corpus is defined by:
- the documents.
- the catalog.
This separation makes it possible to manipulate the catalogs independently of the documents.
To learn how to predict the language of a sentence, the classifier must have examples of sentence for each language.
We have established a format for the document corpus (.mflf Many FiLe Format). This format makes it possible to compact all the documents (indeed if each document is in a separate file, the access time and the storage volume will increase considerably.) It takes 10ms to open a file on a disk. The corpora can have several millions of documents.
Each document is separated by a line that begins and ends with the ##### character sequence. The text between the two sequences is the identifier of the document.
Sample documents file:
####CS1#####
o podmínkách a pravidlech prijetí Bulharské republiky a Rumunska do Evropské unie.
#####DA1#####
om vilkårene og de nærmere bestemmelser for optagelse af Republikken Bulgarien og Rumænien i den Europæiske Union.
#####DE1#####
über die Bedingungen und Einzelheiten der Aufnahme der Republik Bulgarien und Rumäniens in die Europäische Union.
#####EL1#####
s?et??? µe t??? ????? ?a? t?? ?ept?µ??e?e? t?? p??s????s?? t?? ??µ???at?a? t?? ?????a??a? ?a? t?? ???µa??a? st?? ????pa??? ???s?.
...
The catalog determines the class of the documents (or classes). Each line describes the classification of a document with the following form:
IDDOC CLASS1 CLASS2 ... CLASSn
Sample catalog file:
CS1 CS
DA1 DA
DE1 DE
EL1 EL
...
As a source of the training corpus, we have chosen a 1000-sentence extract from the DGT2014 (corpora of the european committee (see [DGT2014] (https://ec.europa.eu/jrc/en/language-technologies/dgt- translation-memory)) containing 85 million sentences from 24 languages).
The construction of the corpus always requires adaptation to .mflf and .cat format
In the folder MYCLASS_MODEL \ sample \ langdetect
contains the catalogs and the corpus for the working suite.
The training of neural networks is done on a corpus of document which is indexed.
Indexing makes it possible to organize the linguistic structure (the text) in an efficient digital structure. In our case, we use BOW (Bag Of Words). Each word in the document is replaced by the word identifier and the number of occurrences in the document. We can also perform additional treatments:
-
Eliminate the stop words (the, the, one, ...) which not of being able to classify in general. Except in our particular case, indeed they are a good marker for language detection.
-
By defining the notion of "word" during indexing. Should we keep capital letters, numbers, ...?
-
Lemmatize the words: Look for the root of the word. (global, globaly, globalisation, ...). This allows to gather under a single term several forms of the same concept.
-
Filter by the number of occurrences minimun and maximun in all the corpus.
To index the corpus, we use the same indexer as that of MYCAT (the code of the indexer can be here).
By opening the myclass
project in your IDE. you will find the code associated with our example in the org.olanto.demo.langdetect_word
package. Three classes are used for indexing:
-
ConfigurationForCat
: Indexing Settings -
TokenCatNative
: Definition of words (token) -
CreateIndexForCat
: Program performing indexing
The folder structure must exist before indexing.
By convention, structures are in the data
folder.
Each project has its folder. Each folder contains a sto0
folder and a mnn
folder.
The index structure is created at the first launch of the program. To empty the index, we manually delete the files. (These manual operations are to avoid accidents that clears an index that has taken a lot of computing time to be built).
We have seen above that the CreateIndexForCat
program will perform indexing.
public static void main(String[] args) {
id = new IdxStructure("NEW", new ConfigurationForCat()); // build a new structure
// first pass to count words
id.indexdirOnlyCount(SenseOS.getMYCLASS_ROOT() + "MYCLASS_MODEL/sample/langdetect");
id.flushIndexDoc(); // flush buffers
// seconde pass to build BOW
id.indexdirBuildDocBag(SenseOS.getMYCLASS_ROOT() + "MYCLASS_MODEL/sample/langdetect");
id.flushIndexDoc(); // flush buffers
id.Statistic.global(); // display statistic
id.close(); // close index
}
Indexing is done in two passes:
- count the words to be able to filter them
- build BOWs for each document
The log of indexing is a fairly complete file see log of indexing.
These lines give statistics on the corpus.
STATISTICS global:
wordmax: 1048576, wordnow: 401448, used: 38%
documax: 262144, docunow: 21537, used: 8%
docIdx: 21537 docValid: 21537, valid: 100%
totidx: 0
DIn our case, 401448 different words for 21537 documents.
The program to use for training and testing is ExperimentManual
.
All parameters are described in Training and Test Parameters
We define here the catalogs. The test catalog is empty because we use an 80/20 test from the training catalog. We load catalogs with a two-character classification.
// define path to catalog
String fntrain = SenseOS.getMYCLASS_ROOT() + "MYCLASS_MODEL/sample/langdetect/corpus_dgt2014.cat";
String fntest = SenseOS.getMYCLASS_ROOT() + "MYCLASS_MODEL/sample/langdetect/EMPTY.cat";
// load catalog at the specified level (2 char codification in this case)
NNBottomGroup BootGroup = new NNBottomGroup(id, fntrain, fntest, 2, false, false);
Our test is performed on the first three predictions. We perform tests in Mono ranking (Maintest)
// TEST -- parameters for the test
3, // int Nfirst,
true, // boolean maintest,
true, // boolean maintestGroupdetail,
true, // boolean maintestDocumentdetail,
false, // boolean multitest,
false, // boolean multitestGroupdetail ,
false // boolean multitestOtherdetail
Our test is completed by calculating confusion matrix.
// display a Confusion Matrix
NNOneN.ConfusionMatrix(false);
Finally, we ask for the list of the most important words (features) of each class.
// display top features
NNOneN.explainGroup(20, true);
The log of the execution of the manual test is quite complete [see test log] (Log_Test_annex).
Let's look at the most interesting parts of the log
We see that 21536 documents are identified in the catalog. The minimum word count is 1 and the maximum is 104. The average is 20.
#doc:21536, avg:20, min:1, max:104
STOP [2018/05/31 12:34:07]: avgLength() - 62 ms
Do not consider the words appearing once to reduce the number of words to 33573. The system has prepared an 80/20 test. the classification concerns 22 classes (languages to be detected)
GLOBALMINOCC: 2 , MAX features:33573
start mem: 121454680
2.
lasttraindoc:21536
lasttestdoc:21536
Train 0..17228 Test ..21536
maxgroup:22
after localgroup: 124413040
Active group:22
The system divides the training on the cores. It shows the progression. The corpus is occurs 5 times in this case.
START [2018/05/31 12:34:07] : TrainWinnow
after init: 125684344
filter used:0, open:33573, discarded:0, filtred:0
Start loop 0 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 - 4 - 5 - 7 - 3 - 6 - 0 - 1 - 2 End loop 0
Start loop 1 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 - 5 - 6 - 1 - 4 - 7 - 2 - 3 - 0 End loop 1
Start loop 2 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 - 4 - 0 - 1 - 3 - 5 - 6 - 7 - 2 End loop 2
Start loop 3 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 - 1 - 0 - 2 - 3 - 4 - 6 - 7 - 5 End loop 3
Start loop 4 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 - 1 - 2 - 0 - 5 - 3 - 7 - 4 - 6 End loop 4
# features: 33573
# maxgroup: 22
# maxtrain: 17228
# avg doc : 20
# repeatK: 5
size of NN: 738 [Kn]
estimate #eval (if no discarded feature): 37884 [Kev]
estimate power (if no discarded feature): 185 [Mev/sec]
STOP [2018/05/31 12:34:08]: TrainWinnow - 219 ms
The documents reserved for the test are used.
Mainclass1000.0,1.06,2,300.0,300.0,9979,20,9955,18,4
La ligne de résultat s'interprète comme suit:
The result line is interpreted as follows:
-
Mainclass1000.0,1.06,2,300.0,300.0,
: training parameter -
9979,
: the sum of the accuracy of the first three predictions (99.79%) -
20,
: the sum of the errors of the first three predictions (0.20%) -
9955,
: accuracy of the first prediction (99.55%) -
18,
: accuracy of the second prediction (0.18%) -
4
: accuracy of the third prediction (0.04%)detail in: C:/MYCLASS_MODEL/experiment/langdetect/detailworddetect-MainDetail-Class.txt
START [2018/05/31 12:34:08] : ConfusionMatrix
1000.0,1.06,2,300.0,300.0,995,995,4,0
confusion matrix: (line=real category; colums= prediction)
>>predict,HU,ET,PL,LT,EL,LV,MT,DE,SL,BG,EN,SV,NL,SK,FR,CS,FI,PT,IT,ES,DA,RO,
HU,203,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,
ET,0,204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
PL,0,0,202,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
LT,0,0,0,194,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
EL,0,0,0,0,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
LV,0,0,0,0,0,193,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
MT,0,0,0,0,0,1,186,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
DE,0,0,0,0,0,0,0,190,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
SL,0,0,0,0,0,0,0,0,203,0,0,0,0,1,0,0,0,0,0,0,0,0,
BG,0,0,0,0,0,0,0,1,0,216,0,0,0,0,0,0,0,0,0,0,0,0,
EN,0,0,0,0,0,0,0,0,0,0,215,0,0,0,0,0,0,0,0,1,1,0,
SV,0,0,0,0,0,0,0,0,0,0,0,221,0,0,0,0,0,0,0,0,1,0,
NL,0,0,0,0,0,0,0,0,0,0,0,0,201,0,0,0,0,0,0,0,0,0,
SK,0,0,0,0,0,0,0,0,1,0,0,0,0,195,0,1,0,0,0,1,0,0,
FR,0,0,0,0,0,0,0,0,0,0,0,0,0,0,211,0,0,0,0,0,0,0,
CS,0,0,0,0,0,0,0,0,1,0,0,0,0,2,0,199,0,0,0,0,0,0,
FI,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,205,0,0,0,0,0,
PT,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,177,0,0,0,0,
IT,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,195,0,0,0,
ES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,197,0,0,
DA,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,200,0,
RO,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,98,
STOP [2018/05/31 12:34:08]: ConfusionMatrix - 47 ms
After layout in a Spreadsheet, we obtain the following result:
The yellow boxes show the errors. The test covers 20% of the data that was removed from the training. The results are excellent but remember that the tested sentences had more than 50 characters like all the rest of the corpus.
This feature gives us the most valued words during training. Among them, we find the words "tools" (stop words). Other words are also in the list, because the list does not take into account the frequency of the word in the corpus.
groupe, nbdoc, kw1, kw2, kw3, ...
---- 0 HU, nbdoc:1000,az,és,bizottság,hl,tanács,nem,következo,csatlakozás,kell,ig,szóló,egészül,vagy,vonatkozó,rendelete,hoeromu,mint,támogatás,nemzeti,melléklet
---- 1 ET, nbdoc:1000,bulgaaria,või,ning,kui,lk,rumeenia,euroopa,eü,lisa,artikli,aasta,mis,nõukogu,kohta,kuni,bulgaarias,alusel,suhtes,määruse,punkti
---- 2 PL, nbdoc:1000,w,dnia,dz,sie,dla,oraz,we,panstwa,przez,bulgarii,rumunii,które,czlonkowskie,przy,jest,sa,zgodnie,rocznie,lub,przystapienia
---- 3 LT, nbdoc:1000,del,eb,iš,iki,istojimo,kaip,pagal,bulgarijos,punkte,ol,saugyklai,bulgarija,gali,i,nuo,yra,dalyje,arba,bulgarijoje,tarybos
---- 4 EL, nbdoc:1000,?a?,st?,t??,ap?,t??,??a,t??,p??,µe,t?,de?aµe??,t??,st??t??,t?,ß????a??a,??,ta,s?µe??,st?,?
---- 5 LV, nbdoc:1000,gada,uz,punkta,eiropas,pievienošanas,kas,lpp,vai,attieciba,ka,pec,panta,ov,atbilstigi,padomes,lai,lidz,dalas,bulgarija,ša
---- 6 MT, nbdoc:1000,li,apos,ankara,jew,ghandu,doganali,fuq,ghandha,ghandhom,dawn,u,minn,ghal,ta,ikunu,inkluzi,fl-appendici,ma,kif,dak
---- 7 DE, nbdoc:1000,und,aschebecken,vom,für,von,verordnung,werden,nach,wird,aus,über,mit,nummer,oder,absatz,abl,zur,nicht,dem,sind
---- 8 SL, nbdoc:1000,iz,pristopa,bolgarija,ul,ali,glede,lahko,št,bolgariji,prilogi,odstavka,sveta,tem,pogodbe,kot,skladu,sklep,države,podlagi,odbora
---- 9 BG, nbdoc:1000,??,?,?,?,???,??,??,??,?????,?????????,????,???,son,????????,altesse,royale,grand-duc,??,????????????,???????????
---- 10 EN, nbdoc:1000,shall,and,decision,following,oj,or,accession,regulation,european,council,committee,treaty,republic,member,may,with,amended,by,areas,executive
---- 11 SV, nbdoc:1000,och,av,skall,för,från,till,att,inte,enligt,får,förordning,vid,följande,är,genom,senast,anslutningen,tillämpas,republiken,som
---- 12 NL, nbdoc:1000,van,het,met,voor,door,bulgarije,zijn,aan,wordt,worden,asvijver,dat,tot,lid,volgende,bijlage,op,een,roemenië,bij
---- 13 SK, nbdoc:1000,ú,alebo,pre,ktoré,ako,rozhodnutie,pristúpenia,týchto,sú,môže,komisia,popol,súlade,sa,vo,ods,nariadenia,opatrenia,štátov,odseku
---- 14 FR, nbdoc:1000,dans,bulgarie,est,les,adhésion,une,du,république,paragraphe,règlement,aux,qui,sur,pour,cette,état,au,roumanie,sont,peut
---- 15 CS, nbdoc:999,nebo,pristoupení,pro,spolecenství,narízení,ve,techto,opatrení,být,cl,pokud,oblast,príloze,odst,smernice,státy,komise,souladu,prosince,které
---- 16 FI, nbdoc:1000,ey,kuin,päivänä,sekä,euroopan,jotka,mukaisesti,tuhka-allas,kohdassa,artiklan,neuvoston,liitteessä,sovelletaan,osalta,komissio,eyvl,bulgariassa,vuoden,annettu,että
---- 17 PT, nbdoc:1000,em,ao,conselho,roménia,artigo,com,não,regulamento,uma,adesão,membros,os,aos,bacia,cinzas,decisão,nas,comissão,é,até
---- 18 IT, nbdoc:1000,di,della,dell,dal,che,il,è,consiglio,regolamento,adesione,nel,stati,sono,dei,dicembre,allegato,commissione,gu,pag,stagno
---- 19 ES, nbdoc:1000,las,consejo,los,comisión,el,decisión,y,adhesión,ejecutivo,miembros,podrá,unión,artículo,declaración,hasta,con,diciembre,rumanía,miembro,común
---- 20 DA, nbdoc:1000,og,til,af,stk,skal,fra,ikke,ef,disse,inden,afsnit,følgende,omhandlet,bilag,forordning,rumænien,før,nye,tiltrædelsesdatoen,fastsættes
---- 21 RO, nbdoc:537,?i,în,acord,cu,sa,care,pe,prin,sau,prezentul,pentru,catre,poate,nu,sunt,acest,consiliul,fiecare,acordul,este
(total time: 17 seconds)
In the settings, we indicated that we wanted the details of the test. The setting indicates that the documents must be placed in the folder MYCLASS_MODEL\experiment\ angdetect
We find the detail file by class with the number of tests performed and the performance for the first three predictions
group,tottest,in1,in2,in3,...
HU,204,0.99509805,0.0,0.0
ET,205,0.99512196,0.0,0.0
PL,202,1.0,0.0,0.0
LT,194,1.0,0.0,0.0
MT,187,0.9946524,0.0,0.0053475937
...
We find the detail file for each tested document with
- the identifier of the document
- the expected class
- for the first three predictions: the predicted class and the prediction score
- the size of the BOW (word bag)
It is interesting to examine this file if one wants to understand the causes of a bad prediction in our case the document EN959 had a prediction for DA (Danish) with a low score (below 1000). Looking through the training file, the sentence to be classified is composed of proper names.
#####EN959#####
TPP at ‘Zaharni zavodi’ ashpond, Veliko Tarnovo, Gorna Oryahovitsa;.
N## abstract
We saw:
- how to format the corpus and the catalog
- how to index a corpus
- how to do the training and the test
In the following example, we will examine a corpus so the classification is hierarchical and multi-class Patent Classification.