Skip to content
Olanto Foundation edited this page Jun 4, 2018 · 4 revisions

This page must be translated into English (or reviewed). Use French pages model. Thank you for your contributions

A first example: language detection

For our first example, we will build a classifier that detects the language of a document (or sentence). This case is simple because the documents are classified with a single symbol (Mono-ranking) and the ranking is defined by a single level.

Make a protype in three steps:

  • the constitution of the training corpus
  • indexing of documents
  • tests and validation of the classifier

The production step is completed by:

  • construction and storage of neural networks of classifiers of the classification hierarchy.
  • publication of the classifier with a WebService
  • development of client applications.

The training corpus

The training corpus is defined by:

  • the documents.
  • the catalog.

This separation makes it possible to manipulate the catalogs independently of the documents.

The documents

To learn how to predict the language of a sentence, the classifier must have examples of sentence for each language.

We have established a format for the document corpus (.mflf Many FiLe Format). This format makes it possible to compact all the documents (indeed if each document is in a separate file, the access time and the storage volume will increase considerably.) It takes 10ms to open a file on a disk. The corpora can have several millions of documents.

Format mflf

Each document is separated by a line that begins and ends with the ##### character sequence. The text between the two sequences is the identifier of the document.

Sample documents file:

 ####CS1#####
 o podmínkách a pravidlech prijetí Bulharské republiky a Rumunska do Evropské unie.
 #####DA1#####
 om vilkårene og de nærmere bestemmelser for optagelse af Republikken Bulgarien og Rumænien i den Europæiske Union.
 #####DE1#####
 über die Bedingungen und Einzelheiten der Aufnahme der Republik Bulgarien und Rumäniens in die Europäische Union.
 #####EL1#####
 s?et??? µe t??? ????? ?a? t?? ?ept?µ??e?e? t?? p??s????s?? t?? ??µ???at?a? t?? ?????a??a? ?a? t?? ???µa??a? st?? ????pa??? ???s?.
 ...

Format cat

Format cat

The catalog determines the class of the documents (or classes). Each line describes the classification of a document with the following form:

IDDOC CLASS1 CLASS2 ... CLASSn

Sample catalog file:

 CS1	CS
 DA1	DA
 DE1	DE
 EL1	EL
 ...

Choice of training corpus and preparation

As a source of the training corpus, we have chosen a 1000-sentence extract from the DGT2014 (corpora of the european committee (see [DGT2014] (https://ec.europa.eu/jrc/en/language-technologies/dgt- translation-memory)) containing 85 million sentences from 24 languages).

The construction of the corpus always requires adaptation to .mflf and .cat format

In the folder MYCLASS_MODEL \ sample \ langdetect contains the catalogs and the corpus for the working suite.

indexing documents

The training of neural networks is done on a corpus of document which is indexed.

Indexing makes it possible to organize the linguistic structure (the text) in an efficient digital structure. In our case, we use BOW (Bag Of Words). Each word in the document is replaced by the word identifier and the number of occurrences in the document. We can also perform additional treatments:

  • Eliminate the stop words (the, the, one, ...) which not of being able to classify in general. Except in our particular case, indeed they are a good marker for language detection.

  • By defining the notion of "word" during indexing. Should we keep capital letters, numbers, ...?

  • Lemmatize the words: Look for the root of the word. (global, globaly, globalisation, ...). This allows to gather under a single term several forms of the same concept.

  • Filter by the number of occurrences minimun and maximun in all the corpus.

To index the corpus, we use the same indexer as that of MYCAT (the code of the indexer can be here).

By opening the myclass project in your IDE. you will find the code associated with our example in the org.olanto.demo.langdetect_word package. Three classes are used for indexing:

Index folder structure

The folder structure must exist before indexing.

Ready to experiment

By convention, structures are in the data folder.

Each project has its folder. Each folder contains a sto0 folder and a mnn folder.

The index structure is created at the first launch of the program. To empty the index, we manually delete the files. (These manual operations are to avoid accidents that clears an index that has taken a lot of computing time to be built).

Index our corpus

We have seen above that the CreateIndexForCat program will perform indexing.

public static void main(String[] args) {
    id = new IdxStructure("NEW", new ConfigurationForCat());  // build a new structure        
    // first pass to count words
    id.indexdirOnlyCount(SenseOS.getMYCLASS_ROOT() + "MYCLASS_MODEL/sample/langdetect"); 
    id.flushIndexDoc();  //  flush buffers       
    // seconde pass to build BOW
    id.indexdirBuildDocBag(SenseOS.getMYCLASS_ROOT() + "MYCLASS_MODEL/sample/langdetect");
    id.flushIndexDoc();  //  flush buffers     
    id.Statistic.global(); //  display statistic
    id.close(); //  close index
  }

Indexing is done in two passes:

  • count the words to be able to filter them
  • build BOWs for each document

The log of indexing is a fairly complete file see log of indexing.

These lines give statistics on the corpus.

 STATISTICS global:
 wordmax: 1048576, wordnow: 401448, used: 38%
 documax: 262144, docunow: 21537, used: 8%
 docIdx: 21537 docValid: 21537, valid: 100%
 totidx: 0

DIn our case, 401448 different words for 21537 documents.

First training and test

The program to use for training and testing is ExperimentManual. All parameters are described in Training and Test Parameters

We define here the catalogs. The test catalog is empty because we use an 80/20 test from the training catalog. We load catalogs with a two-character classification.

    // define path to catalog
    String fntrain = SenseOS.getMYCLASS_ROOT() + "MYCLASS_MODEL/sample/langdetect/corpus_dgt2014.cat";
    String fntest = SenseOS.getMYCLASS_ROOT() + "MYCLASS_MODEL/sample/langdetect/EMPTY.cat";
    // load catalog at the specified level (2 char codification in this case)
    NNBottomGroup BootGroup = new NNBottomGroup(id, fntrain, fntest, 2, false, false);

Our test is performed on the first three predictions. We perform tests in Mono ranking (Maintest)

    // TEST -- parameters for the test
            3, // int Nfirst,
            true, // boolean maintest,
            true, // boolean maintestGroupdetail,
            true, // boolean maintestDocumentdetail,
            false, // boolean multitest,
            false, // boolean multitestGroupdetail ,
            false // boolean multitestOtherdetail 

Our test is completed by calculating confusion matrix.

  // display a Confusion Matrix
  NNOneN.ConfusionMatrix(false);

Finally, we ask for the list of the most important words (features) of each class.

  // display top features
  NNOneN.explainGroup(20, true);

The log of the execution of the manual test is quite complete [see test log] (Log_Test_annex).

Let's look at the most interesting parts of the log

statistics on BOW (word bag)

We see that 21536 documents are identified in the catalog. The minimum word count is 1 and the maximum is 104. The average is 20.

 #doc:21536, avg:20, min:1, max:104
 STOP [2018/05/31 12:34:07]: avgLength() - 62 ms

word filtering and document distribution for testing

Do not consider the words appearing once to reduce the number of words to 33573. The system has prepared an 80/20 test. the classification concerns 22 classes (languages to be detected)

 GLOBALMINOCC: 2 , MAX features:33573
 start mem: 121454680
 2.
 lasttraindoc:21536
 lasttestdoc:21536
 Train 0..17228 Test ..21536
 maxgroup:22
 after localgroup: 124413040
 Active group:22

start training

The system divides the training on the cores. It shows the progression. The corpus is occurs 5 times in this case.

 START [2018/05/31 12:34:07] : TrainWinnow
 after init: 125684344
 filter  used:0, open:33573, discarded:0, filtred:0
 Start loop 0 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 - 4 - 5 - 7 - 3 - 6 - 0 - 1 - 2 End loop 0
 Start loop 1 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 - 5 - 6 - 1 - 4 - 7 - 2 - 3 - 0 End loop 1
 Start loop 2 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 - 4 - 0 - 1 - 3 - 5 - 6 - 7 - 2 End loop 2
 Start loop 3 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 - 1 - 0 - 2 - 3 - 4 - 6 - 7 - 5 End loop 3
 Start loop 4 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 - 1 - 2 - 0 - 5 - 3 - 7 - 4 - 6 End loop 4
 # features: 33573
 # maxgroup: 22
 # maxtrain: 17228
 # avg doc : 20
 # repeatK: 5
 size of NN: 738 [Kn]
 estimate #eval (if no discarded feature): 37884 [Kev]
 estimate power (if no discarded feature): 185 [Mev/sec]
 STOP [2018/05/31 12:34:08]: TrainWinnow - 219 ms

Test Mono Class

The documents reserved for the test are used.

 Mainclass1000.0,1.06,2,300.0,300.0,9979,20,9955,18,4

La ligne de résultat s'interprète comme suit:

The result line is interpreted as follows:

  • Mainclass1000.0,1.06,2,300.0,300.0,: training parameter

  • 9979,: the sum of the accuracy of the first three predictions (99.79%)

  • 20,: the sum of the errors of the first three predictions (0.20%)

  • 9955,: accuracy of the first prediction (99.55%)

  • 18,: accuracy of the second prediction (0.18%)

  • 4: accuracy of the third prediction (0.04%)

    detail in: C:/MYCLASS_MODEL/experiment/langdetect/detailworddetect-MainDetail-Class.txt

build confusion matrix

 START [2018/05/31 12:34:08] : ConfusionMatrix
 1000.0,1.06,2,300.0,300.0,995,995,4,0
 confusion matrix: (line=real category; colums= prediction)
 >>predict,HU,ET,PL,LT,EL,LV,MT,DE,SL,BG,EN,SV,NL,SK,FR,CS,FI,PT,IT,ES,DA,RO,
 HU,203,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,
 ET,0,204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
 PL,0,0,202,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 LT,0,0,0,194,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 EL,0,0,0,0,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 LV,0,0,0,0,0,193,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 MT,0,0,0,0,0,1,186,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 DE,0,0,0,0,0,0,0,190,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 SL,0,0,0,0,0,0,0,0,203,0,0,0,0,1,0,0,0,0,0,0,0,0,
 BG,0,0,0,0,0,0,0,1,0,216,0,0,0,0,0,0,0,0,0,0,0,0,
 EN,0,0,0,0,0,0,0,0,0,0,215,0,0,0,0,0,0,0,0,1,1,0,
 SV,0,0,0,0,0,0,0,0,0,0,0,221,0,0,0,0,0,0,0,0,1,0,
 NL,0,0,0,0,0,0,0,0,0,0,0,0,201,0,0,0,0,0,0,0,0,0,
 SK,0,0,0,0,0,0,0,0,1,0,0,0,0,195,0,1,0,0,0,1,0,0,
 FR,0,0,0,0,0,0,0,0,0,0,0,0,0,0,211,0,0,0,0,0,0,0,
 CS,0,0,0,0,0,0,0,0,1,0,0,0,0,2,0,199,0,0,0,0,0,0,
 FI,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,205,0,0,0,0,0,
 PT,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,177,0,0,0,0,
 IT,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,195,0,0,0,
 ES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,197,0,0,
 DA,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,200,0,
 RO,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,98,
 STOP [2018/05/31 12:34:08]: ConfusionMatrix - 47 ms

After layout in a Spreadsheet, we obtain the following result:

Ready to experiment

The yellow boxes show the errors. The test covers 20% of the data that was removed from the training. The results are excellent but remember that the tested sentences had more than 50 characters like all the rest of the corpus.

build top features

This feature gives us the most valued words during training. Among them, we find the words "tools" (stop words). Other words are also in the list, because the list does not take into account the frequency of the word in the corpus.

 groupe, nbdoc, kw1, kw2, kw3, ...
 ---- 0 HU, nbdoc:1000,az,és,bizottság,hl,tanács,nem,következo,csatlakozás,kell,ig,szóló,egészül,vagy,vonatkozó,rendelete,hoeromu,mint,támogatás,nemzeti,melléklet
 ---- 1 ET, nbdoc:1000,bulgaaria,või,ning,kui,lk,rumeenia,euroopa,eü,lisa,artikli,aasta,mis,nõukogu,kohta,kuni,bulgaarias,alusel,suhtes,määruse,punkti
 ---- 2 PL, nbdoc:1000,w,dnia,dz,sie,dla,oraz,we,panstwa,przez,bulgarii,rumunii,które,czlonkowskie,przy,jest,sa,zgodnie,rocznie,lub,przystapienia
 ---- 3 LT, nbdoc:1000,del,eb,iš,iki,istojimo,kaip,pagal,bulgarijos,punkte,ol,saugyklai,bulgarija,gali,i,nuo,yra,dalyje,arba,bulgarijoje,tarybos
 ---- 4 EL, nbdoc:1000,?a?,st?,t??,ap?,t??,??a,t??,p??,µe,t?,de?aµe??,t??,st??t??,t?,ß????a??a,??,ta,s?µe??,st?,?
 ---- 5 LV, nbdoc:1000,gada,uz,punkta,eiropas,pievienošanas,kas,lpp,vai,attieciba,ka,pec,panta,ov,atbilstigi,padomes,lai,lidz,dalas,bulgarija,ša
 ---- 6 MT, nbdoc:1000,li,apos,ankara,jew,ghandu,doganali,fuq,ghandha,ghandhom,dawn,u,minn,ghal,ta,ikunu,inkluzi,fl-appendici,ma,kif,dak
 ---- 7 DE, nbdoc:1000,und,aschebecken,vom,für,von,verordnung,werden,nach,wird,aus,über,mit,nummer,oder,absatz,abl,zur,nicht,dem,sind
 ---- 8 SL, nbdoc:1000,iz,pristopa,bolgarija,ul,ali,glede,lahko,št,bolgariji,prilogi,odstavka,sveta,tem,pogodbe,kot,skladu,sklep,države,podlagi,odbora
 ---- 9 BG, nbdoc:1000,??,?,?,?,???,??,??,??,?????,?????????,????,???,son,????????,altesse,royale,grand-duc,??,????????????,???????????
 ---- 10 EN, nbdoc:1000,shall,and,decision,following,oj,or,accession,regulation,european,council,committee,treaty,republic,member,may,with,amended,by,areas,executive
 ---- 11 SV, nbdoc:1000,och,av,skall,för,från,till,att,inte,enligt,får,förordning,vid,följande,är,genom,senast,anslutningen,tillämpas,republiken,som
 ---- 12 NL, nbdoc:1000,van,het,met,voor,door,bulgarije,zijn,aan,wordt,worden,asvijver,dat,tot,lid,volgende,bijlage,op,een,roemenië,bij
 ---- 13 SK, nbdoc:1000,ú,alebo,pre,ktoré,ako,rozhodnutie,pristúpenia,týchto,sú,môže,komisia,popol,súlade,sa,vo,ods,nariadenia,opatrenia,štátov,odseku
 ---- 14 FR, nbdoc:1000,dans,bulgarie,est,les,adhésion,une,du,république,paragraphe,règlement,aux,qui,sur,pour,cette,état,au,roumanie,sont,peut
 ---- 15 CS, nbdoc:999,nebo,pristoupení,pro,spolecenství,narízení,ve,techto,opatrení,být,cl,pokud,oblast,príloze,odst,smernice,státy,komise,souladu,prosince,které
 ---- 16 FI, nbdoc:1000,ey,kuin,päivänä,sekä,euroopan,jotka,mukaisesti,tuhka-allas,kohdassa,artiklan,neuvoston,liitteessä,sovelletaan,osalta,komissio,eyvl,bulgariassa,vuoden,annettu,että
 ---- 17 PT, nbdoc:1000,em,ao,conselho,roménia,artigo,com,não,regulamento,uma,adesão,membros,os,aos,bacia,cinzas,decisão,nas,comissão,é,até
 ---- 18 IT, nbdoc:1000,di,della,dell,dal,che,il,è,consiglio,regolamento,adesione,nel,stati,sono,dei,dicembre,allegato,commissione,gu,pag,stagno
 ---- 19 ES, nbdoc:1000,las,consejo,los,comisión,el,decisión,y,adhesión,ejecutivo,miembros,podrá,unión,artículo,declaración,hasta,con,diciembre,rumanía,miembro,común
 ---- 20 DA, nbdoc:1000,og,til,af,stk,skal,fra,ikke,ef,disse,inden,afsnit,følgende,omhandlet,bilag,forordning,rumænien,før,nye,tiltrædelsesdatoen,fastsættes
 ---- 21 RO, nbdoc:537,?i,în,acord,cu,sa,care,pe,prin,sau,prezentul,pentru,catre,poate,nu,sunt,acest,consiliul,fiecare,acordul,este

(total time: 17 seconds)

Test details

In the settings, we indicated that we wanted the details of the test. The setting indicates that the documents must be placed in the folder MYCLASS_MODEL\experiment\ angdetect

We find the detail file by class with the number of tests performed and the performance for the first three predictions

      group,tottest,in1,in2,in3,...
      HU,204,0.99509805,0.0,0.0
      ET,205,0.99512196,0.0,0.0
      PL,202,1.0,0.0,0.0
      LT,194,1.0,0.0,0.0
      MT,187,0.9946524,0.0,0.0053475937
      ...

We find the detail file for each tested document with

  • the identifier of the document
  • the expected class
  • for the first three predictions: the predicted class and the prediction score
  • the size of the BOW (word bag)

détail document

It is interesting to examine this file if one wants to understand the causes of a bad prediction in our case the document EN959 had a prediction for DA (Danish) with a low score (below 1000). Looking through the training file, the sentence to be classified is composed of proper names.

 #####EN959#####
 TPP at ‘Zaharni zavodi’ ashpond, Veliko Tarnovo, Gorna Oryahovitsa;.

résumé

N## abstract

We saw:

  • how to format the corpus and the catalog
  • how to index a corpus
  • how to do the training and the test

In the following example, we will examine a corpus so the classification is hierarchical and multi-class Patent Classification.

Annex