In [2]:
import __init__

import pandas as pd

from bin.splitLogFile import extractSummaryLine

# Experience

Based on __annoted ground truth__, we tried to learn a model to classify __domains specific words__.

We use as input a combinaison of 4 datasets:
* animal
* vehicle
* plant
* other - a __random sample from the whole vocabulary__

To do so we will explore the __carthesian product__ of:
* __domains:__ a combinaison of _N_ previously presented domains
* __strict:__ try to compose missing concept 
* __randomForest / knn:__ knn allow us to check if there is anything consistent to learn, randomForest is a basic model as a first approach to learn the function
* __feature:__ one of the feature presented in the guided tour
* __postFeature:__ any extra processing to apply to the feature extraction (like normalise)

We use a 10 K-Fold cross validation.

_Once you downloaded the files, you can use this script reproduce the experience at home_:

```
python experiment/trainAll_domainClf.py > ../data/learnedModel/domain/log.txt
```

# Results

Here is the summary of the results we gathered,
You can find details reports in logs.

In [6]:
summaryDf = pd.DataFrame([extractSummaryLine(l) for l in open('../../data/learnedModel/domain/summary.txt').readlines()],
                        columns=['domain', 'strict', 'clf', 'feature', 'post', 'precision', 'recall', 'f1'])

summaryDf = summaryDf[summaryDf['clf'] != 'KNeighborsClassifier'].sort_values('f1', ascending=False)
print len(summaryDf)
summaryDf[:5]

198


Unnamed: 0,domain,strict,clf,feature,post,precision,recall,f1
338,plant-vehicle,,RandomForestClassifier,identity,postNormalize,0.972,0.97,0.97
356,plant-vehicle,strict,RandomForestClassifier,identity,postNormalize,0.971,0.969,0.968
340,plant-vehicle,,RandomForestClassifier,polar,postAbs,0.968,0.967,0.967
352,plant-vehicle,strict,RandomForestClassifier,angular,postAbs,0.97,0.967,0.967
359,plant-vehicle,strict,RandomForestClassifier,polar,postNormalize,0.966,0.963,0.964


Considering the nature of the date (really close points for semanticly close concept - ie _puppy, dog_), __Knn__ is not relevant.

As you can see we, there is a lot trained model (198), therefore,<br>
we need to __find a method to select the best combinaison__ - ie: robust to the number and variety of domains

To do so, we'll __select the best average model depending of the dataset combinaison__

In [8]:
summaryDf['f1'] = summaryDf['f1'].astype(float)
summaryDf[['feature', 'post', 'f1']].groupby(['feature', 'post']).describe().unstack(level=-1)

Unnamed: 0_level_0,Unnamed: 1_level_0,f1,f1,f1,f1,f1,f1,f1,f1
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max
feature,post,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
angular,noPost,22.0,0.897682,0.044522,0.811,0.85975,0.914,0.9305,0.962
angular,postAbs,22.0,0.899136,0.045783,0.806,0.85975,0.9185,0.929,0.967
angular,postNormalize,22.0,0.8945,0.042041,0.813,0.86375,0.899,0.93275,0.958
identity,noPost,22.0,0.891591,0.04389,0.803,0.862,0.893,0.93,0.956
identity,postAbs,22.0,0.778818,0.069272,0.639,0.7225,0.7875,0.824,0.876
identity,postNormalize,22.0,0.897682,0.045976,0.802,0.86175,0.907,0.929,0.97
polar,noPost,22.0,0.897182,0.04248,0.807,0.8715,0.9085,0.931,0.961
polar,postAbs,22.0,0.895,0.044825,0.804,0.85875,0.9035,0.9305,0.967
polar,postNormalize,22.0,0.892591,0.043845,0.807,0.85875,0.8985,0.92675,0.964


We observe several things here:
* The __f1-score decrease as we add__ variety of __domains__ (from ~95% for 2 to ~80% for 4)
* In average, the __results are satisfying__ for the basic model.
* The __feature selected__ _angular, polar, carthesian_ have a __litte impact on the average score__.
* Adding possibility to __compose concept__ (_strict_) __improve very slightly the score__

If we had to select one model, we could choose __angular__ feature with __no post processing__. which is the __best in the edge case__ (4 domains)

In [20]:
summaryDf[summaryDf['domain'] == 'animal-plant-vehicle-other'][:1]

Unnamed: 0,domain,strict,clf,feature,post,precision,recall,f1
126,animal-plant-vehicle-other,,RandomForestClassifier,angular,noPost,0.843,0.841,0.823


In [19]:
summaryDf[(summaryDf['feature'] == 'angular') & (summaryDf['post'] == 'noPost') & (summaryDf['strict'] == '')]

Unnamed: 0,domain,strict,clf,feature,post,precision,recall,f1
333,plant-vehicle,,RandomForestClassifier,angular,noPost,0.965,0.962,0.962
225,animal-vehicle,,RandomForestClassifier,angular,noPost,0.953,0.953,0.95
81,animal-plant,,RandomForestClassifier,angular,noPost,0.947,0.945,0.945
369,vehicle-other,,RandomForestClassifier,angular,noPost,0.923,0.92,0.919
9,animal-other,,RandomForestClassifier,angular,noPost,0.917,0.916,0.916
261,plant-other,,RandomForestClassifier,angular,noPost,0.919,0.915,0.915
153,animal-plant-vehicle,,RandomForestClassifier,angular,noPost,0.901,0.899,0.895
306,plant-vehicle-other,,RandomForestClassifier,angular,noPost,0.894,0.887,0.878
54,animal-plant-other,,RandomForestClassifier,angular,noPost,0.865,0.861,0.859
198,animal-vehicle-other,,RandomForestClassifier,angular,noPost,0.872,0.862,0.848


## Study errors

Here is the detail of classification error for combined __animal, plant and vehicle__:

In [25]:
!python ../../toolbox/script/detailConceptClfError.py ../../data/voc/npy/wikiEn-skipgram.npy ../../data/learnedModel/domain/animal-plant-vehicle__RandomForestClassifier_angular_noPost.dill ../../data/domain/luu_animal.txt animal ../../data/domain/luu_plant.txt plant ../../data/domain/luu_vehicle.txt vehicle

1388424 loaded from wikiEn-skipgram
mem usage 1.6GiB
loaded time 2.28821516037 s
input: elder  /  predicted: animal  /  true: plant  /  proba:[ 0.52770563  0.46344372  0.00885065]
input: periwinkle  /  predicted: animal  /  true: plant  /  proba:[ 0.59370291  0.38447223  0.02182487]
input: rocket  /  predicted: animal  /  true: plant  /  proba:[ 0.36572428  0.26895122  0.36532451]
input: dumper  /  predicted: animal  /  true: vehicle  /  proba:[ 0.53974509  0.17098901  0.28926589]
input: electric  /  predicted: animal  /  true: vehicle  /  proba:[ 0.60716588  0.10343283  0.28940128]
input: rocket  /  predicted: animal  /  true: vehicle  /  proba:[ 0.36572428  0.26895122  0.36532451]
input: semi  /  predicted: animal  /  true: vehicle  /  proba:[ 0.67749268  0.03053849  0.29196884]
input: tipper  /  predicted: animal  /  true: vehicle  /  proba:[ 0.48796488  0.11740702  0.3946281 ]

--  REPORT  --
             precision    recall  f1-score   support

     animal       0.99      1.00    

Be aware there is __no cross validation__ here, __so we are overfitting__

Yet, we see __collisions seems to be due to several meaning for one concept__:<br>
__rocket__ for example, __is both a plant and a vehicle__, which make it an __unsolvable case for this model__.



Let's compare with the same domains __but adding other__:

In [26]:
!python ../../toolbox/script/detailConceptClfError.py ../../data/voc/npy/wikiEn-skipgram.npy ../../data/learnedModel/domain/animal-plant-vehicle-other__RandomForestClassifier_angular_noPost.dill ../../data/domain/luu_animal.txt animal ../../data/domain/luu_plant.txt plant ../../data/domain/luu_vehicle.txt vehicle ../../data/domain/all_1400.txt other

1388424 loaded from wikiEn-skipgram
mem usage 1.6GiB
loaded time 2.23708605766 s
input: ant  /  predicted: other  /  true: animal  /  proba:[ 0.44698061  0.47728024  0.03956311  0.03617603]
input: aphid  /  predicted: plant  /  true: animal  /  proba:[ 0.4205326   0.06616029  0.50666878  0.00663833]
input: bryozoan  /  predicted: other  /  true: animal  /  proba:[ 0.28859803  0.37158572  0.18269929  0.15711696]
input: bullock  /  predicted: other  /  true: animal  /  proba:[ 0.44268397  0.47954619  0.0655728   0.01219704]
input: cephalochordate  /  predicted: other  /  true: animal  /  proba:[ 0.29835134  0.59678727  0.09063798  0.0142234 ]
input: cephalopod  /  predicted: other  /  true: animal  /  proba:[ 0.40773633  0.44026934  0.10582134  0.04617299]
input: coelenterate  /  predicted: other  /  true: animal  /  proba:[ 0.35773328  0.42768328  0.16387334  0.0507101 ]
input: collembolan  /  predicted: other  /  true: animal  /  proba:[ 0.28649462  0.42505779  0.27505705  0.01339054]


We observe 3 things:
* __A very bad recall on vehicle__, which could be explained by the small size of the dataset or the face.
* the __several semantic meanings__ of words, __create noise__:
    * __grub__ is predicted as __other__ unstead of animal
    * ...
* __Almost all conflict are 'other' class related__, indeed, by looking deeper into the other dataset<br> _which is a random sample from the vocabulary_, we notice that __a lot of them actually belong to one of the other domains__ (animal, plant, vehicle)
    * cuscutaceae is a plant
    * sloanii is an animal
    * ...
    
Once again, the model proves it's __ability to 'challenge' the ground truth _(which is here highly biaised)_ __.

# Conclusion

The __recognition rate is  satisfying__, thus the classification errors highlight one main issues:
* Lot of __words a several meaning but__ have __only one semantic position in word2vec space__.<br>
_a workaround for this could be to have an adapted annotation for the dataset but the real problem in inherent to word2vec_.