In [2]:
import __init__

import pandas as pd

from bin.splitLogFile import extractSummaryLine

# Experience

Based on __wordnet__ as ground truth, we tried to learn a classifier to __detect taxonomyc relations__ between words.<br>
We use here to animal domain _(feline > cat / animal > mammal)_.

We try here 2 approaches:
* direct parent
* belonging to a common branch

To do so we will explore the __carthesian product__ of:
* __full:__ check for direct parent or belonging to a common branch
* __strict:__ try to compose missing concept 
* __randomForest / knn:__ knn allow us to check if there is anything consistent to learn, randomForest is a basic model as a first approach to learn the function
* __feature:__ one of the feature presented in the guided tour
* __postFeature:__ any extra processing to apply to the feature extraction (like normalise)

We use a 10 K-Fold cross validation.

Negative sampling is generating by shuffling pairs.

We restrict here to the animal domain.

_Once you downloaded the files, you can use this script reproduce the experience at home_:

```
python experiment/trainAll_taxoClf.py > ../data/learnedModel/taxo/log.txt
```

_To build other taxonomy dataset you can use the following script_:

```
python script/extractWordnetTaxo.py vehicle > /tmp/wordnetTaxo_vehicle.txt
```

# Results

Here is the summary of the results we gathered,
You can find details reports in logs.

In [29]:
summaryDf = pd.DataFrame([extractSummaryLine(l) for l in open('../../data/learnedModel/taxo/summary.txt').readlines()],
                        columns=['full', 'strict', 'clf', 'feature', 'post', 'precision', 'recall', 'f1'])

summaryDf = summaryDf[summaryDf['clf'] != 'KNeighborsClassifier']

directDf = summaryDf[summaryDf['full'] == 'animal']
belongDf = summaryDf[summaryDf['full'] == 'animalAll']

Considering the nature of the date (really close points for semanticly close concept - ie _puppy, dog_), __Knn__ is not relevant.

Since the nature of the function trained for __belonging in chain__ and __direct parent__ are really different, we'll separate the study of the results.

# Direct parent

## Results

In [30]:
directDf.sort_values('f1', ascending=False)[:10]

Unnamed: 0,full,strict,clf,feature,post,precision,recall,f1
70,animal,,RandomForestClassifier,subPolar,noPost,0.843,0.843,0.843
66,animal,,RandomForestClassifier,subAngular,noPost,0.837,0.836,0.836
68,animal,,RandomForestClassifier,subCarth,noPost,0.828,0.825,0.825
64,animal,,RandomForestClassifier,concatPolar,noPost,0.814,0.813,0.813
62,animal,,RandomForestClassifier,concatCarth,noPost,0.812,0.808,0.808
65,animal,,RandomForestClassifier,concatPolar,postNormalize,0.804,0.803,0.803
90,animal,strict,RandomForestClassifier,subAngular,noPost,0.807,0.803,0.803
63,animal,,RandomForestClassifier,concatCarth,postNormalize,0.802,0.798,0.798
84,animal,strict,RandomForestClassifier,concatAngular,noPost,0.8,0.797,0.797
60,animal,,RandomForestClassifier,concatAngular,noPost,0.796,0.796,0.795


Thus, we can observe quite good f1-score on __RandomForest__ with __polar substraction__ and __no normalisation__.<br>
_it is ok to not have normalisation considering decision tree algorithm_

__Angular substraction__ also provide good results, conforting the idea of the angle of concept pointing the semantic direction.

## Study errors

Here is the detail of:
* False positive - ie: incorrect direct parent relation, reason can be:
    * wrong branch
    * branch correct but not the __direct__ parent
* False negative - ie: not detected parent relation

In [24]:
!python ../../toolbox/script/detailConceptPairClfError.py ../../data/voc/npy/wikiEn-skipgram.npy ../../data/learnedModel/taxo/animal__RandomForestClassifier_subPolar_noPost.dill ../../data/wordPair/wordnetTaxo_animal.txt isParent ../../data/wordPair/wordnetTaxo_animal_fake.txt notParent

1388424 loaded from wikiEn-skipgram
mem usage 1.6GiB
loaded time 2.35443711281 s
input: (chordate, '>', cephalochordate)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.48115061  0.51884939]
input: (chordate, '>', tunicate)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.3998105  0.6001895]
input: (salamander, '>', newt)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.47239304  0.52760696]
input: (lungfish, '>', ceratodus)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.48341822  0.51658178]
input: (sturgeon, '>', beluga)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.44348526  0.55651474]
input: (characin, '>', piranha)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.40517024  0.59482976]
input: (cyprinid, '>', goldfish)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.47709308  0.52290692]
input: (goldfish, '>', silverfish)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.46178254  0.53821746]
input: (kill

We can find here 3 sources of incorrect prediction:
* __Wrong prediction__
* __Words having several meaning__ (which is not supported by word2vec)
    * ray, '>', skate
    * ...
* __Not direct parent but in correct taxonomy chain__
    * reptile, '>', ceratopsian
    * ...
    
The __third case__ is especially __interresting__:

It translate a __difference in the taxonmy ground truth__ in wordnet __and the one learned by the classifier.__<br>
Since the __vocabulary here is really specific__, it __may not occurs many times in the corpus__, this raise the following question:

__What taxonomy__ could we build (and how different) with a __dedicated corpus__ ?

# Belonging to chain

## Results

In [32]:
belongDf.sort_values('f1', ascending=False)[:10]

Unnamed: 0,full,strict,clf,feature,post,precision,recall,f1
36,animalAll,strict,RandomForestClassifier,concatAngular,noPost,0.971,0.971,0.971
38,animalAll,strict,RandomForestClassifier,concatCarth,noPost,0.971,0.971,0.971
14,animalAll,,RandomForestClassifier,concatCarth,noPost,0.969,0.969,0.969
40,animalAll,strict,RandomForestClassifier,concatPolar,noPost,0.968,0.968,0.968
37,animalAll,strict,RandomForestClassifier,concatAngular,postNormalize,0.967,0.967,0.967
39,animalAll,strict,RandomForestClassifier,concatCarth,postNormalize,0.965,0.965,0.965
15,animalAll,,RandomForestClassifier,concatCarth,postNormalize,0.963,0.963,0.963
41,animalAll,strict,RandomForestClassifier,concatPolar,postNormalize,0.963,0.963,0.963
12,animalAll,,RandomForestClassifier,concatAngular,noPost,0.961,0.961,0.961
16,animalAll,,RandomForestClassifier,concatPolar,noPost,0.961,0.961,0.961


Thus, we can observe a __really high__ f1-score on __RandomForest__ with __concatenation techniques__ and __no normalisation__.

_it is ok to not have normalisation considering decision tree_

Considering we have here a __much bigger dataset__, it __may also include 'almost' redondant__ data ie: _feline > cat, feline > jungle cat_,<br>
are we __overfitting despite of cross validation__ ?

__Substraction techniques__ also provide good results.

## Study errors

Here is the detail of:
* False positive - ie: incorrect chain
* False negative - ie: not chain belonging detected

In [34]:
!python ../../toolbox/script/detailConceptPairClfError.py ../../data/voc/npy/wikiEn-skipgram.npy ../../data/learnedModel/taxo/animalAll_strict_RandomForestClassifier_concatAngular_noPost.dill ../../data/wordPair/wordnetTaxo_animal_all.txt isParent ../../data/wordPair/wordnetTaxo_animal_fake.txt notParent

1388424 loaded from wikiEn-skipgram
mem usage 1.6GiB
loaded time 2.1897380352 s
input: (cyprinid, '>', silverfish)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.4502493  0.5497507]
input: (topminnow, '>', mollie)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.49513321  0.50486679]
input: (perch, '>', walleye)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.45020875  0.54979125]
input: (pike, '>', pickerel)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.49210614  0.50789386]
input: (billfish, '>', spearfish)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.48102727  0.51897273]
input: (grouper, '>', coney)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.47348371  0.52651629]
input: (snapper, '>', yellowtail)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.47417262  0.52582738]
input: (wrasse, '>', bluehead)  /  predicted: notParent  /  true: isParent  /  proba:[ 0.38904442  0.61095558]
input: (remora, '>', sharksuc

We can find here 2 sources of incorrect prediction:
* __Wrong prediction__
* __Words having several meaning__ (which is not supported by word2vec)
    * hedgehog, '>', grub
    * ...
* __Challenges to wordnet taxonomy__
    * proboscidean, '>', clam _(proboscidean are defined as having a trunk, so the clam)_
    * ...

On the __third case__ we also observe a __difference in the taxonmy ground truth__ in wordnet __and the one learned by the classifier.__

# Conclusion

The __recognition rate is quite satisfying__ here considering the basic model we use. More __advanced techniques__ could __improve the results.__

It is interresting to notice __false predictions__ are also __challenging the ground truth__.<br>
__Unlike antonyms__, a domain specific human expert would probably be able to decide.

_As a side note_, __having several meaning for one word__ introduce noise, showing some __limitations of word2vec model__.
_____________________

__Further exploring could be:__
* Apply the animal trained model to a different domain.
* Use a graph similarity measure to quantify the global error with the wordnet taxonomy.
* Find a feature to use sub graphs as dataset.