# Concept embeddings

### But: calculer des embeddings de concepts wikipédia

#### Qu'est ce qu'un concept wikipédia?

Ici, un concept wikipédia est soit une définition de l'ontologie (personne, animal, lieu, ville, artiste, etc) soit une entité nommée, qui est un type d'un noeud de l'ontologie et qui possède une page wikipédia. Par exemple: Cory Barlog, qui est un Artist, qui est lui même un Person ou encore Paris, qui est une City, qui est lui même un PopulatedPlace.

#### Comment calculer ces embeddings?

Pour calculer ces embeddings, nous nous basons sur les résumés (abstract) des pages wikipédia correspondant à chacun de ces concepts. Il existe 2 tailles de résumés, short et long, nous essayons les deux. 
Ces résumés vont être donnés à un modèle doc2vec pour calculer une représentation en grande dimension dans laquelle deux articles (donc concepts ou named entities) proches en terme de sens le seront aussi.

### Sources:

- pour les entités nommées (qui correspondent à des pages wikipédia, type "Bretagne", "Testostérone", "Napoléon Bonaparte"), la source sera les datasets <a href="https://wiki.dbpedia.org/develop/datasets">wikipédia</a>
- Pour les concepts un niveau d'abstraction au dessus, 

In [57]:
import pickle
import json
import re
from gensim.parsing.preprocessing import preprocess_string,remove_stopwords,strip_tags,strip_punctuation,strip_numeric,strip_multiple_whitespaces,strip_short
import string
table = str.maketrans('', '', '!"#$%\'()*+,-./:;<=>?@[\\]^_`{|}~')
printable = set(string.printable)

# Krovetz stemmer is a stemmer much less "destructive" than porter.
from krovetzstemmer import Stemmer # good stemmer for IR
ks = Stemmer()

CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_multiple_whitespaces, strip_punctuation, strip_numeric, lambda x:strip_short(x, minsize=3)] #, lambda x: ks.stem(x)

In [120]:
concepts = pickle.load(open("data/concepts.pkl", 'rb'))

In [121]:
"Alex_Kozinski" in concepts

True

In [122]:
#dictionnaire à sérialiser en JSON
abstracts = {}
files = ["long_abstracts_en.nt", "long_abstracts_en_02.nt"]

for f in files:
    with open("data/"+f, "r", encoding="utf-8") as file:
        i = 0
        for line in file:
            if i > 0:
                kelbay = line.split(" ")
                nom = kelbay[0].replace("<http://dbpedia.org/resource/", "")[:-1]
                if nom in concepts and nom not in abstracts:
                    text = " ".join(preprocess_string(" ".join(kelbay[2:]).encode('ascii', 'ignore').decode("utf-8"), CUSTOM_FILTERS))
                    abstracts[nom] = text
            i += 1

#les définitions de l'ontologie
with open("data/dbpedia_2016-10.nt", "r") as file:
    for line in file:
        lol = line.split(" ")
        if lol[1] == "<http://www.w3.org/2000/01/rdf-schema#comment>":
            nom = lol[0].replace('<http://dbpedia.org/ontology/', "")[:-1]
            if nom != "" and "\"@en" in " ".join(lol[2:]) and nom in concepts:
                text = re.sub(r'https?:\/\/.*[\r\n]*', '', " ".join(lol[2:]).replace("@en", ""))
                abstracts[nom] = " ".join(preprocess_string(text.strip(), CUSTOM_FILTERS))
    
save = json.dumps(abstracts)
with open("data/concepts_text.json", "w") as f:
    f.write(save)
print("collection saved.") 

collection saved.


In [123]:
print("il y a {} definitions de concepts sur les {} concepts.".format(len(abstracts), len(concepts)))

il y a 11085 definitions de concepts sur les 11632 concepts.


In [124]:
concepts - set(abstracts.keys())

{'%C3%9Cberlingen__Reichsstadt_%C3%9Cberlingen',
 '1912%E2%80%9313_Huddersfield_Town_F.C._season__Frank_Mann',
 '1994%E2%80%9395_FA_Premier_League__Shay_Given',
 'A.C._Libertas__Alberto_Menghi',
 'Aach,_Baden-W%C3%BCrttemberg__Herrschaft_Aach',
 'Aalen__Reichsstadt_Aalen',
 'Aarberg__rafschaft_Aarberg',
 'Abbey_of_Saint_Gall__F%C3%BCrstabtei_St._Gallen',
 'AcademicConference',
 'Achamillai_Achamillai__Achamillai',
 'Action_Jackson__Action_Jackson',
 'Activity',
 'Albi__Episcopal_City_of_Albi',
 'Ali_(name)',
 'AmateurBoxer',
 'AmericanFootballCoach',
 'AmericanFootballPlayer',
 'AmericanFootballTeam',
 'AmusementParkAttraction',
 'AnatomicalStructure',
 'Apollon_Smyrni_F.C.__Vangelis_Valaoras',
 'Aquileia__rchaeological_Area_and_the_Patriarchal_Basilica_o',
 'Archeologist',
 'ArcherPlayer',
 'Aristocrat',
 "Arlon__Maark_grofschaft_vun_Arel_lb_Marquisat_comt%C3%A9_d'Arl",
 'Article',
 'ArtistDiscography',
 'Asante_Kotoko_SC__Maxwell_Konadu',
 'Aston_Villa_F.C.__Graham_Burke',
 'Athlete'

In [28]:
abstracts["Hormone"]

'hormone from greek ubc uae impetus chemical released cell gland organ one part the body that affects cells other parts the organism only small amount hormone required alter cell metabolism essence chemical messenger that transports signal from one cell another all multicellular organisms produce hormones plant hormones are also called phytohormones hormones animals are often transported the blood cells respond hormone when they express specific receptor for that hormone the hormone binds the receptor protein resulting the activation signal transduction mechanism that ultimately leads cell type specific responses endocrine hormone molecules are secreted released directly into the bloodstream typically into fenestrated capillaries hormones with paracrine function diffuse through the interstitial spaces nearby target tissues variety exogenous chemical compounds both natural and synthetic have hormone like effects both humans and wildlife their interference with the synthesis secretion tr

### Entrainement du modèle

In [140]:
import gensim

abstracts = json.load(open("data/concepts_text.json", 'r'))

train_corpus = [gensim.models.doc2vec.TaggedDocument(abstracts[s].split(), tags=[s]) for s in abstracts]

model = gensim.models.doc2vec.Doc2Vec(vector_size=100, window=8, min_count=5, epochs=40, dm=1)
model.build_vocab(train_corpus)

#entrainement du modèle!
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 2min 26s, sys: 3.09 s, total: 2min 29s
Wall time: 58.7 s


In [150]:
abstracts["SoccerLeague"]

'group sports teams that compete against each other soccer'

In [152]:
model.docvecs.most_similar([model.docvecs["BasketballLeague"]], topn=14)

[('BasketballLeague', 1.0000001192092896),
 ('IceHockeyLeague', 0.952705442905426),
 ('VolleyballLeague', 0.9440698027610779),
 ('CyclingLeague', 0.941717803478241),
 ('SoccerLeague', 0.9358798861503601),
 ('SpeedwayLeague', 0.9340630173683167),
 ('LacrosseLeague', 0.9328960180282593),
 ('Free_Imperial_City_of_Kempten', 0.9317072629928589),
 ('BaseballLeague', 0.9310052990913391),
 ('PoloLeague', 0.9284431338310242),
 ('Mike_Reynolds_(actor)', 0.9270393252372742),
 ('Ponzano_Monferrato', 0.9268031716346741),
 ('BowlingLeague', 0.9266229867935181),
 ('Dilsen-Stokkem', 0.926305890083313)]