# Concept embeddings

### But: calculer des embeddings de concepts wikipédia

#### Qu'est ce qu'un concept wikipédia?

Ici, un concept wikipédia est soit une définition de l'ontologie (personne, animal, lieu, ville, artiste, etc) soit une entité nommée, qui est un type d'un noeud de l'ontologie et qui possède une page wikipédia. Par exemple: Cory Barlog, qui est un Artist, qui est lui même un Person ou encore Paris, qui est une City, qui est lui même un PopulatedPlace.

#### Comment calculer ces embeddings?

Pour calculer ces embeddings, nous nous basons sur les résumés (abstract) des pages wikipédia correspondant à chacun de ces concepts. Il existe 2 tailles de résumés, short et long, nous essayons les deux. 
Ces résumés vont être donnés à un modèle doc2vec pour calculer une représentation en grande dimension dans laquelle deux articles (donc concepts ou named entities) proches en terme de sens le seront aussi.

### Sources:

- pour les entités nommées (qui correspondent à des pages wikipédia, type "Bretagne", "Testostérone", "Napoléon Bonaparte"), la source sera les datasets <a href="https://wiki.dbpedia.org/develop/datasets">wikipédia</a>
- Pour les concepts un niveau d'abstraction au dessus, la courte définition de l'ontologie définie par la relation rdf <http://www.w3.org/2000/01/rdf-schema#comment>

In [6]:
import pickle
import json
import re
from gensim.parsing.preprocessing import preprocess_string,remove_stopwords,strip_tags,strip_punctuation,strip_numeric,strip_multiple_whitespaces,strip_short
import string
table = str.maketrans('', '', '!"#$%\'()*+,-./:;<=>?@[\\]^_`{|}~')
printable = set(string.printable)

# Krovetz stemmer is a stemmer much less "destructive" than porter.
from krovetzstemmer import Stemmer # good stemmer for IR
ks = Stemmer()

CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_multiple_whitespaces, strip_punctuation, strip_numeric, lambda x:strip_short(x, minsize=3)] #, lambda x: ks.stem(x)

In [7]:
concepts = pickle.load(open("data/concepts.pkl", 'rb'))

In [8]:
len(concepts)

118194

In [10]:
"Justice" in concepts

False

In [4]:
#dictionnaire à sérialiser en JSON
abstracts = {}
files = ["long_abstracts_en.nt", "long_abstracts_en_02.nt"]

for f in files:
    with open("data/"+f, "r", encoding="utf-8") as file:
        i = 0
        for line in file:
            if i > 0:
                kelbay = line.split(" ")
                nom = kelbay[0].replace("<http://dbpedia.org/resource/", "")[:-1]
                if (nom in concepts) and (nom not in abstracts):
                    text = " ".join(preprocess_string(" ".join(kelbay[2:]).encode('ascii', 'ignore').decode("utf-8"), CUSTOM_FILTERS))
                    abstracts[nom] = text
            i += 1

#les définitions de l'ontologie
with open("data/dbpedia_2016-10.nt", "r") as file:
    for line in file:
        lol = line.split(" ")
        if lol[1] == "<http://www.w3.org/2000/01/rdf-schema#comment>":
            nom = lol[0].replace('<http://dbpedia.org/ontology/', "")[:-1]
            if nom != "" and "\"@en" in " ".join(lol[2:]) and nom in concepts and nom not in abstracts:
                text = re.sub(r'https?:\/\/.*[\r\n]*', '', " ".join(lol[2:]).replace("@en", ""))
                abstracts[nom] = " ".join(preprocess_string(text.strip(), CUSTOM_FILTERS))
    
save = json.dumps(abstracts)
with open("data/concepts_text.json", "w") as f:
    f.write(save)
print("collection saved.") 

collection saved.


In [5]:
print("il y a {} definitions de concepts sur les {} concepts.".format(len(abstracts), len(concepts)))

il y a 117756 definitions de concepts sur les 118194 concepts.


In [7]:
concepts - set(abstracts.keys())

{'%C3%9Cberlingen__Reichsstadt_%C3%9Cberlingen__1',
 'Aach,_Baden-W%C3%BCrttemberg__Herrschaft_Aach__1',
 'Aalen__Reichsstadt_Aalen__1',
 'Aarberg__rafschaft_Aarberg__1',
 'Abbey_of_Saint_Gall__F%C3%BCrstabtei_St._Gallen__1',
 'AcademicConference',
 'Activity',
 'AmateurBoxer',
 'AmericanFootballCoach',
 'AmericanFootballPlayer',
 'AmericanFootballTeam',
 'AmusementParkAttraction',
 'AnatomicalStructure',
 'Archeologist',
 'ArcherPlayer',
 'Aristocrat',
 "Arlon__Maark_grofschaft_vun_Arel_lb_Marquisat_comt%C3%A9_d'Arl__1",
 'Article',
 'ArtistDiscography',
 'Athlete',
 'Athletics',
 'AthleticsPlayer',
 "Augsburg__''Parit%C3%A4tische_Reichsstadt_Augsburg__1",
 'AustralianFootballTeam',
 'AustralianRulesFootballPlayer',
 'AutomobileEngine',
 'AutomobilePlatform',
 'BackScene',
 'Bad_Kreuznach__Grafschaft_Sponheim-Kreuznach__1',
 'Bad_Pyrmont__Grafschaft_F%C3%BCrstentum_Pyrmont__1',
 'Bad_Urach__Grafschaft_Urach__1',
 'Bad_Wimpfen__Reichsstadt_Wimpfen__1',
 'Bad_Windsheim__Reichsstadt_Wind

In [6]:
abstracts["Philosopher"]

'philosopher person with extensive knowledge philosophy who uses this knowledge their work typically solve philosophical problems philosophy concerned with studying the subject matter fields such aesthetics ethics epistemology logic metaphysics well social philosophy and political philosophy there sense which every human being philosopher accept very humanistic and generous interpretation this say that every human being has unique contribution ideas the society however more generally accepted interpretation academia that philosopher one who has attained philosophy teaches philosophy has published literature field philosophy peer reviewed journal widely accepted other philosophers philosopher'

In [9]:
abstracts["Hormone"]

'hormone from greek ubc uae impetus chemical released cell gland organ one part the body that affects cells other parts the organism only small amount hormone required alter cell metabolism essence chemical messenger that transports signal from one cell another all multicellular organisms produce hormones plant hormones are also called phytohormones hormones animals are often transported the blood cells respond hormone when they express specific receptor for that hormone the hormone binds the receptor protein resulting the activation signal transduction mechanism that ultimately leads cell type specific responses endocrine hormone molecules are secreted released directly into the bloodstream typically into fenestrated capillaries hormones with paracrine function diffuse through the interstitial spaces nearby target tissues variety exogenous chemical compounds both natural and synthetic have hormone like effects both humans and wildlife their interference with the synthesis secretion tr

### Entrainement du modèle

In [9]:
import gensim

abstracts = json.load(open("data/concepts_text.json", 'r'))

train_corpus = [gensim.models.doc2vec.TaggedDocument(abstracts[s].split(), tags=[s]) for s in abstracts]

model = gensim.models.doc2vec.Doc2Vec(vector_size=300, window=8, min_count=5, epochs=30, dm=1)
model.build_vocab(train_corpus)

#entrainement du modèle!
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 37min 5s, sys: 29.4 s, total: 37min 35s
Wall time: 15min 21s


In [166]:
abstracts["Aristotle"]

'aristotle was greek philosopher and polymath student plato and teacher alexander the great his writings cover many subjects including physics metaphysics poetry theater music logic rhetoric linguistics politics government ethics biology and zoology together with plato and socrates plato teacher aristotle one the most important founding figures western philosophy aristotle writings were the first create comprehensive system western philosophy encompassing morality aesthetics logic science politics and metaphysics aristotle views the physical sciences profoundly shaped medieval scholarship and their influence extended well into the renaissance although they were ultimately replaced newtonian physics the zoological sciences some his observations were confirmed accurate only the century his works contain the earliest known formal study logic which was incorporated the late century into modern formal logic metaphysics aristotelianism had profound influence philosophical and theological thi

In [11]:
model.save("embeddings/concepts")

In [3]:
import gensim

model = gensim.models.doc2vec.Doc2Vec.load("embeddings/concepts")

In [4]:
model.docvecs.most_similar([model.docvecs["Aristotle"]], topn=14)

[('Aristotle', 0.9999999403953552),
 ('Alfred_V._Kidder', 0.50833660364151),
 ('Socrates', 0.4701247215270996),
 ('Joseph_Margolis', 0.465733140707016),
 ('Otto_E._Neugebauer', 0.4652498960494995),
 ('Gottfried_Wilhelm_Leibniz', 0.4644862413406372),
 ('Constantine_Samuel_Rafinesque', 0.4617351293563843),
 ('Mongolian_Academy_of_Sciences', 0.46045464277267456),
 ('Ludwig_Feuerbach', 0.45850318670272827),
 ('Lucretius', 0.4526785910129547),
 ("Euclid's_Elements", 0.4521154761314392),
 ('Leonard_Bloomfield', 0.4511200785636902),
 ('Leonhard_Euler', 0.45068058371543884),
 ('John_Ruskin', 0.4505317211151123)]

In [5]:
model.docvecs.most_similar([model.docvecs["Philosopher"]], topn=14)

[('Philosopher', 1.0),
 ('Riverhead_(town),_New_York', 0.6244463920593262),
 ('Romeo_discography', 0.6134164929389954),
 ('Rudolf_Lehmann_(SS_officer)', 0.6004624962806702),
 ('Dantan_I_(community_development_block)', 0.6004362106323242),
 ('Beylagan_District', 0.5977004766464233),
 ('Tiamat_(band)', 0.5936920642852783),
 ('Fayette_County,_Georgia', 0.5920968651771545),
 ('Collinsville,_Oklahoma', 0.5913281440734863),
 ('Babadzhan', 0.5909887552261353),
 ('Crowley_County,_Colorado', 0.5905856490135193),
 ('Hexachlorobutadiene', 0.5903594493865967),
 ('Kyshlak', 0.5902799367904663),
 ('Bruce_Sterling', 0.5901660919189453)]