A language-agnostic toolset for semantic triples extraction and processing.
- Python 2.7.x
- LevelDB
- LZ4 (optional)
- Django 1.5 (optional)
-
Prepare input data (list of sentences in first-order-logic form generated by Metaphor ADP):
- Create LF file using one of Metaphor's pipeline.
- Split LF if they are too large (recommended size ~ 1GB).
-
Extract triples from LF sentences using
findtriples.py
:python findtriples.py < sentences.lf.txt > triples.csv
The output will be the following:
noun_verb_adv, <NONE>, быть-VB, можно-RB, <->, <->, 156 noun_adj, поле-NN, ледяной-ADJ, <->, <->, <->, 73 noun_verb_adv, <NONE>, быть-VB, надо-RB, <->, <->, 68 noun_verb_adv, <NONE>, быть-VB, нельзя-RB, <->, <->, 65 noun_adj, океан-NN, ледовитый-ADJ, <->, <->, <->, 47 ...
-
Create triples index using
createtriplesindex.py
:python createtriplesindex.py -i triples.csv -o triples-index-dir
-
Create query-file
query.json
:{ "query": [ { "label": "poverty", "source": [ "source_word_1", "source_word_2", ... "source_word_n", ], "target": [ "target_word_1", "target_word_2", ... "target_word_n" ] } ] }
Note that
source
ortarget
filesds may be empy depending on your next step. -
Run
findsources.py
to find source (requires a list of targets in query file):python findsources.py -i triples-index-dir -o output-dir -q query.json
-
Prepare file with list of sources (each on separate string):
source_1 source_2 ... source_n
-
Run
findpatterns.py
to find patterns:python findpatterns.py -i triples-index-dir -o output-dir -qf sources.txt
Usage:
python mokujin.py [<input file in logical form>] [<output file>]
-
Input format are sentences in first-order logic form produced by Metaphor ADP semantic pipelines.
-
Extracts the following relationships:
Verbs
subj_verb_dirobj([noun*],verb,[noun+]) ("John reads a book")
subj_verb_indirobj([noun*],verb,[noun+]) ("John gives to Mary")
subj_verb_instr([noun*],verb,[noun+]) ("Джон работает топором")
subj_verb([noun+], verb) ("John runs") // only if there is no dirobj and indirobj
subj_verb_prep_compl([noun*],verb,prep,[noun+]) ("John comes from London")
subj_verb_verb_prep_noun([noun*],verb,verb,prep,[noun+]) ("John tries to go into the house")
subj_verb_verb([noun+],verb,verb) ("John tries to go") // only if there is no prep attached to the second verb
Nouns
noun_be_prep_noun(noun,verb,prep,noun) ("intention to leave for money")
noun_be(noun,verb) ("intention to leave") // only if there is no prep attached to verb
noun_adj_prep_noun(noun,adjective,prep,noun) ("The book is good for me") -> only if "for" has "good" (and not "is") as its arg
noun_adj([noun+],adjective) ("The book is good") // only if there is no prep attached to adj as its arg
noun_verb_adv_prep_noun(adverb,verb) ("John runs fast for me") -> only if "for" has "fast" (and not "runs") as its arg
noun_verb_adv([noun*],verb,adverb) ("John runs fast") // only if there is no prep attached to adv
nn_prep([noun+],prep,noun) ("[city]&bike for John") // only if "for" has "bike" (and not some verb) as its arg
nn(noun,noun) ("city bike") // only if there is no prep attached to the second noun
nnn(noun,noun,noun) ("Tzar Ivan Grozny")
noun_equal_prep_noun(noun,noun,prep,noun) ("John is a man of heart") // only if "of" has "man" (and not "is") as its arg.
noun_equal_noun(noun,noun) ("John is a biker") // only if there is no prep attached to the second noun
noun_prep_noun(noun,prep,noun) ("house in London")
noun_prep_prep_noun(noun,prep,prep,noun) ("book out of the store")
Verbs
compl(anything,anything) ("близкий мне")
Input (Logical Form):
% В четверг , 7 февраля 2013 года , стартовала официальная продажа билетов на Олимпийские игры в Сочи —
% ровно за год до начала соревнований .
id(1).
[1001]:в-in(e1,e5,x1) & [1002]:четверг-nn(e2,x1) & [1005]:февраль-nn(e3,x2) & [1007]:год-nn(e4,x3) &
[1009]:стартовать-vb(e5,x4,u1,u2) & [1010]:официальный-adj(e6,x4) & [1011]:продажа-nn(e7,x4) &
[1012]:билет-nn(e8,x5) & [1013]:на-in(e9,x5,x6) & [1014]:олимпийский-adj(e10,x6) & [1015]:игра-nn(e11,x6) &
[1016]:в-in(e12,x6,x7) & [1017]:сочи-nn(e13,x7) & [1019]:ровно-rb(e14,e15) & [1020]:за-in(e15,e5,x8) &
[1021]:год-nn(e16,x8) & [1022]:до-in(e17,x9,x10) & [1023]:начало-nn(e18,x10) & [1024]:соревнование-nn(e19,x11) &
card(e20,u3,7) & card(e21,x3,2013) & of-in(e22,x2,x3) & of-in(e23,x4,x5) & typelt(e24,x5,s1) & typelt(e25,x6,s2) &
of-in(e26,x10,x11) & typelt(e27,x11,s3) & past(e28,e5)
% В первые же часы билеты на самые интересные широкому кругу болельщиков виды программы — хоккей , биатлон ,
% сноуборд — были раскуплены чуть менее чем полностью .
id(2).
[2001]:в-in(e1,x1,x2) & [2004]:часы-nn(e2,x2) & [2005]:билет-nn(e3,x1) & [2006]:на-in(e4,x1,x3) &
[2008]:интересный-adj(e5,x3) & [2009]:широкий-adj(e6,x3) & [2010]:круг-nn(e7,x3) & [2011]:болельщик-nn(e8,x4) &
[2012]:вид-nn(e9,x1) & [2013]:программа-nn(e10,x5) & [2015]:хоккей-nn(e11,x6) & [2017]:биатлон-nn(e12,x7) &
[2019]:сноуборд-nn(e13,x8) & [2022]:раскупить-vb(e14,u1,x8,u2) & [2023]:чуть-rb(e15,e16) & [2024]:менее-rb(e16,e14) &
[2025]:чем-cnj(e17,x9) & [2026]:полностью-rb(e18,e17) & card(e19,x2,1) & typelt(e20,x2,s1) & typelt(e21,x1,s2) &
of-in(e22,x3,x4) & typelt(e23,x4,s3) & typelt(e24,x1,s4) & of-in(e25,x1,x5) & past(e26,x8) & past(e27,e14)
% Что касается мужского хоккея , например , то недоступными оказались пропуска на все игры плей-офф — и это при том ,
% что даже сетка турнира составлена пока не целиком .
id(3).
[3002]:касаться-vb(e1,u1,x1,u2) & [3003]:мужской-adj(e2,x1) & [3004]:хоккей-nn(e3,x1) & [3006]:например-rb(e4,e5) &
[3008]:то-cnj(e5,x2) & [3009]:недоступный-adj(e6,x3) & [3010]:оказаться-vb(e7,x4,u3,u4) & [3011]:пропуск-nn(e8,x4) &
[3012]:на-in(e9,x4,x5) & [3014]:игра-nn(e10,x5) & [3015]:плей-офф-nn(e11,x6) & thing(e12,x7)
[3019]:при-in(e13,x8,x7) & [3024]:сетка-nn(e14,x9) & [3025]:турнир-nn(e15,x10) & [3026]:составить-vb(e16,x9,u5,u6) &
[3027]:пока-cnj(e17,x11) & [3029]:целиком-rb(e18,e17) & of-in(e19,x5,x6) & of-in(e20,x9,x10) & not(e21,e18) &
past(e22,e7) & past(e23,e16)
Output (List of Triples in CSV format):
rel_type,arg1,arg2,arg3,arg4,arg5,arg6,freq
noun_adj,федерация-NN, российский-ADJ,<->,<->,<->,162267
subj_verb,речь-NN,идти-VB,<->,<->,<->,85846
subj_verb_dirobj,<NONE>,обратить-VB,внимание-NN,<->,<->,64583
noun_adj,житель-NN,местный-ADJ,<->,<->,<->,17450
This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense US Army Research Lab contract W911NF-12-C-0025. The US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the US Government.