Qui fait quoi ?

Tools to detect syntactic relationships in French and English

keyfayqua is a Command Line tool that features 2 commands. First, the parse command annotates a corpus of texts, detecting entities and dependency relationships, and outputs each document's annotations as a CoNLL-formatted string in the column of a CSV file. Second, the match command reconverts the CoNLL string into a SpaCy object and, using Semgrex patterns and SpaCy's DependencyMatcher, detects dependency relationships between nodes in the parsed sentences. By providing Semgrex patterns, the user chooses the types of relationships that will be searched with the match command.

How to install
How to use
1. parse command (intially parse the text documents)
2. test-conll command (test CoNLL string validity)
3. match comand (apply Semgrex patterns to match syntactic relationships)
4. Match output
Optional pre-processing

Install

Create and activate a virtual Python environment (>=3.11).
Clone this repository.

git clone https://github.com/medialab/keyfayqua.git

Install keyfayqua in the activated virtual environment.

pip install --upgrade pip
pip install -e .

On Mac, whose MPS GPU is not yet supported, I recommend you install two additional libraries to (slightly) improve performace:
- pip install thinc-apple-ops
- pip install spacy[apple]

Usage

`keyfayqua parse` : Initially parse texts

The parse command is the first step to detecting dependency relationship patterns. It takes in a corpus of text documents, given as a CSV file, and outputs CoNLL-formatted string representations of the parsed documents, also in a CSV file. Optionally, with the --clean-social flag, you can pre-process the text with a cleaning script designed for social media posts, specifically Twitter. For help, type keyfayqua parse --help.

flowchart RL

subgraph parse command
datafile_p("--datafile")
id_col_p("--id-col")
text_col_p("--text-col")
lang_p("--lang")
clean_social_p("--clean-social")
model_p("--model")
modelpath_p("--model-path")
outfile_p("--outfile")
end

infile --file path--> datafile_p

subgraph infile[in-file CSV]
id --column name--> id_col_p
text --column name--> text_col_p
end

language[primary language] -.- text
language --2-letter abbreviation--> lang_p

genre[text type] -.- text
genre --boolean flag--> clean_social_p

subgraph models[model types]
spacy
stanza
hopsparser
udpipe
end

models --name of model type--> model_p

hopsparser -.- hopsparsermodel[hopsparser model]
hopsparsermodel --file path--> modelpath_p

subgraph outfile[out-file CSV]
outid(id)
parsed(parsed_text)
conll(conll_string)
end

outfile --file path--> outfile_p

Note about Hopsparser French models:

By selecting the model type hopsparser and the French language (fr), you will need a Hopsparser model. If you do not have one, the script will download one for you. The default model is the Flaubert model for Spoken french and the default download location is in this repository at ./hopsparser_model/UD_all_spoken_French-flaubert/. If you want to use another of the Hopsparser models, you can download it yourself and provide the path with the option --model-path.

Upon completion or exit of the parse command, the CSV file to which the program had been writing each text document's annotations will be compressed using Gzip. The out-file is expected to be very large despite having only 3 columns:

an identifier for the text document, given with the option --id-col
the version of the text that was parsed
the CoNLL-formatted string

`keyfayqua test-conll` : Test CoNLL string validity

Sometimes it's useful to quickly test the integrity of your CoNLL format. The command test-conll requires the path to the file whose strings you want to test, and optionally the name of the strings' column if other than the default "conll_string". The program will raise an error and show you the problematic string if it finds an invalid CoNLL format. Otherwise it will exit upon completion.

╭─ Options ───────────────────────────────────────────────────╮
│ *  --datafile         FILE  Path to file with Conll results │
│                             [default: None]                 │
│                             [required]                      │
│    --conll-col        TEXT  CoNLL string column name        │
│                             [default: conll_string]         │
│    --help                   Show this message and exit.     │
╰─────────────────────────────────────────────────────────────╯

`keyfayqua match` : Match dependency patterns

After creating a data file with annotated tokens correctly formatted in CoNLL strings, you're ready to apply Semgrex matches and detect syntactic relationships. First, you'll need a JSON file with a set of Semgrex match patterns. See an example here. Then, you'll call the match command, as explained here.

Composing the Semgrex file

The Semgrex file's JSON format closely resembles the format that SpaCy uses in Python for their DependencyMatcher.

In both cases, a pattern is composed of an array of nodes, in which the order matters. The array's first node is the "anchor" node, to which all other nodes relate, either directly or indirectly. Subsequent nodes have a relationship to the anchor or another preceding node.

In both SpaCy's Python dictionary and keyfayqua's JSON format, the ordered array of nodes looks like the following example, which is taken from SpaCy's documentation:

[
  {
    "RIGHT_ID": "founded",
    "RIGHT_ATTRS": { "ORTH": "founded" }
  },
  {
    "LEFT_ID": "founded",
    "REL_OP": ">",
    "RIGHT_ID": "subject",
    "RIGHT_ATTRS": { "DEP": "nsubj" }
  },
  {
    "LEFT_ID": "founded",
    "REL_OP": ";",
    "RIGHT_ID": "initially",
    "RIGHT_ATTRS": { "ORTH": "initially" }
  }
]

Whereas in Python the pattern's array of nodes is attributed to a variable, in the JSON format, the array is the value in a JSON key-value pair wherein the key is the pattern's identifying name.

{
  "PatternName": [
    {
      "RIGHT_ID": "founded",
      "RIGHT_ATTRS": { "ORTH": "founded" }
    },
    {
      "LEFT_ID": "founded",
      "REL_OP": ">",
      "RIGHT_ID": "subject",
      "RIGHT_ATTRS": { "DEP": "nsubj" }
    },
    {
      "LEFT_ID": "founded",
      "REL_OP": ";",
      "RIGHT_ID": "initially",
      "RIGHT_ATTRS": { "ORTH": "initially" }
    }
  ]
}

Calling the `match` command

Even though you're applying the Semgrex patterns to a corpus that is already parsed, SpaCy's DependencyMatcher still requires a language model. Consequently, the match command resembles the parse command. For more information, ask for help with the command keyfayqua match --help.

flowchart RL

subgraph match command
datafile_m("--datafile")
id_col_m("--id-col")
conll_col_m("--conll-col")
lang_m("--lang")
matchfile_m("--matchfile")
model_m("--model")
modelpath_m("--model-path")
outfile_m("--outfile")
end

infile --file path--> datafile_m

subgraph infile[in-file CSV]
id --column name--> id_col_m
parsed_text[parsed_text]
conll --column name--> conll_col_m
end

language[primary language] -.- conll
language --2-letter abbreviation--> lang_m


subgraph models[model types]
spacy
stanza
hopsparser
udpipe
end

models --name of model type--> model_m

hopsparser -.- hopsparsermodel[hopsparser model]
hopsparsermodel --file path--> modelpath_m

subgraph outfile[out-file CSV]
outid(id)
etc([various match columns...])
end

subgraph matchfilejson[match file JSON]
patterns([various match patterns...])
end

outfile --file path--> outfile_m
matchfilejson --file path-->matchfile_m

Match output

The CSV output by the match command is dynamically formatted to have as many columns as are necessary to store information about the patterns you provide. A Semgrex pattern has at least 2 nodes, an anchor and something that relates to it. For every node in the pattern, there will be 6 columns.

PatternName_NodeName_id : the token's index
PatternName_NodeName_lemma : the token's lemma (or text if the model failed to lemmatize it)
PatternName_NodeName_pos : the token's part-of-speech tag
PatternName_NodeName_deprel : the token's dependency relationship to its head
PatternName_NodeName_entity : the token's named-entity-recognition label
PatternName_NodeName_noun_phrase (in development)

Each match on a Semgrex pattern is written to a row of the CSV, along with the text document's unique ID.

id	FindRootSubjects_ROOT_id	FindRootSubjects_ROOT_lemma	FindRootSubjects_ROOT_pos	FindRootSubjects_ROOT_deprel	FindRootSubjects_ROOT_entity	FindRootSubjects_ROOT_noun_phrase	FindRootSubjects_SUBJECT_id	FindRootSubjects_SUBJECT_lemma	FindRootSubjects_SUBJECT_pos	FindRootSubjects_SUBJECT_deprel	FindRootSubjects_SUBJECT_entity	FindRootSubjects_SUBJECT_noun_phrase
1598065358522699776	18	launch	VERB	ROOT			17	we	PRON	nsubj

Optional pre-processing

The original text may be cleaned if the flag --clean-social is provided with the parse command. This flag adds an extra step in which the text is pre-processed with a normalizing script designed for social media text documents, specifically Tweets. The normalizer applies the following changes:

remove emojis

>>> import emoji
>>>
>>> text = 'ChatGPT-4 : plus de 1000 prompts 🤯 pour améliorer votre #création https://openai.com/blog/chatgpt via @siecledigital'
>>>
>>> emoji.replace_emoji(text, replace='')
'ChatGPT-4 : plus de 1000 prompts  pour améliorer votre #création https://openai.com/blog/chatgpt via @siecledigital'

separate titles / pre-colon spans from sentences [(^(\w+\W+){1,2})*:]

>>> import re
>>>
>>> text = 'ChatGPT-4 : plus de 1000 prompts  pour améliorer votre #création https://openai.com/blog/chatgpt via @siecledigital'
>>>
>>> re.sub(r'(^(\w+\W+){1,2})*:', '\\1.', text)
'ChatGPT-4 . plus de 1000 prompts  pour améliorer votre #création https.//openai.com/blog/chatgpt via @siecledigital'

remove URLs

>>> from ural.patterns import URL_IN_TEXT_RE
>>>
>>> text = 'ChatGPT-4 . plus de 1000 prompts  pour améliorer votre #création https.//openai.com/blog/chatgpt via @siecledigital'
>>>
>>> URL_IN_TEXT_RE.sub(repl='', string=text)
'ChatGPT-4 . plus de 1000 prompts  pour améliorer votre #création https. via @siecledigital'

remove citations at the end of a post [(?!https)via(\s{0,}@\w*)]

>>> import re
>>>
>>> text = 'ChatGPT-4 . plus de 1000 prompts  pour améliorer votre #création https. via @siecledigital'
>>>
>>> re.sub(r"(?!https)via(\s{0,}@\w*)", "", text)
'ChatGPT-4 . plus de 1000 prompts  pour améliorer votre #création https. '

remove # and @ signs [[@#]]

>>> import re
>>>
>>> text = 'ChatGPT-4 . plus de 1000 prompts  pour améliorer votre #création https. '
>>>
>>>re.sub(r'[@#]', '',  text)
'ChatGPT-4 . plus de 1000 prompts  pour améliorer votre création https. '

remove trailing white space

>>> text = 'ChatGPT-4 . plus de 1000 prompts  pour améliorer votre création https. '
>>>
>>> text.strip()
'ChatGPT-4 . plus de 1000 prompts  pour améliorer votre création https.'

remove double spaces

>>> text = 'ChatGPT-4 . plus de 1000 prompts  pour améliorer votre création https.'
>>>
>>> re.sub(r'\s+', ' ', text)
'ChatGPT-4 . plus de 1000 prompts pour améliorer votre création https.'

Original text:

ChatGPT-4 : plus de 1000 prompts 🤯 pour améliorer votre #création https://openai.com/blog/chatgpt via @siecledigital

Cleaned text:

ChatGPT-4 . plus de 1000 prompts pour améliorer votre création https.'

CoNLL result:

ID	FORM	LEMMMA	UPOS	XPOS	FEATS	HEAD	DEPREL	DEPS	MISC
1	ChatGPT-4	chatgpt-4	PROPN	ADJ	Gender=Fem \| Number=Sing	0	ROOT	_	_
2	.	.	PUNCT	PUNCT	_	1	punct	_	_

ID	FORM	LEMMMA	UPOS	XPOS	FEATS	HEAD	DEPREL	DEPS	MISC
1	plus	plus	ADV	ADV	_	4	advmod	_	_
2	de	de	ADP	ADP	_	3	case	_	_
3	1000	1000	NUM	PRON	NumType=Card	1	iobj	_	_
4	prompts	prompt	NOUN	VERB	Tense=Pres \| VerbForm=Part	0	ROOT	_	_
5	pour	pour	ADP	ADP	_	6	mark	_	_
6	améliorer	améliorer	VERB	VERB	VerbForm=Inf	4	acl	_	_
7	votre	votre	DET	DET	Number=Sing \| Poss=Yes	8	det	_	_
8	création	création	NOUN	NOUN	Gender=Fem \| Number=Sing	6	obj	_	_
9	https	https	INTJ	PROPN	Gender=Masc \| Number=Sing	8	nmod	_	_
10	.	.	PUNCT	PUNCT	_	4	punct	_	_

Model citations

Hopsparser (French)

@inproceedings{grobol:hal-03223424,
    title = {{Analyse en dépendances du français avec des plongements contextualisés}},
    author = {Grobol, Loïc and Crabbé, Benoît},
    url = {https://hal.archives-ouvertes.fr/hal-03223424},
    booktitle = {{Actes de la 28ème Conférence sur le Traitement Automatique des Langues Naturelles}},
    eventtitle = {{TALN-RÉCITAL 2021}},
    venue = {Lille, France},
    pdf = {https://hal.archives-ouvertes.fr/hal-03223424/file/HOPS_final.pdf},
    hal_id = {hal-03223424},
    hal_version = {v1},
}

Stanza (English)

@inproceedings{qi2020stanza,
    title={Stanza: A {Python} Natural Language Processing Toolkit for Many Human Languages},
    author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
demo		demo
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo

demo

src

src

test

test

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pyproject.toml

pyproject.toml

setup.cfg

setup.cfg

Repository files navigation

Qui fait quoi ?

Tools to detect syntactic relationships in French and English

Install

Usage

`keyfayqua parse` : Initially parse texts

`keyfayqua test-conll` : Test CoNLL string validity

`keyfayqua match` : Match dependency patterns

Composing the Semgrex file

Calling the `match` command

Match output

Optional pre-processing

Model citations

Hopsparser (French)

Stanza (English)

About

Releases

Packages

Languages

License

medialab/keyfayqua

Folders and files

Latest commit

History

Repository files navigation

Qui fait quoi ?

Tools to detect syntactic relationships in French and English

Install

Usage

keyfayqua parse : Initially parse texts

keyfayqua test-conll : Test CoNLL string validity

keyfayqua match : Match dependency patterns

Composing the Semgrex file

Calling the match command

Match output

Optional pre-processing

Model citations

Hopsparser (French)

Stanza (English)

About

Resources

License

Stars

Watchers

Forks

Languages

`keyfayqua parse` : Initially parse texts

`keyfayqua test-conll` : Test CoNLL string validity

`keyfayqua match` : Match dependency patterns

Calling the `match` command