DataNER

This repository contains the code used to create the DataNER corpus, which is annotated for NER using Wikipedia and WikiData.

Dependencies

You need to install MongoDB in order to use this repository.

Description

The code contained in this repository will process a Wikipedia xml dump and a WikiData json dump and use them to build a corpus annotated with named entities.

It was developed as part of a (french-speaking) master's thesis at Université de Montréal. The findings of the thesis sadly showed this process led to a lesser quality corpus in comparison to other similar corpora. The code itself still constitutes an interesting contribution to approach this method, thus justifying publishing it.

How to use

Download the Wikipedia and WikiData dump of your liking.
Download the WikiExtractor GitHub and NECKAr tool and move them into their respective folders in this repository (see README files)
Add WikiData dump path to the NECKAr.cfg file in the NECKAr folder.
Run the process_wikidata_dump.sh script.
Run the process_wikipedia_dump.sh script.
(Optional) Run the augment_mentions.py script to create more named entities in your corpus.
Run the extract_collection.py script to create the corpus.

Disclaimer

This code was experimented with and used on a 24-threads computer, it might be very slow on a more "normal" one. If the scripts take an unreasonable time to run, I recommend using a subset of Wikipedia to still be able to produce a corpus.

References

This code is based on two other works :

Giusepppe Attardi. Wikiextractor. https://github.com/attardi/wikiextractor, 2015.
Johanna Geiß, Andreas Spitz, and Michael Gertz. Neckar : A named entity classifier for wikidata. In Georg Rehm and Thierry Declerck, editors, Language Technologies for the Challenges of the Digital Age, pages 115–129, Cham, 2018. Springer International Publishing. ISBN 978-3-319-73706-5.

Contact

Please reach out to lucas.pages123@gmail.com if you have any question about this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
NECKAr		NECKAr
config_files		config_files
wikidata_scripts		wikidata_scripts
wikiextractor		wikiextractor
wikipedia_scripts		wikipedia_scripts
LICENSE		LICENSE
README.md		README.md
augment_mentions.py		augment_mentions.py
entity_utils.py		entity_utils.py
extract_collection.py		extract_collection.py
process_wikidata_dump.sh		process_wikidata_dump.sh
process_wikipedia_dump.sh		process_wikipedia_dump.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NECKAr

NECKAr

config_files

config_files

wikidata_scripts

wikidata_scripts

wikiextractor

wikiextractor

wikipedia_scripts

wikipedia_scripts

LICENSE

LICENSE

README.md

README.md

augment_mentions.py

augment_mentions.py

entity_utils.py

entity_utils.py

extract_collection.py

extract_collection.py

process_wikidata_dump.sh

process_wikidata_dump.sh

process_wikipedia_dump.sh

process_wikipedia_dump.sh

Repository files navigation

DataNER

Dependencies

Description

How to use

Disclaimer

References

Contact

About

Releases

Packages

Languages

License

LucasPages/dataner_creation

Folders and files

Latest commit

History

Repository files navigation

DataNER

Dependencies

Description

How to use

Disclaimer

References

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages