Simurg

A tool to create extendable multilingual corpora for abstractive text summarization (and other applications).

Motivation

Abstractive single document summarization is considered as a difficult problem in the field of artificial intelligence and natural language processing. Meanwhile and specifically in the last two years, several deep learning summarization approaches were introduced that once again attracted the attention of researchers to this field.

It is a known issue that deep learning approaches do not work well with small amount of data. With some exceptions, this is unfortunately the case for most of the data sets available for the summarization task. Beside this problem, it should be considered that phonetic, morphological, semantic and syntactic features of the language are constantly changing over time and unfortunately most of the summarization corpora are constructed from old resources. Another problem is the language of the corpora. Not only in the summarization field, but also in other fields of natural language processing, most of the corpora are only available in English. In addition to the above problems, licence terms and fees of the corpora is a obstacle that prevent many academics and specifically non-academics from accessing these data.

Simurg is an open source framework to create an extensable multilingual corpus for abstractive single document summarization that addresses the above mentioned problems.

Architecture

Creating the corpus consists of two phases:

Constructing the template corpus: The template corpus is the sharable part of the Simurg corpus.

Populating the template corpus: In this phase the template corpus will be populated with all the required information and the result will be a collection of JSON documents.

Dependencies

Docker

Setup the Project

make build: to build the docker image
make start_redis: to start the redis server
make connect_redis: to use the redis command line interface
make run: to run the container

Template Corpus

To create the template corpus use the following commands:

make run: to run the container

In the container run python and then enter the following two python commands:

import simurg
simurg.create_template_corpus(lang='de')

Populating the Template Corpus

Run the following command to create the final corpus:

make run: to run the container

In the container run python and then enter the following two python commands:

import simurg
simurg.populate_template_corpus(lang='de')

Adding New Languages:

Currently English, German, French and Italian are supported. Adding a new language is simple: In the file config.py modify the variable REDIS_DBS and add the new language code. Example to add Farsi:

REDIS_DB = {
    'de': 0,
    'en': 1,
    'fr': 2,
    'it': 3,
    'tr': 4
}

Parallel Execution

If you want to construct a corpur for multiple languages at the same time, simply start several containers at the same time. For example to construct English, German, French and Italian corpus at the same time run the following commands:

make run # For the first language
docker exec -it simurg bash -l # For the second language
docker exec -it simurg bash -l # For the third language
docker exec -it simurg bash -l # For the fourth language

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
images		images
redis		redis
simurg		simurg
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
License.md		License.md
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simurg

Motivation

Architecture

Dependencies

Setup the Project

Template Corpus

Populating the Template Corpus

Adding New Languages:

Parallel Execution

About

Releases

Packages

Languages

License

pasmod/simurg

Folders and files

Latest commit

History

Repository files navigation

Simurg

Motivation

Architecture

Dependencies

Setup the Project

Template Corpus

Populating the Template Corpus

Adding New Languages:

Parallel Execution

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages