Welcome to Data Journalism Extractor's documentation!

This project is an attempt to create a tool to help journalists extract and process data at scale, from multiple heterogenous data sources while leveraging powerful and complex database, information extraction and NLP tools with limited programming knowledge.

Features

This software is based on Apache Flink, a stream processing framework similar to Spark written in Java and Scala. It executes dataflow programs, is highly scalable and integrates easily with other Big Data frameworks and tools such as Kafka, HDFS, YARN, Cassandra or ElasticSearch.

Although you can work with custom dataflow programs that suits your specific needs, one doesn't need to know programming, Flink or Scala to work with this tool and build complex dataflow programs to achieve some of the following operations:

Extract data from relational databases (Postgres, MySQL, Oracle), NoSQL databases (MongoDB), CSV files, HDFS, etc.

Use complex processing tools such as soft string-matching functions, link extractions, etc.

Store outputs in multiple different data sinks (CSV files, databases, HDFS, etc.)

Some examples are in getting-started and example-walkthrough. Description of the modules and how to use them is in modules.

getting-started example-walkthrough modules

Indices and tables

genindex
modindex
search

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.rst

index.rst

Welcome to Data Journalism Extractor's documentation!

Features

Indices and tables

Files

index.rst

Latest commit

History

index.rst

File metadata and controls

Welcome to Data Journalism Extractor's documentation!

Features

Indices and tables