Skip to content

Experimental NLP (Natural Language Processing) project aimed to bundle various scattered updates

License

Notifications You must be signed in to change notification settings

nathanIL/openews

Repository files navigation

Build Status Coverage Status

Openews

An NLP (Natural Language Processing) based experimental project aimed to bundle news from various sources.

Architecture

Components

  1. MongoDB - database layer.
  2. Redis - queuing scrapper jobs.
  3. Flask - REST / Web services.

Concepts

  • Scrapper: New collector.
  • DataProcessor: Processes raw data collected by scrappers and structure it (this is the NLP part).
  • Job: A Queued Scrapper.
  • Worker: A Python process running and waiting for Jobs to be added to the queue, then execute them.
  • Server: A RESTful web server managing all the services.

Data Flow

  1. Scrappers are queued by rq as jobs in redis to the scrapper_jobs queue once every X minutes (scheduled by cron or equivalent method).
  2. When the jobs are executed by a worker, the scrappers begin collecting the data (news) from the various resources, each collects its own resources asynchronously ( gevent ).
  3. Each scrapper stores its scrapped data in a nested document inside scrappers database, the nested object is named as the scrapper class name in lower case letters. A typical scrapper document has the following fields:
{
        category: string,
        title: string,
        url: string,
        scraped_at: datetime.utcnow()
        (1)bundled: 1,
        (2)title_en: string
}

Notes: * (1) bundled is an optional property which is available only when the document was already classified by the NLP process. it means that this document was found similar to other documents and was bundled with them. * (2) title_en is also an optional property which is available only if a translation was performed. The original is title. 4. DataProcessors are queued by rq as jobs in redis to the nlp_process queue once every T minutes (scheduled by cron or equivalent method). If similar documents are found in raw database, they will be stored in the bundled database in the following structure:

Details
  • Unique index on: title ASCENDING
  • In case translation was made to title then it will be stored in title_en.

OS dependencies

The following packages are required (names might be slightly different depending on the linux distro):

  • libxml2-dev
  • libxslt1-dev
  • python-dev
  • liblas-devel.x86_64 (aka: libblas-dev)
  • lapack-devel.x86_64 (aka liblapack-dev)

Testing

Unit tests are using nose located under the tests folder, can be executed by running:

$ make configure-test && make test

Any subsequent can be executed only by running (as the environment is already prepared):

$ make test

Running

Depending on the environment its going to run, a general way to deploy this is as follows:

  1. Make sure you have MongoDB and Redis installed and running on the ports listed in configure-production.py.
  2. First prepare the environment:
$ make clean && make configure-prod
  1. Load ( source ) the virtual environment ( virtualenv ):
$ source venv/bin/activate
  1. To schedule all scrappers for work, use:

About

Experimental NLP (Natural Language Processing) project aimed to bundle various scattered updates

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published