An NLP (Natural Language Processing) based experimental project aimed to bundle news from various sources.
- MongoDB - database layer.
- Redis - queuing scrapper jobs.
- Flask - REST / Web services.
- Scrapper: New collector.
- DataProcessor: Processes raw data collected by scrappers and structure it (this is the NLP part).
- Job: A Queued Scrapper.
- Worker: A Python process running and waiting for Jobs to be added to the queue, then execute them.
- Server: A RESTful web server managing all the services.
- Scrappers are queued by rq as jobs in redis to the scrapper_jobs queue once every X minutes (scheduled by cron or equivalent method).
- When the jobs are executed by a worker, the scrappers begin collecting the data (news) from the various resources, each collects its own resources asynchronously ( gevent ).
- Each scrapper stores its scrapped data in a nested document inside scrappers database, the nested object is named as the scrapper class name in lower case letters. A typical scrapper document has the following fields:
{
category: string,
title: string,
url: string,
scraped_at: datetime.utcnow()
(1)bundled: 1,
(2)title_en: string
}
Notes: * (1) bundled is an optional property which is available only when the document was already classified by the NLP process. it means that this document was found similar to other documents and was bundled with them. * (2) title_en is also an optional property which is available only if a translation was performed. The original is title. 4. DataProcessors are queued by rq as jobs in redis to the nlp_process queue once every T minutes (scheduled by cron or equivalent method). If similar documents are found in raw database, they will be stored in the bundled database in the following structure:
- Unique index on: title ASCENDING
- In case translation was made to title then it will be stored in title_en.
The following packages are required (names might be slightly different depending on the linux distro):
- libxml2-dev
- libxslt1-dev
- python-dev
- liblas-devel.x86_64 (aka: libblas-dev)
- lapack-devel.x86_64 (aka liblapack-dev)
Unit tests are using nose located under the tests folder, can be executed by running:
$ make configure-test && make test
Any subsequent can be executed only by running (as the environment is already prepared):
$ make test
Depending on the environment its going to run, a general way to deploy this is as follows:
- Make sure you have MongoDB and Redis installed and running on the ports listed in configure-production.py.
- First prepare the environment:
$ make clean && make configure-prod
- Load ( source ) the virtual environment ( virtualenv ):
$ source venv/bin/activate
- To schedule all scrappers for work, use: