Openews

An NLP (Natural Language Processing) based experimental project aimed to bundle news from various sources.

Architecture

Components

MongoDB - database layer.
Redis - queuing scrapper jobs.
Flask - REST / Web services.

Concepts

Scrapper: New collector.
DataProcessor: Processes raw data collected by scrappers and structure it (this is the NLP part).
Job: A Queued Scrapper.
Worker: A Python process running and waiting for Jobs to be added to the queue, then execute them.
Server: A RESTful web server managing all the services.

Data Flow

Scrappers are queued by rq as jobs in redis to the scrapper_jobs queue once every X minutes (scheduled by cron or equivalent method).
When the jobs are executed by a worker, the scrappers begin collecting the data (news) from the various resources, each collects its own resources asynchronously ( gevent ).
Each scrapper stores its scrapped data in a nested document inside scrappers database, the nested object is named as the scrapper class name in lower case letters. A typical scrapper document has the following fields:

{
        category: string,
        title: string,
        url: string,
        scraped_at: datetime.utcnow()
        (1)bundled: 1,
        (2)title_en: string
}

Notes: * (1) bundled is an optional property which is available only when the document was already classified by the NLP process. it means that this document was found similar to other documents and was bundled with them. * (2) title_en is also an optional property which is available only if a translation was performed. The original is title. 4. DataProcessors are queued by rq as jobs in redis to the nlp_process queue once every T minutes (scheduled by cron or equivalent method). If similar documents are found in raw database, they will be stored in the bundled database in the following structure:

Details

Unique index on: title ASCENDING
In case translation was made to title then it will be stored in title_en.

OS dependencies

The following packages are required (names might be slightly different depending on the linux distro):

libxml2-dev
libxslt1-dev
python-dev
liblas-devel.x86_64 (aka: libblas-dev)
lapack-devel.x86_64 (aka liblapack-dev)

Testing

Unit tests are using nose located under the tests folder, can be executed by running:

$ make configure-test && make test

Any subsequent can be executed only by running (as the environment is already prepared):

$ make test

Running

Depending on the environment its going to run, a general way to deploy this is as follows:

Make sure you have MongoDB and Redis installed and running on the ports listed in configure-production.py.
First prepare the environment:

$ make clean && make configure-prod

Load ( source ) the virtual environment ( virtualenv ):

$ source venv/bin/activate

To schedule all scrappers for work, use:

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
language		language
scrappers		scrappers
server		server
tests		tests
.gitignore		.gitignore
.pep8		.pep8
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config-development.py		config-development.py
config-production.py		config-production.py
log.py		log.py
logconfig-development.json		logconfig-development.json
logconfig-production.json		logconfig-production.json
manager.py		manager.py
requirements-dev.txt		requirements-dev.txt
requirements-prod.txt		requirements-prod.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Openews

Architecture

Components

Concepts

Data Flow

Details

OS dependencies

Testing

Running

About

Releases

Packages

Contributors 2

Languages

License

nathanIL/openews

Folders and files

Latest commit

History

Repository files navigation

Openews

Architecture

Components

Concepts

Data Flow

Details

OS dependencies

Testing

Running

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages