Skip to content
Permalink
Browse files

added command for import from api.parldata.eu

  • Loading branch information...
girogiro committed Feb 19, 2015
1 parent 7dbb4f8 commit da48f06ccba1594560ecd112970c6a816370e6f6
@@ -32,25 +32,33 @@ language-dependent sorting), timezone or fulltext search configuration.

Because we need to set those settings individually for each parliament,
the multi-instance functionality of SayIt cannot be used and we must
implemented it differently.
implement it differently.

A separate WSGI application runs for each parliament initialized with
parliament-specific settings. All WSGI applications share the same
codebase and the same Django project with common settings.

The following steps are needed to add a new parliament:

#. Create a new database with collation settings corresponding to language
of the parliament. Example:
#. Create a new database ``sayit_<country_code>_<parliament_code>``
with collation settings corresponding to primary language of the
parliament. Example:

.. code-block:: SQL

CREATE DATABASE sayit_sk_nrsr WITH LC_CTYPE 'sk_SK.UTF-8' LC_COLLATE 'sk_SK.UTF-8' TEMPLATE template0 OWNER sayit;

When the required locale is missing in your system, create it first
and restart database server:

.. code-block:: console

$ sudo locale-gen xx_YY.UTF-8
$ sudo service postgresql restart

#. Copy one of the subdirectories in ``/subdomains`` directory under a
new name and adjust content of the ``settings.py`` file within.
Follow the naming conventions there. The directory name can contain
only letters and underscore because it represents a python module.
new name <country_code>_<parliament_code> and adjust content of the
``settings.py`` file within.

#. Create database tables:

@@ -77,42 +85,30 @@ The following steps are needed to add a new parliament:
Importing of data
=================

Data are imported in `AkomaNtoso format (XML)`_ (debates) and
`Popolo format (JSON)`_ (speakers) via commandline script ``manage.py``
of the particular subdomain and special Django commands. The script must
be executed in virtual environment of the installation.

.. code-block:: console

$ source /home/projects/.virtualenvs/sayit/bin/activate

.. _`AkomaNtoso format (XML)`: http://sayit.mysociety.org/about/developers
.. _`Popolo format (JSON)`: http://www.popoloproject.com/specs/person.html
Data are imported from ``api.parldata.eu`` via commandline script
``manage.py`` of the particular subdomain and using the command
``load_parldata``. The script must be executed in virtual environment
of the installation.


Examples
--------
Example
-------

To import speakers from a given file to Slovak parliament subdomain use:
To initially import data for Slovak parliament subdomain:

.. code-block:: console

(sayit)$ /home/projects/sayit/subdomains/sk_nrsr/manage.py sayit_load_speakers /home/projects/export-to-sayit/sk/nrsr/people.json

To import debates from all files in a given directory to Slovak parliament
subdomain use:

.. code-block:: console

(sayit)$ /home/projects/sayit/subdomains/sk_nrsr/manage.py load_akomantoso --dir /home/projects/export-to-sayit/sk --instance default --commit --merge-existing
$ source /home/projects/.virtualenvs/sayit/bin/activate
(sayit)$ /home/projects/sayit/subdomains/sk_nrsr/manage.py load_parldata --initial

To delete all data from the Slovak parliament subdomain use:
To load new data since the last import:

.. code-block:: console

(sayit)$ subdomains/sk_nrsr/manage.py flush
(sayit)$ /home/projects/sayit/subdomains/sk_nrsr/manage.py load_parldata

Schedule those scripts to be executed by Cron if regular updates are needed.
Schedule the incremental update to be executed by Cron if regular
updates are needed.


Some implementation notes
@@ -156,14 +152,14 @@ in Apache config file and its own settings in the ``subdomains``
directory. The settings for a particular subdomain are loaded as follows:

The ``VirtualHost`` block in Apache config file points to the subdomain's
WSGI application file ``subdomains/<subdom>/wsgi.py`` which loads
WSGI application file ``subdomains/<parliament>/wsgi.py`` which loads
settings file from the same directory. The settings file imports common
settings from ``sayit_parldata_eu/settings/base.py`` and overrides the
parliament-specific ones. The common settings file loads private settings
from ``conf/private.yml`` that is not present in the repository.
from ``conf/private.yml`` file that is not present in the repository.

The same mechanism of setting loading as in ``wsgi.py`` is used in domain
specific ``manage.py``.
The same mechanism of settings loading as in ``wsgi.py`` is used in
domain specific ``manage.py``.

Domain-independent commands like ``collectstatic`` can be executed by the
main ``manage.py`` file in the repository root.
No changes.
@@ -0,0 +1,284 @@
from datetime import datetime
import locale

from django.core import management

from instances.models import Instance
from speeches.models import Section, Speech, Speaker

from . import vpapi

import logging
logger = logging.getLogger(__name__)

class ParldataImporter:
def __init__(self, country_code, chamber_code, **options):
self.api_url = 'http://api.parldata.eu'
self.parliament = '%s/%s' %(country_code, chamber_code)
self.initial_import = options.get('initial', False)
self.verbosity = int(options.get('verbosity', 0))

# set country specific info obtained from VPAPI
resp = vpapi.get('')
for country in resp['_links']['child']:
if country['href'] == country_code:
locale.setlocale(locale.LC_ALL, country['locale'])
vpapi.timezone(country['timezone'])
break
vpapi.parliament(self.parliament)

# in case of initial import delete all existing data
if self.initial_import:
self._vlog('Deleting all existing data')
management.call_command('flush', verbosity=0, interactive=False)

self.instance, _created = Instance.objects.get_or_create(label='default')

def _vlog(self, msg):
if self.verbosity > 0:
# logger.info(msg)
print(msg)

def load_speakers(self):
self._vlog('Importing speakers')

# get datetime of the last import of speakers
try:
latest_speaker = Speaker.objects.order_by('-updated_at')[0]
last_modified = latest_speaker.updated_at
except IndexError:
last_modified = datetime.min

# update the people modified since the last import and create new ones
updated_people = vpapi.getall(
'people',
where={'updated_at': {'$gt': last_modified.isoformat()}}
)
count_c = 0
count_u = 0
for person in updated_people:
defaults = {
'name': person['name'][:128],
'family_name': person.get('family_name', ''),
'given_name': person.get('given_name', ''),
'additional_name': person.get('additional_name', ''),
'honorific_prefix': person.get('honorific_prefix', ''),
'honorific_suffix': person.get('honorific_suffix', ''),
'patronymic_name': person.get('patronymic_name', ''),
'sort_name': person.get('sort_name', ''),
'email': person.get('email'),
'gender': person.get('gender', ''),
'birth_date': person.get('birth_date', ''),
'death_date': person.get('death_date', ''),
'summary': person.get('summary', ''),
'biography': person.get('biography', ''),
'image': person.get('image'),
}
_record, created = update_object(
Speaker.objects, person,
identifiers__identifier=person['id'],
defaults=defaults,
instance=self.instance
)
count_c += created
count_u += not created

self._vlog('Imported %i persons (%i created, %i updated)' % (count_c+count_u, count_c, count_u))

def load_debates(self):
self._vlog('Importing debates')

# get datetime of the last import of speeches
try:
latest_speech = Speech.objects.order_by('-modified')[0]
last_modified = latest_speech.modified
except IndexError:
last_modified = datetime.min

# update the speeches modified since the last import and create new ones
updated_speeches = vpapi.getall(
'speeches',
where={'updated_at': {'$gt': last_modified.isoformat()}},
sort='event_id,date,position'
)

# prepare mapping from source_id to Speaker objects
speakers = {s.identifiers.filter(scheme='api.parldata.eu')[0].identifier: s
for s in Speaker.objects.select_related('identifiers')}

sec_count_c = 0
sec_count_u = 0
sp_count_c = 0
sp_count_u = 0
chamber = {}
session = {}
sitting = {}
speech_objects = []
for speech in updated_speeches:
if speech['position'] > 50: continue # DEBUG

if speech['event_id'] != sitting.get('id'):
# in case of initial import bulk create speeches when a new sitting occurs
if self.initial_import:
Speech.objects.bulk_create(speech_objects)
speech_objects = []

sitting = vpapi.get('events/%s' % speech['event_id'])

# create/update new section corresponding to the chamber
if sitting['organization_id'] != chamber.get('id'):
chamber = vpapi.get('organizations/%s' % sitting['organization_id'])
self._vlog('Importing chamber `%s`' % chamber['name'])
defaults = {
'heading': chamber.get('name'),
'start_date': chamber.get('founding_date'),
'legislature': chamber.get('name', ''),
'source_url': '%s/%s/organizations/%s' % (self.api_url, self.parliament, chamber['id']),
}
chamber_object, created = Section.objects.update_or_create(
source_url=defaults['source_url'],
defaults=defaults,
instance=self.instance
)
sec_count_c += created
sec_count_u += not created

# create/update new section corresponding to the session
if sitting['parent_id'] != session.get('id'):
session = vpapi.get('events/%s' % sitting['parent_id'])
if int(session['identifier']) > 5: break # DEBUG
self._vlog('Importing session `%s`' % session['name'])
sd, st = local_date_time(session.get('start_date'))
defaults = {
'heading': session.get('name'),
'start_date': sd,
'start_time': st,
'legislature': chamber.get('name', ''),
'session': session.get('name', ''),
'parent': chamber_object,
'source_url': '%s/%s/events/%s' % (self.api_url, self.parliament, session['id']),
}
session_object, created = Section.objects.update_or_create(
source_url=defaults['source_url'],
defaults=defaults,
instance=self.instance
)
sec_count_c += created
sec_count_u += not created

# create/update new section corresponding to the sitting
self._vlog('Importing sitting `%s`' % sitting['name'])
sd, st = local_date_time(sitting.get('start_date'))
defaults = {
'heading': sitting.get('name'),
'start_date': sd,
'start_time': st,
'legislature': chamber.get('name', ''),
'session': session.get('name', ''),
'parent': session_object,
'source_url': '%s/%s/events/%s' % (self.api_url, self.parliament, sitting['id']),
}
sitting_object, created = Section.objects.update_or_create(
source_url=defaults['source_url'],
defaults=defaults,
instance=self.instance
)
sec_count_c += created
sec_count_u += not created

# create/update the speech
speaker = speakers.get(speech.get('creator_id'))
sd, st = local_date_time(speech.get('date'))
defaults = {
'audio': speech.get('audio', ''),
'text': speech.get('text', ''),
'section': sitting_object,
'event': '%s, %s, %s' % (chamber['name'], session['name'], sitting['name']),
'speaker': speaker,
'type': speech.get('type', 'speech'),
'start_date': sd,
'start_time': st,
'source_url': '%s/%s/speeches/%s' % (self.api_url, self.parliament, speech['id']),
}
if speaker and speech.get('attribution_text'):
defaults['speaker_display'] = '%s, %s' % (speaker.name, speech['attribution_text'])
elif speech.get('attribution_text'):
defaults['speaker_display'] = speech['attribution_text']

if self.initial_import:
speech_object = Speech(instance=self.instance, **defaults)
speech_objects.append(speech_object)
created = True
else:
speech_object, created = Speech.objects.update_or_create(
source_url=defaults['source_url'],
defaults=defaults,
instance=self.instance
)
sp_count_c += created
sp_count_u += not created

# create speeches of the last sitting when doing initial import
if self.initial_import:
Speech.objects.bulk_create(speech_objects)

self._vlog('Imported %i sections (%i created, %i updated) and %i speeches (%i created, %i updated)' % (
sec_count_c+sec_count_u, sec_count_c, sec_count_u,
sp_count_c+sp_count_u, sp_count_c, sp_count_u))


def local_date_time(dtstr):
if not dtstr:
return None, None
dt = vpapi.utc_to_local(dtstr, to_string=False)
return dt.date(), dt.time()


def update_object(qs, data, defaults=None, **kwargs):
# update or create the object itself
record, created = qs.update_or_create(defaults=defaults, **kwargs)

# update or create related objects for links and sources
for l in ('sources', 'links'):
for item in data.get(l, []):
rel = getattr(record, l)
rel.update_or_create(
url=item['url'],
defaults={'note': item.get('note', '')}
)

# in case of speakers update or create related objects for
# identifiers, other_names and contact details,
if qs.model == Speaker:
record.identifiers.update_or_create(
identifier=data['id'],
scheme='api.parldata.eu',
)
for i in data.get('identifiers', []):
record.identifiers.update_or_create(
identifier=i['identifier'],
scheme=i.get('scheme', ''),
)
for name in data.get('other_names', []):
record.other_names.update_or_create(
name=name['name'],
defaults={'note': name.get('note', '')},
)
for cd in data.get('contact_details', []):
subrec = record.contact_details.update_or_create(
label = cd.get('label', ''),
start_date = cd.get('start_date'),
defaults={
'contact_type': cd['type'],
'value': cd['value'],
'note': cd.get('note', ''),
'end_date': cd.get('end_date'),
}
)
for src in cd.get('sources', []):
subrec.sources.update_or_create(
url=src['url'],
defaults={'note': src.get('note', '')}
)

return record, created

0 comments on commit da48f06

Please sign in to comment.
You can’t perform that action at this time.