Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV data import updates #1444

Merged
merged 17 commits into from Aug 14, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 12 additions & 5 deletions Dockerfile
Expand Up @@ -3,17 +3,24 @@ FROM ubuntu:16.04
ENV DEBIAN_FRONTEND noninteractive
ENV PYTHONUNBUFFERED 1

RUN apt-get update -qq

# For python 3.6
RUN apt-get install -yqq software-properties-common \
&& add-apt-repository ppa:jonathonf/python-3.6 \
&& apt-get update -qq

RUN apt-get update \
&& apt-get install -y postgresql-client \
libproj-dev \
gdal-bin \
memcached \
libmemcached-dev \
build-essential \
python \
python-pip \
python3.6 \
python3-pip \
python-virtualenv \
python-dev \
python3.6-dev \
git \
libssl-dev \
libpq-dev \
Expand All @@ -25,8 +32,8 @@ RUN apt-get update \
zlib1g-dev \
python-software-properties \
ghostscript \
python-celery \
python-sphinx \
python3-celery \
python3-sphinx \
openjdk-9-jre-headless \
locales \
pkg-config \
Expand Down
68 changes: 68 additions & 0 deletions Dockerfile.py2
@@ -0,0 +1,68 @@
FROM ubuntu:16.04

ENV DEBIAN_FRONTEND noninteractive
ENV PYTHONUNBUFFERED 1

RUN apt-get update \
&& apt-get install -y postgresql-client \
libproj-dev \
gdal-bin \
memcached \
libmemcached-dev \
build-essential \
python \
python-pip \
python-virtualenv \
python-dev \
git \
libssl-dev \
libpq-dev \
gfortran \
libatlas-base-dev \
libjpeg-dev \
libxml2-dev \
libxslt-dev \
zlib1g-dev \
python-software-properties \
ghostscript \
python-celery \
python-sphinx \
openjdk-9-jre-headless \
locales \
pkg-config \
gcc \
libtool \
automake

RUN locale-gen en_US.UTF-8
ENV LC_ALL en_US.UTF-8
ENV LC_CTYPE en_US.UTF-8
RUN dpkg-reconfigure locales

RUN useradd -m onadata
RUN mkdir -p /srv/onadata && chown -R onadata:onadata /srv/onadata
USER onadata
RUN mkdir -p /srv/onadata/requirements

ADD requirements /srv/onadata/requirements/

WORKDIR /srv/onadata

ADD . /srv/onadata/

ENV DJANGO_SETTINGS_MODULE onadata.settings.docker

USER root

# for local development tmux is a nice to have
RUN apt-get install -y tmux

RUN rm -rf /var/lib/apt/lists/* \
&& find . -name '*.pyc' -type f -delete

USER onadata

# Run tmux to use bash shell.
RUN echo "set-option -g default-shell /bin/bash" > ~/.tmux.conf

CMD ["/srv/onadata/docker/docker-entrypoint.sh"]
30 changes: 30 additions & 0 deletions docker-compose-py2.yml
@@ -0,0 +1,30 @@
version: '3'

services:
db:
build: ./docker/postgis
image: postgis:9.6
volumes:
# One level above the code to prevent having to move or delete
# it everytime we rebuild.
- ../.onadata_db:/var/lib/postgresql/data
queue:
image: rabbitmq
web:
build:
context: .
dockerfile: Dockerfile.py2
image: onadata:py2
volumes:
- .:/srv/onadata
- .inputrc:/home/onadata/.inputrc
- .bash_history:/home/onadata/.bash_history
ports:
- "3030:3030"
- "8000:8000"
depends_on:
- db
- queue
environment:
- SELECTED_PYTHON=python2

9 changes: 6 additions & 3 deletions docker-compose.yml
Expand Up @@ -11,8 +11,10 @@ services:
queue:
image: rabbitmq
web:
build: .
image: onadata:master
build:
context: .
dockerfile: Dockerfile
image: onadata:py3
volumes:
- .:/srv/onadata
- .inputrc:/home/onadata/.inputrc
Expand All @@ -23,4 +25,5 @@ services:
depends_on:
- db
- queue

environment:
- SELECTED_PYTHON=python3.6
4 changes: 2 additions & 2 deletions docker/docker-entrypoint.sh
Expand Up @@ -6,8 +6,8 @@ psql -h db -U postgres -c "CREATE ROLE onadata WITH SUPERUSER LOGIN PASSWORD 'on
psql -h db -U postgres -c "CREATE DATABASE onadata OWNER onadata;"
psql -h db -U postgres onadata -c "CREATE EXTENSION postgis; CREATE EXTENSION postgis_topology;"

virtualenv /srv/onadata/.virtualenv
. /srv/onadata/.virtualenv/bin/activate
virtualenv -p `which $SELECTED_PYTHON` /srv/onadata/.virtualenv/${SELECTED_PYTHON}
. /srv/onadata/.virtualenv/${SELECTED_PYTHON}/bin/activate

cd /srv/onadata
pip install --upgrade pip
Expand Down
132 changes: 132 additions & 0 deletions docs/proposals/data-import.md
@@ -0,0 +1,132 @@
# Bulk data upload.

Allow bulk upload of data.

1. CSV upload to a blank form (already supported)
2. Excel upload to a blank form (not supported)
3. Allow multiple uploads of data to a form that may already include data.

Example XLSForm:

| survey |
| | type | name | label |
| | text | a_name | Your Name? |
| | begin group | fruits | Fruits |
| | select one fruits | fruit | Fruit |
| | end group | | |
| | | | |
| choices | list name | name | label |
| | fruits | mango | Mango |
| | fruits | orange | Orange |
| | fruits | apple | Apple |

## Excel bulk data upload to an existing form

A user should be able to upload an Excel `.xls/.xlsx` file to a form. The data in the file will add new submissions or edit submissions. The excel file should have:

1. The first row, the column header row, MUST HAVE the name that matches a question name in the XLSForm/XForm form. Any column name that does not match a question in the form will be ignored and hence not imported.
2. The column header names MAY HAVE group or repeat name separators. If there are no separators, it will be ASSUMED that the field name matches the question be it within a group or a repeat. For example:

| a_name | fruits/fruit |
| Alice | mango |
| Bob | orange |

Without the group name, it will also match perfectly to the above form.

| a_name | fruit |
| Alice | mango |
| Bob | orange |

3. There MUST NOT be a duplicate named column in the form or the data upload file otherwise it will be rejected.
4. The file MAY HAVE a `meta/instanceID` column which should uniquely identify a specific record. If present, the `meta/instanceID` will be used to identify whether the record is new or is an edit. If it does not exist, the system will create a new one for each new record added.

Questions:
1. What happens if an upload file has repeats? They will also be uploaded, they need to be in the flat csv format, e.g `fruits[1]/fruit`.

2. Which Excel sheet should have the data imported? First sheet.
3. Should an Excel template file be provided? Yes, need to implement a blank CSV format - it will be upto the user to convert the CSV to excel.

### Data upload expected behaviour

When the upload is complete, three things could happen to the data.2

1. The upload will add new records to the existing form.
2. The upload will edit existing records where there is a matching `meta/instanceID` and add new records if the existing `meta/instanceID` is either blank or missing or does not exist.
3. The upload will overwrite existing records.

Note:

- For ANY approach, the UI should display a caution/warning, and a clear explanation of expected behaviour to the user.
- There will be the loss of the original data submitter information in the case of an overwrite.
- No effort will be made to link an exported file from Ona with the original submitter of the data.

#### 1. The upload will add new records to the existing form.

A data upload will add new records to the existing form under the following circumstances:

1. The form has NO submissions.
2. The upload file DOES NOT have the `meta/instanceID` column. (Should the user be allowed to specify a unique column?)

#### 2. The upload will edit existing records

A data upload will edit existing records in an existing form only if the upload file CONTAINS the column `meta/instanceID` and the value in this column MATCHES an existing record in the form.

Note: We will create a new record If the `meta/instanceID` is EITHER BLANK or MISSING or DOES NOT EXIST.

#### 3. The upload will OVERWRITE existing record

A data will OVERWRITE existing records if the parameter `overwrite` is `true`, `overwrite=true`, as part of the upload request. All existing records will be PERMANENTLY DELETED, and the NEW data upload will become the new submissions in the form.

Questions:
- Should it be possible to REVERT this process? NO


## API implementation

Implement a `/api/v1/data/[pk]/import` or endpoint on the API.

### `POST` /data/[pk]/import

The endpoint will accept `POST` requests to upload the data CSV/Excel file.

- Persist the uploaded file in the database and file storage. I propose we use the `MetaData` model to keep this record; we may need to use a new key, e.g. `data-imports` to refer to this files. New models could be used to achieve the same effect if there is more information to be stored.
- An asynchronous task will be created to start the process of importing the records from the file.

Request:

POST /data/[pk]/import

{
"upload_file": ..., // the file to upload
"overwrite": false, // whether to overwrite or not, accepts true or false.
...
}

Response:

Response status code 201

{
"xform": [xform pk],
"upload_id": [unique record identify for the upload]
"filename": [filename of uploaded file]
}

#### Processing the Uploaded file.

Depending on the query parameters, the data import will be taking into account the three options available as described above, i.e., NEW or EDIT or OVERWRITE.

- A record of the number of records processed, successful and failed should be maintained.
- In the event of a SUCCESS or a FAILURE, a notification SHOULD be sent. The notification can be via EMAIL to the user uploading the data or to via MQTT messaging/notifications or BOTH.

## Questions

1. What happens if an upload file has repeats? Repeats will be part of data.
2. Which Excel sheet should have the data imported? The first sheet.
3. Should an Excel template file be provided? Yes, API endpoint will be added.
4. Should it be possible to REVERT this process? NO
5. How should we notify the user of upload status/progress? messaging notification, job status query?
6. What limits should we impose on data file uploads? In megabytes or number rows?
7. Is the process supposed to be atomic - i.e. all uploads go through, or partial uploads will do? Partial uploads to be supported.
8. Should data imports from exports link the submitted by the user? Yes.
9. Should media links be downloaded into the new submission? Only data will be imported; media attachments will not be imported.
4 changes: 3 additions & 1 deletion onadata/apps/api/tasks.py
Expand Up @@ -57,7 +57,9 @@ def publish_xlsform_async(self, user, post_data, owner, file_data):


@task()
def delete_xform_async(xform):
def delete_xform_async(xform_id):
"""Soft delete an XForm asynchrounous task"""
xform = XForm.objects.get(pk=xform_id)
xform.soft_delete()


Expand Down
1 change: 1 addition & 0 deletions onadata/apps/api/tests/viewsets/test_data_viewset.py
Expand Up @@ -1538,6 +1538,7 @@ def test_geojson_format(self):
format='geojson')

self.assertEqual(response.status_code, 200)
self.assertEqual(self.xform.instances.count(), 4)

test_geo = {
'type': 'Feature',
Expand Down
6 changes: 4 additions & 2 deletions onadata/apps/api/tools.py
Expand Up @@ -6,9 +6,9 @@
import os
import tempfile
from datetime import datetime

from future.utils import listitems

import librabbitmq
from django import forms
from django.conf import settings
from django.contrib.auth.models import Permission, User
Expand All @@ -23,7 +23,9 @@
from django.http import HttpResponseNotFound, HttpResponseRedirect
from django.shortcuts import get_object_or_404
from django.utils.translation import ugettext as _

from guardian.shortcuts import assign_perm, get_perms_for_model, remove_perm
from kombu.exceptions import OperationalError
from registration.models import RegistrationProfile
from rest_framework import exceptions
from taggit.forms import TagField
Expand Down Expand Up @@ -404,7 +406,7 @@ def id_string_exists_in_account():
try:
# Next run async task to apply all other perms
set_project_perms_to_xform_async.delay(xform.pk, project.pk)
except librabbitmq.ConnectionError:
except OperationalError:
# Apply permissions synchrounously
set_project_perms_to_xform(xform, project)
else:
Expand Down
7 changes: 4 additions & 3 deletions onadata/apps/api/viewsets/data_viewset.py
Expand Up @@ -70,15 +70,16 @@ def get_data_and_form(kwargs):
return (data_id, kwargs.get('format'))


def delete_instance(instance):
def delete_instance(instance, user):
"""
Function that calls Instance.set_deleted and catches any exception that may
occur.
:param instance:
:param user:
:return:
"""
try:
instance.set_deleted(timezone.now())
instance.set_deleted(timezone.now(), user)
except FormInactiveError as e:
raise ParseError(text(e))

Expand Down Expand Up @@ -307,7 +308,7 @@ def destroy(self, request, *args, **kwargs):

if request.user.has_perm(
CAN_DELETE_SUBMISSION, self.object.xform):
delete_instance(self.object)
delete_instance(self.object, request.user)
else:
raise PermissionDenied(_(u"You do not have delete "
u"permissions."))
Expand Down