# Using the surveydata package with ODK Central

This workbook demonstrates how to use the `surveydata` Python package to work with [ODK Central](https://docs.getodk.org/central-intro/) data. It demonstrates four different approaches to storage:

1. Use of local data exported by [ODK Central](https://docs.getodk.org/central-intro/)
2. Synchronizing data directly from ODK into the local file system
3. Synchronizing data directly from ODK into [AWS S3 storage](https://aws.amazon.com/s3/), including attachments
4. Synchronizing data directly from ODK into [AWS DynamoDB storage](https://aws.amazon.com/dynamodb/), with attachments saved to AWS S3
5. Synchronizing data directly from ODK into [Google Cloud Storage](https://cloud.google.com/storage), including attachments
6. Synchronizing data directly from ODK into [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs/), including attachments

All examples other than the local data export include code to save and reload Pandas DataFrames, for cases where data ingestion and processing is separated from actual analysis or use.

## Reading credentials and configuration

This example workbook begins by loading cloud storage configuration and credentials from an `.ini` file stored in `~/.ocl/surveydata-odk-examples.ini`. The `~` in the path refers to the current user's home directory, and the `.ini` file contents are as follows:

    [aws]
    accesskeyid=idhere
    accesskeysecret=secrethere
    s3bucketname=bucketnamehere
    region=regionnamehere
    ddbtablename=tablenamehere

    [google]
    googleprojectid=idhere
    googlebucketname=bucketnamehere

    [azure]
    azureconnectionstring=connectionstringhere
    azurecontainername=oclexamples
    azureaccounturl=https://storageaccountname.blob.core.windows.net

Feel free to update the path in `inifile_location` below, and only include properties as needed for the example cases you wish to execute.

The ODK Central credentials are likewise loaded from `~/.ocl/.pyodk_config.toml`, which looks like this:

    [central]
    base_url = "https://yourserver.domain.com"
    username = "username_with_access"
    password = "password"
    default_project_id = default_project_id_number

Finally, for Google Cloud authentication, service account credentials are loaded from `~/.ocl/google-ocl-examples-service-account-credentials.json`, which can be saved directly from the Google Cloud Storage console.

In [1]:
# for convenience, auto-reload modules when they've changed
%load_ext autoreload
%autoreload 2

import configparser
import os

# manually initialize ODK parameters
project_id = 1
form_id = "all_fields_for_testing"

# load credentials and other configuration from a local ini file
inifile_location = os.path.expanduser("~/.ocl/surveydata-odk-examples.ini")
inifile = configparser.RawConfigParser()
inifile.read(inifile_location)

# load AWS credentials and configuration
aws_accesskey_id = inifile.get("aws", "accesskeyid")
aws_accesskey_secret = inifile.get("aws", "accesskeysecret")
s3_bucketname = inifile.get("aws", "s3bucketname")
aws_region = inifile.get("aws", "region")
ddb_tablename = inifile.get("aws", "ddbtablename")

# load Google Cloud Storage credentials and configuration
from google.oauth2 import service_account
google_project_id = inifile.get("google", "googleprojectid")
google_bucket_name = inifile.get("google", "googlebucketname")

google_credentials = service_account.Credentials.from_service_account_file(os.path.expanduser("~/.ocl/google-ocl-examples-service-account-credentials.json"))

# load Azure Blob Storage credentials and configuration
azure_connection_string = inifile.get("azure", "azureconnectionstring")
azure_container_name = inifile.get("azure", "azurecontainername")
azure_account_url = inifile.get("azure", "azureaccounturl")

## Loading data from ODK Central export

First, we'll take the simplest case: an *All data and Attachments* export from ODK Central, which has been downloaded and unzipped into a local directory (here, with the primary export file in `~/Exports/ODK/all_fields_for_testing/all_fields_for_testing.csv`). The `surveydata` package makes it easy to load all submissions into a Pandas DataFrame.

This example doesn't utilize any external services or require any credentials, so it doesn't use anything loaded from the `.ini` file above. It loads all `.csv` files present in the specified directory, and it presumes that the `media` subdirectory is also present with all attachments.

In [2]:
# from surveydata import ODKPlatform
from surveydata import ODKExportStorage
from surveydata import ODKPlatform

# initialize local storage with wide-format export and attachments_available=True since media subdirectory is present
storage = ODKExportStorage(export_file=os.path.expanduser("~/Exports/odk/all_fields_for_testing/all_fields_for_testing.csv"), attachments_available=True)

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = ODKPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

Submissions in storage: ['uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'uuid:6a384d76-5547-4bf0-8f50-4f5c873c83c3', 'uuid:f4a826a1-db80-48f5-97e3-de840e964d9c', 'uuid:f9374882-7671-46fa-87ec-53b516d2b37a']

Attachments in storage: [{'name': '1666784762837-9_46_21.jpg', 'location_string': '1666784762837-9_46_21.jpg'}, {'name': '1666784783763-9_46_25.m4a', 'location_string': '1666784783763-9_46_25.m4a'}, {'name': 'ChrisExample_public-9_46_36.pem', 'location_string': 'ChrisExample_public-9_46_36.pem'}]

Submission DataFrame field counts:
SubmissionDate                           4
starttime                                4
endtime                                  4
deviceid                                 4
devicephonenum                           4
username                                 4
intronote                                4
qtext                                    4
qint                                     2
qdecimal                                 1
qgeopoint-Latitude            

## Synchronizing data between ODK Central and local file system

In this next case, we'll synchronize data directly between ODK Central and the local file system. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data into a DataFrame, and here we'll show how to save and reload that DataFrame in case your processing and analysis jobs are separated.

In this example, data is synchronized to the `~/Files/surveydata/odk/formid/` folder tree, where `~` refers to the current user's home directory and `formid` is the ODK form ID initialized earlier.

In [3]:
from surveydata import ODKPlatform
from surveydata import FileStorage

# initialize the survey platform connection
platform = ODKPlatform(config_file=os.path.expanduser("~/.ocl/.pyodk_config.toml"), project_id=project_id, form_id=form_id)

# initialize the local file storage location
storage = FileStorage(os.path.expanduser("~/Files/surveydata/odk/" + form_id + "/"))

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage, include_rejected=False)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = ODKPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# save DataFrame for other Python code to access (retains all DataFrame properties)
storage.store_dataframe("__ALL_DATA_DF__", submissions_df)
# example of how to reload the saved DataFrame
submissions_df_from_storage = storage.get_dataframe("__ALL_DATA_DF__")

# save DataFrame as a .csv file for others to access (doesn't retain all DataFrame properties)
storage.store_dataframe_csv("__ALL_DATA_CSV__", submissions_df)
# example of how to reload saved .csv into a DataFrame
submissions_df_from_csv = storage.get_dataframe_csv("__ALL_DATA_CSV__")

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:f4a826a1-db80-48f5-97e3-de840e964d9c', 'uuid:f9374882-7671-46fa-87ec-53b516d2b37a', 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'uuid:6a384d76-5547-4bf0-8f50-4f5c873c83c3']

Attachments in storage: [{'name': '1666784762837-9_46_21.jpg', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 'file:/Users/crobert/Files/surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7c8/1666784762837-9_46_21.jpg'}, {'name': '1666784783763-9_46_25.m4a', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 'file:/Users/crobert/Files/surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7c8/1666784783763-9_46_25.m4a'}, {'name': 'ChrisExample_public-9_46_36.pem', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 'file:/Users/crobert/Files/surveydata

## Synchronizing data between ODK and S3

In this next case, we'll synchronize data directly between ODK and S3. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data into a DataFrame, and we'll show how to save and reload that DataFrame in case your processing and analysis jobs are separated.

The AWS credentials and S3 bucket name will also be as loaded earlier. We recommend creating a new programmatic-access AWS user with limited access to the appropriate S3 bucket. In the example, all data is stored within the `surveydata/odk/formid/` folder, where `formid` is the ODK form ID initialized earlier. This allows data from multiple forms to be safely stored within the same S3 bucket.

In [4]:
from surveydata import ODKPlatform
from surveydata import S3Storage

# initialize the survey platform connection
platform = ODKPlatform(config_file=os.path.expanduser("~/.ocl/.pyodk_config.toml"), project_id=project_id, form_id=form_id)

# initialize the S3 storage connection
storage = S3Storage(s3_bucketname, key_name_prefix="surveydata/odk/" + form_id + "/", aws_access_key_id=aws_accesskey_id, aws_secret_access_key=aws_accesskey_secret)

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = ODKPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# save DataFrame for other Python code to access (retains all DataFrame properties)
storage.store_dataframe("__ALL_DATA_DF__", submissions_df)
# example of how to reload the saved DataFrame
submissions_df_from_storage = storage.get_dataframe("__ALL_DATA_DF__")

# save DataFrame as a .csv file for others to access (doesn't retain all DataFrame properties)
storage.store_dataframe_csv("__ALL_DATA_CSV__", submissions_df)
# example of how to reload saved .csv into a DataFrame
submissions_df_from_csv = storage.get_dataframe_csv("__ALL_DATA_CSV__")

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'uuid:6a384d76-5547-4bf0-8f50-4f5c873c83c3', 'uuid:f4a826a1-db80-48f5-97e3-de840e964d9c', 'uuid:f9374882-7671-46fa-87ec-53b516d2b37a']

Attachments in storage: [{'name': '1666784762837-9_46_21.jpg', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 's3:surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7c8/1666784762837-9_46_21.jpg'}, {'name': '1666784783763-9_46_25.m4a', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 's3:surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7c8/1666784783763-9_46_25.m4a'}, {'name': 'ChrisExample_public-9_46_36.pem', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 's3:surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7

## Synchronizing data between ODK and DynamoDB+S3

In this next case, we'll synchronize data directly between ODK and DynamoDB+S3, storing the submission data in DynamoDB and all attachments in S3. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data into a DataFrame.

The AWS credentials, DynamoDB table name, and S3 bucket name will also be as loaded earlier. We recommend creating a new programmatic-access AWS user with limited access to the appropriate DynamoDB table and S3 bucket. In this example, all data is stored within the `{FormID: formid}` DynamoDB partition, where `formid` is the ODK form ID initialized earlier; all attachments are stored within the `surveydata/odk/attachments/formid/` folder in the S3 bucket.

In [5]:
from surveydata import S3Storage
from surveydata import DynamoDBStorage
from surveydata import ODKPlatform

# initialize the survey platform connection
platform = ODKPlatform(config_file=os.path.expanduser("~/.ocl/.pyodk_config.toml"), project_id=project_id, form_id=form_id)

# initialize DynamoDB storage for submission data
storage = DynamoDBStorage(aws_region=aws_region, table_name=ddb_tablename + "-odk", id_field_name="KEY", partition_key_name="FormID", partition_key_value=form_id, aws_access_key_id=aws_accesskey_id, aws_secret_access_key=aws_accesskey_secret)
# initialize S3 storage for attachments
file_storage = S3Storage(s3_bucketname, key_name_prefix="surveydata/odk/attachments/" + form_id + "/", aws_access_key_id=aws_accesskey_id, aws_secret_access_key=aws_accesskey_secret)

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage=storage, attachment_storage=file_storage)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {file_storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = ODKPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'uuid:6a384d76-5547-4bf0-8f50-4f5c873c83c3', 'uuid:f4a826a1-db80-48f5-97e3-de840e964d9c', 'uuid:f9374882-7671-46fa-87ec-53b516d2b37a']

Attachments in storage: [{'name': '1666784762837-9_46_21.jpg', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 's3:surveydata/odk/attachments/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7c8/1666784762837-9_46_21.jpg'}, {'name': '1666784783763-9_46_25.m4a', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 's3:surveydata/odk/attachments/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7c8/1666784783763-9_46_25.m4a'}, {'name': 'ChrisExample_public-9_46_36.pem', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 's3:surveydata/odk/attachments/all_fields_for_testing/uuid%

## Synchronizing data between ODK and Google Cloud Storage

In this next case, we'll synchronize data directly between ODK and Google Cloud Storage. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data into a DataFrame, and we'll show how to save and reload that DataFrame in case your processing and analysis jobs are separated.

The Google project ID, bucket name, and credentials will also be as loaded earlier. We recommend creating a new service account with limited access to the appropriate bucket. In the example, all data is stored within the `surveydata/odk/formid/` folder, where `formid` is the ODK form ID initialized earlier. This allows data from multiple forms to be safely stored within the same bucket.

In [6]:
from surveydata import ODKPlatform
from surveydata import GoogleCloudStorage

# initialize the survey platform connection
platform = ODKPlatform(config_file=os.path.expanduser("~/.ocl/.pyodk_config.toml"), project_id=project_id, form_id=form_id)

# initialize the Google Cloud Storage connection
storage = GoogleCloudStorage(project_id=google_project_id, bucket_name=google_bucket_name, blob_name_prefix="surveydata/odk/" + form_id + "/", credentials=google_credentials)

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = ODKPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# save DataFrame for other Python code to access (retains all DataFrame properties)
storage.store_dataframe("__ALL_DATA_DF__", submissions_df)
# example of how to reload the saved DataFrame
submissions_df_from_storage = storage.get_dataframe("__ALL_DATA_DF__")

# save DataFrame as a .csv file for others to access (doesn't retain all DataFrame properties)
storage.store_dataframe_csv("__ALL_DATA_CSV__", submissions_df)
# example of how to reload saved .csv into a DataFrame
submissions_df_from_csv = storage.get_dataframe_csv("__ALL_DATA_CSV__")

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'uuid:6a384d76-5547-4bf0-8f50-4f5c873c83c3', 'uuid:f4a826a1-db80-48f5-97e3-de840e964d9c', 'uuid:f9374882-7671-46fa-87ec-53b516d2b37a']

Attachments in storage: [{'name': '1666784762837-9_46_21.jpg', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 'gs:surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7c8/1666784762837-9_46_21.jpg'}, {'name': '1666784783763-9_46_25.m4a', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 'gs:surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7c8/1666784783763-9_46_25.m4a'}, {'name': 'ChrisExample_public-9_46_36.pem', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 'gs:surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7

## Synchronizing data between ODK and Azure Blob Storage

In this next case, we'll synchronize data directly between ODK and Azure Blob Storage. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data into a DataFrame, and we'll show how to save and reload that DataFrame in case your processing and analysis jobs are separated.

The Azure connection string and container name will also be as loaded earlier. While this example uses a connection string to authenticate, passwordless authentication is also supported if you log in first and then pass `account_url` to the `AzureBlobStorage` constructor. In the example, all data is stored within the `surveydata/odk/formid/` folder, where `formid` is the ODK form ID initialized earlier. This allows data from multiple forms to be safely stored within the same container.

In [7]:
from surveydata import ODKPlatform
from surveydata import AzureBlobStorage

# initialize the survey platform connection
platform = ODKPlatform(config_file=os.path.expanduser("~/.ocl/.pyodk_config.toml"), project_id=project_id, form_id=form_id)

# initialize the Google Cloud Storage connection
storage = AzureBlobStorage(container_name=azure_container_name, blob_name_prefix="surveydata/odk/" + form_id + "/", connection_string=azure_connection_string)

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = ODKPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# save DataFrame for other Python code to access (retains all DataFrame properties)
storage.store_dataframe("__ALL_DATA_DF__", submissions_df)
# example of how to reload the saved DataFrame
submissions_df_from_storage = storage.get_dataframe("__ALL_DATA_DF__")

# save DataFrame as a .csv file for others to access (doesn't retain all DataFrame properties)
storage.store_dataframe_csv("__ALL_DATA_CSV__", submissions_df)
# example of how to reload saved .csv into a DataFrame
submissions_df_from_csv = storage.get_dataframe_csv("__ALL_DATA_CSV__")

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'uuid:6a384d76-5547-4bf0-8f50-4f5c873c83c3', 'uuid:f4a826a1-db80-48f5-97e3-de840e964d9c', 'uuid:f9374882-7671-46fa-87ec-53b516d2b37a']

Attachments in storage: [{'name': '1666784762837-9_46_21.jpg', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 'abs:surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7c8/1666784762837-9_46_21.jpg'}, {'name': '1666784783763-9_46_25.m4a', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 'abs:surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb934a7c8/1666784783763-9_46_25.m4a'}, {'name': 'ChrisExample_public-9_46_36.pem', 'submission_id': 'uuid:0d70b386-59f7-488e-ab43-fb0bb934a7c8', 'location_string': 'abs:surveydata/odk/all_fields_for_testing/uuid%3A0d70b386-59f7-488e-ab43-fb0bb93