# Using the surveydata package with SurveyCTO

This workbook demonstrates how to use the `surveydata` Python package to work with [SurveyCTO](https://www.surveycto.com) data. It demonstrates four different approaches to storage:

1. Use of local data exported by [SurveyCTO Desktop](https://docs.surveycto.com/05-exporting-and-publishing-data/02-exporting-data-with-surveycto-desktop/01.using-desktop.html)
2. Synchronizing data directly from SurveyCTO into the local file system
3. Synchronizing data directly from SurveyCTO into [AWS S3 storage](https://aws.amazon.com/s3/), including attachments
4. Synchronizing data directly from SurveyCTO into [AWS DynamoDB storage](https://aws.amazon.com/dynamodb/), with attachments saved to AWS S3

## Reading credentials and configuration

This example workbook begins by loading credentials and configuration from an `.ini` file stored in `~/.ocl/surveydata-surveycto-examples.ini`. The `~` in the path refers to the current user's home directory, and the `.ini` file contents are as follows:

    [aws]
    accesskeyid=idhere
    accesskeysecret=secrethere
    s3bucketname=bucketnamehere
    region=regionnamehere
    ddbtablename=tablenamehere

    [surveycto]
    server=servernamehere
    username=emailhere
    password=passwordhere
    formid=formidhere
    privatekey=-----BEGIN RSA PRIVATE KEY-----
     FROM THE SECOND LINE TO THE LAST
     EACH KEY LINE
     MUST BE INDENTED
     BY AT LEAST ONE SPACE
     -----END RSA PRIVATE KEY-----

Feel free to update the path in `inifile_location` below, and only include properties as needed for the example cases you wish to execute.

In [20]:
# for convenience, auto-reload modules when they've changed
%load_ext autoreload
%autoreload 2

import configparser
import os

# load credentials and other configuration from a local ini file
inifile_location = os.path.expanduser("~/.ocl/surveydata-surveycto-examples.ini")
inifile = configparser.RawConfigParser()
inifile.read(inifile_location)

# load AWS credentials and configuration
aws_accesskey_id = inifile.get("aws", "accesskeyid")
aws_accesskey_secret = inifile.get("aws", "accesskeysecret")
s3_bucketname = inifile.get("aws", "s3bucketname")
aws_region = inifile.get("aws", "region")
ddb_tablename = inifile.get("aws", "ddbtablename")

# load SurveyCTO credentials and configuration
scto_server=inifile.get("surveycto", "server")
scto_username=inifile.get("surveycto", "username")
scto_password=inifile.get("surveycto", "password")
scto_formid=inifile.get("surveycto", "formid")
scto_private_key=inifile.get("surveycto", "privatekey")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Loading data from SurveyCTO export

First, we'll take the simplest case: wide-format data exported from [SurveyCTO Desktop](https://docs.surveycto.com/05-exporting-and-publishing-data/02-exporting-data-with-surveycto-desktop/01.using-desktop.html). The `surveydata` package makes it easy to load all submissions into a Pandas DataFrame — and also load all [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into a DataFrame when needed.

This example doesn't utilize any external services or require any credentials, so it doesn't use anything loaded from the `.ini` file above. It references a wide-format export file in the location exported by SurveyCTO Desktop, and it presumes that the `media` subdirectory is also present with all attachments.

In [28]:
from surveydata.surveyctoplatform import SurveyCTOPlatform
from surveydata.surveyctoexportstorage import SurveyCTOExportStorage

# initialize local storage with wide-format export and attachments_available=True since media subdirectory is present
storage = SurveyCTOExportStorage(export_file=os.path.expanduser("~/Exports/All fields for testing_WIDE.csv"), attachments_available=True)

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
# note that we can't list attachments in storage since the media directory can mix attachments from multiple forms
#print(f"Attachments in storage: {storage.list_attachments()}")
#print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# load all text audits into DataFrame and describe contents
textaudit_df = SurveyCTOPlatform.get_text_audit_df(storage, location_strings=submissions_df.textaudit)
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))
else:
    print("No text audits found.")

Submissions in storage: ['uuid:e4f56b32-cc64-4af1-abdf-56fd6dc790ce', 'uuid:2839a648-a32c-4ff1-a728-f9a0405794d0', 'uuid:2f07f119-0d10-4d40-a867-e162a5e831a6', 'uuid:8949508b-9bc0-482a-8ff9-6452ffd747e8', 'uuid:f47bec38-3b88-45d1-80e3-b33e49cea41c']

Submission DataFrame field counts:
SubmissionDate               5
starttime                    5
endtime                      5
deviceid                     5
devicephonenum               5
username                     5
device_info                  5
duration                     5
caseid                       5
comments                     5
textaudit                    5
textaudit_full               5
audioaudit                   5
speedviolationscount         2
speedviolationslist          5
pct_conversation             5
mean_light_level             5
mean_movement                5
mean_sound_level             5
mean_sound_pitch             5
light_level                  5
movement_stream              5
sound_level_stream           5
s

[autoreload of surveydata.filestorage failed: Traceback (most recent call last):
  File "/Users/crobert/Code/Orange Chair Labs/py-surveydata/venv/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/Users/crobert/Code/Orange Chair Labs/py-surveydata/venv/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
    module = reload(module)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/imp.py", line 314, in reload
    return importlib.reload(module)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 630, in _exec
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  

## Synchronizing data between SurveyCTO and local file system

In this next case, we'll synchronize data directly between SurveyCTO and the local file system. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data and [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into DataFrames. We'll also specify that we want submissions with *any* review status (pending, approved, or rejected), as the default sync only includes approved submissions.

The SurveyCTO credentials and form ID will be those loaded earlier, from the `.ini` file. We recommend creating a new user role for API access, which allows API access as well as read-only access to forms and data. The `privatekey` property in the `.ini` file is optional, to be used when the SurveyCTO form is encrypted.

In this example, data is synchronized to the `~/Files/surveydata/formid/` folder tree, where `~` refers to the current user's home directory and `formid` is the SurveyCTO form ID loaded from the `.ini` file.

In [35]:
from surveydata.surveyctoplatform import SurveyCTOPlatform
from surveydata.filestorage import FileStorage

# initialize the survey platform connection
platform = SurveyCTOPlatform(scto_server, scto_username, scto_password, scto_formid, scto_private_key)

# initialize the local file storage location
storage = FileStorage(os.path.expanduser("~/Files/surveydata/" + scto_formid + "/"))

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage, review_statuses=["pending", "approved", "rejected"])
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# summarize submission review statuses
print("Submission DataFrame review statuses:")
print(submissions_df.review_status.value_counts())
print()

# load all text audits into DataFrame and describe contents
textaudit_df = SurveyCTOPlatform.get_text_audit_df(storage, location_strings=submissions_df.textaudit)
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))
else:
    print("No text audits found.")

Count of new submissions sync'd to storage: 1
List of new submissions sync'd to storage: ['uuid:8d744680-238a-454d-8e83-d168b1da1aaf']

Submissions in storage: ['uuid:a45f2d93-af11-43db-842f-e2227f022c6e', 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'uuid:d8880923-8a5d-4c7a-9b66-994a496b2ae8', 'uuid:7fac8029-4b31-49f5-83e1-ef9bf7ac1db0', 'uuid:d5f8b82d-9ef1-41ff-afc2-16eab7b8275d', 'uuid:8d744680-238a-454d-8e83-d168b1da1aaf', 'uuid:66767ff8-919e-44d3-b6db-784091a3de37']

Attachments in storage: [{'name': 'TA_a45f2d93-af11-43db-842f-e2227f022c6e.csv', 'submission_id': 'uuid:a45f2d93-af11-43db-842f-e2227f022c6e', 'location_string': 'file:/Users/crobert/Files/surveydata/all_fields_for_testing_enc/uuid%3Aa45f2d93-af11-43db-842f-e2227f022c6e/TA_a45f2d93-af11-43db-842f-e2227f022c6e.csv'}, {'name': 'TA_66767ff8-919e-44d3-b6db-784091a3de37.csv', 'submission_id': 'uuid:66767ff8-919e-44d3-b6db-784091a3de37', 'location_string': 'file:/Users/crobert/Files/surveydata/all_fields_for_testing_enc/uui

## Synchronizing data between SurveyCTO and S3

In this next case, we'll synchronize data directly between SurveyCTO and S3. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data and [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into DataFrames.

The SurveyCTO credentials and form ID will be those loaded earlier, from the `.ini` file. We recommend creating a new user role for API access, which allows API access as well as read-only access to forms and data. The `privatekey` property in the `.ini` file is optional, to be used when the SurveyCTO form is encrypted.

The AWS credentials and S3 bucket name will also be as loaded above. We recommend creating a new programmatic-access AWS user with limited access to the appropriate S3 bucket. In the example, all data is stored within the `surveydata/formid/` folder, where `formid` is the SurveyCTO form ID configured in the `.ini` file. This allows data from multiple forms to be safely stored within the same S3 bucket.

In [40]:
from surveydata.surveyctoplatform import SurveyCTOPlatform
from surveydata.s3storage import S3Storage

# initialize the survey platform connection
platform = SurveyCTOPlatform(scto_server, scto_username, scto_password, scto_formid, scto_private_key)

# initialize the S3 storage connection
storage = S3Storage(s3_bucketname, key_name_prefix="surveydata/" + scto_formid + "/", aws_access_key_id=aws_accesskey_id, aws_secret_access_key=aws_accesskey_secret)

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# load all text audits into DataFrame and describe contents
textaudit_df = SurveyCTOPlatform.get_text_audit_df(storage, location_strings=submissions_df.textaudit)
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))
else:
    print("No text audits found.")

Count of new submissions sync'd to storage: 1
List of new submissions sync'd to storage: ['uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77']

Submissions in storage: ['uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'uuid:66767ff8-919e-44d3-b6db-784091a3de37', 'uuid:a45f2d93-af11-43db-842f-e2227f022c6e', 'uuid:d5f8b82d-9ef1-41ff-afc2-16eab7b8275d', 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77']

Attachments in storage: [{'name': '1666784762837.jpg', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 's3:surveydata/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784762837.jpg'}, {'name': '1666784783763.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 's3:surveydata/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784783763.m4a'}, {'name': 'AA_5e5a40ce-bce2-4225-856e-224f13f3fafa_AFTER_0S.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 's3:su

## Synchronizing data between SurveyCTO and DynamoDB+S3

In this next case, we'll synchronize data directly between SurveyCTO and DynamoDB+S3, storing the submission data in DynamoDB and all attachments in S3. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data and [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into DataFrames.

The SurveyCTO credentials and form ID will be those loaded earlier, from the `.ini` file. We recommend creating a new user role for API access, which allows API access as well as read-only access to forms and data. The `privatekey` property in the `.ini` file is optional, to be used when the SurveyCTO form is encrypted.

The AWS credentials, DynamoDB table name, and S3 bucket name will also be as loaded above. We recommend creating a new programmatic-access AWS user with limited access to the appropriate DynamoDB table and S3 bucket. In this example, all data is stored within the `{FormID: formid}` DynamoDB partition, where `formid` is the SurveyCTO form ID configured in the `.ini` file; all attachments are stored within the `surveydata/attachments/formid/` folder in the S3 bucket.

In [42]:
from surveydata.s3storage import S3Storage
from surveydata.dynamodbstorage import DynamoDBStorage
from surveydata.surveyctoplatform import SurveyCTOPlatform

# initialize the survey platform connection
platform = SurveyCTOPlatform(scto_server, scto_username, scto_password, scto_formid, scto_private_key)

# initialize DynamoDB storage for submission data
storage = DynamoDBStorage(aws_region=aws_region, table_name=ddb_tablename, id_field_name="KEY", partition_key_name="FormID", partition_key_value=scto_formid, aws_access_key_id=aws_accesskey_id, aws_secret_access_key=aws_accesskey_secret)
# initialize S3 storage for attachments
file_storage = S3Storage(s3_bucketname, key_name_prefix="surveydata/attachments/" + scto_formid + "/", aws_access_key_id=aws_accesskey_id, aws_secret_access_key=aws_accesskey_secret)

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage=storage, attachment_storage=file_storage)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {file_storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# load all text audits into DataFrame and describe contents
textaudit_df = SurveyCTOPlatform.get_text_audit_df(file_storage, location_strings=submissions_df.textaudit)
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))
else:
    print("No text audits found.")

Count of new submissions sync'd to storage: 1
List of new submissions sync'd to storage: ['uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77']

Submissions in storage: ['uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'uuid:66767ff8-919e-44d3-b6db-784091a3de37', 'uuid:a45f2d93-af11-43db-842f-e2227f022c6e', 'uuid:d5f8b82d-9ef1-41ff-afc2-16eab7b8275d', 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77']

Attachments in storage: [{'name': '1666784762837.jpg', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 's3:surveydata/attachments/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784762837.jpg'}, {'name': '1666784783763.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 's3:surveydata/attachments/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784783763.m4a'}, {'name': 'AA_5e5a40ce-bce2-4225-856e-224f13f3fafa_AFTER_0S.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', '