# Using the surveydata package with SurveyCTO

This workbook demonstrates how to use the `surveydata` Python package to work with [SurveyCTO](https://www.surveycto.com) data. It demonstrates four different approaches to storage:

1. Use of local data exported by [SurveyCTO Desktop](https://docs.surveycto.com/05-exporting-and-publishing-data/02-exporting-data-with-surveycto-desktop/01.using-desktop.html)
2. Synchronizing data directly from SurveyCTO into the local file system
3. Synchronizing data directly from SurveyCTO into [AWS S3 storage](https://aws.amazon.com/s3/), including attachments
4. Synchronizing data directly from SurveyCTO into [AWS DynamoDB storage](https://aws.amazon.com/dynamodb/), with attachments saved to AWS S3
5. Synchronizing data directly from SurveyCTO into [Google Cloud Storage](https://cloud.google.com/storage), including attachments
6. Synchronizing data directly from SurveyCTO into [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs/), including attachments

All examples other than the local SurveyCTO Desktop data export include code to save and reload Pandas DataFrames, for cases where data ingestion and processing is separated from actual analysis or use.

Finally, it demonstrates how to update submissions with comments and/or updated review status and quality classification.

## Reading credentials and configuration

This example workbook begins by loading credentials and configuration from an `.ini` file stored in `~/.ocl/surveydata-surveycto-examples.ini`. The `~` in the path refers to the current user's home directory, and the `.ini` file contents are as follows:

    [aws]
    accesskeyid=idhere
    accesskeysecret=secrethere
    s3bucketname=bucketnamehere
    region=regionnamehere
    ddbtablename=tablenamehere

    [google]
    googleprojectid=idhere
    googlebucketname=bucketnamehere

    [azure]
    azureconnectionstring=connectionstringhere
    azurecontainername=oclexamples
    azureaccounturl=https://storageaccountname.blob.core.windows.net

    [surveycto]
    server=servernamehere
    username=emailhere
    password=passwordhere
    formid=formidhere
    privatekey=-----BEGIN RSA PRIVATE KEY-----
     FROM THE SECOND LINE TO THE LAST
     EACH KEY LINE
     MUST BE INDENTED
     BY AT LEAST ONE SPACE
     -----END RSA PRIVATE KEY-----

Feel free to update the path in `inifile_location` below, and only include properties as needed for the example cases you wish to execute.

Finally, for Google Cloud authentication, service account credentials are loaded from `~/.ocl/google-ocl-examples-service-account-credentials.json`.

In [1]:
# for convenience, auto-reload modules when they've changed
%load_ext autoreload
%autoreload 2

import configparser
import os

# load credentials and other configuration from a local ini file
inifile_location = os.path.expanduser("~/.ocl/surveydata-surveycto-examples.ini")
inifile = configparser.RawConfigParser()
inifile.read(inifile_location)

# load AWS credentials and configuration
aws_accesskey_id = inifile.get("aws", "accesskeyid")
aws_accesskey_secret = inifile.get("aws", "accesskeysecret")
s3_bucketname = inifile.get("aws", "s3bucketname")
aws_region = inifile.get("aws", "region")
ddb_tablename = inifile.get("aws", "ddbtablename")

# load Google Cloud Storage credentials and configuration
from google.oauth2 import service_account
google_project_id = inifile.get("google", "googleprojectid")
google_bucket_name = inifile.get("google", "googlebucketname")

google_credentials = service_account.Credentials.from_service_account_file(os.path.expanduser("~/.ocl/google-ocl-examples-service-account-credentials.json"))

# load Azure Blob Storage credentials and configuration
azure_connection_string = inifile.get("azure", "azureconnectionstring")
azure_container_name = inifile.get("azure", "azurecontainername")
azure_account_url = inifile.get("azure", "azureaccounturl")

# load SurveyCTO credentials and configuration
scto_server=inifile.get("surveycto", "server")
scto_username=inifile.get("surveycto", "username")
scto_password=inifile.get("surveycto", "password")
scto_formid=inifile.get("surveycto", "formid")
scto_private_key=inifile.get("surveycto", "privatekey")

## Loading data from SurveyCTO export

First, we'll take the simplest case: wide-format data exported from [SurveyCTO Desktop](https://docs.surveycto.com/05-exporting-and-publishing-data/02-exporting-data-with-surveycto-desktop/01.using-desktop.html). The `surveydata` package makes it easy to load all submissions into a Pandas DataFrame — and also load all [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into a DataFrame when needed.

This example doesn't utilize any external services or require any credentials, so it doesn't use anything loaded from the `.ini` file above. It references a wide-format export file in the location exported by SurveyCTO Desktop, and it presumes that the `media` subdirectory is also present with all attachments.

In [2]:
from surveydata import SurveyCTOPlatform
from surveydata import SurveyCTOExportStorage
import pandas as pd
from pytz import timezone

# initialize local storage with wide-format export and attachments_available=True since media subdirectory is present
storage = SurveyCTOExportStorage(export_file=os.path.expanduser("~/Exports/All fields for testing_WIDE.csv"), attachments_available=True)

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
# note that we can't list attachments in storage since the media directory can mix attachments from multiple forms
#print(f"Attachments in storage: {storage.list_attachments()}")
#print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# load all text audits into DataFrame and describe contents
# (here, the "textaudit" column includes the path to each text audit file, relative to the .csv export path)
textaudit_df = SurveyCTOPlatform.get_text_audit_df(storage, location_strings=submissions_df.textaudit)
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))
    print()

    # summarize text audit data for analysis
    # (include timezone since exports don't include that information; use pytz.all_timezones to show list of available timezones)
    ta_summary = SurveyCTOPlatform.process_text_audits(textaudit_df, submissions_df["starttime"], submissions_df["endtime"], timezone("US/Eastern").zone, timezone("US/Eastern").zone)

    # merge text audit summaries with submission data
    all_data = pd.concat([submissions_df, ta_summary], axis='columns', join='outer', verify_integrity=True)

    # print summary of combined DataFrame
    print("Combined DataFrame field counts:")
    print(all_data.count(0))
else:
    print("No text audits found.")

Submissions in storage: ['uuid:e4f56b32-cc64-4af1-abdf-56fd6dc790ce', 'uuid:2839a648-a32c-4ff1-a728-f9a0405794d0', 'uuid:2f07f119-0d10-4d40-a867-e162a5e831a6', 'uuid:8949508b-9bc0-482a-8ff9-6452ffd747e8', 'uuid:f47bec38-3b88-45d1-80e3-b33e49cea41c']

Submission DataFrame field counts:
SubmissionDate               5
starttime                    5
endtime                      5
deviceid                     5
devicephonenum               5
username                     5
device_info                  5
duration                     5
caseid                       5
comments                     5
textaudit                    5
textaudit_full               5
audioaudit                   5
speedviolationscount         2
speedviolationslist          5
pct_conversation             5
mean_light_level             5
mean_movement                5
mean_sound_level             5
mean_sound_pitch             5
light_level                  5
movement_stream              5
sound_level_stream           5
s

## Synchronizing data between SurveyCTO and local file system

In this next case, we'll synchronize data directly between SurveyCTO and the local file system. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data and [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into DataFrames, and here we'll show how to save and reload those DataFrames in case your processing and analysis jobs are separated. We'll also specify that we want submissions with *any* review status (pending, approved, or rejected), as the default sync only includes approved submissions.

The SurveyCTO credentials and form ID will be those loaded earlier, from the `.ini` file. We recommend creating a new user role for API access, which allows API access as well as read-only access to forms and data. The `privatekey` property in the `.ini` file is optional, to be used when the SurveyCTO form is encrypted.

In this example, data is synchronized to the `~/Files/surveydata/formid/` folder tree, where `~` refers to the current user's home directory and `formid` is the SurveyCTO form ID loaded from the `.ini` file.

In [3]:
from surveydata import SurveyCTOPlatform
from surveydata import FileStorage
import pandas as pd
from pytz import timezone

# initialize the survey platform connection
platform = SurveyCTOPlatform(scto_server, scto_username, scto_password, scto_formid, scto_private_key)

# initialize the local file storage location
storage = FileStorage(os.path.expanduser("~/Files/surveydata/" + scto_formid + "/"))

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage, review_statuses=["pending", "approved", "rejected"])
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# summarize submission review statuses
print("Submission DataFrame review statuses:")
print(submissions_df.review_status.value_counts())
print()

# load all text audits into DataFrame and describe contents
textaudit_df = SurveyCTOPlatform.get_text_audit_df(storage, location_strings=submissions_df.textaudit)
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))
    print()

    # summarize text audit data for analysis
    ta_summary = SurveyCTOPlatform.process_text_audits(textaudit_df, submissions_df["starttime"], submissions_df["endtime"], storage.get_data_timezone(), timezone("US/Eastern").zone)

    # merge text audit summaries with submission data
    all_data = pd.concat([submissions_df, ta_summary], axis='columns', join='outer', verify_integrity=True)

    # print summary of combined DataFrame
    print("Combined DataFrame field counts:")
    print(all_data.count(0))

    # save combined DataFrame for other Python code to access (retains all DataFrame properties)
    storage.store_dataframe("__ALL_DATA_DF__", all_data)
    # example of how to reload the saved DataFrame
    all_data_from_storage = storage.get_dataframe("__ALL_DATA_DF__")

    # save combined DataFrame as a .csv file for others to access (doesn't retain all DataFrame properties)
    storage.store_dataframe_csv("__ALL_DATA_CSV__", all_data)
    # example of how to reload saved .csv into a DataFrame
    all_data_from_csv = storage.get_dataframe_csv("__ALL_DATA_CSV__")
else:
    print("No text audits found.")

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:a45f2d93-af11-43db-842f-e2227f022c6e', 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'uuid:d8880923-8a5d-4c7a-9b66-994a496b2ae8', 'uuid:7fac8029-4b31-49f5-83e1-ef9bf7ac1db0', 'uuid:d5f8b82d-9ef1-41ff-afc2-16eab7b8275d', 'uuid:8d744680-238a-454d-8e83-d168b1da1aaf', 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77', 'uuid:66767ff8-919e-44d3-b6db-784091a3de37']

Attachments in storage: [{'name': 'AA_f32f8fde-44e1-44fb-9a45-0fc4960b7a77_AFTER_0S.m4a', 'submission_id': 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77', 'location_string': 'file:/Users/crobert/Files/surveydata/all_fields_for_testing_enc/uuid%3Af32f8fde-44e1-44fb-9a45-0fc4960b7a77/AA_f32f8fde-44e1-44fb-9a45-0fc4960b7a77_AFTER_0S.m4a'}, {'name': 'TA_f32f8fde-44e1-44fb-9a45-0fc4960b7a77.csv', 'submission_id': 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77', 'location_string': 'file:/Users/crobert/Files/surveydata/all_fields

## Synchronizing data between SurveyCTO and S3

In this next case, we'll synchronize data directly between SurveyCTO and S3. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data and [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into DataFrames, and we'll show how to save and reload those DataFrames in case your processing and analysis jobs are separated.

The SurveyCTO credentials and form ID will be those loaded earlier, from the `.ini` file. We recommend creating a new user role for API access, which allows API access as well as read-only access to forms and data. The `privatekey` property in the `.ini` file is optional, to be used when the SurveyCTO form is encrypted.

The AWS credentials and S3 bucket name will also be as loaded earlier. We recommend creating a new programmatic-access AWS user with limited access to the appropriate S3 bucket. In the example, all data is stored within the `surveydata/formid/` folder, where `formid` is the SurveyCTO form ID configured in the `.ini` file. This allows data from multiple forms to be safely stored within the same S3 bucket.

In [4]:
from surveydata import SurveyCTOPlatform
from surveydata import S3Storage
import pandas as pd
from pytz import timezone

# initialize the survey platform connection
platform = SurveyCTOPlatform(scto_server, scto_username, scto_password, scto_formid, scto_private_key)

# initialize the S3 storage connection
storage = S3Storage(s3_bucketname, key_name_prefix="surveydata/" + scto_formid + "/", aws_access_key_id=aws_accesskey_id, aws_secret_access_key=aws_accesskey_secret)

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# load all text audits into DataFrame and describe contents
textaudit_df = SurveyCTOPlatform.get_text_audit_df(storage, location_strings=submissions_df.textaudit)
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))
    print()

    # summarize text audit data for analysis
    ta_summary = SurveyCTOPlatform.process_text_audits(textaudit_df, submissions_df["starttime"], submissions_df["endtime"], storage.get_data_timezone(), timezone("US/Eastern").zone)

    # merge text audit summaries with submission data
    all_data = pd.concat([submissions_df, ta_summary], axis='columns', join='outer', verify_integrity=True)

    # print summary of combined DataFrame
    print("Combined DataFrame field counts:")
    print(all_data.count(0))

    # save combined DataFrame for other Python code to access (retains all DataFrame properties)
    storage.store_dataframe("__ALL_DATA_DF__", all_data)
    # example of how to reload the saved DataFrame
    all_data_from_storage = storage.get_dataframe("__ALL_DATA_DF__")

    # save combined DataFrame as a .csv file for others to access (doesn't retain all DataFrame properties)
    storage.store_dataframe_csv("__ALL_DATA_CSV__", all_data)
    # example of how to reload saved .csv into a DataFrame
    all_data_from_csv = storage.get_dataframe_csv("__ALL_DATA_CSV__")
else:
    print("No text audits found.")

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'uuid:66767ff8-919e-44d3-b6db-784091a3de37', 'uuid:7fac8029-4b31-49f5-83e1-ef9bf7ac1db0', 'uuid:a45f2d93-af11-43db-842f-e2227f022c6e', 'uuid:d5f8b82d-9ef1-41ff-afc2-16eab7b8275d', 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77']

Attachments in storage: [{'name': '1666784762837.jpg', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 's3:surveydata/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784762837.jpg'}, {'name': '1666784783763.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 's3:surveydata/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784783763.m4a'}, {'name': 'AA_5e5a40ce-bce2-4225-856e-224f13f3fafa_AFTER_0S.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 's3:

## Synchronizing data between SurveyCTO and DynamoDB+S3

In this next case, we'll synchronize data directly between SurveyCTO and DynamoDB+S3, storing the submission data in DynamoDB and all attachments in S3. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data and [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into DataFrames.

The SurveyCTO credentials and form ID will be those loaded earlier, from the `.ini` file. We recommend creating a new user role for API access, which allows API access as well as read-only access to forms and data. The `privatekey` property in the `.ini` file is optional, to be used when the SurveyCTO form is encrypted.

The AWS credentials, DynamoDB table name, and S3 bucket name will also be as loaded earlier. We recommend creating a new programmatic-access AWS user with limited access to the appropriate DynamoDB table and S3 bucket. In this example, all data is stored within the `{FormID: formid}` DynamoDB partition, where `formid` is the SurveyCTO form ID configured in the `.ini` file; all attachments are stored within the `surveydata/attachments/formid/` folder in the S3 bucket.

In [5]:
from surveydata import S3Storage
from surveydata import DynamoDBStorage
from surveydata import SurveyCTOPlatform
import pandas as pd
from pytz import timezone

# initialize the survey platform connection
platform = SurveyCTOPlatform(scto_server, scto_username, scto_password, scto_formid, scto_private_key)

# initialize DynamoDB storage for submission data
storage = DynamoDBStorage(aws_region=aws_region, table_name=ddb_tablename, id_field_name="KEY", partition_key_name="FormID", partition_key_value=scto_formid, aws_access_key_id=aws_accesskey_id, aws_secret_access_key=aws_accesskey_secret)
# initialize S3 storage for attachments
file_storage = S3Storage(s3_bucketname, key_name_prefix="surveydata/attachments/" + scto_formid + "/", aws_access_key_id=aws_accesskey_id, aws_secret_access_key=aws_accesskey_secret)

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage=storage, attachment_storage=file_storage)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {file_storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# load all text audits into DataFrame and describe contents
textaudit_df = SurveyCTOPlatform.get_text_audit_df(file_storage, location_strings=submissions_df.textaudit)
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))
    print()

    # summarize text audit data for analysis
    ta_summary = SurveyCTOPlatform.process_text_audits(textaudit_df, submissions_df["starttime"], submissions_df["endtime"], storage.get_data_timezone(), timezone("US/Eastern").zone)

    # merge text audit summaries with submission data
    all_data = pd.concat([submissions_df, ta_summary], axis='columns', join='outer', verify_integrity=True)

    # print summary of combined DataFrame
    print("Combined DataFrame field counts:")
    print(all_data.count(0))

    # save combined DataFrame for other Python code to access (retains all DataFrame properties)
    file_storage.store_dataframe("__ALL_DATA_DF__", all_data)
    # example of how to reload the saved DataFrame
    all_data_from_storage = file_storage.get_dataframe("__ALL_DATA_DF__")

    # save combined DataFrame as a .csv file for others to access (doesn't retain all DataFrame properties)
    file_storage.store_dataframe_csv("__ALL_DATA_CSV__", all_data)
    # example of how to reload saved .csv into a DataFrame
    all_data_from_csv = file_storage.get_dataframe_csv("__ALL_DATA_CSV__")
else:
    print("No text audits found.")

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'uuid:66767ff8-919e-44d3-b6db-784091a3de37', 'uuid:7fac8029-4b31-49f5-83e1-ef9bf7ac1db0', 'uuid:a45f2d93-af11-43db-842f-e2227f022c6e', 'uuid:d5f8b82d-9ef1-41ff-afc2-16eab7b8275d', 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77']

Attachments in storage: [{'name': '1666784762837.jpg', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 's3:surveydata/attachments/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784762837.jpg'}, {'name': '1666784783763.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 's3:surveydata/attachments/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784783763.m4a'}, {'name': 'AA_5e5a40ce-bce2-4225-856e-224f13f3fafa_AFTER_0S.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa',

## Synchronizing data between SurveyCTO and Google Cloud Storage

In this next case, we'll synchronize data directly between SurveyCTO and Google Cloud Storage. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data and [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into DataFrames, and we'll show how to save and reload those DataFrames in case your processing and analysis jobs are separated.

The SurveyCTO credentials and form ID will be those loaded earlier, from the `.ini` file. We recommend creating a new user role for API access, which allows API access as well as read-only access to forms and data. The `privatekey` property in the `.ini` file is optional, to be used when the SurveyCTO form is encrypted.

The Google project ID, bucket name, and credentials will also be as loaded earlier. We recommend creating a new service account with limited access to the appropriate bucket. In the example, all data is stored within the `surveydata/formid/` folder, where `formid` is the SurveyCTO form ID configured in the `.ini` file. This allows data from multiple forms to be safely stored within the same bucket.

In [6]:
from surveydata import SurveyCTOPlatform
from surveydata import GoogleCloudStorage
import pandas as pd
from pytz import timezone

# initialize the survey platform connection
platform = SurveyCTOPlatform(scto_server, scto_username, scto_password, scto_formid, scto_private_key)

# initialize the Google Cloud Storage connection
storage = GoogleCloudStorage(project_id=google_project_id, bucket_name=google_bucket_name, blob_name_prefix="surveydata/" + scto_formid + "/", credentials=google_credentials)

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# load all text audits into DataFrame and describe contents
textaudit_df = SurveyCTOPlatform.get_text_audit_df(storage, location_strings=submissions_df.textaudit)
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))
    print()

    # summarize text audit data for analysis
    ta_summary = SurveyCTOPlatform.process_text_audits(textaudit_df, submissions_df["starttime"], submissions_df["endtime"], storage.get_data_timezone(), timezone("US/Eastern").zone)

    # merge text audit summaries with submission data
    all_data = pd.concat([submissions_df, ta_summary], axis='columns', join='outer', verify_integrity=True)

    # print summary of combined DataFrame
    print("Combined DataFrame field counts:")
    print(all_data.count(0))

    # save combined DataFrame for other Python code to access (retains all DataFrame properties)
    storage.store_dataframe("__ALL_DATA_DF__", all_data)
    # example of how to reload the saved DataFrame
    all_data_from_storage = storage.get_dataframe("__ALL_DATA_DF__")

    # save combined DataFrame as a .csv file for others to access (doesn't retain all DataFrame properties)
    storage.store_dataframe_csv("__ALL_DATA_CSV__", all_data)
    # example of how to reload saved .csv into a DataFrame
    all_data_from_csv = storage.get_dataframe_csv("__ALL_DATA_CSV__")
else:
    print("No text audits found.")

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'uuid:66767ff8-919e-44d3-b6db-784091a3de37', 'uuid:7fac8029-4b31-49f5-83e1-ef9bf7ac1db0', 'uuid:d5f8b82d-9ef1-41ff-afc2-16eab7b8275d', 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77']

Attachments in storage: [{'name': '1666784762837.jpg', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 'gs:surveydata/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784762837.jpg'}, {'name': '1666784783763.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 'gs:surveydata/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784783763.m4a'}, {'name': 'AA_5e5a40ce-bce2-4225-856e-224f13f3fafa_AFTER_0S.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 'gs:surveydata/all_fields_for_testing_enc/uuid%3A

## Synchronizing data between SurveyCTO and Azure Blob Storage

In this next case, we'll synchronize data directly between SurveyCTO and Azure Blob Storage. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. As before, we'll load all data and [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into DataFrames, and we'll show how to save and reload those DataFrames in case your processing and analysis jobs are separated.

The SurveyCTO credentials and form ID will be those loaded earlier, from the `.ini` file. We recommend creating a new user role for API access, which allows API access as well as read-only access to forms and data. The `privatekey` property in the `.ini` file is optional, to be used when the SurveyCTO form is encrypted.

The Azure connection string and container name will also be as loaded earlier. While this example uses a connection string to authenticate, passwordless authentication is also supported if you log in first and then pass `account_url` to the `AzureBlobStorage` constructor. In the example, all data is stored within the `surveydata/formid/` folder, where `formid` is the SurveyCTO form ID configured in the `.ini` file. This allows data from multiple forms to be safely stored within the same container.

In [7]:
from surveydata import SurveyCTOPlatform
from surveydata import AzureBlobStorage
import pandas as pd
from pytz import timezone

# initialize the survey platform connection
platform = SurveyCTOPlatform(scto_server, scto_username, scto_password, scto_formid, scto_private_key)

# initialize the Google Cloud Storage connection
storage = AzureBlobStorage(container_name=azure_container_name, blob_name_prefix="surveydata/" + scto_formid + "/", connection_string=azure_connection_string)

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage)
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# load all text audits into DataFrame and describe contents
textaudit_df = SurveyCTOPlatform.get_text_audit_df(storage, location_strings=submissions_df.textaudit)
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))
    print()

    # summarize text audit data for analysis
    ta_summary = SurveyCTOPlatform.process_text_audits(textaudit_df, submissions_df["starttime"], submissions_df["endtime"], storage.get_data_timezone(), timezone("US/Eastern").zone)

    # merge text audit summaries with submission data
    all_data = pd.concat([submissions_df, ta_summary], axis='columns', join='outer', verify_integrity=True)

    # print summary of combined DataFrame
    print("Combined DataFrame field counts:")
    print(all_data.count(0))

    # save combined DataFrame for other Python code to access (retains all DataFrame properties)
    storage.store_dataframe("__ALL_DATA_DF__", all_data)
    # example of how to reload the saved DataFrame
    all_data_from_storage = storage.get_dataframe("__ALL_DATA_DF__")

    # save combined DataFrame as a .csv file for others to access (doesn't retain all DataFrame properties)
    storage.store_dataframe_csv("__ALL_DATA_CSV__", all_data)
    # example of how to reload saved .csv into a DataFrame
    all_data_from_csv = storage.get_dataframe_csv("__ALL_DATA_CSV__")
else:
    print("No text audits found.")

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'uuid:66767ff8-919e-44d3-b6db-784091a3de37', 'uuid:7fac8029-4b31-49f5-83e1-ef9bf7ac1db0', 'uuid:d5f8b82d-9ef1-41ff-afc2-16eab7b8275d', 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77']

Attachments in storage: [{'name': '1666784762837.jpg', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 'abs:surveydata/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784762837.jpg'}, {'name': '1666784783763.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 'abs:surveydata/all_fields_for_testing_enc/uuid%3A5e5a40ce-bce2-4225-856e-224f13f3fafa/1666784783763.m4a'}, {'name': 'AA_5e5a40ce-bce2-4225-856e-224f13f3fafa_AFTER_0S.m4a', 'submission_id': 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'location_string': 'abs:surveydata/all_fields_for_testing_enc/uuid

## Submitting submission updates (commenting and/or updating review status and quality classification)

Below is example code for updating submissions with comments and/or reviews. Any number of updates can be submitted in a single batch, but the SurveyCTO login you use must have *Can modify or delete data* access (or else the server will reject all updates).

After running the following cell, you should re-run one of the earlier cells to re-sync with the server, in order to fetch the newly-updated submission data.

Just a note of caution: the SurveyCTO API used to update submissions is internal and undocumented, so future SurveyCTO releases may break compatibility.

In [8]:
# submit one or more submission updates

# organize update bundle (list of individual updates)
#   example: just add a comment to a pending submission
submission_updates=[{"submissionID": "uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa", "comment": "Another custom comment added via Python"}]
#   example: update submission with "okay" quality classification
#submission_updates=[{"submissionID": "uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa", "qualityClassification": "okay"}]
#   example: reject submission with no quality classification
#submission_updates=[{"submissionID": "uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa", "reviewStatus": "rejected"}]
#   example: reject submission with "poor" quality classification
#submission_updates=[{"submissionID": "uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa", "reviewStatus": "rejected", "qualityClassification": "poor"}]
#   example: revert submission back to pending (unreviewed) status
#submission_updates=[{"submissionID": "uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa", "reviewStatus": "none", "comment": "(Example custom comment to explain why we're reverting back to pending status)"}]
#   example: revert submission back to pending (unreviewed) status
#submission_updates=[{"submissionID": "uuid:a45f2d93-af11-43db-842f-e2227f022c6e", "reviewStatus": "none", "comment": "(Example custom comment to explain why we're reverting back to pending status)"}]

#   submit bundle of reviews
platform.update_submissions(submission_updates)