# Using the ml4qc package with SurveyCTO

This workbook demonstrates how to use the `ml4dq` Python package to work with [SurveyCTO](https://www.surveycto.com) data.

## Reading credentials and configuration

This example workbook begins by loading credentials and configuration from an `.ini` file stored in `~/.ocl/surveydata-surveycto-examples.ini`. The `~` in the path refers to the current user's home directory, and the `.ini` file contents are as follows:

    [aws]
    accesskeyid=idhere
    accesskeysecret=secrethere
    s3bucketname=bucketnamehere
    region=regionnamehere
    ddbtablename=tablenamehere

    [surveycto]
    server=servernamehere
    username=emailhere
    password=passwordhere
    formid=formidhere
    privatekey=-----BEGIN RSA PRIVATE KEY-----
     FROM THE SECOND LINE TO THE LAST
     EACH KEY LINE
     MUST BE INDENTED
     BY AT LEAST ONE SPACE
     -----END RSA PRIVATE KEY-----

Feel free to update the path in `inifile_location` below, and only include properties as needed for the example cases you wish to execute.

*A note on SurveyCTO access:* While the `surveydata` Python package requires only a read-only login to SurveyCTO, this `ml4qc` package offers the option to submit submission reviews, which also requires write access.

In [1]:
# for convenience, auto-reload modules when they've changed
%load_ext autoreload
%autoreload 2

import configparser
import os

# load credentials and other configuration from a local ini file
inifile_location = os.path.expanduser("~/.ocl/ml4qc-examples.ini")
inifile = configparser.RawConfigParser()
inifile.read(inifile_location)

# load SurveyCTO credentials and configuration
scto_server=inifile.get("surveycto", "server")
scto_username=inifile.get("surveycto", "username")
scto_password=inifile.get("surveycto", "password")
scto_formid=inifile.get("surveycto", "formid")
scto_private_key=inifile.get("surveycto", "privatekey")

# load AWS credentials and configuration
aws_accesskey_id = inifile.get("aws", "accesskeyid")
aws_accesskey_secret = inifile.get("aws", "accesskeysecret")
s3_bucketname = inifile.get("aws", "s3bucketname")
aws_region = inifile.get("aws", "region")
ddb_tablename = inifile.get("aws", "ddbtablename")

## Synchronizing data between SurveyCTO and local file system

To start, we'll synchronize data directly between SurveyCTO and the local file system. The synchronization process will be efficient, using a stored cursor to only download and store new or updated data, and it will include both submission data and all attachments. We'll load all data and [text audits](https://docs.surveycto.com/02-designing-forms/01-core-concepts/03zd.field-types-text-audit.html) into DataFrames. We'll also specify that we want submissions with *any* review status (pending, approved, or rejected), as the default sync only includes approved submissions.

The SurveyCTO credentials and form ID will be those loaded earlier, from the `.ini` file. We recommend creating a new user role for API access, which allows API access as well as read-only access to forms and data (as well as *Can modify or delete data* access if you will be submitting submission reviews later on). The `privatekey` property in the `.ini` file is optional, to be used when the SurveyCTO form is encrypted.

In this example, data is synchronized to the `~/Files/surveydata/formid/` folder tree, where `~` refers to the current user's home directory and `formid` is the SurveyCTO form ID loaded from the `.ini` file.

In [13]:
from ml4qc import SurveyCTOMLPlatform
from surveydata.filestorage import FileStorage
from pytz import timezone
import pandas as pd

# initialize the survey platform connection
platform = SurveyCTOMLPlatform(scto_server, scto_username, scto_password, scto_formid, scto_private_key)

# initialize the local file storage location
storage = FileStorage(os.path.expanduser("~/Files/surveydata/" + scto_formid + "/"))

# synchronize data to ensure storage is up-to-date
new_submissions = platform.sync_data(storage, review_statuses=["pending", "approved", "rejected"])
print(f"Count of new submissions sync'd to storage: {len(new_submissions)}")
print(f"List of new submissions sync'd to storage: {new_submissions}")
print()

# output details about what's present in storage
print(f"Submissions in storage: {storage.list_submissions()}")
print()
print(f"Attachments in storage: {storage.list_attachments()}")
print()

# load all submissions into DataFrame and describe contents
submissions_df = SurveyCTOMLPlatform.get_submissions_df(storage)
print("Submission DataFrame field counts:")
print(submissions_df.count(0))
print()

# summarize submission review statuses
print("Submission DataFrame review statuses:")
print(submissions_df.review_status.value_counts())
print()

# load all text audits into DataFrame and describe contents
#   (and because we have two text audit fields in this form, create a combined textaudit column)
submissions_df["all_ta"] = submissions_df["textaudit"] + submissions_df["textaudit_full"]
textaudit_df = SurveyCTOMLPlatform.get_text_audit_df(storage, location_strings=submissions_df["all_ta"])
if textaudit_df is not None:
    print("Text audit DataFrame field counts:")
    print(textaudit_df.count(0))

    # summarize text audit data for analysis
    ta_summary = SurveyCTOMLPlatform.process_text_audits(textaudit_df, submissions_df["starttime"], submissions_df["endtime"], storage.get_data_timezone(), timezone("US/Eastern"))

    # merge text audit summaries with submission data
    all_data = pd.concat([submissions_df, ta_summary], axis='columns', join='inner', verify_integrity=True)

    # print summary of combined DataFrame
    print("Combined DataFrame field counts:")
    print(all_data.count(0))

else:
    print("No text audits found.")

Count of new submissions sync'd to storage: 0
List of new submissions sync'd to storage: []

Submissions in storage: ['uuid:a45f2d93-af11-43db-842f-e2227f022c6e', 'uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa', 'uuid:d8880923-8a5d-4c7a-9b66-994a496b2ae8', 'uuid:7fac8029-4b31-49f5-83e1-ef9bf7ac1db0', 'uuid:d5f8b82d-9ef1-41ff-afc2-16eab7b8275d', 'uuid:8d744680-238a-454d-8e83-d168b1da1aaf', 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77', 'uuid:66767ff8-919e-44d3-b6db-784091a3de37']

Attachments in storage: [{'name': 'AA_f32f8fde-44e1-44fb-9a45-0fc4960b7a77_AFTER_0S.m4a', 'submission_id': 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77', 'location_string': 'file:/Users/crobert/Files/surveydata/all_fields_for_testing_enc/uuid%3Af32f8fde-44e1-44fb-9a45-0fc4960b7a77/AA_f32f8fde-44e1-44fb-9a45-0fc4960b7a77_AFTER_0S.m4a'}, {'name': 'TA_f32f8fde-44e1-44fb-9a45-0fc4960b7a77.csv', 'submission_id': 'uuid:f32f8fde-44e1-44fb-9a45-0fc4960b7a77', 'location_string': 'file:/Users/crobert/Files/surveydata/all_fields

## Submitting submission updates (commenting and/or updating review status and quality classification)

Below is example code for updating submissions with comments and/or reviews. Any number of updates can be submitted in a single batch, but *Can modify or delete data* access is required for those updates to be accepted by the server.

After running the following cell, you should run the above cell to re-sync with the server, fetching updated submission data.

In [3]:
# try submitting one or more submission updates

# organize update bundle (list of individual updates)
#   example: just add a comment to a pending submission
submission_updates=[{"submissionID": "uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa", "comment": "Another custom comment added via Python"}]
#   example: update submission with "okay" quality classification
#submission_updates=[{"submissionID": "uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa", "qualityClassification": "okay"}]
#   example: reject submission with no quality classification
#submission_updates=[{"submissionID": "uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa", "reviewStatus": "rejected"}]
#   example: reject submission with "poor" quality classification
#submission_updates=[{"submissionID": "uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa", "reviewStatus": "rejected", "qualityClassification": "poor"}]
#   example: revert submission back to pending (unreviewed) status
#submission_updates=[{"submissionID": "uuid:5e5a40ce-bce2-4225-856e-224f13f3fafa", "reviewStatus": "none", "comment": "(Example custom comment to explain why we're reverting back to pending status)"}]
#   example: revert submission back to pending (unreviewed) status
#submission_updates=[{"submissionID": "uuid:a45f2d93-af11-43db-842f-e2227f022c6e", "reviewStatus": "none", "comment": "(Example custom comment to explain why we're reverting back to pending status)"}]

#   submit bundle of reviews
platform.update_submissions(submission_updates)

In [14]:
print(all_data.select_dtypes(include=['number']).columns.values)
print(all_data.select_dtypes(include=['datetime']).columns.values)
print(all_data.select_dtypes(exclude=['number', 'datetime']).columns.values)

['devicephonenum' 'duration' 'speedviolationscount' 'pct_conversation'
 'mean_light_level' 'mean_movement' 'mean_sound_level' 'mean_sound_pitch'
 'qint' 'qdecimal' 'qyesno' 'formdef_version' 'qselectmultiple_one'
 'qselectmultiple_two' 'qselectmultiple_three' 'qselectmultiple_four'
 'pct_quiet' 'pct_still' 'sd_light_level' 'sd_movement' 'sd_sound_level'
 'sd_sound_pitch' 'ta_duration_total' 'ta_duration_mean' 'ta_duration_sd'
 'ta_duration_max' 'ta_time_in_fields' 'ta_fields' 'ta_sessions'
 'ta_pct_revisits' 'ta_field_intronote_visited'
 'ta_field_intronote_visit_1_start' 'ta_field_intronote_visit_1_duration'
 'ta_field_qtext_visited' 'ta_field_qtext_visit_1_start'
 'ta_field_qtext_visit_1_duration' 'ta_field_qtext_visit_2_start'
 'ta_field_qtext_visit_2_duration' 'ta_field_qint_visited'
 'ta_field_qint_visit_1_start' 'ta_field_qint_visit_1_duration'
 'ta_field_qint_visit_2_start' 'ta_field_qint_visit_2_duration'
 'ta_field_qdecimal_visited' 'ta_field_qdecimal_visit_1_start'
 'ta_field

In [19]:
from ml4qc import SurveyMLTools

# categorize our DataFrame columns
cols_by_type = SurveyMLTools.columns_by_type(all_data)