Aggregate AppEngine Troubleshooting

Yaw Anokwa edited this page Oct 19, 2017 · 13 revisions

Introduction

You are investigating why something is not working or you've found an issue and the ODK team would like to look at your AppEngine logs and tables to diagnose the problem.

Investigating a problem on AppEngine

Aggregate writes many lines of progress and activity information to its application log. You can look in these logs for errors and warnings that might indicate why your application is not performing as expected.

There are 2 sets of Logs produced by AppEngine:

  • The frontend logs track actions performed through the webpages, submission and download actions of ODK Collect and ODK Briefcase, and the publishing of data to external services (e.g, Google Spreadsheet or Google Fusion Tables). Keep in mind that there can be up to a 60-second delay from receiving a submission to publishing that data to an external service.
  • The backend logs (i.e., the 'background' version) report actions performed through the built-in periodic watchdog process (that restarts any publishing tasks or file exports that have failed) and all export-to-file actions (export to CSV, KML or as JSON file formats).

To access your application logs:

  • Go to the logging page of Google Cloud Platform: https://console.cloud.google.com/logs
  • Choose your application id from the drop-down in the top right of the screen.
  • To view the log for the website, choose "App Engine, Default Module" then navigate under "App Engine" and choose "default", and choose "All Versions".
  • To view the log for the background worker, (which is used for exporting CSVs and publishing already-recorded submissions to external servers (e.g., Fusion Tables), choose "App Engine, Default Module" then navigate under "App Engine" and choose "background" and choose "All Versions".
  • Choose 'Any Log Level' and change to 'Error'
  • You can filter the resulting list further by using a search filter, entered in the text box above these drop-down selections.

You will likely see some logging entries for the time period in which you experienced the errors.

You may be able to correlate these errors with performance problems reported on the Google Appspot 'System Status' page here: https://status.cloud.google.com/

Inviting Someone to be a Developer

Go to https://console.cloud.google.com/permissions/projectpermissions and select the application id you're having trouble with from the top right drop-down. This presents the Permissions dashboard for this application.

Choose "Add+"

Enter the full email address (e.g., yanokwa@nafundi.com) of the user you want to give (widespread) access to your site.

Choose Project and then Owner as the Role (at the moment, the granularity of the other permissions is insufficient to view and modify an App Engine instance).

Click Add

Please also send an e-mail identifying the App Engine application id and issue. While we can correlate problems with invitations by looking at the administrators of the application, it helps if we have an e-mail from you identifying your application id and issue.

Adding Someone as Site Admin

Log onto your ODK Aggregate instance using the super-user e-mail or as a user with Site Administrator privileges.

Click on the Site Admin tab, and select the Permissions sub-tab.

In the text area under the Add Users heading, enter a username.

Click Add

In the table under Edit Users, check the checkbox under the Site Administrator column heading beside the user you've added.

Click Save Changes

And finally, Click Change Password and supply a password.

You will then need to send a private e-mail identifying the App Engine application id and issue and notifying us that we've been added as site administrators. It is the only way we'll know that this has been done.

Repairing your App Engine Database Yourself

This is occasionally necessary because Google can kill the server process during a form update before the newer form definition is fully processed, or before the older form definition is fully removed from the system.

The same thing can happen during the submission of a filled-in form.

These failures are more likely to occur if you are trying to run using just free quota, or if you are heavily using your system and the form being uploaded is large or contains many repeat groups.

If this disruption occurs while uploading a form definition to ODK Aggregate, the symptom is an infinite refresh of the ODK Aggregate website page.

If the disruption occurs while uploading a filled-in form to ODK Aggregate, any publishers or attempts to Export to CSV, or use ODK Briefcase to pull that data will fail with an error message like:

Error: Problem persisting data or accessing data (roster2@parentAuri(uuid:dd119aa7-447d-4dc0-8262-f4d2d409c5d9) is missing a (repeat) group instance OR has an extra copy of one)

If a filled-in form is corrupted, you can repair the corrupted submission yourself.

If a form definition is corrupted (and you are seeing infinite refreshes), we recommend that you give us developer access to allow us to repair your database. If you don't want to give developer access (access to all of your data), then you can follow the steps below to repair the database on your own.

Repairing the Form Definition Table

If you erroneously delete a record, your previously-collected data for that form, in all likelihood, will be lost. We strongly recommend giving us developer access to do this, unless you are very systematic and detail-oriented.

The steps are as follows:

Go to the Datastore dashboard for your app via https://console.cloud.google.com/datastore/entities/query and select the application id from the top right drop-down list.

The Kinds (data tables) that you will need to look at and clean up are:

  • opendatakit._form_info
  • opendatakit._form_info_fileset
  • opendatakit._form_info_xform_bin
  • opendatakit._form_info_xform_ref
  • opendatakit._form_info_xform_blb
  • opendatakit._form_info_submission_association

(1) select the opendatakit._form_info_submission_association Kind (data table).

There will be 2 records in this table for one form id (submission_form_id) -- this is the form that needs to be repaired.

Look at the is_persistence_model_complete field of these records.

Whichever one is FALSE should be removed -- the system was interrupted while processing that form definition and that definition cannot be repaired, it must be removed. If both records are TRUE, then the newer form definition was successfully uploaded -- you want to remove the older one.

Make note of the creation timestamps of both records, and the ID/Name of the record that needs to be removed.

The records in the other tables with similar older/newer creation timestamps are the ones you need to delete.

(2) go to opendatakit._form_info Verify that there is only one entry in this table for that form id. This is just a sanity check. Make note of the ID/name of this record -- the md5:2342asa... portion.

(3) go to opendatakit._form_info_fileset -- find the records that match the _parent_auri with this md5:... value of the opendatakit._form_info record. Determine which one correspond to the record of the opendatakit._form_info_submission_association table that needs to be removed..

Make note of the ID/name ( particularly the uuid:234d-... part) of the two entries -- and the creation timestamp. We want to correlate the records by their approximate creation timestamps.

(4) go to opendatakit._form_info_xform_bin -- match the records here via _parent_auri and the ID/name of the records in the opendatakit._form_info_fileset. Record creation timestamps

(5) go to opendatakit_form_info_xform_ref -- match the records here via the _dom_auri field and the ID/name of the opendatakit._form_info_xform_bin entry (above). Record creation timestamps Record the _sub_auri field. If your forms are extremely large, there may be multiple entries in this table sharing the same _dom_auri. Confirm that you have found them all.

(6) go to opendatakit._form_info_xform_blb -- match the records here via their ID/name and the _sub_auri field of the opendatakit._form_info_xform_ref records above. Record the creation timestamps.

(7) Now, figure out the set of records that need to be deleted from these tables such that the newest TRUE opendatakit._form_info_submission_association entry is preserved.

(8) Go through, starting at opendatakit._form_info_fileset, and delete all the records that need to be deleted.

Repairing a Filled-in Form Submission

NOTE: Upgrading to ODK Aggregate 1.4.11 should greatly reduce the frequency of this occurring. See the aggregate release notes for how to upgrade. Upgrading will not correct an already-corrupted filled-in form submission. You will still need to perform the steps below. But, after upgrading, you are considerably less likely to have this problem recur.

The steps are as follows:

(1) Open your Logging dashboard here: https://console.cloud.google.com/logs and select your application id from the top right drop-down list. Then choose to view logs with errors (change from Any log level to Error). On AppEngine, there are two modules on the server. You need to look at the logs for both the module default and module background. To do that, open the leftmost filter box, choose App Engine and select All services. If running on Tomcat or another webserver, look in your server logs.

(2a) If you are running ODK Aggregate 1.4.5 or higher, the filtered list of requests with errors should not be empty; if it is, see step (2b).

Each entry in the logging display is a single request to the server. Choose one of the requests that are in error (prefixed by a red box containing two exclamation points ('!!')). Expand it to display the first 10 log entries during the processing of this request, then choose "Show All" to see all remaining log entries. The error will be the last or the next-to-last entry in that list. You should see a detailed error message that indicates the exact location of the problem. Something like:

Error: Problem persisting data or accessing data [ChoiceSubmissionType:getValueFromEntity] SQL: SELECT * FROM opendatakit.YOUR_DATA_TABLE_NAME WHERE _PARENT_AURI = XXXXXX ORDER BY _PARENT_AURI

Make note of all such detailed errors in your log file. In particular, you need to make note of all YOUR_DATA_TABLE_NAME and XXXXXX values. In most cases, _PARENT_AURI will be referenced. In some cases, there might be two clauses, _TOP_LEVEL_AURI and _PARENT_AURI, and in others, there might be a reference to _DOM_AURI. Make note of all of these variations.

(2b) If you don't find any log entries with errors, then the data corruption may affect only the attachments to the form (or, for encrypted forms, the encrypted submission, which is handled like an attachment to a "submission placeholder"). In that case, clear the log level filter (resetting it to Any log level). Then scroll down to the bottom of the log, and begin scrolling up.

You are looking for GET requests that return a 500 response code (internal server error). In the search box, enter status:500 and method:GET to find these requests. Here is an example of such a request:

13:38:54.070 GET 500 211 B 99 ms Chrome 51 /view/binaryData?blobKey=female_crf_encrypt_V3%5B%40version%3Dnull+and+%40uiVersion%3Dnull%5D%2Fdata%5B%40key%3Duuid%3Ad8b27831-8b3a-491d-9c5f-783f8acfe049%5D%2FencryptedXmlFile&previewImage=true

If you take the blobKey argument (everything from blobKey= to the &) and run it through a URL decoder (e.g., [http://meyerweb.com/eric/tools/dencoder/] ) you get this:

female_crf_encrypt_V3[@version=null and @uiVersion=null]/data[@key=uuid:d8b27831-8b3a-491d-9c5f-783f8acfe049]/encryptedXmlFile

This identifies the problem:

form ID: female_crf_encrypt_V3

instance ID: uuid:d8b27831-8b3a-491d-9c5f-783f8acfe049

XPath of the field (attachment) causing the problem: /data/encryptedXmlFile

(3) Open your Datastore dashboard here: https://console.cloud.google.com/datastore/entities/query and select your application id from the top right drop-down list.

(4) Click on the link for the 'Query by GCL'

For non-App-Engine installations, open your database admin tool (e.., MySQL Workbench or PgAdminIII).

(5) Now, for each data table name and XXXXX value in step (2a), execute a query of the form:

SELECT * FROM `opendatakit.YOUR_TABLE_NAME` WHERE _PARENT_AURI = "XXXXXX"

Take special note of the backquote around the opendatakit.-qualified table name, and the double-quotes around the XXXXXX value. If the original query specified _DOM_AURI instead of _PARENT_AURI, use that. If this query does not return a value, you may also need to add the AND _TOP_LEVEL_AURI = "YYYYY" qualifier (but that should not be necessary), if it was present in the original error message.

(6) Look at the _ORDINAL_NUMBER column. There should be exactly one entry for a given ordinal number. The error occurs because there are two or more rows for at least one of these ordinal numbers.

(7) Select the rows with the earliest (oldest) _LAST_UPDATE_DATE and choose 'Delete'

(8) Use ODK Briefcase to Pull data from your server. If it fails, repeat all these steps. Continue repeating until ODK Briefcase completes without errors or only errors of the form (2b) are found.

If your form has many repeat groups or multiple-choice questions, there may be a considerable number of duplicates.

(9) For errors of the form (2b), there is some additional sleuthing required.

First, on the Query by GQL page, execute a copy of this query with the form id from (2b):

  select * from `opendatakit._form_info_submission_association` where SUBMISSION_FORM_ID = "female_crf_encrypt_V3"

replacing female_crf_encrypt_V3 with the form id you found in step (2b).

Make note of the value of the URI_SUBMISSION_DATA_MODEL field (in this example it is uuid:72a64ffe-f751-4a8a-a65c-54a23219fc5f).

Then execute this query, putting that value, and the field name found in step (2b) (without any leading XPath terms) into your copy of the query:

select * from `opendatakit._form_data_model` where URI_SUBMISSION_DATA_MODEL = "uuid:72a64ffe-f751-4a8a-a65c-54a23219fc5f" and ELEMENT_NAME = "encryptedXmlFile"

This will generally yield 3 rows. Scroll to the right to find the PERSIST_AS_TABLE_NAME column and make note of the 3 table names.

The data corruption is in those 3 tables. If BAD_TABLE_BN, BAD_TABLE_REF and BAD_TABLE_BLB are the 3 tables, and the instance id from (2b) is uuid:d8b27831-8b3a-491d-9c5f-783f8acfe049, then the queries to examine are of the form (with substituted names and values appropriate to your situation):

select * from `opendatakit.BAD_TABLE_BN` where _TOP_LEVEL_AURI = "uuid:d8b27831-8b3a-491d-9c5f-783f8acfe049"
select * from `opendatakit.BAD_TABLE_REF` where _TOP_LEVEL_AURI = "uuid:d8b27831-8b3a-491d-9c5f-783f8acfe049"
select * from `opendatakit.BAD_TABLE_BLB` where _TOP_LEVEL_AURI = "uuid:d8b27831-8b3a-491d-9c5f-783f8acfe049"

The data corruption is either:

  1. There are two or more entries (or missing entries) for a given _ORDINAL_NUMBER in the ..._BN table. This field should start 1, 2, ...
  2. There are two or more entries (or missing entries) for a given (PART, _DOM_AURI) tuple. These should also start 1, 2, ...

As before, select the rows with the earliest (oldest) _LAST_UPDATE_DATE and choose 'Delete'

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.