# Programmatic Updates to File Datasets on healthdata.gov

A file dataset, also called a "blob" dataset, is just a static file like a PDF or Word document that is hosted on Socrata and discoverable in your domain's data catalog.

This notebook outlines the steps to accomplish the following:
1. Updating PDF documents in published file datasets
2. Adding attachments to the dataset
3. Updating the `Last Update` metadata field

Note that some variables should be changed by the user prior to running, as indicated in the comments.

In [None]:
from socrata.authorization import Authorization
from socrata import Socrata
import requests
import os
import json

dataset_id = 'j88g-nmjt' ### UPDATE with the relevant dataset 4x4 to be updated
domain = "hhs-odp-testing.demo.socrata.com"
meta_url = f"https://{domain}/api/views/metadata/v1/{dataset_id}"


auth = Authorization(
  domain,  
  os.environ['SOCRATA_ID'], ### UPDATE with your Socrata credentials
  os.environ['SOCRATA_KEY']
)

### Update the PDF
1. Look up the existing dataset
2. Create a new revision
    - Revisions are essetially drafts; when you create a revision, it's the same as going into the UI and clicking "Edit" on a dataset. Until the revision is published, you can make any needed edits to the dataset.
3. Load a new PDF file into the revision
4. "Apply" or publish the revision

In [None]:
from datetime import date
date.today().strftime("%Y%m%d")

In [None]:
# Initialize Socrata client using auth variable from above
socrata = Socrata(auth)

# Look up existing Socrata dataset
view = socrata.views.lookup(dataset_id)

# Create new revision. default is to publish publically, 
# use `create_replace_revision(permission='private')` to publish privately
revision = view.revisions.create_replace_revision()

# Upload PDF file from disk and publish revision
filename = 'new_pdf.pdf' ### UPDATE with the filepath to the PDF/s
with open(filename, 'rb') as file:
    upload = revision.source_as_blob(filename)
    source = upload.blob(file)
    job = revision.apply()
    job.wait_for_finish(progress=lambda job: print('Job progress:', job.attributes['status']))

### Add Attachments
1. Define function
    - Attachments are accessible through a separate endpoint, hence the additional function. 
2. Create a new revision
3. Run function
    - Note that the function returns a revision update object, which is a payload that should subsequently be applied to the open revision (i.e. published)
    - Parsing the revisions "attachments" attribute will allow for ordering of the attachments
4. Apply revision

In [None]:
def add_attachment(revision, file_path, file_name):
    """
    Add an attachment to a revision
    Args:
    ```
        revision (Revision): This is the revision you want to add the attachment
        file_path:
            (String): The path to the file in your system
            OR
            (Bytes): The bytes of a file that was downloaded
        file_name (String): The name for your file
    ```
    Returns:
    ```
        Revision
    ```
    Examples:
    ```
    revision = add_attachment(revision, 'C:\\Users\\user.name\\Desktop\\my_file.txt', 'my_file.txt')
    revision = add_attachment(revision, request.content, 'my_file.txt')
    ```
    """
    
    url = "https://{domain}/api/publishing/v1/revision/{view_id}/{revision_seq}/attachment".format(
        view_id = revision.attributes['fourfour'],
        revision_seq = revision.attributes['revision_seq'],
        domain = revision.auth.domain
    )
    
    if isinstance(file_path, str):
        file = open(file_path, 'rb').read()
    else:
        file = file_path
    
    headers = {
        "x-file-name": file_name
    }

    req = requests.post(url=url,auth=revision.auth.basic, data=file, headers=headers)
    req.raise_for_status()
    file_info = {
        "name": req.json().get('filename'),
        "filename": req.json().get('filename'),
        "blob_id": None,
        "asset_id": req.json().get('file_id')
    }
    
    attachments = revision.attributes['attachments']
    attachments.append(file_info)
    return revision.update({
        'attachments': attachments
    })
  

In [None]:
# Create new revision
revision = view.revisions.create_replace_revision()

# Run function & apply revision
attachment_update = add_attachment(revision, filename, 'State_Summaries_Alabama.pdf') ### UPDATE with the file name
                                                                                        # that users should see when they
                                                                    # view the attachments on the dataset primer page
job = attachment_update.apply()
job.wait_for_finish(progress=lambda job: print('Job progress:', job.attributes['status']))                                                                

### Update `Last Update` Metadata Field
1. Create JSON payload
2. Send `PATCH` request to metadata API endpoint

In [None]:
last_update = "July 30, 2021" ### UPDATE with the relevant value for this metadata field
payload = {"customFields": {"Common Core": {"Last Update" : last_update}}}
json_data = json.dumps(payload)
req_update = requests.patch(meta_url, json_data, auth=(os.environ['SOCRATA_ID'],os.environ['SOCRATA_KEY']))
meta_new = req_update.text


In [60]:
import utils

In [61]:
utils.add_attachment()

TypeError: add_attachment() missing 3 required positional arguments: 'revision', 'file_path', and 'file_name'

In [56]:
import json
import re
with open('metadata.json') as f:
  metadata = json.load(f)

In [57]:
profile_reports = [i for i in metadata if i['name'].__contains__('State Profile Report')]
for i in profile_reports:
    del i['tags']
    del i['master_tags']
    i['name'] = re.sub('COVID-19 State Profile Report - ', '', i['name'])
    i['name'] = re.sub(' ', '_', i['name'])

In [58]:
with open('profile_reports_config.json', 'w') as json_file:
  json.dump(profile_reports, json_file)