## Connect to ArcGIS Enterprise

In [None]:
# Imports
from arcgis import GIS, geoanalytics

In [None]:
# Connect to the ArcGIS Enterprise deployment
gis = GIS(profile="dev_summit_demo")

## Check if GeoAnalytics is supported

**Note:** The tools available in the `arcgis.geoanalytics` module require that the ArcGIS Enterprise deployment be configured with at least one ArcGIS GeoAnalytics server.

`geoanalytics.is_supported()` returns `True` if the GIS supports GeoAnalytics.

In [None]:
# Make sure GeoAnalytics is supported
geoanalytics.is_supported()

## Big Data File Shares

A big data file share is an item created in your portal that references a location available to your ArcGIS GeoAnalytics Server. You can use the big data file share location as an input and output to feature data (points, polylines, polygons, and tabular data) of GeoAnalytics tools.

<br/>

There are several benefits to using a big data file share:
  - A big data file share accesses the data when the analysis is run, so you can continue to add data to an existing dataset in your big data file share without having to reregister or publish your data.
 
 
  - You can also [modify the manifest](https://enterprise.arcgis.com/en/geoanalytics/10.9.1/perform-analysis/what-is-a-big-data-file-share.htm#ESRI_SECTION1_B5813F52FD444398AF4F122F38D9FD46) to remove, add, or update datasets in the big data file share.
 
 
  - Big data file shares also allow you to partition your datasets while still treating multiple partitions as a single dataset.
  
  
  - Using big data file shares for output data allows you to store your results in formats that you may use for other workflows, such as a parquet file for further analysis or storage.
  
<br/>
  
**Note:** Big data file shares are only accessed when you run GeoAnalytics Tools. This means that you can only browse and add big data files to your analysis; you cannot visualize the data on a map.

<br/>

Big data file shares can reference the following input data sources:
  - **File share** - A directory of datasets on a local disk or network share.
  - **Apache Hadoop Distributed File System (HDFS)** - An HDFS directory of datasets.
  - **Apache Hive** - Hive metastore databases.
  - **Cloud store** - An Amazon Simple Storage Service (S3) bucket, Microsoft Azure Blob container, or Microsoft Azure Data Lake (Server Manager only) store containing a directory of datasets.
  
<br/>

The following file types are supported as datasets for input and output in big data file shares:
  - **Delimited files** - (such as .csv, .tsv, and .txt)
  - **Shapefiles** - (.shp)
  - **Parquet files** - (.gz.parquet)
  - **ORC files** - (orc.crc)

Example of a big data file share that contains three datasets: Earthquakes, Hurricanes, and GlobalOceans.

## Access GeoAnalytics data stores

`geoanalytics.get_datastores()` returns an instance of the [DatastoreManager](https://developers.arcgis.com/python/api-reference/arcgis.gis.toc.html?highlight=add_bigdata#datastoremanager) helper class, which is used to manage data stores within the ArcGIS Enterprise deployment.

In [None]:
# Connect to the GeoAnalytics data stores
gax_datastores = geoanalytics.get_datastores()

## Register data on AWS S3 as a Cloud Store

When you register a cloud store, you must include an Azure container name, an Amazon S3 bucket name, or an Azure Data Lake Store account name. It is recommended that you additionally specify a folder within the container or bucket. The specified folder is composed of subfolders, and each represents an individual dataset. Each dataset is composed of all the contents of the subfolder.

**Note:** To register a cloud store as a big data file share, you must first add the cloud store as a registered data store.


In [None]:
# Create a unique name for the Cloud Store
import uuid
unique_cloud_store_name = "demo_cloud_store_{0}".format(str(uuid.uuid4())[:6])
unique_cloud_store_name

In [None]:
# Get the AWS S3 accessKeyId and secretAccessKey from local environment variables 
import os
aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")

Connection information for the Cloud Store is stored within the [connectionString](https://developers.arcgis.com/rest/enterprise-administration/server/dataitem.htm#GUID-C2B4950B-5CA6-4732-8985-9AB360EA3633).

In [None]:
# Create the connection string for the Cloud Store
cloud_store_connection_string_json = {"accessKeyId": aws_access_key_id,
                                      "secretAccessKey": aws_secret_access_key,
                                      "regionEndpointUrl": "s3.us-west-1.amazonaws.com",
                                      "region": "us-west-1",
                                      "defaultEndpointsProtocol": "https",
                                      "credentialType": "accesskey"}

The [add_cloudstore](https://developers.arcgis.com/python/api-reference/arcgis.gis.toc.html?#arcgis.gis.DatastoreManager.add_cloudstore) method adds a Cloud Store data [Item](https://developers.arcgis.com/python/api-reference/arcgis.gis.toc.html?#arcgis.gis.Item).

In [None]:
# Register the Cloud Store with the GeoAnalytics server
registered_cloud_store = gax_datastores.add_cloudstore(name=unique_cloud_store_name,
                                                       conn_str=cloud_store_connection_string_json,
                                                       object_store="esri-ga-test-bdfs/dev_summit_bdfs",
                                                       provider="amazon"
                                                      )

The [validate](https://developers.arcgis.com/python/api-reference/arcgis.gis.toc.html?highlight=datastore#arcgis.gis.Datastore.validate) method validates all items in the datastore.  Returns `True` if validation was successful.

In [None]:
# Validate the Cloud Store item
registered_cloud_store.validate()

### Register the AWS S3 Cloud Store as a Big Data File Share

Big Data File Share data items are file shares, HDFS, Hive, or cloud data stores that contain input data for GeoAnalytics.

In [None]:
# Create a unique name for the Big Data File Share
unique_bdfs_name = "demo_bdfs_{0}".format(str(uuid.uuid4())[:6])
unique_bdfs_name

The [add_bigdata](https://developers.arcgis.com/python/api-reference/arcgis.gis.toc.html?highlight=add_bigdata#arcgis.gis.DatastoreManager.add_bigdata) method registers a big data file share with the Datastore.

In [None]:
# Register the AWS S3 Cloud Store as a Big Data File Share
registered_bdfs = gax_datastores.add_bigdata(name=unique_bdfs_name,
                                             server_path=registered_cloud_store.path,
                                             connection_type="dataStore"
                                            )

In [None]:
# Validate the Big Data File Share
registered_bdfs.validate()

## Big Data File Share manifests

Big data file shares require a [manifest](https://enterprise.arcgis.com/en/geoanalytics/latest/perform-analysis/understanding-the-big-data-file-share-manifest.htm) to outline the schema of the data, as well as the fields that represent geometry and time in the dataset.

The manifest is automatically generated when you register a big data file share, but you may need to make modifications if there are any changes to your data, or if the manifest generation was unable to determine all the information needed (for example, if the automatically generated manifest did not select the correct field for the geometry or time).

The [manifest](https://developers.arcgis.com/python/api-reference/arcgis.gis.toc.html?highlight=datastore#arcgis.gis.Datastore.manifest) property retrieves or sets the manifest resource for big data file shares, as a dictionary.

In [None]:
# View the Big Data File Share manifest
manifest = registered_bdfs.manifest
manifest

In [None]:
# Create a function that prints the field names for each dataset in a Big Data File Share
def print_bdfs_item_dataset_fields(bdfs_name):
    # Get the BDFS item
    bdfs_item = gis.content.search(query="title:{0}".format(bdfs_name),
                                   item_type="big data file share",
                                   max_items=1)[0]
    
    # Print the field names for each dataset
    for layer in bdfs_item.layers:
        print("{0}:".format(layer.properties.name))
        for field in layer.properties.fields:
            print("\t - " + field.name)

In [None]:
# Print the field names for each dataset in the registered Big Data File Share 
print_bdfs_item_dataset_fields(unique_bdfs_name)

In [None]:
# Update the first field name of the first dataset in the Big Data File Share
manifest["datasets"][0]["schema"]["fields"][0].update({'name': "UPDATED_FIELD_NAME"})

# Update the Big Data File Share manifest
registered_bdfs.manifest = manifest
registered_bdfs.manifest["datasets"][0]

In [None]:
# Print the field names for each dataset in the registered Big Data File Share 
print_bdfs_item_dataset_fields(unique_bdfs_name)

In [None]:
# Delete the first dataset from the Big Data File Share
manifest = registered_bdfs.manifest
del manifest["datasets"][0]

# Update the Big Data File Share manifest
registered_bdfs.manifest = manifest
registered_bdfs.manifest

In [None]:
# Print the field names for each dataset in the registered Big Data File Share 
print_bdfs_item_dataset_fields(unique_bdfs_name)

The [regenerate](https://developers.arcgis.com/python/api-reference/arcgis.gis.toc.html?highlight=datastore#arcgis.gis.Datastore.regenerate) method is used to regenerate the manifest for a big data file share. Returns `True` if the manifest was regenerated successfully. For example you would regenerate the Big Data File Share manifest if you added new data.

In [None]:
# Regenerate the Big Data File Share manifest
registered_bdfs.regenerate()

In [None]:
# Validate the Big Data File Share
registered_bdfs.validate()

In [None]:
# View the Big Data File Share manifest
registered_bdfs.manifest

In [None]:
# Print the field names for each dataset in the registered Big Data File Share 
print_bdfs_item_dataset_fields(unique_bdfs_name)

## Search for Big Data File Shares

In [None]:
# Get the registered Big Data File Share item
bdfs_item = gis.content.search(query="title:{0}".format(unique_bdfs_name),
                               item_type="big data file share",
                               max_items=1)[0]
bdfs_item

In [None]:
# Create a function that prints the name of each layer in a Big Data File Share
def print_bdfs_item_layers(bdfs_item):
    for index, item in enumerate(bdfs_item.layers):
        print("{0}: {1}".format(index, item.properties.name))

In [None]:
# Print the names of the layers in the registered Big Data File Share
print_bdfs_item_layers(bdfs_item)

Items that have layers have a dynamic [layers](https://developers.arcgis.com/python/api-reference/arcgis.gis.toc.html?#item) property that is used to get the individual layers in the item.

In [None]:
# Get the UberSF data from the Big Data File Share
ubersf_bdfs_layer = bdfs_item.layers[1]
ubersf_bdfs_layer

## Return GeoAnalytics tool job messages

[GPJob](https://developers.arcgis.com/python/api-reference/arcgis.geoprocessing.html?highlight=gpjob#gpjob) represents a single geoprocessing job. The GPJob class allows for the asynchronous operation of any geoprocessing task. To request a GPJob task, the tool must be called with `future=True` or else the operation will occur synchronously.

`GPJob.done()` returns `True` if the call was successfully cancelled or finished running.

In [None]:
import json

# Create a function that prints the job messages while a GPJob task is running
def print_job_messages(gpjob):
    previous_message = None
    while gpjob.done() is False:
        if gpjob.messages:
            current_message = gpjob.messages[-1]["description"]
            if current_message != previous_message:
                if "messageCode" in current_message:
                    print(json.loads(current_message)["message"])
                previous_message = current_message

## Use GeoAnalytics tools to analyze the UberSF trips data

### Describe Dataset

The GeoAnalytics [Describe Dataset](https://developers.arcgis.com/rest/services-reference/enterprise/describe-dataset.htm) tool summarizes features into calculated field statistics, sample features, and extent boundaries.

The sample layer allows you to efficiently test your workflow before running it on the full dataset.

In [None]:
# Import the GeoAnalytics Describe Dataset tool
from arcgis.geoanalytics.summarize_data import describe_dataset

### Describe Dataset: GPJob

In [None]:
# Create a unique output name
unique_output_name = "dd_ubersf_gpjob_{0}".format(str(uuid.uuid4())[:6])
print("Output name: {0}\n".format(unique_output_name))

# Run the Describe Dataset tool on the UberSF data with future=True
dd_ubersf_gpjob = describe_dataset(input_layer=ubersf_bdfs_layer,
                                   extent_output=True,
                                   sample_size=1000,
                                   output_name=unique_output_name,
                                   future=True  # This returns a GPJob
                                  )

# Print the job messages while the tool is running
print_job_messages(dd_ubersf_gpjob)

`GPJob.result()` will return the value returned by the call. If the call hasn’t yet completed then this method will wait.

In [None]:
# Get the Describe Dataset GPJob result
dd_ubersf_gpjob.result()

### Describe Dataset: Result

In [None]:
# Create a unique output name
unique_output_name = "dd_ubersf_result_{0}".format(str(uuid.uuid4())[:6])
print("Output name: {0}\n".format(unique_output_name))

# Run the Describe Dataset tool on the UberSF data with future=False
dd_ubersf_result = describe_dataset(input_layer=ubersf_bdfs_layer,
                                    extent_output=True,
                                    sample_size=1000,
                                    output_name=unique_output_name,
                                    future=False
                                    )

# Show the Describe Dataset result
dd_ubersf_result

### Visualize the Describe Dataset result

In [None]:
# Create a map of San Francisco
map_one = gis.map("San Francisco")
map_one

In [None]:
# Add the Describe Dataset extent layer and the sample layer to the map
map_one.add_layer(dd_ubersf_result.layers[0], {"opacity": 0.5})
map_one.add_layer(dd_ubersf_result.layers[1])

### View the Describe Dataset summary statistics table

In [None]:
# Import
import pandas as pd

In [None]:
# Create a Spatially Enabled DataFrame object from the Describe Dataset summary statistics table
sdf = pd.DataFrame.spatial.from_layer(dd_ubersf_result.tables[0])
sdf[["FIELD_NAME", "COUNT", "COUNT_NON_EMPTY"]].head()

### Reconstruct Tracks

The GeoAnalytics [Reconstruct Tracks](https://developers.arcgis.com/rest/services-reference/enterprise/reconstruct-tracks.htm) tool creates line or polygon tracks from time-enabled input data.  The features must represent an instant in time.

The tool first determines which features belong to a track using an identifier. Using the time at each location, the tracks are ordered sequentially and transformed into a line or polygon representing the path of movement over time. Optionally, the input can be buffered by a field, which creates a polygon at each location.

<br/>

<img src="https://pro.arcgis.com/en/pro-app/latest/tool-reference/big-data-analytics/GUID-85767AB0-D12E-4923-9C22-FE2A758DF149-web.png" width="500" align="left">

In [None]:
# Import the GeoAnalytics Reconstruct Tracks tool
from arcgis.geoanalytics.summarize_data import reconstruct_tracks

In [None]:
# Create a unique output name
unique_output_name = "rt_ubersf_{0}".format(str(uuid.uuid4())[:6])
print("Output name: {0}\n".format(unique_output_name))

# Run the Reconstruct Tracks tool on the UberSF data
rf_ubersf_output = reconstruct_tracks(input_layer=ubersf_bdfs_layer,
                                      track_fields="id",
                                      output_name=unique_output_name,
                                      future=True  # This returns a GPJob
                                      )

# Print the job messages while the tool is running
print_job_messages(rf_ubersf_output)

In [None]:
# Create a map of South San Francisco
map_two = gis.map("South San Francisco")
map_two

In [None]:
# Add the Reconstruct Tracks layer to the map
map_two.add_layer(rf_ubersf_output.result().layers[0])

### Summarize Within

The GeoAnalytics [Summarize Within](https://developers.arcgis.com/rest/services-reference/enterprise/summarize-within.htm) tool overlays a polygon layer with another layer to summarize the number of points, length of the lines, or area of the polygons within each polygon and calculates attribute field statistics about those features within the polygons.

In [None]:
# Import the GeoAnalytics Summarize Within tool
from arcgis.geoanalytics.summarize_data import summarize_within

In [None]:
# Create a unique output name
unique_output_name = "sw_ubersf_rt_{0}".format(str(uuid.uuid4())[:6])
print("Output name: {0}\n".format(unique_output_name))


# Run the Summarize Within tool on the UberSF data reconstructed tracks
sw_ubersf_output = summarize_within(summarized_layer=rf_ubersf_output.result().layers[0],
                                    bin_type="Hexagon",
                                    bin_size=0.5,
                                    bin_size_unit="Miles",
                                    output_name=unique_output_name,
                                    future=True  # This returns a GPJob
                                    )

# Print the job messages while the tool is running
print_job_messages(sw_ubersf_output)

In [None]:
# Create a map of San Francisco
map_three = gis.map("San Francisco")
map_three

In [None]:
# Add the Summarize Within layer to the map
map_three.add_layer(sw_ubersf_output.result().layers[0])