# Make your data accessible to the ArcGIS Server

Collecting, storing, managing and analyzing large quantities of numbers, figures, and files is not a new business activity. But referring to these numbers, figures and files as big data is relatively recent.
 
The GeoAnalytics Server expands your ArcGIS Enterprise deployment providing functionality and services to process and analyze big data.


In order to run the GeoAnalytics tools, your data needs to be in one of the following formats:

- Feature layers (hosted, hosted feature layer views, and from feature services)
- Feature collections
- [Big data file shares](https://gis.fema.gov/arcgis/help/en/portal/latest/use/what-is-a-big-data-file-share.htm) registered with ArcGIS GeoAnalytics Server


## Big data file shares
The GeoAnalytics server allows you to register datasets in a format called a [big data file share](http://enterprise.arcgis.com/en/server/latest/get-started/windows/what-is-a-big-data-file-share.htm). Big data file shares are items on your Web GIS, and can reference data in any of the following data sources:
 - File Share - a directory of datasets stored locally or shared across a network
 - HDFS - an [Hadoop Distributed File System](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction) directory of datasets
 - [Apache Hive](https://hive.apache.org/) - a metastore database
 - Cloud Store - an [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/) container or [Amazon Web Services S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html) 

When writing results to a big data file share, you can use the following output GeoAnalytics Tools:

- File share
- HDFS
- Cloud store

The following file types are supported as datasets for input and output in big data file shares:

- Delimited files (such as .csv, .tsv, and .txt)
- Shapefiles (.shp)
- Parquet files (.gz.parquet)
- ORC files (orc.crc)

Storing your data in a big data file share datastore benefits you because:
 - The GeoAnalytics tools read your data only when they are executed, which allows you to update or add data to these locations.
 - You can use partitioned data as a single dataset.
 - Big data file shares are flexible in how time and geometry are defined, allowing data in multiple formats in a single dataset.
 


### Preparing your data
To register a file share or an HDFS, you need to format your datasets as subfolders within a single parent folder and register the parent folder. This parent folder becomes a `datastore`, and each subfolder becomes a `dataset`. For instance, to register 2 datasets representing earthquakes and hurricanes, your folder hierarchy would look like below:
```
|---FileShareFolder         <-- register as a datastore
   |---Earthquakes          <-- dataset 1
      |---1960              
         |---01_1960.csv
         |---02_1960.csv
      |---1961              
         |---01_1961.csv
         |---02_1961.csv
   |---Hurricanes           <-- dataset 2
      |---atlantic_hur.shp
      |---pacific_hur.shp
```
Learn more about preparing your big data file share datasets [here](http://server.arcgis.com/en/server/latest/get-started/windows/what-is-a-big-data-file-share.htm).

In [2]:
# Connect to enterprise GIS
from arcgis.gis import GIS
import arcgis.geoanalytics
portal_gis = GIS("https://pythonapi.playground.esri.com/portal", "arcgis_python", "amazing_arcgis_123")

### Ensuring your GIS supports GeoAnalytics
It is best practice to confirm proper configuration of your Enterprise to support the GeoAnalytics Server. 

In [3]:
# Verify that GeoAnalytics is supported 
arcgis.geoanalytics.is_supported()

True

## Registering big data file shares

The [`get_datastores()`](https://esri.github.io/arcgis-python-api/apidoc/html/arcgis.geoanalytics.toc.html#get-datastores) method of the `geoanalytics` module returns a [`DatastoreManager`](https://esri.github.io/arcgis-python-api/apidoc/html/arcgis.gis.toc.html#datastoremanager) object that lets you search for and manage the big data file share items as Python API  [`Datastore`](https://esri.github.io/arcgis-python-api/apidoc/html/arcgis.gis.toc.html#datastore) objects on your GeoAnalytics server.

In [23]:
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
bigdata_datastore_manager

<DatastoreManager for https://pythonapi.playground.esri.com/ga/admin>

You can register your data as a big data file share using the `add_bigdata()` method on a `DatastoreManager` object. Ensure the datasets are stored in a format compatible with the GeoAnalytics server as seen earlier in this guide.

`
item = bigdata_datastore_manager.add_bigdata("Name_of_big_data_file_share", r"\\<file_share_path>\<big_data_folder>")
`

In [14]:
data_item = bigdata_datastore_manager.add_cloudstore(name='cloud_store', 
                                         conn_str='''{"accessKeyId":"<provide key here>",
                                                      "secretAccessKey":"<provide secret key here>",
                                                      "region":"<provide region here>",
                                                      "defaultEndpointsProtocol":"<probide https or http here>",
                                                      "credentialType":"accesskey"}''', 
                                         object_store="esri-delhi-store", 
                                         provider='amazon')

Created cloud store for cloud_store


In [12]:
data_item.path


'/cloudStores/cloud_store1'

In [13]:
data_item = bigdata_datastore_manager.add_bigdata(name="ServiceCallsOrleans", 
                                                  server_path=data_item.path, 
                                                  connection_type='dataStore')

Created Big Data file share for ServiceCallsOrleans


In [5]:
data_item = bigdata_datastore_manager.add_bigdata("ServiceCallsOrleans", r"\\machinename\datastore")

Created Big Data file share for ServiceCallsOrleans


### Searching for big data file shares on datastore

Use the [`search()`](https://esri.github.io/arcgis-python-api/apidoc/html/arcgis.gis.toc.html#arcgis.gis.DatastoreManager.search) method on a `DatastoreManager` object to search for `Datastores`. Observe in the output below the item titled _FileShareFolder_ as illustrated in the example file structure above is registered as a big data file share in the portal.

In [24]:
bigdata_fileshares = bigdata_datastore_manager.search()
bigdata_fileshares

[<Datastore title:"/bigDataFileShares/NYC_taxi_data15" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/all_hurricanes" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/NYCdata" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_1848_1900" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/ServiceCallsOrleans" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_dask_csv" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_dask_shp" type:"bigDataFileShare">]

### Get datasets from a big data file share datastore
Let's use the `datasets` property on a `Datastore` object to find out how many datasets are available and then list them.

In [15]:
file_share_folder = bigdata_fileshares[1]
file_share_datasets = file_share_folder.datasets
len(file_share_datasets)

1

In [16]:
for i in range(0, len(file_share_datasets)):
    print("{:<10}{:<3}{}".format("Dataset " + str(i) + ":", "", file_share_datasets[i]['name']))

Dataset 0:   hurricanes


In [18]:
# let's view the json schema of the hurricanes dataset for a sample
file_share_datasets[0]

{'name': 'hurricanes',
 'format': {'type': 'shapefile', 'extension': 'shp'},
 'schema': {'fields': [{'name': 'serial_num', 'type': 'esriFieldTypeString'},
   {'name': 'season', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'num', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'basin', 'type': 'esriFieldTypeString'},
   {'name': 'sub_basin', 'type': 'esriFieldTypeString'},
   {'name': 'name', 'type': 'esriFieldTypeString'},
   {'name': 'iso_time', 'type': 'esriFieldTypeString'},
   {'name': 'nature', 'type': 'esriFieldTypeString'},
   {'name': 'latitude', 'type': 'esriFieldTypeDouble'},
   {'name': 'longitude', 'type': 'esriFieldTypeDouble'},
   {'name': 'wind_wmo_', 'type': 'esriFieldTypeDouble'},
   {'name': 'pres_wmo_', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'center', 'type': 'esriFieldTypeString'},
   {'name': 'wind_wmo1', 'type': 'esriFieldTypeDouble'},
   {'name': 'pres_wmo1', 'type': 'esriFieldTypeDouble'},
   {'name': 'track_type', 'type': 'esriFieldTypeString'},
   

## Get path of the big data file share item

In [26]:
file_share_folder.datapath

'/bigDataFileShares/ServiceCallsOrleans'

## Check if the data is accessible to all Geoanalytics servers

In [13]:
file_share_folder.validate()

True

## Get schema of the data

Once a big data file share is created, the GeoAnalytics server samples the datasets to generate a [manifest](https://enterprise.arcgis.com/en/server/latest/get-started/windows/understanding-the-big-data-file-share-manifest.htm), which outlines the data schema and specifies any time and geometry fields. A query of the resulting manifest returns each dataset's schema.. This process can take a few minutes depending on the size of your data. Once processed, querying the manifest property returns the schema of the datasets in your big data file share.

In [8]:
manifest = file_share_folder.manifest
manifest

{'datasets': [{'name': 'calls',
   'format': {'quoteChar': '"',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'encoding': 'UTF-8',
    'escapeChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited',
    'extension': 'csv'},
   'schema': {'fields': [{'name': 'NOPD_Item', 'type': 'esriFieldTypeString'},
     {'name': 'Type_', 'type': 'esriFieldTypeString'},
     {'name': 'TypeText', 'type': 'esriFieldTypeString'},
     {'name': 'Priority', 'type': 'esriFieldTypeString'},
     {'name': 'MapX', 'type': 'esriFieldTypeDouble'},
     {'name': 'MapY', 'type': 'esriFieldTypeDouble'},
     {'name': 'TimeCreate', 'type': 'esriFieldTypeString'},
     {'name': 'TimeDispatch', 'type': 'esriFieldTypeString'},
     {'name': 'TimeArrive', 'type': 'esriFieldTypeString'},
     {'name': 'TimeClosed', 'type': 'esriFieldTypeString'},
     {'name': 'Disposition', 'type': 'esriFieldTypeString'},
     {'name': 'DispositionText', 'type': 'esriFieldTypeString'},
     {'name': 'BLOCK_ADDRESS'

### Edit a big data file share

The spatial reference of the dataset is set to 4326, but we know this data is from New Orleans, Louisiana, and is actually stored in the [Louisiana State Plane Coordinate System](https://spatialreference.org/ref/esri/102682/html/). We need to edit the manifest with the correct spatial reference: {"wkid": 102682, "latestWkid": 3452}. Knowing the location where this data belongs to and the coordinate system which contains geospatial information of this dataset, we will edit our manifest. This will set the correct spatial reference.

In [9]:
manifest['datasets'][0]['geometry']['spatialReference'] = { "wkid": 102682, "latestWkid": 3452 }

In [10]:
file_share_folder.manifest = manifest

In [20]:
file_share_folder.manifest

{'datasets': [{'name': 'calls',
   'format': {'quoteChar': '"',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'encoding': 'UTF-8',
    'escapeChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited',
    'extension': 'csv'},
   'schema': {'fields': [{'name': 'NOPD_Item', 'type': 'esriFieldTypeString'},
     {'name': 'Type_', 'type': 'esriFieldTypeString'},
     {'name': 'TypeText', 'type': 'esriFieldTypeString'},
     {'name': 'Priority', 'type': 'esriFieldTypeString'},
     {'name': 'MapX', 'type': 'esriFieldTypeDouble'},
     {'name': 'MapY', 'type': 'esriFieldTypeDouble'},
     {'name': 'TimeCreate', 'type': 'esriFieldTypeString'},
     {'name': 'TimeDispatch', 'type': 'esriFieldTypeString'},
     {'name': 'TimeArrive', 'type': 'esriFieldTypeString'},
     {'name': 'TimeClosed', 'type': 'esriFieldTypeString'},
     {'name': 'Disposition', 'type': 'esriFieldTypeString'},
     {'name': 'DispositionText', 'type': 'esriFieldTypeString'},
     {'name': 'BLOCK_ADDRESS'

### Search for big data file shares items

Adding a big data file share to the Geoanalytics server adds a corresponding [big data file share item](https://enterprise.arcgis.com/en/portal/latest/use/what-is-a-big-data-file-share.htm) on the portal. We can search for these types of items using the item_type parameter.

In [11]:
search_result = gis.content.search("bigDataFileShares_ServiceCallsOrleans", item_type = "big data file share", max_items=40)
search_result

[<Item title:"bigDataFileShares_ServiceCallsOrleans" type:Big Data File Share owner:admin>]

In [12]:
data_item = search_result[0]

In [13]:
data_item

You are now ready to preform some cool analysis on your data