# Making your data accessible to the GIS
Big data is popularly characterized with 4 v's - 
 - high **volume**: large quantity of data that cannot be analyzed in a traditional manner using the memory available one a single machine, 
 - high **velocity**: data that is not just static but can also arrive from streaming sources, 
 - large **variety**: formats that are tabular, non tabular, spatial, non spatial from a variety of sources
 - unknown **veracity**: data that is not pre-processed or screened and of unknown quality.

## Big data file shares
Given the enormity and uncertainty in such kinds of data, the GeoAnalytics server allows you register your big datasets in a format called a big data file share. Big data file shares can reference data in the following data sources
 - file share - a directory of datasets
 - HDFS - a Hadoop Distributed Files System directory of datasets
 - Hive - metastore databases

Storing your data in a Big data file share datastore has the following benefits
 - the GeoAnalytics tools read your data only when they are executed. This allows you to keep updating or adding new data to these locations.
 - you can partition your data, say using file system folders, yet treat them as a single dataset
 - big data file shares are flexible in how time and geometry are defined. This allows you to have data in multiple formats even in a single dataset.

### Preparing your data
To register a file share or a HDFS, you need to format your datasets as sub folders within a single parent folder and register that folder. This parent folder you register becomes a `datastore` and each of the sub folder becomes a `dataset`. For instance, to register 2 datastores representing earthquakes and hurricanes, your folder hierarchy would look like below:
```
|---FileShareFolder    
   |---Earthquakes          <-- register as a datastore
      |---1960              <-- dataset 1
         |---01_1960.csv
         |---02_1960.csv
      |---1961              <-- dataset 2
         |---01_1961.csv
         |---02_1961.csv
   |---Hurricanes           <-- register as a datastore
      |---atlantic_hur.shp
      |---pacific_hur.shp
```
To learn more about preparing your data for use with GeoAnalytics server, refer to this [server documentation](http://server.arcgis.com/en/server/latest/get-started/windows/what-is-a-big-data-file-share.htm).

## Searching for big data file shares
The `get_datastores()` method of the `geoanalytics` module returns you a `DatastoreManager` object that lets you search for and manage `Datastore` objects on your GeoAnalytics server.

In [None]:
# Connect to enterprise GIS
from arcgis.gis import GIS
import arcgis.geoanalytics
portal_gis = GIS("portal url", "username", "password")

In [None]:
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
bigdata_datastore_manager

<DatastoreManager for https://dev003247.esri.com:6443/arcgis/admin>

Use the `search()` method on a `DatastoreManager` object to search for `Datastore`s

In [None]:
bigdata_fileshares = bigdata_datastore_manager.search()
bigdata_fileshares

[<Datastore title:"/bigDataFileShares/Chicago_accidents" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_1m_168yrs" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_all" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/Hurricane_tracks" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/NYCdata" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/NYC_taxi" type:"bigDataFileShare">]

### Get datasets from a big data file share datastore
Use the `datasets` property on a `Datastore` object to get a dictionary representation of the datasets.

In [None]:
Chicago_accidents = bigdata_fileshares[0]
len(Chicago_accidents.datasets)

6

In [None]:
# let us view the first dataset for a sample
Chicago_accidents.datasets[0]

{'format': {'encoding': 'UTF-8',
  'extension': 'csv',
  'fieldDelimiter': ',',
  'hasHeaderRow': True,
  'quoteChar': '"',
  'recordTerminator': '\n',
  'type': 'delimited'},
 'geometry': {'fields': [{'formats': ['x'], 'name': 'longitude'},
   {'formats': ['y'], 'name': 'latitude'}],
  'geometryType': 'esriGeometryPoint',
  'spatialReference': {'wkid': 4326}},
 'name': 'April',
 'schema': {'fields': [{'name': 'date', 'type': 'esriFieldTypeString'},
   {'name': 'year', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'day_o_week', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'num_veh', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'injuries', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'fatalities', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'coll_type', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'weather', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'lighting', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'surf_cond', 'type': 'esriFieldTypeBigInteger'},
   {'n

## Registering big data file shares
You can register your data as a big data file share using the `add_bigdata()` method on a `DatastoreManager` object. Ensure the datasets are stored in a format compatible with the GeoAnalytics server as seen earlier in this guide.

In [None]:
NYC_data_item = bigdata_datastore_manager.add_bigdata("NYCdata2", 
                                                      r"\\teton\atma_shared\datasets\NYC_taxi")

Created Big Data file share for NYCdata2


In [None]:
NYC_data_item

<Datastore title:"/bigDataFileShares/NYCdata2" type:"bigDataFileShare">

Once a big data file share is created, the GeoAnalytics server processes all the valid file types to discern the schema of the data. This process can take a few minutes depending on the size of your data. Once processed, querying the `manifest` property returns the schema.

In [None]:
NYC_data_item.manifest

{'datasets': [{'format': {'encoding': 'UTF-8',
    'extension': 'csv',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'quoteChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited'},
   'geometry': {'fields': [{'formats': ['x'], 'name': 'pickup_longitude'},
     {'formats': ['y'], 'name': 'pickup_latitude'}],
    'geometryType': 'esriGeometryPoint',
    'spatialReference': {'wkid': 4326}},
   'name': 'sampled',
   'schema': {'fields': [{'name': 'VendorID',
      'type': 'esriFieldTypeBigInteger'},
     {'name': 'tpep_pickup_datetime', 'type': 'esriFieldTypeString'},
     {'name': 'tpep_dropoff_datetime', 'type': 'esriFieldTypeString'},
     {'name': 'passenger_count', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'trip_distance', 'type': 'esriFieldTypeDouble'},
     {'name': 'pickup_longitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'pickup_latitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'RateCodeID', 'type': 'esriFieldTypeBigInteger'},
     {'nam