![](images/Capture2.JPG)

#  <center> Tutorial: Pull a VDS from the Lucd UDS into a Local Dask Dataframe </center>

## Background on Lucd

The Lucd Enterprise AI Data Science Platform is a highly secure, scalable, open and flexible platform for persisting an fusing large and numerous datasets and training AI models for production against those datasets.
The Lucd platform is an end to end platform that can be deployed in public cloud environments, on premise on bare metal hardware, or the Lucd multi-tenant PaaS can be directly accessed.  The platform consists of:

 - A scalable open data ingest capability
 - A petabyte scale unified data space data repository
 - 3-D Visualization and Exploration
 - An Exploratory Data Analysis Rest Service
 - A Kubernetes environment to train PyTorch and TensorFlow models
 - NLP Word Embedding and Explainable AI Assets
 - Model results visualization and exporting to internal or external serving capability

![](images/Architecture1.png)

## Introduction, Prerequisites

This tutorial covers leveraging the Lucd Python Client to pull a Virtual Data Set (VDS) from data in the Lucd Unified Data Space (UDS) into a Local Dask Dataframe via a Jupyter Notebook.  Creating the VDS leveraging the Lucd 3D Graphical UI; Creating custom EDA operations; or Locally creating AI models to upload to the Lucd platform are outside of scope of this Tutorial and (are/will be) covered in other Tutorials.

Prerequisites are:
 - Obtaining a Lucd account with appropriate security settings to access/retrieve data.  https://community.lucd.ai/hc/en-us/articles/360037995531
 - Obtaining the Lucd Python Client package (in the future pip install will be available).  For now, obtain by contacting marketing@lucd.ai
 - Downloading and installing a Jupyter notebook (this tutorial assumes that an Anaconda Jupyter notebook is used)

## 1. Run Setup on the Lucd Python Client

 - Extract the Lucd Python Package from the zip file.
 - From the Anaconda Cmd Prompt or from Anaconda Navigator, navigate to the Lucd Python Package folder
 - run:  python setup.py

## 2. Import the following into your notebook

In [1]:
from lucd import LucdClient, log
from eda.int import asset
from eda.int import vds
from eda.int import uds
from eda.lib import lucd_uds

## 3. Access Lucd with your account information

In [2]:
client = lucd.LucdClient(domain="<your domain>", #i.e. "https://p1.lucd.ai"
                         username="<your username>",
                         password="<your password>",
                         )

Look at data in the Lucd Unified Data Space

In [3]:
all_uds, http = uds.sources({"uid": "demouser"})

Your view will look different depending on your security group, the below is an example of the result

In [4]:
all_uds

[{'bytes': 1840691474,
  'lastIngest': 1575571138618,
  'records': 1697533,
  'source': 'AMAZON'},
 {'bytes': 69111650,
  'lastIngest': 1575566305862,
  'records': 50000,
  'source': 'IMDB'},
 {'bytes': 10000,
  'lastIngest': 1575564236229,
  'records': 150,
  'source': 'IRIS'},
 {'bytes': 5033542244,
  'lastIngest': 1575657022337,
  'records': 2121379,
  'source': 'MIMICIII'},
 {'bytes': 3465039079,
  'lastIngest': 1575582766041,
  'records': 8807303,
  'source': 'NYC_GREEN_TAXI'},
 {'bytes': 47316594862,
  'lastIngest': 1575932176659,
  'records': 111330249,
  'source': 'NYC_YELLOW_TAXI'},
 {'bytes': 2165822,
  'lastIngest': 1575584734133,
  'records': 7032,
  'source': 'TELCO_CHURN'}]

creating a Virtual Data Set (VDS) from data in the Unified Data Space (UDS) via access to the Lucd 3D UI Client is outside the scope of this tutorial, but when a VDS is created from data in the UDS, you can view it as follows:

In [5]:
all_vds, http = vds.read({"uid": "<your username>"})

Your VDS view will be different, the below is an example

In [6]:
all_vds

{'demouser_9223370452718499796': {'description': 'single day',
  'model': {'data': ['green-taxi.extra',
    'green-taxi.fare_amount',
    'green-taxi.mta_tax',
    'green-taxi.passenger_count',
    'green-taxi.total_amount',
    'green-taxi.trip_distance'],
   'labels': []},
  'name': 'Taxi Dataset',
  'operations': [],
  'query': {'aggs': {'agg_source': {'aggs': {'agg_model': {'aggs': {'topHits': {'top_hits': {'size': 10}}},
       'terms': {'field': 'model'}}},
     'terms': {'field': 'source'}}},
   'dataset': '637197317169885378',
   'query': {'function_score': {'functions': [{'random_score': {}}],
     'query': {'bool': {'filter': [],
       'must': [{'bool': {'should': [{'match_phrase': {'source': 'nyc_green_taxi'}}]}},
        {'range': {'content_date': {'gte': 1514782800000,
           'lt': 1514869200000}}}],
       'must_not': []}}}},
   'size': 100},
  'query_size': 284306,
  'username': 'demouser'},
 'demouser_9223370455919155658': {'description': '',
  'model': {'data': ['

## Pull a VDS into a local Dask Dataframe

Identify the VDS ID

In [7]:
for my_dict_list in all_vds:
    print(all_vds[my_dict_list]['name'] + " is from key: " + my_dict_list)

Taxi Dataset is from key: demouser_9223370452718499796
Fused Movie Reviews is from key: demouser_9223370455919155658
IMDB Reviews is from key: demouser_9223370456091712540
Amazon Reviews is from key: demouser_9223370456095032890
IRIS Regression is from key: demouser_9223370456702943543
IMDB: wesley is from key: demouser_9223370456709135637
IRIS Dataset is from key: demouser_9223370459654976634


I.e. if you want pull the Taxi Dataset VDS, its VDS ID is:  demouser_9223370452718499796

Pull the VDS into a local Dask Dataframe:

In [None]:
ddf = lucd_uds.get_dataframe("<VDS ID>")

## Now you have a local copy of the VDS in a dask dataframe to work with per your requirements

It is assumed that appropriate packages are imported into your notebook (i.e. dask, pandas)

I.e. Working with a dask dataframe:

In [None]:
ddf.head()

I.e. Converting the dask dataframe to a pandas dataframe

In [None]:
pdf = ddf.compute()

I.e. Writing the dask or pandas dataframe to a csv

In [None]:
ddf.to_csv('/path/to/myfiles.csv', single_file = True)
pdf.to_csv('/path/to/myfiles.csv')