<h1>Accidents Example</h1>

In [1]:
from GreyNsights.analyst import Pointer, Command, Analyst
from GreyNsights.client import DataWorker, DataSource
from GreyNsights.frameworks import framework
import numpy as np

This notebook demonstrates how to use GreyNSights on a remote dataset hosted by some dataowner. The primary aim of this example is to show pandas could be used as it is across a wide range of queries to analyze and explore a remote datasource. For running this example first run datasource.py , this begins the datasource server which hosts the dataset and executes the requests made from this notebook. 

In [2]:
#Pandas version of GreyNsights that performs queries remotely 
frameworks = framework()
pandas = frameworks.pandas

The analyst identity doesn't actually have any underlying functionality for now , but it is a placeholder for the future such as providing an actual identity in terms of certificate. 

In [3]:
identity = Analyst("Alice", port=65441, host="127.0.0.1")

This connects to the remote dataowner 

In [4]:
worker = DataWorker(port=65441, host="127.0.0.1")
dataset = DataSource(identity, worker, "Sample Data")

Get the config of data owner to understand the limitations set on the private dataset for querying 

In [5]:
a = dataset.get_config()
print(a)



owner_name: Bob
dataset_name: Sample Data
privacy_budget: 10.0
trusted-aggregator: None
secret-sharing: Shamirs_scheme
private_columns: 

	-N
	-o
	-n
	-e
visible_columns: 

	-d
	-e
	-f
	-a
	-u
	-l
	-t
restricted_columns: 

	-d
	-e
	-f
	-a
	-u
	-l
	-t
allowed_queries: 

	-d
	-e
	-f
	-a
	-u
	-l
	-t
restricted_columns: 

	-d
	-e
	-f
	-a
	-u
	-l
	-t
visible_queries: 

	-sum
	-count
	-mean
	-percentile
	-max
	-min
	-median
restricted_queries: 

	-N
	-o
	-n
	-e



In [6]:
a = a.approve().init_pointer()

Create a dataframe from the dataset (Its already a dataframe but to demonstrate GreyNSights pandas remote execution)

In [7]:
df = pandas.DataFrame(a)

Variables and functions can be sent remotely for execution using function send(). The send() returns a pointer to the variable that now lives remotely. 

In [8]:
p = 3
p = dataset.send(p)

In [9]:
# last 5 rows
print(df.tail(p))


Pointer->Sample Data
	 	 dataset:Sample Data
	 	 dtype:<class 'pandas.core.frame.DataFrame'>
	 	 id:62799278081
	 	 port:65441
	 	 host:127.0.0.1



In [10]:
print(df)


Pointer->Sample Data
	 	 dataset:Sample Data
	 	 dtype:<class 'pandas.core.frame.DataFrame'>
	 	 id:434770736808
	 	 port:65441
	 	 host:127.0.0.1



The below operation performs operation on the pointer which ensures the operation is executed remotely by datasource. The original results are returned only when the get function is called.The exact same functionalities as Pandas dataframes can be performed. 

In [11]:
print(df["TMC"])


Pointer->Bob
	 	 dataset:Sample Data
	 	 dtype:<class 'pandas.core.series.Series'>
	 	 id:437555460438
	 	 port:65441
	 	 host:127.0.0.1



In [12]:
print(df["TMC"].sum())


Pointer->Sample Data
	 	 dataset:Sample Data
	 	 dtype:<class 'float'>
	 	 id:564836655698
	 	 port:65441
	 	 host:127.0.0.1



In [13]:
print(df["TMC"].sum().get())

515650203.8837159


In [14]:
print("TMC sum: ", df["TMC"].sum().get())
print("TMC std: ", df["TMC"].std().get())
print("Severity mean: ", df["Severity"].mean().get())

TMC sum:  515650097.9761645
TMC std:  20.766272454711583
Severity mean:  2.3399287483402764


The number of rows should be queried as a differentially private count. This reflects dimension of dataset but not the number of rows.  

In [17]:
df.shape

(-1, 49)

In [18]:
print("COLUMNS: ", df.columns)

COLUMNS:  Index(['ID', 'Source', 'TMC', 'Severity', 'Start_Time', 'End_Time',
       'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)',
       'Description', 'Number', 'Street', 'Side', 'City', 'County', 'State',
       'Zipcode', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp',
       'Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)',
       'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)',
       'Precipitation(in)', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop',
       'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')


In [19]:
df.columns = [
    "ID",
    "Source",
    "TMC",
    "Severity",
    "Start_Time",
    "End_Time",
    "Start_Lat",
    "Start_Lng",
    "End_Lat",
    "End_Lng",
    "Distance_mi",
    "Description",
    "Number",
    "Street",
    "Side",
    "City",
    "County",
    "State",
    "Zipcode",
    "Country",
    "Timezone",
    "Airport_Code",
    "Weather_Timestamp",
    "Temperature_F",
    "Wind_Chill_F",
    "Humidity_%",
    "Pressure_in",
    "Visibility_mi",
    "Wind_Direction",
    "Wind_Speed_mph",
    "Precipitation_in",
    "Weather_Condition",
    "Amenity",
    "Bump",
    "Crossing",
    "Give_Way",
    "Junction",
    "No_Exit",
    "Railway",
    "Roundabout",
    "Station",
    "Stop",
    "Traffic_Calming",
    "Traffic_Signal",
    "Turning_Loop",
    "Sunrise_Sunset",
    "Civil_Twilight",
    "Nautical_Twilight",
    "Astronomical_Twilight",
]

<h3>Transforming original dataset into a subset of columns</h3>

In [20]:
df = df[
    [
        "ID",
        "Source",
        "TMC",
        "Severity",
        "Start_Time",
        "End_Time",
        "Start_Lat",
        "Start_Lng",
        "End_Lat",
        "End_Lng",
    ]
]


<h3>A wide range of data transformations applied on pointers</h3>

In [21]:
df["Somecol"] = (df["TMC"] + df["Severity"] / 10) / 2
(df["TMC"] + df["Severity"])



<GreyNsights.analyst.Pointer at 0x7ff3402b9f10>

In [22]:
df["Somecol"] = df["TMC"] + df["Severity"]

(df["TMC"] + df["Severity"] / 10) / 2

df["TMC"] > 2

(df["Severity"] > 8) | (df["TMC"] > 200)

df[df["TMC"] > 200]

df[(df["Severity"] > 8) | (df["TMC"] > 200)]



<GreyNsights.analyst.Pointer at 0x7ff340470610>

In [23]:
And_df = df[(df["TMC"] > 200)]
# Multiple conditions: OR
Or_df = df[(df["Severity"] > 8) | (df["TMC"] > 200)]

In [24]:
And_df["TMC"].mean().get()

208.02294982131104

In [25]:
Or_df["TMC"].mean().get()

208.02277166764964

<h3>Sending a function across and passing pointers as arguments</h3>

In [26]:
def somefunc(x):
    return x + 2

somefunc_pt = dataset.send(somefunc)
df["Somecol"] = df["TMC"].apply(somefunc_pt)

In [28]:
print(df["Somecol"])


Pointer->Bob
	 	 dataset:Sample Data
	 	 dtype:<class 'pandas.core.series.Series'>
	 	 id:621218086782
	 	 port:65441
	 	 host:127.0.0.1



In [29]:
df["Somecol"].mean().get()

210.02201671942584