<h1>Accidents Example</h1>

In [1]:
from GreyNsights.analyst import DataWorker, DataSource, Pointer, Command, Analyst
from GreyNsights.frameworks import framework
import numpy as np

This notebook demonstrates how to use GreyNSights on a remote dataset hosted by some dataowner. The primary aim of this example is to show pandas could be used as it is across a wide range of queries to analyze and explore a remote datasource. For running this example first run datasource.py , this begins the datasource server which hosts the dataset and executes the requests made from this notebook. 

In [2]:
#Pandas version of GreyNsights that performs queries remotely 
frameworks = framework()
pandas = frameworks.pandas

The analyst identity doesn't actually have any underlying functionality for now , but it is a placeholder for the future such as providing an actual identity in terms of certificate. 

In [3]:
identity = Analyst("Alice", port=65441, host="127.0.0.1")

This connects to the remote dataowner 

In [4]:
worker = DataWorker(port=65441, host="127.0.0.1")
dataset = DataSource(identity, worker, "Sample Data")

Get the config of data owner to understand the limitations set on the private dataset for querying 

In [5]:
a = dataset.get_config()
print(a)



owner_name: Bob
dataset_name: Sample Data
privacy_budget: 10.0
trusted-aggregator: None
secret-sharing: Shamirs_scheme
private_columns: 

	-N
	-o
	-n
	-e
visible_columns: 
restricted_columns: default
allowed_queries: default
visible_queries: 
	-sum
	-count
	-mean
	-percentile
	-max
	-min
	-median
restricted_queries: None




In [6]:
a = a.approve().init_pointer()

Create a dataframe from the dataset (Its already a dataframe but to demonstrate GreyNSights pandas remote execution)

In [7]:
df = pandas.DataFrame(a)

Variables and functions can be sent remotely for execution using function send(). The send() returns a pointer to the variable that now lives remotely. 

In [8]:
p = 3
p = dataset.send(p)

In [9]:
# last 5 rows
print(df.tail(p))


Pointer->Sample Data
	 	 dataset:Sample Data
	 	 dtype:<class 'pandas.core.frame.DataFrame'>
	 	 id:540050427194
	 	 port:65441
	 	 host:127.0.0.1



In [10]:
print(df)


Pointer->Sample Data
	 	 dataset:Sample Data
	 	 dtype:<class 'pandas.core.frame.DataFrame'>
	 	 id:944226731383
	 	 port:65441
	 	 host:127.0.0.1



The below operation performs operation on the pointer which ensures the operation is executed remotely by datasource. The original results are returned only when the get function is called.The exact same functionalities as Pandas dataframes can be performed. 

In [11]:
print(df["TMC"])


Pointer->Bob
	 	 dataset:Sample Data
	 	 dtype:<class 'pandas.core.series.Series'>
	 	 id:522977501930
	 	 port:65441
	 	 host:127.0.0.1



In [12]:
print(df["TMC"].sum())


Pointer->Sample Data
	 	 dataset:Sample Data
	 	 dtype:<class 'float'>
	 	 id:194139130899
	 	 port:65441
	 	 host:127.0.0.1



In [13]:
print(df["TMC"].sum().get())

515650733.5298479


In [14]:
print(df.describe().get())

                TMC      Severity     Start_Lat     Start_Lng       End_Lat  \
count  2.478818e+06  3.513617e+06  3.513617e+06  3.513617e+06  1.034799e+06   
mean   2.080226e+02  2.339929e+00  3.654195e+01 -9.579151e+01  3.755758e+01   
std    2.076627e+01  5.521935e-01  4.883520e+00  1.736877e+01  4.861215e+00   
min    2.000000e+02  1.000000e+00  2.455527e+01 -1.246238e+02  2.457011e+01   
25%    2.010000e+02  2.000000e+00  3.363784e+01 -1.174418e+02  3.399477e+01   
50%    2.010000e+02  2.000000e+00  3.591687e+01 -9.102601e+01  3.779736e+01   
75%    2.010000e+02  3.000000e+00  4.032217e+01 -8.093299e+01  4.105139e+01   
max    4.060000e+02  4.000000e+00  4.900220e+01 -6.711317e+01  4.907500e+01   

            End_Lng  Distance(mi)        Number  Temperature(F)  \
count  1.034799e+06  3.513617e+06  1.250753e+06    3.447885e+06   
mean  -1.004560e+02  2.816167e-01  5.975383e+03    6.193512e+01   
std    1.852879e+01  1.550134e+00  1.496624e+04    1.862106e+01   
min   -1.244978e+02 

In [15]:
print("TMC sum: ", df["TMC"].sum().get())
print("TMC std: ", df["TMC"].std().get())
print("Severity mean: ", df["Severity"].mean().get())

TMC sum:  515649671.3097857
TMC std:  20.766272454711583
Severity mean:  2.3399253624414285


The number of rows should be queried as a differentially private count 

In [16]:
df.shape

(-1, 49)

In [17]:
print("COLUMNS: ", df.columns)

COLUMNS:  Index(['ID', 'Source', 'TMC', 'Severity', 'Start_Time', 'End_Time',
       'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)',
       'Description', 'Number', 'Street', 'Side', 'City', 'County', 'State',
       'Zipcode', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp',
       'Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)',
       'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)',
       'Precipitation(in)', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop',
       'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')


In [18]:
df.columns = [
    "ID",
    "Source",
    "TMC",
    "Severity",
    "Start_Time",
    "End_Time",
    "Start_Lat",
    "Start_Lng",
    "End_Lat",
    "End_Lng",
    "Distance_mi",
    "Description",
    "Number",
    "Street",
    "Side",
    "City",
    "County",
    "State",
    "Zipcode",
    "Country",
    "Timezone",
    "Airport_Code",
    "Weather_Timestamp",
    "Temperature_F",
    "Wind_Chill_F",
    "Humidity_%",
    "Pressure_in",
    "Visibility_mi",
    "Wind_Direction",
    "Wind_Speed_mph",
    "Precipitation_in",
    "Weather_Condition",
    "Amenity",
    "Bump",
    "Crossing",
    "Give_Way",
    "Junction",
    "No_Exit",
    "Railway",
    "Roundabout",
    "Station",
    "Stop",
    "Traffic_Calming",
    "Traffic_Signal",
    "Turning_Loop",
    "Sunrise_Sunset",
    "Civil_Twilight",
    "Nautical_Twilight",
    "Astronomical_Twilight",
]

In [19]:
df = df[
    [
        "ID",
        "Source",
        "TMC",
        "Severity",
        "Start_Time",
        "End_Time",
        "Start_Lat",
        "Start_Lng",
        "End_Lat",
        "End_Lng",
    ]
]


In [20]:
df["Somecol"] = (df["TMC"] + df["Severity"] / 10) / 2
(df["TMC"] + df["Severity"])

df["LOL"] = df["TMC"]

In [21]:
df["Somecol"] = df["TMC"] + df["Severity"]

(df["TMC"] + df["Severity"] / 10) / 2

df["TMC"] > 2

(df["Severity"] > 8) | (df["TMC"] > 200)

df[df["TMC"] > 200]

df[(df["Severity"] > 8) | (df["TMC"] > 200)]



<GreyNsights.analyst.Pointer at 0x7feda41c28b0>

In [22]:
And_df = df[(df["TMC"] > 200)]
# Multiple conditions: OR
Or_df = df[(df["Severity"] > 8) | (df["TMC"] > 200)]

In [23]:
And_df["TMC"].mean().get()

208.02370964691175

In [24]:
Or_df["TMC"].mean().get()

208.02279032257925

In [25]:
def somefunc(x):
    return x + 2

somefunc_pt = dataset.send(somefunc)

In [26]:
df["Somecol"] = df["TMC"].apply(somefunc_pt)

In [27]:
print(df["Somecol"])


Pointer->Bob
	 	 dataset:Sample Data
	 	 dtype:<class 'pandas.core.series.Series'>
	 	 id:321943713955
	 	 port:65441
	 	 host:127.0.0.1



In [28]:
df["Somecol"].mean().get()

210.0228592498267