# Use a notebook with Kqlmagic to query data in a KQL database 
 
 Use of this notebook is documented in [Microsoft Fabric documentation](https://learn.microsoft.com/fabric/real-time-analytics/jupyter-notebook).

 This flow uses native python packages that are publicly available.

Goal: Query a publicly available dataset [NYC taxi](https://learn.microsoft.com/azure/open-datasets/dataset-taxi-yellow) and use a basic clustering ML model to detect where are the most busy taxi pickup hot spots in New York City. 

Prerequisites: KQL Database with NYC taxi data loaded.

## High level notebook workflow
- Load dependencies using import commands
- Load the [Kqlmagic](https://pypi.org/project/Kqlmagic/) package to allow connectivity to the KQL database
- Authenticate to the KQL database
- Use KQL commands to showcase KQL interactivity through Jupyter notebook
- Train a model on a fraction of the data
- Display a graphical rendering for the clustering results on New York City taxi pickup location

Start by loading the numpi packages.

In [1]:
import numpy as np
import pandas as pd

Load matplotlib packages for the graphs

In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

Import the KQL magic package to enable connectivity to the KQL Database

In [None]:
!pip install Kqlmagic --no-cache-dir --upgrade

Load the package to memory

In [None]:
%reload_ext Kqlmagic

 Connect to your database URL. This can be found in the database details page.

This process uses Device Code authentication flow. You will receive a code that you need to input. Then you'll be asked to authenticate using your AAD credentials. Talk to your administrators if you run into authentication issues.

In [None]:
%kql kusto://code;cluster='enter-database-uri';database='enter-database-name'

This step returns a count of the "trips2" table.

In [None]:
%%kql trips2
| extend 
  pickup_datetime= tpepPickupDateTime
, pickup_latitude = startLat
, pickup_longitude = startLon
| where pickup_datetime between (datetime(2014-01-01)..datetime(2014-12-31))
| where isnotempty(pickup_latitude) and isnotempty(pickup_longitude)
| count

This cell shows how the render commands is also available through KQL magic. Note that here it's KQL doing the rendering, not Python.

In [None]:
%%kql      // Note the %% magic syntax to send full cell contents to ADX (including comment marker //)
trips2
| extend 
  pickup_datetime= tpepPickupDateTime
, pickup_latitude = startLat
, pickup_longitude = startLon
| where pickup_datetime  between (datetime(2014-01-01)..datetime(2014-12-31))
| summarize count() by bin_at(pickup_datetime, 7d, datetime(2014-01-01))
| render timechart with(title='NYC 2014 Taxi Rides count per week')

The following is a more detailed query that contextualizes the data using their geographic positions

In [None]:
%%kql trips2 
| extend 
  pickup_datetime= tpepPickupDateTime
, dropoff_datetime = tpepDropoffDateTime
, pickup_latitude = startLat
, pickup_longitude = startLon
, dropoff_longitude = endLon
, dropoff_latitude = endLat
, vendor_id=vendorID
| where isnotempty(pickup_latitude) and isnotempty(pickup_longitude)
| project vendor_id, pickup_datetime, dropoff_datetime,pickup_longitude, pickup_latitude, dropoff_longitude,dropoff_latitude
| take 3

Set some boundaries for use later. Define NYC area limits:

In [None]:
south=40.61
north=40.91
west=-74.06
east=-73.77

1. Specify KQL query
2. Implement simple result cache in local binary (pickle) file, based on hash of the KQL query string

NOTE: to make hash() consistent set env. variable PYTHONHASHSEED=0

In [None]:
%env PYTHONHASHSEED=0

def adx_query(q):
    fn = "df" + str(hash(q)) + ".pkl"
    try:
        df = pd.read_pickle(fn)
        print("Load df from " + fn)
        return df
    except:
        print("Execute query...")
        %kql res << -query q
        try:
            df = res.to_dataframe()
            print("Save df to " + fn)
            df.to_pickle(fn)
            print("\n", df.shape, "\n", df.columns)
            return df
        except Exception as ex:
            print(ex)
            return None

The "Q" variable holds our main KQL code. This will aggregate all pickups within the defined geographic boundary. 

In [None]:
q = '''
set notruncation;
let South=south; let North=north; let West=west; let East=east; // copy Python variables to ADX
trips2
| extend 
  pickup_datetime= tpepPickupDateTime
, dropoff_datetime = tpepDropoffDateTime
, pickup_latitude = startLat
, pickup_longitude = startLon
, dropoff_longitude = endLon
, dropoff_latitude = endLat
, vendor_id=vendorID
| where pickup_datetime between (datetime(2014-01-01)..datetime(2014-12-31))
| where isnotempty(pickup_latitude) and isnotempty(pickup_longitude)
| extend Lat=round(pickup_latitude, 4), Long=round(pickup_longitude, 4)
| where Lat between(South..North) and Long between(West..East)
| summarize num_pickups=count() by Lat, Long
'''

aggr_pickups = adx_query(q)

Show 4 rows of the dataframe.

In [None]:
print(aggr_pickups[-4:])

Initialize graphics for the heatmap.

In [None]:
new_style = {'grid':False}
matplotlib.rc('axes', **new_style)
from matplotlib import rcParams
rcParams['figure.figsize'] = [15, 15]

Draw a map by plotting a heat map over a scatter plot. 

In [None]:
plt.style.use('dark_background')
p = aggr_pickups.plot(kind='scatter', x='Long', y='Lat', color='white', xlim=(west, east), ylim=(south, north), s=0.02, alpha=0.6)

Take a subset of the data for training 0.1%

In [None]:
q = '''
set notruncation;
let South=south; let North=north; let West=west; let East=east; // copy Python variables to ADX
let sf=0.001; // Extract 0.1% of the raw data
trips2
| extend 
  pickup_datetime= tpepPickupDateTime
, dropoff_datetime = tpepDropoffDateTime
, pickup_latitude = startLat
, pickup_longitude = startLon
, dropoff_longitude = endLon
, dropoff_latitude = endLat
, vendor_id=vendorID
| where pickup_datetime between (datetime(2014-01-01)..datetime(2014-12-31))
| where pickup_latitude between(South..North) and pickup_longitude between(West..East)
| project pickup_datetime, pickup_latitude, pickup_longitude
| where rand() < sf'''

df = adx_query(q)

Define the clustering function

In [None]:
def KMeans_clustering(k, features):
    from sklearn.cluster import KMeans, MiniBatchKMeans
    km = MiniBatchKMeans(n_clusters=k) if features.shape[0] > 1000 else KMeans(n_clusters=k)
    km.fit(features)
    centroids = pd.DataFrame(km.cluster_centers_, columns=features.columns)
    centroids.insert(features.shape[1], "num", pd.DataFrame(km.labels_, columns=["n"]).groupby("n").size())
    centroids.insert(features.shape[1], "cluster_id", range(k))
    return centroids, km.labels_

Define a few more variables and mark the centroids on the map with stars

In [None]:
pickup_hub_loc, pickup_cluster = KMeans_clustering(8, df[['pickup_latitude', 'pickup_longitude']])
pickup_hub_loc

In [None]:
plt.scatter(x=aggr_pickups['Long'], y=aggr_pickups['Lat'], color='white', s=0.02, alpha=0.6)
plt.scatter(x=pickup_hub_loc['pickup_longitude'], y=pickup_hub_loc['pickup_latitude'], color='#ff00a0', marker='*', s=pickup_hub_loc['num']/len(df)*8000, alpha=0.6)
plt.show()