# Evaluation criteria

The goal of this assignment is to get a view on your hands-on "data engineering" skills.  
At our company, our data scientists and engineers collaborate on projects.  
Your main focus will be creating performant & robust data flows.  
For a take-home-assignment, we cannot grant you access to our infrastructure.  
The assignement below measures your proficiency in general programming, data science & engineering tasks using python.  
Completion should not take more than half a day.

**We expect you to be proficient in:**
 * SQL queries (Sybase IQ system)
 * ETL flows (In collaboration with existing teams)
 * General python to glue it all together
 * Python data science ecosystem (Pandas + SKlearn)
 
**In this exercise we expect you to demonstrate your ability to / knowledge of:**
 * Building a data science runtime
 * PEP8 / Google python styleguide
 * Efficiently getting the job done
 * Choose meaningfull names for variables & functions
 * Writing maintainable code (yes, you might need to document some steps)
 * Help a data scientist present interactive results.
 * Offer predictions via REST api

# Setting-up a data science workspace

We allow you full freedom in setting up a data science runtime.  
The main objective is having a runtime where you can run this notebook and the code you will develop.  
You can choose for a local setup on your pc, or even a cloud setup if you're up for it.   

**In your environment, you will need things for:**
 * https request
 * python3 (not python2 !!)
 * (geo)pandas
 * interactive maps (e.g. folium, altair, ...)
 * REST apis
 
**Deliverables we expect**:
 * notebook with the completed assignment
 * list of packages for your runtime (e.g. yml or txt file)
 * evidence of a working API endpoint

# Importing packages

We would like you to put all your import statements here, together in 1 place.  
Before submitting, please make sure you remove any unused imports :-)  

In [None]:
# your imports go here.  You get pandas for free.
import pandas as pd
import requests
import unittest
import geopandas as gpd
import folium
import json
import zipfile
import io
import os
import threading
import subprocess as sub
from pandas import json_normalize, DataFrame
from math import cos, sin, asin, sqrt, radians
from joblib import load
from pathlib import Path
from lightgbm.sklearn import LGBMRegressor
from pyproj import CRS

pd.set_option('display.max_columns', 500)

# Data ingestion exercises

## Getting store location data from an API

**Goal:** Obtain a pandas dataframe  
**Hint:** You will need to normalise/flatten the json, because it contains multiple levels  
**API call:** https://ecgplacesmw.colruytgroup.com/ecgplacesmw/v3/nl/places/filter/clp-places  

In [None]:
def get_clp_places(url: str) -> DataFrame:
    """
    Retreive dataset at url, put it in a
    dataframe
    Parameters:
    url: link to dataset
    Returns:
    a dataframe containing retreived dataset
    """
    response = requests.get(url)

    return json_normalize(response.json())


df_clp = get_clp_places("https://ecgplacesmw.colruytgroup.com/ecgplacesmw/v3/nl/places/filter/clp-places")
df_clp.head(10)

I left placeSearchOpeningHours column unflattened because I assume we want a row per store, not a row per store-and-its-opening-times

### Quality checks

We would like you to add several checks on this data based on these constraints:  
 * records > 200
 * latitude between 49 and 52
 * longitude between 2 and 7
 
We dont want you to create a full blown test suite here, we're just gonna use 'asserts' from unittest

In [None]:
# your code goes here
quality_check = unittest.TestCase('__init__')  # renaming varible for better readability

# checking records > 200
quality_check.assertTrue(len(df_clp) - 200 > 0, "Dataframe is too small")


# normally I would isolate functions in a separate .py file, leaving it here for ease of reading
def count_in_range(col_name: str,
                   min_val: int,
                   max_val: int) -> int:

    """
    calculates number of values in col_name that fit within
    [min_val: max_val] interval
    Parameters:
    col_name: name of column in df_clp
    min_val: minimum range value
    max_val: maximum range value
    Returns:
    number of rows in datafwame within range
    """
    return (df_clp[col_name]
            .astype(float)
            .apply(lambda x: (x >= min_val) and (x <= max_val))
            .sum())


# checking that latitude is in correct range
in_range_lat = count_in_range('geoCoordinates.latitude',
                              49,
                              52)
quality_check.assertEqual(in_range_lat,
                          len(df_clp),
                          f'{len(df_clp)-in_range_lat} records outside range')

# checking longitude is in correct range
in_range_lon = count_in_range('geoCoordinates.longitude',
                              2,
                              7)
quality_check.assertEqual(in_range_lon,
                          len(df_clp),
                          f'{len(df_clp)-in_range_lon} records outside range')

### Feature creation

Create a new column "antwerpen" which is 1 for all stores in Antwerpen (province) and 0 for all others 

In [None]:
# your code goes here
# info about postal codes from
# https://nl.wikipedia.org/wiki/Postnummers_in_Belgi%C3%AB#2000_-_2999_provincie_Antwerpen
df_clp['antwerpen'] = (df_clp['address.postalcode']
                       .astype(int)
                       .apply(lambda x: 1 if x >= 2000 and x <= 2999 else 0))

df_clp["antwerpen"].value_counts()

## Predict used car value

A datascientist in our team made a basic model to predict car prices.  
The model was saved to disk ('lgbr_cars.model') using joblib's dump fuctionality.  
Documentation states the model is a LightGBM Regressor, trained using the sk-learn api.  

**As engineer, your task it to expose this model as REST-api.** 

First, retrieve the model via the function below.  
Change the path according to your setup.  

In [None]:
# your code goes here
def retrieve_model(path: str) -> LGBMRegressor:
    """
    Loads model from specified path
    Parameters:
    path: path to model
    Returns:
    model object
    """
    trained_model = load(path)
    return trained_model


lgbr_cars = retrieve_model("models/lgbr_cars.model")

quality_check.assertEqual(str(type(lgbr_cars)),
                          "<class 'lightgbm.sklearn.LGBMRegressor'>",
                          type(lgbr_cars))

Now you have your trained model, lets do a functional test based on the parameters below.  
You have to present the parameters in this order.  

* vehicleType: coupe
* gearbox: manuell
* powerPS: 190
* model: NaN
* kilometer: 125000
* monthOfRegistration: 5 
* fuelType: diesel
* brand: audi

Based on these parameters, you should get a predicted value of 14026.35068804
However, the model doesnt accept string inputs, see the integer encoding below:

In [None]:
model_test_input = [[3, 1, 190, -1, 125000, 5, 3, 1]]

In [None]:
# your code goes here

def make_prediction(trained_model: LGBMRegressor, single_input: list) -> float:
    """
    Produces model prediction for a single instance
    Parameters:
    trained_model: a trained model
    single_input: an instance that needs to be predictred
    Returns:
    model prediction
    """
    predicted_value = trained_model.predict(single_input)[0]
    return predicted_value


predicted_value = make_prediction(lgbr_cars, model_test_input)

quality_check.assertAlmostEqual(predicted_value, 14026.35, places=2)

Now you got this model up and running, we want you to **expose it as a rest api.**  
We don't expect you to set up any authentication.  
We're not looking for beautiful inputs, just make it work.  
**Building this endpoint should NOT be done in a notebook, but in proper .py file(s)**

Once its up and running, use it to predict the following input:
* [-1,1,0,118,150000,0,1,38] ==> prediction should be 13920.70

In [None]:
def start_flask():
    """
    Starts a flask app in a separate thread
    to free up other notebook cells
    """
    sub.call('flask run', shell=True)


# start local flask server
threading.Thread(target=start_flask).start()

please wait for "running at 127.0.0.1" message before running next cell :)

In [None]:
# get prediction for test instance
response = requests.get('http://127.0.0.1:5000/predict',
                        data=json.dumps({'instance': [-1, 1, 0, 118, 150000, 0, 1, 38]}))
response.content

In [None]:
# shut down flask server
requests.post('http://127.0.0.1:5000/shutdown')

## Geospatial data exercise
The goal of this exercise is to read in some data from a shape file and visualize it on a map
- The map should be dynamic. I want to zoom in and out to see more interesting aspects of the map
- We want you to visualize the statistical sectors within a distance of 2KM of your home location.

Specific steps to take:
- Read in the shape file
- Transform to WGS coordinates
- Create a distance function (Haversine)
- Create variables for home_lat, home_lon and perimeter_distance
- Calculate centroid for each nis district
- Calculate the distance to home for each nis district centroid 
- Figure out which nis districts are near your home
- Create dynamic zoomable map
- Visualize the nis districts near you (centroid <2km away), on the map


In [None]:
# part 1: Reading in the data
local_path = 'data/external/shapefiles/'
Path(local_path).mkdir(parents=True, exist_ok=True)

url = 'https://statbel.fgov.be/sites/default/files/files/opendata/Statistische%20sectoren/sh_statbel_statistical_sectors_20200101.shp.zip'
response = requests.get(url)
file = zipfile.ZipFile(io.BytesIO(response.content))
file.extractall(path=local_path)

# renaming this from df for ease of reading
gpd_stat_sectors = gpd.read_file(os.path.join(local_path,
                                              'sh_statbel_statistical_sectors_20200101.shp'))
# replaced deprecated init here
gpd_stat_sectors = gpd_stat_sectors.to_crs(CRS('epsg:4326'))

gpd_stat_sectors['centroid_lon'] = gpd_stat_sectors.centroid.x
gpd_stat_sectors['centroid_lat'] = gpd_stat_sectors.centroid.y

Here I assume that the statistical sectors are small enough to neglect curvature of the earth, so even with a geographic CRS the centroids are close enough to the real thing.

In [None]:
# Let's create some variables to indicate the location of your interest
home_lat = 50.82449164657
home_lon = 4.345775663707
perimeter_distance = 2  # km

In [None]:
# At some point we will need a distance function (google the Haversine formula, and implement it)
def haversine(lat1: float,
              lon1: float,
              lat2: float,
              lon2: float) -> float:
    """
    Calculate Haversine distance between points (lat1, lon1) and (lat2, lon2)
    Parameters:
    lat1: latifude of first point in decimal degrees
    lon1: longitude of first point in decimal degrees
    lat2: latitude of second point in decimal degrees
    lon2: longitude of second point in decimal degrees
    Returns:
    distance in meters between the two points
    """

    # convert degrees to radians
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    distance_lon = lon2 - lon1
    distance_lat = lat2 - lat1
    a = sin(distance_lat/2)**2 + cos(lat1)*cos(lat2)*sin(distance_lon/2)**2
    c = 2*asin(sqrt(a))
    radius = 6371  # earth radius in km

    return c*radius

Next, implement some sanity checks for your distance function 

In [None]:
# implement sanity checks here
tc_haversine = unittest.TestCase('__init__')

# distance between same points is zero
tc_haversine.assertEqual(haversine(1., 0., 1., 0.), 0.)
tc_haversine.assertAlmostEqual(haversine(90., 0., -270., 0.), 0.)
tc_haversine.assertAlmostEqual(haversine(90., 0., 450., 0.), 0)  # 450 = 90+360

# distance between north and south poles is approx. 201015 km
tc_haversine.assertAlmostEqual(haversine(90., 0., -90., 0.), 20015., places=0)

# distance of flight from london to new york, data taken from
# https://www.distance.to/London/New-York
tc_haversine.assertAlmostEqual(haversine(51.500153, -0.126236, 40.714268, -74.005974),
                               5570,
                               places=0)

# basic input vars type check
with tc_haversine.assertRaises(TypeError) as context:
    haversine('a', 2, 3, 4)
tc_haversine.assertEqual(TypeError, type(context.exception))

Now, create a dynamical map 

In [None]:
# implementation of the map goes here
# calculate distance to home
gpd_stat_sectors['distance_to_home'] = (gpd_stat_sectors[['centroid_lon', 'centroid_lat']]
                                        .apply(lambda x: haversine(x['centroid_lat'],
                                                                   x['centroid_lon'],
                                                                   home_lat,
                                                                   home_lon),
                                        axis=1))

In [None]:
# filter out places w. distance <2 km
gpd_close_to_home = gpd_stat_sectors[gpd_stat_sectors['distance_to_home'] < perimeter_distance]

In [None]:
# init map
coords_belgium = [50.5039, 4.4699]
stat_sectors_map = folium.Map(location=coords_belgium, control_scale=True, zoom_start=8)

# add neaby stat sectors
gpd_close_to_home.apply(lambda x: folium.Circle(location=[x['centroid_lat'], x['centroid_lon']],
                                                radius=10,
                                                fill=True,
                                                color='blue',
                                                popup=x['T_SEC_NL'])
                        .add_to(stat_sectors_map), axis=1)

# add home point for reference
folium.Circle(location=[home_lat, home_lon],
              radius=10,
              fill=True,
              color='red',
              popup='home').add_to(stat_sectors_map)

In [None]:
stat_sectors_map