# Predict NYC Taxi Tips 
The notebook ingests, prepares and then trains a model based on the dataset previsously configured that tracks NYC Yellow Taxi trips and various attributes around them. The goal is to for a given trip, predict whether there will be a tip or not. The model then will be converted to ONNX format and tracked by MLFlow.
We will later use the ONNX model for inferencing in Azure Synapse SQL Pool using the new model scoring wizard.
## Note:
**Please note that for successful conversion to ONNX, this notebook requires using  Scikit-learn version 0.20.3.**
This is already included in the "Python 3.6 - AzureML" Confirm that this is selected and that you have already attached the compute. 



## Load data
Get a sample data of nyc yellow taxi from Azure ML Dataset. You will need to restart the Kernel! Do it by changing the Kernel on the top corner on the rigth.s

In [20]:
#%pip list
%pip install scikit-learn==0.20.3

In [106]:
from azureml.core import Dataset, Run, Workspace


# DataSet name


dataset_naTaxiH1Curatedrated1'

ws = Workspace.from_config()

# Get a dataset by name
df_dataset = Dataset.get_by_name(workspace = ws,
                                 name = dataset_name, 
                                 version = 1)


# Load a TabularDataset into pandas DataFrame
df = df_dataset.to_pandas_dataframe()


In [107]:
from IPython.display import display
import numpy
import pandas
pandas.set_option('display.max_columns', Nonesampled_df = df.copy()
display(sampled_df.head(5))
ad(5))


Unnamed: 0,VendorID,PickUpDateTime,DropOffDateTime,PassengerCount,TripDistance,PickUpLocationID,PickUpLocationZone,PickUpLocationBorough,DropOffLocationID,DropOffLocationZone,DropOffLocationBorough,PaymentTypeID,PaymentTypeDescription,FareAmount,ExtraAmount,MTATaxAmount,TipAmount,TollsAmount,ImprovementSurchargeAmount,TotalRideAmount
0,2,2019-01-17,2019-01-17,1,0.98,186,Penn Station/Madison Sq West,Manhattan,68,East Chelsea,Manhattan,2,Cash,5.5,0.0,0.5,0.0,0.0,0.3,6.3
1,2,2019-01-09,2019-01-09,1,1.14,75,East Harlem South,Manhattan,74,East Harlem North,Manhattan,2,Cash,5.5,0.0,0.5,0.0,0.0,0.3,6.3
2,2,2019-01-14,2019-01-14,1,0.73,148,Lower East Side,Manhattan,144,Little Italy/NoLiTa,Manhattan,2,Cash,5.0,0.0,0.5,0.0,0.0,0.3,5.8
3,2,2019-01-12,2019-01-12,1,0.49,163,Midtown North,Manhattan,162,Midtown East,Manhattan,2,Cash,5.0,0.0,0.5,0.0,0.0,0.3,5.8
4,2,2019-01-12,2019-01-12,1,0.99,238,Upper West Side North,Manhattan,239,Upper West Side South,Manhattan,2,Cash,6.0,0.0,0.5,0.0,0.0,0.3,6.8


## Prepare and featurize data
- There are extra dimensions that are not going to be useful in the model. We just take the dimensions that we need and put them into the featurised dataframe. 
- There are also a bunch of outliers in the data so we need to filter them out.

In [108]:

def get_pickup_time(df):
    pickupHour = df['pickupHour'];
    if ((pickupHour >= 7) & (pickupHour <= 10)):
        return 'AMRush'
    elif ((pickupHour >= 11) & (pickupHour <= 15)):
        return 'Afternoon'
    elif ((pickupHour >= 16) & (pickupHour <= 19)):
        return 'PMRush'
    else:
        return 'Night'

featurized_df = pandas.DataFrame()
featurized_df['tipped'] = (sampled_df['TipAmount'] > 0).astype('int')
featurized_df['fareAmount'] = sampled_df['FareAmount'].astype('float32')
featurized_df['paymentType'] = sampled_df['PaymentTypeID'].astype('int')
featurized_df['passengerCount'] = sampled_df['PassengerCount'].astype('int')
featurized_df['tripDistance'] = sampled_df['TripDistance'].astype('float32')
featurized_df['pickupHour'] = sampled_df['PickUpDateTime'].dt.hour.astype('int')
featurized_df['TotalRideAmount'] = sampled_df['TotalRideAmount'].astype('float32')

featurized_df['pickupTimeBin'] = featurized_df.apply(get_pickup_time, axis=1)
featurized_df = featurized_df.drop(columns='pickupHour')

display(featurized_df.head(5))



Unnamed: 0,tipped,fareAmount,paymentType,passengerCount,tripDistance,TotalRideAmount,pickupTimeBin
0,0,5.5,2,1,0.98,6.3,Night
1,0,5.5,2,1,1.14,6.3,Night
2,0,5.0,2,1,0.73,5.8,Night
3,0,5.0,2,1,0.49,5.8,Night
4,0,6.0,2,1,0.99,6.8,Night


In [109]:
filtered_df = featurized_df[(featurized_df.tipped >= 0) & (featurized_df.tipped <= 1)\
    & (featurized_df.fareAmount >= 1) & (featurized_df.fareAmount <= 250)\
    & (featurized_df.paymentType >= 1) & (featurized_df.paymentType <= 2)\
    & (featurized_df.passengerCount > 0) & (featurized_df.passengerCount < 8)\
    & (featurized_df.tripDistance >= 0) & (featurized_df.tripDistance <= 100)]

#
filtered_df.info()

filtered_df.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7505194 entries, 0 to 7550459
Data columns (total 7 columns):
tipped             int64
fareAmount         float32
paymentType        int64
passengerCount     int64
tripDistance       float32
TotalRideAmount    float32
pickupTimeBin      object
dtypes: float32(3), int64(3), object(1)
memory usage: 372.2+ MB


Unnamed: 0,tipped,fareAmount,paymentType,passengerCount,tripDistance,TotalRideAmount
count,7505194.0,7505194.0,7505194.0,7505194.0,7505194.0,7505194.0
mean,0.6897795,12.22318,1.280273,1.593187,2.812785,15.49892
std,0.4625838,10.92565,0.4491324,1.22,3.682295,13.57638
min,0.0,1.0,1.0,1.0,0.0,1.0
25%,0.0,6.0,1.0,1.0,0.91,8.3
50%,1.0,9.0,1.0,1.0,1.53,11.3
75%,1.0,13.5,2.0,2.0,2.8,16.56
max,1.0,250.0,2.0,7.0,96.13,3345.3


## Split training and testing data sets
- 70% of the data is used to train the model.
- 30% of the data is used to test the model.

In [110]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(filtered_df, test_size=0.3, random_state=123)

x_train = pandas.DataFrame(train_df.drop(['tipped'], axis = 1))
y_train = pandas.DataFrame(train_df.iloc[:,train_df.columns.tolist().index('tipped')])

x_test = pandas.DataFrame(test_df.drop(['tipped'], axis = 1))
y_test = pandas.DataFrame(test_df.iloc[:,test_df.columns.tolist().index('tipped')])

## Export test data as CSV
Export the test data as a CSV file. Later, it can be loaded the CSV file into Synapse SQL pool to test the model.

In [111]:
test_df.to_csv('test_data.csv', index=False)

## Train model
Train a bi-classifier to predict whether a taxi trip will be a tipped or not.

Try to include the 'paymentType' and observe the change in the results.


In [112]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

float_features = ['fareAmount', 'tripDistance', 'TotalRideAmount']
float_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

integer_features = ['paymentType', 'passengerCount']
integer_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['pickupTimeBin']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('float', float_transformer, float_features),
        ('integer', integer_transformer, integer_features),
        ('cat', categorical_transformer, categorical_features)
    ])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

# Train the model
clf.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


Pipeline(memory=None,
     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('float', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', ver...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [113]:
# Evalute the model
score = clf.score(x_test, y_test)
print(score)

0.9803416210723326


## Convert the model to ONNX format
Currently, T-SQL scoring only supports ONNX model format (https://onnx.ai/).

In [115]:
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType, Int64TensorType, DoubleTensorType, StringTensorType

def convert_dataframe_schema(df, drop=None):
    inputs = []
    for k, v in zip(df.columns, df.dtypes):
        if drop is not None and k in drop:
            continue
        if v == 'int64':
            t = Int64TensorType([1, 1])
        elif v == 'float32':
            t = FloatTensorType([1, 1])
        elif v == 'float64':
            t = DoubleTensorType([1, 1])
        else:
            t = StringTensorType([1, 1])
        inputs.append((k, t))
    return inputs

model_inputs = convert_dataframe_schema(x_train)
onnx_model = convert_sklearn(clf, "nyc_taxi_tip_predict", model_inputs)

The maximum opset needed by this model is only 11.
The maximum opset needed by this model is only 1.


## Register the model with MLFlow

In [116]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, sep='\n')

amlworkspacekpq2soskyd5b4
MDW-Lab
eastus


In [117]:
import mlflow
import mlflow.onnx

from mlflow.models.signature import infer_signature

experiment_name = 'nyc_taxi_tip_predict_exp'
artifact_path = 'nyc_taxi_tip_predict_artifact'

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())
mlflow.set_experiment(experiment_name)

with mlflow.start_run() as run:
    # Infer signature
    input_sample = x_train.head(1)
    output_sample = pandas.DataFrame(columns=['output_label'], data=[1])
    signature = infer_signature(input_sample, output_sample)

    # Save the model to the outputs directory for capture
    mlflow.onnx.log_model(onnx_model, artifact_path, signature=signature, input_example=input_sample)

    # Register the model to AML model registry
    mlflow.register_model('runs:/' + run.info.run_id + '/' + artifact_path, 'nyc_taxi_tip_predict')


Registered model 'nyc_taxi_tip_predict' already exists. Creating a new version of this model...
2020/12/15 17:16:06 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: nyc_taxi_tip_predict, version 4
Created version '4' of model 'nyc_taxi_tip_predict'.
