<h1 align='Center'>
<img src="https://assets.digihaul.com/images/logo.png" width="350" height="550" align="center"/>
</h1>
<h3 align='center'> Task 3</h3>

### Problem Statement

Road haulage is essential for the people and businesses of the UK. Approximately 90% of all goods transported by land in Great Britain are moved directly by road. DigiHaul is a digital transport business, specialising in managing, consolidating and integrating data from both Carriers and Shippers to deliver seamless end-to-end logistics service.
Shippers book shipments on the DigiHaul platform, detailing the scheduled collection and delivery time windows / locations and required vehicle types for carriers to consider. Once a carrier accepts a job and collection is scheduled, DigiHaul’s driver app facilitates real-time tracking of shipments through GPS signals, subject to carriers granting permissions for location logging.


#### Python packages:

- data handling packages   : pandas, numpy, re
- modelling packages       : sklearn, pickle
- geo package              : google maps

In [None]:
# Importing data handling python packages

import pandas as pd
import re
import numpy as np
import pickle
import warnings

from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report

import googlemaps

g_client = googlemaps.Client(key="AIzaSyCQqsfGtdwOFDsuQwcz2sHS_q6BXibukZk")

### Input data:

Three datasets are provided for working with the tasks and each one of them are in .csv format

1. gps_data.csv: contains GPS logging(latitude and longitude) for each shipment along with the timestamp details

2. shipment.csv: contains all the details about each shipment initiated, with the details of shipper, details of carries, collection and delivery - postcode, latitude, longitude and time\

3. new_booking.csv: new shipments data booked by the shippers

3. shipment_gps_data.csv: contains only shipments for which valid GPS information is captured

In [149]:
# reading in input datasets

gps_data = pd.read_csv("./Test/GPS_data.csv")
shipment = pd.read_csv("./Test/Shipment_bookings.csv")
new_booking = pd.read_csv("./Test/New_bookings.csv")
valid_ship_gps = pd.read_csv(
    "./shipment_gps_data.csv"
)  # shipments with valid GPS logs (re-using data created from Task 1)

### Pre-processing:


In [None]:
class PotentialDelayPrediction:
    """
    Class to build a model and predict if there is a potential delay or not
    """

    def __init__(self, data, valid_data):
        """
        To initialize all necessary class variables
        """
        self.data = data
        self.valid_data = valid_data
        self.mode = "Train"
        self.columns_to_model = [
            "PROJECT_ID_ENCODED",
            "CARRIER_DISPLAY_ID_ENCODED",
            "VEHICLE_SIZE_ENCODED",
            "VEHICLE_BUILD_UP_ENCODED",
            "DAY_OF_FIRST_COLLECTION_SCHEDULE_EARLIEST",
            "DAY_OF_FIRST_COLLECTION_SCHEDULE_LATEST",
            "DAY_OF_LAST_DELIVERY_SCHEDULE_EARLIEST",
            "DAY_OF_LAST_DELIVERY_SCHEDULE_LATEST",
            "PERIOD_OF_FIRST_COLLECTION_SCHEDULE_EARLIEST",
            "PERIOD_OF_FIRST_COLLECTION_SCHEDULE_LATEST",
            "PERIOD_OF_LAST_DELIVERY_SCHEDULE_EARLIEST",
            "PERIOD_OF_LAST_DELIVERY_SCHEDULE_LATEST",
            "WEEKNUM_OF_FIRST_COLLECTION_SCHEDULE_EARLIEST",
            "WEEKNUM_OF_FIRST_COLLECTION_SCHEDULE_LATEST",
            "WEEKNUM_OF_LAST_DELIVERY_SCHEDULE_EARLIEST",
            "WEEKNUM_OF_LAST_DELIVERY_SCHEDULE_LATEST",
            "SCALED_DISTANCE_TO_DESTINATION",
            "SCALED_TIME_TO_DESTINATION",
            "DELAY?",
        ]

    def format_time_stamp(self, input_col_name, output_col_name):
        """
        Function to format the input time stamp to python understandable format
        Args:
            input_col_name (str): name of input column with time-stamp
            output_col_name(str): name of new output column to fill the generated output
        Returns:
            Changes the dataframes with new columns
        """
        self.data[output_col_name] = pd.to_datetime(
            self.data[input_col_name].apply(
                lambda x: (
                    re.sub("T", " ", x[:-9])
                    if "Z" not in x
                    else re.sub("T", " ", x[:-5])
                )
            )
        )

    def get_transit_details_google(self, x):
        """
        Function to do API call to Google Distance API to retreive distance and ETA between two locations
        Args:
            x (row iter): row iter from the lambda function call
        Returns:
            distance in km
            time in mins
        """
        source = (
            str(x["FIRST_COLLECTION_LATITUDE"])
            + ","
            + str(x["FIRST_COLLECTION_LONGITUDE"])
        )
        destination = (
            str(x["LAST_DELIVERY_LATITUDE"]) + "," + str(x["LAST_DELIVERY_LONGITUDE"])
        )
        result = g_client.directions(
            source, destination, mode="driving", avoid="ferries", transit_mode="bus"
        )
        distance = result[0]["legs"][0]["distance"]["value"]
        time = result[0]["legs"][0]["duration"]["value"]
        return distance / 1000, time / 60

    def get_label_encoding(self, column_name):
        """
        Method to convert categorical values to numeric entities
        Args:
            column_name: column with the categorical values
        Returns:
            Dataframe column is replaced with the encoded values
        """
        if self.mode == "Train":
            encoder = preprocessing.OrdinalEncoder(
                handle_unknown="use_encoded_value", unknown_value=-1
            )
            self.data[str(column_name) + "_ENCODED"] = encoder.fit_transform(
                np.array(self.data[column_name]).reshape(-1, 1)
            )
            pickle.dump(encoder, open(column_name + "_encoder_", "wb"))
        else:
            with open(column_name + "_encoder_", "rb") as pickle_file:
                encoder = pickle.load(pickle_file)
            self.data[str(column_name) + "_ENCODED"] = encoder.transform(
                np.array(self.data[column_name]).reshape(-1, 1)
            )

    def get_day_of_week(self, column_name):
        """
        Method to extract the day of the week (Sunday/Monday/Tuesday/Wednesday/Thrusday/Friday/Saturday)
        Args:
            column_name: column with the time stamp value
        Returns:
            New dataframe column is created and populated with extracted information
        """
        self.data["DAY_OF_" + re.sub("FORMATTED_", "", column_name)] = self.data[
            column_name
        ].dt.dayofweek

    def get_part_of_day(self, column_name):
        """
        Method to extract the part of the week (Late night/Morning/Afternoon/Evening/Night)
        Args:
            column_name: column with the time stamp value
        Returns:
            New dataframe column is created and populated with extracted information
        """
        self.data["PERIOD_OF_" + re.sub("FORMATTED_", "", column_name)] = (
            self.data[column_name].dt.hour % 24 + 4
        ) // 4

    def get_weeknum_of_month(self, column_name):
        """
        Method to extract the week number of the week (1/2/3/4)
        Args:
            column_name: column with the time stamp value
        Returns:
            New dataframe column is created and populated with extracted information
        """
        self.data["WEEKNUM_OF_" + re.sub("FORMATTED_", "", column_name)] = (
            self.data[column_name].dt.day - 1
        ) // 7 + 1

    def get_time_features(self):
        """
        Method to call other methdos and extract the day of the week, part of the day and week number of the month
        """
        time_columns = [
            "FORMATTED_FIRST_COLLECTION_SCHEDULE_EARLIEST",
            "FORMATTED_FIRST_COLLECTION_SCHEDULE_LATEST",
            "FORMATTED_LAST_DELIVERY_SCHEDULE_EARLIEST",
            "FORMATTED_LAST_DELIVERY_SCHEDULE_LATEST",
        ]
        for col in time_columns:
            self.get_day_of_week(col)
            self.get_part_of_day(col)
            self.get_weeknum_of_month(col)

    def get_scaled_values(self, column_name):
        """
        Method to scale the values of columns based on min and max values
        Args:
            column_name: column with the numberic continous values
        Return:
            model which is used to build the scaling function
        """
        if self.mode == "Train":
            scaler = MinMaxScaler()
            self.data["SCALED_" + column_name] = scaler.fit_transform(
                np.array(self.data[column_name]).reshape(-1, 1)
            )
            pickle.dump(scaler, open(column_name + "_scaler_", "wb"))
        else:
            with open(column_name + "_scaler_", "rb") as pickle_file:
                scaler = pickle.load(pickle_file)
            self.data["SCALED_" + column_name] = scaler.fit_transform(
                np.array(self.data[column_name]).reshape(-1, 1)
            )

    def get_google_estimate(self):
        """
        Method used to do API call to Google directions API and retreive estimated distance and time

        """
        self.data[["DISTANCE_TO_DESTINATION", "TIME_TO_DESTINATION"]] = self.data.apply(
            lambda x: self.get_transit_details_google(x),
            axis="columns",
            result_type="expand",
        )

    def get_target_variable(self):
        """
        Method to join the shipment data with the extracted data with delivery indictor from previous question

        """
        self.data = self.valid_data.merge(self.data, how="inner", on="SHIPMENT_NUMBER")
        self.data["DELAY?"] = self.data["INDICATOR"].apply(
            lambda x: (
                0 if (x == "advanced_delivery") or (x == "threshold_delivery") else 1
            )
        )

    def train_test_split(self):
        """
        Method to split the data into train and test to perfrom model building process on train
        and validation later on test
        """
        # print(self.data.columns)
        self.train = self.data[self.columns_to_model]
        X = self.train.iloc[:, :-1]
        y = self.train["DELAY?"]
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            X, y, test_size=0.33, stratify=y, random_state=42
        )

    def train_model(self):
        """
        Method to build the classification model using Logistic Regression algorithm
        """
        # instantiate the model (using the default parameters)
        logreg = LogisticRegression(random_state=16)

        # fit the model with data
        logreg.fit(self.X_train, self.y_train)
        pickle.dump(logreg, open("logreg", "wb"))

    def validation(self):
        """
        Method to load the built model and validate on the test split
        """
        with open("logreg", "rb") as pickle_file:
            model = pickle.load(pickle_file)
        self.y_pred = model.predict(self.X_test)

    def get_prediction(self):
        """
        Method to load the built model and generate predictions on unseen data
        """
        with open("logreg", "rb") as pickle_file:
            model = pickle.load(pickle_file)
        self.predictions = model.predict(self.data[self.columns_to_model[:-1]])

    def model_performance_metrics(self):
        """
        Method to generate the output metrics(accuracy, precision, recall) on the predictions made by the model
        """
        target_names = ["0", "1"]
        print(
            classification_report(self.y_test, self.y_pred, target_names=target_names)
        )

    def preprocess(self):
        """
        Method to format the input data on required columns
        """
        self.format_time_stamp(
            "FIRST_COLLECTION_SCHEDULE_EARLIEST",
            "FORMATTED_FIRST_COLLECTION_SCHEDULE_EARLIEST",
        )
        self.format_time_stamp(
            "FIRST_COLLECTION_SCHEDULE_LATEST",
            "FORMATTED_FIRST_COLLECTION_SCHEDULE_LATEST",
        )
        self.format_time_stamp(
            "LAST_DELIVERY_SCHEDULE_EARLIEST",
            "FORMATTED_LAST_DELIVERY_SCHEDULE_EARLIEST",
        )
        self.format_time_stamp(
            "LAST_DELIVERY_SCHEDULE_LATEST", "FORMATTED_LAST_DELIVERY_SCHEDULE_LATEST"
        )

    def feature_eng(self):
        """
        Method to call required methods to perform feature engineering
        """
        self.get_label_encoding("PROJECT_ID")
        self.get_label_encoding("CARRIER_DISPLAY_ID")
        self.get_label_encoding("VEHICLE_SIZE")
        self.get_label_encoding("VEHICLE_BUILD_UP")
        self.get_time_features()
        self.get_google_estimate()
        self.get_scaled_values("DISTANCE_TO_DESTINATION")
        self.get_scaled_values("TIME_TO_DESTINATION")

    def training(self):
        """
        Method to control the entire model training process and followed by validation on test data
        """
        self.preprocess()
        self.feature_eng()
        self.get_target_variable()
        self.train_test_split()
        self.train_model()
        self.validation()
        self.model_performance_metrics

    def test(self, data):
        """
        Method to control the entire model prediction process on the unseen data
        """
        self.data = data
        self.mode = "Test"
        self.data.rename(
            columns={"SHIPPER_ID": "PROJECT_ID", "CARRIER_ID": "CARRIER_DISPLAY_ID"},
            inplace=True,
        )
        self.preprocess()
        self.feature_eng()
        self.get_prediction()

In [None]:
delay_model = PotentialDelayPrediction(shipment, valid_ship_gps)

In [None]:
delay_model.training()

In [None]:
delay_model.test(new_booking)