<a href="https://colab.research.google.com/github/kenextra/ATCS_ML/blob/main/ATCS_End2End_ML_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# End to End Machine Learning Project Steps

1.   Look at the big picture.
2.   Get the data.
3.   Discover and visualize the data to gain insights.
4.   Prepare the data for Machine Learning algorithms.
5.   Select a model and train it.
6.   Fine-tune your model.
7.   Present your solution.
8.   Launch, monitor, and maintain your system.


# Frame the Problem

## A Taxi Fleet and the Challenge of Dispatching

### Problem Statement
The demand for taxis changes throughout the day, so how can we make the dispatching more efficient by sending the taxicabs to the places that have more requests at different times of the day? Could the data available from 2015 help?

## Type of Problem

*   Supervised
*   Unsupervised

## Type of ML Task

*   Classification
*   Regression

## Select a Performance Measure

In [None]:
# upgrade scikit-learn and restart runtime
%%bash
pip install --upgrade pandas --quiet
pip install --upgrade scikit-learn --quiet

# Data collection

## Download Data

In [None]:
import urllib.request
from zipfile import ZipFile

In [None]:
DATA_URL = "https://www.mathworks.com/supportfiles/practicaldsmatlab/taxi/Taxi%20Data.zip"
DATA_NAME = "TaxiData.zip"
urllib.request.urlretrieve(DATA_URL, DATA_NAME)

In [None]:
# Create a ZipFile Object and load data in it
with ZipFile(DATA_NAME, 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()

In [None]:
from pathlib import Path
DATA_DIR = Path.cwd() / 'Taxi Data'

## Load data to memory

In [None]:
import pandas as pd
import numpy as np
import sklearn
sklearn.__version__

In [None]:
pd.__version__

In [None]:
col_names = ['VendorID',
 'tpep_pickup_datetime',
 'tpep_dropoff_datetime',
 'passenger_count',
 'trip_distance',
 'pickup_longitude',
 'pickup_latitude',
 'RateCodeID',
 'store_and_fwd_flag',
 'dropoff_longitude',
 'dropoff_latitude',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'total_amount']

parse_dates = ["tpep_pickup_datetime", "tpep_dropoff_datetime"]

dtype = {'RateCodeID': str, 'payment_type': str, 'VendorID': str, 'RatecodeID': str,}

In [None]:
df_from_each_file = (pd.read_csv(f, parse_dates=parse_dates,
                                 names=col_names, dtype=dtype,
                                 low_memory=False,
                                 skiprows=1)
                        for f in DATA_DIR.iterdir()
                        if 'yellow' in str(f))
df = pd.concat(df_from_each_file, ignore_index=True)

# Data Exploration/Visualization

In [None]:
df.head()

In [None]:
df.info(show_counts=True)

In [None]:
df.describe().loc[['min', 'max']]

In [None]:
df.describe(include=['object', 'bool'])

In [None]:
_ = df.plot(kind="scatter", x='pickup_longitude', y='pickup_latitude',)

In [None]:
_ = df.plot(kind="scatter", x='dropoff_longitude', y='dropoff_latitude',)

In [None]:
Payment_Type = {"1": "Credit card",
                "2": "Cash",
                "3": "No charge",
                "4": "Dispute",
                "5": "Unknown",
                "6": "Voided trip", }

RateCode = {"1": "Standard rate",
            "2": "JFK",
            "3": "Newark",
            "4": "Nassau or Westchester",
            "5": "Negotiated fare",
            "6": "Group ride",
            "99": "99"}

VendorID = {"1": "Creative Mobile Technologies, LLC",
            "2": "VeriFone Inc.", }


# Bounding latitude/longitude
lat = [40.5612, 40.9637]
lon = [-74.1923, -73.5982]

In [None]:
def basic_preprocessing(df=None):
    print('Converting categorical features to their corresponding values...\n')
    df.loc[:, 'payment_type'] = df['payment_type'].apply(lambda x: Payment_Type[x])
    df.loc[:, 'RateCodeID'] = df['RateCodeID'].apply(lambda x: RateCode[x])
    df.loc[:, 'VendorID'] = df['VendorID'].apply(lambda x: VendorID[x])

    # Remove invalid charges
    # Only keep trips (rows) containing valid charges.
    print('Removing invalid charges...\n')
    df.query('RateCodeID != "99"', inplace=True)
    df.query('fare_amount > 0', inplace=True)
    df.query('extra >= 0', inplace=True)
    df.query('mta_tax >= 0', inplace=True)
    df.query('tip_amount >= 0', inplace=True)
    df.query('tolls_amount >= 0', inplace=True)
    df.query('improvement_surcharge >= 0', inplace=True)
    df.query('total_amount > 0', inplace=True)

    # Only keep trips where charges match the expected values.
    # ImpSurcharge is $0.30
    # Tax is $0.50
    # Total is the sum of all charges
    df.query('abs(improvement_surcharge-0.3) < 0.01', inplace=True)
    df.query('abs(mta_tax-0.5) < 0.01', inplace=True)
    df.query('abs(fare_amount+extra+mta_tax+tip_amount+tolls_amount+improvement_surcharge-total_amount) < 0.01', inplace=True)

    # Remove invalid trip information
    # Only keep trips with valid passenger and distance information.
    print('Removing invalid trip information...\n')
    df.query('passenger_count > 0', inplace=True)
    df.query('trip_distance > 0', inplace=True)

    # Remove outliers
    # Only keep trips with pickup and drop off locations inside the region of interest.
    print('Keep trips with pickup and drop off locations inside the region of interest\n')
    df.query(f'pickup_longitude >= {lon[0]} & pickup_longitude <= {lon[1]}', inplace=True)
    df.query(f'dropoff_longitude >= {lon[0]} & dropoff_longitude <= {lon[1]}', inplace=True)
    df.query(f'pickup_latitude >= {lat[0]} & pickup_latitude <= {lat[1]}', inplace=True)
    df.query(f'dropoff_latitude >= {lat[0]} & dropoff_latitude <= {lat[1]}', inplace=True)

    # Only keep trips with typical values
    # Typical trip
    print('Only keep trips with typical values..\n')
    # df.query('duration >= 1 & duration <= 120', inplace=True)
    df.query('trip_distance >= 0.01 & trip_distance <= 50', inplace=True)

    # Typical charges
    df.query('fare_amount >= 0.01 & fare_amount <= 100', inplace=True)
    df.query('tolls_amount <= 20', inplace=True)
    df.query('total_amount >= 0.5 & total_amount <= 120', inplace=True)

    df.reset_index(inplace=True, drop=True)
    return df

In [None]:
df = basic_preprocessing(df.copy())

In [None]:
df.describe().loc[['min', 'max']]

In [None]:
_ = df.plot(kind="scatter", x='pickup_longitude', y='pickup_latitude',)

In [None]:
_ = df.plot(kind="scatter", x='dropoff_longitude', y='dropoff_latitude',)

In [None]:
Names = ["Manhattan", "LaGuardia", "JFK"]
Lat1 = [40.7485, 40.766, 40.639]
Lat2 = [40.7576, 40.776, 40.650]
Lon1 = [-73.9955, -73.876, -73.793]
Lon2 = [-73.9773, -73.861, -73.775]

In [None]:
for i, loc in enumerate(Names):
    isInBox = (df.pickup_latitude >= Lat1[i]) & (df.pickup_latitude <= Lat2[i]) & (df.pickup_longitude >= Lon1[i]) & (df.pickup_longitude <= Lon2[i])    
    # df.Location[isInBox] = loc
    df.loc[isInBox, 'location'] = loc

df['location'] = df.location.astype("category")

In [None]:
df.dropna(inplace=True)

df['pickup_time'] = pd.to_datetime(df['tpep_pickup_datetime'].dt.strftime("%Y-%m-%d %H"))

# Model Training

## Prepare the data for Machine Learning algorithms.

### Data Cleaning / Feature selection

In [None]:
taxi_pickups = df.groupby(by=['pickup_time', 'location'], as_index=False, dropna=False, )['passenger_count'].count()
taxi_pickups.columns = ['pickup_time', 'location', 'trip_count']

In [None]:
taxi_pickups.head(10)

In [None]:
taxi_pickups.shape

(26217, 3)

In [None]:
summary = taxi_pickups.groupby(by='location', as_index=False,
                               sort=False, dropna=False
                               )['trip_count'].agg({'size', 'min', 'max', 'mean', 'median'}) # .reset_index()
summary.reset_index()

In [None]:
taxi_pickups['timeofday'] = taxi_pickups.pickup_time.dt.hour
taxi_pickups['dayofweek'] = taxi_pickups.pickup_time.dt.day_name()
taxi_pickups['dayofmonth'] = taxi_pickups.pickup_time.dt.day
taxi_pickups['dayofyear'] = taxi_pickups.pickup_time.dt.dayofyear
taxi_pickups.head()

In [None]:
y = taxi_pickups['trip_count']
X = taxi_pickups.drop(labels=['trip_count', 'pickup_time'], axis=1)

X.head()

### Create Train/Test Set

In [None]:
from sklearn import set_config
set_config(display="diagram")
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder, MinMaxScaler

In [None]:
strat_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

train_idx, test_idx = next(strat_split.split(X, X['location']))


# Create the dataframes
X_train = X.loc[train_idx, :]
y_train = y.loc[train_idx,]

X_test  = X.loc[test_idx, :]
y_test  = y.loc[test_idx,]

X_train.location.value_counts(normalize=True)

## Select a model and train it and evaluate on the Test Set

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [None]:
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'bool', 'category']).columns
numerical_cols, categorical_cols

In [None]:
numerical_ix = [X.columns.get_loc(col) for col in numerical_cols]
categorical_ix = [X.columns.get_loc(col) for col in categorical_cols]
numerical_ix, categorical_ix

In [None]:
num_pipeline = Pipeline([("num", StandardScaler()),])

cat_pipeline = Pipeline([("cat", OrdinalEncoder()),])

In [None]:
transformer = ColumnTransformer([
                 ("num_pipe", num_pipeline, numerical_ix),
                 ("cat_pipe", cat_pipeline, categorical_ix),
                 ])
transformer

In [None]:
pd.DataFrame(transformer.fit_transform(X_train), columns=['timeofday', 'dayofmonth', 'dayofyear', 'location', 'dayofweek'])

In [None]:
errors = list()
scores = list()

### Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
mdl = LinearRegression()
lr_estimator = Pipeline([('preparation', transformer), 
                     ('model', mdl)
                     ])

In [None]:
lr_estimator.fit(X_train, y_train)

In [None]:
# predict
y_train_pred = lr_estimator.predict(X_train)
y_test_pred = lr_estimator.predict(X_test)

errors.append(pd.Series({'train': r2_score(y_train, y_train_pred),
           'test' : r2_score(y_test, y_test_pred)},
          name='LR_score'))

errors.append(pd.Series({'train': mean_squared_error(y_train, y_train_pred, squared=False),
           'test' : mean_squared_error(y_test,  y_test_pred, squared=False)},
          name='LR_rmse'))

pd.concat(errors, axis=1)

### Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
mdl = DecisionTreeRegressor()
dt_estimator = Pipeline([('preparation', transformer), 
                     ('model', mdl)
                     ])

In [None]:
dt_estimator.fit(X_train, y_train)

In [None]:
# predict
y_train_pred = dt_estimator.predict(X_train)
y_test_pred = dt_estimator.predict(X_test)

errors.append(pd.Series({'train': r2_score(y_train, y_train_pred),
           'test' : r2_score(y_test, y_test_pred)},
          name='DT_score'))

errors.append(pd.Series({'train': mean_squared_error(y_train, y_train_pred, squared=False),
           'test' : mean_squared_error(y_test,  y_test_pred, squared=False)},
          name='DT_rmse'))

pd.concat(errors, axis=1)

### Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
mdl = RandomForestRegressor()
rfr_estimator = Pipeline([('preparation', transformer), 
                     ('model', mdl)
                     ])

In [None]:
rfr_estimator.fit(X_train, y_train)

In [None]:
# predict
y_train_pred = rfr_estimator.predict(X_train)
y_test_pred = rfr_estimator.predict(X_test)

errors.append(pd.Series({'train': r2_score(y_train, y_train_pred),
           'test' : r2_score(y_test, y_test_pred)},
          name='RFR_score'))

errors.append(pd.Series({'train': mean_squared_error(y_train, y_train_pred, squared=False),
           'test' : mean_squared_error(y_test,  y_test_pred, squared=False)},
          name='RFR_rmse'))

pd.concat(errors, axis=1)

## Fine-tune your model.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

In [None]:
rfr_estimator.get_params().keys()

In [None]:
# Set parameters
MAX_DEPTH_OPTIONS = [15, 20]
N_ESTIMATORS = [10, 15, 20, 30]
params = {
    'model__max_depth': MAX_DEPTH_OPTIONS,
    'model__n_estimators': N_ESTIMATORS,
}

kf = KFold(shuffle=True, random_state=42, n_splits=5)

In [None]:
grid = GridSearchCV(rfr_estimator, params, verbose=10, n_jobs=1, cv=kf,
                    scoring='neg_mean_squared_error', return_train_score=True)

In [None]:
grid.fit(X_train, y_train.values.ravel())

In [None]:
grid.best_score_, grid.best_params_

In [None]:
grid.best_estimator_

In [None]:
# predict
y_train_pred = grid.predict(X_train)
y_test_pred = grid.predict(X_test)

errors.append(pd.Series({'train': r2_score(y_train_pred, y_train),
           'test' : r2_score(y_test_pred, y_test)},
          name='RFR_Grid_score'))

errors.append(pd.Series({'train': mean_squared_error(y_train, y_train_pred, squared=False),
           'test' : mean_squared_error(y_test,  y_test_pred, squared=False)},
          name='RFR_Grid_rmse'))

pd.concat(errors, axis=1)

## Select and retrain best model

In [None]:
parameters = grid.best_params_
n_estimators = parameters['model__n_estimators']
max_depth = parameters['model__max_depth']
max_depth, n_estimators

(15, 30)

In [None]:
mdl = RandomForestRegressor(n_estimators=n_estimators,  max_depth=max_depth, random_state=42,)
rfr_estimator = Pipeline([('preparation', transformer), 
                     ('model', mdl)
                     ])

In [None]:
rfr_estimator.fit(X_train, y_train.values.ravel())

In [None]:
# predict
y_train_pred = rfr_estimator.predict(X_train)
y_test_pred = rfr_estimator.predict(X_test)

errors.append(pd.Series({'train': r2_score(y_train, y_train_pred),
           'test' : r2_score(y_test, y_test_pred)},
          name='mdl_score'))

errors.append(pd.Series({'train': mean_squared_error(y_train, y_train_pred, squared=False),
           'test' : mean_squared_error(y_test,  y_test_pred, squared=False)},
          name='mdl_rmse'))

pd.concat(errors, axis=1)

## Save Trained Model

In [None]:
import joblib
joblib.__version__

In [None]:
joblib.dump(rfr_estimator, 'model.joblib')

## Evaluate on new data

In [None]:
from datetime import datetime

In [None]:
pkt = pd.date_range(start=td + ' 00:00:00', end=td + ' 23:59:59', periods=72)

dff = pd.DataFrame({'pickup_time': pkt})

dff['pickup_time'] = pd.to_datetime(dff['pickup_time'].dt.strftime("%Y-%m-%d %H"))

dff['location'] = Names * 24

In [None]:
dff.head(15)

In [None]:
dff['timeofday'] = dff.pickup_time.dt.hour
dff['dayofweek'] = dff.pickup_time.dt.day_name()
dff['dayofmonth'] = dff.pickup_time.dt.day
dff['dayofyear'] = dff.pickup_time.dt.dayofyear

dff.head()

In [None]:
data = dff[['location', 'timeofday', 'dayofweek', 'dayofmonth', 'dayofyear']]
data.head()

(72, 5)

In [None]:
model = joblib.load('model.joblib')

In [None]:
dff['trip_count'] = model.predict(data)
dff

In [None]:
pickups = dff.groupby(by=['pickup_time',], as_index=False, dropna=False, )['trip_count'].sum()
pickups.columns = ['pickup_time', 'total_trip']
pickups

In [None]:
pickups = pd.merge(left=dff, right=pickups, how='inner', on='pickup_time')
pickups['fraction'] = pickups.trip_count / pickups.total_trip
pickups

In [None]:
fractions = pickups.pivot(index='pickup_time', columns='location', values='fraction')
fractions

In [None]:
import matplotlib.pyplot as plt

In [None]:
_ = fractions.plot(kind='bar', stacked=True, figsize=(12, 8))

# Model Deployment


## Public Cloud Options

- Heroku
- GCP
- AWS
- Azure

# Resources

## Books

[Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

## Dataset
- [Taxi Data](https://www.mathworks.com/supportfiles/practicaldsmatlab/taxi/Taxi%20Data.zip) - Two percent of the total trips sampled at random from each month of 2015.
- [Full Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) - 12 years (2009 –2020) worth of Data available

## Coursera Courses
  1. [IBM Machine Learning Professional Certificate](https://www.coursera.org/professional-certificates/ibm-machine-learning)
  2. [Practical Data Science with MATLAB Specialization](https://www.coursera.org/specializations/practical-data-science-matlab)