# Day 1 - ML Workflow

The objective of this exercise is to use the tools and methods you learnt during the previous weeks, in order to solve a **real challenge**.

The problem to solve is a **Kaggle Competition**: [New York City Taxi Fare Prediction](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction). The goal is to predict the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations.

Building a machine learning model requires a few different steps 

## Steps
1. [Get the data](#part1)
2. [Explore the data](#part2)
3. [Data cleaning](#part3)
4. [Evaluation metric](#part4)
5. [Model baseline](#part5)
6. [Build your first model](#part6)
7. [Model evaluation](#part7)
8. [Kaggle submission](#part8)
9. [Model iteration](#part9)

## 1. Get the data <a id='part1'></a>

The dataset is available on [Kaggle](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data)

First of all:
- Follow the instructions to download the training and test sets
- Put the datasets in a separate folder on your local disk, that you can name "data" for example.

Now we are going to use Pandas to read and explore the datasets.

In [None]:
import pandas as pd

In [None]:
pip install s3fs

The training dataset is relatively big (~5GB). 
So let's only open a portion of it.  
👉 Go to [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/) to see how to open a portion of csv file and store it into a dataframe. (ex: just read 1 million rows maximum)  
💡 NB: here we will read portion of a file **directly from an url**, texactly the same can be done with local file

In [None]:
%%time
url = 's3://wagon-public-datasets/taxi-fare-train.csv'
df = pd.read_csv(url, nrows=1000000)

Now let's display the first rows to understand the different fields 

In [None]:
df.head(2)

## 2. Explore the data <a id='part2'></a>

Before trying to solve the prediction problem, we need to get a better understanding of the data. 
For that, libraries like Pandas and Seaborn are your best friends. 
Firt of all, make you sure you have [Seaborn](https://seaborn.pydata.org/) installed and import it into your notebook. Note that this can be also useful to import `matplotlib.pyplot` to customize a few things.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 14
plt.figure(figsize=(12,5))
palette = sns.color_palette('Paired', 10)

### There are multiple things we want to do in terms of data exploration.

- You first want to look at the distribution of the variable you are going to predict: "fare_amount"
- Then you want to vizualize other variable distributions
- And finally it is often very helpful to compute and vizualise correlation between the target the variable and other variables.
- Also, look for any missing values, or other irregularities.

### Explore the target variable
- Compute simple statistics of the target variable (min, max, mean, std, ...)
- Plot distributions

In [None]:
df.fare_amount.describe()

In [None]:
%matplotlib inline
def plot_dist(series=df["fare_amount"], title="Fare Distribution"):
    sns.distplot(series)
    sns.despine()
    plt.title(title);
    plt.show()
plot_dist()

In [None]:
# drop absurd values 
df = df[df.fare_amount.between(0, 6000)]
plot_dist(df.fare_amount)

In [None]:
# We can also visualise binned fare_amount variable
df['fare-bin'] = pd.cut(df['fare_amount'], bins = list(range(0, 50, 5))).astype(str)

# Uppermost bin
df.loc[df['fare-bin'] == 'nan', 'fare-bin'] = '[45+]'

# Adjust bin so the sorting is correct
df.loc[df['fare-bin'] == '(5, 10]', 'fare-bin'] = '(05, 10]'

In [None]:
sns.catplot(x="fare-bin", kind="count", palette=palette, data=df, height=5, aspect=3);
sns.despine()
plt.show()

### Explore other variables

- passenger_count (statistics + distribution)
- pickup_datetime (you need to build time features out of pickup datetime)
- Geospatial features (pickup_longitude, pickup_latitude,dropoff_longitude,dropoff_latitude)
- Find other variables you can compute from existing data that might explain the target 

#### Passenger Count

In [None]:
df.passenger_count.describe()

In [None]:
sns.catplot(x="passenger_count", kind="count", palette=palette, data=df, height=5, aspect=3);
sns.despine()
plt.title('Passenger Count');
plt.show()

#### Pickup Datetime 
- Extract time features from pickup_datetime (hour, day of week, month, year)
- Create a method `def extract_time_features(_df)` that you will be able to re-use later
- Be careful of timezone
- Explore the newly created features 

In [None]:
def extract_time_features(df):
    timezone_name = 'America/New_York'
    time_column = "pickup_datetime"
    df.index = pd.to_datetime(df[time_column])
    df.index = df.index.tz_convert(timezone_name)
    df["dow"] = df.index.weekday
    df["hour"] = df.index.hour
    df["month"] = df.index.month
    df["year"] = df.index.year
    return df.reset_index(drop=True)

In [None]:
%%time
df = extract_time_features(df)

In [None]:
# hour of day
sns.catplot(x="hour", kind="count", palette=palette, data=df, height=5, aspect=3);
sns.despine()
plt.title('Hour of Day');
plt.show()

In [None]:
# day of week
sns.catplot(x="dow", kind="count", palette=palette, data=df, height=5, aspect=3);
sns.despine()
plt.title('Day of Week');
plt.show()

#### Geospatial Data
- Extract time features from pickup_datetime (hour, day of week, month, year)
- Create a method `def extract_time_features(_df)` that you will be able to re-use later
- Be careful of timezone
- Explore the newly created features 

In [None]:
df_test = pd.read_csv("./data/test.csv")

In [None]:
# find boudaries from test and remove them from training set
for col in ["pickup_latitude", "pickup_longitude", "dropoff_latitude", "dropoff_longitude"]:
    MIN = df_test[col].min()
    MAX = df_test[col].max()
    print(col, MIN, MAX)

In [None]:
df = df[df["pickup_latitude"].between(left = 40, right = 42 )]
df = df[df["pickup_longitude"].between(left = -74.3, right = -72.9 )]
df = df[df["dropoff_latitude"].between(left = 40, right = 42 )]
df = df[df["dropoff_longitude"].between(left = -74, right = -72.9 )]

In [None]:
# make sur your install folium first
import folium
from folium.plugins import HeatMap
from folium.plugins import HeatMapWithTime

In [None]:
center_location = [40.758896, -73.985130]
m = folium.Map(location=center_location, control_scale=True, zoom_start=11)

In [None]:
%matplotlib notebook
df["count"] =1
heatmap_data = df.head(10000)[['pickup_latitude', 'pickup_longitude', 'count']].groupby(['pickup_latitude', 'pickup_longitude']).sum().reset_index().values.tolist()
gradient = {0.2: 'blue', 0.4: 'lime', 0.6: 'orange', 1: 'red'}
HeatMap(data=heatmap_data, radius=5, gradient=gradient, max_zoom=13).add_to(m)
m

In [None]:
heatmap_data_by_hour = []
__df__ = df.head(10000)
for hour in df.hour.sort_values().unique():
    _df = __df__[__df__.hour == hour][['pickup_latitude', 'pickup_longitude', 'count']].groupby(['pickup_latitude', 'pickup_longitude']).sum().reset_index().values.tolist()
    heatmap_data_by_hour.append(_df)

In [None]:
m2 = folium.Map(location=center_location, control_scale=True, zoom_start=11)
HeatMapWithTime(heatmap_data_by_hour, radius=5, 
                gradient=gradient, 
                min_opacity=0.5, max_opacity=0.8, 
                use_local_extrema=False).add_to(m2)
m2

#### Distance
- Compute distance between pickup and dropoff location (tip: https://en.wikipedia.org/wiki/Haversine_formula)
- Write a method `def haversine_distance(df, **kwargs)` that you will be able to reuse later
- Compute a few statistics for distance and plot distance distribution

In [None]:
import numpy as np
def haversine_distance(df,
                         start_lat="start_lat",
                         start_lon="start_lon",
                         end_lat="end_lat",
                         end_lon="end_lon"):
    """
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees).
        Vectorized version of the haversine distance for pandas df
        Computes distance in kms
    """

    lat_1_rad, lon_1_rad = np.radians(df[start_lat].astype(float)), np.radians(df[start_lon].astype(float))
    lat_2_rad, lon_2_rad = np.radians(df[end_lat].astype(float)), np.radians(df[end_lon].astype(float))
    dlon = lon_2_rad - lon_1_rad
    dlat = lat_2_rad - lat_1_rad

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat_1_rad) * np.cos(lat_2_rad) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    haversine_distance = 6371 * c
    return haversine_distance

df["distance"] = haversine_distance(df, 
                                   start_lat="pickup_latitude", start_lon="pickup_longitude",
                                   end_lat="dropoff_latitude", end_lon="pickup_longitude"
                                  )

In [None]:
df.distance.describe()

In [None]:
%matplotlib inline
g = sns.distplot(df[df.distance < 50].distance)
sns.despine()
plt.title("Distance distribution")
plt.show()

#### Explore how target variable correlate with other variables
- As a first step, you can vizualize the target variable vs another variable. For categorical variables, it is often useful to compute the average target variable for each category (Seaborn as plots that do it for you!). For continuous variables (like distance, you can use scatter plots, or regression plots, or bucket the distance into different bins.
- But there many different ways to visualize correlation between features, so be creative.

In [None]:
%matplotlib inline
sns.catplot(x="passenger_count", y="fare_amount", palette=palette, data=df, kind="bar", aspect=3)
sns.despine()
plt.show()

In [None]:
sns.catplot(x="hour", y="fare_amount", palette=palette, data=df, kind="bar", aspect=3)
sns.despine()
plt.show()

In [None]:
sns.catplot(x="dow", y="fare_amount", palette=palette, data=df, kind="bar", aspect=3)
sns.despine()
plt.show()

In [None]:
sns.scatterplot(x="distance", y="fare_amount", data=df[df.distance < 80])
plt.show()

In [None]:
sns.scatterplot(x="distance", y="fare_amount", hue="passenger_count", data=df[df.distance < 80])
plt.show()

## 3. Data cleaning <a id='part3'></a>

As you probably identified in the previous section during your data exploration, there are some values that do not seem valid.
In this section, you will take a few steps to clean the training data.


Remove all trips that look incorrect. We recommand you writing a method `clean_data(df)` that you will be able to re-use in the next steps.

In [None]:
print("trips with negative fares:", len(df[df.fare_amount <= 0]))
print("trips with too high distance:", len(df[df.distance >= 100]))
print("trips with too many passengers:", len(df[df.passenger_count > 8]))
print("trips with zero passenger:", len(df[df.passenger_count == 0]))

In [None]:
def clean_data(df, test=False):
    df = df.dropna(how='any', axis='rows')
    df = df[(df.dropoff_latitude != 0) | (df.dropoff_longitude != 0)]
    df = df[(df.pickup_latitude != 0) | (df.pickup_longitude != 0)]
    df = df[df.fare_amount.between(0, 4000)]
    df = df[df.passenger_count < 8]
    df = df[df.passenger_count >= 0]
    df = df[df["pickup_latitude"].between(40, 42)]
    df = df[df["pickup_longitude"].between(-74.3, -72.9 )]
    df = df[df["dropoff_latitude"].between(40, 42)]
    df = df[df["dropoff_longitude"].between(-74, -72.9)]
    return df

df_cleaned = clean_data(df)
"% data removed", (1 - len(df_cleaned) / len(df)) * 100

## 4. Evaluation metric <a id='part4'></a>

The evaluation metric for this competition is the root mean-squared error or RMSE. RMSE measures the difference between the predictions of a model, and the corresponding ground truth. A large RMSE is equivalent to a large average error, so smaller values of RMSE are better.

More details here https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation

Write a method `def compute_rmse(y_pred, y_true)` that computes the RMSE given `y_pred` and `y_true` which are two numpy arrays corresponding to model predictions and ground truth values.

This method will be useful to evaluate performance of your model

In [None]:
def compute_rmse(y_pred, y_true):
    return np.sqrt(((y_pred - y_true)**2).mean())

## 5. Model baseline <a id='part5'></a>

Before building your model, it is often useful to get a performance benchmark. For this, you will use a baseline model that is a very dumb model and compute the evualation metric on that model.
Then, you will be able to see how much better your model is compared to the baseline. It is very common to see ML teams comming up with very sophisticated approaches without knowing by how much their model beats the very simple model.

- Generate predictions based on a simple heuristic
- Evaluate RMSE for these predictions

In [None]:
y_pred = df_cleaned.fare_amount.mean()
df["y_pred"] = df_cleaned.fare_amount.mean()
compute_rmse(df.y_pred, df.fare_amount)

## 6. Build your first model <a id='part6'></a>

Now it is time to build your model!

To start we are going to use a linear model only. We will try more sophisticated models later during day 2.

Here are the different steps you have to follow:

1. Split the data into two different sets (training and validation). You will be measuring the performance of your model on the validation set.
2. Make sure you apply the data cleaning on your training set
3. Think about the different features you want to add in your model
4. For each of these features, make sure you apply the correct transformation so that the model can correctly learn from them (this is true for categorical variables like `hour of day` or `day of week`)
5. Train your model

##### Training/Validation Split

In [None]:
# training/validation
from sklearn.model_selection import train_test_split
df_train, df_val = train_test_split(df, test_size=10)

##### Apply data cleaning on training set

In [None]:
df_train = clean_data(df_train)

##### List features (continuous vs categorical)

In [None]:
# features
target = "fare_amount"
features = ["distance"]
categorical_features = ["hour", "dow", "passenger_count"]

##### Features transformation
- Write a method `def transform_features(df, **kwargs)` because you will have to make sure you apply the same transformation on the validation (or test set) before making predictions
- For categorical features transformation, you can use `pandas.get_dummies` method

In [None]:
def transform_features(_df, dummy_features=None):
    encode = True if dummy_features is None else False
    dummy_features = dummy_features if dummy_features is not None else []
    for c in categorical_features:
        dummies = pd.get_dummies(_df[c], prefix=c)
        _df = pd.concat([_df, dummies], axis=1)
        if encode:
            dummy_features = dummy_features + (list(dummies.columns.values))
    for dummy_feature in [f for f in dummy_features if f not in _df.columns]:
        _df[dummy_feature] = 0 
    _df = _df[dummy_features + features]
    return _df, dummy_features

##### Model training

In [None]:
# model training
from sklearn.linear_model import LassoCV
clf = LassoCV(cv=5, n_alphas=5)
X_train, dummy_features = transform_features(df_train)
y_train = df_train.fare_amount
clf.fit(X_train, y_train)

## 7. Model evaluation <a id='part7'></a>

Now to evaluate your model, you need to use your previously trained model to make predictions on the validation set. 

For this, follow these steps:
1. Apply the same transformations on the validation set
2. Make predictions
3. Evaluate predictions using `compute_rmse` method

In [None]:
X_val, _ = transform_features(df_val, dummy_features=dummy_features)
df_val["y_pred"] = clf.predict(X_val)
compute_rmse(df_val.y_pred, df_val.fare_amount)

## 8. Kaggle submission <a id='part8'></a>

Now that you have a model, you can now make predictions on Kaggle test set and be evaluated by Kaggle directly.

- Download test data from Kaggle
- Follow [instructions](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation) to make sure your predictions are in the right format
- Re-train your model using all the data (do not split between train/validation)
- Apply all features engineering and transformations methods on the test set
- Use the model to make predictions on the test set
- Submit your predictions!

In [None]:
df_test = pd.read_csv("./data/test.csv")
df_test.head(1)

In [None]:
df_test["distance"] = haversine_distance(df_test, 
                                   start_lat="pickup_latitude", start_lon="pickup_longitude",
                                   end_lat="dropoff_latitude", end_lon="pickup_longitude"
                                  )
df_test = extract_time_features(df_test)
X_test, _ = transform_features(df_test, dummy_features=dummy_features) 
df_test["y_pred"] = clf.predict(X_test)

In [None]:
df_test.head(1)

In [None]:
df_test.reset_index(drop=True)[["key", "y_pred"]].rename(columns={"y_pred": "fare_amount"}).to_csv("lasso_v0_predictions.csv", index=False)

## 9. Push further Feature Engineering <a id='part9'></a>

You can improve your model by trying different things (But dont' worry, some of these things will be covered in the next days).
- Use more data to train
- Build and add more features 
- Try different estimators
- Adjust your data cleaning to remove more or less data
- Tune the hyperparameters of your model

On following section we will focus on advanced Feature Engineering (keep in mind that relevant feateng is often key to significant increase in model performances):

👉 **Manhattan distance** better suited to our problem  
👉 **Distance to NYC center** to highlight interesting pattern ...  
👉 **Direction**   

###### Another Distance ?
- Think about the distance you used, try and find a more adapted distance to our problem (Ask TA for insights)

![Minkowski distance](https://wikimedia.org/api/rest_v1/media/math/render/svg/4ed8b780e0d3224880760b1745c444481590ee86)

In [None]:
# Minkowski Distance is actually the generic distance to compute differnet distance
def minkowski_distance(x1, x2, y1, y2, p):
    return ((abs(x2 - x1) ** p) + (abs(y2 - y1)) ** p) ** (1 / p)

In [None]:
# euclidian distance = minkowski_distance(x1, x2, y1, y2, p) where p=2
# manhattan distance = minkowski_distance(x1, x2, y1, y2, p) where p=1
df['manhattan_dist'] = minkowski_distance(df['pickup_longitude'], df['dropoff_longitude'],
                                       df['pickup_latitude'], df['dropoff_latitude'], 1)

df['euclidian_dist'] = minkowski_distance(df['pickup_longitude'], df['dropoff_longitude'],
                                       df['pickup_latitude'], df['dropoff_latitude'], 2)

###### Distance from the center 
- Compute a new Feature calculating distance of pickup location from the center
- Scatter Plot *distance_from_center* regarding *distance* 
- What do you observe ? What new features could you add ? How are these new features correlated to the target ?

In [None]:
# Let's compute distance from NYC center
nyc_center = (40.7141667, -74.0063889)
df["nyc_lat"], df["nyc_lng"] = nyc_center[0], nyc_center[1]
args =  dict(start_lat="nyc_lat", start_lon="nyc_lng",
            end_lat="pickup_latitude", end_lon="pickup_longitude")

df['distance_to_center'] = haversine_distance(df, **args)

In [None]:
idx = (df.distance < 40) & (df.distance_to_center < 40)
sns.scatterplot(x="distance_to_center", y="distance", data=df[idx].sample(10000), hue="fare-bin")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.show()

In [None]:
df.distance_to_center.hist(bins=100)

👉 **Take time to step back conlude interesting pattern here ? What are these clustered with same distance to center?**

In [None]:
# Seems to be fixed distance_to_center
jfk_center = (40.6441666667, -73.7822222222)


df["jfk_lat"], df["jfk_lng"] = jfk_center[0], jfk_center[1]
args_pickup =  dict(start_lat="jfk_lat", start_lon="jfk_lng",
            end_lat="pickup_latitude", end_lon="pickup_longitude")
args_dropoff =  dict(start_lat="jfk_lat", start_lon="jfk_lng",
            end_lat="dropoff_latitude", end_lon="dropoff_longitude")

jfk = (-73.7822222222, 40.6441666667)
df['pickup_distance_to_jfk'] = haversine_distance(df, **args_pickup)
df['dropoff_distance_to_jfk'] = haversine_distance(df, **args_dropoff)

In [None]:
df.pickup_distance_to_jfk.hist(bins=100)

###### Which direction  you heading to ?
- Compute a new Feature calculating the direction your heading to
- What do you observe ? What new features could you add ? How are these new features correlated to the target ?

In [None]:
def calculate_direction(d_lon, d_lat):
    result = np.zeros(len(d_lon))
    l = np.sqrt(d_lon**2 + d_lat**2)
    result[d_lon>0] = (180/np.pi)*np.arcsin(d_lat[d_lon>0]/l[d_lon>0])
    idx = (d_lon<0) & (d_lat>0)
    result[idx] = 180 - (180/np.pi)*np.arcsin(d_lat[idx]/l[idx])
    idx = (d_lon<0) & (d_lat<0)
    result[idx] = -180 - (180/np.pi)*np.arcsin(d_lat[idx]/l[idx])
    return result

In [None]:
df['delta_lon'] = df.pickup_longitude - df.dropoff_longitude
df['delta_lat'] = df.pickup_latitude - df.dropoff_latitude
df['direction'] = calculate_direction(df.delta_lon, df.delta_lat)

In [None]:
plt.figure(figsize=(10,6))
df.direction.hist(bins=180)

In [None]:
# plot direction vs average fare amount for fares inside manhattan
def select_within_boundingbox(df, BB):
    return (df.pickup_longitude >= BB[0]) & (df.pickup_longitude <= BB[1]) & \
           (df.pickup_latitude >= BB[2]) & (df.pickup_latitude <= BB[3]) & \
           (df.dropoff_longitude >= BB[0]) & (df.dropoff_longitude <= BB[1]) & \
           (df.dropoff_latitude >= BB[2]) & (df.dropoff_latitude <= BB[3])
BB_manhattan = (-74.025, -73.925, 40.7, 40.8)
idx_manhattan = select_within_boundingbox(df, BB_manhattan)


fig, ax = plt.subplots(1, 1, figsize=(14,6))
direc = pd.cut(df[idx_manhattan]['direction'], np.linspace(-180, 180, 37))
df[idx_manhattan].pivot_table('fare_amount', index=[direc], columns='year', aggfunc='mean').plot(ax=ax)
plt.xlabel('direction (degrees)')
plt.xticks(range(36), np.arange(-170, 190, 10))
plt.ylabel('average fare amount $USD');

In [None]:
corrs = df.corr()
l = list(corrs)
l.remove("fare_amount")
corrs['fare_amount'][l].plot.bar(color = 'b');
plt.title('Correlation with Fare Amount');