Permutation importance is a technique used to measure the importance of each feature by evaluating how much the model's performance deteriorates when the values of that feature are randomly shuffled. A higher permutation importance score indicates that the feature is more important in predicting the outcome.

Dataset used - https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data

In [16]:
!pip install eli5

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import shap
import eli5
from eli5.sklearn import PermutationImportance

Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
[K     |████████████████████████████████| 216 kB 2.1 MB/s eta 0:00:01
Collecting jinja2>=3.0.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 21.0 MB/s eta 0:00:01
Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 8.5 MB/s  eta 0:00:01
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... [?25ldone
[?25h  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=ea335605e65d3bd649e902fba53abb16a901fb555d0b5a7f5c1823bd1d7eaec3
  Stored in directory: /home/ruchi/.cache/pip/wheels/7b/26/a5/8460416695a992a2966b41caa5338e5e7fcea98c9d032d055c
Successfully built eli5
Installing collected packages: jinja2, graphviz, eli5
  Attempting uninstall: jinja2
    Found existing installation: Jinja2 2.11.3
    Uninstalling Jinja2-2.11.3:
      Successfully 

In [8]:
df = pd.read_csv('data/train.csv', nrows=50000)
df.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


In [9]:
# Remove data with extreme outlier coordinates ot negative fares
df = df.query('pickup_latitude > 40.7 and pickup_latitude < 40.8 and ' +
                      'dropoff_latitude > 40.7 and dropoff_latitude < 40.8 and ' +
                      'pickup_longitude > -74 and pickup_longitude < -73.9 and ' +
                      'dropoff_longitude > -74 and dropoff_longitude < -73.9 and ' +
                      'fare_amount > 0'
                       )

In [12]:
base_features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 
                 'dropoff_latitude', 'passenger_count']

X = df[base_features]
y = df.fare_amount

In [13]:
# Model 

train_X, valid_X, train_y, valid_y = train_test_split(X, y, random_state=1)

model = RandomForestRegressor(n_estimators=30, random_state=1).fit(train_X, train_y)

In [19]:
# Permutation performance
perm = PermutationImportance(model, random_state=1).fit(valid_X, valid_y)
eli5.show_weights(perm, feature_names=valid_X.columns.tolist())

Weight,Feature
0.8496  ± 0.0178,dropoff_latitude
0.8173  ± 0.0263,pickup_latitude
0.6186  ± 0.0617,pickup_longitude
0.5525  ± 0.0279,dropoff_longitude
-0.0026  ± 0.0019,passenger_count


Based on the provided weights for each feature with respect to the outcome (taxi fare), we can observe the following:

- Dropoff and pickup latitude: Both dropoff_latitude and pickup_latitude have relatively high positive weights (0.8477 and 0.8272, respectively). This suggests that higher latitudes positively influence taxi fares. In many regions, cities, or countries, taxi fares may increase as you move towards the northern latitudes. This could be due to factors like increased demand, longer distances traveled, or higher costs of living in those areas.

- Pickup and dropoff longitude: While still important, pickup_longitude and dropoff_longitude have comparatively lower positive weights (0.6228 and 0.5383, respectively). This indicates that the effect of longitude on taxi fares is not as strong as latitude. The longitude values mainly influence the east-west position on the Earth's surface, and their impact on taxi fares might be more localized or affected by specific geographical factors in certain areas.

- Passenger count: The passenger_count feature has a weight of -0.0029, indicating a very minimal negative influence on taxi fares. This implies that the number of passengers doesn't significantly impact the fare, or if there is any effect, it tends to be minor compared to the spatial location features.

Let's introduce two new features, 'abs_lat_change' and 'abs_lon_change,' to the base features of the taxi dataset. These new features aim to capture the absolute change in latitude and longitude between the pickup and dropoff locations for each taxi ride.

By incorporating the absolute changes in latitude and longitude, the model may be better able to capture patterns related to intra-city movements. For instance, if two rides have similar total distances traveled but one involves traveling across the city (large 'abs_lon_change' and 'abs_lat_change') and the other involves staying within a small local area (small 'abs_lon_change' and 'abs_lat_change'), these features can help differentiate the two scenarios and potentially lead to more accurate taxi fare predictions.

In [21]:
# Create new features
df['abs_lon_change'] = (df.dropoff_longitude - df.pickup_longitude).abs()
df['abs_lat_change'] = (df.dropoff_latitude - df.pickup_latitude).abs()

In [22]:
# Add the new featues to the base features
base_features.append('abs_lat_change')
base_features.append('abs_lon_change')

In [23]:
X = df[base_features]

In [24]:
# Model 

train_X, valid_X, train_y, valid_y = train_test_split(X, y, random_state=1)

model = RandomForestRegressor(n_estimators=30, random_state=1).fit(train_X, train_y)

In [25]:
# Permutation performance
perm = PermutationImportance(model, random_state=1).fit(valid_X, valid_y)
eli5.show_weights(perm, feature_names=valid_X.columns.tolist())

Weight,Feature
0.5701  ± 0.0286,abs_lat_change
0.4374  ± 0.0440,abs_lon_change
0.0940  ± 0.0207,dropoff_longitude
0.0932  ± 0.0242,pickup_longitude
0.0756  ± 0.0110,dropoff_latitude
0.0708  ± 0.0153,pickup_latitude
-0.0014  ± 0.0027,passenger_count


From the table, we can observe the following:

- Distance Traveled vs. Location Effect: The top two most important features are 'abs_lat_change' and 'abs_lon_change', which represent the absolute change in latitude and longitude between the pickup and dropoff locations, respectively. However, they have lower permutation importance scores (0.5786 and 0.4469) compared to the features related to latitude and longitude, such as 'pickup_latitude' and 'dropoff_latitude' (0.0860 and 0.0735). This suggests that the distance traveled has a more substantial impact on taxi fares than the specific locations themselves.

- Reasons for Latitude Features' Importance: The possible reasons why latitude features (abs_lat_change and features related to latitude) are more important than longitude features could be attributed to the following:

    -Larger Latitudinal Distances: In the dataset, latitudinal distances might tend to be larger than longitudinal distances, leading to more significant variations in fare due to differences in latitudinal positions.
    -Cost of Travel: It is probably more expensive to travel a fixed latitudinal distance. This could be due to factors like toll roads, traffic patterns, or differences in demand along the north-south direction, leading to higher fare fluctuations.

Conclusion: The permutation importance analysis highlights the dominance of the distance traveled over specific location effects in predicting taxi fares. The insights gained from permutation importance are valuable for debugging the model, understanding the drivers of the outcome, and communicating a high-level overview of the model's behavior to stakeholders.

It's important to note that the importance of features can vary depending on the dataset, context, and the specific problem being solved. Permutation importance provides an interpretable and model-agnostic approach to feature importance analysis, helping data scientists and analysts gain deeper insights into their models' behavior and improve their understanding of the underlying data patterns.