# Models metrics

## Dataset Description 

- `id` - Trip ID
- `vendor_id` - ID of the transportation company
- `pickup_datetime` - Timestamp of the trip start
- `dropoff_datetime` - Timestamp of the trip end
- `passenger_count` - Number of passengers
- `pickup_longitude` - Longitude of the pickup location
- `pickup_latitude` - Latitude of the pickup location
- `dropoff_longitude` - Longitude of the dropoff location
- `dropoff_latitude` - Latitude of the dropoff location
- `store_and_fwd_flag` - Yes/No: Was the information stored in the vehicle's memory due to loss of connection with the server

## Tasks

### Task 1

Calculate MSE for two predictions we have in our data. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('taxi_dataset_with_predictions.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0_level_0,vendor_id,pickup_datetime,passenger_count,store_and_fwd_flag,trip_duration,distance_km,prediction_1,prediction_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
id2875421,1,2016-03-14 17:24:55,930.399753,0,455.0,1.500479,578.156451,355.27071
id2377394,0,2016-06-12 00:43:35,930.399753,0,663.0,1.807119,962.657188,674.295781
id3858529,1,2016-01-19 11:35:24,930.399753,0,2124.0,6.39208,2546.180515,2422.132431
id3504673,1,2016-04-06 19:32:31,930.399753,0,429.0,1.487155,737.926214,795.992362
id2181028,1,2016-03-26 13:30:55,930.399753,0,435.0,1.189925,666.070794,-4.158492


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1458644 entries, id2875421 to id1209952
Data columns (total 8 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   vendor_id           1458644 non-null  int64  
 1   pickup_datetime     1458644 non-null  object 
 2   passenger_count     1458644 non-null  float64
 3   store_and_fwd_flag  1458644 non-null  int64  
 4   trip_duration       1458644 non-null  float64
 5   distance_km         1458644 non-null  float64
 6   prediction_1        1458644 non-null  float64
 7   prediction_2        1458644 non-null  float64
dtypes: float64(5), int64(2), object(1)
memory usage: 100.2+ MB


In [5]:
error_1 = np.mean(((df['trip_duration'] - df['prediction_1'])**2).to_list())
error_2 = np.mean(((df['trip_duration'] - df['prediction_2'])**2).to_list())

In [6]:
print(f'MSE prediction_1: {int(error_1)}')
print(f'MSE prediction_1: {int(error_2)}')

MSE prediction_1: 99994
MSE prediction_1: 124936


### Task 2

Calculate RMSE for both predictions. 

In [7]:
### Your code is here
error_1 = (error_1) ** .5
error_2 = (error_2) ** .5

In [8]:
print(f'RMSE prediction_1: {int(error_1)}')
print(f'RMSE prediction_1: {int(error_2)}')

RMSE prediction_1: 316
RMSE prediction_1: 353


### Task 3

Calculate MAE for both predictions. 

In [9]:
absolute_error_1 = np.mean((abs(df['trip_duration'] - df['prediction_1'])).to_list())
absolute_error_2 = np.mean((abs(df['trip_duration'] - df['prediction_2'])).to_list())

In [10]:
print(f'MAE prediction_1: {int(absolute_error_1)}')
print(f'MAE prediction_2: {int(absolute_error_2)}')

MAE prediction_1: 300
MAE prediction_2: 281


### Task 4

How often do the predictions of the first and the second models differ by more than 500 from the actual answer?

In [11]:
counter_1 = sum([1 if abs(x - y) >= 500 else 0 for x, y in zip(df['prediction_1'].to_list(), df['trip_duration'].to_list())])
counter_2 = sum([1 if abs(x - y) >= 500 else 0 for x, y in zip(df['prediction_2'].to_list(), df['trip_duration'].to_list())])

In [12]:
print(f'The amount of deviations >= 500 for the first model: {counter_1}')
print(f'The amount of deviations >= 500 for the second model: {counter_2}')

The amount of deviations >= 500 for the first model: 33061
The amount of deviations >= 500 for the second model: 228789


### Task 5

Calculate RMSLE for both predictions. 

In [13]:
# first we need to replace all negative answers with 0
pred_1 = [0 if x < 0 else x for x in df['prediction_1'].to_list()]
pred_2 = [0 if x < 0 else x for x in df['prediction_2'].to_list()]

In [14]:
from math import log


rmsle_1 = (np.mean([(log(x + 1) - log(y + 1))**2 for x, y in zip(pred_1, df['trip_duration'].to_list())])) ** .5
rmsle_2 = (np.mean([(log(x + 1) - log(y + 1))**2 for x, y in zip(pred_2, df['trip_duration'].to_list())])) ** .5

In [15]:
print(f'RMSLE prediction_1: {rmsle_1:.3f}')
print(f'RMSLE prediction_2: {rmsle_2:.3f}')

RMSLE prediction_1: 0.554
RMSLE prediction_2: 1.556


### Task 6

Calculate how many times did the first model overpredicted and underpredicted our target. 

In [16]:
over_predicted_1 = sum([1 if x - y > 0 else 0 for x, y in zip(pred_1, df['trip_duration'].to_list())])
under_predicted_1 = sum([1 if x - y < 0 else 0 for x, y in zip(pred_1, df['trip_duration'].to_list())])

In [17]:
print(f'The first model overpredicted in {over_predicted_1} cases.')
print(f'The first model underpredicted in {under_predicted_1} cases.')

The first model overpredicted in 1456721 cases.
The first model underpredicted in 1923 cases.


### Task 7

Calculate how many times did the second model overpredicted and underpredicted our target. 

In [18]:
over_predicted_2 = sum([1 if x - y > 0 else 0 for x, y in zip(pred_2, df['trip_duration'].to_list())])
under_predicted_2 = sum([1 if x - y < 0 else 0 for x, y in zip(pred_2, df['trip_duration'].to_list())])

In [19]:
print(f'The second model overpredicted in {over_predicted_2} cases.')
print(f'The second model underpredicted in {under_predicted_2} cases.')

The second model overpredicted in 811778 cases.
The second model underpredicted in 646866 cases.
