# Models metrics

## Dataset Description 

- `id` - Trip ID
- `vendor_id` - ID of the transportation company
- `pickup_datetime` - Timestamp of the trip start
- `dropoff_datetime` - Timestamp of the trip end
- `passenger_count` - Number of passengers
- `pickup_longitude` - Longitude of the pickup location
- `pickup_latitude` - Latitude of the pickup location
- `dropoff_longitude` - Longitude of the dropoff location
- `dropoff_latitude` - Latitude of the dropoff location
- `store_and_fwd_flag` - Yes/No: Was the information stored in the vehicle's memory due to loss of connection with the server

## Tasks

### Task 1

Import data you need to build models. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
initial_data = pd.read_csv('initial_data.csv', index_col='id')

initial_cols = ['vendor_id', 'passenger_count', 'pickup_longitude',
                'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
                'trip_duration']

initial_data = initial_data[initial_cols]

In [3]:
initial_data.head()

Unnamed: 0_level_0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
id2875421,2,1,-73.982155,40.767937,-73.96463,40.765602,455.0
id2377394,1,1,-73.980415,40.738564,-73.999481,40.731152,663.0
id3858529,2,1,-73.979027,40.763939,-74.005333,40.710087,2124.0
id3504673,2,1,-74.01004,40.719971,-74.012268,40.706718,429.0
id2181028,2,1,-73.973053,40.793209,-73.972923,40.78252,435.0


In [4]:
initial_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1458644 entries, id2875421 to id1209952
Data columns (total 7 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   vendor_id          1458644 non-null  int64  
 1   passenger_count    1458644 non-null  int64  
 2   pickup_longitude   1458644 non-null  float64
 3   pickup_latitude    1458644 non-null  float64
 4   dropoff_longitude  1458644 non-null  float64
 5   dropoff_latitude   1458644 non-null  float64
 6   trip_duration      1458644 non-null  float64
dtypes: float64(5), int64(2)
memory usage: 89.0+ MB


### Task 2

Make a logarithmic transform of your target column, so you can calculate MSLE when evaluating models' performance. 

In [5]:
initial_data = initial_data.assign(log_trip_duration=np.log1p(initial_data['trip_duration']))
initial_data = initial_data.drop('trip_duration', axis=1)

### Task 3

Split your data on train and test samples. 

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X = initial_data.drop('log_trip_duration', axis=1)
y = initial_data['log_trip_duration']

X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)

### Task 4

Apply `KFold` splitting and evaluate a simple `LinearRegression` model performance on cross-validation.

In [8]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

In [9]:
splitter = KFold(n_splits=20, shuffle=True, random_state=33)

In [10]:
losses_train = []
losses_test = []

for train_index, test_index in splitter.split(X_train):
    x_train, x_test = X_train.values[train_index], X_train.values[test_index]
    y_train, y_test = Y_train.values[train_index], Y_train.values[test_index]
    
    model = LinearRegression()
    model.fit(x_train, y_train)
    
    np.mean((model.predict(x_train) - y_train)**2)
    
    losses_train.append(np.mean((model.predict(x_train) - y_train)**2))
    losses_test.append(np.mean((model.predict(x_test) - y_test)**2))

### Task 5

What are the mean MSLE scores of your cross-validation on train and test samples? 

In [11]:
print(round(np.mean(losses_train), 3))
print(round(np.mean(losses_test), 3))

0.609
0.613


### Task 6

Assess a simple `LinearRegression` performance of a model built on the whole dataset.

In [12]:
model = LinearRegression()
model.fit(X_train, Y_train)
    
print(round(np.mean((model.predict(X_test) - Y_test)**2), 3))

0.606


### Task 7

Let's try to build another model using another set of features. 

In [13]:
processed_data = pd.read_csv('processed_data.csv', index_col='id')

In [14]:
processed_data.head()

Unnamed: 0_level_0,vendor_id,passenger_count,store_and_fwd_flag,trip_duration,distance_km
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
id2875421,1,930.399753,0,455.0,1.500479
id2377394,0,930.399753,0,663.0,1.807119
id3858529,1,930.399753,0,2124.0,6.39208
id3504673,1,930.399753,0,429.0,1.487155
id2181028,1,930.399753,0,435.0,1.189925


In [15]:
processed_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1458644 entries, id2875421 to id1209952
Data columns (total 5 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   vendor_id           1458644 non-null  int64  
 1   passenger_count     1458644 non-null  float64
 2   store_and_fwd_flag  1458644 non-null  int64  
 3   trip_duration       1458644 non-null  float64
 4   distance_km         1458644 non-null  float64
dtypes: float64(3), int64(2)
memory usage: 66.8+ MB


### Task 8

Making log tranformation of the target column. 

In [16]:
processed_data = processed_data.assign(log_trip_duration=np.log1p(processed_data['trip_duration']))
processed_data = processed_data.drop('trip_duration', axis=1)

In [17]:
X_2 = processed_data.drop('log_trip_duration', axis=1)
y_2 = processed_data['log_trip_duration']

### Task 9

Getting the same data split so the comparison between models will be fair. 

In [18]:
test_indexes = X_test.index
train_indexes = X_train.index

X_train_2 = X_2[X_2.index.isin(train_indexes)]
y_train_2 = y_2[y_2.index.isin(train_indexes)]

X_test_2 = X_2[X_2.index.isin(test_indexes)]
y_test_2 = y_2[y_2.index.isin(test_indexes)]

### Task 10

Apply `KFold` splitting and evaluate a simple `LinearRegression` model performance on cross-validation, but now we will use an in-built `cross_validate` method. 

In [19]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error

In [20]:
model = LinearRegression()
splitter = KFold(n_splits=20, shuffle=True, random_state=33)

scores = cross_validate(
    model, X_train_2, y_train_2,
    scoring = 'neg_mean_squared_error',
    cv=splitter,
    return_train_score=True
)

In [21]:
round(scores['test_score'].mean() * -1, 3)

0.431

### Task 11

Assess a simple `LinearRegression` performance of a model built on the whole dataset.

In [22]:
model = LinearRegression()
model.fit(X_train_2, y_train_2)

print(round(np.mean((model.predict(X_test_2) - y_test_2)**2), 3))

0.407
