# Intro to ML

## Dataset Description 

- `id` - Trip ID
- `vendor_id` - ID of the transportation company
- `pickup_datetime` - Timestamp of the trip start
- `dropoff_datetime` - Timestamp of the trip end
- `passenger_count` - Number of passengers
- `pickup_longitude` - Longitude of the pickup location
- `pickup_latitude` - Latitude of the pickup location
- `dropoff_longitude` - Longitude of the dropoff location
- `dropoff_latitude` - Latitude of the dropoff location
- `store_and_fwd_flag` - Yes/No: Was the information stored in the vehicle's memory due to loss of connection with the server

## Tasks

### Task 1

Load the data and check it. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

taxiDB = pd.read_csv('taxi_dataset.csv')

In [2]:
taxiDB.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N


In [3]:
taxiDB.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 10 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   id                  1458644 non-null  object 
 1   vendor_id           1458644 non-null  int64  
 2   pickup_datetime     1458644 non-null  object 
 3   dropoff_datetime    1458644 non-null  object 
 4   passenger_count     1458644 non-null  int64  
 5   pickup_longitude    1458644 non-null  float64
 6   pickup_latitude     1458644 non-null  float64
 7   dropoff_longitude   1458644 non-null  float64
 8   dropoff_latitude    1458644 non-null  float64
 9   store_and_fwd_flag  1458644 non-null  object 
dtypes: float64(4), int64(2), object(4)
memory usage: 111.3+ MB


### Task 2

Convert datetime columns into datetime data type. 

In [4]:
taxiDB['pickup_datetime'] = pd.to_datetime(taxiDB['pickup_datetime'])
taxiDB['dropoff_datetime'] = pd.to_datetime(taxiDB['dropoff_datetime'])

In [5]:
taxiDB.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 10 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   id                  1458644 non-null  object        
 1   vendor_id           1458644 non-null  int64         
 2   pickup_datetime     1458644 non-null  datetime64[ns]
 3   dropoff_datetime    1458644 non-null  datetime64[ns]
 4   passenger_count     1458644 non-null  int64         
 5   pickup_longitude    1458644 non-null  float64       
 6   pickup_latitude     1458644 non-null  float64       
 7   dropoff_longitude   1458644 non-null  float64       
 8   dropoff_latitude    1458644 non-null  float64       
 9   store_and_fwd_flag  1458644 non-null  object        
dtypes: datetime64[ns](2), float64(4), int64(2), object(2)
memory usage: 111.3+ MB


### Task 3

Create one of our targets `trip_duration` that will be in seconds. 

In [6]:
taxiDB['trip_duration'] = (taxiDB['dropoff_datetime'] - taxiDB['pickup_datetime']).dt.total_seconds()

In [7]:
taxiDB.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455.0
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663.0
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124.0
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429.0
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435.0


### Task 4

Remove `dropoff_datetime` column from your dataset. 

In [8]:
taxiDB.drop('dropoff_datetime', axis=1, inplace=True)

In [9]:
taxiDB.head()

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,N,455.0
1,id2377394,1,2016-06-12 00:43:35,1,-73.980415,40.738564,-73.999481,40.731152,N,663.0
2,id3858529,2,2016-01-19 11:35:24,1,-73.979027,40.763939,-74.005333,40.710087,N,2124.0
3,id3504673,2,2016-04-06 19:32:31,1,-74.01004,40.719971,-74.012268,40.706718,N,429.0
4,id2181028,2,2016-03-26 13:30:55,1,-73.973053,40.793209,-73.972923,40.78252,N,435.0


### Task 5

Change values in `vendor_id` column from {1, 2} to {0, 1}.  

In [10]:
taxiDB['vendor_id'].value_counts()

2    780302
1    678342
Name: vendor_id, dtype: int64

In [11]:
taxiDB['vendor_id'] = taxiDB['vendor_id'] - 1

In [12]:
taxiDB['vendor_id'].value_counts()

1    780302
0    678342
Name: vendor_id, dtype: int64

### Task 6

Find another binary feature and decode it to {0, 1} as well. 

In [13]:
taxiDB.store_and_fwd_flag.value_counts()

N    1450599
Y       8045
Name: store_and_fwd_flag, dtype: int64

In [14]:
def yes_no_to_binary(s):
    if s == 'N':
        return 0
    return 1

In [15]:
taxiDB.store_and_fwd_flag = taxiDB.store_and_fwd_flag.apply(yes_no_to_binary)

In [16]:
taxiDB.store_and_fwd_flag.value_counts()

0    1450599
1       8045
Name: store_and_fwd_flag, dtype: int64

In [17]:
taxiDB.head()

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,1,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,0,455.0
1,id2377394,0,2016-06-12 00:43:35,1,-73.980415,40.738564,-73.999481,40.731152,0,663.0
2,id3858529,1,2016-01-19 11:35:24,1,-73.979027,40.763939,-74.005333,40.710087,0,2124.0
3,id3504673,1,2016-04-06 19:32:31,1,-74.01004,40.719971,-74.012268,40.706718,0,429.0
4,id2181028,1,2016-03-26 13:30:55,1,-73.973053,40.793209,-73.972923,40.78252,0,435.0


### Task 7

Save the first 10 rows of your dataset into a `csv` file. 

In [18]:
taxiDB = taxiDB.astype({'pickup_datetime': 'object', 'trip_duration': 'float64'})

In [19]:
taxiDB.head(10).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  10 non-null     object 
 1   vendor_id           10 non-null     int64  
 2   pickup_datetime     10 non-null     object 
 3   passenger_count     10 non-null     int64  
 4   pickup_longitude    10 non-null     float64
 5   pickup_latitude     10 non-null     float64
 6   dropoff_longitude   10 non-null     float64
 7   dropoff_latitude    10 non-null     float64
 8   store_and_fwd_flag  10 non-null     int64  
 9   trip_duration       10 non-null     float64
dtypes: float64(5), int64(3), object(2)
memory usage: 928.0+ bytes


In [20]:
### use ; as a separator between columns
taxiDB.head(10).to_csv('01_task6.csv', sep=';')

### Task 8

Let's prepare latitudes for further distance calculation. 

Use this article as a reference: [link](https://www.datafix.com.au/BASHing/2018-11-07.html)

In [21]:
allLat  = list(taxiDB['pickup_latitude']) + list(taxiDB['dropoff_latitude'])

In [22]:
medianLat  = sorted(allLat)[int(len(allLat)/2)]

In [23]:
latMultiplier  = 111.32

taxiDB['pickup_latitude']   = latMultiplier  * (taxiDB['pickup_latitude']   - medianLat)
taxiDB['dropoff_latitude']   = latMultiplier  * (taxiDB['dropoff_latitude']  - medianLat)

In [24]:
taxiDB.head()

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,1,2016-03-14 17:24:55,1,-73.982155,1.516008,-73.96463,1.256121,0,455.0
1,id2377394,0,2016-06-12 00:43:35,1,-73.980415,-1.753813,-73.999481,-2.578912,0,663.0
2,id3858529,1,2016-01-19 11:35:24,1,-73.979027,1.070973,-74.005333,-4.923841,0,2124.0
3,id3504673,1,2016-04-06 19:32:31,1,-74.01004,-3.823568,-74.012268,-5.298809,0,429.0
4,id2181028,1,2016-03-26 13:30:55,1,-73.973053,4.329328,-73.972923,3.139453,0,435.0


### Task 9

Let's do the same for longitudes. 

In [25]:
allLong = list(taxiDB['pickup_longitude']) + list(taxiDB['dropoff_longitude'])

medianLong  = sorted(allLong)[int(len(allLong)/2)]

longMultiplier = np.cos(medianLat*(np.pi/180.0)) * 111.32

In [26]:
taxiDB['pickup_longitude']   = longMultiplier  * (taxiDB['pickup_longitude']   - medianLong)
taxiDB['dropoff_longitude']   = longMultiplier  * (taxiDB['dropoff_longitude']  - medianLong)

In [27]:
taxiDB.head()

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,1,2016-03-14 17:24:55,1,-0.110015,1.516008,1.367786,1.256121,0,455.0
1,id2377394,0,2016-06-12 00:43:35,1,0.036672,-1.753813,-1.571088,-2.578912,0,663.0
2,id3858529,1,2016-01-19 11:35:24,1,0.153763,1.070973,-2.064547,-4.923841,0,2124.0
3,id3504673,1,2016-04-06 19:32:31,1,-2.4615,-3.823568,-2.649362,-5.298809,0,429.0
4,id2181028,1,2016-03-26 13:30:55,1,0.657515,4.329328,0.668452,3.139453,0,435.0


### Task 10

Calculate distance in km between pickup and dropoff points using Euclidean distance. 

In [28]:
def distance_km(dropoff_latitude, pickup_latitude, dropoff_longitude, pickup_longitude):
    delta_lat = dropoff_latitude - pickup_latitude
    delta_long = dropoff_longitude - pickup_longitude
    return (delta_lat ** 2 + delta_long ** 2) ** .5

In [29]:
taxiDB['distance_km'] = taxiDB.apply(lambda row: distance_km(row['dropoff_latitude'], row['pickup_latitude'], row['dropoff_longitude'], row['pickup_longitude']), axis=1)

In [30]:
taxiDB.head()

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,distance_km
0,id2875421,1,2016-03-14 17:24:55,1,-0.110015,1.516008,1.367786,1.256121,0,455.0,1.500479
1,id2377394,0,2016-06-12 00:43:35,1,0.036672,-1.753813,-1.571088,-2.578912,0,663.0,1.807119
2,id3858529,1,2016-01-19 11:35:24,1,0.153763,1.070973,-2.064547,-4.923841,0,2124.0,6.39208
3,id3504673,1,2016-04-06 19:32:31,1,-2.4615,-3.823568,-2.649362,-5.298809,0,429.0,1.487155
4,id2181028,1,2016-03-26 13:30:55,1,0.657515,4.329328,0.668452,3.139453,0,435.0,1.189925


### Task 11

Remove columns that we won't need anymore. 

In [31]:
taxiDB = taxiDB.drop(['pickup_longitude', 'dropoff_longitude',
                      'pickup_latitude', 'dropoff_latitude'], axis=1)

In [32]:
taxiDB.head()

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,store_and_fwd_flag,trip_duration,distance_km
0,id2875421,1,2016-03-14 17:24:55,1,0,455.0,1.500479
1,id2377394,0,2016-06-12 00:43:35,1,0,663.0,1.807119
2,id3858529,1,2016-01-19 11:35:24,1,0,2124.0,6.39208
3,id3504673,1,2016-04-06 19:32:31,1,0,429.0,1.487155
4,id2181028,1,2016-03-26 13:30:55,1,0,435.0,1.189925


### Task 12

Save the first 10 rows of your dataset into a `csv` file.

In [33]:
### use ; as a separator between columns
taxiDB.head(10).to_csv('01_task7.csv', sep=';')

### Task 13

What values does the `passenger_count` have?

In [34]:
taxiDB.passenger_count.value_counts()

1    1033540
2     210318
5      78088
3      59896
6      48333
4      28404
0         60
7          3
9          1
8          1
Name: passenger_count, dtype: int64

### Task 14

Use **mean-target encoding** technique for `passenger_count` feature. 

In [35]:
taxiDB['passenger_count'] = taxiDB['passenger_count'].map(taxiDB.groupby(['passenger_count'])['trip_duration'].mean())

In [36]:
taxiDB.rename(columns={'passenger_count': 'category_encoded'}, inplace=True)

In [37]:
taxiDB.head()

Unnamed: 0,id,vendor_id,pickup_datetime,category_encoded,store_and_fwd_flag,trip_duration,distance_km
0,id2875421,1,2016-03-14 17:24:55,930.399753,0,455.0,1.500479
1,id2377394,0,2016-06-12 00:43:35,930.399753,0,663.0,1.807119
2,id3858529,1,2016-01-19 11:35:24,930.399753,0,2124.0,6.39208
3,id3504673,1,2016-04-06 19:32:31,930.399753,0,429.0,1.487155
4,id2181028,1,2016-03-26 13:30:55,930.399753,0,435.0,1.189925


### Task 15

Save the first 10 rows of your dataset into a `csv` file.

In [38]:
### use ; as a separator between columns
taxiDB.head(10).to_csv('01_task8.csv', sep=';')

### Task 16

Use `id` column as an index for your dataset. 

In [39]:
taxiDB = taxiDB.set_index('id')

In [40]:
taxiDB.head(10)

Unnamed: 0_level_0,vendor_id,pickup_datetime,category_encoded,store_and_fwd_flag,trip_duration,distance_km
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
id2875421,1,2016-03-14 17:24:55,930.399753,0,455.0,1.500479
id2377394,0,2016-06-12 00:43:35,930.399753,0,663.0,1.807119
id3858529,1,2016-01-19 11:35:24,930.399753,0,2124.0,6.39208
id3504673,1,2016-04-06 19:32:31,930.399753,0,429.0,1.487155
id2181028,1,2016-03-26 13:30:55,930.399753,0,435.0,1.189925
id0801584,1,2016-01-30 22:01:40,1061.355223,0,443.0,1.100107
id1813257,0,2016-06-17 22:34:59,1053.529749,0,341.0,1.327852
id1324603,1,2016-05-21 07:54:58,930.399753,0,1551.0,5.722427
id1301050,0,2016-05-27 23:12:23,930.399753,0,255.0,1.311541
id0012891,1,2016-03-10 21:45:01,930.399753,0,1225.0,5.126939


### Task 17

Save the first 10_000 rows of your dataset into a `csv` file.

In [41]:
taxiDB.iloc[0:10000].to_csv('01_task9.csv', sep=';')