# 1 - Perkenalan

> Bab pengenalan harus diisi dengan identitas, gambaran besar dataset yang digunakan, dan objective yang ingin dicapai.

## Identity

Name : Jason Rich Darmawan Onggo Putra

Batch : 016 RMT

## Overview of the Dataset

Source of the dataset: https://www.kaggle.com/datasets/brllrb/uber-and-lyft-dataset-boston-ma

1. Each row represent 1 transaction.
2. Each transaction contains columns that can explain:
   1. when the transaction happened: `timestamp`, `hour`, `day`, `month`, `timezone`.
   2. where the transaction happened: `source`, `destination`.
   3. what is the ride-hailing platform: `cab_type`
   4. what is the product: `product_id`, `name`
   5. how much is the cost: `price`.
   6. how far is the destination: `distance`.
   7. #TODO how big is the demand at the time: `1.0`
   8. where is the user at the time: `latitude`, `longitude`
   9. how did the user feel at that time: `temperature`, `apparentTemperature`, `short_summary`, `long_summary`, `precipIntensity`, `precipProbability`, `windSpeed`, `windGust`, `windGustTime`, `visibility`, `temperatureHigh`, `temperatureHighTime`, `temperatureLow`, `temperatureLowTime`, `apparentTemperatureHigh`, `apparentTemperatureHighTime`, `apparentTemperatureLow`, `apparentTemperatureLowTime` and the list goes on.

## Objective

Create a model to predict the fare price of the ride-hailing platform with accuracy of at least 70%.

~~The dataset contains different ride-hailing platform. Therefore, the model should be able to predict the price depending on the raid-hailing platform (maybe this requirement will be changed after EDA)~~

# 2 - Import Libraries

> Cell pertama pada notebook harus berisi dan hanya berisi semua library yang digunakan dalam project.

In [2]:
import pandas as pd
import plotly.express as px

# 3 - Data Loading

> Bagian ini berisi proses penyiapan data sebelum dilakukan eksplorasi data lebih lanjut. Proses Data Loading dapat berupa memberi nama baru untuk setiap kolom, mengecek ukuran dataset, dll.

## 3.1 - data head and tail

columns that are under scrutiny; decision; reason:
- `id`; ignore; it's a random generated string.
- `timestamp`; ignore; duplicated column.
- `datetime`; ignore; personal judgement that if we include feature minute the model will be overfitted.
- `timezone`; ignore; duplidcated column to column `hour`.
- `source` and `destination`; ignore; personal judgement, the known solution is to use One-Hot Encoding because these columns is not ordinal. However, imagine if there is 10k of unique places, there will be 10k of columns.
- [ ] `cab_type`, `product_id`, `name`; check EDA; it may have a linear relationship with column `price`.
- [ ] `precipIntensity`, `precipProbability`; check EDA; it may have a linear relationship with column `surge_multiplier`.
- `latitude`, `longitude`; ignore; the data is too specific for a Machine Learning to learn from.
- other columns that are not mentioned; ignore; personal judgement, these columns e.g `temperature`, `apparentTemperature`, and so on should have correlation with `surge_multiplier`, albeit weak. Using these columns will definitely make the model overfitted.

columns that may have correlation with the column `price`; reason:
- [ ] `distance`; personal judgement, distance shoud have correlation with price.

columns that may have correlation with column `surge_multiplier`; reason:
- [ ] `surge_multiplier` is a dependent variable, it should be affected by column `hour`, `day`, `month` and column `precipIntensity`.

reference of `precipIntensity` and `precipProbability`:
- https://www.weather.gov/media/pah/WeatherEducation/pop.pdf

In [3]:
pd.options.display.max_columns = None
data = pd.read_csv("./rideshare_kaggle.csv", on_bad_lines='warn')
data

Unnamed: 0,id,timestamp,hour,day,month,datetime,timezone,source,destination,cab_type,product_id,name,price,distance,surge_multiplier,latitude,longitude,temperature,apparentTemperature,short_summary,long_summary,precipIntensity,precipProbability,humidity,windSpeed,windGust,windGustTime,visibility,temperatureHigh,temperatureHighTime,temperatureLow,temperatureLowTime,apparentTemperatureHigh,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureLowTime,icon,dewPoint,pressure,windBearing,cloudCover,uvIndex,visibility.1,ozone,sunriseTime,sunsetTime,moonPhase,precipIntensityMax,uvIndexTime,temperatureMin,temperatureMinTime,temperatureMax,temperatureMaxTime,apparentTemperatureMin,apparentTemperatureMinTime,apparentTemperatureMax,apparentTemperatureMaxTime
0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,1.544953e+09,9,16,12,2018-12-16 09:30:07,America/New_York,Haymarket Square,North Station,Lyft,lyft_line,Shared,5.0,0.44,1.0,42.2148,-71.0330,42.34,37.12,Mostly Cloudy,Rain throughout the day.,0.0000,0.0,0.68,8.66,9.17,1545015600,10.000,43.68,1544968800,34.19,1545048000,37.95,1544968800,27.39,1545044400,partly-cloudy-night,32.70,1021.98,57,0.72,0,10.000,303.8,1544962084,1544994864,0.30,0.1276,1544979600,39.89,1545012000,43.68,1544968800,33.73,1545012000,38.07,1544958000
1,4bd23055-6827-41c6-b23b-3c491f24e74d,1.543284e+09,2,27,11,2018-11-27 02:00:23,America/New_York,Haymarket Square,North Station,Lyft,lyft_premier,Lux,11.0,0.44,1.0,42.2148,-71.0330,43.58,37.35,Rain,"Rain until morning, starting again in the eve...",0.1299,1.0,0.94,11.98,11.98,1543291200,4.786,47.30,1543251600,42.10,1543298400,43.92,1543251600,36.20,1543291200,rain,41.83,1003.97,90,1.00,0,4.786,291.1,1543232969,1543266992,0.64,0.1300,1543251600,40.49,1543233600,47.30,1543251600,36.20,1543291200,43.92,1543251600
2,981a3613-77af-4620-a42a-0c0866077d1e,1.543367e+09,1,28,11,2018-11-28 01:00:22,America/New_York,Haymarket Square,North Station,Lyft,lyft,Lyft,7.0,0.44,1.0,42.2148,-71.0330,38.33,32.93,Clear,Light rain in the morning.,0.0000,0.0,0.75,7.33,7.33,1543334400,10.000,47.55,1543320000,33.10,1543402800,44.12,1543320000,29.11,1543392000,clear-night,31.10,992.28,240,0.03,0,10.000,315.7,1543319437,1543353364,0.68,0.1064,1543338000,35.36,1543377600,47.55,1543320000,31.04,1543377600,44.12,1543320000
3,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,1.543554e+09,4,30,11,2018-11-30 04:53:02,America/New_York,Haymarket Square,North Station,Lyft,lyft_luxsuv,Lux Black XL,26.0,0.44,1.0,42.2148,-71.0330,34.38,29.63,Clear,Partly cloudy throughout the day.,0.0000,0.0,0.73,5.28,5.28,1543514400,10.000,45.03,1543510800,28.90,1543579200,38.53,1543510800,26.20,1543575600,clear-night,26.64,1013.73,310,0.00,0,10.000,291.1,1543492370,1543526114,0.75,0.0000,1543507200,34.67,1543550400,45.03,1543510800,30.30,1543550400,38.53,1543510800
4,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,1.543463e+09,3,29,11,2018-11-29 03:49:20,America/New_York,Haymarket Square,North Station,Lyft,lyft_plus,Lyft XL,9.0,0.44,1.0,42.2148,-71.0330,37.44,30.88,Partly Cloudy,Mostly cloudy throughout the day.,0.0000,0.0,0.70,9.14,9.14,1543446000,10.000,42.18,1543420800,36.71,1543478400,35.75,1543420800,30.29,1543460400,partly-cloudy-night,28.61,998.36,303,0.44,0,10.000,347.7,1543405904,1543439738,0.72,0.0001,1543420800,33.10,1543402800,42.18,1543420800,29.11,1543392000,35.75,1543420800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
693066,616d3611-1820-450a-9845-a9ff304a4842,1.543708e+09,23,1,12,2018-12-01 23:53:05,America/New_York,West End,North End,Uber,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL,13.0,1.00,1.0,42.3519,-71.0643,37.05,37.05,Partly Cloudy,Light rain in the morning and overnight.,0.0000,0.0,0.74,2.34,2.87,1543672800,9.785,44.76,1543690800,34.83,1543712400,44.09,1543690800,35.48,1543712400,partly-cloudy-night,29.65,1023.57,133,0.31,0,9.785,271.5,1543665331,1543698855,0.82,0.0000,1543683600,31.42,1543658400,44.76,1543690800,27.77,1543658400,44.09,1543690800
693067,633a3fc3-1f86-4b9e-9d48-2b7132112341,1.543708e+09,23,1,12,2018-12-01 23:53:05,America/New_York,West End,North End,Uber,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,9.5,1.00,1.0,42.3519,-71.0643,37.05,37.05,Partly Cloudy,Light rain in the morning and overnight.,0.0000,0.0,0.74,2.34,2.87,1543672800,9.785,44.76,1543690800,34.83,1543712400,44.09,1543690800,35.48,1543712400,partly-cloudy-night,29.65,1023.57,133,0.31,0,9.785,271.5,1543665331,1543698855,0.82,0.0000,1543683600,31.42,1543658400,44.76,1543690800,27.77,1543658400,44.09,1543690800
693068,64d451d0-639f-47a4-9b7c-6fd92fbd264f,1.543708e+09,23,1,12,2018-12-01 23:53:05,America/New_York,West End,North End,Uber,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi,,1.00,1.0,42.3519,-71.0643,37.05,37.05,Partly Cloudy,Light rain in the morning and overnight.,0.0000,0.0,0.74,2.34,2.87,1543672800,9.785,44.76,1543690800,34.83,1543712400,44.09,1543690800,35.48,1543712400,partly-cloudy-night,29.65,1023.57,133,0.31,0,9.785,271.5,1543665331,1543698855,0.82,0.0000,1543683600,31.42,1543658400,44.76,1543690800,27.77,1543658400,44.09,1543690800
693069,727e5f07-a96b-4ad1-a2c7-9abc3ad55b4e,1.543708e+09,23,1,12,2018-12-01 23:53:05,America/New_York,West End,North End,Uber,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,27.0,1.00,1.0,42.3519,-71.0643,37.05,37.05,Partly Cloudy,Light rain in the morning and overnight.,0.0000,0.0,0.74,2.34,2.87,1543672800,9.785,44.76,1543690800,34.83,1543712400,44.09,1543690800,35.48,1543712400,partly-cloudy-night,29.65,1023.57,133,0.31,0,9.785,271.5,1543665331,1543698855,0.82,0.0000,1543683600,31.42,1543658400,44.76,1543690800,27.77,1543658400,44.09,1543690800


In [4]:
# Duplicate Dataset

data_duplicate = data.copy()

## 3.1 - data.info()

columns that have null value; decision; reason:
- [ ] `price`; drop the rows with NaN value; we need it for y_true, without it we can't test whether the model is a good fit or not.

In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693071 entries, 0 to 693070
Data columns (total 57 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           693071 non-null  object 
 1   timestamp                    693071 non-null  float64
 2   hour                         693071 non-null  int64  
 3   day                          693071 non-null  int64  
 4   month                        693071 non-null  int64  
 5   datetime                     693071 non-null  object 
 6   timezone                     693071 non-null  object 
 7   source                       693071 non-null  object 
 8   destination                  693071 non-null  object 
 9   cab_type                     693071 non-null  object 
 10  product_id                   693071 non-null  object 
 11  name                         693071 non-null  object 
 12  price                        637976 non-null  float64
 13 

## 3.2 - data.describe()

columns; insight
- `hour`
  - distribution should be skewed to the left because mean (11.62) is before median (12). The assumption is people use the ride-hailing platform during morning rush hour when the public transportation is overcrowded.
  - [ ] why is the mode 0? do I have strong reason to believe that hour `0` is not a natural outlier?
    
    personal judgement, the mode should be during the typical rush hour but it is not. The assumption is that people use the ride-hailing platform after public transportation is closed.

    Conflicting proof:
    > [Out of the 472 stations, 470 are served 24 hours a day.](https://en.wikipedia.org/wiki/New_York_City_Subway)
    
    > [Buses, like the subway, operate on a 24-hour basis.](https://www.introducingnewyork.com/buses)
- `day`
  - The mode is 27. The assumption is that people receive their monthly salary on 25th.
  
    Conflicting proof:
    > [In February 2022, biweekly was the most common length of pay period..](https://www.bls.gov/ces/publications/length-pay-period.htm#:~:text=In%20February%202022%2C%20biweekly%20was,establishments%20paying%20employees%20each%20week.)
- `month`
  - There are only month `11` and `12` in the dataset.
- `price`
  - value in column `price` does not include the column `surge_multiplier` value in the equation.
- `distance`
  - [ ] why is the min 0.02? do I have strong reason to believe that `0.02` is not a natural outlier?
    Assumming the measurement is in mile, 0.02 mile equals to 32 meter.

    > [The average walking pace is 2.5 to 4 mph.](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiAt4_a9pP7AhVOcGwGHdLQA-4QFnoECAoQAw&url=https%3A%2F%2Fwww.nike.com%2Fhr%2Fa%2Fhow-long-does-it-take-to-walk-a-mile%23%3A~%3Atext%3DThe%2520average%2520walking%2520pace%2520is%2Cthe%2520incline%2520and%2520your%2520age.&usg=AOvVaw24G7m9X7T-GCZyoqqGRD16)

    2.5 mile per hour roughly equals to 0,04 per minute.  

In [19]:
pd.options.display.float_format = "{0:.2f}".format
data[['hour', 'day', 'month', 'price', 'distance', 'surge_multiplier', 'precipIntensity', 
      'precipProbability']] \
    .agg(['median',
          'mean',
          lambda x: x.mode()[0],
          "std",
          "skew",
          "min",
          "max",
          lambda x: len(x.unique())]) \
    .reset_index(drop=True) \
    .rename(index={0: "median", 
                   1: "mean", 
                   2: "mode", 
                   3: "std", 
                   4: "skew", 
                   5: "min", 
                   6: "max", 
                   7: "unique length"})

Unnamed: 0,hour,day,month,price,distance,surge_multiplier,precipIntensity,precipProbability
median,12.0,17.0,12.0,13.5,2.16,1.0,0.0,0.0
mean,11.62,17.79,11.59,16.55,2.19,1.01,0.01,0.15
mode,0.0,27.0,12.0,7.0,2.66,1.0,0.0,0.0
std,6.95,9.98,0.49,9.32,1.14,0.09,0.03,0.33
skew,-0.05,-0.38,-0.35,1.05,0.83,8.32,3.33,2.03
min,0.0,1.0,11.0,2.5,0.02,1.0,0.0,0.0
max,23.0,30.0,12.0,97.5,7.86,3.0,0.14,1.0
unique length,24.0,17.0,2.0,148.0,549.0,7.0,63.0,29.0


In [7]:
# column `month` unique values: 11, 12
data['month'].unique()

array([12, 11])

In [72]:
# with a distance of 0.02, UberPool have 3 unique prices 
# but the surge_multiplier does not change.
data[['distance', 'name', 'price', 'surge_multiplier']].value_counts().sort_index()

distance  name       price  surge_multiplier
0.02      Black      15.00  1.00                10
          Black SUV  27.50  1.00                10
          UberPool   5.50   1.00                 2
                     6.50   1.00                 6
                     7.50   1.00                 2
                                                ..
7.86      Black SUV  53.00  1.00                 1
          UberPool   13.50  1.00                 1
          UberX      17.50  1.00                 1
          UberXL     29.50  1.00                 1
          WAV        17.50  1.00                 1
Length: 42932, dtype: int64

# 4 - Exploratory Data Analysis (EDA)

> Bagian ini berisi eksplorasi data pada dataset diatas dengan menggunakan query, grouping, visualisasi sederhana, dan lain sebagainya.

# 5 - Data Preprocessing

> Bagian ini berisi proses penyiapan data untuk proses pelatihan model, seperti pembagian data menjadi train-dev-test, transformasi data (normalisasi, encoding, dll.), dan proses-proses lain yang dibutuhkan.

# 6 - Model Definition

> Bagian ini berisi cell untuk mendefinisikan model. Jelaskan alasan menggunakan suatu algoritma/model, hyperparameter yang dipakai, jenis penggunaan metrics yang dipakai, dan hal lain yang terkait dengan model.

# 7 - Model Training

> Cell pada bagian ini hanya berisi code untuk melatih model dan output yang dihasilkan. Lakukan beberapa kali proses training dengan hyperparameter yang berbeda untuk melihat hasil yang didapatkan. Analisis dan narasikan hasil ini pada bagian Model Evaluation.

# 8 - Model Evaluation

> Pada bagian ini, dilakukan evaluasi model yang harus menunjukkan bagaimana performa model berdasarkan metrics yang dipilih. Hal ini harus dibuktikan dengan visualisasi tren performa dan/atau tingkat kesalahan model. Lakukan analisis terkait dengan hasil pada model dan tuliskan hasil analisisnya.

# 9 - Model Inference

> Model yang sudah dilatih akan dicoba pada data yang bukan termasuk ke dalam train-set ataupun test-set. Data ini harus dalam format yang asli, bukan data yang sudah di-scaled.

# 10 - Pengambilan Kesimpulan

> Pada bagian terakhir ini, harus berisi kesimpulan yang mencerminkan hasil yang didapat dengan objective yang sudah ditulis di bagian pengenalan.