# Uber ride fare prediction

**Problem description**

This project focuses on analyzing Uber ride fares, including exploratory data analysis (EDA) with hypothesis testing, and building a model to predict future ride costs. The data is sourced from Kaggle: [Uber Fares Dataset](https://www.kaggle.com/datasets/yasserh/uber-fares-dataset/data).

**Project objective:**
- Understand the structure of the data and key factors affecting fares.
- Build a predictive model that estimates future ride costs based on features like location, distance, and ride time.

**Significance of the project:**
Predicting ride costs can be beneficial for Uber customers who want to better plan their expenses and for operators to optimize services and implement dynamic pricing effectively.

**The dataset includes the following columns:**
- `key` – a unique identifier for each trip.
- `fare_amount` – the cost of each trip in USD (target variable).
- `pickup_datetime` – the date and time when the meter was engaged.
- `passenger_count` – the number of passengers in the vehicle (entered by the driver).
- `pickup_longitude` – the longitude where the meter was engaged.
- `pickup_latitude` – the latitude where the meter was engaged.
- `dropoff_longitude` – the longitude where the meter was disengaged.
- `dropoff_latitude` – the latitude where the meter was disengaged.

---

## Action Plan

1. Exploratory Data Analysis (EDA)
2. Data Preprocessing
3. Model Development and Evaluation


## 1. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import opendatasets as od


## 2. Download the Dataset

In [2]:
dataset_url = 'https://www.kaggle.com/datasets/yasserh/uber-fares-dataset/data'

In [3]:
od.download(dataset_url)

Dataset URL: https://www.kaggle.com/datasets/yasserh/uber-fares-dataset
Downloading uber-fares-dataset.zip to .\uber-fares-dataset


100%|██████████| 7.04M/7.04M [00:00<00:00, 13.8MB/s]







In [9]:
data_dir = './uber-fares-dataset'

### Loading Training Set

- Ignore the `key` column
- Parse pickup datetime while loading data 

In [12]:
selected_cols = 'fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count'.split(',')

df = pd.read_csv(data_dir+'/uber.csv',
                 usecols=selected_cols, 
                 parse_dates=["pickup_datetime"])

In [19]:
df.shape

(200000, 7)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   fare_amount        200000 non-null  float64            
 1   pickup_datetime    200000 non-null  datetime64[ns, UTC]
 2   pickup_longitude   200000 non-null  float64            
 3   pickup_latitude    200000 non-null  float64            
 4   dropoff_longitude  199999 non-null  float64            
 5   dropoff_latitude   199999 non-null  float64            
 6   passenger_count    200000 non-null  int64              
dtypes: datetime64[ns, UTC](1), float64(5), int64(1)
memory usage: 10.7 MB


In [17]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,200000.0,200000.0,200000.0,199999.0,199999.0,200000.0
mean,11.359955,-72.527638,39.935885,-72.525292,39.92389,1.684535
std,9.901776,11.437787,7.720539,13.117408,6.794829,1.385997
min,-52.0,-1340.64841,-74.015515,-3356.6663,-881.985513,0.0
25%,6.0,-73.992065,40.734796,-73.991407,40.733823,1.0
50%,8.5,-73.981823,40.752592,-73.980093,40.753042,1.0
75%,12.5,-73.967154,40.767158,-73.963658,40.768001,2.0
max,499.0,57.418457,1644.421482,1153.572603,872.697628,208.0


In [20]:
df.pickup_datetime.min(), df.pickup_datetime.max()

(Timestamp('2009-01-01 01:15:22+0000', tz='UTC'),
 Timestamp('2015-06-30 23:40:39+0000', tz='UTC'))

- **Data size:** The dataset contains 200,000 rows and 7 columns.  
- **Data anomalies:**  
  - `fare_amount`: Minimum value is -52 (illogical), maximum is 499 (potential outlier).  
  - `passenger_count`: Minimum is 0, maximum is 208, which requires verification.  
  - Geographic coordinates contain values outside realistic ranges.  
- **Pickup dates:** The pickup dates range from January 1st, 2009, to June 30th, 2015.


## 3. Exploratory Data Analysis and Visualization

In [15]:
df

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,7.5,2015-05-07 19:52:06+00:00,-73.999817,40.738354,-73.999512,40.723217,1
1,7.7,2009-07-17 20:04:56+00:00,-73.994355,40.728225,-73.994710,40.750325,1
2,12.9,2009-08-24 21:45:00+00:00,-74.005043,40.740770,-73.962565,40.772647,1
3,5.3,2009-06-26 08:22:21+00:00,-73.976124,40.790844,-73.965316,40.803349,3
4,16.0,2014-08-28 17:47:00+00:00,-73.925023,40.744085,-73.973082,40.761247,5
...,...,...,...,...,...,...,...
199995,3.0,2012-10-28 10:49:00+00:00,-73.987042,40.739367,-73.986525,40.740297,1
199996,7.5,2014-03-14 01:09:00+00:00,-73.984722,40.736837,-74.006672,40.739620,1
199997,30.9,2009-06-29 00:42:00+00:00,-73.986017,40.756487,-73.858957,40.692588,2
199998,14.5,2015-05-20 14:56:25+00:00,-73.997124,40.725452,-73.983215,40.695415,1
