<h2><center> UrbanFlow Prognosticator: Propagation-Aware Traffic Prediction and Visualization System</h2></center>
<figure>
<center><img src ="https://media.springernature.com/lw685/springer-static/image/chp%3A10.1007%2F978-3-030-94102-4_6/MediaObjects/509967_1_En_6_Fig1_HTML.png" width = "750" height = '500' alt="unsplash.com"/>

## Author: Umar Kabir

Date: [July, 2023]

<a id='table-of-contents'></a>
# Table of Contents

1. [Introduction](#introduction)
    - Motivation
    - Problem Statement
    - Objective
    - Data Source
    - Importing Dependencies  


2. [Data](#2-data)
    - Data Loading
    - Dataset Overview


3. [Exploratory Data Analysis](#exploratory-data-analysis)
    - Descriptive Statistics
    - Data Visualization
    - Correlation Analysis
    - Outlier Detection


4. [Data Preparation](#data-preparation)
    - Data Cleaning
    - Handling Missing Values
    - Handling Imbalanced Classes
    - Feature Selection
    - Feature Engineering
    - Data Transformation
    - Data Splitting

<a id='introduction'></a>
<font size="+2" color='#053c96'><b> Introduction</b></font>  
[back to top](#table-of-contents)  

<font size="+0" color='green'><b> Possible Target Variables</b></font>  


<font size="+0" color='green'><b> Motivation</b></font>  


<font size="+0" color='green'><b> Problem Statement</b></font>  



<font size="+0" color='green'><b> Objectives</b></font>  


<font size="+0" color='green'><b> Data Source</b></font>  


<font size="+0" color='green'><b> Importing Dependencies</b></font>  

In [22]:
import sys
# Insert the parent path relative to this notebook so we can import from the src folder.
sys.path.insert(0, "..")

from src.dependencies import *
from src.functions import *

<a id='#data'></a>
<font size="+2" color='#053c96'><b> Data</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data Loading</b></font>  

In [123]:
pems04 = pd.read_csv('../data/PEMS04.csv')
pems07 = pd.read_csv('../data/PEMS07.csv')
pems08 = pd.read_csv('../data/PEMS08.csv')
data = pd.read_csv('../data/original_cleaned_nyc_taxi_data_2018.csv')
taxi_zones = pd.read_csv('../data/taxi_zone_geo.csv')
taxi_trips = pd.read_csv('../data/taxi_trip_data.csv')

<font size="+0" color='green'><b> Data Overview</b></font>  

In [24]:
print(pems04.info())
pems04.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   from    340 non-null    int64  
 1   to      340 non-null    int64  
 2   cost    340 non-null    float64
dtypes: float64(1), int64(2)
memory usage: 8.1 KB
None


(340, 3)

In [83]:
pems04.head()

Unnamed: 0,from,to,cost
0,73,5,352.6
1,5,154,347.2
2,154,263,392.9
3,263,56,440.8
4,56,96,374.6


In [88]:
pems04.describe()

Unnamed: 0,from,to,cost
count,340.0,340.0,340.0
mean,149.544118,146.241176,410.300588
std,88.341793,89.092511,257.518655
min,0.0,0.0,3.2
25%,71.75,66.75,328.775
50%,144.5,139.5,367.15
75%,229.25,226.5,422.3
max,305.0,306.0,2712.1


In [25]:
print(pems07.info())
pems07.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 866 entries, 0 to 865
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   from    866 non-null    int64  
 1   to      866 non-null    int64  
 2   cost    866 non-null    float64
dtypes: float64(1), int64(2)
memory usage: 20.4 KB
None


(866, 3)

In [124]:
pems07['cost'] = pems07['cost'] * 200

In [125]:
pems07.head()

Unnamed: 0,from,to,cost
0,721,445,158.0
1,542,480,515.0
2,770,702,185.2
3,32,266,119.2
4,34,56,125.6


In [126]:
pems07.describe()

Unnamed: 0,from,to,cost
count,866.0,866.0,866.0
mean,443.393764,439.577367,292.493072
std,255.69856,255.545771,389.753597
min,0.0,0.0,6.4
25%,223.25,219.25,118.0
50%,442.5,440.5,183.4
75%,665.75,660.75,322.4
max,882.0,882.0,4107.8


In [26]:
print(pems08.info())
pems08.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 295 entries, 0 to 294
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   from    295 non-null    int64  
 1   to      295 non-null    int64  
 2   cost    295 non-null    float64
dtypes: float64(1), int64(2)
memory usage: 7.0 KB
None


(295, 3)

In [89]:
pems08.head()

Unnamed: 0,from,to,cost
0,9,153,310.6
1,153,62,330.9
2,62,111,332.9
3,111,11,324.2
4,11,28,336.0


In [90]:
pems08.describe()

Unnamed: 0,from,to,cost
count,295.0,295.0,295.0
mean,82.376271,85.118644,315.895254
std,51.902487,51.507177,216.686639
min,1.0,0.0,6.3
25%,33.5,40.0,240.3
50%,82.0,85.0,328.1
75%,127.0,132.0,372.15
max,169.0,169.0,3274.4


In [28]:
print(taxi_trips.info())
taxi_trips.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 17 columns):
 #   Column               Dtype  
---  ------               -----  
 0   vendor_id            int64  
 1   pickup_datetime      object 
 2   dropoff_datetime     object 
 3   passenger_count      int64  
 4   trip_distance        float64
 5   rate_code            int64  
 6   store_and_fwd_flag   object 
 7   payment_type         int64  
 8   fare_amount          float64
 9   extra                float64
 10  mta_tax              float64
 11  tip_amount           float64
 12  tolls_amount         float64
 13  imp_surcharge        float64
 14  total_amount         float64
 15  pickup_location_id   int64  
 16  dropoff_location_id  int64  
dtypes: float64(8), int64(6), object(3)
memory usage: 1.3+ GB
None


(10000000, 17)

In [87]:
taxi_trips.head()

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,imp_surcharge,total_amount,pickup_location_id,dropoff_location_id
0,2,2018-03-29 13:37:13,2018-03-29 14:17:01,1,18.15,3,N,1,70.0,0.0,0.0,16.16,10.5,0.3,96.96,161,1
1,2,2018-03-29 13:37:18,2018-03-29 14:15:33,1,4.59,1,N,1,25.0,0.0,0.5,5.16,0.0,0.3,30.96,13,230
2,2,2018-03-29 13:26:57,2018-03-29 13:28:03,1,0.3,1,N,1,3.0,0.0,0.5,0.76,0.0,0.3,4.56,231,231
3,2,2018-03-29 13:07:48,2018-03-29 14:03:05,2,16.97,1,N,1,49.5,0.0,0.5,5.61,5.76,0.3,61.67,231,138
4,2,2018-03-29 14:19:11,2018-03-29 15:19:59,5,14.45,1,N,1,45.5,0.0,0.5,10.41,5.76,0.3,62.47,87,138


In [29]:
taxi_zones.info()
taxi_zones.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   zone_id    263 non-null    int64 
 1   zone_name  263 non-null    object
 2   borough    263 non-null    object
 3   zone_geom  263 non-null    object
dtypes: int64(1), object(3)
memory usage: 8.3+ KB


(263, 4)

In [102]:
pems04[pems04['to'] == 5]

Unnamed: 0,from,to,cost
0,73,5,352.6


<font size="+0" color='green'><b> Combining the Data</b></font>  

In [106]:
pems = combine_dataframes([pems04, pems07, pems08])

In [107]:
pems.head()

Unnamed: 0,from,to,cost
0,73,5,352.6
1,5,154,347.2
2,154,263,392.9
3,263,56,440.8
4,56,96,374.6


In [108]:
pems.duplicated().sum()

18

In [109]:
# Renaming the columns
pems.rename(columns={'from': 'pickup_location_id', 'to': 'dropoff_location_id'}, inplace=True)

In [110]:
df = pd.merge(pems, taxi_trips, on=['pickup_location_id', 'dropoff_location_id'], how='inner')
df.head()

Unnamed: 0,pickup_location_id,dropoff_location_id,cost,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,imp_surcharge,total_amount
0,263,56,440.8,2,2018-10-06 03:03:53,2018-10-06 03:23:20,3,9.14,1,N,1,27.0,0.5,0.5,5.66,0.0,0.3,33.96
1,263,56,440.8,2,2018-04-08 00:27:46,2018-04-08 00:57:21,1,7.26,1,N,2,25.0,0.5,0.5,0.0,0.0,0.3,26.3
2,263,56,440.8,1,2018-03-26 08:13:56,2018-03-26 08:39:48,1,11.1,1,N,1,33.0,0.0,0.5,9.85,5.76,0.3,49.41
3,263,56,440.8,1,2018-06-16 09:59:43,2018-06-16 10:30:01,1,9.2,1,N,1,30.5,0.0,0.5,7.8,0.0,0.3,39.1
4,263,56,440.8,1,2018-03-10 00:44:57,2018-03-10 01:13:21,3,7.0,1,N,2,24.5,0.5,0.5,0.0,0.0,0.3,25.8


In [105]:
df.shape

(66035, 18)

In [111]:
df.duplicated().sum()

5859

In [114]:
df.isna().sum()

pickup_location_id     0
dropoff_location_id    0
cost                   0
vendor_id              0
pickup_datetime        0
dropoff_datetime       0
passenger_count        0
trip_distance          0
rate_code              0
store_and_fwd_flag     0
payment_type           0
fare_amount            0
extra                  0
mta_tax                0
tip_amount             0
tolls_amount           0
imp_surcharge          0
total_amount           0
dtype: int64

<a id='exploratory-data-analysis'></a>
<font size="+2" color='#053c96'><b> Exploratory Data Analysis</b></font>  
[back to top](#table-of-contents)

<a id='data-exploration'></a>
<font size="+0" color='green'><b> Data Exploration</b></font>  

In [112]:
df['pickup_location_id'].nunique()

195

In [113]:
df['pickup_location_id'].nunique()

195

In [116]:
df['passenger_count'].value_counts()

1    46324
2    10185
5     2993
3     2849
6     1769
4     1396
0      516
7        2
9        1
Name: passenger_count, dtype: int64

In [117]:
df['vendor_id'].value_counts()

2    39923
1    25814
4      298
Name: vendor_id, dtype: int64

In [118]:
df['rate_code'].value_counts()

1     54658
3      6126
2      3673
5      1508
4        63
99        6
6         1
Name: rate_code, dtype: int64

<a id='data-visualization'></a>
<font size="+0" color='green'><b> Data Visualization</b></font>  

<a id='summary-statistics'></a>
<font size="+0" color='green'><b> Summary Statistics</b></font>  

<a id='feature-correlation'></a>
<font size="+0" color='green'><b> Feature Correlation</b></font>  

<a id='data-preparation'></a>
<font size="+2" color='#053c96'><b> Data Preparation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data CLeaning</b></font>  

<font size="+0" color='green'><b> Handling Imbalanced Classes</b></font>  

<font size="+0" color='green'><b> Feature Engineering</b></font>  

<font size="+0" color='green'><b> Feature Selection</b></font>  

<font size="+0" color='green'><b> Data Transformation</b></font>  

<font size="+0" color='green'><b> Data Splitting</b></font>  