<h2><center> "UrbanFlow Prognosticator: Propagation-Aware Traffic Prediction and Visualization System</h2></center>
<figure>
<center><img src ="https://media.springernature.com/lw685/springer-static/image/chp%3A10.1007%2F978-3-030-94102-4_6/MediaObjects/509967_1_En_6_Fig1_HTML.png" width = "750" height = '500' alt="unsplash.com"/>

## Author: Umar Kabir

Date: [July, 2023]

<a id='table-of-contents'></a>
# Table of Contents

1. [Introduction](#introduction)
    - Motivation
    - Problem Statement
    - Objective
    - Data Source
    - Importing Dependencies  


2. [Data](#2-data)
    - Data Loading
    - Dataset Overview


3. [Exploratory Data Analysis](#exploratory-data-analysis)
    - Descriptive Statistics
    - Data Visualization
    - Correlation Analysis
    - Outlier Detection


4. [Data Preparation](#data-preparation)
    - Data Cleaning
    - Handling Missing Values
    - Handling Imbalanced Classes
    - Feature Selection
    - Feature Engineering
    - Data Transformation
    - Data Splitting

<a id='introduction'></a>
<font size="+2" color='#053c96'><b> Introduction</b></font>  
[back to top](#table-of-contents)  

<font size="+0" color='green'><b> Possible Target Variables</b></font>  


<font size="+0" color='green'><b> Motivation</b></font>  


<font size="+0" color='green'><b> Problem Statement</b></font>  



<font size="+0" color='green'><b> Objectives</b></font>  


<font size="+0" color='green'><b> Data Source</b></font>  


<font size="+0" color='green'><b> Importing Dependencies</b></font>  

In [2]:
import sys
# Insert the parent path relative to this notebook so we can import from the src folder.
sys.path.insert(0, "..")

from src.dependencies import *

<a id='#data'></a>
<font size="+2" color='#053c96'><b> Data</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data Loading</b></font>  

In [5]:
pems04 = pd.read_csv('../data/PEMS04.csv')
pems07 = pd.read_csv('../data/PEMS07.csv')
pems08 = pd.read_csv('../data/PEMS08.csv')
original = pd.read_csv('../data/original_cleaned_nyc_taxi_data_2018.csv')
taxi_zones = pd.read_csv('../data/taxi_zone_geo.csv')
taxi_trips = pd.read_csv('../data/taxi_trip_data.csv')

<font size="+0" color='green'><b> Data Overview</b></font>  

In [6]:
pems04.shape
print(pems04.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   from    340 non-null    int64  
 1   to      340 non-null    int64  
 2   cost    340 non-null    float64
dtypes: float64(1), int64(2)
memory usage: 8.1 KB
None


In [8]:
print(pems07.info())
pems07.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 866 entries, 0 to 865
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   from    866 non-null    int64  
 1   to      866 non-null    int64  
 2   cost    866 non-null    float64
dtypes: float64(1), int64(2)
memory usage: 20.4 KB
None


(866, 3)

In [9]:
print(pems08.info())
pems08.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 295 entries, 0 to 294
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   from    295 non-null    int64  
 1   to      295 non-null    int64  
 2   cost    295 non-null    float64
dtypes: float64(1), int64(2)
memory usage: 7.0 KB
None


(295, 3)

In [11]:
print(original.info())
original.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8319928 entries, 0 to 8319927
Data columns (total 21 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   Unnamed: 0               int64  
 1   trip_distance            float64
 2   rate_code                int64  
 3   store_and_fwd_flag       object 
 4   payment_type             int64  
 5   fare_amount              float64
 6   extra                    float64
 7   mta_tax                  float64
 8   tip_amount               float64
 9   tolls_amount             float64
 10  imp_surcharge            float64
 11  total_amount             float64
 12  pickup_location_id       int64  
 13  dropoff_location_id      int64  
 14  year                     int64  
 15  month                    int64  
 16  day                      int64  
 17  day_of_week              int64  
 18  hour_of_day              int64  
 19  trip_duration            float64
 20  calculated_total_amount  float64
dtypes: float

(8319928, 21)

In [12]:
print(taxi_trips.info())
taxi_trips.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 17 columns):
 #   Column               Dtype  
---  ------               -----  
 0   vendor_id            int64  
 1   pickup_datetime      object 
 2   dropoff_datetime     object 
 3   passenger_count      int64  
 4   trip_distance        float64
 5   rate_code            int64  
 6   store_and_fwd_flag   object 
 7   payment_type         int64  
 8   fare_amount          float64
 9   extra                float64
 10  mta_tax              float64
 11  tip_amount           float64
 12  tolls_amount         float64
 13  imp_surcharge        float64
 14  total_amount         float64
 15  pickup_location_id   int64  
 16  dropoff_location_id  int64  
dtypes: float64(8), int64(6), object(3)
memory usage: 1.3+ GB
None


(10000000, 17)

In [13]:
taxi_zones.info()
taxi_zones.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   zone_id    263 non-null    int64 
 1   zone_name  263 non-null    object
 2   borough    263 non-null    object
 3   zone_geom  263 non-null    object
dtypes: int64(1), object(3)
memory usage: 8.3+ KB


(263, 4)

<a id='exploratory-data-analysis'></a>
<font size="+2" color='#053c96'><b> Exploratory Data Analysis</b></font>  
[back to top](#table-of-contents)

<a id='data-exploration'></a>
<font size="+0" color='green'><b> Data Exploration</b></font>  

In [11]:
# Check the number of unique values in each column
print("\nNumber of Unique Values:")
print(df.nunique())


Number of Unique Values:
WASTE_ID      58502
SOURCE           12
ORG_ID        47496
WWTP_NAME     49260
COUNTRY         188
CNTRY_ISO       180
LAT_WWTP      31311
LON_WWTP      44467
QUAL_LOC          4
LAT_OUT       13507
LON_OUT       24606
STATUS            9
POP_SERVED    22602
QUAL_POP          4
WASTE_DIS     33782
QUAL_WASTE        4
LEVEL             3
QUAL_LEVEL        2
DF            45199
HYRIV_ID      42821
RIVER_DIS     22017
COAST_10KM        2
COAST_50KM        2
DESIGN_CAP     7328
QUAL_CAP          3
dtype: int64


In [12]:
# Check for any missing values in the DataFrame
print("\nMissing Values:")
print(df.isnull().sum())


Missing Values:
WASTE_ID          0
SOURCE            0
ORG_ID            0
WWTP_NAME      5287
COUNTRY           0
CNTRY_ISO         0
LAT_WWTP          0
LON_WWTP          0
QUAL_LOC          0
LAT_OUT           0
LON_OUT           0
STATUS            0
POP_SERVED        0
QUAL_POP          0
WASTE_DIS         0
QUAL_WASTE        0
LEVEL             0
QUAL_LEVEL        0
DF            11200
HYRIV_ID        379
RIVER_DIS     10551
COAST_10KM        0
COAST_50KM        0
DESIGN_CAP    15835
QUAL_CAP          0
dtype: int64


<a id='data-visualization'></a>
<font size="+0" color='green'><b> Data Visualization</b></font>  

<a id='summary-statistics'></a>
<font size="+0" color='green'><b> Summary Statistics</b></font>  

<a id='feature-correlation'></a>
<font size="+0" color='green'><b> Feature Correlation</b></font>  

<a id='data-preparation'></a>
<font size="+2" color='#053c96'><b> Data Preparation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data CLeaning</b></font>  

<font size="+0" color='green'><b> Handling Imbalanced Classes</b></font>  

<font size="+0" color='green'><b> Feature Engineering</b></font>  

<font size="+0" color='green'><b> Feature Selection</b></font>  

<font size="+0" color='green'><b> Data Transformation</b></font>  

<font size="+0" color='green'><b> Data Splitting</b></font>  