<a id='table-of-contents'></a>
# Table of Contents

1. [Introduction](#introduction)
    - Motivation
    - Problem Statement
    - Objective
    - Data Source
    - Importing Dependencies  


2. [Data](#2-data)
    - Data Loading
    - Dataset Overview


3. [Exploratory Data Analysis](#exploratory-data-analysis)
    - Descriptive Statistics
    - Data Visualization
    - Correlation Analysis
    - Outlier Detection


4. [Data Preparation](#data-preparation)
    - Data Cleaning
    - Handling Missing Values
    - Handling Imbalanced Classes
    - Feature Selection
    - Feature Engineering
    - Data Transformation
    - Data Splitting


5. [Model Development](#model-development)
    - Baseline Model
    - Model Selection
    - Model Training
    - Hyperparameter Tuning


6. [Model Evaluation](#model-evaluation)
    - Performance Metrics
    - Confusion Matrix
    - ROC Curve
    - Precision-Recall Curve
    - Cross-Validation
    - Bias-Variance Tradeoff


7. [Model Interpretation](#model-interpretation)
    - Feature Importance
    - Model Explanation Techniques
    - Business Impact Analysis


8. [Conclusion](#conclusion)
    - Summary of Findings
    - Recommendations
    - Limitations
    - Future Work
    - Final Thoughts


9. [References](#references)

<font size="+0" color='green'><b> Importing Dependencies</b></font>  

In [1]:
import sys
# Insert the parent path relative to this notebook so we can import from the src folder.
sys.path.insert(0, "..")

from src.dependencies import *

<a id='#data'></a>
<font size="+2" color='#053c96'><b> Data</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data Loading</b></font>  

In [5]:
data1 = pd.read_parquet("../data/primary_data.parquet")
data2 = pd.read_parquet("../data/secondary_data.parquet")
taxi = pd.read_parquet("../data/nyc_taxi_data.parquet")

<font size="+0" color='green'><b> Data Overview</b></font>  

In [3]:
data1.head()

Unnamed: 0,origin_id,destination_id,cost,traffic_flow,month,week,day,hour,minute,day_of_week,day_of_year,cost_flow,cost_flow_percentage,flow_cost,flow_cost_percentage
0,73,5,352.6,47.0,1,1,1,0,0,0,1,7.502128,2.12766,16572.2,4700.0
1,73,5,352.6,25.0,1,1,1,0,5,0,1,14.104,4.0,8815.0,2500.0
2,73,5,352.6,55.0,1,1,1,0,10,0,1,6.410909,1.818182,19393.0,5500.0
3,73,5,352.6,70.0,1,1,1,0,15,0,1,5.037143,1.428571,24682.0,7000.0
4,73,5,352.6,82.0,1,1,1,0,20,0,1,4.3,1.219512,28913.2,8200.0


In [8]:
data1.shape

(738149, 15)

In [6]:
data2.head()

Unnamed: 0,origin_id,destination_id,cost,traffic_flow,traffic_occupancy,traffic_speed,month,week,day,hour,minute,day_of_week,day_of_year,cost_flow,cost_flow_percentage,flow_speed,flow_occupancy,flow_speed_percentage,flow_occupancy_percentage,cost_speed,cost_occupancy,cost_speed_percentage,cost_occupancy_percentage,speed_flow,speed_occupancy,speed_flow_percentage,speed_occupancy_percentage,occupancy_speed,occupancy_flow,occupancy_flow_percentage,occupancy_speed_percentage
0,9,153,310.6,256.0,0.0385,68.9,7,26,1,0,0,4,183,1.213281,0.390625,3.71553,6649.350649,5.392641,17271040.0,4.507983,8067.532468,6.54279,20954630.0,17638.4,2.65265,6890.0,6890.0,2.65265,9.856,3.85,3.85
1,9,153,310.6,210.0,0.0337,68.0,7,26,1,0,5,4,183,1.479048,0.47619,3.088235,6231.454006,4.541522,18490960.0,4.567647,9216.617211,6.717128,27349010.0,14280.0,2.2916,6800.0,6800.0,2.2916,7.077,3.37,3.37
2,9,153,310.6,224.0,0.0367,66.9,7,26,1,0,10,4,183,1.386607,0.446429,3.348281,6103.542234,5.004904,16630910.0,4.64275,8463.215259,6.939836,23060530.0,14985.6,2.45523,6690.0,6690.0,2.45523,8.2208,3.67,3.67
3,9,153,310.6,208.0,0.0324,67.4,7,26,1,0,15,4,183,1.493269,0.480769,3.086053,6419.753086,4.578714,19814050.0,4.608309,9586.419753,6.837253,29587720.0,14019.2,2.18376,6740.0,6740.0,2.18376,6.7392,3.24,3.24
4,9,153,310.6,188.0,0.0261,70.4,7,26,1,0,20,4,183,1.652128,0.531915,2.670455,7203.065134,3.793259,27597950.0,4.411932,11900.383142,6.266949,45595340.0,13235.2,1.83744,7040.0,7040.0,1.83744,4.9068,2.61,2.61


In [9]:
data2.shape

(109658, 31)

In [7]:
taxi.head()

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount,year,month,week,day,hour,minute,day_of_week,day_of_year,time,inflow,outflow
0,0,2014-01-09 19:30:33,2014-01-09 19:37:00,1,1.6,-73.966895,40.76075,1,0,-73.952452,40.780517,0,7.5,1.0,0.5,1.0,0.0,10.0,2014,1,2,9,19,30,3,9,2014-01-09 19:30:00+00:00,20.0,22.0
1,0,2014-01-09 19:30:02,2014-01-09 19:46:44,1,3.5,-73.984054,40.743348,1,0,-73.94527,40.77406,0,14.0,1.0,0.5,3.1,0.0,18.6,2014,1,2,9,19,30,3,9,2014-01-09 19:30:00+00:00,20.0,22.0
2,0,2014-01-09 19:30:29,2014-01-09 19:54:52,1,9.1,-73.968481,40.75494,1,0,-73.835023,40.716944,0,27.5,1.0,0.5,4.0,5.33,38.33,2014,1,2,9,19,30,3,9,2014-01-09 19:30:00+00:00,20.0,22.0
3,0,2014-01-09 19:30:47,2014-01-09 19:43:48,4,1.3,-73.986107,40.740197,1,0,-74.006595,40.744362,0,9.5,1.0,0.5,2.75,0.0,13.75,2014,1,2,9,19,30,3,9,2014-01-09 19:30:00+00:00,20.0,22.0
4,0,2014-01-09 19:30:52,2014-01-09 19:44:17,1,3.8,-73.96327,40.768193,1,0,-73.927815,40.736685,0,14.0,1.0,0.5,1.5,0.0,17.0,2014,1,2,9,19,30,3,9,2014-01-09 19:30:00+00:00,20.0,22.0


In [10]:
taxi.shape

(242229, 29)

<a id='data-preparation'></a>
<font size="+2" color='#053c96'><b> Data Preparation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Feature Selection</b></font>  

In [None]:
X, y = main_df.drop('traffic_flow', axis=1), main_df['traffic_flow']

In [None]:
threshold = 0.085
correlation_matrix = main_df.corr()
high_correlation_features = correlation_matrix[abs(correlation_matrix['traffic_flow']) > threshold]['traffic_flow'].index
high_correlation_features

Index(['origin_id', 'destination_id', 'traffic_flow', 'week', 'day', 'hour',
       'day_of_year', 'cost_flow', 'cost_flow_percentage', 'flow_cost',
       'flow_cost_percentage'],
      dtype='object')

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Create a SelectKBest object with the f_regression score function
k_best = SelectKBest(k=13)

# Fit the SelectKBest object on the training data
X_train_kbest = k_best.fit_transform(X_train_scaled, y_train)

# Get the indices of the selected features
selected_indices = k_best.get_support(indices=True)

# Get the names of the selected features
selected_features = X.columns[selected_indices]

# Display the selected features
print("Selected Features:", selected_features)

Selected Features: Index(['origin_id', 'destination_id', 'cost', 'year', 'month', 'week', 'day',
       'hour', 'minute', 'day_of_week', 'day_of_year', 'cost_flow',
       'flow_cost'],
      dtype='object')


In [None]:
from sklearn.feature_selection import mutual_info_regression
threshold_ = 0.05
mutual_info_scores = mutual_info_regression(X, y)
selected_features = X.columns[mutual_info_scores > threshold_]
selected_features

<font size="+0" color='green'><b> Data Splitting</b></font>  

<a id='model-development'></a>

<font size="+2" color='#053c96'><b> Model Development</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Baseline Model</b></font>  

<font size="+0" color='green'><b> Model Selection</b></font>  

<font size="+0" color='green'><b> Model Training</b></font>  

<font size="+0" color='green'><b> Hyperparameter Tuning</b></font>  

<a id='model-evaluation'></a>

<font size="+2" color='#053c96'><b> Model Evaluation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Performance Metrics</b></font>  

<font size="+0" color='green'><b> Confusion Matrix</b></font>  

<font size="+0" color='green'><b> ROC Curve</b></font>  

<font size="+0" color='green'><b> Precision-Recall Curve</b></font>   

<font size="+0" color='green'><b> Cross-Validation</b></font>   

<font size="+0" color='green'><b> Bias-Variance Tradeoff</b></font>   

<a id='model-interpretation'></a>
<font size="+2" color='#053c96'><b> Model Interpretation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Feature Importance</b></font>   

<font size="+0" color='green'><b> Model Explanation Techniques</b></font>   

<font size="+0" color='green'><b> Business Impact Analysis</b></font>   

<a id='conclusion'></a>

<font size="+2" color='#053c96'><b> Conclusion</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Summary of Findings</b></font>   

<font size="+0" color='green'><b> Recommendations</b></font>   

<font size="+0" color='green'><b> Limitations</b></font>   

<font size="+0" color='green'><b> Future Work</b></font>   

<font size="+0" color='green'><b> Final Thoughts</b></font>   

<a id='references'></a>

<font size="+2" color='#053c96'><b> References</b></font>  
[back to top](#table-of-contents)