<h2><center> AquaInsight: Exploring Global Wastewater Treatment Patterns</h2></center>
<figure>
<center><img src ="https://th.bing.com/th/id/OIP.wuNPTx42LyVnFMqRofDVPQHaGB?pid=ImgDet&rs=1" width = "750" height = '500' alt="unsplash.com"/>

## Author: Umar Kabir

Date: [July, 2023]

<a id='table-of-contents'></a>
# Table of Contents

1. [Introduction](#introduction)
    - Motivation
    - Problem Statement
    - Objective
    - Data Source
    - Importing Dependencies  


2. [Data](#2-data)
    - Data Loading
    - Dataset Overview


3. [Exploratory Data Analysis](#exploratory-data-analysis)
    - Descriptive Statistics
    - Data Visualization
    - Correlation Analysis
    - Outlier Detection


4. [Data Preparation](#data-preparation)
    - Data Cleaning
    - Handling Missing Values
    - Handling Imbalanced Classes
    - Feature Selection
    - Feature Engineering
    - Data Transformation
    - Data Splitting


5. [Model Development](#model-development)
    - Baseline Model
    - Model Selection
    - Model Training
    - Hyperparameter Tuning


6. [Model Evaluation](#model-evaluation)
    - Performance Metrics
    - Confusion Matrix
    - ROC Curve
    - Precision-Recall Curve
    - Cross-Validation
    - Bias-Variance Tradeoff


7. [Model Interpretation](#model-interpretation)
    - Feature Importance
    - Model Explanation Techniques
    - Business Impact Analysis


8. [Conclusion](#conclusion)
    - Summary of Findings
    - Recommendations
    - Limitations
    - Future Work
    - Final Thoughts


9. [References](#references)

<a id='introduction'></a>
<font size="+2" color='#053c96'><b> Introduction</b></font>  
[back to top](#table-of-contents)  

<font size="+0" color='green'><b> Possible Target Variables</b></font>  


<font size="+0" color='green'><b> Motivation</b></font>  


<font size="+0" color='green'><b> Problem Statement</b></font>  



<font size="+0" color='green'><b> Objectives</b></font>  


<font size="+0" color='green'><b> Data Source</b></font>  


<font size="+0" color='green'><b> Importing Dependencies</b></font>  

In [1]:
import sys
# Insert the parent path relative to this notebook so we can import from the src folder.
sys.path.insert(0, "..")

from src.dependencies import *

<a id='#data'></a>
<font size="+2" color='#053c96'><b> Data</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data Loading</b></font>  

In [2]:
pems03 = pd.read_csv('../data/PEMS03.csv')
pems04 = pd.read_csv('../data/PEMS04.csv')
pems07 = pd.read_csv('../data/PEMS07.csv')
pems08 = pd.read_csv('../data/PEMS08.csv')
original = pd.read_csv('../data/original_cleaned_nyc_taxi_data_2018.csv')
taxi_zones = pd.read_csv('../data/taxi_zone_geo.csv')
taxi_trips = pd.read_csv('../data/taxi_trip_data.csv')

<font size="+0" color='green'><b> Data Overview</b></font>  

In [3]:
pems03.head()

Unnamed: 0,from,to,distance
0,317842,318711,0.872
1,318721,315955,1.322
2,315927,318236,1.222
3,318711,318721,0.233
4,318236,317843,1.22


In [4]:
pems04.head()

Unnamed: 0,from,to,cost
0,73,5,352.6
1,5,154,347.2
2,154,263,392.9
3,263,56,440.8
4,56,96,374.6


In [6]:
pems07.head()

Unnamed: 0,from,to,cost
0,721,445,0.79
1,542,480,2.575
2,770,702,0.926
3,32,266,0.596
4,34,56,0.628


In [7]:
pems08.head()

Unnamed: 0,from,to,cost
0,9,153,310.6
1,153,62,330.9
2,62,111,332.9
3,111,11,324.2
4,11,28,336.0


In [8]:
original.head()

Unnamed: 0.1,Unnamed: 0,trip_distance,rate_code,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,imp_surcharge,total_amount,pickup_location_id,dropoff_location_id,year,month,day,day_of_week,hour_of_day,trip_duration,calculated_total_amount
0,3,16.97,1,N,1,49.5,0.0,0.5,5.61,5.76,0.3,61.67,231,138,2018,3,29,3,13,3317.0,61.67
1,4,14.45,1,N,1,45.5,0.0,0.5,10.41,5.76,0.3,62.47,87,138,2018,3,29,3,14,3648.0,62.47
2,5,11.6,1,N,1,42.0,0.0,0.5,14.57,5.76,0.3,63.13,68,138,2018,3,29,3,14,3540.0,63.13
3,10,5.1,1,N,1,26.5,1.0,0.5,5.65,0.0,0.3,33.95,186,33,2018,3,29,3,16,2585.0,33.95
4,12,11.11,1,N,1,45.5,1.0,0.5,10.61,5.76,0.3,63.67,163,138,2018,3,29,3,16,4521.0,63.67


<a id='exploratory-data-analysis'></a>
<font size="+2" color='#053c96'><b> Exploratory Data Analysis</b></font>  
[back to top](#table-of-contents)

<a id='data-exploration'></a>
<font size="+0" color='green'><b> Data Exploration</b></font>  

In [11]:
# Check the number of unique values in each column
print("\nNumber of Unique Values:")
print(df.nunique())


Number of Unique Values:
WASTE_ID      58502
SOURCE           12
ORG_ID        47496
WWTP_NAME     49260
COUNTRY         188
CNTRY_ISO       180
LAT_WWTP      31311
LON_WWTP      44467
QUAL_LOC          4
LAT_OUT       13507
LON_OUT       24606
STATUS            9
POP_SERVED    22602
QUAL_POP          4
WASTE_DIS     33782
QUAL_WASTE        4
LEVEL             3
QUAL_LEVEL        2
DF            45199
HYRIV_ID      42821
RIVER_DIS     22017
COAST_10KM        2
COAST_50KM        2
DESIGN_CAP     7328
QUAL_CAP          3
dtype: int64


In [12]:
# Check for any missing values in the DataFrame
print("\nMissing Values:")
print(df.isnull().sum())


Missing Values:
WASTE_ID          0
SOURCE            0
ORG_ID            0
WWTP_NAME      5287
COUNTRY           0
CNTRY_ISO         0
LAT_WWTP          0
LON_WWTP          0
QUAL_LOC          0
LAT_OUT           0
LON_OUT           0
STATUS            0
POP_SERVED        0
QUAL_POP          0
WASTE_DIS         0
QUAL_WASTE        0
LEVEL             0
QUAL_LEVEL        0
DF            11200
HYRIV_ID        379
RIVER_DIS     10551
COAST_10KM        0
COAST_50KM        0
DESIGN_CAP    15835
QUAL_CAP          0
dtype: int64


<a id='data-visualization'></a>
<font size="+0" color='green'><b> Data Visualization</b></font>  

<a id='summary-statistics'></a>
<font size="+0" color='green'><b> Summary Statistics</b></font>  

<a id='feature-correlation'></a>
<font size="+0" color='green'><b> Feature Correlation</b></font>  

<a id='data-preparation'></a>
<font size="+2" color='#053c96'><b> Data Preparation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data CLeaning</b></font>  

<font size="+0" color='green'><b> Handling Imbalanced Classes</b></font>  

<font size="+0" color='green'><b> Feature Engineering</b></font>  

<font size="+0" color='green'><b> Feature Selection</b></font>  

<font size="+0" color='green'><b> Data Transformation</b></font>  

<font size="+0" color='green'><b> Data Splitting</b></font>  

<a id='model-development'></a>

<font size="+2" color='#053c96'><b> Model Development</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Baseline Model</b></font>  

<font size="+0" color='green'><b> Model Selection</b></font>  

<font size="+0" color='green'><b> Model Training</b></font>  

<font size="+0" color='green'><b> Hyperparameter Tuning</b></font>  

<a id='model-evaluation'></a>

<font size="+2" color='#053c96'><b> Model Evaluation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Performance Metrics</b></font>  

<font size="+0" color='green'><b> Confusion Matrix</b></font>  

<font size="+0" color='green'><b> ROC Curve</b></font>  

<font size="+0" color='green'><b> Precision-Recall Curve</b></font>   

<font size="+0" color='green'><b> Cross-Validation</b></font>   

<font size="+0" color='green'><b> Bias-Variance Tradeoff</b></font>   

<a id='model-interpretation'></a>
<font size="+2" color='#053c96'><b> Model Interpretation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Feature Importance</b></font>   

<font size="+0" color='green'><b> Model Explanation Techniques</b></font>   

<font size="+0" color='green'><b> Business Impact Analysis</b></font>   

<a id='conclusion'></a>

<font size="+2" color='#053c96'><b> Conclusion</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Summary of Findings</b></font>   

<font size="+0" color='green'><b> Recommendations</b></font>   

<font size="+0" color='green'><b> Limitations</b></font>   

<font size="+0" color='green'><b> Future Work</b></font>   

<font size="+0" color='green'><b> Final Thoughts</b></font>   

<a id='references'></a>

<font size="+2" color='#053c96'><b> References</b></font>  
[back to top](#table-of-contents)