<h2><center> AquaInsight: Exploring Global Wastewater Treatment Patterns</h2></center>
<figure>
<center><img src ="https://th.bing.com/th/id/OIP.wuNPTx42LyVnFMqRofDVPQHaGB?pid=ImgDet&rs=1" width = "750" height = '500' alt="unsplash.com"/>

## Author: Umar Kabir

Date: [July, 2023]

<a id='table-of-contents'></a>
# Table of Contents

1. [Introduction](#introduction)
    - Motivation
    - Problem Statement
    - Objective
    - Data Source
    - Importing Dependencies  


2. [Data](#2-data)
    - Data Loading
    - Dataset Overview


3. [Exploratory Data Analysis](#exploratory-data-analysis)
    - Descriptive Statistics
    - Data Visualization
    - Correlation Analysis
    - Outlier Detection


4. [Data Preparation](#data-preparation)
    - Data Cleaning
    - Handling Missing Values
    - Handling Imbalanced Classes
    - Feature Selection
    - Feature Engineering
    - Data Transformation
    - Data Splitting


5. [Model Development](#model-development)
    - Baseline Model
    - Model Selection
    - Model Training
    - Hyperparameter Tuning


6. [Model Evaluation](#model-evaluation)
    - Performance Metrics
    - Confusion Matrix
    - ROC Curve
    - Precision-Recall Curve
    - Cross-Validation
    - Bias-Variance Tradeoff


7. [Model Interpretation](#model-interpretation)
    - Feature Importance
    - Model Explanation Techniques
    - Business Impact Analysis


8. [Conclusion](#conclusion)
    - Summary of Findings
    - Recommendations
    - Limitations
    - Future Work
    - Final Thoughts


9. [References](#references)

<a id='introduction'></a>
<font size="+2" color='#053c96'><b> Introduction</b></font>  
[back to top](#table-of-contents)  

<font size="+0" color='green'><b> Motivation</b></font>  


<font size="+0" color='green'><b> Problem Statement</b></font>  



<font size="+0" color='green'><b> Objectives</b></font>  


<font size="+0" color='green'><b> Data Source</b></font>  


<font size="+0" color='green'><b> Importing Dependencies</b></font>  

In [2]:
import sys
# Insert the parent path relative to this notebook so we can import from the src folder.
sys.path.insert(0, "..")

from src.dependencies import *

<a id='#data'></a>
<font size="+2" color='#053c96'><b> Data</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data Loading</b></font>  

In [5]:
df = pd.read_csv('../data/HydroWASTE_v10.csv', encoding='ISO-8859-1')

<font size="+0" color='green'><b> Data Overview</b></font>  

In [6]:
df.shape

(58502, 25)

In [10]:
# Get information about the DataFrame, including data types and non-null counts
print("\nData Info:")
print(df.info())


Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58502 entries, 0 to 58501
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   WASTE_ID    58502 non-null  int64  
 1   SOURCE      58502 non-null  int64  
 2   ORG_ID      58502 non-null  int64  
 3   WWTP_NAME   53215 non-null  object 
 4   COUNTRY     58502 non-null  object 
 5   CNTRY_ISO   58502 non-null  object 
 6   LAT_WWTP    58502 non-null  float64
 7   LON_WWTP    58502 non-null  float64
 8   QUAL_LOC    58502 non-null  int64  
 9   LAT_OUT     58502 non-null  float64
 10  LON_OUT     58502 non-null  float64
 11  STATUS      58502 non-null  object 
 12  POP_SERVED  58502 non-null  int64  
 13  QUAL_POP    58502 non-null  int64  
 14  WASTE_DIS   58502 non-null  float64
 15  QUAL_WASTE  58502 non-null  int64  
 16  LEVEL       58502 non-null  object 
 17  QUAL_LEVEL  58502 non-null  int64  
 18  DF          47302 non-null  float64
 19  HYRIV_ID    5

In [9]:
# Display the first few rows of the DataFrame
print("First few rows:")
df.head()

First few rows:


Unnamed: 0,WASTE_ID,SOURCE,ORG_ID,WWTP_NAME,COUNTRY,CNTRY_ISO,LAT_WWTP,LON_WWTP,QUAL_LOC,LAT_OUT,LON_OUT,STATUS,POP_SERVED,QUAL_POP,WASTE_DIS,QUAL_WASTE,LEVEL,QUAL_LEVEL,DF,HYRIV_ID,RIVER_DIS,COAST_10KM,COAST_50KM,DESIGN_CAP,QUAL_CAP
0,1,1,1140441,Akmenes aglomeracija,Lithuania,LTU,56.247,22.726,2,56.223,22.627,Not Reported,1060,2,148.213,4,Advanced,1,2421.974,20228874.0,4.153,0,0,4600.0,2
1,2,1,1140443,Alytaus m aglomeracija,Lithuania,LTU,54.432,24.056,2,54.519,24.098,Not Reported,87900,2,8797.904,1,Advanced,1,2534.527,20261585.0,257.983,0,0,220000.0,2
2,3,1,1140445,Anyksciu aglomeracija,Lithuania,LTU,55.509,25.073,2,55.452,25.006,Not Reported,12400,2,1959.285,1,Advanced,1,1367.809,20243105.0,30.995,0,0,33000.0,2
3,4,1,1140447,Ariogalos aglomeracija,Lithuania,LTU,55.252,23.484,2,55.21,23.51,Not Reported,2500,2,578.482,1,Secondary,1,2061.969,20247446.0,13.799,0,0,4357.0,2
4,5,1,1140449,Baisogalos aglomeracija,Lithuania,LTU,55.644,23.741,2,55.681,23.835,Not Reported,1200,2,167.788,4,Secondary,1,209.549,20239330.0,0.405,0,0,1490.0,2


<a id='exploratory-data-analysis'></a>
<font size="+2" color='#053c96'><b> Exploratory Data Analysis</b></font>  
[back to top](#table-of-contents)

<a id='data-exploration'></a>
<font size="+0" color='green'><b> Data Exploration</b></font>  

In [11]:
# Check the number of unique values in each column
print("\nNumber of Unique Values:")
print(df.nunique())


Number of Unique Values:
WASTE_ID      58502
SOURCE           12
ORG_ID        47496
WWTP_NAME     49260
COUNTRY         188
CNTRY_ISO       180
LAT_WWTP      31311
LON_WWTP      44467
QUAL_LOC          4
LAT_OUT       13507
LON_OUT       24606
STATUS            9
POP_SERVED    22602
QUAL_POP          4
WASTE_DIS     33782
QUAL_WASTE        4
LEVEL             3
QUAL_LEVEL        2
DF            45199
HYRIV_ID      42821
RIVER_DIS     22017
COAST_10KM        2
COAST_50KM        2
DESIGN_CAP     7328
QUAL_CAP          3
dtype: int64


In [7]:
df['International'].value_counts()

0    4314
1     110
Name: International, dtype: int64

In [8]:
df['Course'].value_counts()

9500    766
9147    380
9238    355
9085    337
9773    331
9670    268
9991    268
9254    252
9070    226
171     215
8014    215
9003    210
9853    192
9119    170
9130    141
9556     86
33       12
Name: Course, dtype: int64

<a id='data-visualization'></a>
<font size="+0" color='green'><b> Data Visualization</b></font>  

<a id='summary-statistics'></a>
<font size="+0" color='green'><b> Summary Statistics</b></font>  

<a id='feature-correlation'></a>
<font size="+0" color='green'><b> Feature Correlation</b></font>  

<a id='data-preparation'></a>
<font size="+2" color='#053c96'><b> Data Preparation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data CLeaning</b></font>  

<font size="+0" color='green'><b> Handling Imbalanced Classes</b></font>  

<font size="+0" color='green'><b> Feature Engineering</b></font>  

<font size="+0" color='green'><b> Feature Selection</b></font>  

<font size="+0" color='green'><b> Data Transformation</b></font>  

<font size="+0" color='green'><b> Data Splitting</b></font>  

<a id='model-development'></a>

<font size="+2" color='#053c96'><b> Model Development</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Baseline Model</b></font>  

<font size="+0" color='green'><b> Model Selection</b></font>  

<font size="+0" color='green'><b> Model Training</b></font>  

<font size="+0" color='green'><b> Hyperparameter Tuning</b></font>  

<a id='model-evaluation'></a>

<font size="+2" color='#053c96'><b> Model Evaluation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Performance Metrics</b></font>  

<font size="+0" color='green'><b> Confusion Matrix</b></font>  

<font size="+0" color='green'><b> ROC Curve</b></font>  

<font size="+0" color='green'><b> Precision-Recall Curve</b></font>   

<font size="+0" color='green'><b> Cross-Validation</b></font>   

<font size="+0" color='green'><b> Bias-Variance Tradeoff</b></font>   

<a id='model-interpretation'></a>
<font size="+2" color='#053c96'><b> Model Interpretation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Feature Importance</b></font>   

<font size="+0" color='green'><b> Model Explanation Techniques</b></font>   

<font size="+0" color='green'><b> Business Impact Analysis</b></font>   

<a id='conclusion'></a>

<font size="+2" color='#053c96'><b> Conclusion</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Summary of Findings</b></font>   

<font size="+0" color='green'><b> Recommendations</b></font>   

<font size="+0" color='green'><b> Limitations</b></font>   

<font size="+0" color='green'><b> Future Work</b></font>   

<font size="+0" color='green'><b> Final Thoughts</b></font>   

<a id='references'></a>

<font size="+2" color='#053c96'><b> References</b></font>  
[back to top](#table-of-contents)