<h2><center> AquaInsight: Exploring Global Wastewater Treatment Patterns</h2></center>
<figure>
<center><img src ="https://th.bing.com/th/id/OIP.wuNPTx42LyVnFMqRofDVPQHaGB?pid=ImgDet&rs=1" width = "750" height = '500' alt="unsplash.com"/>

## Author: Umar Kabir

Date: [July, 2023]

<a id='table-of-contents'></a>
# Table of Contents

1. [Introduction](#introduction)
    - Motivation
    - Problem Statement
    - Objective
    - Data Source
    - Importing Dependencies  


2. [Data](#2-data)
    - Data Loading
    - Dataset Overview


3. [Exploratory Data Analysis](#exploratory-data-analysis)
    - Descriptive Statistics
    - Data Visualization
    - Correlation Analysis
    - Outlier Detection


4. [Data Preparation](#data-preparation)
    - Data Cleaning
    - Handling Missing Values
    - Handling Imbalanced Classes
    - Feature Selection
    - Feature Engineering
    - Data Transformation
    - Data Splitting


5. [Model Development](#model-development)
    - Baseline Model
    - Model Selection
    - Model Training
    - Hyperparameter Tuning


6. [Model Evaluation](#model-evaluation)
    - Performance Metrics
    - Confusion Matrix
    - ROC Curve
    - Precision-Recall Curve
    - Cross-Validation
    - Bias-Variance Tradeoff


7. [Model Interpretation](#model-interpretation)
    - Feature Importance
    - Model Explanation Techniques
    - Business Impact Analysis


8. [Conclusion](#conclusion)
    - Summary of Findings
    - Recommendations
    - Limitations
    - Future Work
    - Final Thoughts


9. [References](#references)

<a id='introduction'></a>
<font size="+2" color='#053c96'><b> Introduction</b></font>  
[back to top](#table-of-contents)  

<font size="+0" color='green'><b> Possible Target Variables</b></font>  


<font size="+0" color='green'><b> Motivation</b></font>  


<font size="+0" color='green'><b> Problem Statement</b></font>  



<font size="+0" color='green'><b> Objectives</b></font>  


<font size="+0" color='green'><b> Data Source</b></font>  


<font size="+0" color='green'><b> Importing Dependencies</b></font>  

In [1]:
import sys
# Insert the parent path relative to this notebook so we can import from the src folder.
sys.path.insert(0, "..")

from src.dependencies import *
from src.functions import *

<a id='#data'></a>
<font size="+2" color='#053c96'><b> Data</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data Loading</b></font>  

In [2]:
rivers = pd.read_csv('../data/river_data.csv')
df = pd.read_csv('../data/HydroWASTE_v10.csv', encoding='ISO-8859-1')
gloric = pd.read_csv('../data/gloric.csv')
africa_atlas = pd.read_csv('../data/africa-atlas.csv')
aus_ocean_atlas = pd.read_csv('../data/australia-and-oceania-atlas.csv')
cse_asia_atlas = pd.read_csv('../data/central-and-south-east-asia-atlas.csv')
eme_atlas = pd.read_csv('../data/europe-and-middle-east-atlas.csv')
greenland_atlas = pd.read_csv('../data/greenland-atlas.csv')
naa_atlas = pd.read_csv('../data/north-america-arctic-atlas.csv')
nac_atlas = pd.read_csv('../data/north-america-caribbean-atlas.csv')
siberia_atlas = pd.read_csv('../data/siberia-atlas.csv')
san_atlas = pd.read_csv('../data/south-america-north-atlas.csv')
sas_atlas = pd.read_csv('../data/south-america-south-atlas.csv')


In [7]:
rivers = rivers[rivers['HYRIV_ID'].isin(df['HYRIV_ID'])]
africa_atlas = africa_atlas[africa_atlas['HYRIV_ID'].isin(df['HYRIV_ID'])]
aus_ocean_atlas = aus_ocean_atlas[aus_ocean_atlas['HYRIV_ID'].isin(df['HYRIV_ID'])]
cse_asia_atlas = cse_asia_atlas[cse_asia_atlas['HYRIV_ID'].isin(df['HYRIV_ID'])]
eme_atlas = eme_atlas[eme_atlas['HYRIV_ID'].isin(df['HYRIV_ID'])]
greenland_atlas = greenland_atlas[greenland_atlas['HYRIV_ID'].isin(df['HYRIV_ID'])]
naa_atlas = naa_atlas[naa_atlas['HYRIV_ID'].isin(df['HYRIV_ID'])]
nac_atlas = nac_atlas[nac_atlas['HYRIV_ID'].isin(df['HYRIV_ID'])]
siberia_atlas = siberia_atlas[siberia_atlas['HYRIV_ID'].isin(df['HYRIV_ID'])]
san_atlas = san_atlas[san_atlas['HYRIV_ID'].isin(df['HYRIV_ID'])]
sas_atlas = sas_atlas[sas_atlas['HYRIV_ID'].isin(df['HYRIV_ID'])]

In [11]:
merged_atlas = combine_dataframes([africa_atlas, aus_ocean_atlas, cse_asia_atlas, eme_atlas, greenland_atlas, naa_atlas, nac_atlas, siberia_atlas, san_atlas, sas_atlas])

In [18]:
merged_atlas.shape

(42821, 295)

<font size="+0" color='green'><b> Data Overview</b></font>  

In [10]:
print("\nHydroWASTE Data Shape:")
print(df.shape)
print("\nRivers Data Shape:")
print(rivers.shape)
print("\nGLoric Data Shape:")
print(gloric.shape)
print("\nAfrica Atlas Data Shape:")
print(africa_atlas.shape)
print("\nAustralia and Oceania Atlas Data Shape:")
print(aus_ocean_atlas.shape)
print("\nCentral and South East Asia Atlas Data Shape:")
print(cse_asia_atlas.shape)
print("\nEurope and Middle East Atlas Data Shape:")
print(eme_atlas.shape)
print("\nGreenland Atlas Data Shape:")
print(greenland_atlas.shape)
print("\nNorth America and Arctic Atlas Data Shape:")
print(naa_atlas.shape)
print("\nNoth America and Caribbean Atlas Data Shape:")
print(nac_atlas.shape)
print("\nSiberian Atlas Data Shape:")
print(siberia_atlas.shape)
print("\nSouth America North Atlas Data Shape:")
print(san_atlas.shape)
print("\nSouth America South Atlas Data Shape:")
print(sas_atlas.shape)


HydroWASTE Data Shape:
(58502, 25)

Rivers Data Shape:
(42821, 16)

GLoric Data Shape:
(8477883, 17)

Africa Atlas Data Shape:
(1294, 295)

Australia and Oceania Atlas Data Shape:
(1482, 295)

Central and South East Asia Atlas Data Shape:
(3429, 295)

Europe and Middle East Atlas Data Shape:
(18424, 295)

Greenland Atlas Data Shape:
(0, 295)

North America and Arctic Atlas Data Shape:
(85, 295)

Noth America and Caribbean Atlas Data Shape:
(15603, 295)

Siberian Atlas Data Shape:
(218, 295)

South America North Atlas Data Shape:
(353, 295)

South America South Atlas Data Shape:
(1933, 295)


In [5]:
# Get information about the DataFrames, including data types and non-null counts
print("\nHydroWASTE Data Info:")
print(df.info())
print("\nRivers Data Info:")
print(rivers.info())
print("\nGLoric Data Info:")
print(gloric.info())
print("\nAfrica Atlas Data Info:")
print(africa_atlas.info())
print("\nAustralia and Oceania Atlas Data Info:")
print(aus_ocean_atlas.info())
print("\nCentral and South East Asia Atlas Data Info:")
print(cse_asia_atlas.info())
print("\nEurope and Middle East Atlas Data Info:")
print(eme_atlas.info())
print("\nGreenland Atlas Data Info:")
print(greenland_atlas.info())
print("\nNorth America and Arctic Atlas Data Info:")
print(naa_atlas.info())
print("\nNoth America and Caribbean Atlas Data Info:")
print(nac_atlas.info())
print("\nSiberian Atlas Data Info:")
print(siberia_atlas.info())
print("\nSouth America North Atlas Data Info:")
print(san_atlas.info())
print("\nSouth America South Atlas Data Info:")
print(sas_atlas.info())


HydroWASTE Data Info:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58502 entries, 0 to 58501
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   WASTE_ID    58502 non-null  int64  
 1   SOURCE      58502 non-null  int64  
 2   ORG_ID      58502 non-null  int64  
 3   WWTP_NAME   53215 non-null  object 
 4   COUNTRY     58502 non-null  object 
 5   CNTRY_ISO   58502 non-null  object 
 6   LAT_WWTP    58502 non-null  float64
 7   LON_WWTP    58502 non-null  float64
 8   QUAL_LOC    58502 non-null  int64  
 9   LAT_OUT     58502 non-null  float64
 10  LON_OUT     58502 non-null  float64
 11  STATUS      58502 non-null  object 
 12  POP_SERVED  58502 non-null  int64  
 13  QUAL_POP    58502 non-null  int64  
 14  WASTE_DIS   58502 non-null  float64
 15  QUAL_WASTE  58502 non-null  int64  
 16  LEVEL       58502 non-null  object 
 17  QUAL_LEVEL  58502 non-null  int64  
 18  DF          47302 non-null  float64
 19  HYRIV_ID    58123 non-nul

In [6]:
# Display the first few rows of the DataFrame
print("First few rows:")
rivers.head()

First few rows:


Unnamed: 0,OBJECTID,HYRIV_ID,NEXT_DOWN,MAIN_RIV,LENGTH_KM,DIST_DN_KM,DIST_UP_KM,CATCH_SKM,UPLAND_SKM,ENDORHEIC,DIS_AV_CMS,ORD_STRA,ORD_CLAS,ORD_FLOW,HYBAS_L12,Shape_Length
0,1,10000001,0,10000001,0.89,0.0,7.2,11.27,11.1,0,0.062,1,1,8,1120031210,0.008839
1,2,10000002,0,10000002,2.9,0.0,7.0,24.59,24.200001,0,0.126,1,1,7,1120031210,0.028957
2,3,10000003,10000009,10000009,4.63,5.7,9.8,57.23,57.200001,0,0.316,1,1,7,1120031210,0.046487
3,4,10000004,10000009,10000009,0.69,5.7,5.4,11.11,11.1,0,0.061,1,2,8,1120031210,0.00625
4,5,10000005,0,10000005,8.32,0.0,13.6,35.02,34.0,0,0.177,1,1,7,1120031210,0.085713


In [7]:
# Display the first few rows of the DataFrame
print("First few rows:")
df.head()

First few rows:


Unnamed: 0,WASTE_ID,SOURCE,ORG_ID,WWTP_NAME,COUNTRY,CNTRY_ISO,LAT_WWTP,LON_WWTP,QUAL_LOC,LAT_OUT,LON_OUT,STATUS,POP_SERVED,QUAL_POP,WASTE_DIS,QUAL_WASTE,LEVEL,QUAL_LEVEL,DF,HYRIV_ID,RIVER_DIS,COAST_10KM,COAST_50KM,DESIGN_CAP,QUAL_CAP
0,1,1,1140441,Akmenes aglomeracija,Lithuania,LTU,56.247,22.726,2,56.223,22.627,Not Reported,1060,2,148.213,4,Advanced,1,2421.974,20228874.0,4.153,0,0,4600.0,2
1,2,1,1140443,Alytaus m aglomeracija,Lithuania,LTU,54.432,24.056,2,54.519,24.098,Not Reported,87900,2,8797.904,1,Advanced,1,2534.527,20261585.0,257.983,0,0,220000.0,2
2,3,1,1140445,Anyksciu aglomeracija,Lithuania,LTU,55.509,25.073,2,55.452,25.006,Not Reported,12400,2,1959.285,1,Advanced,1,1367.809,20243105.0,30.995,0,0,33000.0,2
3,4,1,1140447,Ariogalos aglomeracija,Lithuania,LTU,55.252,23.484,2,55.21,23.51,Not Reported,2500,2,578.482,1,Secondary,1,2061.969,20247446.0,13.799,0,0,4357.0,2
4,5,1,1140449,Baisogalos aglomeracija,Lithuania,LTU,55.644,23.741,2,55.681,23.835,Not Reported,1200,2,167.788,4,Secondary,1,209.549,20239330.0,0.405,0,0,1490.0,2


In [8]:
# Display the first few rows of the DataFrame
print("First few rows:")
gloric.head()

First few rows:


Unnamed: 0,OBJECTID,Reach_ID,Next_down,Length_km,Log_Q_avg,Log_Q_var,Class_hydr,Temp_min,CMI_indx,Log_elev,Class_phys,Lake_wet,Stream_pow,Class_geom,Reach_type,Kmeans_30,Shape_Length
0,1,10000001,0,0.888,-1.20761,0.43806,12.0,11.4,-0.488,1.65321,311.0,0.0,0.03063,11.0,311.0,10.0,0.008839
1,2,10000002,0,2.904,-0.89963,0.43745,12.0,11.4,-0.518,1.51851,311.0,0.0,0.02158,11.0,311.0,10.0,0.028957
2,3,10000003,10000009,4.633,-0.50031,0.43532,12.0,11.6,-0.513,1.04139,311.0,0.0,0.00847,11.0,311.0,10.0,0.046487
3,4,10000004,10000009,0.695,-1.21467,0.43998,12.0,11.6,-0.513,1.25527,311.0,0.0,0.0,11.0,311.0,10.0,0.00625
4,5,10000005,0,8.317,-0.75203,0.42963,12.0,11.6,-0.526,-1.0,311.0,1.0,0.02313,21.0,313.0,26.0,0.085713


In [9]:
# Display the first few rows of the DataFrame
print("First few rows:")
naa_atlas.head()

First few rows:


Unnamed: 0,HYRIV_ID,NEXT_DOWN,MAIN_RIV,LENGTH_KM,DIST_DN_KM,DIST_UP_KM,CATCH_SKM,UPLAND_SKM,ENDORHEIC,DIS_AV_CMS,ORD_STRA,ORD_CLAS,ORD_FLOW,HYBAS_L12,dis_m3_pyr,dis_m3_pmn,dis_m3_pmx,run_mm_cyr,inu_pc_cmn,inu_pc_umn,inu_pc_cmx,inu_pc_umx,inu_pc_clt,inu_pc_ult,lka_pc_cse,lka_pc_use,lkv_mc_usu,rev_mc_usu,dor_pc_pva,ria_ha_csu,ria_ha_usu,riv_tc_csu,riv_tc_usu,gwt_cm_cav,ele_mt_cav,ele_mt_uav,ele_mt_cmn,ele_mt_cmx,slp_dg_cav,slp_dg_uav,sgr_dk_rav,clz_cl_cmj,cls_cl_cmj,tmp_dc_cyr,tmp_dc_uyr,tmp_dc_cmn,tmp_dc_cmx,tmp_dc_c01,tmp_dc_c02,tmp_dc_c03,tmp_dc_c04,tmp_dc_c05,tmp_dc_c06,tmp_dc_c07,tmp_dc_c08,tmp_dc_c09,tmp_dc_c10,tmp_dc_c11,tmp_dc_c12,pre_mm_cyr,pre_mm_uyr,pre_mm_c01,pre_mm_c02,pre_mm_c03,pre_mm_c04,pre_mm_c05,pre_mm_c06,pre_mm_c07,pre_mm_c08,pre_mm_c09,pre_mm_c10,pre_mm_c11,pre_mm_c12,pet_mm_cyr,pet_mm_uyr,pet_mm_c01,pet_mm_c02,pet_mm_c03,pet_mm_c04,pet_mm_c05,pet_mm_c06,pet_mm_c07,pet_mm_c08,pet_mm_c09,pet_mm_c10,pet_mm_c11,pet_mm_c12,aet_mm_cyr,aet_mm_uyr,aet_mm_c01,aet_mm_c02,aet_mm_c03,aet_mm_c04,aet_mm_c05,aet_mm_c06,aet_mm_c07,aet_mm_c08,aet_mm_c09,aet_mm_c10,aet_mm_c11,aet_mm_c12,ari_ix_cav,ari_ix_uav,cmi_ix_cyr,cmi_ix_uyr,cmi_ix_c01,cmi_ix_c02,cmi_ix_c03,cmi_ix_c04,cmi_ix_c05,cmi_ix_c06,cmi_ix_c07,cmi_ix_c08,cmi_ix_c09,cmi_ix_c10,cmi_ix_c11,cmi_ix_c12,snw_pc_cyr,snw_pc_uyr,snw_pc_cmx,snw_pc_c01,snw_pc_c02,snw_pc_c03,snw_pc_c04,snw_pc_c05,snw_pc_c06,snw_pc_c07,snw_pc_c08,snw_pc_c09,snw_pc_c10,snw_pc_c11,snw_pc_c12,glc_cl_cmj,glc_pc_c01,glc_pc_c02,glc_pc_c03,glc_pc_c04,glc_pc_c05,glc_pc_c06,glc_pc_c07,glc_pc_c08,glc_pc_c09,glc_pc_c10,glc_pc_c11,glc_pc_c12,glc_pc_c13,glc_pc_c14,glc_pc_c15,glc_pc_c16,glc_pc_c17,glc_pc_c18,glc_pc_c19,glc_pc_c20,glc_pc_c21,glc_pc_c22,glc_pc_u01,glc_pc_u02,glc_pc_u03,glc_pc_u04,glc_pc_u05,glc_pc_u06,glc_pc_u07,glc_pc_u08,glc_pc_u09,glc_pc_u10,glc_pc_u11,glc_pc_u12,glc_pc_u13,glc_pc_u14,glc_pc_u15,glc_pc_u16,glc_pc_u17,glc_pc_u18,glc_pc_u19,glc_pc_u20,glc_pc_u21,glc_pc_u22,pnv_cl_cmj,pnv_pc_c01,pnv_pc_c02,pnv_pc_c03,pnv_pc_c04,pnv_pc_c05,pnv_pc_c06,pnv_pc_c07,pnv_pc_c08,pnv_pc_c09,pnv_pc_c10,pnv_pc_c11,pnv_pc_c12,pnv_pc_c13,pnv_pc_c14,pnv_pc_c15,pnv_pc_u01,pnv_pc_u02,pnv_pc_u03,pnv_pc_u04,pnv_pc_u05,pnv_pc_u06,pnv_pc_u07,pnv_pc_u08,pnv_pc_u09,pnv_pc_u10,pnv_pc_u11,pnv_pc_u12,pnv_pc_u13,pnv_pc_u14,pnv_pc_u15,wet_cl_cmj,wet_pc_cg1,wet_pc_ug1,wet_pc_cg2,wet_pc_ug2,wet_pc_c01,wet_pc_c02,wet_pc_c03,wet_pc_c04,wet_pc_c05,wet_pc_c06,wet_pc_c07,wet_pc_c08,wet_pc_c09,wet_pc_u01,wet_pc_u02,wet_pc_u03,wet_pc_u04,wet_pc_u05,wet_pc_u06,wet_pc_u07,wet_pc_u08,wet_pc_u09,for_pc_cse,for_pc_use,crp_pc_cse,crp_pc_use,pst_pc_cse,pst_pc_use,ire_pc_cse,ire_pc_use,gla_pc_cse,gla_pc_use,prm_pc_cse,prm_pc_use,pac_pc_cse,pac_pc_use,tbi_cl_cmj,tec_cl_cmj,fmh_cl_cmj,fec_cl_cmj,cly_pc_cav,cly_pc_uav,slt_pc_cav,slt_pc_uav,snd_pc_cav,snd_pc_uav,soc_th_cav,soc_th_uav,swc_pc_cyr,swc_pc_uyr,swc_pc_c01,swc_pc_c02,swc_pc_c03,swc_pc_c04,swc_pc_c05,swc_pc_c06,swc_pc_c07,swc_pc_c08,swc_pc_c09,swc_pc_c10,swc_pc_c11,swc_pc_c12,lit_cl_cmj,kar_pc_cse,kar_pc_use,ero_kh_cav,ero_kh_uav,pop_ct_csu,pop_ct_usu,ppd_pk_cav,ppd_pk_uav,urb_pc_cse,urb_pc_use,nli_ix_cav,nli_ix_uav,rdd_mk_cav,rdd_mk_uav,hft_ix_c93,hft_ix_u93,hft_ix_c09,hft_ix_u09,gad_id_cmj,gdp_ud_cav,gdp_ud_csu,gdp_ud_usu,hdi_ix_cav
0,80000001,0,80000001,1.17,0.0,8.8,11.11,11.0,0,0.045,1,1,8,8120056940,0.045,0.0,0.371,128,0,0,0,0,0,0,0,0,0,0,0,1.823,1.807,2.408,2.402,1,0,0,0,0,61,60,0,6,12,-186,-186,-350,30,-333,-350,-348,-262,-113,-3,30,12,-82,-193,-279,-317,77,77,3,3,3,3,4,6,14,13,11,8,5,4,142,142,0,0,0,0,16,41,52,29,4,0,0,0,100,100,0,0,0,0,13,31,36,18,2,0,0,0,54,54,-46,-46,100,100,100,100,-75,-85,-73,-54,64,100,100,100,97,97,100,100,100,100,100,100,100,89,80,96,100,100,100,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,-999,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,100,17,18,11,412,11,112,-999,-999,-999,-999,-999,-999,-999,-999,74,74,81,82,82,83,82,77,70,65,65,67,68,69,5,0,0,0,0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,41,64789,0,0,832
1,80000002,0,80000002,2.39,0.0,38.5,2.8,61.8,0,0.262,2,1,7,8120056920,0.262,0.003,2.316,129,0,0,0,0,0,0,0,0,0,0,0,1.367,19.734,4.932,43.563,2,0,90,0,0,71,149,0,6,12,-188,-191,-352,30,-335,-352,-350,-264,-113,-2,30,11,-82,-195,-281,-318,77,85,3,3,3,3,4,6,14,13,11,8,5,4,142,138,0,0,0,0,16,42,52,28,4,0,0,0,101,102,0,0,0,0,13,32,36,18,2,0,0,0,54,62,-46,-38,100,100,100,100,-75,-86,-73,-54,64,100,100,100,92,91,100,100,100,100,100,100,99,62,53,93,100,100,100,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,56,44,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,-999,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,99,100,0,82,11,412,11,112,-999,-999,-999,-999,-999,-999,-999,-999,74,77,81,82,82,83,82,77,69,65,65,67,68,69,5,0,0,0,0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,41,64789,0,33,832
2,80000003,0,80000003,3.49,0.0,16.8,12.34,12.2,0,0.042,1,1,8,8120056960,0.042,0.0,0.321,109,0,0,0,0,0,0,0,0,0,0,0,2.505,2.496,3.661,3.657,1,0,0,0,0,135,135,0,6,12,-185,-185,-347,30,-330,-347,-346,-260,-111,-3,30,12,-81,-192,-276,-314,79,79,3,3,3,3,4,6,14,14,11,8,5,4,142,142,0,0,0,0,16,41,52,29,4,0,0,0,101,101,0,0,0,0,13,31,36,19,2,0,0,0,55,55,-45,-45,100,100,100,100,-72,-85,-73,-52,64,100,100,100,96,96,100,100,100,100,100,100,99,86,78,94,100,100,100,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,-999,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,100,53,54,11,412,11,112,-999,-999,-999,-999,-999,-999,-999,-999,75,75,81,82,83,83,82,77,70,65,65,67,69,70,5,0,0,0,0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,41,64789,0,0,832
3,80000004,80000002,80000002,1.4,2.3,10.2,11.64,11.6,0,0.048,1,2,8,8120056920,0.048,0.001,0.423,129,0,0,0,0,0,0,0,0,0,0,0,1.866,1.866,2.762,2.762,0,0,0,0,0,143,143,0,6,12,-188,-188,-353,30,-336,-353,-351,-265,-113,-2,30,12,-82,-195,-281,-319,77,77,3,3,3,3,4,6,14,13,11,8,5,4,142,142,0,0,0,0,16,42,52,28,4,0,0,0,101,101,0,0,0,0,13,32,36,18,2,0,0,0,54,54,-46,-46,100,100,100,100,-75,-86,-73,-54,64,100,100,100,94,94,100,100,100,100,100,100,98,72,66,93,100,100,100,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,-999,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,100,48,48,11,412,11,112,-999,-999,-999,-999,-999,-999,-999,-999,74,74,81,82,82,83,82,77,69,65,65,67,68,69,5,0,0,0,0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,41,64789,0,0,832
4,80000005,0,80000005,5.11,0.0,14.6,33.36,33.3,0,0.136,1,1,7,8120056900,0.136,0.003,1.25,129,0,0,0,0,0,0,0,0,0,0,0,6.65,6.639,10.572,10.567,0,0,0,0,0,173,173,0,6,12,-189,-189,-355,30,-339,-355,-352,-266,-114,-2,30,11,-83,-196,-284,-321,77,77,3,3,3,3,4,6,14,13,11,8,5,4,142,142,0,0,0,0,16,42,52,28,4,0,0,0,101,101,0,0,0,0,13,32,36,18,2,0,0,0,54,54,-46,-46,100,100,100,100,-75,-86,-73,-54,64,100,100,100,88,88,100,100,100,100,100,100,98,40,33,85,100,100,100,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,-999,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,100,42,42,11,412,11,112,-999,-999,-999,-999,-999,-999,-999,-999,74,74,81,82,82,83,82,77,69,65,65,67,68,69,5,0,0,0,0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,41,64789,0,0,832


In [10]:
df['HYRIV_ID'].min()

10000009.0

<a id='exploratory-data-analysis'></a>
<font size="+2" color='#053c96'><b> Exploratory Data Analysis</b></font>  
[back to top](#table-of-contents)

<a id='data-exploration'></a>
<font size="+0" color='green'><b> Data Exploration</b></font>  

In [11]:
# Check the number of unique values in each column
print("\nNumber of Unique Values:")
print(df.nunique())


Number of Unique Values:
WASTE_ID      58502
SOURCE           12
ORG_ID        47496
WWTP_NAME     49260
COUNTRY         188
CNTRY_ISO       180
LAT_WWTP      31311
LON_WWTP      44467
QUAL_LOC          4
LAT_OUT       13507
LON_OUT       24606
STATUS            9
POP_SERVED    22602
QUAL_POP          4
WASTE_DIS     33782
QUAL_WASTE        4
LEVEL             3
QUAL_LEVEL        2
DF            45199
HYRIV_ID      42821
RIVER_DIS     22017
COAST_10KM        2
COAST_50KM        2
DESIGN_CAP     7328
QUAL_CAP          3
dtype: int64


In [12]:
# Check for any missing values in the DataFrame
print("\nMissing Values:")
print(df.isnull().sum())


Missing Values:
WASTE_ID          0
SOURCE            0
ORG_ID            0
WWTP_NAME      5287
COUNTRY           0
CNTRY_ISO         0
LAT_WWTP          0
LON_WWTP          0
QUAL_LOC          0
LAT_OUT           0
LON_OUT           0
STATUS            0
POP_SERVED        0
QUAL_POP          0
WASTE_DIS         0
QUAL_WASTE        0
LEVEL             0
QUAL_LEVEL        0
DF            11200
HYRIV_ID        379
RIVER_DIS     10551
COAST_10KM        0
COAST_50KM        0
DESIGN_CAP    15835
QUAL_CAP          0
dtype: int64


In [13]:
africa_atlas.duplicated().sum()

0

<a id='data-visualization'></a>
<font size="+0" color='green'><b> Data Visualization</b></font>  

<a id='summary-statistics'></a>
<font size="+0" color='green'><b> Summary Statistics</b></font>  

<a id='feature-correlation'></a>
<font size="+0" color='green'><b> Feature Correlation</b></font>  

<a id='data-preparation'></a>
<font size="+2" color='#053c96'><b> Data Preparation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data CLeaning</b></font>  

<font size="+0" color='green'><b> Handling Imbalanced Classes</b></font>  

<font size="+0" color='green'><b> Feature Engineering</b></font>  

<font size="+0" color='green'><b> Feature Selection</b></font>  

<font size="+0" color='green'><b> Data Transformation</b></font>  

<font size="+0" color='green'><b> Data Splitting</b></font>  

<a id='model-development'></a>

<font size="+2" color='#053c96'><b> Model Development</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Baseline Model</b></font>  

<font size="+0" color='green'><b> Model Selection</b></font>  

<font size="+0" color='green'><b> Model Training</b></font>  

<font size="+0" color='green'><b> Hyperparameter Tuning</b></font>  

<a id='model-evaluation'></a>

<font size="+2" color='#053c96'><b> Model Evaluation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Performance Metrics</b></font>  

<font size="+0" color='green'><b> Confusion Matrix</b></font>  

<font size="+0" color='green'><b> ROC Curve</b></font>  

<font size="+0" color='green'><b> Precision-Recall Curve</b></font>   

<font size="+0" color='green'><b> Cross-Validation</b></font>   

<font size="+0" color='green'><b> Bias-Variance Tradeoff</b></font>   

<a id='model-interpretation'></a>
<font size="+2" color='#053c96'><b> Model Interpretation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Feature Importance</b></font>   

<font size="+0" color='green'><b> Model Explanation Techniques</b></font>   

<font size="+0" color='green'><b> Business Impact Analysis</b></font>   

<a id='conclusion'></a>

<font size="+2" color='#053c96'><b> Conclusion</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Summary of Findings</b></font>   

<font size="+0" color='green'><b> Recommendations</b></font>   

<font size="+0" color='green'><b> Limitations</b></font>   

<font size="+0" color='green'><b> Future Work</b></font>   

<font size="+0" color='green'><b> Final Thoughts</b></font>   

<a id='references'></a>

<font size="+2" color='#053c96'><b> References</b></font>  
[back to top](#table-of-contents)