<h2><center> AquaInsight: Exploring Global Wastewater Treatment Patterns</h2></center>
<figure>
<center><img src ="https://th.bing.com/th/id/OIP.wuNPTx42LyVnFMqRofDVPQHaGB?pid=ImgDet&rs=1" width = "750" height = '500' alt="unsplash.com"/>

## Author: Umar Kabir

Date: [July, 2023]

<a id='table-of-contents'></a>
# Table of Contents

1. [Introduction](#introduction)
    - Motivation
    - Problem Statement
    - Objective
    - Data Source
    - Importing Dependencies  


2. [Data](#2-data)
    - Data Loading
    - Dataset Overview


3. [Exploratory Data Analysis](#exploratory-data-analysis)
    - Descriptive Statistics
    - Data Visualization
    - Correlation Analysis
    - Outlier Detection


4. [Data Preparation](#data-preparation)
    - Data Cleaning
    - Handling Missing Values
    - Handling Imbalanced Classes
    - Feature Selection
    - Feature Engineering
    - Data Transformation
    - Data Splitting


5. [Model Development](#model-development)
    - Baseline Model
    - Model Selection
    - Model Training
    - Hyperparameter Tuning


6. [Model Evaluation](#model-evaluation)
    - Performance Metrics
    - Confusion Matrix
    - ROC Curve
    - Precision-Recall Curve
    - Cross-Validation
    - Bias-Variance Tradeoff


7. [Model Interpretation](#model-interpretation)
    - Feature Importance
    - Model Explanation Techniques
    - Business Impact Analysis


8. [Conclusion](#conclusion)
    - Summary of Findings
    - Recommendations
    - Limitations
    - Future Work
    - Final Thoughts


9. [References](#references)

<a id='introduction'></a>
<font size="+2" color='#053c96'><b> Introduction</b></font>  
[back to top](#table-of-contents)  

<font size="+0" color='green'><b> Possible Target Variables</b></font>  


<font size="3" color='cyan'><b> Choice of Target Variable: STATUS (Status of the WWTP)</b></font>  

The target variable chosen for analysis is the "STATUS" column, which represents the status of the wastewater treatment plants (WWTPs). This variable provides valuable information about the current state or condition of each WWTP and can serve as an essential factor in various data analysis and modeling tasks.

<font size="2" color='cyan'><b> Importance of STATUS as the Target Variable:</b></font>  

1. **Decision-Making Insights:**
   Understanding the status of WWTPs can provide crucial insights for decision-making processes. It allows stakeholders, policymakers, and environmentalists to identify operational WWTPs, those under construction, and those that are closed or decommissioned.

2. **Environmental Assessment:**
   The status of WWTPs is essential for environmental assessments, as it helps evaluate the impact of wastewater treatment on surrounding ecosystems and water bodies. It enables the analysis of active and non-operational WWTPs and their associated waste discharges.

3. **Maintenance and Upgrades:**
   Analyzing the status of WWTPs aids in identifying those under construction, needing maintenance, or requiring upgrades. This information is valuable for optimizing resources and ensuring the efficient functioning of WWTPs.

4. **Risk Analysis:**
   Knowing the status of WWTPs helps assess the potential risks associated with untreated or inadequately treated wastewater discharge. Non-operational or closed WWTPs may pose environmental risks and warrant further investigation.

5. **Modeling for Future Projections:**
   The status of WWTPs serves as a target variable for predictive modeling, helping to forecast future changes in the WWTP landscape and predict the emergence of new operational WWTPs or the decommissioning of existing ones.

6. **Environmental Policy Evaluation:**
   Policymakers can use the status of WWTPs to evaluate the effectiveness of environmental policies and regulations. It enables the assessment of policy impact on the number and status of operational WWTPs.

<font size="2" color='cyan'><b> Analysis and Modeling Opportunities:</b></font>  

With "STATUS" as the target variable, several data analysis and modeling opportunities arise:

1. **Classification Models:**
   The status categories (Operational, Under Construction, Closed, etc.) make it suitable for building classification models that predict the status of WWTPs based on other attributes in the dataset.

2. **Impact Assessment:**
   Analyzing the status of WWTPs in different regions or countries can help assess the impact of environmental policies and infrastructure development efforts.

3. **Performance Evaluation:**
   The target variable can be used for evaluating the performance of WWTPs by comparing the status of plants with their respective wastewater discharge and treatment levels.

4. **Spatial Analysis:**
   Geospatial analysis can be performed to visualize the distribution of operational and non-operational WWTPs on a global scale, identifying areas with a higher concentration of active plants.

In conclusion, selecting the "STATUS" column as the target variable provides a rich foundation for data analysis and modeling tasks related to wastewater treatment plants. It allows for valuable insights, impact assessment, and predictive modeling to improve wastewater management and environmental sustainability worldwide.


<font size="3" color='cyan'><b> Choice of Target Variable: WASTE_DIS (Treated Wastewater Discharged)</b></font>  

The target variable chosen for analysis is the "WASTE_DIS" column, which represents the amount of treated wastewater discharged by each wastewater treatment plant (WWTP) daily. This variable provides crucial information about the volume of treated wastewater released into the environment, making it a significant factor in various data analysis and modeling tasks.

<font size="2" color='cyan'><b> Importance of WASTE_DIS as the Target Variable:</b></font>  

1. **Environmental Impact Assessment:**
   The "WASTE_DIS" column plays a vital role in assessing the environmental impact of wastewater treatment plants. It allows us to quantify the volume of treated wastewater discharged into rivers, lakes, or oceans, helping to evaluate its potential effect on aquatic ecosystems and water quality.

2. **Resource Management and Planning:**
   Understanding the amount of treated wastewater discharged is essential for resource management and urban planning. Policymakers can use this information to optimize water usage, allocate resources, and plan for future wastewater treatment capacity.

3. **Sustainability Evaluation:**
   The volume of treated wastewater discharged reflects the efficiency and capacity of WWTPs. By analyzing this variable, we can evaluate the sustainability of wastewater treatment practices and identify areas for improvement.

4. **Health and Sanitation Assessment:**
   Treated wastewater discharge affects public health and sanitation. Analyzing "WASTE_DIS" helps monitor compliance with environmental regulations and assess potential health risks related to water quality.

<font size="2" color='cyan'><b> Analysis and Modeling Opportunities:</b></font>  

With "WASTE_DIS" as the target variable, several data analysis and modeling opportunities arise:

1. **Regression Models:**
   The continuous nature of "WASTE_DIS" makes it suitable for building regression models to predict the volume of treated wastewater discharged based on other attributes in the dataset.

2. **Environmental Impact Prediction:**
   Using "WASTE_DIS" in predictive models, we can forecast the potential environmental impact of wastewater treatment plants, identifying areas with high waste discharge and planning for better environmental conservation.

3. **Trend Analysis:**
   Analyzing the trends in "WASTE_DIS" over time or across different regions can provide insights into the changing patterns of wastewater treatment and discharge, aiding in policy evaluation and decision-making.

4. **Resource Optimization:**
   By understanding the factors influencing "WASTE_DIS," we can optimize resources and design strategies to minimize waste generation and improve wastewater treatment efficiency.

In conclusion, selecting the "WASTE_DIS" column as the target variable provides valuable insights into the environmental impact of wastewater treatment and offers numerous opportunities for data analysis and modeling to enhance water resource management and sustainability efforts.


<font size="3" color='cyan'><b> Choice of Target Variable: LEVEL (Level of Treatment of the WWTP)</b></font>  

The target variable chosen for analysis is the "LEVEL" column, which represents the level of treatment provided by each wastewater treatment plant (WWTP). This variable categorizes WWTPs into different treatment levels: "Primary," "Secondary," or "Advanced." Analyzing this variable offers valuable insights into the sophistication and efficiency of wastewater treatment processes.  

<font size="2" color='cyan'><b> Importance of LEVEL as the Target Variable:</b></font>  

1. **Treatment Process Evaluation:**
   The "LEVEL" column allows us to assess the extent to which wastewater is treated at each WWTP. It provides information on whether the plant primarily removes physical solids ("Primary"), goes through biological treatment ("Secondary"), or implements more advanced treatment techniques ("Advanced").

2. **Environmental Impact Assessment:**
   Different treatment levels have varying effects on the quality of treated wastewater discharged into the environment. Understanding the "LEVEL" helps evaluate the potential environmental impact of WWTPs and their contributions to water quality.

3. **Water Quality Monitoring:**
   The level of treatment directly affects the quality of treated wastewater. By analyzing "LEVEL," we can monitor the effectiveness of different treatment methods in improving water quality and compliance with environmental standards.

4. **Policy and Regulatory Compliance:**
   Environmental regulations often set specific treatment requirements for WWTPs. Analyzing the "LEVEL" variable allows policymakers to evaluate whether WWTPs meet the prescribed treatment standards.

<font size="2" color='cyan'><b> Analysis and Modeling Opportunities:</b></font>  

With "LEVEL" as the target variable, several data analysis and modeling opportunities arise:

1. **Classification Models:**
   The categorical nature of the "LEVEL" variable makes it suitable for building classification models. We can predict the treatment level of WWTPs based on other attributes in the dataset.

2. **Environmental Impact Assessment:**
   Using "LEVEL" in predictive models, we can assess the potential environmental impact of different treatment levels, identifying areas where more advanced treatment is needed.

3. **Treatment Efficiency Analysis:**
   Analyzing "LEVEL" helps compare the efficiency of different treatment methods and identify areas for improving wastewater treatment processes.

4. **Optimization of Treatment Resources:**
   Understanding the factors influencing the "LEVEL" of treatment can help optimize resources and allocate them effectively to achieve desired treatment outcomes.

In conclusion, selecting the "LEVEL" column as the target variable offers valuable insights into wastewater treatment processes and their environmental impact. The categorical nature of the variable provides opportunities for classification models and facilitates the evaluation and improvement of wastewater treatment practices for better water quality and environmental sustainability.


<font size="3" color='cyan'><b> Choice of Target Variable: POP_SERVED (Population Served by the WWTP)</b></font>  

The target variable chosen for analysis is the "POP_SERVED" column, which represents the population served by each wastewater treatment plant (WWTP). This variable provides crucial information about the number of people benefiting from wastewater treatment services and is a significant factor in various data analysis and modeling tasks.

<font size="2" color='cyan'><b> Importance of POP_SERVED as the Target Variable:</b></font>  

1. **Public Health and Sanitation:**
   The "POP_SERVED" column is essential for assessing public health and sanitation. It helps quantify the number of people who have access to treated wastewater, reducing the potential risks of waterborne diseases.

2. **Urban Planning and Infrastructure Development:**
   Understanding the population served by WWTPs is vital for urban planning and infrastructure development. It guides decision-makers in estimating future wastewater treatment needs and optimizing resource allocation.

3. **Resource Management:**
   Analyzing "POP_SERVED" allows efficient resource management, ensuring that WWTPs have the capacity to handle the wastewater generated by the population they serve.

4. **Sustainability Evaluation:**
   The "POP_SERVED" variable reflects the impact of wastewater treatment on the surrounding communities. It is essential for evaluating the sustainability and effectiveness of wastewater management practices.

<font size="2" color='cyan'><b> Analysis and Modeling Opportunities:</b></font>  

With "POP_SERVED" as the target variable, several data analysis and modeling opportunities arise:

1. **Regression Models:**
   The continuous nature of "POP_SERVED" makes it suitable for building regression models to predict the population served by a WWTP based on other attributes in the dataset.

2. **Urban Planning Projections:**
   Using "POP_SERVED" in predictive models, we can project future population growth and estimate the corresponding increase in wastewater treatment demand.

3. **Health Impact Assessment:**
   Analyzing "POP_SERVED" allows us to assess the potential health impact of wastewater treatment on the population and identify areas where wastewater management needs improvement.

4. **Infrastructure Optimization:**
   By understanding the factors influencing "POP_SERVED," we can optimize the capacity and design of WWTPs to meet the growing needs of the population.

In conclusion, selecting the "POP_SERVED" column as the target variable provides valuable insights into public health, urban planning, and resource management related to wastewater treatment. It offers numerous opportunities for data analysis and modeling to enhance wastewater infrastructure, public health, and environmental sustainability efforts.


<font size="3" color='cyan'><b> Choice of Target Variable: QUAL_POP (Quality Indicator for Population Served)</b></font>  

The target variable chosen for analysis is the "QUAL_POP" column, which represents the quality indicator related to the attribute "population served" by each wastewater treatment plant (WWTP). This variable provides valuable information about the reliability and accuracy of reported population data and is a significant factor in various data analysis and modeling tasks.

<font size="2" color='cyan'><b> Importance of QUAL_POP as the Target Variable:</b></font>  

1. **Data Reliability Assessment:**
   The "QUAL_POP" column allows us to assess the reliability of reported population data for each WWTP. It helps identify cases where population data is directly reported, estimated with wastewater discharge information, or estimated without wastewater discharge information.

2. **Data Quality Improvement:**
   Analyzing "QUAL_POP" helps in identifying areas where data quality improvement is needed. It allows for better data collection and reporting practices to enhance the accuracy of population served estimates.

3. **Policy and Decision-Making:**
   Policymakers and stakeholders rely on accurate population data to make informed decisions about wastewater treatment infrastructure and resource allocation. The "QUAL_POP" variable ensures that such decisions are based on reliable information.

<font size="2" color='cyan'><b> Analysis and Modeling Opportunities:</b></font>  

With "QUAL_POP" as the target variable, several data analysis and modeling opportunities arise:

1. **Classification Models:**
   The categorical nature of the "QUAL_POP" variable makes it suitable for building classification models to predict the quality of population data reported by each WWTP based on other attributes in the dataset.

2. **Data Quality Assessment:**
   Using "QUAL_POP" in predictive models, we can assess the factors influencing the accuracy of reported population data and identify patterns for improving data quality.

3. **Policy Evaluation:**
   Analyzing "QUAL_POP" helps policymakers and authorities evaluate the effectiveness of data reporting policies and their impact on data quality.

4. **Resource Allocation Optimization:**
   By understanding the factors affecting "QUAL_POP," we can optimize resource allocation for data collection and reporting efforts, improving the reliability of population served estimates.

In conclusion, selecting the "QUAL_POP" column as the target variable provides valuable insights into the reliability and accuracy of population data reported by WWTPs. It offers numerous opportunities for data analysis and modeling to enhance data quality and decision-making in wastewater treatment infrastructure planning and resource management.


<font size="+0" color='green'><b> Motivation</b></font>  


<font size="+0" color='green'><b> Problem Statement</b></font>  



<font size="+0" color='green'><b> Objectives</b></font>  


<font size="+0" color='green'><b> Data Source</b></font>  


<font size="+0" color='green'><b> Importing Dependencies</b></font>  

In [1]:
import sys
# Insert the parent path relative to this notebook so we can import from the src folder.
sys.path.insert(0, "..")

from src.dependencies import *

<a id='#data'></a>
<font size="+2" color='#053c96'><b> Data</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data Loading</b></font>  

In [10]:
rivers = pd.read_csv('../data/river_data.csv')
df = pd.read_csv('../data/HydroWASTE_v10.csv', encoding='ISO-8859-1')

<font size="+0" color='green'><b> Data Overview</b></font>  

In [11]:
rivers.shape

(8477883, 16)

In [3]:
df.shape

(58502, 25)

In [12]:
# Get information about the DataFrame, including data types and non-null counts
print("\nData Info:")
print(rivers.info())


Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8477883 entries, 0 to 8477882
Data columns (total 16 columns):
 #   Column        Dtype  
---  ------        -----  
 0   OBJECTID      int64  
 1   HYRIV_ID      int64  
 2   NEXT_DOWN     int64  
 3   MAIN_RIV      int64  
 4   LENGTH_KM     float64
 5   DIST_DN_KM    float64
 6   DIST_UP_KM    float64
 7   CATCH_SKM     float64
 8   UPLAND_SKM    float64
 9   ENDORHEIC     int64  
 10  DIS_AV_CMS    float64
 11  ORD_STRA      int64  
 12  ORD_CLAS      int64  
 13  ORD_FLOW      int64  
 14  HYBAS_L12     int64  
 15  Shape_Length  float64
dtypes: float64(7), int64(9)
memory usage: 1.0 GB
None


In [4]:
# Get information about the DataFrame, including data types and non-null counts
print("\nData Info:")
print(df.info())


Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58502 entries, 0 to 58501
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   WASTE_ID    58502 non-null  int64  
 1   SOURCE      58502 non-null  int64  
 2   ORG_ID      58502 non-null  int64  
 3   WWTP_NAME   53215 non-null  object 
 4   COUNTRY     58502 non-null  object 
 5   CNTRY_ISO   58502 non-null  object 
 6   LAT_WWTP    58502 non-null  float64
 7   LON_WWTP    58502 non-null  float64
 8   QUAL_LOC    58502 non-null  int64  
 9   LAT_OUT     58502 non-null  float64
 10  LON_OUT     58502 non-null  float64
 11  STATUS      58502 non-null  object 
 12  POP_SERVED  58502 non-null  int64  
 13  QUAL_POP    58502 non-null  int64  
 14  WASTE_DIS   58502 non-null  float64
 15  QUAL_WASTE  58502 non-null  int64  
 16  LEVEL       58502 non-null  object 
 17  QUAL_LEVEL  58502 non-null  int64  
 18  DF          47302 non-null  float64
 19  HYRIV_ID    5

In [5]:
# Display the first few rows of the DataFrame
print("First few rows:")
df.head()

First few rows:


Unnamed: 0,WASTE_ID,SOURCE,ORG_ID,WWTP_NAME,COUNTRY,CNTRY_ISO,LAT_WWTP,LON_WWTP,QUAL_LOC,LAT_OUT,LON_OUT,STATUS,POP_SERVED,QUAL_POP,WASTE_DIS,QUAL_WASTE,LEVEL,QUAL_LEVEL,DF,HYRIV_ID,RIVER_DIS,COAST_10KM,COAST_50KM,DESIGN_CAP,QUAL_CAP
0,1,1,1140441,Akmenes aglomeracija,Lithuania,LTU,56.247,22.726,2,56.223,22.627,Not Reported,1060,2,148.213,4,Advanced,1,2421.974,20228874.0,4.153,0,0,4600.0,2
1,2,1,1140443,Alytaus m aglomeracija,Lithuania,LTU,54.432,24.056,2,54.519,24.098,Not Reported,87900,2,8797.904,1,Advanced,1,2534.527,20261585.0,257.983,0,0,220000.0,2
2,3,1,1140445,Anyksciu aglomeracija,Lithuania,LTU,55.509,25.073,2,55.452,25.006,Not Reported,12400,2,1959.285,1,Advanced,1,1367.809,20243105.0,30.995,0,0,33000.0,2
3,4,1,1140447,Ariogalos aglomeracija,Lithuania,LTU,55.252,23.484,2,55.21,23.51,Not Reported,2500,2,578.482,1,Secondary,1,2061.969,20247446.0,13.799,0,0,4357.0,2
4,5,1,1140449,Baisogalos aglomeracija,Lithuania,LTU,55.644,23.741,2,55.681,23.835,Not Reported,1200,2,167.788,4,Secondary,1,209.549,20239330.0,0.405,0,0,1490.0,2


<a id='exploratory-data-analysis'></a>
<font size="+2" color='#053c96'><b> Exploratory Data Analysis</b></font>  
[back to top](#table-of-contents)

<a id='data-exploration'></a>
<font size="+0" color='green'><b> Data Exploration</b></font>  

In [11]:
# Check the number of unique values in each column
print("\nNumber of Unique Values:")
print(df.nunique())


Number of Unique Values:
WASTE_ID      58502
SOURCE           12
ORG_ID        47496
WWTP_NAME     49260
COUNTRY         188
CNTRY_ISO       180
LAT_WWTP      31311
LON_WWTP      44467
QUAL_LOC          4
LAT_OUT       13507
LON_OUT       24606
STATUS            9
POP_SERVED    22602
QUAL_POP          4
WASTE_DIS     33782
QUAL_WASTE        4
LEVEL             3
QUAL_LEVEL        2
DF            45199
HYRIV_ID      42821
RIVER_DIS     22017
COAST_10KM        2
COAST_50KM        2
DESIGN_CAP     7328
QUAL_CAP          3
dtype: int64


In [12]:
# Check for any missing values in the DataFrame
print("\nMissing Values:")
print(df.isnull().sum())


Missing Values:
WASTE_ID          0
SOURCE            0
ORG_ID            0
WWTP_NAME      5287
COUNTRY           0
CNTRY_ISO         0
LAT_WWTP          0
LON_WWTP          0
QUAL_LOC          0
LAT_OUT           0
LON_OUT           0
STATUS            0
POP_SERVED        0
QUAL_POP          0
WASTE_DIS         0
QUAL_WASTE        0
LEVEL             0
QUAL_LEVEL        0
DF            11200
HYRIV_ID        379
RIVER_DIS     10551
COAST_10KM        0
COAST_50KM        0
DESIGN_CAP    15835
QUAL_CAP          0
dtype: int64


<a id='data-visualization'></a>
<font size="+0" color='green'><b> Data Visualization</b></font>  

<a id='summary-statistics'></a>
<font size="+0" color='green'><b> Summary Statistics</b></font>  

<a id='feature-correlation'></a>
<font size="+0" color='green'><b> Feature Correlation</b></font>  

<a id='data-preparation'></a>
<font size="+2" color='#053c96'><b> Data Preparation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Data CLeaning</b></font>  

<font size="+0" color='green'><b> Handling Imbalanced Classes</b></font>  

<font size="+0" color='green'><b> Feature Engineering</b></font>  

<font size="+0" color='green'><b> Feature Selection</b></font>  

<font size="+0" color='green'><b> Data Transformation</b></font>  

<font size="+0" color='green'><b> Data Splitting</b></font>  

<a id='model-development'></a>

<font size="+2" color='#053c96'><b> Model Development</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Baseline Model</b></font>  

<font size="+0" color='green'><b> Model Selection</b></font>  

<font size="+0" color='green'><b> Model Training</b></font>  

<font size="+0" color='green'><b> Hyperparameter Tuning</b></font>  

<a id='model-evaluation'></a>

<font size="+2" color='#053c96'><b> Model Evaluation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Performance Metrics</b></font>  

<font size="+0" color='green'><b> Confusion Matrix</b></font>  

<font size="+0" color='green'><b> ROC Curve</b></font>  

<font size="+0" color='green'><b> Precision-Recall Curve</b></font>   

<font size="+0" color='green'><b> Cross-Validation</b></font>   

<font size="+0" color='green'><b> Bias-Variance Tradeoff</b></font>   

<a id='model-interpretation'></a>
<font size="+2" color='#053c96'><b> Model Interpretation</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Feature Importance</b></font>   

<font size="+0" color='green'><b> Model Explanation Techniques</b></font>   

<font size="+0" color='green'><b> Business Impact Analysis</b></font>   

<a id='conclusion'></a>

<font size="+2" color='#053c96'><b> Conclusion</b></font>  
[back to top](#table-of-contents)

<font size="+0" color='green'><b> Summary of Findings</b></font>   

<font size="+0" color='green'><b> Recommendations</b></font>   

<font size="+0" color='green'><b> Limitations</b></font>   

<font size="+0" color='green'><b> Future Work</b></font>   

<font size="+0" color='green'><b> Final Thoughts</b></font>   

<a id='references'></a>

<font size="+2" color='#053c96'><b> References</b></font>  
[back to top](#table-of-contents)