<a href="https://colab.research.google.com/github/mohammadreza-mohammadi94/Data_Analysis_Machine_Learning/blob/master/4.%20Public%20Health%20and%20Safety/Water%20Quality/Water_Quality_Analysis_and_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align='center'>
    <a href="https://ibb.co/jgNSnw2"><img src="https://i.ibb.co/PwLdqWn/water-quality-blog-post-img.jpg" alt="water-quality-blog-post-img" border="0"></a>
</div>

# Project Content
1. [Introduction](#1)
2. [Conneting to Kaggle](#2)
    - 2.1 [Downloading Dataset From Kaggle](#2.1)
3. [Importing Libraries](#3)
4. [Importing Dataset](#4)
5. [First Analysis of Dataset](#5)

# 2. Connecting to Kaggle <a id=2></a>

In [1]:
from google.colab import userdata
import os

os.environ["KAGGLE_PASS"] = userdata.get('KAGGLE_PASS')
os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')

## 2.1. Downlading Dataset <a id=2.1></a>

In [2]:
!kaggle datasets download -d adityakadiwal/water-potability

Dataset URL: https://www.kaggle.com/datasets/adityakadiwal/water-potability
License(s): CC0-1.0
Downloading water-potability.zip to /content
  0% 0.00/251k [00:00<?, ?B/s]
100% 251k/251k [00:00<00:00, 71.5MB/s]


In [3]:
! unzip "water-potability.zip"

Archive:  water-potability.zip
  inflating: water_potability.csv    


# 3. Importing Libraries <a id=3></a>

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 4. Loading Dataset <a id=4></a>

In [5]:
df = pd.read_csv("water_potability.csv")
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


# 5. First Analysis of Dataset <a id=5></a>

## 5.1 Getting to Know the Variables <a id=5.1></a>

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2785 non-null   float64
 1   Hardness         3276 non-null   float64
 2   Solids           3276 non-null   float64
 3   Chloramines      3276 non-null   float64
 4   Sulfate          2495 non-null   float64
 5   Conductivity     3276 non-null   float64
 6   Organic_carbon   3276 non-null   float64
 7   Trihalomethanes  3114 non-null   float64
 8   Turbidity        3276 non-null   float64
 9   Potability       3276 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 256.1 KB


1. ph: pH of 1. water (0 to 14).
2. Hardness: Capacity of water to precipitate soap in mg/L.
3. Solids: Total dissolved solids in ppm.
4. Chloramines: Amount of Chloramines in ppm.
5. Sulfate: Amount of Sulfates dissolved in mg/L.
6. Conductivity: Electrical conductivity of water in μS/cm.
7. Organic_carbon: Amount of organic carbon in ppm.
8. Trihalomethanes: Amount of Trihalomethanes in μg/L.
9. Turbidity: Measure of light emiting property of water in NTU.
10. **Potability: Indicates if water is safe for human consumption. Potable -1 and Not potable -0**

## 5.2 Analyzing the Dataframe <a id=5.2></a>

In [7]:
df.dtypes

ph                 float64
Hardness           float64
Solids             float64
Chloramines        float64
Sulfate            float64
Conductivity       float64
Organic_carbon     float64
Trihalomethanes    float64
Turbidity          float64
Potability           int64
dtype: object

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2785 non-null   float64
 1   Hardness         3276 non-null   float64
 2   Solids           3276 non-null   float64
 3   Chloramines      3276 non-null   float64
 4   Sulfate          2495 non-null   float64
 5   Conductivity     3276 non-null   float64
 6   Organic_carbon   3276 non-null   float64
 7   Trihalomethanes  3114 non-null   float64
 8   Turbidity        3276 non-null   float64
 9   Potability       3276 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 256.1 KB



### Summary
- The dataset contains a total of 3,276 records with 10 columns.
- Most columns are of type `float64`, except for the `Potability` column, which is of type `int64`.
- Missing values are present in three columns:
  - **ph**: 491 missing values (15%)
  - **Sulfate**: 781 missing values (24%)
  - **Trihalomethanes**: 162 missing values (5%)
- The columns `Hardness`, `Solids`, `Chloramines`, `Conductivity`, `Organic_carbon`, `Turbidity`, and `Potability` have no missing values.

In [9]:
df.duplicated().sum()

0

_This dataset does not contains any duplicated records_

## 5.3 Unique Values <a id=5.3></a>

In [11]:
pd.DataFrame(df.nunique(), columns=['Unique Values'])

Unnamed: 0,Unique Values
ph,2785
Hardness,3276
Solids,3276
Chloramines,3276
Sulfate,2495
Conductivity,3276
Organic_carbon,3276
Trihalomethanes,3114
Turbidity,3276
Potability,2


In [15]:
pd.DataFrame(df.Potability.value_counts())

Unnamed: 0_level_0,count
Potability,Unnamed: 1_level_1
0,1998
1,1278


## 5.4 Statistical Summary <a id=5.4></a>

In [16]:
df.describe(include='all').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ph,2785.0,7.080795,1.59432,0.0,6.093092,7.036752,8.062066,14.0
Hardness,3276.0,196.369496,32.879761,47.432,176.850538,196.967627,216.667456,323.124
Solids,3276.0,22014.092526,8768.570828,320.942611,15666.690297,20927.833607,27332.762127,61227.196008
Chloramines,3276.0,7.122277,1.583085,0.352,6.127421,7.130299,8.114887,13.127
Sulfate,2495.0,333.775777,41.41684,129.0,307.699498,333.073546,359.95017,481.030642
Conductivity,3276.0,426.205111,80.824064,181.483754,365.734414,421.884968,481.792304,753.34262
Organic_carbon,3276.0,14.28497,3.308162,2.2,12.065801,14.218338,16.557652,28.3
Trihalomethanes,3114.0,66.396293,16.175008,0.738,55.844536,66.622485,77.337473,124.0
Turbidity,3276.0,3.966786,0.780382,1.45,3.439711,3.955028,4.50032,6.739
Potability,3276.0,0.39011,0.487849,0.0,0.0,0.0,1.0,1.0


## 5.4.1 Analysis Output: <a id=5.4.1></a>
#### 1. **ph**
- The minimum value of 0.00 suggests possible data entry errors or extreme outliers.
- The mean and median are close, indicating a roughly symmetric distribution.
- The standard deviation is relatively high, indicating variability.
- The max value of 14.00 is an outlier, considering the typical pH range in natural waters is 6-9.

#### 2. **Hardness**
- The values are fairly normally distributed, given the symmetry around the mean.
- The min and max values do not indicate extreme outliers.

#### 3. **Solids**
- High standard deviation indicates large variability.
- The minimum value is extremely low compared to the median and quartiles, suggesting outliers.
- The max value also indicates potential outliers.

#### 4. **Chloramines**
- The min value is quite low, possibly indicating outliers or measurement issues.
- The distribution is fairly symmetric, but the high max value suggests outliers.

#### 5. **Sulfate**
- The distribution appears to be symmetric.
- The min value is significantly lower than the 25th percentile, indicating potential outliers.

#### 6. **Conductivity**
- The distribution is relatively normal.
- Min and max values suggest potential outliers, but they are not extreme.

#### 7. **Organic_carbon**
- The min value suggests potential outliers.
- The distribution appears relatively normal but slightly skewed.

#### 8. **Trihalomethanes**
- The min value is very low, suggesting potential outliers.
- The distribution appears roughly normal.

#### 9. **Turbidity**
- The min and max values suggest the presence of outliers.
- The distribution appears symmetric.

#### 10. **Potability**
- This binary variable (0 or 1) indicates water potability.
- The mean value suggests that approximately 39% of the samples are potable.

### Summary
- **Outliers**: Potential outliers are indicated in the `ph`, `Solids`, `Chloramines`, `Sulfate`, `Organic_carbon`, `Trihalomethanes`, and `Turbidity` columns.
- **Distributions**: Most variables exhibit a roughly normal distribution, with some skewness and outliers.
- **Missing Values**: The `ph`, `Sulfate`, and `Trihalomethanes` columns have missing values that need to be addressed.
- **Variability**: High variability is observed in the `Solids`, `Conductivity`, and `Organic_carbon` columns.