# Data preprocessing using Water Potability dataset

The steps followed will be:
* Getting the dataset
* Importing the libraries
* Importing the dataset
* Clean the dataset (Fill the missing values, if present)
* Encode categorical data, if any
* Split the dataset into training and testing sets
* Perform feature scaling to ensure no variable dominates over the other just because of its measurements

## Importing the required libraries

In [1]:
import numpy as np
import pandas as pd

## Importing the dataset

In [2]:
water_potability = pd.read_csv('water_potability.csv')
water_potability.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


## Understanding the variables
The dependent variable(s) for this case is Potability, which means water is safe for human consumption or not. The independent variables are pH, hardness, solids, chloramines, sulfate, conductivity, organic_carbon, trihalomethanes and turbidity

In [4]:
X = water_potability.iloc[:, :-1].values
y = water_potability.iloc[:, -1].values

In [5]:
X

array([[           nan, 2.04890455e+02, 2.07913190e+04, ...,
        1.03797831e+01, 8.69909705e+01, 2.96313538e+00],
       [3.71608008e+00, 1.29422921e+02, 1.86300579e+04, ...,
        1.51800131e+01, 5.63290763e+01, 4.50065627e+00],
       [8.09912419e+00, 2.24236259e+02, 1.99095417e+04, ...,
        1.68686369e+01, 6.64200925e+01, 3.05593375e+00],
       ...,
       [9.41951032e+00, 1.75762646e+02, 3.31555782e+04, ...,
        1.10390697e+01, 6.98454003e+01, 3.29887550e+00],
       [5.12676292e+00, 2.30603758e+02, 1.19838694e+04, ...,
        1.11689462e+01, 7.74882131e+01, 4.70865847e+00],
       [7.87467136e+00, 1.95102299e+02, 1.74041771e+04, ...,
        1.61403676e+01, 7.86984463e+01, 2.30914906e+00]])

In [6]:
y

array([0, 0, 0, ..., 1, 1, 1], dtype=int64)

## Cleaning the data, to fill the missing values

In [7]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, :])
X[:,:] = imputer.transform(X[:,:])

In [8]:
X

array([[7.08079450e+00, 2.04890455e+02, 2.07913190e+04, ...,
        1.03797831e+01, 8.69909705e+01, 2.96313538e+00],
       [3.71608008e+00, 1.29422921e+02, 1.86300579e+04, ...,
        1.51800131e+01, 5.63290763e+01, 4.50065627e+00],
       [8.09912419e+00, 2.24236259e+02, 1.99095417e+04, ...,
        1.68686369e+01, 6.64200925e+01, 3.05593375e+00],
       ...,
       [9.41951032e+00, 1.75762646e+02, 3.31555782e+04, ...,
        1.10390697e+01, 6.98454003e+01, 3.29887550e+00],
       [5.12676292e+00, 2.30603758e+02, 1.19838694e+04, ...,
        1.11689462e+01, 7.74882131e+01, 4.70865847e+00],
       [7.87467136e+00, 1.95102299e+02, 1.74041771e+04, ...,
        1.61403676e+01, 7.86984463e+01, 2.30914906e+00]])

## Encoding categorical data
Since there are no categorical variables in thedataset (independent variables), encoding will not be necessary in this project

## Splitting the dataset into training and testing sets
The ratio used is 7:3

In [9]:
from sklearn.model_selection import train_test_split as tts
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.3, random_state=0)

In [10]:
X_train

array([[7.08079450e+00, 2.15997117e+02, 3.59710251e+04, ...,
        1.43378044e+01, 8.17697753e+01, 2.93413700e+00],
       [9.13079590e+00, 2.00032348e+02, 2.82736032e+04, ...,
        1.28605139e+01, 6.41784939e+01, 3.02570737e+00],
       [7.27714397e+00, 1.94880861e+02, 1.82701051e+04, ...,
        1.54597516e+01, 7.69872315e+01, 4.93135352e+00],
       ...,
       [6.64800546e+00, 1.91841801e+02, 1.51762907e+04, ...,
        1.54382868e+01, 5.65323872e+01, 3.82978355e+00],
       [7.67591362e+00, 2.33300759e+02, 2.36731006e+04, ...,
        1.84594078e+01, 6.09935904e+01, 5.04046081e+00],
       [7.08079450e+00, 1.60915815e+02, 1.39432450e+04, ...,
        1.52086911e+01, 7.55750556e+01, 4.14155231e+00]])

In [11]:
X_test

array([[8.11195317e+00, 2.17266472e+02, 3.81844696e+04, ...,
        1.30279207e+01, 7.85820939e+01, 4.43075046e+00],
       [6.76806005e+00, 1.79805992e+02, 2.37930314e+04, ...,
        1.35573812e+01, 6.05712410e+01, 4.14580670e+00],
       [7.08079450e+00, 1.80893036e+02, 1.77056086e+04, ...,
        1.04610247e+01, 3.20748635e+01, 3.99912550e+00],
       ...,
       [7.08079450e+00, 1.93116794e+02, 1.73889826e+04, ...,
        1.62330386e+01, 6.78739099e+01, 4.21774374e+00],
       [5.96727423e+00, 1.87085084e+02, 3.08465855e+04, ...,
        1.26518540e+01, 7.22735904e+01, 3.91454361e+00],
       [4.13704491e+00, 1.16338278e+02, 1.71019516e+04, ...,
        7.84831295e+00, 7.09770011e+01, 4.32735603e+00]])

In [12]:
y_train

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

In [13]:
y_test

array([1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,

## Feature scaling
This is necessary so as to standardize the dataset and make sure no variable dominates the other variables

In [14]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [15]:
X_train

array([[-7.24509827e-04,  5.90192457e-01,  1.61224883e+00, ...,
         9.02966016e-03,  9.60086632e-01, -1.35673034e+00],
       [ 1.40513766e+00,  1.00239795e-01,  7.30973920e-01, ...,
        -4.34286028e-01, -1.49206437e-01, -1.23763778e+00],
       [ 1.33929195e-01, -5.78573867e-02, -4.14322787e-01, ...,
         3.45711453e-01,  6.58502994e-01,  1.24076541e+00],
       ...,
       [-2.97525160e-01, -1.51124968e-01, -7.68532421e-01, ...,
         3.39270136e-01, -6.31364219e-01, -1.91890130e-01],
       [ 4.07399825e-01,  1.12123464e+00,  2.04264120e-01, ...,
         1.24586933e+00, -3.50044075e-01,  1.38266576e+00],
       [-7.24509827e-04, -1.10023172e+00, -9.09703355e-01, ...,
         2.70371455e-01,  5.69452235e-01,  2.13583196e-01]])

In [16]:
X_test

array([[ 7.06429625e-01,  6.29148504e-01,  1.86566524e+00, ...,
        -3.84049440e-01,  7.59073823e-01,  5.89702129e-01],
       [-2.15193408e-01, -5.20499365e-01,  2.17994947e-01, ...,
        -2.25165204e-01, -3.76677111e-01,  2.19116274e-01],
       [-7.24509827e-04, -4.87138419e-01, -4.78951771e-01, ...,
        -1.15434160e+00, -2.17363740e+00,  2.83488771e-02],
       ...,
       [-7.24509827e-04, -1.11995936e-01, -5.15202167e-01, ...,
         5.77764829e-01,  8.38237344e-02,  3.12674565e-01],
       [-7.64361081e-01, -2.97106816e-01,  1.02555368e+00, ...,
        -4.96902165e-01,  3.61264308e-01, -8.16547673e-02],
       [-2.01950665e+00, -2.46829935e+00, -5.48064237e-01, ...,
        -1.93838246e+00,  2.79502351e-01,  4.55231686e-01]])