# Determining Water Potability

**Introduction**

Water potability is critical for supporting populations, various industries and agricultural activities that rely on clean water for survival. Clean and drinkable water can be determined by careful analysis of certain characteristics of a sample. Using classification we hope to train a data set to be able to classify water as safe or not based on its characteristics. The question we will be asking is: Is this water sample potable based on the different levels and characteristics of it?
The data set we will be using is ‘Water Quality and Potability’. Each observation has a pH value, total solids dissolved, chloramines count, sulfate count, conductivity, organic carbon level, trihalomethanes, turbidity and potability. With these values we hope to train a classification system that can provide accurate results on potability. 


**The Dataset**

In [7]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('tests.R')
source("cleanup.R")


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [8]:
#reading the data set from the web 
url <- "https://raw.githubusercontent.com/nori-2004/data-science-group-project/main/water_potability.csv"
water <- read_csv(url)

#changing the potable variable to a factor datatype
water<-water |> 
    mutate(Potability=as_factor(Potability))

#cleaning the data by selecting the columns we want to use
water_selected <- select(water, - Conductivity)
water_selected


[1mRows: [22m[34m3276[39m [1mColumns: [22m[34m10[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (10): ph, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic_...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ph,Hardness,Solids,Chloramines,Sulfate,Organic_carbon,Trihalomethanes,Turbidity,Potability
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
,204.8905,20791.32,7.300212,368.5164,10.37978,86.99097,2.963135,0
3.716080,129.4229,18630.06,6.635246,,15.18001,56.32908,4.500656,0
8.099124,224.2363,19909.54,9.275884,,16.86864,66.42009,3.055934,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
9.419510,175.7626,33155.58,7.350233,,11.03907,69.84540,3.298875,1
5.126763,230.6038,11983.87,6.303357,,11.16895,77.48821,4.708658,1
7.874671,195.1023,17404.18,7.509306,,16.14037,78.69845,2.309149,1


In [9]:
#splitting the data into testing and training sets
water_split <- initial_split(water_selected, prop = 0.75)
water_train <- training(water_split)   
water_test <- testing(water_split)
water_train

ph,Hardness,Solids,Chloramines,Sulfate,Organic_carbon,Trihalomethanes,Turbidity,Potability
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
,181.3897,18908.63,7.215350,325.9008,14.78284,93.44067,4.013477,0
8.389946,213.6657,17674.71,8.348815,,19.30125,70.82951,3.569717,0
,202.2548,11981.74,9.189106,339.9839,14.07923,62.76553,2.678911,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
,253.0544,28864.91,8.524371,364.7495,15.82307,70.75819,4.697041,0
4.993531,242.6205,13900.78,6.467744,,17.27466,61.24579,3.153809,1
,139.1697,33784.83,9.640520,275.3320,13.66449,70.36836,4.678745,1


**Methodology**

The problem that we have selected is a classification problem, i.e. classifying a sample of water into potable or non-potable. We will be using the K nearest neighbour classification do to so. Out of the 8 variables in our dataset, we will be using all of them as predictors besides conductivity. First, we shall prepare the data by tidying it and splitting it into training an testing sets. Then we will use cross-validation to pick the best “k” value for our dataset and then train the model using the training set, and then apply it to the testing set. This will ensure that we produce a classifier with the highest accuracy possible. Though, it would be ideal to have high precision and high recall, we shall be aiming to have a high recall with not-potable being the “positive” class. 


The best way in this case to visualize our result is to create several scatterplots using pairs of variables, this will also highlight any hidden co-relations between them. We will be experimenting with our visuals to avoid overplotting since there are over 900 observations in our data set. This will be done either by reducing the size or the transparency of the points.

