## Length of the report {-}
The length of the report must be no more than 15 pages, when printed as PDF. However, there is no requirement on the minimum number of pages.

You may put additional stuff as Appendix. You may refer to the Appendix in the main report to support your arguments. However, your appendix is unlikely to be checked while grading, unless the grader deems it necessary. The appendix, references, and information about GitHub and individual contribution will not be included in the page count, and there is no limit on the length of the appendix.

**Delete this section from the report, when using this template.** 

## Code should be put separately in the code template {-}
Your report should be in a research-paper like style. If there is something that can only be explained by showing the code, then you may put it, otherwise do not put the code in the report. We will check your code in the code template. 

**Delete this section from the report, when using this template.** 

## Background / Motivation

What motivated you to work on this problem?

Mention any background about the problem, if it is required to understand your analysis later on.

Water is vital to the survival of all kinds of life. More relevantly, humans require a clean and safe, or in other words potable, water supply to drink in order to stay healthy. Drinking unpotable water can result in various health issues like gastrointestinal illness, reproductive problems, and neurological disorders. Because the safety of water is so important for the health of the general population, efforts must be made to conserve the potability of water supplies via monitoring of the potability or locate potable water supplies for those that don't have them via testing of the water potability. The quality of the water in any given water supply can be affected by various factors, including sedimentation, polluted runoff, and improperly maintained pipes. These factors change different chemical characteristics of the water like its pH, the concentration of dissolved solids, or the concentration of chemicals like sulfates or trihalomethanes. Understanding how these different chemical characteristics play a role in the potability of a given water sample would allow for easier prediction of the potability of a given water sample, making it easier to manage water supplies and ensure water safety.  

## Problem statement 

Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.

We are attempting to determine how different chemical characteristics of water can affect the potability (whether it is safe to drink or not). 

The is both an inference and a prediction problem. It is both important to understand how chemical characteristics relate to water potability (inference) and be able to predict whether water will be safe to drink based on the chemical characteristics (prediction).

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

We used an open-access dataset on water potability from Kaggle. 
https://www.kaggle.com/datasets/artimule/drinking-water-probability

The dataset contains information on several chemical characteristics of water samples and the classification of each water sample as potable or not. The included characteristics are: pH, Hardness, Solids, Chloramines, Sulfates, Conductivity, Organic carbon, Trihalomethanes, and Turbidity.

From Kaggle, the variables are defined as follows:

Potability: Indicates if water is safe to drink where 1 means safe to drink and 0 means not safe.

pH value: PH is an important parameter in evaluating the acid-base balance of water. It is also the indicator of the acidic or alkaline condition of water status. WHO has recommended the maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

Hardness: Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness-producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.

Solids: Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates, etc. These minerals produced an unwanted taste and diluted color in the appearance of water. This is the important parameter for the use of water. The water with a high TDS value indicates that water is highly mineralized. The desirable limit for TDS is 500 mg/l and the maximum limit is 1000 mg/l which is prescribed for drinking purposes.

Chloramines:
Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

Sulfate:
Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.

Conductivity:
Pure water is not a good conductor of electric current rather’s a good insulator. An increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceed 400 μS/cm.

Organic_carbon:
Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.

Trihalomethanes:
THMs are chemicals that may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm are considered safe in drinking water.

Turbidity:
The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of the light-emitting properties of water and the test is used to indicate the quality of waste discharge with respect to the colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

## Stakeholders
Who cares? If you are successful, what difference will it make to them?

We are considering three main categories of stakeholders for this project. First, governmental and environmental regulatory organizations would be interested in the results of our project. They are responsible for providing and conserving potable water supplies for the general population. Because of their status as governmental organizations, it is a requirement of their job that they succeed in improving and preserving water potability or else risk the safety of many citizens. Understanding what characteristics of water most heavily influence water potability would be beneficial to them as it will let them know what qualities of the water and the environment of the water supply they should focus research and conservation efforts on in order to preserve the potability of the water. 

Other stakeholders include biotechnology research companies that are developing tests for water potability. They would also benefit from understanding what qualities of water are most relevant to the potability of water as they would know what characteristics to base their tests on for the highest accuracy of the test. Similarly, if our model meets our standards for classification accuracy and FPR, it would suggest that our method of prediction based on chemical characteristics is highly successful. Tests for water potability would be able to draw directly from that prediction potential in order to classify water as potable or not. With a more accurate test thanks to our model, they may experience higher demand and may thus benefit financially.

In our list of stakeholders, we would also like to include include people who live in areas where water potability is an issue. Although they may not have the scientific equipment to directly predict whether a water source is potable, this project will benefit them as they can use our model to understand how certain characteristics of their environment that are affecting the water will change whether or not it is safe to drink. For example, if they can see that the water has lots of suspended solids and it can conduct electricity, those are two characteristics that may be included in the model and can be used to make an inference about the potability of their water. In the end, understanding how their water potability is affected by different chemicals can improve their safety and their health.

## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

The dataset contains 10 total variables, there are 9 continuous variables that are possible predictors and 1 discrete variable, which is the target variable potabilitiy, that is either a 0, meaning not potable, or 1, meaning potable. All continuous variables have a normal distribution and their histograms are shown in the appendix. The variable potability has a distribution containing 1200 zeros (59.7%) and 811 ones (40.3%). There were only three columns that contained missing values, ph, chloramines, and trihalmoethanes. However there were many missing values in each of these columns, 491, 781, and 162 respectively. Since there were over 3200 rows initially and still over 2000 rows of data after dropping all rows with missing values, we determined that all missing values can be dropped and the dataset will still be big enough to prevent overfitting of the model. After removing missing values, none of the columns needed more data cleaning since they were all numerical and no additional predictors were created from existing predictors.

## Exploratory data analysis

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. Also add info about null model.

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

The plots demonstrating the relationships between each predictor and the target variable as well as the relationships between predictors can be seen in the Appendix.

Relationships between explanatory variables and potability:
 - Through looking at boxplots of explanatory variables against potability, all the variables have minimal relationship with potability. There was very little difference between the boxplots when water was safe to drink versus not safe to drink, showing minimal, if any, relationship. The lack of relationships highlights that it might not be possible to get a highly performing model for water potability based on these chemical components or their transformations as predictors. 
 
Relationships between explanatory variables:
 - Through looking at both the scatterplot matrix and the correlation heatmap, it appears that there is also minimal relationship between the variables. The scatterplot matrix does not show any clear associations between the variables, Further, the heatmap shows that there is almost zero correlation between most of the variables, with the highest correlation between two explanatory variables being roughly 0.1. This shows that there is very little collinearity and clearly no collinearity issue. This also implies that interaction terms between the predictors will likely not have an effect on the performance of the model.

## Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

Because the target variable for this problem was a binary variable, we opted to use a logistic model to fit the data. The two metrics we tried to focus our efforts on were classification accuracy and false positive rate - we attempted to maximize classification accuracy and minimize the false positive rate (FPR). Classification accuracy is relevant in this case because the end goal for our model is to have it correctly predict whether or not water is potable based on the water's chemical characteristics. Hence, a higher classification accuracy would imply that our model more correctly predicted the potability. At the same time, it was important to us to minimize the false positive rate given that it is the metric with the highest stakes in the particular problem. Since the potability of water is vital to human health and drinking unpotable water can have adverse side effects on someone's health, it is much preferred that someone avoid drinking water that is potable (represented by the false negative rate) than drink water that is not potable (the false positive rate). In addition to these two metrics, when completing some of the variable selection methods as described later in this report, we used classification accuracy on test data as well as AIC and BIC of the model to choose the best variable subset. This is because AIC and BIC account for the number of predictors and prevent overfitting; this ended up being unnecessary as seen in our ridge regression results, but we wanted to be cautious nonetheless. 

Before conducting EDA, we were concerned about the possibility of multicollinearity between the different chemical characteristics of the water, like hardness and conductivity and solids (hardness implies dissolution of magnesium and calcium salts which would produce electrolyte solutions that would conduct electricity, and solids implies the dissolution of other solids which would seem to increase at the same time that calcium and magnesium dissolution would increase). As such we planned to complete a ridge regression to avoid multicollinearity issues. However, as seen in the results of our EDA, none of the predictors seemed to be highly correlated with the target variable or with each other which led to us having concerns that the predictors available to us in the dataset would not produce a good model, regardless of how many transformations or interactions we added. The first model we built was simply the model with all of the predictors, which had a classification accuracy of 59.8% and an FPR of 0.25%. In regard to the FPR, it is reasonably low, but the model was almost effectively just predicting everything as not potable The accuracy shows that the model is only 0.7% better than the null model and it was not high enough for our standards so we proceeded with model development. As will be discussed in later sections, we tried to circumvent the issue of uncorrelated predcitors by using various types of variable selection with all predictors, 2-factor interaciton terms, and quartic and cubic transformations to choose the best set possible. However, none of the outputted models were satisfactory by our standards and when one metric improved, another would be a less desirable value, making it difficult to choose which model was the best overall.

## Developing the model

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why?

**Put the final model equation**.

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

In order to determine what steps need to be taken, we started developing a model by first making a baseline model that used each continuous variable as a predictor with no interaction terms or polynomial transformations. The baseline model was very similar to the naive model since it predicted most values to be zero. In fact, only 3 out of 1609 entries of the test dataset were predicted to be potable. The baseline model only improved the classification accuracy by 0.1% from the Naive model and needed to improve. Based on the EDA, we predicted that we wouldn’t have any multicollinearity problems in our model since the correlations between the variables were all very low (correlation coefficients under 0.12). To confirm this prediction, we ran a ridge regression using all the predictors. In our ridge regression, the testing parameter was tested for alpha values from 0 to 1 with increments of 0.01. The best tuning parameter was determined to have an alpha value of 0 proving that there are no multicollinearity problems with our data and the best model was the one determined by the least squares regression. 

Given that the model was not up to our standards of 95% accuracy and 1% FPR and given the lack of conclusions we gathered from EDA, we concluded that our best chance to find the best model was to perform stepwise regression. We ran a forward stepwise regression with all possible variables and all their 2-factor interaction terms to determine the best model. We used classification accuracy, AIC, and BIC as the metrics to determine the best model. Forward stepwise selection determined that the model with 31 predictors was the best model based on accuracy, that with 40 predictors was the best based on AIC, and that with 16 predictors was the best based on BIC. The 31 predictor model had the maximum classification accuracy of 71.6% and a false positive rate of 8.4%. The 16 predictor model had a lower FPR of 6.8% but also a lower classification accuracy of 69.2%. Both of these models had a higher classification accuracy than the baseline model but also much higher false positive rates. 

We also ran a backward stepwise regression to see if the model would improve. Backward stepwise regression determined that the model with 22 predictors was the best model based on accuracy, that with 13 predictors based on AIC and that with 10 based on BIC. The best model of out these was the one based on classification accuracy, with an accuracy of 72.6% and an false positive rate of 7.6%. Both the classification accuracy and the false positive rate were better in the model developed from backward stepwise selection than any of the previous models, we decided to continue with that model. To test polynomial transformations on the model, a for loop was run to add all possible quadratic and cubic transformations to the model that was achieved from backward stepwise regression. However, all transformations that were tested did not raise the classification accuracy and were therefore determined to be insignificant. Therefore, even though the model still did not meet our goals for accuracy and FPR, it was determined that our best model was the model developed from backwards stepwise regression with no polynomial transformations. 

The final model equation is: Probability of Potability = -23.9065+2.8165*ph+0.0004*Solids-0.4270*Chloramines+0.0842*Sulfate-0.0138*Conductivity-0.0950*Organic_carbon-0.0022*(ph + Hardness)+0.1448*(ph + Chloramines)-0.0096*(ph + Sulfate)-0.0004*(ph + Conductivity)-2.95e-07*(Hardness + Solids)-0.0016*(Hardness + Chloramines)+5.571e-05*(Hardness + Sulfate)+3.777e-05*(Hardness + Conductivity)-1.209e-05*(Solids + Chloramines)-9.902e-07*(Solids + Sulfate)+4.434e-08*(Solids + Conductivity)-9.351e-07*(Solids + Organic_carbon)+4.074e-07*(Solids + Trihalomethanes)-0.0001*(Sulfate + Trihalomethanes)+0.0002*(Conductivity + Organic_carbon)+6.752e-05*(Conductivity + Trihalomethanes).

## Limitations of the model with regard to inference / prediction

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

Our model was meant to help both in inference, and with prediciton. For inference, our model is limited in that there are a lot of interaction terms, which makes it quite difficult to interpret how any one variable would affect the log-likelihood of the water being potable. Additionally, since our model had limited training data, we are limited to inferences based on the range of data we had. To be specific, our inferences are limited to water with a pH from 0.23 - 14.00, a hardness with range from 73.49 - 317.34, a total dissolved solids concentration that ranges from 320.94 - 56488.67 ppm, a chlorine concetration that ranges from 1.39 - 13.13 ppm, a sulfate concentration that ranges from 129.00 - 481.03 ppm, a conductivity that ranges from 201.62 - 753.34 μS/cm, a total organic carbon concetration that ranges from 2.20 - 27.01 ppm, a trihalomethane concentration that ranges from 8.58 - 124.00 ppm, and a turbidity value that ranges from 1.45 - 6.49 NTU. 

For prediction, our model can be used immediatly to predict whether a water sample is potable or not as soon as the chemical tests are run, as there is not any time delay associated with the data that our model uses to predict water potability. However, our model only had an accuracy of 72.6%, which was only 12.9% better than our naive model which had a 59.7% accuracy. Therefore, our model is limited in how accurate is it, as we were hoping that it could have a 95% accuracy for prediction. Moreover, our false positive rate was 7.6%, which we thought was limiting as we desired a 1% false positive rate due to the serious nature of drinking unpotable water. 

Our model should not become obsolete over time as water potability should not change over time in terms of its chemical composition, as a sample of water with the same chemical composition will be, for all intents and purposes, the same sample of water if it has the same chemical composition even sometime far in the future.

## Other sections *(optional)*

You are welcome to introduce additional sections or subsections, if required, to address any specific aspects of your project in detail. For example, you may briefly discuss potential future work that the research community could focus on to make further progress in the direction of your project's topic.

## Conclusions and Recommendations to stakeholder(s)

What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

From our model, we can draw several inference conclusions. For instance, we can conclude that having a higher pH, solids concentration, and sulfate concentration correlates with a higher likelihood of a water sample being potable, all with a statistical significance. We also found that higher chloramine concentration, organic carbon concentration, and conductivity all correlated with a lower likelihood of a water potability. However, these three were all less stastically significant than the pH, solids concentration and sulfate concentration.  We also found that our interaction terms with the highest statisticaly significance were the ones that interacted with pH levels, meaning that we are most confident that our pH levels interacted with the other variables in the training data set. 

On the prediction side of things, we found that our model had an accuracy of 72.6%, and a false positive rate of 7.6%, This is compared to our naive model which had an accuracy of 59.7%. Since our final model was not that much better than the naive model in terms of accuracy, and obviously worse in terms of false positive rate. 

For our stakeholders living in areas without safe drinking water, we would not recommend using our model to predict water potability with the purpose of drinking the water, as our false positive rate is too high for our standards for us to consider the model safe to use to determine water potability. In addition, our accuracy is lower than we would like it to be, so we feel that more data is needed for stakeholders interested in using our model to predict potability.

For environmental regulators and governmental water control institutions, we recommend that they look into researching the effects pH, solid concentration, and sulfate concentration has on water, as they all appeared to increase the likelihood of water potability when they increased with a high statistical significance. They should also research more into the interactions that pH has with other concentrations such as chloramines, and sulfates as those interactions had high statistical significances. 

For biotech research companies, we found that pH, solids concentration, and sulfate concetrations were the most relevant variables to determining water potability. Therefore, we recommend that these companies focus on these variables when developing tests for water potability.

As mentioned in our limitations section, we advise our stakeholders to be aware of several limitations in our model. Again, our accuracy is quite low, especially when compared to the naive model, and our false positive rate is alarmingly high. While our model is not a one-time-use 

## GitHub and individual contribution {-}

Put the **Github link** for the project repository.

Add details of each team member's contribution in the table below.

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Elton John</td>
    <td>Data cleaning and EDA</td>
    <td>Cleaned data to impute missing values and developed visualizations to identify appropriate variable transformations.</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Xena Valenzuela</td>
    <td>Assumptions and interactions</td>
    <td>Checked and addressed modeling assumptions and identified relevant variable interactions.</td>
    <td>120</td>
  </tr>
    <tr>
    <td>Sankaranarayanan Balasubramanian</td>
    <td>Outlier and influential points treatment</td>
    <td>Identified outliers/influential points and analayzed their effect on the model.</td>
    <td>130</td>    
  </tr>
    <tr>
    <td>Chun-Li</td>
    <td>Variable selection and addressing overfitting</td>
    <td>Performed variable selection on an exhaustive set of predictors to address multicollinearity and overfitting.</td>
    <td>150</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3


## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.