# Tabular data preprocessing

## Imports needed for data exploration

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Data importation

For tabular data we are using a water potability dataset which can be found on [Kaggle](https://www.kaggle.com/datasets/adityakadiwal/water-potability). This dataset is located in our directory under "datasets/water_potability" and has a csv file format. The model you would want to train with this dataset is a classification model which devides wether you could drink the water given several parameters.</br></br>
As in the data exploration step we first need to load our dataset into python. The easiest way to read and work with tabular data is to use the Pandas library. </br>
To read data from a csv file we use the following command: 

<code> data = pd.read_csv("pathname/to/dataset.csv", delimiter="," , index_col=None) </code>

Later you will have to save your dataset. 

<code>data.to_csv("path/to/file", sep=";", index=False)</code>

## Data cleaning

### Remove the Null values

An important aspect of your dataset is that it is "clean". This means that there are no None, Null, NaN or other values given in the dataset.</br>
In an earlier step we have checked wether or not there are any Null values, now we are going to remove them. There are several techniques possible. The one we are going to use is using the mean value of a column to fill in the empty values. To do this you can use the following commands.

<code>
mean = dataframe.mean()
dataframe.fillna(value, inplace=True)
</code>

When you are done, don't forget to save your new dataset to a file and version with DVC.

### Data transformation
When exploring the dataset we also looked at the different features and their ranges. If some of features had very different ranges it might be a good idea to normalize your dataset. Again there are several techniques to normalize your data. We are going to use the min-max normalization. To be able to do this you will need the following functions.
</br></br>
To iterate over the columns in your dataset: </br>
<code>for column in dataframe.columns: </code>

To find the min or the max of a column: </br>
<code>dataframe["column"].min() </br> dataframe["column"].max()</code>

The formula for min-max normalization is the following: </br>
$$Xnorm = \frac{X-Xmin}{Xmax - Xmin}$$

Again, do not forget to save your new dataset version with DVC when you are done.

### Data preprocessing
In the previous exercise you found that one of the two Hardness columns was not real. Now it is time to remove this column since it does not give useful information to our machine learning model. This can very simply be done using the following pandas function.

<code>dataframe.drop(["column_1", "column_2"], axis=1)</code>

The last step is to split your data into a training set, validation set and test set. To do this you can use the following command from scikit-learn. </br>

<code>train, rest = train_test_split(dataframe, test_size=0.2)</code>

A typical split is 80 percent training, 10 percent validation, 10 percent test set. 
</br>When you are done, save your three datasets as csv files. 

<code>dataframe.to_csv("path/to/save/file.csv", sep=";", index=0) </code>

Again, do not forget to save your new datasets with DVC when you are done.