# Tabular data exploration

## Imports needed for data exploration

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns 

## Data importation

For tabular data we are using a water potability dataset which can be found on [Kaggle](https://www.kaggle.com/datasets/adityakadiwal/water-potability). This dataset is located in our directory under "datasets/water_potability" and has a csv file format.</br></br>
Before we can do anything with our dataset we need to load it into python. The easiest way to read and work with tabular data is to use the Pandas library. </br>
To read data from a csv file we use the following command: 

<code> data = pd.read_csv("pathname/to/dataset.csv", delimiter="," , index_col=None) </code>

This code creates a pandas.DataFrame type. This type has a number of methods to obtain some basic insights into the data: (1) <code>.head(n)</code> to see the first n rows in the dataframe, (2) <code>.info()</code> to see general information about the columns and dataframe, and (3) <code>.describe()</code> to see basic statistics for each feature. Use these methods to explore the dataframe and answer the following questions:

What is the number of rows in the dataframe: 

What is the number of columns in the dataframe: 

What is the number of features in the data: 

What is the memory usage of the dataframe:

## Explore the data

### Null values

An important aspect of your dataset is that it is "clean". This means that there are no None, Null, NaN or other values given in the dataset.</br>
Using Pandas it is easy to find the exact amount of none-values using the following command:

<code>number_of_nan_in_column = dataframe["column_name"].isnull().sum() </code>

Do this for all columns.

### Feature distribution
Sometimes the different features/variables in your dataset can have different ranges. This seams fine at first glance but can cause problems in certain algorithms like Gradient Descent where variables with larger values can have a larger affect on your gradients than variables with smaller values. Therefore it is a good idea to create histograms of your different variables to know which values have which ranges. It can also give you more information about the different distributions of the parameters, this can be especially interesting with the labels where an underrepresentation of a certain class could cause a problem in the classification. </br>
Use the following command to show the different histograms of the columns in your pandas dataframe:

<code>dataframe.hist(figsize=[width_in_inch, height_in_inch])</code>

Look at the histograms and look at their distributions. Are there any features that could cause a problem?

### Correlations
When working with large datasets there might be too many features making training and inference on your dataset very slow. In those situations there is a higher chance that some features are not necessary. This can be checked by creating a correlation matrix between the different features. There are several different ways to calculate this like the "spearman" or "pearson" correlation values. Both show the correlation with values between -1 and 1 where -1 means a negative correltation and 1 a positive correlation. With Pandas, these correlation matrices can be calculated as follows:

<code>dataframe.corr("spearman")</code>

</br>
To make the matrix more readable you have several options. The first one is to use matplotlib:

<code>plt.matshow(matrix)</br>
plt.colorbar()</br>
plt.show()</br>
</code>

</br>
A second option is to use seaborn. There you create a subplot using matplotlib and then create a heatmap using seaborn:

<code>
f, ax = plt.subplots()</br>
sns.heatmap(matrix, vmin=minimum, vmax=maximum, ax=ax)
</code>

One of the two Hardness columns is the real Hardness, the other is just a combination of two other columns in the dataset. Which one is the real one?