In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Tanzanian Water Pumps
A priori knowledge about the project:

* Client: Tanzanian Ministry of Water
* Top priority: access to clean water for the whole population
* Dataset: Containing data regarding samples of all water pumps
* 3 Labels: functional, functional needs repair, non functional
* Features such as location, water quality, construction year, etc.
* Feature description: https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/

Data Import

In [10]:
feature_filename = "data/Features.csv"
label_filename = "data/Labels.csv"
df_feat = pd.read_csv(feature_filename)
df_label = pd.read_csv(label_filename)
df = pd.merge(df_feat, df_label, on='id', how='inner')

# Data Understanding
In this notebook, I will try to understand the target label and the numerical features better. I will deal with the categorical features in another notebook (data_understanding2) for better readability.

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

Very high number of null values for the scheme_name attribute. Description: scheme_name - Who operates the waterpoint
Decision: Drop scheme_name feature from dataset

In [12]:
sns.heatmap(df.corr(), annot=True)

AttributeError: module 'seaborn' has no attribute 'heatmap'

* High correlation between gps_height and construction_year.
* High correlation between region_code and district_code.
* Relatively high correlation between longitude and construction year.
* Surprisingly: low correlation between coordinates (longitude, latitude) and region_code and district_code

Note that these low/high correlations may exist due to poor data quality.

## Label

In [None]:
df["status_group"].value_counts()

The classes are not equally distributed. Training on this dataset could potentially harm the accuracy of the model.

# Numerical Features

In [None]:
df.describe()

* amount_tsh - Total static head (amount water available to waterpoint): seems to be containing outliers
* gps_height: 25%-quantile = 0 is suspicious, might be indicating missing values
* longitude: Min. value 0.000 is looking suspicious. Need to investigate further.
* latitude: No values greater as or equal to 0
* num_private: no feature description available and seems to be containing outliers as well -> drop feature
* population: seems to contain outliers; 0-values need to be investigated.
* construction_year: 0-values need to be addressed

## amount_tsh - Total static head (amount water available to waterpoint)


In [None]:
sns.catplot(data=df, x="amount_tsh", y="status_group", kind="box")

Outliers exist for all labels.

* Explanation of the feature: https://www.pumpfundamentals.com/what%20is%20head.htm
* Typical depth of water pumps: https://homeguides.sfgate.com/depth-guide-water-pump-103317.html

Decision: Remove outliers (>120) and impute missing values (0's) with median

## population & gps_height

In [None]:
for att in ["gps_height", "population"]:
    sns.catplot(data=df, x=att, y="status_group", kind="box")
    plt.show()

    rc0 = df[df[att] == 0]["region_code"]
    print(f"{att}\n", rc0.value_counts())  # print number of instances with 0 population per region_code

    # print TOTAL number of instances (any #pop) for those region_codes that have instances with 0 pop
    print(f"{att}\n", df[df["region_code"].isin(rc0)]["region_code"].value_counts())

    print(f"Label distribution of {att}\n", df[df[att] == 0]["status_group"].value_counts())

In regions where we have instances with 0 gps_height/pop, almost all instances have 0 gps_height/pop. The same holds for district_code.
However, there are too many instances with 0 gps_height/pop to justify removing the instances from the dataset. Therefore:
Decision: Impute 0's of gps_height/pop features with median value of instances per region_code.


In [None]:
df[df["population"] > 10000]

Only three instances with a population value of over 10000. In all three cases the water pumps are still functional.
Decision: remove heavy outliers (pop >10000)

## Longitude/Latitude

In [None]:
for att in ["longitude", "latitude"]:
    sns.catplot(data=df, x=att, y="status_group", kind="box")
    plt.show()

In [None]:
df[df["longitude"] == 0]

It is worth noting that the missing values in the latitude feature seem to be stored as -2.000000e-8 (almost zero)

In [None]:
df[df["latitude"] == -2.000000e-08]

We can clearly observe missing values for both coordinates.
Decision: Impute with median value of the region (region_code)

## Construction_year

In [None]:
sns.catplot(data=df, x="construction_year", y="status_group", kind="box")

In [None]:
df[df["construction_year"]==0]

Note: High number of missing values (0's) for construction year.

In [None]:
sns.heatmap(df[df["construction_year"] != 0].corr(), annot=True)