# Clustering Data to Unveil Maji Ndogo's Water Crisis

## Introduction

In this second part of the integrated project, we gear up for a deep analytical dive into Maji Ndogo's water scenario. Harness the power of a wide range of SQL functions, including intricate window functions, to tease out insights from the data tables.

## Notebook Setup

In [1]:
# Load the sql extension
%load_ext sql

Deploy Flask apps for free on Ploomber Cloud! Learn more: https://ploomber.io/s/signup


In [2]:
# Create a connection to the mysql 'md_water_services' databases
%sql mysql+pymysql://root:password@localhost:3306/md_water_services

## Cleaning the Data

Let's bring up the `employee` entity. It has information on all of Maji Ndogo's workers, but note that the email addresses have not been added. If we will have to send them reports and figures, we will need their emails hence we need to update it the `email` attribute. Luckily the emails for the organisations, per the project description, are easy: `first_name.last_name@ndogowater.gov`.

We can determine the email address for each employee by:
- selecting the employee_name column
- replacing the space with a full stop
- make it lowercase
- and stitch it all together

We have to update the database again with these email addresses, so before we do, we can use a `SELECT` query to get the format right, then use `UPDATE` and `SET` to persist the changes into the database.

In [3]:
%%sql
# Construct the email addresses for maji ndogo's workers
SELECT 
	CONCAT(
    LOWER(REPLACE(employee_name, " ", ".")), "@ndogowater.gov") AS new_email
FROM employee;

new_email
amara.jengo@ndogowater.gov
bello.azibo@ndogowater.gov
bakari.iniko@ndogowater.gov
malachi.mavuso@ndogowater.gov
cheche.buhle@ndogowater.gov
zuriel.matembo@ndogowater.gov
deka.osumare@ndogowater.gov
lalitha.kaburi@ndogowater.gov
enitan.zuri@ndogowater.gov
farai.nia@ndogowater.gov


In [4]:
# %%sql
# # Update the employee table with the constructed emails
# UPDATE employee
# SET email = CONCAT(LOWER(REPLACE(employee_name, " ", ".")), "@ndogowater.gov");

Let's make sure that the query above worked.

In [5]:
%sql SELECT * FROM employee LIMIT 5;

assigned_employee_id,employee_name,phone_number,email,address,province_name,town_name,position
0,Amara Jengo,99637993287,amara.jengo@ndogowater.gov,36 Pwani Mchangani Road,Sokoto,Ilanga,Field Surveyor
1,Bello Azibo,99643864786,bello.azibo@ndogowater.gov,129 Ziwa La Kioo Road,Kilimani,Rural,Field Surveyor
2,Bakari Iniko,99222599041,bakari.iniko@ndogowater.gov,18 Mlima Tazama Avenue,Hawassa,Rural,Field Surveyor
3,Malachi Mavuso,99945849900,malachi.mavuso@ndogowater.gov,100 Mogadishu Road,Akatsi,Lusaka,Field Surveyor
4,Cheche Buhle,99381679640,cheche.buhle@ndogowater.gov,1 Savanna Street,Akatsi,Rural,Field Surveyor


Awesome, now we have emails for all employees persisted in the database. let's check the `phone_number` entity. The phone numbers should be 12 characters long but as we can see below 👇🏾, the phone numbers are 13 numbers long.

In [8]:
%%sql
# Check the length of the phone numbers
SELECT LENGTH(phone_number) FROM md_water_services.employee LIMIT 5;

LENGTH(phone_number)
13
13
13
13
13


That's because there is a space at the end of the number! If you try to send an automated SMS to that number it will fail. This happens so often and the remedy is to `TRIM(column)` as it removes any leading or trailing spaces from a string.

In [9]:
%%sql
# Trim the leading and trailing whitespaces in the phone_number attribute and check the link
SELECT LENGTH(TRIM(phone_number)) FROM md_water_services.employee LIMIT 5;

LENGTH(TRIM(phone_number))
12
12
12
12
12


In [10]:
%%sql
# Update the table to persist the changes to the databases
UPDATE md_water_services.employee
SET employee.phone_number = TRIM(employee.phone_number);

Let's check if the query above worked

In [11]:
%%sql
# Confirm that the phone_number attribute was updated
SELECT LENGTH(phone_number) FROM md_water_services.employee LIMIT 5;

LENGTH(phone_number)
12
12
12
12
12


## Honoring Employees

Before we can begin the analysis to find employees worth honoring, let's first how many employees reside in each province.

In [16]:
%%sql
# Count the number of employees per province
SELECT province_name, COUNT(employee_name) AS no_of_employees
FROM md_water_services.employee
GROUP BY province_name
ORDER BY no_of_employees DESC;

province_name,no_of_employees
Hawassa,15
Akatsi,13
Kilimani,12
Sokoto,9
Amanzi,7


Let's check how many employees reside in each town as well.

In [17]:
%%sql
# Count the number of employees per town
SELECT town_name, COUNT(employee_name) AS no_of_employees
FROM md_water_services.employee
GROUP BY town_name
ORDER BY no_of_employees DESC;

town_name,no_of_employees
Rural,29
Dahabu,6
Harare,5
Lusaka,4
Zanzibar,4
Ilanga,3
Serowe,3
Kintampo,1
Yaounde,1


Assuming we asked by those in leadership positions in the organisation to send out an email or message congratulating the top 3 field surveyors. We could use the database to get the `employee_id`s and use those to get the `name`s, `email` and `phone_number`s of the **three field surveyors with the most location visits**. To do this let's first query the `visits` entity to retrieve the number of visits made by employees.

In [20]:
%%sql
# Retrieve three employee ids with the most location visits
SELECT assigned_employee_id, COUNT(assigned_employee_id) AS no_of_visits
FROM md_water_services.visits
GROUP BY assigned_employee_id
ORDER BY no_of_visits DESC
LIMIT 5;

assigned_employee_id,no_of_visits
1,3708
30,3676
34,3539
3,3420
10,3407


We note that the three employee ids with the most number of visits from the `visits` entity are:
1. 1
2. 30
3. 34

Now all that's left is to craft a query that will look up the information of the employees with our retrieved employee ids from the previous queries.

In [21]:
%%sql
# Select information on the retrieved employee ids
SELECT assigned_employee_id, employee_name, phone_number, email
FROM md_water_services.employee
WHERE assigned_employee_id IN (1, 30, 34);

assigned_employee_id,employee_name,phone_number,email
1,Bello Azibo,99643864786,bello.azibo@ndogowater.gov
30,Pili Zola,99822478933,pili.zola@ndogowater.gov
34,Rudo Imani,99046972648,rudo.imani@ndogowater.gov


Awesome, we now have the `employee_name`, `phone_number` and `email` columns of the top dog employees.

## Analysing Locations

Looking at the location table, let's focus on the `province_name`, `town_name` and `location_type` to understand where the water sources are in Maji Ndogo. Let's count the records per `town_name` and then count by `province_name`.

In [9]:
%%sql
# Retrieve the number of records per town name
SELECT 
    town_name, 
    COUNT(location_id) AS records_per_town 
FROM md_water_services.location
GROUP BY town_name
ORDER BY records_per_town DESC;

town_name,records_per_town
Rural,23740
Harare,1650
Amina,1090
Lusaka,1070
Mrembo,990
Asmara,930
Dahabu,930
Kintampo,780
Ilanga,780
Isiqalo,770


In [10]:
%%sql
# Retrieve the number of records per province name
SELECT
    province_name,
    COUNT(location_id) AS records_per_province
FROM md_water_services.location
GROUP BY province_name
ORDER BY records_per_province DESC;

province_name,records_per_province
Kilimani,9510
Akatsi,8940
Sokoto,8220
Amanzi,6950
Hawassa,6030


From this table, it's pretty clear that most of the water sources in the survey are situated in small rural communities, scattered across Maji Ndogo. If we count the records for each province, most of them have a similar number of sources, so every province is well represented in the survey. Let's create a table that shows the number or records and groups them by both `province_name` and `town_name`.

In [16]:
%%sql
# Get the number of records per province and town name
SELECT
    province_name,
    town_name,
    COUNT(location_id) AS records_per_town
FROM md_water_services.location
GROUP BY province_name, town_name
ORDER BY province_name, records_per_town DESC;

province_name,town_name,records_per_town
Akatsi,Rural,6290
Akatsi,Lusaka,1070
Akatsi,Harare,800
Akatsi,Kintampo,780
Amanzi,Rural,3100
Amanzi,Asmara,930
Amanzi,Dahabu,930
Amanzi,Amina,670
Amanzi,Pwani,520
Amanzi,Abidjan,400


These results show us that Maji Ndogo's field surveyors did an excellent job of documenting the status of the country's water crisis. Every province and town has many documented sources. This gives us confidence that the data we have is reliable enough to base our decisions on. This is an insight we can use to communicate data integrity, so let's make a note of that. Let's also check the percentage of location types in the location entity.

In [17]:
%%sql
# Compute the percentage of location types
SELECT
    location_type,
    COUNT(location_type) AS records_per_type
FROM location
GROUP BY location_type;

location_type,records_per_type
Urban,15910
Rural,23740


In [20]:
%sql SELECT ROUND(((15910 / (15910 + 23740)) * 100)) AS urban_percentage;

urban_percentage
40


In [21]:
%sql SELECT ROUND(((23740 / (15910 + 23740)) * 100)) AS rural_percentage;

rural_percentage
60


So again, what are some of the insights we gained from the location table?
1. Our entire country was properly canvassed, and our dataset represents the situation on the ground.
2. 60% of our water sources are in rural communities across Maji Ndogo. We need to keep this in mind when we make decisions.