# Beginning Our Data-driven Journey in Maji Ndogo

## Introduction

In this first part of the integrated project, we dive into Maji ndogo's expansive dataset containing just over 60000 records spread across various tables. As we navigate this trove of data, we'll use basic queries to familiarise ourselves with the contents of each table in the database. We'll also use SQL **Data Manipulation Language (DML)** to refine some data points while we're at it.

## Notebook Setup

In [1]:
# Load the sql extension
%load_ext sql

In [2]:
# Create a connection to the mysql 'md_water_services' database
%sql mysql+pymysql://root:password@localhost:3306/md_water_services

## Familiarising Ourselves With the Data

Let's start by reviewing the first few records of each table to get a high level overview of what our data looks like. First things first, let's see the tables that are in Maji Ndogo's database.

In [3]:
%sql SHOW TABLES

Tables_in_md_water_services
data_dictionary
employee
global_water_access
location
visits
water_quality
water_source
well_pollution


We can see that we have a total of **8** tables. Let's see what each of these tables contain starting with the `data_dictionary` table.

In [4]:
%sql SELECT * FROM data_dictionary;

table_name,column_name,description,datatype,related_to
employee,assigned_employee_id,Unique ID assigned to each employee,INT,visits
employee,employee_name,Name of the employee,VARCHAR(255),
employee,phone_number,Contact number of the employee,VARCHAR(15),
employee,email,Email address of the employee,VARCHAR(255),
employee,address,Residential address of the employee,VARCHAR(255),
employee,town_name,Name of the town where the employee resides,VARCHAR(255),
employee,province_name,Name of the province where the employee resides,VARCHAR(255),
employee,position,Position or job title of the employee,VARCHAR(255),
visits,record_id,Unique ID assigned to each visit,int,"water_quality, water_source"
visits,location_id,ID of the location visited,varchar(255),location


We notice that the data dictionary has description of column names per table in the database. So to get any information a specific table and their column names along with description of each column we can just run a query like below.

In [5]:
%sql SELECT column_name, description, datatype, related_to FROM data_dictionary WHERE table_name = "employee";

column_name,description,datatype,related_to
assigned_employee_id,Unique ID assigned to each employee,INT,visits
employee_name,Name of the employee,VARCHAR(255),
phone_number,Contact number of the employee,VARCHAR(15),
email,Email address of the employee,VARCHAR(255),
address,Residential address of the employee,VARCHAR(255),
town_name,Name of the town where the employee resides,VARCHAR(255),
province_name,Name of the province where the employee resides,VARCHAR(255),
position,Position or job title of the employee,VARCHAR(255),


The information above tells us that the `employee` table has **8** columns on of which seems to be a primary key related to another table i.e. `assigned_employee_id` is used to reference some information in the `visits` table. We can even retrieve table names that are related to each other by running a query like so. 

In [20]:
%%sql
# Retrieve related tables
SELECT DISTINCT table_name
FROM data_dictionary
WHERE related_to != "";

table_name
employee
visits
water_quality
water_source
well_pollution
location


We can see that there are only **6** tables related to each other as per the `data_dictionary` table. Great, with the `data_dictionary` table as our map and the `md_water_services` database as our landscape, we now know how to navigate our data landscape. We just go ahead and view the first fiew rows for every table save for the `data_dictionary` table as we already know that it is more of a reference point for our real data in the database. You can run the query below multiple times while changing the table name after the `FROM` clause and it should display the first 10 records and each of their attributes per table/entity

In [19]:
%sql SELECT * FROM employee;

Deploy Flask apps for free on Ploomber Cloud! Learn more: https://ploomber.io/s/signup


assigned_employee_id,employee_name,phone_number,email,address,province_name,town_name,position
0,Amara Jengo,99637993287,,36 Pwani Mchangani Road,Sokoto,Ilanga,Field Surveyor
1,Bello Azibo,99643864786,,129 Ziwa La Kioo Road,Kilimani,Rural,Field Surveyor
2,Bakari Iniko,99222599041,,18 Mlima Tazama Avenue,Hawassa,Rural,Field Surveyor
3,Malachi Mavuso,99945849900,,100 Mogadishu Road,Akatsi,Lusaka,Field Surveyor
4,Cheche Buhle,99381679640,,1 Savanna Street,Akatsi,Rural,Field Surveyor
5,Zuriel Matembo,99034075111,,26 Bahari Ya Faraja Road,Kilimani,Rural,Field Surveyor
6,Deka Osumare,99379364631,,104 Kenyatta Street,Akatsi,Rural,Field Surveyor
7,Lalitha Kaburi,99681623240,,145 Sungura Amanpour Road,Kilimani,Rural,Field Surveyor
8,Enitan Zuri,99248509202,,117 Kampala Road,Hawassa,Zanzibar,Field Surveyor
10,Farai Nia,99570082739,,33 Angélique Kidjo Avenue,Amanzi,Dahabu,Field Surveyor


## Diving Into the Water Sources

Now that we are familiar with what each entity in our database entails, we can dive deeper into specific aspects of our database. A good starting point is understanding the types of water sources recorded in the database. To get that information, we can inspect the `water_source` table

In [21]:
%sql SELECT DISTINCT type_of_water_source FROM md_water_services.water_source;

type_of_water_source
tap_in_home
tap_in_home_broken
well
shared_tap
river


We can see that we have **5** unique types of water sources recorded in our database. Understanding what each of these types mean is paramount to deciphering proper data-driven decision making reports.

## Unpacking the Visits to Water Sources

The `visits` entity in the database logs information on each water source each and every time the water source is visited. From our data exploration above, we also noticed that there is a `time_in_queue` attribute in this entity. Let's experiment and retreive records from this entity `WHERE time_in_queue > 500`.

In [23]:
%%sql 
SELECT * 
FROM md_water_services.visits 
WHERE time_in_queue > 500 
ORDER BY time_in_queue DESC;

record_id,location_id,source_id,time_of_record,visit_count,time_in_queue,assigned_employee_id
30007,AmRu14612,AmRu14612224,2022-04-02 08:55:00,2,539,8
51858,HaRu19538,HaRu19538224,2023-03-04 18:04:00,3,539,4
53278,AkRu05704,AkRu05704224,2023-03-25 13:48:00,2,539,36
45317,HaRu20126,HaRu20126224,2022-11-19 14:22:00,6,538,16
57408,SoRu35388,SoRu35388224,2023-05-27 08:52:00,5,538,1
20372,KiZu31117,KiZu31117224,2021-11-06 09:37:00,3,537,10
33650,KiRu29348,KiRu29348224,2022-05-28 12:58:00,2,537,10
31310,SoRu37865,SoRu37865224,2022-04-23 06:01:00,2,535,40
38947,SoRu38095,SoRu38095224,2022-08-13 13:48:00,6,535,30
52264,HaRu17383,HaRu17383224,2023-03-11 07:10:00,5,535,30


We can further investigate the `type_of_water_source` with such long `time_in_queue`. to do this, let's select the first three `source_id`s from the `visits` entity and search for them in the `water_source` entity

In [26]:
%%sql
SELECT 
    source_id,
    type_of_water_source,
    number_of_people_served
FROM water_source
WHERE source_id IN ("AmRu14612224", "HaRu19538224", "AkRu05704224");

source_id,type_of_water_source,number_of_people_served
AkRu05704224,shared_tap,3398
AmRu14612224,shared_tap,3118
HaRu19538224,shared_tap,3142


We can see that these are water sources of the type `shared_tap` serving more than **3000** people. Keep in mind that from the information in the project description that there were other sources that were visited more than once by the surveyors to see if there was a change in `time_in_queue`.

## Assessing the Quality of Water Sources

## Investigating any Pollution Issues