# Filtering and analysing a summary statistic report

In this notebook, we demonstrate how to filter and analyse a summary statistic report using the `WHERE` and `HAVING` clauses.

## Connecting to our MySQL database

In [None]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

In [None]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://root:password@localhost:3306/united_nations

## Exercise

We started with finding out the minimum, maximum, and average percentage of people that have access to drinking water services, the number of countries, and the total GDP per region and sub-region. We also ordered this data by estimated GDP.

In [None]:
%%sql

SELECT Region, 
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services,
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions 
FROM united_nations.Access_to_Basic_Services 
GROUP BY Region, Sub_region
ORDER BY EST_total_gdp_in_billions;

### 1. Filter for the year 2020.
Using the above query, focus on results where the time period is 2020 using the `WHERE` clause.

### 2. Focus on countries where the percentage of managed drinking water services is below 60%.

Adding onto your query above, focus on results where the percentage of managed drinking water services is smaller than 60% using the `WHERE` clause.

### 3. Filter for the sub-regions that have fewer than four countries.

Filter the results above to only include the regions and sub-regions that have fewer than four countries in the Number_of_countries alias using the `HAVING` clause.

## Solutions

### 1. Filter for the year 2020.

In [None]:
%%sql

SELECT Region, 
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services, 
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions
FROM Access_to_Basic_Services 
WHERE Time_period = 2020
GROUP BY Region, Sub_region
ORDER BY EST_total_gdp_in_billions ASC;


### 2. Focus on countries where the percentage of managed drinking water services is below 60%.

In [None]:
%%sql

SELECT Region, 
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services, 
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions
FROM Access_to_Basic_Services 
WHERE Time_period = 2020
    AND Pct_managed_drinking_water_services < 60
GROUP BY Region, Sub_region
ORDER BY EST_total_gdp_in_billions ASC;

### 3. Filter for the regions and sub-regions that have fewer than four countries.

In [None]:
%%sql

SELECT Region, 
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services, 
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions
FROM Access_to_Basic_Services 
WHERE Time_period = 2020
    AND Pct_managed_drinking_water_services < 60
GROUP BY Region, Sub_region
HAVING Number_of_countries < 4
ORDER BY EST_total_gdp_in_billions ASC;

The `WHERE` clause may come to mind first when trying to apply this criterion. However, take note that this criterion may only be used after the "Number_of_countries" aggregate and grouping. This is because we want to group the data by region and sub-region and then only choose the groups that have fewer than four countries inside those groups.
Because the `WHERE` clause executes before the aggregate and GROUP BY clauses, we are unable to use it.
Therefore the `HAVING` clause is more appropriate here.

With this report, we can answer questions like “Out of the sub-regions that meet the criteria, which sub-region has the lowest GDP?”
Since the data is ordered according to the estimated GDP in ascending order, the first result in the sub-region column will contain the answer. That is, Melanesia at a GDP of 23.85 billion.
