# Basic Aggregation Functions Practice


    
## Implementation in queries

We will again be using the PostgreSQL database to query the data and see how the `Aggregation` Function works. 

Connect again by using the command:

In [2]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@pgsql.dsa.lan/dsa_ro

'Connected: dsa_ro_user@dsa_ro'


### COUNT

The main use for count in a system is to return the number of rows in a database table or table expression (result of join)

To do so you simply use a `COUNT(*)` as the column.

## <span style="background:yellow">Your Turn</span>

Find the number of cities in China 

The number you receive should be 61
  


In [3]:
%%sql
SELECT count(*)
FROM cities
WHERE country = 'China';



 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


count
61


Find the number of cities in Canada with a population greater than 2,000,000

The number you receive should be 1

In [4]:
%%sql
SELECT count(*)
FROM cities
WHERE country = 'Canada' 
AND population > 2000000;





 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


count
1


### MIN

This function will allow you to return the minimum value of a given column in the database table.

## <span style="background-color:yellow">Your Turn</span>
What if we wanted to search for the minimum population of cities in Japan.

How would we write this?

The number you receive should be 1063100


In [5]:
%%sql 
SELECT min(population)
FROM cities 
WHERE country = 'Japan';





 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


min
1063100


### MAX

This function will allow you to return the maximum value of a given column in the database table.


## <span style="background:yellow">Your Turn</span>

What if we wanted to search for the maximum population of cities in Canada. 

How would we write this?

The number you receive should be 2600000

In [6]:
%%sql 
SELECT max(population)
FROM cities 
WHERE country = 'Canada';



 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


max
2600000


### AVG

This function will return the average value of a given column in the database table. 

## <span style="background:yellow">Your Turn</span>

What if we wanted to find the average population of cities that are in the United States.

How would you write this?

The number you receive should be 2385623.076923076923


In [7]:
%%sql 
SELECT AVG(population)
FROM cities
WHERE country = 'United States';




 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


avg
2385623.076923077


### SUM

This function will allow you to return the sum of multiple rows in the database table. 


## <span style="background:yellow">Your Turn</span>
What if we wanted the sum of all people living in Canada or Mexico?

The number you should receive is 30334100

In [8]:
%%sql
SELECT SUM(population)
FROM cities
WHERE country = 'Canada'
OR country = 'Mexico';





 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


sum
30334100





# GROUP BY

`GROUP BY` groups all the records with the same value for the specified grouping field(s) together so that aggregation can process each set separately. 


Think of the **groups** as a set of rows from the table.

Each attribute that is in the SELECT column set and not used in an aggregate function must appear in the `GROUP BY` clause.

## <span style="background:yellow">Your Turn</span>

Write a SELECT statement to display each country's name and its average city population



In [10]:
%%sql
SELECT country, AVG(population)
FROM cities 
GROUP BY country;






 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
97 rows affected.


country,avg
Burkina Faso,1086500.0
Bangladesh,3995833.333333333
Indonesia,2259890.9090909087
Italy,2050050.0
Venezuela,2793325.0
Uruguay,1338400.0
Burma,2842850.0
Cameroon,1318800.0
Czech Republic,1243200.0
Sweden,1253300.0


Write a SELECT statement to display each country's name and the population of the city with the highest population

In [11]:
%%sql
SELECT country, MAX(population)
FROM cities 
GROUP BY country ;



 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
97 rows affected.


country,max
Burkina Faso,1086500
Bangladesh,6725000
Indonesia,9588200
Italy,2863300
Venezuela,5808900
Uruguay,1338400
Burma,4477600
Cameroon,1338100
Czech Republic,1243200
Sweden,1253300


# HAVING Clause

This clause will allow the user to see data that has a certain aggregate function value, thereby only returning the sets that return true on the aggregate comparison.


In [13]:
%%sql 
SELECT country, count(*) 
FROM cities 
GROUP BY country 
HAVING count(*) > 10;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
8 rows affected.


country,count
Indonesia,11
India,38
Japan,14
United States,13
Russia,12
China,61
Brazil,15
Mexico,11


This simply means that if the country is used more than 10 times (count(country) > 10) then we will list it in the results of this query. 



## <span style="background:yellow">Your Turn</span>

Write a SELECT statement to show each country's name and its average population but only for countries whose largest city has less than 5,000,000 people


In [15]:
%%sql
SELECT country, AVG(population)
FROM cities 
GROUP BY country
HAVING MAX(population) < 5000000





 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
71 rows affected.


country,avg
Burkina Faso,1086500.0
Italy,2050050.0
Uruguay,1338400.0
Burma,2842850.0
Cameroon,1318800.0
Czech Republic,1243200.0
Sweden,1253300.0
Uganda,1353200.0
Jordan,1919000.0
Dominican Republic,2093500.0


Write a SELECT statement to show each country's name and the population of its smallest city but only for countries with an average city population between 2 and 5 million people

In [17]:
%%sql
SELECT country, MIN(population)
FROM cities
GROUP BY country
HAVING AVG(population) BETWEEN 2000000 AND  5000000 ;






 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
44 rows affected.


country,min
Bangladesh,1342300
Indonesia,1198100
Italy,1236800
Venezuela,1385100
Burma,1208100
Dominican Republic,1200000
Germany,1260400
Uzbekistan,2140500
South Korea,1031900
Colombia,1380400


# Combining JOIN and GROUPING for aggregates

As foreshadowed, the true power of the relational database comes from combining tables and computing statistics.

Consider the following database tables:
  * us_second_order_divisions
  * util_us_states

```SQL
dsa_ro=> \d us_second_order_divisions
        Table "public.us_second_order_divisions"
       Column       |          Type          | Modifiers 
--------------------+------------------------+-----------
 state_number_code  | smallint               | not null
 county_number_code | character varying(5)   | not null
 county_name        | character varying(100) | 
Indexes:
    "us_second_order_divisions_pkey" PRIMARY KEY, btree (state_number_code, county_number_code)

dsa_ro=> \d util_us_states
             Table "public.util_us_states"
      Column       |         Type          | Modifiers 
-------------------+-----------------------+-----------
 state_alpha_code  | character(2)          | not null
 state_number_code | smallint              | 
 state_name        | character varying(50) | 
Indexes:
    "util_us_states_pkey" PRIMARY KEY, btree (state_alpha_code)
    "util_us_states_state_number_code" btree (state_number_code)
```

Imagine we want a list of the state names and the number of counties per state. 
What would the SQL Look like?

We will build it up in pieces, to help you develop a methodology of query construction.

**First**: We see that counties are listed in the `us_second_order_divisions`.
We can go there for a count of the number of counties per state.

## <span style="background:yellow">Your Turn</span>
Write a SELECT statement that shows the state name and number of counties for states with less than 20 counties



In [18]:
%%sql
SELECT state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S 
    ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name
HAVING COUNT(*) < 20;





 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
18 rows affected.


state_name,count
MASSACHUSETTS,14
,9
RHODE ISLAND,5
GUAM,1
ARIZONA,15
NEVADA,17
PALAU,16
HAWAII,5
AMERICAN SAMOA,5
VERMONT,14


Write a SELECT statement that shows the five state names with the fewest number of counties.  
Write the code intially in the EXPLAIN cell, then copy the SQL without EXPLAIN into the next cell.

In [19]:
%%sql
EXPLAIN
SELECT state_name, count(*)
FROM us_second_order_divisions as C 
JOIN util_us_states as S 
    ON (C.state_number_code = S.state_number_code)
GROUP BY S.state_name
ORDER BY COUNT(*) asc
LIMIT 5 ;





 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
10 rows affected.


QUERY PLAN
Limit (cost=117.65..117.66 rows=5 width=28)
-> Sort (cost=117.65..117.80 rows=59 width=28)
Sort Key: (count(*))
-> HashAggregate (cost=116.08..116.67 rows=59 width=28)
Group Key: s.state_name
-> Hash Join (cost=2.35..99.61 rows=3295 width=10)
Hash Cond: (c.state_number_code = s.state_number_code)
-> Seq Scan on us_second_order_divisions c (cost=0.00..51.95 rows=3295 width=2)
-> Hash (cost=1.60..1.60 rows=60 width=12)
-> Seq Scan on util_us_states s (cost=0.00..1.60 rows=60 width=12)


In [20]:
%%sql

SELECT state_name, count(*)
FROM us_second_order_divisions as C 
JOIN util_us_states as S 
    ON (C.state_number_code = S.state_number_code)
GROUP BY S.state_name
ORDER BY COUNT(*) asc
LIMIT 5 ;





 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
5 rows affected.


state_name,count
DISTRICT OF COLUMBIA,1
GUAM,1
VIRGIN ISLANDS,3
DELAWARE,3
NORTHERN MARIANA ISLANDS,4


# Save your Notebook, then `File > Close and Halt`

---