# Joining Data in SQL

Now that you've learned the basics of SQL, it's time to supercharge your queries using joins and relational set theory. In this notebook, you'll learn all about the power of joining tables while exploring interesting features of countries and their cities throughout the world. You will master inner and outer joins, as well as self joins, semi joins, anti joins and cross joins—fundamental tools in any PostgreSQL wizard's toolbox. Never fear set theory again after learning all about unions, intersections, and except clauses through easy-to-understand diagrams and examples. Lastly, you'll be introduced to the challenging topic of subqueries. You will be able to visually grasp these ideas by using Venn diagrams and other linking illustrations.

In [1]:
! postgres --version

postgres (PostgreSQL) 11.3


In [2]:
import pandas as pd
import psycopg2 as pg

In [None]:
# To setup the database:
# In pgAdmin (http://127.0.0.1:50822/browser/): 
# - create a user the same as for macos user
# - create films database with the user as the owner
# In terminal in the countries2 folder:
# > psql -U ksatola countries < 'countries.sql''

In [3]:
conn = pg.connect(database="countries",user="ksatola", password="ksroot")

In [4]:
sql ='''
SELECT * 
FROM countries 
LIMIT 5
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,code,name,continent,region,surface_area,indep_year,local_name,gov_form,capital,cap_long,cap_lat
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,Afganistan/Afqanestan,Islamic Emirate,Kabul,69.1761,34.5228
1,NLD,Netherlands,Europe,Western Europe,41526.0,1581.0,Nederland,Constitutional Monarchy,Amsterdam,4.89095,52.3738
2,ALB,Albania,Europe,Southern Europe,28748.0,1912.0,Shqiperia,Republic,Tirane,19.8172,41.3317
3,DZA,Algeria,Africa,Northern Africa,2381740.0,1962.0,Al-Jazair/Algerie,Republic,Algiers,3.05097,36.7397
4,ASM,American Samoa,Oceania,Polynesia,199.0,,Amerika Samoa,US Territory,Pago Pago,-170.691,-14.2846


## Inner join

You'll be working with the `countries` database containing information about the most populous world cities as well as country-level economic data, population data, and geographic data. This countries database also contains information on languages spoken in each country.

Here is the basic syntax for an INNER JOIN, here including all columns in both tables:

In [5]:
sql ='''
SELECT * 
FROM cities
  -- 1. Inner join to countries
  INNER JOIN countries
    -- 2. Match on the country codes
    ON cities.country_code = countries.code;
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,name,country_code,city_proper_pop,metroarea_pop,urbanarea_pop,code,name.1,continent,region,surface_area,indep_year,local_name,gov_form,capital,cap_long,cap_lat
0,Abidjan,CIV,4765000.0,,4765000.0,CIV,Cote d'Ivoire,Africa,Western Africa,322463.0,1960,Cote dIvoire,Republic,Yamoussoukro,-4.0305,5.332
1,Abu Dhabi,ARE,1145000.0,,1145000.0,ARE,United Arab Emirates,Asia,Middle East,83600.0,1971,Al-Imarat al-´Arabiya al-Muttahida,Emirate Federation,Abu Dhabi,54.3705,24.4764
2,Abuja,NGA,1235880.0,6000000.0,1235880.0,NGA,Nigeria,Africa,Western Africa,923768.0,1960,Nigeria,Federal Republic,Abuja,7.48906,9.05804
3,Accra,GHA,2070460.0,4010050.0,2070460.0,GHA,Ghana,Africa,Western Africa,238533.0,1957,Ghana,Republic,Accra,-0.20795,5.57045
4,Addis Ababa,ETH,3103670.0,4567860.0,3103670.0,ETH,Ethiopia,Africa,Eastern Africa,1104300.0,-1000,YeItyop´iya,Republic,Addis Ababa,38.7468,9.02274


In [8]:
sql ='''
-- 1. Select name fields (with alias) and region 
SELECT cities.name AS city,
countries.name AS country,
countries.region
FROM cities
INNER JOIN countries
ON cities.country_code = countries.code
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,city,country,region
0,Abidjan,Cote d'Ivoire,Western Africa
1,Abu Dhabi,United Arab Emirates,Middle East
2,Abuja,Nigeria,Western Africa
3,Accra,Ghana,Western Africa
4,Addis Ababa,Ethiopia,Eastern Africa


## Inner join (2)

Instead of writing the full table name, you can use table aliasing as a shortcut. For tables you also use `AS` to add the alias immediately after the table name with a space. Check out the aliasing of cities and countries below.

Notice that to select a field in your query that appears in multiple tables, you'll need to identify which table/table alias you're referring to by using a `.` in your `SELECT` statement.

You'll now explore a way to get data from both the countries and economies tables to examine the inflation rate for both 2010 and 2015.

Sometimes it's easier to write SQL code out of order: you write the `SELECT` statement after you've done the `JOIN`.

- Join the tables `countries` (left) and `economies` (right) aliasing countries AS `c` and economies AS `e`.
- Specify the field to match the tables ON.
- From this join, SELECT:
    - c.code, aliased as country_code.
    - name, year, and inflation_rate, not aliased.

In [9]:
sql ='''
-- 3. Select fields with aliases
SELECT c.code AS country_code, c.name, e.year, inflation_rate
FROM countries AS c
  -- 1. Join to economies (alias e)
  INNER JOIN economies AS e
    -- 2. Match on code
    ON c.code = e.code;
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,country_code,name,year,inflation_rate
0,AFG,Afghanistan,2010,2.179
1,AFG,Afghanistan,2015,-1.549
2,AGO,Angola,2010,14.48
3,AGO,Angola,2015,10.287
4,ALB,Albania,2010,3.605


## Inner join (3)

The ability to combine multiple joins in a single query is a powerful feature of SQL, e.g:

As you can see here it becomes tedious to continually write long table names in joins. This is when it becomes useful to alias each table using the first letter of its name (e.g. countries AS c)! It is standard practice to alias in this way and, if you choose to alias tables or are asked to specifically for an exercise in this course, you should follow this protocol.

Now, for each country, you want to get the country name, its region, and the fertility rate and unemployment rate for both 2010 and 2015.

Note that results should work throughout this course with or without table aliasing unless specified differently.

- Inner join countries (left) and populations (right) on the code and country_code fields respectively.
- Alias countries AS c and populations AS p.
- Select code, name, and region from countries and also select year and fertility_rate from populations (5 fields in total).

In [10]:
sql ='''
-- 4. Select fields
SELECT c.code, c.name, c.region, p.year, p.fertility_rate
  -- 1. From countries (alias as c)
  FROM countries AS c
  -- 2. Join with populations (as p)
  INNER JOIN populations AS p
    -- 3. Match on country code
    ON c.code = p.country_code
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,code,name,region,year,fertility_rate
0,ABW,Aruba,Caribbean,2010,1.704
1,ABW,Aruba,Caribbean,2015,1.647
2,AFG,Afghanistan,Southern and Central Asia,2010,5.746
3,AFG,Afghanistan,Southern and Central Asia,2015,4.653
4,AGO,Angola,Central Africa,2010,6.416


- Add an additional inner join with economies to your previous query by joining on code.
- Include the unemployment_rate column that became available through joining with economies.
- Note that year appears in both populations and economies, so you have to explicitly use e.year instead of year as you did before.

In [11]:
sql ='''
-- 6. Select fields
SELECT c.code, name, region, e.year, fertility_rate, e.unemployment_rate
  -- 1. From countries (alias as c)
  FROM countries AS c
  -- 2. Join to populations (as p)
  INNER JOIN populations AS p
    -- 3. Match on country code
    ON c.code = p.country_code
  -- 4. Join to economies (as e)
  INNER JOIN economies AS e
    -- 5. Match on country code
    ON c.code = e.code;
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,code,name,region,year,fertility_rate,unemployment_rate
0,AFG,Afghanistan,Southern and Central Asia,2015,5.746,
1,AFG,Afghanistan,Southern and Central Asia,2010,5.746,
2,AFG,Afghanistan,Southern and Central Asia,2015,4.653,
3,AFG,Afghanistan,Southern and Central Asia,2010,4.653,
4,AGO,Angola,Central Africa,2015,6.416,


- The trouble with doing your last join on c.code = e.code and not also including year is that e.g. the 2010 value for fertility_rate is also paired with the 2015 value for unemployment_rate.
- Fix your previous query: in your last ON clause, use AND to add an additional joining condition. In addition to joining on code in c and e, also join on year in e and p.

In [12]:
sql ='''
-- 6. Select fields
SELECT c.code, name, region, e.year, fertility_rate, unemployment_rate
  -- 1. From countries (alias as c)
  FROM countries AS c
  -- 2. Join to populations (as p)
  INNER JOIN populations AS p
    -- 3. Match on country code
    ON c.code = p.country_code
  -- 4. Join to economies (as e)
  INNER JOIN economies AS e
    -- 5. Match on country code and year
    ON c.code = e.code AND p.year = e.year;
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,code,name,region,year,fertility_rate,unemployment_rate
0,AFG,Afghanistan,Southern and Central Asia,2010,5.746,
1,AFG,Afghanistan,Southern and Central Asia,2015,4.653,
2,AGO,Angola,Central Africa,2010,6.416,
3,AGO,Angola,Central Africa,2015,5.996,
4,ALB,Albania,Southern Europe,2010,1.663,14.0


## Inner join with using

When joining tables with a common field name, e.g.

You can use USING as a shortcut:

You'll now explore how this can be done with the countries and languages tables.

- Inner join countries on the left and languages on the right with USING(code).
- Select the fields corresponding to:
    - country name AS country,
    - continent name,
    - language name AS language, and
    - whether or not the language is official.
    
Remember to alias your tables using the first letter of their names.

In [16]:
sql ='''
-- 4. Select fields
SELECT c.name AS country, c.continent, l.name AS language, l.official
  -- 1. From countries (alias as c)
  FROM countries AS c
  -- 2. Join to languages (as l)
  INNER JOIN languages AS l
    -- 3. Match using code
    USING(code);
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,country,continent,language,official
0,Afghanistan,Asia,Dari,True
1,Afghanistan,Asia,Pashto,True
2,Afghanistan,Asia,Turkic,False
3,Afghanistan,Asia,Other,False
4,Albania,Europe,Albanian,True


## Self-join

In this exercise, you'll use the populations table to perform a self-join to calculate the percentage increase in population from 2010 to 2015 for each country code!

Since you'll be joining the populations table to itself, you can alias populations as p1 and also populations as p2. This is good practice whenever you are aliasing and your tables have the same first letter. Note that you are required to alias the tables with self-joins.

- Join populations with itself ON country_code.
- Select the country_code from p1 and the size field from both p1 and p2. SQL won't allow same-named fields, so alias p1.size as size2010 and p2.size as size2015.

In [17]:
sql ='''
-- 4. Select fields with aliases
SELECT p1.country_code, 
       p1.size AS size2010,
       p2.size AS size2015
-- 1. From populations (alias as p1)
FROM populations AS p1
  -- 2. Join to itself (alias as p2)
  INNER JOIN populations AS p2
    -- 3. Match on country code
    ON  p1.country_code = p2.country_code;
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,country_code,size2010,size2015
0,ABW,101597.0,103889.0
1,ABW,101597.0,101597.0
2,ABW,103889.0,103889.0
3,ABW,103889.0,101597.0
4,AFG,27962200.0,32526600.0


Notice from the result that for each country_code you have four entries laying out all combinations of 2010 and 2015.

- Extend the ON in your query to include only those records where the p1.year (2010) matches with p2.year - 5 (2015 - 5 = 2010). This will omit the three entries per country_code that you aren't interested in.

In [18]:
sql ='''
-- 5. Select fields with aliases
SELECT p1.country_code,
       p1.size AS size2010,
       p2.size AS size2015
-- 1. From populations (alias as p1)
FROM populations as p1
  -- 2. Join to itself (alias as p2)
  INNER JOIN populations as p2
    -- 3. Match on country code
    ON p1.country_code = p2.country_code
        -- 4. and year (with calculation)
        AND p1.year = p2.year - 5;
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,country_code,size2010,size2015
0,ABW,101597.0,103889.0
1,AFG,27962200.0,32526600.0
2,AGO,21220000.0,25022000.0
3,ALB,2913020.0,2889170.0
4,AND,84419.0,70473.0


As you just saw, you can also use SQL to calculate values like p2.year - 5 for you. With two fields like size2010 and size2015, you may want to determine the percentage increase from one field to the next:

With two numeric fields A and B, the percentage growth from A to B can be calculated as (B−A)/A∗100.0.

Add a new field to SELECT, aliased as growth_perc, that calculates the percentage population growth from 2010 to 2015 for each country, using p2.size and p1.size.

In [19]:
sql ='''
SELECT p1.country_code,
       p1.size AS size2010, 
       p2.size AS size2015,
       -- 1. calculate growth_perc
       ((p2.size - p1.size)/p1.size * 100.0) AS growth_perc
-- 2. From populations (alias as p1)
FROM populations AS p1
  -- 3. Join to itself (alias as p2)
  INNER JOIN populations AS p2
    -- 4. Match on country code
    ON p1.country_code = p2.country_code
        -- 5. and year (with calculation)
        AND p1.year = p2.year - 5;
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,country_code,size2010,size2015,growth_perc
0,ABW,101597.0,103889.0,2.255972
1,AFG,27962200.0,32526600.0,16.323297
2,AGO,21220000.0,25022000.0,17.917192
3,ALB,2913020.0,2889170.0,-0.818875
4,AND,84419.0,70473.0,-16.519977


## Case when and then

Often it's useful to look at a numerical field not as raw data, but instead as being in different categories or groups.

You can use `CASE` with `WHEN`, `THEN`, `ELSE`, and `END` to define a new grouping field.

Using the countries table, create a new field AS geosize_group that groups the countries into three groups:

- If surface_area is greater than 2 million, geosize_group is 'large'.
- If surface_area is greater than 350 thousand but not larger than 2 million, geosize_group is 'medium'.
- Otherwise, geosize_group is 'small'.

In [20]:
sql ='''
SELECT name, continent, code, surface_area,
    -- 1. First case
    CASE WHEN surface_area > 2000000 THEN 'large'
        -- 2. Second case
        WHEN surface_area > 350000 THEN 'medium'
        -- 3. Else clause + end
        ELSE 'small' END
        -- 4. Alias name
        AS geosize_group
-- 5. From table
FROM countries;
'''
df = pd.read_sql(sql, conn)
df.head()

Unnamed: 0,name,continent,code,surface_area,geosize_group
0,Afghanistan,Asia,AFG,652090.0,medium
1,Netherlands,Europe,NLD,41526.0,small
2,Albania,Europe,ALB,28748.0,small
3,Algeria,Africa,DZA,2381740.0,large
4,American Samoa,Oceania,ASM,199.0,small
