# Denison CS181/DA210 SW Lab #11 - Step 2

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

#### Import Python modules and load "SQL Magic"

In [None]:
import pandas as pd
import os
import os.path
import json
import sys
import importlib

module_dir = "../../modules"
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

%load_ext sql

#### Set credentials

In [None]:
def getsqlite_creds(dirname=".",filename="creds.json",source="sqlite"):
    """ Using directory and filename parameters, open a credentials file
        and obtain the two parts needed for a connection string to
        a local provider using the "sqlite" dictionary within
        an outer dictionary.  
        
        Return a scheme and a dbfile
    """
    assert os.path.isfile(os.path.join(dirname, filename))
    with open(os.path.join(dirname, filename)) as f:
        D = json.load(f)
    sqlite = D[source]
    return sqlite["scheme"], sqlite["dbdir"], sqlite["database"]

In [None]:
scheme, dbdir, database = getsqlite_creds(source="sqlite1")
template = '{}:///{}/{}.db'
cstring = template.format(scheme, dbdir, database)
print("Connection string:", cstring)

#### Establish Connection from Client to Server

In [None]:
%sql $cstring

---

## Part C: Types of Joins

We'll observe the differences in the types of joins using the following two tables (technically they're "views" in the `book` database, but for our purposes we'll treat them as tables):

In [None]:
%sql SELECT * FROM pop_gdp

In [None]:
%sql SELECT * FROM country_land

We'll use the following "match condition" for the joins we'll explore:
```
    pop_gdp.code = country_land.code
```

#### Inner join

We have already seen inner joins, so we'll use this as our starting point.

First, we construct a combined table that includes all six columns from the two tables, and where the rows in the result satisfy the match condition, and the matching fields are present in **both** tables.

In [None]:
query = """
SELECT pop_gdp.code AS pg_code,
       pop_gdp.pop AS pg_pop,
       pop_gdp.gdp AS pg_gdp,
       country_land.code AS cl_code,
       country_land.country AS cl_country,
       country_land.land AS cl_land
FROM pop_gdp INNER JOIN country_land ON country_land.code = pop_gdp.code
"""

resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()

As only FRA, GBR, and USA are present in both tables, the resulting table has only three records.

#### Left join

To use a `LEFT JOIN`, we simply replace `INNER JOIN` with `LEFT JOIN` in our SQL statement.

Next, we construct a combined table that includes all six columns from the two tables, and where the rows in the result satisfy the match condition, and all rows in the `pop_gdp` table are present.

In [None]:
query = """
SELECT pop_gdp.code AS pg_code,
       pop_gdp.pop AS pg_pop,
       pop_gdp.gdp AS pg_gdp,
       country_land.code AS cl_code,
       country_land.country AS cl_country,
       country_land.land AS cl_land
FROM pop_gdp LEFT JOIN country_land ON country_land.code = pop_gdp.code
"""

resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()

As this is a `LEFT JOIN`, it has all records present in the `pop_gdp` table, even if they are not present in the `country_land` table (e.g., CHN has NULL values in the columns coming from `country_land`).

#### Right join

Some systems do not implement a `RIGHT JOIN` and provide only a `LEFT JOIN`.  In this case, we can use a `LEFT JOIN` and reverse the order of the tables in the `FROM` clause.

Now, we construct a combined table that includes all six columns from the two tables, and where the rows in the result satisfy the match condition, and all rows in the `land_country` table are present.

In [None]:
query = """
SELECT pop_gdp.code AS pg_code,
       pop_gdp.pop AS pg_pop,
       pop_gdp.gdp AS pg_gdp,
       country_land.code AS cl_code,
       country_land.country AS cl_country,
       country_land.land AS cl_land
FROM country_land LEFT JOIN pop_gdp ON country_land.code = pop_gdp.code
"""
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()

Similar to the previous example, this result has all records present in the `country_land` table, even if they are not present in the `pop_gdp` table (e.g., IND and VNM have NULL values in the columns coming from `pop_gdp`).

#### Full outer join

An outer join, also called a `FULL OUTER JOIN`, is also not implemented in all systems.  Instead, we can take the `UNION` of both the `LEFT JOIN` and `RIGHT JOIN`.

Finally, we construct a combined table that includes all six columns from the two tables, and where the rows in the result satisfy the match condition, and all rows in either original table are present.

In [None]:
query = """
SELECT pop_gdp.code AS pg_code,
       pop_gdp.pop AS pg_pop,
       pop_gdp.gdp AS pg_gdp,
       country_land.code AS cl_code,
       country_land.country AS cl_country,
       country_land.land AS cl_land
FROM pop_gdp LEFT JOIN country_land ON country_land.code = pop_gdp.code
UNION
SELECT pop_gdp.code AS pg_code,
       pop_gdp.pop AS pg_pop,
       pop_gdp.gdp AS pg_gdp,
       country_land.code AS cl_code,
       country_land.country AS cl_country,
       country_land.land AS cl_land
FROM country_land LEFT JOIN pop_gdp ON country_land.code = pop_gdp.code
"""
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()

---

## Part D: `LEFT JOIN` and Set Differences

We can use a `LEFT JOIN` to compute differences between sets.  The next two exercises walk you through this process.  First, let's switch back to the `school` database.

In [None]:
scheme, dbdir, database = getsqlite_creds(source="sqlite2")
template = '{}:///{}/{}.db'
cstring = template.format(scheme, dbdir, database)
print("Connection string:", cstring)

%sql $cstring

**Q7:** Write a SQL query to collect course and class information for all courses (subject, number, and title) and classes (also term, as a column `term`).  Your resulting table should include directed studies, and should have records for all rows in the `courses` table.

In [None]:
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()

resultset = %sql $query
resultdf = resultset.DataFrame()
print(len(resultdf))
resultdf.head()

In [None]:
# Testing cell
assert len(resultdf) == 1896
assert True in list(resultdf["term"].isna())
assert True not in list(resultdf["coursetitle"].isna())
assert "Beginning Arabic I" in set(resultdf["coursetitle"])

**Q8:** Using the query from the previous question as a subquery, select the course subject, number, and title for any courses not offered in either the fall or spring terms.  (Hint: think about what you can filter from the result of the previous question.)

In [None]:
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()

resultset = %sql $query
resultdf = resultset.DataFrame()
print(len(resultdf))
resultdf.head()

In [None]:
# Testing cell
assert len(resultdf) == 94
assert resultdf.iloc[0,1] == 363
assert "Beginning Arabic I" not in set(resultdf["coursetitle"])

**Q9:** Further expand your previous SQL query to retrieve the English courses (subject, number, and title) that were not offered in either semester.

In [None]:
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()

resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()

In [None]:
# Testing cell
assert len(resultdf) == 2

> You've reached the third (and final) checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 3: If a full outer join is made up of rows from an inner join, left join, and right join, then why is it that in forming the outer join, you do not need to `UNION` with the inner join, as well as the left and right joins?

---

---

## Part E

How much time (in minutes/hours) did you spend on this lab outside of class?

YOUR ANSWER HERE