In [1]:
# preamble to be able to run notebooks in Jupyter and Colab
try:
    from google.colab import drive
    import sys
    
    drive.mount('/content/drive')
    notes_home = "/content/drive/Shared drives/CSC310/notes/"
    user_home = "/content/drive/My Drive/"
    
    sys.path.insert(1,notes_home) # let the notebook access the notes folder
    
    !pip install PyMySQL

except ModuleNotFoundError:
    notes_home = "" # running native Jupyter environment -- notes home is the same as the notebook
    user_home = ""  # under Jupyter we assume the user directory is the same as the notebook

# Relational Databases

* The data you need will often live in *databases*, systems designed for efficiently storing and querying data. 

* The bulk of these are relational databases, such as Oracle, MySQL, and SQL Serverl.  These are also called *Relational Database Management Systems* ([RDBMS](https://en.wikipedia.org/wiki/Relational_database_management_system)).

* These systems store data in tables and are typically queried using *Structured Query Language* ([SQL](https://en.wikipedia.org/wiki/SQL)), a declarative language for manipulating data.



## Relational Databases

* A relational database is a collection of tables.
* A table is simply a collection of rows and columns, very similar to Pandas DataFrames.
* A database typically contains multiple tables.
* Each table typically has at least one column called the **primary or foreign key**. 
* These special columns allow the user to pose queries across multiple different tables at the same time - to perform a **join** across tables.
* A primary key is a column that holds a **unique** value for each row in the table.  This is used by the db engine to optimize queries against the table.
* A foreign key is a column where each value point to the primary key of another table.  So you can think of a foreign key as pointer from one table to another.
* Tables together with primary/foreign key relationships are called the **schema of a database**.
* SQL is used to query the data in a relational database.  
* Data returned from an SQL query is returned as a table.

These databases are called relational because each table defines a [mathematical relation](https://www.xaprb.com/blog/2012/03/13/what-makes-relational-databases-relational)!



## SQL Queries and Result Tables

![alt](https://www.w3resource.com/w3r_images/sql-works-with-rdbms.gif)

## There is a lot to RDBMS

![alt](https://www.assignmenthelp.net/blog/wp-content/uploads/2011/07/RDBMS.png)

## SQL

SQL (Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system.

SQL was one of the first commercial languages for Edgar F. Codd's relational model, as described in his influential 1970 paper, "[A Relational Model of Data for Large Shared Data Banks](https://sfu-db.github.io/dbsystems/Papers/p377-codd.pdf)."

Here is a nice [SQL tutorial](https://www.w3schools.com/sql).


## What Can SQL do?
* SQL can execute queries against a database
* SQL can retrieve data from a database
* SQL can insert records in a database
* SQL can update records in a database
* SQL can delete records from a database
* SQL can create new databases
* SQL can create new tables in a database
* SQL can create stored procedures in a database
* SQL can create views in a database
* SQL can set permissions on tables, procedures, and views




## SQL is actually made up of a couple of sub-languages:

* DDL: Data Definition Language, e.g. ‘create’ a table or database
* DML: Data Manipulation Language, e.g. insert or delete a row in a table
* TCL: Transaction Control Language, e.g. commit or rollback database changes
* DCL: Data Control Language, e.g. grant access permissions
* **DQL: Data Query Language**, e.g. retrieve records from one or more table

**Note:** Only a small part of SQL actually has to do with information retrieval/querying

## Data Retrieval with `SELECT`

From our perspective, the most important is the `SELECT` statement that allows you to extract data from the DB tables:
```
SELECT * FROM Customers;               -- get entire contents of table Customers
SELECT * FROM Customers LIMIT 2;       -- get the first two rows
SELECT CustomerID,CustomerName FROM Customers;    
                                       -- get columns CustomerID, CustomerName of table Customers
SELECT CustomerName,City FROM Customers WHERE CustomerID = 3;  
                                       -- get data subject to some conditions
```

Take a peek at the W3Schools [reference manual for SQL](https://www.w3schools.com/sql/).

# SQL and Python

We will use the [PyMySQL](https://pymysql.readthedocs.io/en/latest/) package together with  [Pandas DataFrames' ability to query databases using SQL](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html) to connect to a [MySQL server](https://www.mysql.com) and issue SQL commands.

We will use the 'world' database from the [MySQL website](https://dev.mysql.com/doc/world-setup/en). 

In [2]:
# our database
host = 'testdb.cwy05wfzuxbv.us-east-1.rds.amazonaws.com'
userdb = 'world'
user = 'csc310'
password = 'csc310$is$fun'

The **schema** of the world database looks as follows:

<img src="https://static.packt-cdn.com/products/9781788390415/graphics/cac1f609-1c45-46d7-b066-d9481ceddf18.png">

The database consists of three tables the central table being the 'country' table.  The other two tables are related to the 'country' table via the 'CountryCode' columns that are set up as **foreign keys**.

### Basic Queries

Here is a program that queries 10 rows from the table `city`.  Notice, that the results are returned as a Pandas dataframe and therefore, once we have the results, we can apply standard Pandas things to the results if we so chose (which will do later on).

In [3]:
# query city table
# if the following import fails try: conda install -c anaconda pymysql
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = 'SELECT * FROM city limit 10;'
data = pd.read_sql(sql_string, con=db) 
print (data)

# disconnect from server
db.close()

   ID            Name CountryCode       District  Population
0   1           Kabul         AFG          Kabol     1780000
1   2        Qandahar         AFG       Qandahar      237500
2   3           Herat         AFG          Herat      186800
3   4  Mazar-e-Sharif         AFG          Balkh      127800
4   5       Amsterdam         NLD  Noord-Holland      731200
5   6       Rotterdam         NLD   Zuid-Holland      593321
6   7            Haag         NLD   Zuid-Holland      440900
7   8         Utrecht         NLD        Utrecht      234323
8   9       Eindhoven         NLD  Noord-Brabant      201843
9  10         Tilburg         NLD  Noord-Brabant      193238


Here is another progroam that queries the `city` table for cities where the population is greater than 5,000,000
and displays the columns `name`, `countrycode`, and `population`. It displays the results in descending order.

In [4]:
# show only cities where the population is greater than 5,000,000
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = \
'''
SELECT 
    name, 
    countrycode, 
    population 
FROM 
    city 
WHERE 
    population > 5000000
ORDER 
    by population DESC
'''

data = pd.read_sql(sql_string, con=db) 
print (data)

# disconnect from server
db.close()

                 name countrycode  population
0     Mumbai (Bombay)         IND    10500000
1               Seoul         KOR     9981619
2           São Paulo         BRA     9968485
3            Shanghai         CHN     9696300
4             Jakarta         IDN     9604900
5             Karachi         PAK     9269265
6            Istanbul         TUR     8787958
7    Ciudad de México         MEX     8591309
8              Moscow         RUS     8389200
9            New York         USA     8008278
10              Tokyo         JPN     7980230
11             Peking         CHN     7472000
12             London         GBR     7285000
13              Delhi         IND     7206704
14              Cairo         EGY     6789479
15            Teheran         IRN     6758845
16               Lima         PER     6464693
17          Chongqing         CHN     6351600
18            Bangkok         THA     6320174
19  Santafé de Bogotá         COL     6260862
20     Rio de Janeiro         BRA 

The following function is a simple SQL client that given the appropriate connectivity info will prompt the user for an SQL string, submit it to the db, and display the result table.  In order to sign off type `quit`.

In [5]:
def sql_client(host, user, password, userdb):
    # simple SQL client for executing SQL commands
    import pymysql as sql
    import pandas as pd

    # Open database connection to our database
    db = sql.connect(host, user, password, userdb)

    # get data using a pandas dataframe
    while True:
        try:
            sql_string = input('SQL> ')
            if sql_string == 'quit' or sql_string == 'quit;':
                print('Bye.')
                break
            data = pd.read_sql(sql_string, con=db) 
            print (data)
        except Exception as e:
            print('Error: ' + str(e))

    # disconnect from server
    db.close()

Try our SQL client function with our world db. Try something like
```
select distinct language from countrylanguage where isofficial = 'T';
```
That is, query all the languages that are considered official languages. The `distinct` keyword prevents
a language to appear multiple times in the result table, e.g., English is considered the official language
in multiple countries, but here we are only interested if it is considered an official language in at least one.  Try posing the query without the `distinct` keyword.

In [6]:
sql_client(host, user, password, userdb)

SQL> select distinct language from countrylanguage where isofficial = 'T';
       language
0         Dutch
1          Dari
2        Pashto
3       English
4     Albaniana
..          ...
97   Vietnamese
98      Bislama
99    Afrikaans
100       Xhosa
101        Zulu

[102 rows x 1 columns]
SQL> quit
Bye.


### DB Meta Information

We can use SQL to access meta-information about the data and the db.

In [7]:
# show tables in the db
# if the following import fails try: conda install -c anaconda pymysql
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = 'SHOW TABLES'
data = pd.read_sql(sql_string, con=db) 
print("Shape: {}".format(data.shape))
print (data)

# disconnect from server
db.close()

Shape: (3, 1)
   Tables_in_world
0             city
1          country
2  countrylanguage


We can manipulate the data that comes back from the DB before printing it for example.

In [8]:
# show tables in the db
# if the following import fails try: conda install -c anaconda pymysql
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = 'SHOW TABLES'
data = pd.read_sql(sql_string, con=db) 

# format the output nicely
print("There are {:d} tables in the {:s} database".format(data.shape[0],userdb))
print("The tables are: " + ", ".join(list(data.iloc[:,0])))

# disconnect from server
db.close()

There are 3 tables in the world database
The tables are: city, country, countrylanguage


We can look at columns of a table together with their meta-information.

In [9]:
# show columns from db table 'country'
# if the following import fails try: conda install -c anaconda pymysql
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
# show columns with their associated meta info
sql_string = 'SHOW COLUMNS FROM country'
data = pd.read_sql(sql_string, con=db) 

print("Shape: {}".format(data.shape))
print("Meta Info:  {}".format(", ".join(list(data.columns))))

print("The Column Names (Fields) and their Types:")
print (data.loc[:,['Field','Type']])

print("The Column Names (Fields) and their Key Status:")
print (data.loc[:,['Field','Key']])

# disconnect from server
db.close()

Shape: (15, 6)
Meta Info:  Field, Type, Null, Key, Default, Extra
The Column Names (Fields) and their Types:
             Field                                               Type
0             Code                                            char(3)
1             Name                                           char(52)
2        Continent  enum('Asia','Europe','North America','Africa',...
3           Region                                           char(26)
4      SurfaceArea                                        float(10,2)
5        IndepYear                                        smallint(6)
6       Population                                            int(11)
7   LifeExpectancy                                         float(3,1)
8              GNP                                        float(10,2)
9           GNPOld                                        float(10,2)
10       LocalName                                           char(45)
11  GovernmentForm                                 

In [10]:
# show columns from db table 'city'
# if the following import fails try: conda install -c anaconda pymysql
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
# show columns with their associated meta info
sql_string = 'SHOW COLUMNS FROM city'
data = pd.read_sql(sql_string, con=db) 
print("The Column Names (Fields) and their Key Status:")
print (data.loc[:,['Field','Key']])

# disconnect from server
db.close()

The Column Names (Fields) and their Key Status:
         Field  Key
0           ID  PRI
1         Name     
2  CountryCode  MUL
3     District     
4   Population     


In [11]:
# show columns from db table 'countrylanguage'
# if the following import fails try: conda install -c anaconda pymysql
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
# show columns with their associated meta info
sql_string = 'SHOW COLUMNS FROM countrylanguage'
data = pd.read_sql(sql_string, con=db) 
print("The Column Names (Fields) and their Key Status:")
print (data.loc[:,['Field','Key']])

# disconnect from server
db.close()

The Column Names (Fields) and their Key Status:
         Field  Key
0  CountryCode  PRI
1     Language  PRI
2   IsOfficial     
3   Percentage     


In [12]:
# show all the foreign keys in the world db
# if the following import fails try: conda install -c anaconda pymysql
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
# show columns with their associated meta info
sql_string = \
'''
SELECT 
  TABLE_NAME,
  COLUMN_NAME,
  CONSTRAINT_NAME, 
  REFERENCED_TABLE_NAME,
  REFERENCED_COLUMN_NAME
FROM
  INFORMATION_SCHEMA.KEY_COLUMN_USAGE
WHERE
  REFERENCED_TABLE_SCHEMA = "world";
'''
data = pd.read_sql(sql_string, con=db)
data.index = ['Table {}'.format(i+1) for i in range(data.shape[0])]
print (data.transpose())

# disconnect from server
db.close()

                            Table 1                 Table 2
TABLE_NAME                     city         countrylanguage
COLUMN_NAME             CountryCode             CountryCode
CONSTRAINT_NAME         city_ibfk_1  countryLanguage_ibfk_1
REFERENCED_TABLE_NAME       country                 country
REFERENCED_COLUMN_NAME         code                    code


  result = self._query(query)


## Joins

A [SQL join](https://en.wikipedia.org/wiki/Join_(SQL)) combines columns from one or more tables in a relational database.  A JOIN is a means for combining columns from one (self-join) or more tables by using values common to each.  The most common and default join is the `INNER JOIN` which returns the rows from each table for which the join condition is true.

![alt](https://www.w3schools.com/sql/img_innerjoin.gif)

This is where the foreign keys come in handy -- recall that foreign keys are like pointers from one table to another,

<img src="https://static.packt-cdn.com/products/9781788390415/graphics/cac1f609-1c45-46d7-b066-d9481ceddf18.png">



Let's try a join on our 'world' database: We want to print out the top 10 city names, country names, and population.

In [13]:
# print out the top 10 city names, country names, and population.
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = \
'''
SELECT 
    city.name as City, 
    country.name as Country,
    city.population as Population
FROM 
    city 
JOIN 
    country 
ON 
    city.countrycode = country.code 
ORDER
    by city.population DESC
LIMIT
    10
'''

data = pd.read_sql(sql_string, con=db) 
print (data)

# disconnect from server
db.close()

               City             Country  Population
0   Mumbai (Bombay)               India    10500000
1             Seoul         South Korea     9981619
2         São Paulo              Brazil     9968485
3          Shanghai               China     9696300
4           Jakarta           Indonesia     9604900
5           Karachi            Pakistan     9269265
6          Istanbul              Turkey     8787958
7  Ciudad de México              Mexico     8591309
8            Moscow  Russian Federation     8389200
9          New York       United States     8008278


Let's try another one: let's query each city with its population and its country's population.

In [14]:
# let's query each city with its population and its country's population.
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = \
'''
SELECT 
    country.name as Country,
    city.name as City,
    city.population as CityPop,
    country.population as CountryPop,
    city.population / country.population as Factor
FROM 
    city 
JOIN 
    country 
ON 
    city.countrycode = country.code 
ORDER
    by Factor DESC
LIMIT
    10
'''

data = pd.read_sql(sql_string, con=db) 
print (data)

# disconnect from server
db.close()

                     Country          City  CityPop  CountryPop  Factor
0                  Singapore     Singapore  4017733     3567000  1.1264
1                  Gibraltar     Gibraltar    27025       25000  1.0810
2                      Macao         Macao   437500      473000  0.9249
3                   Pitcairn     Adamstown       42          50  0.8400
4    Cocos (Keeling) Islands        Bantam      503         600  0.8383
5  Saint Pierre and Miquelon  Saint-Pierre     5808        7000  0.8297
6           Falkland Islands       Stanley     1636        2000  0.8180
7                      Palau         Koror    12000       19000  0.6316
8                   Djibouti      Djibouti   383000      638000  0.6003
9               Cook Islands        Avarua    11900       20000  0.5950


One more! Print out how many cities are recorded for the USA, their average population, total population of US.

In [15]:
# print out the city names with their country names where the population is greater than 5mil.
import pymysql as sql
import pandas as pd

# Open database connection to our test database
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = \
'''
SELECT 
    COUNT(city.name) as number, 
    AVG(city.population) as avg_pop,
    country.population as population
FROM 
    city 
JOIN 
    country 
ON 
    city.countrycode = country.code 
WHERE 
    country.code = 'USA';
'''

data = pd.read_sql(sql_string, con=db) 
print (data)

# disconnect from server
db.close()

   number      avg_pop  population
0     274  286955.3796   278357000


We have not touched upon DDL and DML but as data scientists we are mostly consumers of data and therefore querying databases takes priority over all other database activities.

# Team Exercise

Write Python code accessing the world db in order to answer the following questions:

1. What is the name of the country with country code 'ZAF'?
1. How many cities are there in the US with less than 100,000 residents according to the world db? (hint: use count and also conditions can be combined with boolean operators such as 'and'/'or')
1. How many different languages are there in the db?
1. In which countries, according the db, is English the official language? Bonus: print the actual country names not just the country codes!
1. What are the different languages spoken in Angola?


### Teams

You can work with your team from the last assignment or make up a new team.  But in order to get full credit for this exercise you will have to work in a team of at least 2 members.  Maximum number of members per team is 3.