In [1]:
# preamble to be able to run notebooks in Jupyter and Colab
try:
    from google.colab import drive
    import sys
    
    drive.mount('/content/drive')
    notes_home = "/content/drive/Shared drives/CSC310/notes/"
    user_home = "/content/drive/My Drive/"
    
    sys.path.insert(1,notes_home) # let the notebook access the notes folder
    
    !pip install PyMySQL

except ModuleNotFoundError:
    notes_home = "" # running native Jupyter environment -- notes home is the same as the notebook
    user_home = ""  # under Jupyter we assume the user directory is the same as the notebook

# Databases

* The data you need will often live in *databases*, systems designed for efficiently storing and querying data. 

* The bulk of these are relational databases, such as Oracle, MySQL, and SQL Serverl.  These are also called *Relational Database Management Systems* ([RDBMS](https://en.wikipedia.org/wiki/Relational_database_management_system)).

* These systems store data in tables and are typically queried using *Structured Query Language* ([SQL](https://en.wikipedia.org/wiki/SQL)), a declarative language for manipulating data.



## Relational Databases

* A relational database is a collection of tables.
* A table is simply a collection of rows and columns, very similar to Pandas DataFrames.
* A database typically contains multiple tables.
* Each table typically has at least one column called the primary or foreign key. These special columns allows the user to pose queries across multiple different tables at the same time.
* Tables together with primary/foreign key relationships are called the *schema* of a database

It is called relational because each table defines a [mathematical relation](https://www.xaprb.com/blog/2012/03/13/what-makes-relational-databases-relational)!



## SQL Queries and Result Tables

![alt](https://www.w3resource.com/w3r_images/sql-works-with-rdbms.gif)

## There is a lot to RDBMS

![alt](https://www.assignmenthelp.net/blog/wp-content/uploads/2011/07/RDBMS.png)

## SQL

SQL (Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS).

SQL was one of the first commercial languages for Edgar F. Codd's relational model, as described in his influential 1970 paper, "[A Relational Model of Data for Large Shared Data Banks](https://sfu-db.github.io/dbsystems/Papers/p377-codd.pdf)." Despite not entirely adhering to the relational model as described by Codd, it became the most widely used database language.

Here is a nice [SQL tutorial](https://www.w3schools.com/sql).


## What Can SQL do?
* SQL can execute queries against a database
* SQL can retrieve data from a database
* SQL can insert records in a database
* SQL can update records in a database
* SQL can delete records from a database
* SQL can create new databases
* SQL can create new tables in a database
* SQL can create stored procedures in a database
* SQL can create views in a database
* SQL can set permissions on tables, procedures, and views




## SQL is actually made up of a couple of sub-languages:

* DDL: Data Definition Language, e.g. ‘create’ a table or database
* DML: Data Manipulation Language, e.g. insert or delete a row in a table
* TCL: Transaction Control Language, e.g. commit or rollback database changes
* DCL: Data Control Language, e.g. grant access permissions
* **DQL: Data Query Language**, e.g. retrieve records from one or more table

**Note:** Only a small part of SQL actually has to do with information retrieval/querying

## Data Retrieval with `SELECT`

From our perspective, the most important is the `SELECT` statement that allows you to extract data from the DB tables:
```
SELECT * FROM customers;               -- get entire contents of table ‘customers’
SELECT * FROM customers LIMIT 2;       -- get the first two rows
SELECT customername FROM customers;    -- get column ‘customername’ of table ‘customers’
SELECT customername FROM customers WHERE customerid = 3;  
                                       -- get data subject to some conditions
```

## Demo

[SQL TryIt Editor](https://www.w3schools.com/sql/trysql.asp?filename=trysql_asc)

## Details on `SELECT`

You can also use SELECT statements to compute on rows: 
```
SELECT COUNT(*) FROM customers WHERE city = ‘London’;
```
W3Schools maintains a nice [reference manual for SQL](https://www.w3schools.com/sql/).

# SQL and Python

We will use the [PyMySQL](https://pymysql.readthedocs.io/en/latest/) package together with  [Pandas DataFrames' ability to query databases using SQL](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html) to connect to a [MySQL server](https://www.mysql.com) and issue SQL commands.

Here is a basic program that prints out the tables contained within a database

In [2]:
# show tables in the db
# if the following import fails try: conda install -c anaconda pymysql
import pymysql as sql
import pandas as pd

# Open database connection to our test database
host = 'testdb.cwy05wfzuxbv.us-east-1.rds.amazonaws.com'
userdb = 'world'
user = 'csc310'
password = 'csc310$is$fun'
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = 'SHOW TABLES'
data = pd.read_sql(sql_string, con=db) 
print (data)

# disconnect from server
db.close()

   Tables_in_world
0             city
1          country
2  countrylanguage


In [3]:
# simple SQL client for executing SQL commands
import pymysql as sql
import pandas as pd

# Open database connection to our test database
host = 'testdb.cwy05wfzuxbv.us-east-1.rds.amazonaws.com'
userdb = 'world'
user = 'csc310'
password = 'csc310$is$fun'
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = input('SQL> ')
data = pd.read_sql(sql_string, con=db) 
print (data)

# disconnect from server
db.close()

SQL> select name, capital from country
             name  capital
0           Aruba    129.0
1     Afghanistan      1.0
2          Angola     56.0
3        Anguilla     62.0
4         Albania     34.0
..            ...      ...
234         Yemen   1780.0
235    Yugoslavia   1792.0
236  South Africa    716.0
237        Zambia   3162.0
238      Zimbabwe   4068.0

[239 rows x 2 columns]


In [4]:
# show only cities where the population is greater than 5,000,000
import pymysql as sql
import pandas as pd

# Open database connection to our test database
host = 'testdb.cwy05wfzuxbv.us-east-1.rds.amazonaws.com'
userdb = 'world'
user = 'csc310'
password = 'csc310$is$fun'
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = \
'''
SELECT name, countrycode, population 
FROM city 
WHERE population > 5000000
ORDER by population DESC
'''

data = pd.read_sql(sql_string, con=db) 
print (data)

# disconnect from server
db.close()

                 name countrycode  population
0     Mumbai (Bombay)         IND    10500000
1               Seoul         KOR     9981619
2           São Paulo         BRA     9968485
3            Shanghai         CHN     9696300
4             Jakarta         IDN     9604900
5             Karachi         PAK     9269265
6            Istanbul         TUR     8787958
7    Ciudad de México         MEX     8591309
8              Moscow         RUS     8389200
9            New York         USA     8008278
10              Tokyo         JPN     7980230
11             Peking         CHN     7472000
12             London         GBR     7285000
13              Delhi         IND     7206704
14              Cairo         EGY     6789479
15            Teheran         IRN     6758845
16               Lima         PER     6464693
17          Chongqing         CHN     6351600
18            Bangkok         THA     6320174
19  Santafé de Bogotá         COL     6260862
20     Rio de Janeiro         BRA 

## Joins

A [SQL join](https://en.wikipedia.org/wiki/Join_(SQL)) combines columns from one or more tables in a relational database. It creates a set that can be saved as a table or used as it is. A JOIN is a means for combining columns from one (self-join) or more tables by using values common to each.  The most common and default join is the `INNER JOIN` which returns the rows from each table for which the join condition is true.

![alt](https://www.w3schools.com/sql/img_innerjoin.gif)

Back to the product database:
[SQL TryIt Editor](https://www.w3schools.com/sql/trysql.asp?filename=trysql_asc)

Recall that we have a `products` table and a `suppliers` table.  We want to print out a result table including the product name, supplier name, and supplier country.

```
SELECT products.productname, suppliers.suppliername, suppliers.country
FROM products
JOIN suppliers
ON products.supplierid = suppliers.supplierid;
```

Let's try something similar with our `world` database.  Here are the columns for the `city` table,
```
SQL> show columns in city
         Field      Type Null  Key Default           Extra
0           ID   int(11)   NO  PRI    None  auto_increment
1         Name  char(35)   NO                             
2  CountryCode   char(3)   NO  MUL                        
3     District  char(20)   NO                             
4   Population   int(11)   NO            0                
```
And here are the columns for the `country` table,
```
SQL> show columns from country
             Field                                               Type Null  
0             Code                                            char(3)   NO   
1             Name                                           char(52)   NO   
2        Continent  enum('Asia','Europe','North America','Africa',...   NO   
3           Region                                           char(26)   NO   
4      SurfaceArea                                        float(10,2)   NO   
5        IndepYear                                        smallint(6)  YES   
6       Population                                            int(11)   NO   
7   LifeExpectancy                                         float(3,1)  YES   
8              GNP                                        float(10,2)  YES   
9           GNPOld                                        float(10,2)  YES   
10       LocalName                                           char(45)   NO   
11  GovernmentForm                                           char(45)   NO   
12     HeadOfState                                           char(60)  YES   
13         Capital                                            int(11)  YES   
14           Code2                                            char(2)   NO   
```
We want to construct a join that gives us the name of a city, which country it is located in, and the two letter country code.

In [5]:
# show name of a city, which country it is located in, and the two letter country code.
import pymysql as sql
import pandas as pd

# Open database connection to our test database
host = 'testdb.cwy05wfzuxbv.us-east-1.rds.amazonaws.com'
userdb = 'world'
user = 'csc310'
password = 'csc310$is$fun'
db = sql.connect(host, user, password, userdb)

# get data using a pandas dataframe
sql_string = \
'''
SELECT city.Name as CityName, 
       country.Name as CountryName, 
       country.Code2 as CountryCode 
FROM city
JOIN country
ON city.CountryCode = country.Code
'''

data = pd.read_sql(sql_string, con=db) 
print (data)

# disconnect from server
db.close()

            CityName  CountryName CountryCode
0         Oranjestad        Aruba          AW
1              Kabul  Afghanistan          AF
2           Qandahar  Afghanistan          AF
3              Herat  Afghanistan          AF
4     Mazar-e-Sharif  Afghanistan          AF
...              ...          ...         ...
4074        Bulawayo     Zimbabwe          ZW
4075     Chitungwiza     Zimbabwe          ZW
4076    Mount Darwin     Zimbabwe          ZW
4077          Mutare     Zimbabwe          ZW
4078           Gweru     Zimbabwe          ZW

[4079 rows x 3 columns]


We have not touched upon DDL and DML but as data scientists we are mostly consumers of data and therefore querying databases takes priority over all other database activities.

## Exercise

1. Explore the `customer` database further with the [SQL TryIt Editor](https://www.w3schools.com/sql/trysql.asp?filename=trysql_asc). Perhaps creating additional inner joins.

1. Explore the `world` database further using Python.