<a href="https://colab.research.google.com/github/michalis0/Business-Intelligence-and-Analytics/blob/master/week5%20-%20SQL/walkthrough/SQL1_Walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Week 5 - SQL (1/2) - Walkthrough

SQLite is a Python library that allows SQL querries to be executed in (Python) notebooks. Only the setup (import of the dataset and SQLite) is a bit more complicated than with the SQL Explorer. 

In this walkthrough, we show you the basic functions you need to know by using the `CovidData2021` dataset. First we need to import it.

## Setup

In [None]:
import pandas as pd
from sqlalchemy import create_engine
db = create_engine('sqlite://', echo=False)
csvfile = 'https://raw.githubusercontent.com/michalis0/Business-Intelligence-and-Analytics/master/data/CovidData2021.csv'
df = pd.read_csv(csvfile, delimiter=';')
df['country'] = df['country'].astype('category')
df['date'] = df['date'].astype('object')
for column in ['cases', 'deaths', 'tests', 'vaccinations']:
  df[column] = df[column].astype('int64')
  df.drop(df[df[column] < 0].index, inplace=True)
df = df.drop('vaccinations', axis = 'columns')
table_name = 'coronavirus'
df.to_sql(table_name, con=db)

The dataset contains the following elements:
- **country:** country of the observation
- **date:** day of the observation
- **cases:** number of cases in the respective country on the day of the observation
- **deaths:** number of deaths in the respective region on the day of the observation
- **tests:** number of tests in the respective region on the day of the observation

In [None]:
df.sample(5, random_state=12)

Unnamed: 0,country,date,cases,deaths,tests
11065,Montenegro,22.11.2020,409,5,0
385,Andorra,11.03.2020,0,0,0
2221,Bosnia and Herzegovina,28.01.2021,330,16,2542
14926,Slovakia,27.02.2021,2848,109,736984
3839,Denmark,14.05.2020,46,4,12977


## Basic SQLite Syntax
To make a querry in SQLite, one must use the following syntax.

```python
query = """
YourQueryHere
"""
sql_df = pd.read_sql(query, con=db)
sql_df
```

## Basic SQL Functions

Now that you know how to setup SQLite and are familiar with the basic syntax needed to make querries, let's get to the functions.

Have an overview of the `coronavirus` table by taking a look at all columns and all rows.

In [None]:
query = """
select *
from coronavirus
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,index,country,date,cases,deaths,tests
0,0,Albania,25.02.2020,0,0,8
1,1,Albania,26.02.2020,0,0,5
2,2,Albania,27.02.2020,0,0,4
3,3,Albania,28.02.2020,0,0,1
4,4,Albania,29.02.2020,0,0,8
...,...,...,...,...,...,...
17596,17672,Vatican,02.03.2021,0,0,0
17597,17673,Vatican,03.03.2021,0,0,0
17598,17674,Vatican,04.03.2021,0,0,0
17599,17675,Vatican,05.03.2021,0,0,0


We may have selected a too many rows, we don't need all 17601 of them. Let's only look at the first 10.

In [None]:
query = """
select *
from coronavirus
limit 10
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,index,country,date,cases,deaths,tests
0,0,Albania,25.02.2020,0,0,8
1,1,Albania,26.02.2020,0,0,5
2,2,Albania,27.02.2020,0,0,4
3,3,Albania,28.02.2020,0,0,1
4,4,Albania,29.02.2020,0,0,8
5,5,Albania,01.03.2020,0,0,3
6,6,Albania,02.03.2020,0,0,2
7,7,Albania,03.03.2020,0,0,5
8,8,Albania,04.03.2020,0,0,6
9,9,Albania,05.03.2020,0,0,8


You can also select specific columns, for example `country`, `date`, and `cases`.

In [None]:
query = """
select country, date, cases
from coronavirus
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,cases
0,Albania,25.02.2020,0
1,Albania,26.02.2020,0
2,Albania,27.02.2020,0
3,Albania,28.02.2020,0
4,Albania,29.02.2020,0
...,...,...,...
17596,Vatican,02.03.2021,0
17597,Vatican,03.03.2021,0
17598,Vatican,04.03.2021,0
17599,Vatican,05.03.2021,0


The keyword `where` is used to add a condition to your request.

In [None]:
query = """
select country, date, cases
from coronavirus
where country = 'Switzerland'
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,cases
0,Switzerland,24.01.2020,0
1,Switzerland,25.01.2020,0
2,Switzerland,26.01.2020,0
3,Switzerland,27.01.2020,0
4,Switzerland,28.01.2020,0
...,...,...,...
402,Switzerland,02.03.2021,1130
403,Switzerland,03.03.2021,1223
404,Switzerland,04.03.2021,1223
405,Switzerland,05.03.2021,1222


The keyword **`order by`** is used to order the results of a request by a column in an ascending (`asc`) or descending (`desc`) manner.

In [None]:
query = """
select country, date, cases
from coronavirus
order by cases desc
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,cases
0,France,02.11.2020,106091
1,Spain,25.01.2021,93822
2,France,07.11.2020,86655
3,Spain,18.01.2021,84287
4,Spain,01.02.2021,79686
...,...,...,...
17596,Vatican,02.03.2021,0
17597,Vatican,03.03.2021,0
17598,Vatican,04.03.2021,0
17599,Vatican,05.03.2021,0


The keyword **`group by`** is used to group the results of a request by a column.

In [None]:
query = """
select country, sum(cases) as TotalCases
from coronavirus
group by country
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,TotalCases
0,Albania,112078
1,Andorra,11019
2,Austria,471891
3,Belarus,294432
4,Belgium,785226
5,Bosnia and Herzegovina,133908
6,Bulgaria,259811
7,Croatia,246120
8,Cyprus,36456
9,Czechia,1310861


The keyword **`distinct`** is used to display only different values.

In [None]:
query = """
select distinct country
from coronavirus
order by country asc
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country
0,Albania
1,Andorra
2,Austria
3,Belarus
4,Belgium
5,Bosnia and Herzegovina
6,Bulgaria
7,Croatia
8,Cyprus
9,Czechia


## Aggregate SQL Functions
You will often have to find the minimum, maximum, count, average, or sum of a column. You can simplify the functions presented in this section in the select- and having-, but not in the **where-statements**.

The **`min()`** function returns the smallest value of the selected column.

In [None]:
query = """
select country, date, min(cases)
from coronavirus
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,min(cases)
0,Albania,25.02.2020,0


The **`max()`** function returns the largest value of the selected column.

In [None]:
query = """
select country, max(cases)
from coronavirus
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,max(cases)
0,France,106091


The **`count()`** function returns the number of rows that matches a specified criterion.

In [None]:
query = """
select count(distinct country)
from coronavirus
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,count(distinct country)
0,51


The **`avg()`** function returns the average value of a column

In [None]:
query = """
select avg(deaths)
from coronavirus
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,avg(deaths)
0,47.304585


The **`sum()`** function returns the total sum of a numeric column.

In [None]:
query = """
select country, sum(tests)
from coronavirus
where country = 'Switzerland'
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,sum(tests)
0,Switzerland,4381673


## Advanced SQL Functions
You now know the basic functions you will use very frequently. We can now look at some more advanced features from SQL.

If you want to compare values, you will have to use the following in a `where` or `having` clause.

| Sign | Meaning |
| :---: | --- |
| = | equal to |
| > | greater than |
| < | less than |
| >= | greater or equal to |
| <= | less or equal to |
| <> | not equal to |

In [None]:
query = """
select country, date, cases
from coronavirus
where cases > 75000
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,cases
0,France,02.11.2020,106091
1,France,07.11.2020,86655
2,Spain,18.01.2021,84287
3,Spain,25.01.2021,93822
4,Spain,01.02.2021,79686


When looking for a value to be within a certain range, you can use **`(not) between ... and ...`**.

In [None]:
query = """
select country, date, cases
from coronavirus
where cases between 5000 and 10000
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,cases
0,Austria,30.10.2020,5627
1,Austria,31.10.2020,5349
2,Austria,04.11.2020,6901
3,Austria,05.11.2020,7416
4,Austria,06.11.2020,6464
...,...,...,...
819,United Kingdom,02.03.2021,6411
820,United Kingdom,03.03.2021,6420
821,United Kingdom,04.03.2021,6644
822,United Kingdom,05.03.2021,6024


Sometimes you will have to find values that are among a few given ones. You can then use **`(not) in ListOfValues`.**

In [None]:
query = """
select country, date, cases
from coronavirus
where cases not in (0, 1, 2, 3, 4, 5, 999)
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,cases
0,Albania,10.03.2020,8
1,Albania,12.03.2020,11
2,Albania,13.03.2020,10
3,Albania,16.03.2020,9
4,Albania,20.03.2020,6
...,...,...,...
13978,United Kingdom,04.03.2021,6644
13979,United Kingdom,05.03.2021,6024
13980,United Kingdom,06.03.2021,6118
13981,Vatican,12.10.2020,7


When selecting based on a characted chain, using **`(not) like`** is the best way to do so. remember that:

- **%** represents zero, one or multiple characters
- **_** represents one character

In [None]:
query = """
select country, date, cases
from coronavirus
where country like '_I%'
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,cases
0,Finland,29.01.2020,1
1,Finland,30.01.2020,0
2,Finland,31.01.2020,0
3,Finland,01.02.2020,0
4,Finland,02.02.2020,0
...,...,...,...
1188,Lithuania,02.03.2021,427
1189,Lithuania,03.03.2021,526
1190,Lithuania,04.03.2021,547
1191,Lithuania,05.03.2021,437


Sometimes we have to select observations with missing values, which are refered to as `null`. Hence, to select observations with(out) missing values one should use **`is (not) null`** (we don't have any null values in this dataset).

In [None]:
query = """
select country, date, cases
from coronavirus
where cases is null
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,cases


You can give aliases (new names) to columns when extracting the result of your query with **`as`**.

In [None]:
query = """
select country as Pays, cases as cas
from coronavirus
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,Pays,cas
0,Albania,0
1,Albania,0
2,Albania,0
3,Albania,0
4,Albania,0
...,...,...
17596,Vatican,0
17597,Vatican,0
17598,Vatican,0
17599,Vatican,0


One can use arithmetic expressions, which is especially useful in the `select` clause.

| Operator | Meaning |
| :---: | --- |
| + | addition |
| - | substraction |
| * | multiplication |
| / | division |
| % | residual |

Note that it might be necessary to multiply colmuns by 1.0 in order to convert time into float64. Had we not done this here, the positive rate would have been either 0 or 1 for every observation.

Another useful function, especially when dividing, is **``round()``** that simply lets you round up (or down) numbers.

In [None]:
query = """
select country, date, cases, tests, round(cases*1.0/tests*1.0, 2) as PositiveRate
from coronavirus
where PositiveRate > 0
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,cases,tests,PositiveRate
0,Albania,09.03.2020,2,18,0.11
1,Albania,10.03.2020,8,37,0.22
2,Albania,11.03.2020,2,43,0.05
3,Albania,12.03.2020,11,141,0.08
4,Albania,13.03.2020,10,159,0.06
...,...,...,...,...,...
9552,United Kingdom,27.02.2021,7457,383946,0.02
9553,United Kingdom,28.02.2021,6055,526679,0.01
9554,United Kingdom,01.03.2021,5462,727972,0.01
9555,United Kingdom,02.03.2021,6411,675543,0.01


To add conditions, it is most convenient to use the following logical operators:

| Operator | Meaning |
| :---: | --- |
| and | True if both conditions are true |
| or | True if at least one condition is true |
| not | True if none of the conditions are true |

The use of parentheses is advised when using several logical operators.

In [None]:
query = """
select country, date, cases, deaths
from coronavirus
where country == 'Spain' and (cases > 10000 or deaths > 200)
"""
sql_df = pd.read_sql(query, con=db)
sql_df

Unnamed: 0,country,date,cases,deaths
0,Spain,19.03.2020,4053,207
1,Spain,20.03.2020,2447,213
2,Spain,21.03.2020,4964,332
3,Spain,22.03.2020,3394,397
4,Spain,23.03.2020,6368,539
...,...,...,...,...
166,Spain,26.02.2021,8341,329
167,Spain,01.03.2021,15978,467
168,Spain,03.03.2021,6137,446
169,Spain,04.03.2021,6037,254
