# Lesson 12: A Brief Introduction to SQL Clauses

Data scientists work with data stored in tables. This lesson introduces relations, one of the most widely used ways to represent data tables. We'll also introduce SQL, the standard programming language for working with relations.

This lesson introduces operations for taking subsets of relations. When data scientists begin working with a relation, they often want to subset the specific data that they plan to use. For example, a data scientist can slice out the ten relevant features from a relation with hundreds of columns. Or, they can filter a relation to remove rows with incomplete data. For the rest of this chapter, we'll introduce relation operations using a dataset of baby names.

To work with relations, we'll introduce a domain-specific programming language called **SQL (Structured Query Language)**. We commonly pronounce *"SQL"* like *"sequel"* instead of spelling out the acronym. SQL is a specialized language for working with relations—as such, SQL has its own syntax that makes it easier to write programs that operate on relational data.

In [30]:
import pandas as pd
import numpy as np
import sqlalchemy

In [31]:
people = pd.read_csv('data/sf-people.csv')
homes = pd.read_csv('data/sf-homes.csv')
pets = pd.read_csv('data/sf-pets.csv')

In [32]:
...

Ellipsis

## What is SQL Lite?

- [Relational Database](https://www.ibm.com/topics/relational-databases)
- [SQLite](https://sqlite.org/index.html)
- [Self-contained](https://www.oreilly.com/library/view/using-sqlite/9781449394592/ch01s01.html)


Our database is stored in a file called `simplefolks.db`. This file is a SQLite database, so we'll set up a `sqlalchemy` object that can process this format.

In [33]:
db = sqlalchemy.create_engine('sqlite:///data/simplefolks.db')

Let's inspect the tables.

In [34]:
insp = sqlalchemy.inspect(db)
insp.get_table_names()

['homes', 'people', 'pets']

In [35]:
insp.get_columns('people')

[{'name': 'name',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'sex',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'age',
  'type': INTEGER(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0}]

A relation has rows and columns. Every column has a label. 

|**name**|**sex**|**age**|
|--------|-------|-------|
|Austin|M|33|
|Blair|M|90|
|Carolina|F|28|

Unlike dataframes, however, individual rows in a relation don't have labels. Also, unlike dataframes, rows of a relation aren't ordered.

### SELECT

In [11]:
query = ''' 
SELECT *
FROM people;
'''

pd.read_sql(query, db)

Unnamed: 0,name,sex,age
0,Austin,M,33
1,Blair,M,90
2,Carolina,F,28
3,Dani,F,41
4,Donald,M,70
5,Eliza,F,37
6,Farida,F,23
7,Georgina,F,19
8,Hillary,F,68
9,Leland,M,16


In [18]:
query = ''' 
SELECT name
FROM people;
'''

pd.read_sql(query, db)

Unnamed: 0,name
0,Austin
1,Blair
2,Carolina
3,Dani
4,Donald
5,Eliza
6,Farida
7,Georgina
8,Hillary
9,Leland


### WHERE

In [20]:
query = ''' 
SELECT name
FROM people
WHERE sex = 'M';
'''

pd.read_sql(query, db)

Unnamed: 0,name
0,Austin
1,Blair
2,Donald
3,Leland
4,Liam
5,Michael
6,Zed


In [38]:
query = ''' 
SELECT name, age
FROM people
WHERE age > 45;
'''

pd.read_sql(query, db)

Unnamed: 0,name,age
0,Blair,90
1,Donald,70
2,Hillary,68
3,Michael,48
4,Phoebe,52


In [40]:
query = ''' 
SELECT name, age, sex
FROM people
WHERE age > 45 AND sex = 'M';
'''

pd.read_sql(query, db)

Unnamed: 0,name,age,sex
0,Blair,90,M
1,Donald,70,M
2,Michael,48,M


### ORDER BY

In [41]:
query = ''' 
SELECT name, age, sex
FROM people
WHERE age > 45 AND sex = 'M'
ORDER BY age;
'''

pd.read_sql(query, db)

Unnamed: 0,name,age,sex
0,Michael,48,M
1,Donald,70,M
2,Blair,90,M


In [42]:
query = ''' 
SELECT name, age, sex
FROM people
WHERE age > 45 AND sex = 'M'
ORDER BY age DESC;
'''

pd.read_sql(query, db)

Unnamed: 0,name,age,sex
0,Blair,90,M
1,Donald,70,M
2,Michael,48,M


### Aggregation

In [46]:
query = ''' 
SELECT *
FROM homes;
'''

pd.read_sql(query, db)

Unnamed: 0,owner_name,area,value
0,Austin,urban,145000
1,Blair,suburbs,95000
2,Carolina,suburbs,220000
3,Carolina,urban,190000
4,Dani,country,67000
5,Donald,urban,450000
6,Donald,urban,260000
7,Donald,urban,660000
8,Eliza,urban,210000
9,Farida,suburbs,180000


In [47]:
query = ''' 
SELECT SUM(value)
FROM homes;
'''

pd.read_sql(query, db)

Unnamed: 0,SUM(value)
0,4247000


### GROUP BY

In [48]:
query = ''' 
SELECT area, SUM(value)
FROM homes
GROUP BY area;
'''

pd.read_sql(query, db)

Unnamed: 0,area,SUM(value)
0,country,830000
1,suburbs,815000
2,urban,2602000


In [50]:
query = ''' 
SELECT owner_name, area, SUM(value)
FROM homes
WHERE owner_name = 'Donald'
GROUP BY owner_name, area;
'''

pd.read_sql(query, db)

Unnamed: 0,owner_name,area,SUM(value)
0,Donald,urban,1370000


In [51]:
query = ''' 
SELECT owner_name, area, SUM(value) AS total_value
FROM homes
WHERE owner_name = 'Donald'
GROUP BY owner_name, area;
'''

pd.read_sql(query, db)

Unnamed: 0,owner_name,area,total_value
0,Donald,urban,1370000


### JOIN

In [53]:
insp.get_columns('people')

[{'name': 'name',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'sex',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'age',
  'type': INTEGER(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0}]

In [55]:
insp.get_columns('pets')

[{'name': 'name',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'type',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'owner_name',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0}]

In [58]:
query = ''' 
SELECT * 
FROM people JOIN pets
    ON people.name = pets.owner_name;
'''

pd.read_sql(query, db)

Unnamed: 0,name,sex,age,name.1,type,owner_name
0,Austin,M,33,Maru,cat,Austin
1,Blair,M,90,Icey,dog,Blair
2,Blair,M,90,Maxie,dog,Blair
3,Carolina,F,28,Rex,dog,Carolina
4,Dani,F,41,Artemis,cat,Dani
5,Dani,F,41,Harambe,bird,Dani
6,Dani,F,41,Syd,dog,Dani
7,Donald,M,70,Donald,cat,Donald
8,Donald,M,70,Meowser,cat,Donald
9,Donald,M,70,Mr. Muggles,cat,Donald


In [60]:
query = ''' 
SELECT * 
FROM people LEFT JOIN pets
    ON people.name = pets.owner_name;
'''

pd.read_sql(query, db)

Unnamed: 0,name,sex,age,name.1,type,owner_name
0,Austin,M,33,Maru,cat,Austin
1,Blair,M,90,Icey,dog,Blair
2,Blair,M,90,Maxie,dog,Blair
3,Carolina,F,28,Rex,dog,Carolina
4,Dani,F,41,Artemis,cat,Dani
5,Dani,F,41,Harambe,bird,Dani
6,Dani,F,41,Syd,dog,Dani
7,Donald,M,70,Donald,cat,Donald
8,Donald,M,70,Meowser,cat,Donald
9,Donald,M,70,Mr. Muggles,cat,Donald
