# SQL with Python Reference Guide 2 
# Table Variables and Set Operators
## (Justin M. Olds)
Based on Stanford SQL course: https://lagunita.stanford.edu/courses/DB/SQL/SelfPaced/info

---
**Table Variables** - take place in the FROM clause and serve two main purposes: 
* To make queries more readable
* Rename relations in the from clause

**Set Operators** 
* **Union** 
* **Instersect**
* **Except** 

In [1]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("class.db")
c = conn.cursor()

---
### Tables and Insert code hidden below (same as before--college admissions data)

In [2]:
c.execute('DROP TABLE IF EXISTS College')
c.execute('DROP TABLE IF EXISTS Student') 
c.execute('DROP TABLE IF EXISTS Apply') 

c.execute('CREATE TABLE College(cName TEXT, state TEXT, enrollment INT)')
c.execute('CREATE TABLE Student(sID INT, sName TEXT, GPA REAL, sizeHS INT)')
c.execute('CREATE TABLE Apply(sID INT, cName TEXT, major TEXT, decision TEXT)')
conn.commit()

In [3]:
c.execute('DELETE FROM Student')
c.execute('DELETE FROM College')
c.execute('DELETE FROM Apply')

c.execute("INSERT INTO Student VALUES (123, 'Amy', 3.9, 1000)")
c.execute("INSERT INTO Student values (234, 'Bob', 3.6, 1500)")
c.execute("INSERT INTO Student values (345, 'Craig', 3.5, 500)")
c.execute("INSERT INTO Student values (456, 'Doris', 3.9, 1000)")
c.execute("INSERT INTO Student values (567, 'Edward', 2.9, 2000)")
c.execute("INSERT INTO Student values (678, 'Fay', 3.8, 200)")
c.execute("INSERT INTO Student values (789, 'Gary', 3.4, 800)")
c.execute("INSERT INTO Student values (987, 'Helen', 3.7, 800)")
c.execute("INSERT INTO Student values (876, 'Irene', 3.9, 400)")
c.execute("INSERT INTO Student values (765, 'Jay', 2.9, 1500)")
c.execute("INSERT INTO Student values (654, 'Amy', 3.9, 1000)")
c.execute("INSERT INTO Student values (543, 'Craig', 3.4, 2000)")

c.execute("INSERT INTO College values ('Stanford', 'CA', 15000)")
c.execute("INSERT INTO College values ('Berkeley', 'CA', 36000)")
c.execute("INSERT INTO College values ('MIT', 'MA', 10000)")
c.execute("INSERT INTO College values ('Cornell', 'NY', 21000)")

c.execute("INSERT INTO Apply values (123, 'Stanford', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (123, 'Stanford', 'EE', 'N')")
c.execute("INSERT INTO Apply values (123, 'Berkeley', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (123, 'Cornell', 'EE', 'Y')")
c.execute("INSERT INTO Apply values (234, 'Berkeley', 'biology', 'N')")
c.execute("INSERT INTO Apply values (345, 'MIT', 'bioengineering', 'Y')")
c.execute("INSERT INTO Apply values (345, 'Cornell', 'bioengineering', 'N')")
c.execute("INSERT INTO Apply values (345, 'Cornell', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (345, 'Cornell', 'EE', 'N')")
c.execute("INSERT INTO Apply values (678, 'Stanford', 'history', 'Y')")
c.execute("INSERT INTO Apply values (987, 'Stanford', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (987, 'Berkeley', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (876, 'Stanford', 'CS', 'N')")
c.execute("INSERT INTO Apply values (876, 'MIT', 'biology', 'Y')")
c.execute("INSERT INTO Apply values (876, 'MIT', 'marine biology', 'N')")
c.execute("INSERT INTO Apply values (765, 'Stanford', 'history', 'Y')")
c.execute("INSERT INTO Apply values (765, 'Cornell', 'history', 'N')")
c.execute("INSERT INTO Apply values (765, 'Cornell', 'psychology', 'Y')")
c.execute("INSERT INTO Apply values (543, 'MIT', 'CS', 'N')")
conn.commit()


---
### SELECT Statements with table variables

In [4]:
df = pd.read_sql_query("""
    SELECT Student.sID, sName, GPA, Apply.cName, enrollment 
    FROM Student, College, Apply
    WHERE Apply.sID = Student.sID
        AND Apply.cName = College.cName
""", conn)
df

Unnamed: 0,sID,sName,GPA,cName,enrollment
0,123,Amy,3.9,Berkeley,36000
1,123,Amy,3.9,Cornell,21000
2,123,Amy,3.9,Stanford,15000
3,123,Amy,3.9,Stanford,15000
4,234,Bob,3.6,Berkeley,36000
5,345,Craig,3.5,Cornell,21000
6,345,Craig,3.5,Cornell,21000
7,345,Craig,3.5,Cornell,21000
8,345,Craig,3.5,MIT,10000
9,678,Fay,3.8,Stanford,15000


The above SELECT statement can be made a bit more readable by introducting table variables within the FROM clause.

In [5]:
df = pd.read_sql_query("""
    SELECT S.sID, sName, GPA, A.cName, enrollment 
    FROM Student S, College C, Apply A
    WHERE A.sID = S.sID
        AND A.cName = C.cName
""", conn)
df

Unnamed: 0,sID,sName,GPA,cName,enrollment
0,123,Amy,3.9,Berkeley,36000
1,123,Amy,3.9,Cornell,21000
2,123,Amy,3.9,Stanford,15000
3,123,Amy,3.9,Stanford,15000
4,234,Bob,3.6,Berkeley,36000
5,345,Craig,3.5,Cornell,21000
6,345,Craig,3.5,Cornell,21000
7,345,Craig,3.5,Cornell,21000
8,345,Craig,3.5,MIT,10000
9,678,Fay,3.8,Stanford,15000


Table variable are useful for comparing pairs of instances, such as students with the same GPA. The following SELECT statement includes the Student relation twice, which can returns every possible pair of students.

In [6]:
df = pd.read_sql_query("""
    SELECT S1.sID, S1.sName, S1.GPA, S2.sID, S2.sName, S2.GPA
    FROM Student S1, Student S2 
    WHERE S1.GPA = S2.GPA
""", conn)
df

Unnamed: 0,sID,sName,GPA,sID.1,sName.1,GPA.1
0,123,Amy,3.9,123,Amy,3.9
1,123,Amy,3.9,456,Doris,3.9
2,123,Amy,3.9,654,Amy,3.9
3,123,Amy,3.9,876,Irene,3.9
4,234,Bob,3.6,234,Bob,3.6
5,345,Craig,3.5,345,Craig,3.5
6,456,Doris,3.9,123,Amy,3.9
7,456,Doris,3.9,456,Doris,3.9
8,456,Doris,3.9,654,Amy,3.9
9,456,Doris,3.9,876,Irene,3.9


The above SELECT statement returns pairs that are duplicates of the same student. To avoid this we can add an AND to the WHERE clause that makes sure the students have different sID's (<> -- not equals).

In [7]:
df = pd.read_sql_query("""
    SELECT S1.sID, S1.sName, S1.GPA, S2.sID, S2.sName, S2.GPA
    FROM Student S1, Student S2 
    WHERE S1.GPA = S2.GPA
        AND S1.sID <> S2.sID
""", conn)
df

Unnamed: 0,sID,sName,GPA,sID.1,sName.1,GPA.1
0,123,Amy,3.9,456,Doris,3.9
1,123,Amy,3.9,654,Amy,3.9
2,123,Amy,3.9,876,Irene,3.9
3,456,Doris,3.9,123,Amy,3.9
4,456,Doris,3.9,654,Amy,3.9
5,456,Doris,3.9,876,Irene,3.9
6,567,Edward,2.9,765,Jay,2.9
7,789,Gary,3.4,543,Craig,3.4
8,876,Irene,3.9,123,Amy,3.9
9,876,Irene,3.9,456,Doris,3.9


The above statement resolved the identical pairs, but we still have duplicates of pairs within our result (e.g., AMY-DORIS and DORIS-AMY). To resolve this we can simply change the <> (not equals) to either > or <. For example S1.sID < S2.sID lists only the pair with the smaller sID first rather than both. 

In [9]:
df = pd.read_sql_query("""
    SELECT S1.sID, S1.sName, S1.GPA, S2.sID, S2.sName, S2.GPA
    FROM Student S1, Student S2 
    WHERE S1.GPA = S2.GPA
        AND S1.sID < S2.sID
""", conn)
df

Unnamed: 0,sID,sName,GPA,sID.1,sName.1,GPA.1
0,123,Amy,3.9,456,Doris,3.9
1,123,Amy,3.9,654,Amy,3.9
2,123,Amy,3.9,876,Irene,3.9
3,456,Doris,3.9,654,Amy,3.9
4,456,Doris,3.9,876,Irene,3.9
5,567,Edward,2.9,765,Jay,2.9
6,654,Amy,3.9,876,Irene,3.9
7,543,Craig,3.4,789,Gary,3.4


---

### SELECT statments with set operators

Starting with **UNION** we can get a result that lists names of colleges
with names of students. 

Note: by default the UNION operator eliminates duplicates in sqlite (e.g., there is only one Amy in the result below). The UNION ALL clause will retain duplicates.

In [10]:
df = pd.read_sql_query("""
    SELECT cName FROM College
    UNION
    SELECT sName FROM Student
""", conn)
df

Unnamed: 0,cName
0,Amy
1,Berkeley
2,Bob
3,Cornell
4,Craig
5,Doris
6,Edward
7,Fay
8,Gary
9,Helen


In the above result cName was returned as the label, but this is not ideal because the column includes both college names and student names. Like before, we can rename the label with the AS clause.

In [11]:
df = pd.read_sql_query("""
    SELECT cName AS name FROM College
    UNION
    SELECT sName AS name FROM Student
""", conn)
df

Unnamed: 0,name
0,Amy
1,Berkeley
2,Bob
3,Cornell
4,Craig
5,Doris
6,Edward
7,Fay
8,Gary
9,Helen


**INTERSECT** operator

The statement below returns all students that have applied to CS and EE.

In [12]:
df = pd.read_sql_query("""
    SELECT sID FROM Apply WHERE major = 'CS'
    INTERSECT
    SELECT sID FROM Apply WHERE major = 'EE'
""", conn)
df

Unnamed: 0,sID
0,123
1,345


Some systems don't support the intersect operator, but the same query can be acquired in the following way.

In [15]:
df = pd.read_sql_query("""
    SELECT DISTINCT A1.sID 
    FROM Apply A1, Apply A2
    WHERE A1.sID = A2.sID
        AND A1.major = 'CS'
        AND A2.major = 'EE'
""", conn)
df

Unnamed: 0,sID
0,123
1,345


**EXCEPT** operator (called the "Difference" operator in relational algebra) 

The following statement returns student IDs for students that applied to CS but did not apply to EE. 

In [16]:
df = pd.read_sql_query("""
    SELECT sID FROM Apply WHERE major = 'CS'
    EXCEPT
    SELECT sID FROM Apply WHERE major = 'EE'
""", conn)
df

Unnamed: 0,sID
0,543
1,876
2,987


Some systems don't support the EXECPT operator and it is a bit tricky to get the same result, but still possible. The following gets close, but we need more operators to fully pull it off (covered later)

In [17]:
df = pd.read_sql_query("""
    SELECT DISTINCT A1.sID 
    FROM Apply A1, Apply A2
    WHERE A1.sID = A2.sID
        AND A1.major = 'CS'
        AND A2.major <> 'EE'
""", conn)
df

Unnamed: 0,sID
0,123
1,345
2,987
3,876
4,543
