# SQL with Python Reference Guide 3
# Subqueries in WHERE clause
## (Justin M. Olds)
Based on Stanford SQL course: https://lagunita.stanford.edu/courses/DB/SQL/SelfPaced/info

---
**Subqueries** - Nested SELECT statements


In [2]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("class.db")
c = conn.cursor()

---
### Tables and Insert code (same as before--college admissions data)

In [3]:
c.execute('DROP TABLE IF EXISTS College')
c.execute('DROP TABLE IF EXISTS Student') 
c.execute('DROP TABLE IF EXISTS Apply') 

c.execute('CREATE TABLE College(cName TEXT, state TEXT, enrollment INT)')
c.execute('CREATE TABLE Student(sID INT, sName TEXT, GPA REAL, sizeHS INT)')
c.execute('CREATE TABLE Apply(sID INT, cName TEXT, major TEXT, decision TEXT)')
conn.commit()

In [4]:
c.execute('DELETE FROM Student')
c.execute('DELETE FROM College')
c.execute('DELETE FROM Apply')

c.execute("INSERT INTO Student VALUES (123, 'Amy', 3.9, 1000)")
c.execute("INSERT INTO Student values (234, 'Bob', 3.6, 1500)")
c.execute("INSERT INTO Student values (345, 'Craig', 3.5, 500)")
c.execute("INSERT INTO Student values (456, 'Doris', 3.9, 1000)")
c.execute("INSERT INTO Student values (567, 'Edward', 2.9, 2000)")
c.execute("INSERT INTO Student values (678, 'Fay', 3.8, 200)")
c.execute("INSERT INTO Student values (789, 'Gary', 3.4, 800)")
c.execute("INSERT INTO Student values (987, 'Helen', 3.7, 800)")
c.execute("INSERT INTO Student values (876, 'Irene', 3.9, 400)")
c.execute("INSERT INTO Student values (765, 'Jay', 2.9, 1500)")
c.execute("INSERT INTO Student values (654, 'Amy', 3.9, 1000)")
c.execute("INSERT INTO Student values (543, 'Craig', 3.4, 2000)")

c.execute("INSERT INTO College values ('Stanford', 'CA', 15000)")
c.execute("INSERT INTO College values ('Berkeley', 'CA', 36000)")
c.execute("INSERT INTO College values ('MIT', 'MA', 10000)")
c.execute("INSERT INTO College values ('Cornell', 'NY', 21000)")

c.execute("INSERT INTO Apply values (123, 'Stanford', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (123, 'Stanford', 'EE', 'N')")
c.execute("INSERT INTO Apply values (123, 'Berkeley', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (123, 'Cornell', 'EE', 'Y')")
c.execute("INSERT INTO Apply values (234, 'Berkeley', 'biology', 'N')")
c.execute("INSERT INTO Apply values (345, 'MIT', 'bioengineering', 'Y')")
c.execute("INSERT INTO Apply values (345, 'Cornell', 'bioengineering', 'N')")
c.execute("INSERT INTO Apply values (345, 'Cornell', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (345, 'Cornell', 'EE', 'N')")
c.execute("INSERT INTO Apply values (678, 'Stanford', 'history', 'Y')")
c.execute("INSERT INTO Apply values (987, 'Stanford', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (987, 'Berkeley', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (876, 'Stanford', 'CS', 'N')")
c.execute("INSERT INTO Apply values (876, 'MIT', 'biology', 'Y')")
c.execute("INSERT INTO Apply values (876, 'MIT', 'marine biology', 'N')")
c.execute("INSERT INTO Apply values (765, 'Stanford', 'history', 'Y')")
c.execute("INSERT INTO Apply values (765, 'Cornell', 'history', 'N')")
c.execute("INSERT INTO Apply values (765, 'Cornell', 'psychology', 'Y')")
c.execute("INSERT INTO Apply values (543, 'MIT', 'CS', 'N')")
conn.commit()


---
### SELECT Statements with subqueries

The statement below finds the sIDs and Names of all students that also applied to a CS program at any college.

In [18]:
df = pd.read_sql_query("""
    SELECT sID, sName
    FROM Student
    WHERE sID IN
        (SELECT sID FROM Apply WHERE major = 'CS')
""", conn)
df

Unnamed: 0,sID,sName
0,123,Amy
1,345,Craig
2,987,Helen
3,876,Irene
4,543,Craig


The above result can also be accomplished without a subquery using the join condition in the following way. 

Note: The sID must be made explicit from either the Student or Apply relation to disambiguate and DISTINCT must be added in case any students applied to CS at different colleges. 

In [21]:
df = pd.read_sql_query("""
    SELECT DISTINCT Student.sID, sName
    FROM Student, Apply
    WHERE Student.sID = Apply.sID AND major = 'CS'
""", conn)
df

Unnamed: 0,sID,sName
0,123,Amy
1,345,Craig
2,987,Helen
3,876,Irene
4,543,Craig


Similar subquery as before with only names returned. If a subquery is not used in the form above, the addition of the DISTINCT clause for the student names would eliminate one of the Craigs below (Bad because they are different Craigs!)

In [22]:
df = pd.read_sql_query("""
    SELECT sName
    FROM Student
    WHERE sID IN
        (SELECT sID FROM Apply WHERE major = 'CS')
""", conn)
df

Unnamed: 0,sName
0,Amy
1,Craig
2,Helen
3,Irene
4,Craig


Another subquery example with GPA returned

In [5]:
df = pd.read_sql_query("""
    SELECT GPA
    FROM Student
    WHERE sID IN
        (SELECT sID FROM Apply WHERE major = 'CS')
""", conn)
df

Unnamed: 0,GPA
0,3.9
1,3.5
2,3.7
3,3.9
4,3.4


The above SELECT statement with join conditions instead of using a subquery.

This returns extra rows (compared to the statement with a subquery above). Again, this is because students could have applied to CS at different colleges. This is a problem because using these data to compute an average would be incorrect because of the duplicate rows for  some students. 

In [7]:
df = pd.read_sql_query("""
    SELECT GPA
    FROM Student, Apply
    WHERE Student.sID = Apply.sID AND major = 'CS'
""", conn)
df

Unnamed: 0,GPA
0,3.9
1,3.9
2,3.5
3,3.7
4,3.7
5,3.9
6,3.4


Attempting to solve the above problem by adding DISTINCT is insufficient because only GPA is returned and some students have the same GPAs. 

In [8]:
df = pd.read_sql_query("""
    SELECT DISTINCT GPA
    FROM Student, Apply
    WHERE Student.sID = Apply.sID AND major = 'CS'
""", conn)
df

Unnamed: 0,GPA
0,3.9
1,3.5
2,3.7
3,3.4


In this case, the only query version that will return the appropriate data is the one with the subquery in the WHERE clause. 

---

Reference guide 2 ended with a query that attempted to be the equivalent of one with the EXCEPT clause. Subqueries can be used in place of the EXECPT clause in the following way. 

Note: The NOT clause can be moved in front of sID (AND NOT sID IN) and still works fine. 

In [9]:
df = pd.read_sql_query("""
    SELECT sID, sName
    FROM Student
    WHERE sID IN
        (SELECT sID FROM Apply WHERE major = 'CS')
        AND sID NOT IN (SELECT sID FROM Apply WHERE major = 'EE')
""", conn)
df

Unnamed: 0,sID,sName
0,987,Helen
1,876,Irene
2,543,Craig


### EXISTS
Using the EXISTS operator within subqueries to test whether they're empty or not empty. 

This query tests all colleges such that there is some other college in the same state.

Note: Because this query matches colleges from the same relation to one another, it will erroneously return all colleges because each college is identified as having another from the same state if they are matched to themselves. Adding the addition clause in the subquery for the college names to be different circumvents this problem.

In [11]:
df = pd.read_sql_query("""
    SELECT cName, state
    FROM College C1
    WHERE EXISTS
        (SELECT * FROM College C2 
        WHERE C2.state = C1.state
            AND C1.cName <> C2.cName)        
""", conn)
df

Unnamed: 0,cName,state
0,Stanford,CA
1,Berkeley,CA


Using subqueries to obtain a max value. For example, the college with the largest enrollment. 

Specifically, this query returns all colleges where there does not exists a college with a higher enrollment. 

In [12]:
df = pd.read_sql_query("""
    SELECT cName
    FROM College C1
    WHERE NOT EXISTS
        (SELECT * FROM College C2 
        WHERE C2.enrollment > C1.enrollment)
""", conn)
df

Unnamed: 0,cName
0,Berkeley


A similar query can be written to find the student with the highest GPA. Spoiler: 4 way tie. 

In [15]:
df = pd.read_sql_query("""
    SELECT sName, GPA
    FROM Student C1
    WHERE NOT EXISTS
        (SELECT * FROM Student C2 
        WHERE C2.GPA > C1.GPA)
""", conn)
df

Unnamed: 0,sName,GPA
0,Amy,3.9
1,Doris,3.9
2,Irene,3.9
3,Amy,3.9


Writing a similar query with a join instead of a subquery is fundamentally flawwed. Because it returns each student for which there is another student with a lower GPA. 

In [17]:
df = pd.read_sql_query("""
    SELECT DISTINCT S1.sName, S1.GPA
    FROM Student S1, Student S2
    WHERE S1.GPA > S2.GPA
""", conn)
df

Unnamed: 0,sName,GPA
0,Amy,3.9
1,Bob,3.6
2,Craig,3.5
3,Doris,3.9
4,Fay,3.8
5,Gary,3.4
6,Helen,3.7
7,Irene,3.9
8,Craig,3.4


Looking for students where there exists some other student whole high school is smaller than the student we returned.

In [31]:
df = pd.read_sql_query("""
    SELECT sID, sName, sizeHS
    FROM Student S1
    WHERE EXISTS 
        (SELECT * FROM STUDENT S2
        WHERE S2.sizeHS < S1.sizeHS)
""", conn)
df

Unnamed: 0,sID,sName,sizeHS
0,123,Amy,1000
1,234,Bob,1500
2,345,Craig,500
3,456,Doris,1000
4,567,Edward,2000
5,789,Gary,800
6,987,Helen,800
7,876,Irene,400
8,765,Jay,1500
9,654,Amy,1000
