Aggregate functions such as `AVG()` is valid in the **SELECT** clause; however, they can be used in other clauses such as the `GROUP BY` and `HAVING` clauses.

# subqueries. 

A subquery is a query nested within another query. A subquery can exist within the `SELECT`, `FROM` or `WHERE` clause and must always be contained within parentheses.

In [1]:
import pandas as pd
import numpy as np
import sqlite3 as sql

In [2]:
conn = sql.connect("jobs.db")

def read_query(q):
    return pd.read_sql_query(q, conn)

In [7]:
# Which rows are above the average for the ShareWomen column
# q = """Select ShareWomen From recent_grads
#        Where ShareWomen > Avg(ShareWomen)""" This will give error

q = """Select ShareWomen from recent_grads
        Where ShareWomen > (Select Avg(ShareWomen) from recent_grads) Limit 5"""
read_query(q)

Unnamed: 0,ShareWomen
0,0.535714
1,0.578766
2,0.558548
3,0.896019
4,0.750473


The subquery is run first and returns the average value for the `ShareWomen`.  Note that SQL will ignore the column name (`AVG(ShareWomen)`) and is smart enough to just use the actual row value. 

If we had to do this using Python, we would compute and store the average value in `ShareWomen` as a variable and then use the variable in a table filter. While variables dominate how we express logic in **object-oriented programming languages** like **Python** and **Java**, `SQL` doesn't have support for variables.

The query that replaces the placeholder subquery needs to be a full query (contain `SELECT` and `FROM` clauses, etc), that works even if it's run separately. In addition, the inner query should only return a table with a single row and column because of where it fits in the outer query (... `WHERE > ?`). If we instead try to return a table with multiple columns, error will be returned:

Lastly, a subquery must always be contained within parentheses `(` `)`, or the error will be returned:

In [12]:
# Query that returns the majors that are below the average for Unemployment_rate

q = """Select Major, Unemployment_rate From recent_grads
       Where Unemployment_rate < (Select Avg(Unemployment_rate) From recent_grads)
       Order by Unemployment_rate
       Limit 8
       """
read_query(q)

Unnamed: 0,Major,Unemployment_rate
0,MATHEMATICS AND COMPUTER SCIENCE,0.0
1,BOTANY,0.0
2,SOIL SCIENCE,0.0
3,EDUCATIONAL ADMINISTRATION AND SUPERVISION,0.0
4,ENGINEERING MECHANICS PHYSICS AND SCIENCE,0.006334
5,COURT REPORTING,0.01169
6,MATHEMATICS TEACHER EDUCATION,0.016203
7,PETROLEUM ENGINEERING,0.018381


In [16]:
# SQL statement that computes the proportion (as a float value) of rows that contain above average values for the ShareWomen.

q = """Select Cast(Count(*) as Float)/(Select Count(*) from recent_grads) as proportion_abv_avg 
        from recent_grads 
        Where ShareWomen > (Select Avg(ShareWomen) From recent_grads)"""
read_query(q)


Unnamed: 0,proportion_abv_avg
0,0.526012


[from the documentation](https://sqlite.org/lang_expr.html), SQLite contains all of the following operators:



Using the `IN` operator, we can specify a list of values that we want to match against in the `WHERE` clause.

The following query returns the rows where `Major_category` equals either **Business** or **Engineering**:

`SELECT Major, Major_category FROM recent_grads
WHERE Major_category IN ('Business', 'Engineering')
LIMIT 7`

Alternate of above query is;

`SELECT Major, Major_category FROM recent_grads
WHERE Major_category ='Business' or Major_category ='Engineering' 
Limit 7`

In [None]:
q = """SELECT Major_category, Major
  FROM recent_grads
 WHERE Major_category IN ('Business', 'Humanities & Liberal Arts', 'Education');"""

read_query(q)

In [19]:
# query that returns the Major and Major_category columns for the rows where:
# Major_category is one of the 5 highest group level sums for the Total column

q = """Select Major, Major_category from recent_grads 
        Group by Major_category
        Order by Sum(Total)
        Limit 5"""

read_query(q)

Unnamed: 0,Major,Major_category
0,MULTI/INTERDISCIPLINARY STUDIES,Interdisciplinary
1,FOOD SCIENCE,Agriculture & Natural Resources
2,COURT REPORTING,Law & Public Policy
3,ASTRONOMY AND ASTROPHYSICS,Physical Sciences
4,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services


In [None]:
q = """SELECT Major_category, Major
  FROM recent_grads
 WHERE Major_category IN (SELECT Major_category
                            FROM recent_grads
                           GROUP BY Major_category
                           ORDER BY SUM(TOTAL) DESC
                           LIMIT 3
                         );"""
read_query(q)

In [30]:
# query that returns the average ratio (Sample_size/Total)) for all of the majors.

q = """Select Major, Cast(Sample_size as float)/Total avg_ratio 
from recent_grads 
       group by Major
        Limit 5"""

read_query(q)

Unnamed: 0,Major,avg_ratio
0,ACCOUNTING,0.01028
1,ACTUARIAL SCIENCE,0.013503
2,ADVERTISING AND PUBLIC RELATIONS,0.01281
3,AEROSPACE ENGINEERING,0.009762
4,AGRICULTURAL ECONOMICS,0.01804


When we have a SQL statement we want to write that will end up using many subqueries, it can be overwhelming at first to know how to start. In general, we want to start with the inner queries first and work our way outwards.

In [37]:
# query that:
# selects the Major, Major_category, and the computed ratio columns
# filters to just the rows where ratio is greater than avg_ratio

q = """Select Major, Major_category, Cast(Sample_size as float)/Total avg_ratio
        From recent_grads
        where avg_ratio > (Select Avg(Cast(Sample_size as float)/Total) from recent_grads)
        Limit 5
        """
read_query(q)

Unnamed: 0,Major,Major_category,avg_ratio
0,PETROLEUM ENGINEERING,Engineering,0.015391
1,MINING AND MINERAL ENGINEERING,Engineering,0.009259
2,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.012719
3,ACTUARIAL SCIENCE,Business,0.013503
4,MECHANICAL ENGINEERING,Engineering,0.01128
