# Practical SQL: Chapter Practice Notebook

This Jupyter notebook will display all SQL queries and practices covered in each chapter of the book "Practical SQL: A Beginner's Guide To Storytelling With Data" by Anthony DeBarros.

The first part of this notebook will require us to import a couple of dependencies as well as personal data that will allow us to link out Jupyter Notebook to PostgreSQL.

In [1]:
import psycopg2
import pandas as pd
from sql_data import db, usr, pwd

In [2]:
# Connecting to postgreSQL database
conn = psycopg2.connect(
    host = "localhost",
    database = db,
    user = usr,
    password = pwd,
    port = 5432
)

def execute_query(connection, query):
    connection.autocommit = True
    cursor = connection.cursor()
    try:
        cursor.execute(query)
        results = cursor.fetchall()
        column_names = [i[0] for i in cursor.description]
        results = pd.DataFrame(results, columns= column_names)
        return results
        print("Query executed succesfully!")
        # Closing the cursor
        cursor.close()
        del cursor
        # Closing the connection
        connection.close()
    except OperationalError as e:
        print(f"The error '{e}' occurred.")

## Chapter 8: Extracting Information By Grouping and Summarizing

This chapter requires us to build two tables based on the 2009 and 2014 Public Library Surveys conducted by the Institute of Museum and Library Services (IMLS). Please refer to the books resources as detailed in the readme page to gain access to the csv files needed to fill in the two library tables used henceforth.

In [3]:
# Exploring the Library Data Using Aggregate Functions
# Counting Rows and Values using Count()

query = """
SELECT COUNT(*)
FROM pls_fy2014_pupld14a;
"""

execute_query(conn, query)

Unnamed: 0,count
0,9305


In [4]:
query = """
SELECT COUNT(*)
FROM pls_fy2009_pupld09a;
"""

execute_query(conn, query)

Unnamed: 0,count
0,9299


In [5]:
# Examining the count of rows in columns where the NOT NULL constraint was not applied
query = """
SELECT COUNT(salaries)
FROM pls_fy2014_pupld14a;
"""
execute_query(conn,query)

Unnamed: 0,count
0,5983


In [6]:
# Using the DISTINCT function to see how many unique values are contained within a column
# This query returns the count of all rows
query = """
SELECT COUNT(libname)
FROM pls_fy2014_pupld14a;
"""
execute_query(conn,query)

Unnamed: 0,count
0,9305


In [7]:
# This query return the unique values of the same column. The number should be smaller as there are some duplicated values.
query = """
SELECT COUNT(DISTINCT(libname))
FROM pls_fy2014_pupld14a;
"""
execute_query(conn,query)

Unnamed: 0,count
0,8515


In [8]:
# Finding Maximum and Minimum Values Using MAX() and MIN()
query = """
SELECT MAX(visits), MIN(visits)
FROM pls_fy2014_pupld14a;
"""
execute_query(conn, query)

Unnamed: 0,max,min
0,17729020,-3


In [9]:
# Aggregating Data Using GROUP BY

# Combining GROUP BY clause with COUNT()
query = """
SELECT stabr, COUNT(*)
FROM pls_fy2014_pupld14a
GROUP BY stabr
ORDER BY COUNT(*) DESC
LIMIT 5;
"""
execute_query(conn, query)

Unnamed: 0,stabr,count
0,NY,756
1,IL,625
2,TX,556
3,IA,543
4,PA,455


In [12]:
# Aggregating data from multiple columns using GROUP BY
query = """
SELECT stabr, stataddr, COUNT(*)
FROM pls_fy2014_pupld14a
GROUP BY stabr, stataddr
ORDER BY stabr ASC, COUNT(*) DESC;
"""

execute_query(conn, query)

Unnamed: 0,stabr,stataddr,count
0,AK,00,70
1,AK,15,10
2,AK,07,5
3,AL,00,221
4,AL,07,3
...,...,...,...
101,WI,07,6
102,WI,15,3
103,WV,00,93
104,WV,15,4


In [15]:
# Aggregating data from multiple joined tables

# No aggregation but here we are trying to determine trends in library visists using the 2014 and 2009 tables 
query = """
SELECT SUM(pls14.visits) AS visits_2014,
    SUM(pls09.visits) AS visits_2009
FROM pls_fy2014_pupld14a AS pls14
JOIN pls_fy2009_pupld09a AS pls09
    ON pls14.fscskey = pls09.fscskey
WHERE pls14.visits >=0 AND pls09.visits >= 0;
"""

execute_query(conn, query)

Unnamed: 0,visits_2014,visits_2009
0,1417299241,1585455205


In [19]:
# Aggregating data to compare trends by states
query = """
SELECT
    pls14.stabr,
    SUM(pls14.visits) AS visits_2014,
    SUM(pls09.visits) AS visits_2009,
    ROUND( (CAST(SUM(pls14.visits) AS DECIMAL(10, 1)) - SUM(pls09.visits)) /
        SUM(pls09.visits) * 100, 2) AS pct_change
FROM pls_fy2014_pupld14a AS pls14
JOIN pls_fy2009_pupld09a AS pls09
    ON pls14.fscskey = pls09.fscskey
WHERE pls14.visits >=0 AND pls09.visits >= 0
GROUP BY pls14.stabr
ORDER BY pct_change DESC
LIMIT 5
;
"""

execute_query(conn, query)

Unnamed: 0,stabr,visits_2014,visits_2009,pct_change
0,GU,103593,60763,70.49
1,DC,4230790,2944774,43.67
2,LA,17242110,15591805,10.58
3,MT,4582604,4386504,4.47
4,AL,17113602,16933967,1.06


In [20]:
# Filtering an Aggregate Query Using Having
query = """
SELECT
    pls14.stabr,
    SUM(pls14.visits) AS visits_2014,
    SUM(pls09.visits) AS visits_2009,
    ROUND( (CAST(SUM(pls14.visits) AS DECIMAL(10, 1)) - SUM(pls09.visits)) /
        SUM(pls09.visits) * 100, 2) AS pct_change
FROM pls_fy2014_pupld14a AS pls14
JOIN pls_fy2009_pupld09a AS pls09
    ON pls14.fscskey = pls09.fscskey
WHERE pls14.visits >=0 AND pls09.visits >= 0
GROUP BY pls14.stabr
HAVING SUM(pls14.visits) > 50000000
ORDER BY pct_change DESC
LIMIT 5
;
"""

execute_query(conn, query)

Unnamed: 0,stabr,visits_2014,visits_2009,pct_change
0,TX,72876601,78838400,-7.56
1,CA,162787836,182181408,-10.65
2,OH,82495138,92402369,-10.72
3,NY,106453546,119810969,-11.15
4,IL,72598213,82438755,-11.94
