# Practical SQL

This Jupyter notebook will display all SQL queries that I have created as answers to the challenge questions at the end of each chapter of the book "Practical SQL: A Beginner's Guide To Storytelling With Data" by Anthony DeBarros.

The first part of this notebook will require us to import a couple of dependencies as well as personal data that will allow us to link out Jupyter Notebook to PostgreSQL.

In [1]:
import psycopg2
import pandas as pd
from sql_data import db, usr, pwd

In [2]:
# Connecting to postgreSQL database
conn = psycopg2.connect(
    host = "localhost",
    database = db,
    user = usr,
    password = pwd,
    port = 5432
)

def execute_query(connection, query):
    connection.autocommit = True
    cursor = connection.cursor()
    try:
        cursor.execute(query)
        results = cursor.fetchall()
        column_names = [i[0] for i in cursor.description]
        results = pd.DataFrame(results, columns= column_names)
        return results
        print("Query executed succesfully!")
        # Closing the cursor
        cursor.close()
        del cursor
        # Closing the connection
        connection.close()
    except OperationalError as e:
        print(f"The error '{e}' occurred.")      

## Chapter 2: Beginning Data Exploration With Select

Challenge Questions

1. Write a query that lists the schools in alphabetical order along with teachers ordered by last name A-Z

In [3]:
query = """
SELECT school, first_name, last_name
FROM teachers
ORDER BY school, last_name;
"""

execute_query(conn, query)

Unnamed: 0,school,first_name,last_name
0,F.D. Rossevelt HS,Lee,Reynolds
1,F.D. Rossevelt HS,Kathleen,Roush
2,F.D. Rossevelt HS,Janet,Smith
3,Myers Middle School,Samantha,Bush
4,Myers Middle School,Samuel,Cole
5,Myers Middle School,Betty,Diaz


2. Write a query that finds the one teacher whose first name starts with the letter S and who earns more that $40k


In [4]:
query = """
SELECT *
FROM teachers
WHERE salary > 40000 AND
	first_name ILIKE 's%';
"""

execute_query(conn, query)

Unnamed: 0,id,first_name,last_name,school,hire_date,salary
0,3,Samuel,Cole,Myers Middle School,2005-08-01,43500


3. Rank teachers hired since Jan 1, 2010, ordered by highest paid to lowest

In [5]:
query = """
SELECT *
FROM teachers
WHERE hire_date >= '2010-01-01'
ORDER BY salary DESC;
"""

execute_query(conn, query)

Unnamed: 0,id,first_name,last_name,school,hire_date,salary
0,6,Kathleen,Roush,F.D. Rossevelt HS,2010-10-22,38500
1,1,Janet,Smith,F.D. Rossevelt HS,2011-10-30,36200
2,4,Samantha,Bush,Myers Middle School,2011-10-30,36200


## Chapter 3: Understanding Data Types

Challenge Questions

1. Your company delivers fruit and vegetables to local grocery stores, and you need to track the mileaage driven by each driver each day to a tenth of a mile. Assuming no driver would ever travel more that 999 miles in a day, what would be an appropriate data type for the mileage column in your table. Why?

    Assuming we would only need to track to the nearest tenth of a mile and drivers never drive more than 999 miles, we would want a data type that reflects 4 significant figures. I would choose to use the NUMERIC data type with a precision of 4 and a scale of 1.

2. In the table listing each driver in your company, what are appropriate data types for the drivers' first and last names? Why is it a good idea to separate firs and last names into two columns rather than having one larger name column?

    I would choose the VARCHAR data type for the first and last name columns in my table with an ample amount of characters to suit the two fields - (15) and (30) respectively. This data type is best for minimizing space as it can go up to the indicated character count but if a first or last name does not have that many characters, the program will not pad/fill in the remaining values. It is not adviseable to combine both first and last names into one field for the purpose of saving space - the combined field may need to house extra characters like a space or comma to distinguish the two values - and because a combined field may make future queries harder to perform.

3. Assume you have a text column that includes strings formatted as dates. One of the strings is written as '4//2017'. What will happen when you try to convert that string to the timestamp data type?

    An ERROR message should pop up when trying to convert the text '4//2017' in to a TIMESTAMP as it is lacking the months and is not written in the format required by the TIMESTAMP data type - 'YYYY-MM-DD HH:MM:SS'.

## Chapter 4: Importing and Exporting Data

Challenge Questions

1. Write a WITH statement to include with COPY to handle the import of an imaginary text file whose first couple of rows look like this:

In [6]:
# ---------
# id:movie:actor
# 50:#Mission: Impossible#:Tom Cruise
# ---------

	COPY example_table
	FROM 'C:/RandomDirectory/indicatedtextfile.txt'
	WITH (FORMAT CSV, HEADER, DELIMITER ':', QUOTE '#');


2. Using the table us_counties_2010_ you created and filled in this chapter, export to a CSV file the 20 counties in the United States thathave the most housing units. Make sure you export only each county's name, state, and number of housing units.


	COPY (
		SELECT geo_name, state_us_abbreviation, housing_unit_count
		FROM us_counties_2010
		ORDER BY housing_unit_count DESC
		LIMIT 20
	)
	TO 'C:/RandomDirectory/housing_export.csv'
	WITH (FORMAT CSV, HEADER);

3. Imagine you're importing a file that contains a column with these values:

In [7]:
# ----
# 17519.668
# 20084.461
# 18973.335
# ----

3. Will a column in your target table with data type NUMERIC(3,8) work for these values?

    The data type NUMERIC(3,8) will not work as it has switched the precision and scale values. It should instead be NUMERIC(8,3) to indicate that there should be 8 total digits with only 3 of them being to the right of the decimal.

## Chapter 5: Basic Math And Stats With SQL

Challenge Questions

1. Write a SQL statement for calculating the area of a circle whose radius is 5 inches. Do you need parenthese in your calculation? Why or why not?


In [8]:
query = """
SELECT 3.14 * (5 ^ 2);
"""

execute_query(conn, query)

Unnamed: 0,?column?
0,78.5


    Answer: In this case we do not need parentheses as order of operations will give exponents priority over multiplication. However, parentheses may help make the expression easier to understand.

2. Using the  2010 Census county data, find out which New York state county has the highest percentage of the population that identified as "American Indian/Alaska Native Alone".

In [9]:
query = """
SELECT geo_name AS County,
	state_us_abbreviation,
	p0010001 AS total_population,
	p0010005 AS american_indian_alaska_native_alone,
	(CAST(p0010005 AS NUMERIC(8,1)) / p0010001) * 100 AS Pct_American_Indian
FROM us_counties_2010
WHERE state_us_abbreviation = 'NY'
ORDER BY Pct_American_Indian DESC
LIMIT 1;
"""

execute_query(conn, query)

Unnamed: 0,county,state_us_abbreviation,total_population,american_indian_alaska_native_alone,pct_american_indian
0,Franklin County,NY,51599,3797,7.358669741661661


3. Was the 2010 median county population higher in California or New York


In [10]:
query = """
SELECT state_us_abbreviation AS "State",
       percentile_cont(0.5)
	   WITHIN GROUP (ORDER BY p0010001) AS "Median"
FROM us_counties_2010
WHERE state_us_abbreviation IN ('NY', 'CA')
GROUP BY state_us_abbreviation;
"""

execute_query(conn, query)

Unnamed: 0,State,Median
0,CA,179140.5
1,NY,91301.0


     Answer: Based on the output from the query above, California had a higher median county population than New York - 179,140.5 to 91,301 respectively.

## Chapter 6: Joining Tables in a Relational Database

Challenge Questions

1. The table us_counties_2010 contains 3143 rows, and us_counties_2000 has 3141. That reflects the ongoing adjustments to county level geographies that typically result from government decision making. Using appropriate joins and the NULL value, identify which counties don't exist in both tables.


In [11]:
# Answer: The following query reveals that there are 5 counties from Alaska along with Broomfield county are not present in both tables.

query = """
SELECT
	ten.geo_name AS county,
	ten. state_us_abbreviation,
	twok.geo_name
FROM us_counties_2010 AS ten
LEFT JOIN us_counties_2000 AS twok
ON ten.state_fips = twok.state_fips
	AND ten.county_fips = twok.county_fips
WHERE twok.geo_name IS NULL;
"""

execute_query(conn, query)

Unnamed: 0,county,state_us_abbreviation,geo_name
0,Hoonah-Angoon Census Area,AK,
1,Petersburg Census Area,AK,
2,Prince of Wales-Hyder Census Area,AK,
3,Skagway Municipality,AK,
4,Wrangell City and Borough,AK,
5,Broomfield County,CO,


2. Using either the median() or precentile_cont() functions, determine the median of the percent change in county population.


In [12]:
query = """
SELECT
	PERCENTILE_CONT(.5) WITHIN GROUP (ORDER BY (ROUND((CAST(ten.p0010001 AS NUMERIC(8,1)) - twok.p0010001) / twok.p0010001 * 100, 1))) AS median_pop_change
FROM us_counties_2010 AS ten
JOIN us_counties_2000 AS twok
ON ten.state_fips = twok.state_fips
	AND ten.county_fips = twok.county_fips;
"""

execute_query(conn, query)

Unnamed: 0,median_pop_change
0,3.2


3. Which county had the greates percentage loss of population between 2000 and 2010?

In [13]:
# Answer: The following query shows that St. Bernard Parish had the greatese population loss.

query = """
SELECT c2010.geo_name AS County,
       c2010.state_us_abbreviation,
       c2010.p0010001 AS pop_2010,
       c2000.p0010001 AS pop_2000,
       c2010.p0010001 - c2000.p0010001 AS raw_change,
       round( (CAST(c2010.p0010001 AS DECIMAL(8,1)) - c2000.p0010001)
           / c2000.p0010001 * 100, 1 ) AS pct_change
FROM us_counties_2010 c2010 INNER JOIN us_counties_2000 c2000
ON c2010.state_fips = c2000.state_fips
   AND c2010.county_fips = c2000.county_fips
ORDER BY pct_change ASC;
"""

execute_query(conn, query)

Unnamed: 0,county,state_us_abbreviation,pop_2010,pop_2000,raw_change,pct_change
0,St. Bernard Parish,LA,35897,67229,-31332,-46.6
1,Kalawao County,HI,90,147,-57,-38.8
2,Issaquena County,MS,1406,2274,-868,-38.2
3,Cameron Parish,LA,6839,9991,-3152,-31.5
4,Orleans Parish,LA,343829,484674,-140845,-29.1
...,...,...,...,...,...,...
3132,Loudoun County,VA,312311,169599,142712,84.1
3133,Lincoln County,SD,44828,24131,20697,85.8
3134,Flagler County,FL,95696,49832,45864,92.0
3135,Pinal County,AZ,375770,179727,196043,109.1


## Chapter 7: Table Design That Works For You

    CREATE TABLE albums (
        album_id bigserial,
        album_catalog_code varchar(100),
        album_title text,
        album_artist text,
        album_time interval,
        album_release_date date,
        album_genre varchar(40),
        album_description text
    );

    CREATE TABLE songs (
        song_id bigserial,
        song_title text,
        song_artist text,
        album_id bigint
    );

Use the tables to answer these questions:

1. Modify these CREATE TABLE statements to include primary and foreign keys plus
additional constraints on both tables.

    Answer: 
    
    CREATE TABLE albums (
        album_id bigserial,
        album_catalog_code varchar(80) NOT NULL,
        album_title text NOT NULL,
        album_artist text NOT NULL,
        album_release_date date,
        album_genre varchar(30),
        album_description text,
        CONSTRAINT album_id_key PRIMARY KEY (album_id),
        CONSTRAINT release_date_check CHECK (album_release_date > '1/1/1925')
    );

    CREATE TABLE songs (
        song_id bigserial,
        song_title text NOT NULL,
        song_artist text NOT NULL,
        album_id bigint REFERENCES albums (album_id),
        CONSTRAINT song_id_key PRIMARY KEY (song_id)
);

2. Instead of using column_id as a surrogate key for your primary key, are there any columns in 
albums that could be useful as a natural key? What would you have to know to decide?

    Answer: 
    
    Album_catalogue_code may be a viable primary key however we would have to know if it is unique across multiple companies or if it is always provided.


3. To speed up queries, which columns are good candidates for indexes?

    Answer: 
    
    Any column that is designated as a primary key should be indexed and so should columns used as foreign keys. In these tables, we should consider indexing the album_id, titles, artists, and album_release_date columns.

## Chapter 8: Extracting Information By Grouping and Summarizing

Challenge Questions

1. What is the pattern in the use of technology in libraries? Both the 2014 and 2009 survey tables contain the columns gpterms and pitusr. Write code to calculate the percent change in the sum of each column over time. Watch out for negative values

In [15]:
# Answer the following query should return the pct change in gpterms and pitusr
query = """
SELECT
    pls14.stabr,
    SUM(pls14.gpterms) AS gpterms_2014,
    SUM(pls09.gpterms) AS gpterms_2009,
    ROUND( (CAST(SUM(pls14.gpterms) AS DECIMAL(10, 1)) - SUM(pls09.gpterms)) /
        SUM(pls09.gpterms) * 100, 2) AS pct_change_gpterms,
    SUM(pls14.pitusr) AS pitusr_2014,
    SUM(pls09.pitusr) AS pitusr_2009,
    ROUND( (CAST(SUM(pls14.pitusr) AS DECIMAL(10, 1)) - SUM(pls09.pitusr)) /
        SUM(pls09.pitusr) * 100, 2) AS pct_change_pitusr
FROM pls_fy2014_pupld14a AS pls14
JOIN pls_fy2009_pupld09a AS pls09
    ON pls14.fscskey = pls09.fscskey
WHERE pls14.gpterms >=0 AND pls09.gpterms >= 0 AND pls14.pitusr >=0 AND pls09.pitusr >= 0
GROUP BY pls14.stabr
ORDER BY pct_change_gpterms DESC, pct_change_pitusr DESC
LIMIT 5
;
"""

execute_query(conn, query)

Unnamed: 0,stabr,gpterms_2014,gpterms_2009,pct_change_gpterms,pitusr_2014,pitusr_2009,pct_change_pitusr
0,GU,547,59,827.12,39842,19564,103.65
1,DC,1000,594,68.35,1050623,140251,649.1
2,AK,994,618,60.84,771075,1061498,-27.36
3,DE,772,487,58.52,622515,451689,37.82
4,ID,1792,1151,55.69,1878131,1986141,-5.44


2. Both library survey tables contain a column called obereg. Just as we calculated the percent change in visits grouped by state, do the same to group percent changes in visits by U.S. region using obereg. 