CS6964/CS4964: Lab 1:
-----

Please run the following cells to create a sample database and insert a few rows into it.

In [None]:
%load_ext sql

In [None]:
%sql postgresql://postgres:postgres@db:5432/postgres

In [None]:
%%sql drop table if exists franchise cascade;
-- Franchise table with a primary key
CREATE TABLE franchise (
    franchise_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    category TEXT
);

drop table if exists store cascade;
-- Store table with a primary key and a foreign key referencing franchise_id
CREATE TABLE store (
    store_id SERIAL PRIMARY KEY,
    franchise_id INT REFERENCES franchise(franchise_id),
    location TEXT NOT NULL
);

drop table if exists product cascade;
-- Product table with a primary key
CREATE TABLE product (
    product_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    price DECIMAL(10, 2) NOT NULL, -- Using DECIMAL for precise monetary values
    made_by TEXT
);

DROP TABLE IF EXISTS purchase CASCADE;

CREATE TABLE purchase (
    purchase_id SERIAL PRIMARY KEY,
    product_id INT REFERENCES product(product_id),
    store_id INT REFERENCES store(store_id), -- Added store_id reference
    date DATE NOT NULL,
    quantity INT NOT NULL,
    purchaser_age INT
);

DROP TABLE IF EXISTS franchise_chain CASCADE;

CREATE TABLE franchise_chain (
    parent_franchise_id INT REFERENCES franchise(franchise_id),
    child_franchise_id INT REFERENCES franchise(franchise_id),
    PRIMARY KEY (parent_franchise_id, child_franchise_id)
);



In [None]:
%%sql
-- Insert data into the franchise table
INSERT INTO franchise (name, category)
VALUES
    ('Coffee Co', 'Caf√©'),
    ('Electronics Emporium', 'Retail Electronics'),
    ('Home Essentials', 'Home Goods'),
    ('Bagels co', 'Bagels'),
    ('Fresh Bakery', 'Bakery'),
    ('Gadget World', 'Electronics'),
    ('Urban Wear', 'Clothing'),
    ('Gourmet Delight', 'Restaurant'),
    ('Tech Trendz', 'Electronics'),
    ('Sweet Treats', 'Dessert'),
    ('Book Haven', 'Bookstore'),
    ('Fitness Gear', 'Sports Equipment'),
    ('Garden Goods', 'Gardening'),
    ('Tech World', 'Electronics'),
    ('Urban Cafe', 'Cafe'),
    ('Book World', 'Bookstore'),
    ('Health Hub', 'Health'),
    ('Fashion Fiesta', 'Clothing'),
    ('Gadget Galaxy', 'Electronics'),
    ('Cozy Corner', 'Furniture'),
    ('Pet Paradise', 'Pet Supplies'),
    ('Music Mania', 'Music Store');
    
    

-- Insert data into the store table
INSERT INTO store (franchise_id, location)
VALUES
    (1, 'Downtown Plaza'),
    (1, 'West End Mall'),
    (2, 'Tech Hub Mall'),
    (3, 'City Center'),
    (5, 'Lakeview District'),
    (6, 'Greenfield Center'),
    (7, 'Downtown Market'),
    (8, 'Hilltop Plaza'),
    (9, 'Tech Park'),
    (10, 'Sunnyvale Strip'),
    (11, 'Liberty Lane'),
    (12, 'Sports Complex'),
    (13, 'Green Valley'),
    (14, 'Downtown Avenue'),
    (15, 'Central Street'),
    (16, 'Eastside Market'),
    (17, 'West End'),
    (18, 'Northern Quarter'),
    (19, 'Tech District'),
    (20, 'Oakwood Area'),
    (21, 'Riverfront'),
    (22, 'Harbor Side');

-- Insert data into the product table
INSERT INTO product (name, price, made_by)
VALUES
    ('Espresso Machine', 399.99, 'CoffeeCo Appliances'),
    ('Laptop', 899.99, 'TechGizmo'),
    ('Toaster', 29.99, 'HomeTech Inc.'),
    ('Cappuccino Maker', 299.99, 'CoffeeCo Appliances'),
    ('Bread Maker', 120.50, 'BakeTech'),
    ('Smartphone', 599.99, 'GadgetPro'),
    ('Jeans', 49.99, 'Urban Fashion Inc.'),
    ('Steak Grill', 350.00, 'Gourmet Appliances'),
    ('Tablet', 450.00, 'Tech Trendz Inc.'),
    ('Chocolate Cake', 15.99, 'Sweet Treats Bakery'),
    ('Mystery Novel', 8.99, 'Book Haven Publishing'),
    ('Yoga Mat', 20.00, 'Fitness Gear Co.'),
    ('Gardening Kit', 35.00, 'Green Thumb Inc.'),
    ('Smart TV', 1200.00, 'Tech World Electronics'),
    ('Organic Coffee', 10.99, 'Urban Cafe Supplies'),
    ('Bestseller Novel', 15.00, 'Book World Press'),
    ('Fitness Tracker', 199.99, 'Health Hub Tech'),
    ('Designer Jeans', 85.00, 'Fashion Fiesta Brand'),
    ('Wireless Headphones', 250.00, 'Gadget Galaxy Tech'),
    ('Sofa Set', 899.99, 'Cozy Corner Furnishings'),
    ('Dog Bed', 49.99, 'Pet Paradise Supplies'),
    ('Guitar', 499.99, 'Music Mania Instruments');



-- Insert data into the purchase table
INSERT INTO purchase (product_id, store_id, date, quantity, purchaser_age)
VALUES
    (1, 1, '2023-01-10', 3, 35),
    (2, 1, '2023-02-15', 2, 28),
    (3, 2, '2023-03-20', 4, 42),
    (4, 3, '2023-04-05', 1, 29),
    (5, 5, '2023-05-12', 2, 30),
    (6, 6, '2023-06-07', 1, 22),
    (7, 7, '2023-07-22', 3, 40),
    (8, 8, '2023-08-15', 1, 33),
    (9, 9, '2023-09-10', 2, 30),
    (11, 11, '2023-10-05', 1, 25),
    (12, 13, '2023-09-15', 3, 45),
    (14, 14, '2023-10-20', 1, 32);

    
INSERT INTO franchise_chain (parent_franchise_id, child_franchise_id)
VALUES
    (1, 4), -- 'Coffee Co' is the parent of 'Bagels co'
    (3, 2), -- 'Home Essentials' is the parent of 'Electronics Emporium'
    (5, 8), -- 'Fresh Bakery' is the parent of 'Gourmet Delight'
    (7, 6), -- 'Urban Wear' is the parent of 'Gadget World'
    (9, 10), -- 'Tech Trendz' is the parent of 'Sweet Treats'
    (11, 12), -- 'Book Haven' is the parent of 'Fitness Gear'
    (13, 15), -- 'Garden Goods' is the parent of 'Urban Cafe'
    (15, 17), -- 'Urban Cafe' is the parent of 'Health Hub'
    (14, 16), -- 'Tech World' is the parent of 'Book World'
    (16, 18); -- 'Book World' is the parent of 'Fashion Fiesta'


Part 1: SQL questions
---------
We will first start with a series of SQL queries, to strengthen our understanding of simple and advanced SQL

Q0. (5 pts) Write a SQL query that lists all franchise names and their stores' locations

In [None]:
%%sql
-- Your code goes here


Q1. (5 pts) Write an SQL query that calculates the total revenue for each franchise by summing up the revenues from all of their stores. The revenue should be calculated based on the quantities and prices of products sold in each purchase. The result should display the name of each franchise along with its total revenue

In [None]:
%%sql
-- Your code goes here


Q2. (5 pts) Write a SQL query to calculate the total revenue for each franchise, including those without any sales. The revenue should be determined by summing up the product of quantities and prices of all products sold across all stores belonging to each franchise. The output should list the name of each franchise alongside its total revenue, ensuring that franchises with no sales also appear in the result with their revenue shown as zero

In [None]:
%%sql
-- Your code goes here


Q3. (5 pts) Write a SQL query that lists each franchise's name, the locations of its stores, and the total revenue for each store. The revenue should be calculated from the sum of the products of the quantities and prices of all purchases made at each store. Include all franchises and stores in the results, even those without any sales, showing zero revenue for these cases. The output should be grouped by both franchise name and store location and ordered by franchise name and store location

In [None]:
%%sql
-- Your code goes here

Q4. (5 pts) Write a SQL query to find the average purchase amount per store, along with the number of unique products sold and the category of the franchise

In [None]:
%%sql
-- Your code goes here


Q5. (7 pts) Write a SQL query that shows a hierarchy graph of franchises. This graph should show each franchise, its level in the hierarchy, and the root franchise it originates from. The query should list franchises at the root level with a level of 0, and increment the level by 1 for each subsequent child franchise level. Order the results by the root franchise ID and the hierarchical level.

In [None]:
%%sql
-- Your code goes here

Q6. (7 pts) Write a SQL query to find the total sales of each franchise, including all its child franchises in the hierarchy

In [None]:
%%sql
-- Your query goes here

Part 2: Schema creation, normalization, and Pandas primer
-----

In this part, you will get to design a database, populate it, and then, run simple SQL queries on it. You will then run those same queries using Pandas.

Q7. (10 pts) Create a normalized (i.e., one that won't allow redundancies) database schema that has:

    - at least 3 tables
    - at least 3 primary keys
    - at least 2 foreign keys
    - at least 2 tables should have an n-m relationship between them

It is OK to use a database you find online for this question, as long as it satisfies the above constraints.
The code cell should include CREATE TABLE statements


In [None]:
%%sql
-- Your code goes here

Q8. (5 pts) Insert values into the database (at least 5 rows per table)

In [None]:
%%sql
-- Your code goes here

Q9. (5 pts) Write three join SQL queries on your database, and explain what they accomplish

In [None]:
%%sql
-- Your code goes here

Q10. (5 pts) Can you speed up the three queries from Q7? If so, show what you did and show the before/after query plans using the EXPLAIN command.

In [None]:
%%sql
-- Your code goes here

Q11. (10 pts) Use Pandas to load your tables and run the three queries you wrote in Q7, but using Pandas methods (no SQL)

In [None]:
import pandas as pd

#TODO: your code goes here

Part 3: SQL vs. Pandas
--------------

We have an IMDB movies dataset (IMDB-movies.csv), and we are interested in getting movies that were released in 1990. We will first produce this dataset using Pandas, then we will produce it using Postgres and compare the two methods. The goal of this part is to show you why you need a database/SQL vs. just using Pandas. 


#### Getting the name, ranking of all movies released in year 1990 (using Pandas)
Q12. (8 pts) Comlplete the following code

In [None]:
import pandas as pd
import time

# Start the timer
start_time = time.time()

#TODO: your code goes here

# End the timer
end_time = time.time()

# Calculate the duration for loading
duration = end_time - start_time
print(f'Time taken: {duration} seconds')

# Print the loaded data


#### Getting the name, ranking of all movies released in year 1990 (using SQL)

Let's first load the CSV file into the Postgres database. The COPY command allows us to import a CSV file into a Postgres table

In [None]:
%%sql

DROP TABLE IF EXISTS movies;

CREATE TABLE movies (
  id SERIAL PRIMARY KEY,
  name TEXT,
  year INT,
  rank TEXT
);

COPY movies(id, name, year, rank)
FROM '/tmp/IMDB-movies.csv'
WITH (FORMAT csv, HEADER true, NULL 'NULL');

Q13. (8 pts) Complete the following code cell

In [None]:
import pandas as pd
from sqlalchemy import create_engine
import time


# Establish a connection to the PostgreSQL database
engine = create_engine(f'postgresql://postgres:postgres@db:5432/postgres')

# Start the timer
start_time = time.time()

#TODO: your code goes here

# End the timer
end_time = time.time()

# Calculate the duration for loading
duration = end_time - start_time
print(f'Time taken: {duration} seconds')

# Print the loaded data



#### Comparison

Q14. (5 pts) Can you compare the runtime of SQL vs. Pandas to produce the dataset and discuss why they are different?

Q15.(5 pts) What can we do to further accelerate the SQL query to get the 1990 movies dataset? Write a new code cell with this improvement. Has the runtime improved?