# Project 2, Part 6, Preliminary analytics

University of California, Berkeley

Master of Information and Data Science (MIDS) program

w205 - Fundamentals of Data Engineering

Student: Jack Galvin

Year: 2022

Semester: Spring

Section: 9


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [1]:
import pandas as pd
import numpy as np
import math
import psycopg2

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  

Remember you can freely use any code from the labs. You do not need to cite code from the labs.

In [2]:
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

In [3]:
cursor = connection.cursor()

In [4]:
# Function to run a select query and return rows in a pandas dataframe
# Pandas puts all numeric values from postgres to float
# If it will fit in an integer, change it to integer


def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)

# 2.6.1 Total dollar amount of sales

Write a query to sum the total_amount in the stage_1_peak_sales table and present the sum in a Pandas dataframe with appropriate column header name.

It is fine to leave the sum as is.  You do not have to format it or put in dollar signs.

Remember that you need to convert varchars to numeric using column::numeric before doing any math on them.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [14]:
# Find the total dollar amount of sales

rollback_before_flag = True
rollback_after_flag = True

query = """

select sum(total_amount::numeric) as total_dollar_sales
from stage_1_peak_sales;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,total_dollar_sales
0,6480


# 2.6.2 Total number of sales

Write a query to count the total number of sales in the stage_1_peak_sales table and present the sum in a Pandas dataframe with appropriate column header name.  Each record in the stage_1_peak_sales table is a sale.

It is fine to leave the count as is.  You do not have to format it.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [11]:
# Find the total number of sales

rollback_before_flag = True
rollback_after_flag = True

query = """

select count(*) as total_number_sales
from stage_1_peak_sales;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,total_number_sales
0,97


# 2.6.3 Total dollar amount of sales, total cut paid to Peak, net to AGM

AGM is paying Peak an 18% cut to deliver the meals.

Write a query to calculate the total dollar amount of sales, the total cut paid to Peak, and the net to AGM.  

You may want to round to 2 decimal places for the total cut paid to Peak and the net to AGM, as they will be decimal.  

You do not need to format the numers with commas, dollar signs, etc.

Remember that you need to convert varchars to numeric using column::numeric before doing any math on them.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [33]:
# Find the total dollar amount of sales, cut paid to Peak, net to AGM

rollback_before_flag = True
rollback_after_flag = True

query = """

select sum(total_amount::numeric) as total_dollar_sales,
        (sum(total_amount::numeric) * 0.18) as cut_to_peak,
        (sum(total_amount::numeric) - (sum(total_amount::numeric) * 0.18)) as net_AGM
from stage_1_peak_sales;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,total_dollar_sales,cut_to_peak,net_agm
0,6480,1166.4,5313.6


# 2.6.4 Total number of meals sold

Write a query to sum the quantity in the stage_1_peak_line_items table and present the sum in a Pandas dataframe with appropriate column header name

It is fine to leave the number as is.  You do not have to format it.

Remember that you need to convert varchars to numeric using column::numeric before doing any math on them.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [35]:
# Find the total number of meals sold

rollback_before_flag = True
rollback_after_flag = True

query = """

select sum(quantity::numeric) as total_num_meals
from stage_1_peak_line_items;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,total_num_meals
0,540


# 2.6.5 Total number of meals sold by meal

Expanding on the last query, group the sum of quantity by meal.  Display the meal followed by the number of meals sold. 

Sort by highest number sold first.

Note that you will need to use the peak_product_mapping table and the products table in addition to the stage_1_peak_line_items table.

It is fine to leave the numbers as is.  You do not have to format them.

Remember that you need to convert varchars to numeric using column::numeric before doing any math on them.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [59]:
# Find the total number of meals sold, by meal

rollback_before_flag = True
rollback_after_flag = True

query = """

select p.description as meal_name,
        sum(l.quantity::numeric) as total_num_sold
from stage_1_peak_line_items as l
    join peak_product_mapping as pmt
        on l.product_id::numeric = pmt.peak_product_id::numeric
    join products as p
        on pmt.product_id::numeric = p.product_id::numeric
group by p.product_id
order by total_num_sold desc;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,meal_name,total_num_sold
0,Pistachio Salmon,113
1,Eggplant Lasagna,107
2,Curry Chicken,101
3,Teriyaki Chicken,80
4,Brocolli Stir Fry,60
5,Tilapia Piccata,44
6,Spinach Orzo,27
7,Chicken Salad,8


# 2.6.6 Average number of meals per sale

Write a query to find the average number of meals per sale, which should be equal to the total number of meals sold divided by the total number of sales, both of which we have calculated before.

You may want to round to 1 decimal place.

Remember that you need to convert varchars to numeric using column::numeric before doing any math on them.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [67]:
# Find the average number of meals per sale

rollback_before_flag = True
rollback_after_flag = True

query = """

with a as (select sum(quantity::numeric) as total_num_meals
from stage_1_peak_line_items),
b as (select count(*) as total_number_sales
from stage_1_peak_sales)
select round((a.total_num_meals / b.total_number_sales), 2) as avg_meals_per_sale
from a,b;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,avg_meals_per_sale
0,5.57
