# Use subqueries to aggregate data

## Overview

I am a junior data analyst working for a company that manufactures socks. I have access to data on the company’s customers, orders, warehouses, and products. My objective is to identify top-performing warehouses to optimize warehouse performance.

## Dataset

I receive 2 .csv files containing the data for orders and warehouses:

- the **Warehouse table** can be viewed in [Google Sheets](https://drive.google.com/file/d/18bzqeHv2Nk_BZD0N8S9WlpC2mSm--SVd/view?usp=drive_link) or the [.csv file](/activities/sql/c05m03-count-distinct/c05m03-warehouse-data.csv), and records are saved in the format: warehouse_id,warehouse_alias,maximum_capacity,employee_total,state
- the **Orders table** can be viewed in [Google Sheets](https://drive.google.com/file/d/1dPcjBa1mC1FFtQsMZ95CnL_NBoVSzEP5/view?usp=drive_link) or the [.csv file](/activities/sql/c05m03-count-distinct/c05m03-warehouse-orders-data.csv) and records are saved in the format: order_id,customer_id,warehouse_id,order_date,shipper_date

Below is a preview of both tables in .csv format:

![Data in csv format](c05m03-warehouse-tables-data.png 'Data in csv format')

## Importing the data in BigQuery

The following steps are followed to import the employees and departments data to BigQuery:

- **Create dataset** with **Dataset ID** `warehouse_orders`
- In the **Dataset info** window, select the **CREATE TABLE** button
- In the **Source** section, select the ***Upload*** option in **Create table from**
- Browse to the `c05m03-warehouse-data.csv` file and open
- Set the file format to `.csv`
- In the **Destination** section, name the table as `warehouse`
- In the **Schema** section, select **Auto detect**
- Finally, select **Create table**

A new table `warehouse` has been created and appear in the explorer pane under the database `warehouse_orders`. The above steps are repeated to create a new table `orders` from the file `c05m03-warehouse-orders-data.csv`. A preview of the BigQuery tables are shown below:

![Data in BigQuery](c05m03-warehouse-tables-bigquery.png 'Data in BigQuery')

## Query: Identify top-performing warehouses

To identify the top-performing warehouses, I require the following information:

- warehouse ID,
- warehouse state,
- warehouse alias,
- number of orders for each warehouse,
- total of orders for all warehouses combined, and
- classifying each warehouse by the percentage of total orders that it fulfilled: 0–20%, 21-60%, or > 60%.

I execute the following query to aggregate the data from the `warehouse` and `orders` tables into a new table containing this information:

In [None]:
SELECT
  warehouse.warehouse_id,
  CONCAT(warehouse.state, ': ', warehouse.warehouse_alias) AS warehouse_name,
  COUNT(orders.order_id) AS number_of_orders,
  /* Subquery to calculate the total number of orders for all warehouses
     to compare number of orders fulfilled by each warehouse */
  (
    SELECT
      COUNT(*)
    FROM
      `plucky-aegis-427011-v5.warehouse_orders.orders` AS orders
  )
  AS total_orders,
  -- Classifying fulfillment percentage of total orders
  CASE
    WHEN
      COUNT(orders.order_id) / 
      (SELECT COUNT(*) FROM `plucky-aegis-427011-v5.warehouse_orders.orders` AS orders)
      <= 0.20
    THEN "Fulfilled 0-20% of Orders"
    WHEN
      COUNT(orders.order_id) / 
      (SELECT COUNT(*) FROM `plucky-aegis-427011-v5.warehouse_orders.orders` AS orders)
      > 0.20
    AND
      COUNT(Orders.order_id) /
      (SELECT COUNT(*) FROM `plucky-aegis-427011-v5.warehouse_orders.orders` AS orders)
      <= 0.60
    THEN "Fulfilled 21-60% of Orders"
    ELSE "Fulfilled more than 60% of Orders"
  END AS fulfillment_summary

FROM
  `plucky-aegis-427011-v5.warehouse_orders.warehouse` AS warehouse
LEFT JOIN
  `plucky-aegis-427011-v5.warehouse_orders.orders` AS orders
ON
  warehouse.warehouse_id = orders.warehouse_id
GROUP BY
  warehouse.warehouse_id,
  warehouse_name
-- Exclude new warehouses being built without any orders
HAVING
  COUNT(orders.order_id) > 0
-- Sorting to have top performing warehouse first
ORDER BY
  number_of_orders DESC;

The query successfully return the information required to identify the top performing warehouses as shown below:

![Top performing warehouses](c05m03-query-top-perfoming-warehouses.png 'Top performing warehouses')