In [None]:
import pyspark.sql.functions as f

# PySpark DataFrames: Data Analysis Part 2

Welcome to the last notebooks of the PySpark DataFrames module! This part is a continuation of the previous one, where we will continue practicing more advanced operations on PySpark DataFrames.

This notebook is composed by more questions that will teach you how to use some pyspark DataFrame methods and SQL functions that are very useful for data analysis.

The data used in this notebook is the same as the previous one, the orders and products datasets adapted from the JMP Case Study Library.

Let's get the orders and products data we've saved in the DBFS in the first part of this module.

In [None]:
df_orders = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("sep", ",")
    .load("/FileStore/lp-big-data/orders-data/orders-preprocessed.csv")
)

df_products = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("sep", ",")
    .load("/FileStore/lp-big-data/orders-data/products-preprocessed.csv")
)

df_orders.display()
df_products.display()

## Join tables

Once again, let's use the join operator to join the orders and products tables.

We want to use a left join on the orders table because we don't want to miss any orders data, even the products sold on that order are not listed on the products table.

As we've seen in the previous notebook, since both tables have a column with the same name, we can simply use that column to join the tables instead of having to seperately specify the name of the column in each table.

In [None]:
df_orders_products = (
    df_orders.join(
        df_products,
        on=['product_id'],
        how='left'
    )
)

df_orders_products.display()

Let's recall the schema of the merged table:

In [None]:
df_orders_products.printSchema()

## Data Analytics - Continuation

1. For each customer segment, what is the maximum delivery delay and the average profit?

We can perform multiple aggregations on the same group of data at once as we've seen in question 6 of the previous notebook. Let's see another syntax to do that.

In [None]:
(
    df_orders_products
    .groupBy('customer_status')
    .agg({
        'days_to_delivery' : 'max',
        'profit' : 'avg',
    })
).display()

2. What is the average number of units (amount) per order of each product sold?

In [None]:
# Or using
(
    df_orders_products
    .groupBy(f.col('product_id'))
    .agg(f.avg('amount').alias('avg_amount'))
).display()

3. What are the dates in which each product was delivered among all orders?

To answer this question we need to group the data by product and then aggregate the dates in which the product was delivered on a list. We can use the `collect_list()` function to do that.

This aggregation function is different from the ones we've seen so far. Instead of collapsing all the data into a single value, it creates a list of values.

In [None]:
(
    df_orders_products
    .groupBy(f.col('product_id'))
    .agg(f.collect_list('delivery_date').alias('delivery_dates'))
).display()

4. How many orders of each category were placed in each year?

To answer this question we can use the pivot operation. This operation produces a pivot table which is a cross-tabulation that can show the relationship between two columns.

As an alternative we could also use the groupBy operation, but the pivot operation results in a more readable table.

In [None]:
# Using groupBy
(
    df_orders_products
    .groupBy('order_year', 'product_category')
    .agg(f.count('order_id'))
).display()

In [None]:
# Using pivot
pivot_df = (
    df_orders_products
    .groupBy('order_year')
    .pivot('product_category')
    .agg(f.count('order_id'))
)

pivot_df.display()

This query created one row each element in the `groupBy` clause and one column for each unique value in the `pivot` clause.

5. What is the total profit each continent suppliers generated each year?

Again, we can use the pivot operation to answer this question.

In [None]:
pivot_df = (
    df_orders_products
    .groupBy('order_year')
    .pivot('supplier_continent')
    .agg(f.sum('profit'))
)

pivot_df.display()

6. Who is the client that bought the greatest variety of products in each year?

Let's break the question down into smaller parts:
- First, we need to count the number of unique products each customer bought in each year.
- Then, we need to get the customer with the highest number of unique products bought in each year.

In [None]:
# Find the number of unique products each client ordered in each year
unique_product_count_df = (
    df_orders_products
    .groupBy(['customer_id', 'order_year'])
    .agg(f.countDistinct('product_id').alias('nr_unique_products'))
)

# Find the customer with the maximum number of unique products for each year
result_df = (
    unique_product_count_df
    .orderBy(['order_year', f.desc('nr_unique_products')])
    .groupBy('order_year')
    .agg(f.first('customer_id').alias('most_varied_customer'), f.max('nr_unique_products').alias('max_unique_products'))
)

result_df.display()

7. What is the standard deviation of profit for each supplier?

In [None]:
(
    df_orders_products
    .groupBy('supplier_id')
    .agg(f.stddev('profit').alias('profit_stddev'))
).display()

8. How many unique customers placed orders within each supplier continent?

In [None]:
(
    df_orders_products
    .groupBy('supplier_continent')
    .agg(
        f.countDistinct('customer_id')
    )
).display()

---

Great work! We've learned a lot about PySpark DataFrames and SQL functions and how to use them to perform data analysis.

As usual, it's time to practice what we've learned. Go to notebook `exercises-part2` to solve the exercises.