# Cloud Function to Beam Pipeline

In this example we'll be converting a series of Cloud Functions to an Apache Beam pipeline.

### Background

Your team receives invoice data from your ordering system in JSON format. Currently, the data is processed by a series of Cloud Functions.


The JSON object contains a list of invoices and each invoice has the following structure:
```json
{
  "invoice_id": int,
  "customer_id": int,
  "line_items": [
    {
      "line_no": int,
      "product_name": str,
      "quantity": int,
      "price_ea": float
    }
  ]
}
```

### Create the Sample Data

The cell below contains a function that generates sample data, let's run it to generate some records for us to test our pipeline.

In [None]:
import random
import json

def generate_sample_data(n_records: int = 100, n_customers: int = 10):
    """
    params:
        n_records (int): The number of sample records to create. Default = 100
        n_customers (int): The number of unique customer_ids to include. Default = 10
    
    out:
        sample invoices (list[str]): a list of sample records formatted as JSON strings
    """
    # products and their corresponding prices
    products = ['Scissors', 'Tape', 'Printer Paper', 'Box', 'Envelope']
    prices = [9.99, 0.99, 7.99, 2.49, 1.49]
    
    # initialize variables
    invoice_no = 0
    invoices = []
    
    while len(invoices) < n_records:
        # generate a random customer ID
        customer_id = random.randint(1, n_customers)
        
        # each product gets its own line item on the invoice
        line_items = []
        for price, product in zip(prices, products):
            # get a random quantity for each product
            quantity = random.randint(0, 5)
            # if quantity is zero, skip that line
            if quantity <= 0:
                continue
            
            line_items.append(
                {'line_no': len(line_items) + 1, 'product_name': product, 'quantity': quantity, 'price_ea': price}
            )
        
        if line_items:
            invoice_no += 1
            invoices.append(
                json.dumps({'invoice_id': invoice_no, 'customer_id': customer_id, 'line_items': line_items}, indent=2)
            )
    
    return invoices

sample_data = generate_sample_data()

### Review the Data

Let's check out the sample data. Run the cell below to see what one record of the sample data looks like.

In [None]:
print(sample_data[0])

### Cloud Functions

Your company decided to use a series of Cloud Functions as its ETL solution because there are multiple stages and each stage's output needs to be used by multiple downstream stages. Instead of creating one massive pipeline, to make it easier to maintain and update you decided to break each stage out into its own Cloud Function.

In [None]:
# Cloud Function 1
# It takes a list of invoices that are formatted as JSON str
# For each invoice in list of invoices:
#   1. Calculate and add the total for each line item in the invoice
#   2. Calculate and add the invoice total
# Returns the JSON str list of invoices with the updated totals
def get_total_by_invoice(invoice_list: list):
    # initialize the invoice output list
    invoices = []
    
    # loop through all the invoices in the invoice list to calculate the invoice total for each
    for invoice_json in invoice_list:
        # convert the JSON str to a dict and get the list of line_items
        invoice = json.loads(invoice_json)
        line_items = invoice.get('line_items', [])
        
        if not line_items:
            continue
        
        # calculate total for each line item and add it to a running total
        running_total = 0.00
        for i in range(len(line_items)):
            # get the line item dict
            line_item = line_items[i]
            # calculate the total for the line by multiplying quantity and price for each item
            quantity = line_item.get('quantity', 0)
            price_ea = line_item.get('price_ea', 0.00)
            line_item_total = round(quantity * price_ea, 2)
            # add the line total to the line item dict
            line_item['line_item_total'] = line_item_total
            # overwrite the existing line item dict in the line item list
            line_items[i] = line_item
            # add the line total to the invoice running total
            running_total += line_item_total
        
        # overwrite the line item list in the invoice dict and add the running total
        invoice['line_items'] = line_items
        invoice['invoice_total'] = round(running_total, 2)
        # convert the invoice dict to a JSON str and add it to the invoice output list
        invoices.append(json.dumps(invoice, indent=2))
    
    # return the updated list of invoices
    return invoices
            
# Cloud Function 2
# It takes the list of invoices that are output by Cloud Function 1
# Iterates through the invoices to calculate the total spent by each customer
# Returns list of JSON str with a customer ID and that customer's total spend 
def get_total_by_customer(invoice_w_total_list: list):
    # initialize dict to keep track of running total by customer
    customer_totals = {}
    # iterate through all invoices and get total by customer ID
    for invoice_json in invoice_w_total_list:
        # convert invoice JSON to dict and get customer ID
        invoice = json.loads(invoice_json)
        customer_id = invoice.get('customer_id', -1)
        
        # get the current running total for that customer ID or 0.00 if there isn't one
        customer_total = customer_totals.get(customer_id, 0.00)
        # add the total for this invoice to the running total and overwrite the previous total in the dict
        customer_totals[customer_id] = round(customer_total + invoice.get('invoice_total', 0.00), 2)
    
    # iterate through customer totals dict and add each customer and total as JSON str to the output list
    customer_total_list = []
    for customer_id, customer_total in customer_totals.items():
        customer_total_list.append(
            json.dumps({'customer_id': customer_id, 'customer_total': round(customer_total, 2)}, indent=2)
        )
    
    # return the list of totals by customer
    return customer_total_list

In [None]:
invoices_with_totals = get_total_by_invoice(sample_data)
print(invoices_with_totals[0])
customer_totals = get_total_by_customer(invoices_with_totals)
print(customer_totals[0])

In [None]:
import apache_beam as beam
from apache_beam import Create, Map, GroupByKey, FlatMap
from apache_beam.transforms.util import WithKeys

from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib

In [None]:
p = beam.Pipeline(InteractiveRunner())

invoices = (p | 'Create Invoice PColl' >> Create(sample_data)
              | 'Convert JSON to PColl' >> Map(json.loads))

ib.show(invoices)