# Manipulating Collections using Loops – 2

Let continue on manipulating collections using loops. We will primarily focus on a bit of advanced transformations.

* Preparing Data Sets
* Quick Recap of Dict Operations
* Performing Aggregations – 1
* Performing Aggregations – 2
* Joining Data Sets – 1
* Joining Data Sets – 2
* Limitations of using Loops
* Exercises

## Preparing Data Sets

We will be primarily using orders and order_items data set to understand about manipulating collections.
* orders is available at path **/Users/itversity/Research/data/retail_db/orders/part-00000**
* order_items is available at path **/Users/itversity/Research/data/retail_db/orders/part-00000**
* orders - columns
  * order_id - it is of type integer and unique
  * order_date - it can be considered as string
  * order_customer_id - it is of type integer
  * order_status - it is of type string
* order_items - columns
  * order_item_id - it is of type integer and unique
  * order_item_order_id - it is of type integer and refers to orders.order_id
  * order_item_product_id - it is of type integer and refers to products.product_id
  * order_item_quantity - it is of type integer and represents number of products as an order item with in an order.
  * order_item_subtotal - it is item level revenue (product of order_item_quantity and order_item_product_price)
  * order_item_product_price - it is product price for each item with in an order.
* orders is parent data set to order_items and will contain one record per order. Each order can contain multiple items.
* order_items is child data set to orders and can contain multiple entries for a given order_item_order_id.

### Task 1 - Read orders into collection
Let us read orders data set into the collection called as **orders**. This will be used later.

In [None]:
orders_path = '/Users/itversity/Research/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research
orders_file = open(orders_path)

In [None]:
orders_raw = orders_file.read()

In [None]:
orders = orders_raw.splitlines()

In [None]:
orders[:10]

In [None]:
len(orders) # same as number of records in the file

### Task 1 - Read order_items into collection
Let us read order_items data set into the collection called as **order_items**. This will be used later.

In [None]:
order_items_path = '/Users/itversity/Research/data/retail_db/order_items/part-00000'
# C:\\users\\itversity\\Research
order_items_file = open(order_items_path)

In [None]:
order_items_raw = order_items_file.read()

In [None]:
order_items = order_items_raw.splitlines()

In [None]:
order_items[:10]

In [None]:
len(order_items) # same as number of records in the file

## Quick Recap of Dict Operations

Let us recap some of the important concepts and operations related to `dict`. We will primarily focus on those operations which are important for aggregations and joins.
* `dict` contains heterogeneous type of elements.
* Typically it is used to represent a row in a table or a sheet.
* Each and every element in a `dict` contains key value pair where key is typically column name.
* Here are the common `dict` operations relevant to aggregrations and joins.
  * Adding elements to the dict
  * Checking if the key exists
  * Getting value for a given key
  * Updating value if the key exists

In [None]:
order_count_by_date = {}

In [None]:
# Adding elements in dict
order_count_by_date['2014-01-01'] = 1
order_count_by_date['2014-01-02'] = 1
order_count_by_date['2014-01-03'] = 1

In [None]:
order_count_by_date

In [None]:
# Checking if element exists in dict
'2014-01-01' in order_count_by_date

In [None]:
'2014-01-04' in order_count_by_date

In [None]:
# Getting value for a given key
order_count_by_date['2014-01-01']

In [None]:
order_count_by_date['2014-01-04'] # Throws KeyError exception

In [None]:
order_count_by_date.get('2014-01-01')

In [None]:
order_count_by_date.get('2014-01-04') # Returns None

In [None]:
# Updating value
order_count_by_date['2014-01-01'] = 2

In [None]:
order_count_by_date

In [None]:
order_count_by_date['2014-01-01'] += 1 # Incrementing the existing value for a given key

In [None]:
order_count_by_date

In [None]:
order_count_by_date.update({'2014-01-02': 2}) # Alternate way to update an existing element value in dict

In [None]:
order_count_by_date

## Performing Total Aggregations

We have pre-existing functions to take care of aggregations such as `len`, `sum`, `min`, `max` etc. Let us understand how they are typically implemented.

### Task 1
Use orders and get total number of records for a given month (201401). 
* Develop a function which take orders collection and month as arguments.
* Month will be passed as integer in the form of yyyyMM (example 201401).
* Return the order count

In [None]:
orders[:10]

In [None]:
order = '1,2013-07-25 00:00:00.0,11599,CLOSED'

In [None]:
order.split(',')

In [None]:
order.split(',')[1]

In [None]:
order.split(',')[:7]

In [None]:
order.split(',')[1][:7].replace('-', '')

In [None]:
int(order.split(',')[1][:7].replace('-', ''))

In [None]:
def get_order_count(orders, order_month):
    order_count = 0
    for order in orders:
        l_order_month = int(order.split(',')[1][:7].replace('-', ''))
        if l_order_month == order_month: order_count += 1
    return order_count

In [None]:
get_order_count(orders, 201401)

### Task 2

Use order items data set and compute total revenue generated for a given product_id.
* Define a function which takes order_items and a product_id as arguments.
* product_id will be passed as integer
* Compute revenue generated for a given product id using subtotal (5th field)
* Return the computed product revenue

In [None]:
order_items[:10]

In [None]:
order_item = '1,1,957,1,299.98,299.98'

In [None]:
order_item.split(',')

In [None]:
order_item.split(',')[2]

In [None]:
int(order_item.split(',')[2])

In [None]:
float(order_item.split(',')[4])

In [None]:
def get_product_revenue(order_items, product_id):
    product_revenue = 0.0
    for order_item in order_items:
        l_product_id = int(order_item.split(',')[2])
        order_item_subtotal = float(order_item.split(',')[4])
        if l_product_id == product_id: product_revenue += order_item_subtotal
    return product_revenue

In [None]:
get_product_revenue(order_items, 502)

### Task 3

Use order items data set and get total number of items sold as well as total revenue generated for a given product_id.
* Define a function which takes order_items and a product_id as arguments.
* product_id will be passed as integer
* Get number of items sold for a given product id using quantity (4th field)
* Compute revenue generated for a given product id using subtotal (5th field)
* Return the number of items sold as well as revenue generated

In [None]:
t1 = (1, 200.0)

In [None]:
t2 = (2, 300.0)

In [None]:
res = (0, 0.0)

In [None]:
res = (res[0] + t1[0], res[1] + t1[1])

In [None]:
res

In [None]:
res = (res[0] + t2[0], res[1] + t2[1])

In [None]:
res

In [None]:
def get_product_metrics(order_items, product_id):
    product_metrics = (0, 0.0)
    for order_item in order_items:
        l_product_id = int(order_item.split(',')[2])
        order_metric = (int(order_item.split(',')[3]), float(order_item.split(',')[4]))
        if l_product_id == product_id: 
            product_metrics = (product_metrics[0] + order_metric[0], product_metrics[1] + order_metric[1])
    return product_metrics

In [None]:
get_product_metrics(order_items, 502)

### Task 4

Create a collection with sales and commission percentage. Using that collection compute total commission amount. If the commission percent is None or not present, treat it as 0.
* Each element in the collection should be a tuple.
* First element is the sales amount and second element is commission percentage.
* Commission for each sale can be computed by multiplying commission percentage with sales (make sure to divide commission percentage by 100).
* Some of the records does not have commission percentage, in that case commission amount for that sale shall be 0
* Function should take a collection of tuples and return commission amount which is of type float.

In [None]:
transactions = [(376.0, 8),
(548.23, 14),
(107.93, 8),
(838.22, 14),
(846.85, 21),
(234.84,),
(850.2, 21),
(992.2, 21),
(267.01,),
(958.91, 21),
(412.59,),
(283.14,),
(350.01, 14),
(226.95,),
(132.7, 14)]

In [None]:
transactions[:6]

In [None]:
sale = (376.0, 8)

In [None]:
sale_amount = round(sale[0] * (sale[1] / 100), 2)

In [None]:
sale_amount

In [None]:
sale = (234.84,)

In [None]:
sale_amount = round(sale[0] * (sale[1] / 100), 2) # errors out

In [None]:
len(sale)

In [None]:
commission_pct = sale[1] / 100 if len(sale) == 2 else 0

In [None]:
commission_pct

In [None]:
def get_commission_amount(sales):
    commission_amount = 0.0
    for sale in sales:
        sale_amount = sale[0]
        commission_pct = round(sale[1]/100, 2) if len(sale) == 2 else 0
        commission_amount += sale_amount * commission_pct
    return round(commission_amount, 2)

In [None]:
get_commission_amount(transactions)

## Performing Aggregations - Grouped

Here are some of the examples for grouped aggregations.
* Get number of employees for each department
* Get daily revenue for a given month (aggregation for a given day and filtering based up on month).
* Number of courses enrolled by each student
* Number of students enrolled for each course

## Joining Data Sets – 1

## Joining Data Sets – 2

## Limitations of using Loops

## Exercises