# Manipulating Collections using Loops – 1

* Reading files into collections
* Overview of Standard Transformations
* Row level transformations
* Getting unique elements
* Filtering Data
* Exercises

## Reading files into collections

Let us understand how to read data from files into collections.
* Python have simple and yet rich APIs to perform file I/O
* We can create a file object with open in different modes (by default read only mode)
* To read the contents from the file into memory, we have APIs on top of file object such as read()
* read() will create large string using contents of the files
* If the data have multiple records with new line character as delimiter, we can apply splitlines() on the output of read
* splitlines() will convert the string into list with new line character as delimiter

In [5]:
path = '/Users/itversity/Research/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research
orders_file = open(path)

In [6]:
orders_raw = orders_file.read()

In [7]:
orders = orders_raw.splitlines()

In [8]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [9]:
len(orders) # same as number of records in the file

68883

## Overview of Standard Transformations

Let us understand standard transformations we perform on top of data in collections.
* Filtering
* Row level transformations such as standardization, cleansing etc.
* Aggregations
* Grouped Aggregations
* Sorting and Ranking

Typically we use external libraries such as Pandas, Pyspark etc to perform these standard transformations. However, we will try to develop using conventional loops to understand how they are implemented and also to get better with respect to programming.

## Row level transformations

Here are the details about orders.
* Data is in text file format
* Each line in the file contains one record.
* Each record contains 4 attributes which are separated by “,”
  * order_id
  * order_date
  * order_customer_id
  * order_status

### Task 1

Get all order ids and associated statuses. Each record in the output should be comma separated string.

In [10]:
order = '1,2013-07-25 00:00:00.0,11599,CLOSED'

In [11]:
order.join?

[0;31mSignature:[0m [0morder[0m[0;34m.[0m[0mjoin[0m[0;34m([0m[0miterable[0m[0;34m,[0m [0;34m/[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Concatenate any number of strings.

The string whose method is called is inserted in between each given string.
The result is returned as a new string.

Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'
[0;31mType:[0m      builtin_function_or_method


In [12]:
','.join([order.split(',')[0], order.split(',')[3]])

'1,CLOSED'

In [13]:
order_statuses = []
for order in orders:
    order_statuses.append(','.join([order.split(',')[0], order.split(',')[3]]))

In [14]:
order_statuses[:10]

['1,CLOSED',
 '2,PENDING_PAYMENT',
 '3,COMPLETE',
 '4,CLOSED',
 '5,COMPLETE',
 '6,COMPLETE',
 '7,COMPLETE',
 '8,PROCESSING',
 '9,PENDING_PAYMENT',
 '10,PENDING_PAYMENT']

In [15]:
len(order_statuses)

68883

In [16]:
order_statuses = [','.join([order.split(',')[0], order.split(',')[3]]) for order in orders] # alternative solution

In [17]:
order_statuses[:10]

['1,CLOSED',
 '2,PENDING_PAYMENT',
 '3,COMPLETE',
 '4,CLOSED',
 '5,COMPLETE',
 '6,COMPLETE',
 '7,COMPLETE',
 '8,PROCESSING',
 '9,PENDING_PAYMENT',
 '10,PENDING_PAYMENT']

In [18]:
len(order_statuses)

68883

### Task 2

Get all order ids, the dates on which order is placed and order status. Each record in the output should be dict with following column names as keys.
* order_id
* order_date
* order_status

In [21]:
def get_order_details(order):
    """Extract order details such as id, date as well as status and return as dict"""
    order_values = order.split(',')
    return ({
        'order_id': int(order_values[0]),
        'order_date': order_values[1],
        'order_status': order_values[3]
    })

In [22]:
get_order_details('1,2013-07-25 00:00:00.0,11599,CLOSED')

{'order_id': 1,
 'order_date': '2013-07-25 00:00:00.0',
 'order_status': 'CLOSED'}

In [23]:
order_details = []
for order in orders:
    order_details.append(get_order_details(order))

In [24]:
order_details[:10]

[{'order_id': 1,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 2,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 3,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 4,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 5,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 6,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 7,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 8,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PROCESSING'},
 {'order_id': 9,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 10,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'}]

In [25]:
len(order_details)

68883

In [26]:
order_details = [get_order_details(order) for order in orders]

In [27]:
order_details[:10]

[{'order_id': 1,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 2,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 3,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 4,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 5,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 6,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 7,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 8,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PROCESSING'},
 {'order_id': 9,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 10,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'}]

In [28]:
len(order_details)

68883

## Getting unique elements

Let us perform few tasks to understand how to extract unique elements.
* We can create a list of elements first and then convert into a set.
* We can also build set directly while extracting the information.

### Task 1

Get all the unique dates from orders data.

In [39]:
order_dates = set()
for order in orders:
    order_dates.add(order.split(',')[1])

In [40]:
list(order_dates)[:10]

['2014-01-25 00:00:00.0',
 '2014-04-08 00:00:00.0',
 '2014-01-01 00:00:00.0',
 '2013-08-17 00:00:00.0',
 '2013-08-11 00:00:00.0',
 '2013-10-01 00:00:00.0',
 '2013-08-01 00:00:00.0',
 '2014-05-07 00:00:00.0',
 '2014-06-02 00:00:00.0',
 '2013-11-22 00:00:00.0']

In [42]:
len(order_dates)

364

In [33]:
order_dates = {order.split(',')[1] for order in orders}

In [35]:
list(order_dates)[:10]

['2014-01-25 00:00:00.0',
 '2014-04-08 00:00:00.0',
 '2014-01-01 00:00:00.0',
 '2013-08-17 00:00:00.0',
 '2013-08-11 00:00:00.0',
 '2013-10-01 00:00:00.0',
 '2013-08-01 00:00:00.0',
 '2014-05-07 00:00:00.0',
 '2014-06-02 00:00:00.0',
 '2013-11-22 00:00:00.0']

In [36]:
len(order_dates)

364

### Task 2

Get all the unique weekend dates from orders data.

In [57]:
order_date = '2014-01-25 00:00:00.0'

In [57]:
import datetime as dt

In [58]:
dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f')

datetime.datetime(2014, 1, 25, 0, 0)

In [59]:
dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f').weekday()

5

In [61]:
import calendar
calendar.day_name[dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f').weekday()]

'Saturday'

In [62]:
calendar.day_abbr[dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f').weekday()]

'Sat'

In [63]:
dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f').weekday() in (5, 6)

True

In [65]:
import datetime as dt
def is_weekend(order_date):
    return dt.datetime.strptime(order_date, '%Y-%m-%d %H:%M:%S.%f').weekday() in (5, 6)

In [66]:
is_weekend('2014-01-25 00:00:00.0')

True

In [69]:
weekend_dates = set()
for order in orders:
    order_date = order.split(',')[1]
    if is_weekend(order_date):
        weekend_dates.add(order_date)

In [70]:
list(weekend_dates)[:10]

['2014-01-25 00:00:00.0',
 '2014-03-16 00:00:00.0',
 '2014-04-19 00:00:00.0',
 '2013-11-03 00:00:00.0',
 '2013-08-17 00:00:00.0',
 '2013-08-11 00:00:00.0',
 '2014-02-09 00:00:00.0',
 '2013-10-27 00:00:00.0',
 '2013-11-17 00:00:00.0',
 '2014-07-12 00:00:00.0']

In [71]:
len(weekend_dates)

103

## Filtering Data

Let us perform few tasks to understand how to filter the data in collections using loops and conditionals.

Here are the details about orders.
* Data is in text file format
* Each line in the file contains one record.
* Each record contains 4 attributes which are separated by “,”
  * order_id
  * order_date
  * order_customer_id
  * order_status

### Task 1
Create a function by name get_customer_orders which take orders list and customer_id as arguments and return all the orders placed by customer_id

In [79]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [80]:
order = '3,2013-07-25 00:00:00.0,12111,COMPLETE'

In [81]:
int(order.split(',')[2]) == 12111

True

In [82]:
def get_customer_orders(orders, customer_id):
    orders_filtered = []
    for order in orders:
        if int(order.split(',')[2]) == customer_id:
            orders_filtered.append(order)
    return orders_filtered

In [83]:
# Use the function and get all the orders which are placed by customer with id 12431
get_customer_orders(orders, 12431)

['3774,2013-08-16 00:00:00.0,12431,CANCELED',
 '3870,2013-08-17 00:00:00.0,12431,PENDING_PAYMENT',
 '4032,2013-08-17 00:00:00.0,12431,ON_HOLD',
 '22812,2013-12-12 00:00:00.0,12431,PENDING',
 '22927,2013-12-13 00:00:00.0,12431,CLOSED',
 '25614,2013-12-30 00:00:00.0,12431,CLOSED',
 '27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
 '29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '29232,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '45894,2014-05-06 00:00:00.0,12431,CLOSED',
 '46217,2014-05-07 00:00:00.0,12431,CLOSED',
 '49678,2014-05-31 00:00:00.0,12431,PENDING',
 '51865,2014-06-15 00:00:00.0,12431,PROCESSING',
 '63146,2014-02-13 00:00:00.0,12431,PENDING_PAYMENT',
 '67110,2014-07-14 00:00:00.0,12431,PENDING']

### Task 2

Create a function by name get_customer_orders_for_month which take orders list, customer_id and month in the format YYYY-MM as arguments and return all the orders placed by customer_id for a given month.

In [73]:
order = '3,2013-07-25 00:00:00.0,12111,COMPLETE'

In [74]:
int(order.split(',')[2]) == 12111

True

In [75]:
order.split(',')[1].startswith('2013-07')

True

In [76]:
int(order.split(',')[2]) == 12111 and order.split(',')[1].startswith('2013-07')

True

In [77]:
def get_customer_orders_for_month(orders, customer_id, order_month):
    orders_filtered = []
    for order in orders:
        order_elements = order.split(',')
        if int(order_elements[2]) == customer_id and order_elements[1].startswith(order_month):
            orders_filtered.append(order)
    return orders_filtered

In [78]:
# Use the function and get all the orders which are placed by customer with id 12431 in January 2014
get_customer_orders_for_month(orders, 12431, '2014-01')

['27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
 '29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '29232,2014-01-21 00:00:00.0,12431,ON_HOLD']

### Task 3
Write ad hoc code to get all the orders which are placed by customer with id 12431 in January 2014 and status is in PENDING_PAYMENT or PROCESSING

In [72]:
for order in orders:
    order_elements = order.split(',')
    if int(order_elements[2]) == 12431 \
        and order_elements[1].startswith('2014-01') \
        and (order_elements[3] in ('PROCESSING', 'PENDING_PAYMENT')):
        print(order)

27585,2014-01-12 00:00:00.0,12431,PROCESSING
28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT


## Exercises