# Quick Recap of Python

Let us quickly recap of some of the core programming concepts of Python before we get into Spark.

## Data Engineering Life Cycle

Let us first understand the Data Engineering Life Cycle. We typically read the data, process it by applying business rules and write the data back to different targets
* Read the data from different sources.
  * Files
  * Databases
  * Mainframes
  * APIs
* Processing the data
  * Row Level Transformations
  * Aggregations
  * Sorting
  * Ranking
  * Joining multiple data sets
* Write data to different targets.
  * Files
  * Databases
  * Mainframes
  * APIs

## Python CLI or Jupyter Notebook

We can use Python CLI or Jupyter Notebook to explore APIs.
* We can launch Python CLI using `python` command.
* We can launch the Jupyter Notebook using the `jupyter notebook` command.
* A web service will be started on port number 8888 by default.
* We can go to the browser and connect to the web server using IP address and port number.
* We should be able to explore code in interactive fashion.
* We can issue magic commands such as %%sh to run shell commands, %%md to document using markdown etc.

### Tasks

Let us perform these tasks to just recollect how to use Python CLI or Jupyter Notebook.
* Create variables `i` and `j` assigning `10` and `20.5` respectively.

In [1]:
i = 10
j = 20.5

* Add the values and assign result to `res`.

In [2]:
res = i + j
print(str(res))

30.5


* Get the `type` of `i`, `j` and `res`.

In [3]:
type(i)

int

In [4]:
type(j)

float

In [5]:
type(res)

float

* Get the help on `int`.

In [6]:
help(int)

Help on class int in module builtins:

class int(object)
 |  int(x=0) -> integer
 |  int(x, base=10) -> integer
 |  
 |  Convert a number or string to an integer, or return 0 if no arguments
 |  are given.  If x is a number, return x.__int__().  For floating point
 |  numbers, this truncates towards zero.
 |  
 |  If x is not a number or if base is given, then x must be a string,
 |  bytes, or bytearray instance representing an integer literal in the
 |  given base.  The literal can be preceded by '+' or '-' and be surrounded
 |  by whitespace.  The base defaults to 10.  Valid bases are 0 and 2-36.
 |  Base 0 means to interpret the base from the string as an integer literal.
 |  >>> int('0b100', base=0)
 |  4
 |  
 |  Methods defined here:
 |  
 |  __abs__(self, /)
 |      abs(self)
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __and__(self, value, /)
 |      Return self&value.
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __ceil__(...)
 |      Ceiling of

* Get the help on `startswith` that is available on `str`.

In [7]:
help(str.startswith)

Help on method_descriptor:

startswith(...)
    S.startswith(prefix[, start[, end]]) -> bool
    
    Return True if S starts with the specified prefix, False otherwise.
    With optional start, test S beginning at that position.
    With optional end, stop comparing S at that position.
    prefix can also be a tuple of strings to try.



## Basic Programming Constructs

Let us recollect some of the basic programming constructs of Python.
* Comparison Operations (==, !=, <, >, <=, >=, etc) 
  * All the comparison operators return a True or False (Boolean value)
* Conditionals (if) 
  * We typically use comparison operators as part of conditionals.
* Loops (for) 
  * We can iterate through collection using `for i in l` where l is a standard collection such as list or set.
  * Python provides special function called as `range` which will return a collection of integers between the given range. It excludes the upper bound value.
* In Python, scope is defined by indentation.

### Tasks
 
Let us perform few tasks to quickly recap basic programming constructs of Python.
 * Get all the odd numbers between 1 and 15.

In [8]:
list(range(1, 16, 2))

[1, 3, 5, 7, 9, 11, 13, 15]

* Print all those numbers which are divisible by 3 from the above list.

In [9]:
for i in list(range(1, 16, 2)):
    if(i%3 == 0): print(i)

3
9
15


## Developing Functions

Let us understand how to develop functions using Python as programming language.
* Function starts with `def` followed by function name.
* Parameters can be of different types.
    * Required
    * Keyword
    * Variable Number
    * Functions
* Functions which take another function as an argument is called higher order functions.

### Tasks

Let us perform few tasks to understand how to develop functions in Python.   
* Sum of integers between lower bound and upper bound using formula.

In [10]:
def sumOfN(n):
    return int((n * (n + 1)) / 2)

In [11]:
sumOfN(10)

55

In [12]:
def sumOfIntegers(lb, ub):
    return sumOfN(ub) - sumOfN(lb -1)

In [13]:
sumOfIntegers(5, 10)

45

* Sum of integers between lower bound and upper bound using loops.

In [14]:
def sumOfIntegers(lb, ub):
    total = 0
    for e in range(lb, ub + 1):
        total += e
    return total

In [15]:
sumOfIntegers(1, 10)

55

* Sum of squares of integers between lower bound and upper bound using loops.

In [16]:
def sumOfSquares(lb, ub):
    total = 0
    for e in range(lb, ub + 1):
        total += e * e
    return total

In [17]:
sumOfSquares(2, 4)

29

* Sum of the even numbers between lower bound and upper bound using loops.

In [18]:
def sumOfEvens(lb, ub):
    total = 0
    for e in range(lb, ub + 1):
        total += e if e%2==0 else 0
    return total

In [19]:
sumOfEvens(2, 4)

6

## Lambda Functions

Let us recap details related to lambda functions.
* We can develop functions with out names. They are called Lambda Functions and also known as Anonymous Functions.
* We typically use them to pass as arguments to higher order functions which takes functions as arguments  

### Tasks

Let us perform few tasks related to lambda functions.    
* Create a generic function mySum which is supposed to perform arithmetic using integers within a range.   
  * It takes 3 arguments - `lb`, `ub` and `f`.
  * Function f should be invoked inside the function on each element within the range.

In [20]:
def mySum(lb, ub, f):
    total = 0
    for e in range(lb, ub + 1):
        total += f(e)
    return total

* Sum of integers between lower bound and upper bound using `mySum`.

In [21]:
mySum(2, 4, lambda i: i)

9

* Sum of squares of integers between lower bound and upper bound using `mySum`.

In [22]:
mySum(2, 4, lambda i: i * i)

29

* Sum of the even numbers between lower bound and upper bound using `mySum`.

In [23]:
mySum(2, 4, lambda i: i if i%2 == 0 else 0)

6

## Overview of Collections

Let's quickly recap about Collections and Tuples in Python. We will primarily talk about collections that comes as part of Python standard library such as `list`, `set`, `dict` and `tuple`.
* Group of elements with length and index - `list`
* Group of unique elements - `set`
* Group of key value pairs - `dict`
* While `list` and `set` contain group of homogeneous elements, `dict` and `tuple` contains group of heterogeneous elements.
* `list` or `set` are analogous to a database table while `dict` or `tuple` are analogous to individual record.
* Typically we create list of tuples or dicts or set of tuples or dicts. Also a dict can be considered as list of pairs.
* We typically use Map Reduce APIs to process the data in collections. There are also some pre-defined functions such as `len`, `sum`, `min`, `max` etc for aggregating data in collections.

### Tasks

Let us perform few tasks to quickly recap details about Collections and Tuples in Python. We will also quickly recap about Map Reduce APIs.

* Create a collection of orders by reading data from a file.

In [24]:
%%sh

ls -ltr /data/retail_db/orders/part-00000

-rw-r--r-- 1 root root 2999944 Jan 21  2021 /data/retail_db/orders/part-00000


In [25]:
orders_path = "/data/retail_db/orders/part-00000"
orders = open(orders_path). \
    read(). \
    splitlines()

* Get all unique order statuses. Make sure data is sorted in alphabetical order.

In [26]:
# sorted(set(map(lambda o: o.split(",")[3], orders)))

* Get count of all unique dates.

In [27]:
len(list(map(lambda o: o.split(",")[1], orders)))

68883

* Sort the data in orders in ascending order by order_customer_id and then order_date.

In [28]:
sorted(orders, key=lambda k: (int(k.split(",")[2]), k.split(",")[1]))

['22945,2013-12-13 00:00:00.0,1,COMPLETE',
 '57963,2013-08-02 00:00:00.0,2,ON_HOLD',
 '15192,2013-10-29 00:00:00.0,2,PENDING_PAYMENT',
 '67863,2013-11-30 00:00:00.0,2,COMPLETE',
 '33865,2014-02-18 00:00:00.0,2,COMPLETE',
 '22646,2013-12-11 00:00:00.0,3,COMPLETE',
 '61453,2013-12-14 00:00:00.0,3,COMPLETE',
 '23662,2013-12-19 00:00:00.0,3,COMPLETE',
 '35158,2014-02-26 00:00:00.0,3,COMPLETE',
 '46399,2014-05-09 00:00:00.0,3,PROCESSING',
 '56178,2014-07-15 00:00:00.0,3,PENDING',
 '57617,2014-07-24 00:00:00.0,3,COMPLETE',
 '9023,2013-09-19 00:00:00.0,4,COMPLETE',
 '9704,2013-09-24 00:00:00.0,4,COMPLETE',
 '17253,2013-11-09 00:00:00.0,4,PENDING_PAYMENT',
 '37878,2014-03-15 00:00:00.0,4,COMPLETE',
 '49339,2014-05-28 00:00:00.0,4,COMPLETE',
 '51157,2014-06-10 00:00:00.0,4,CLOSED',
 '13705,2013-10-18 00:00:00.0,5,COMPLETE',
 '36472,2014-03-06 00:00:00.0,5,PROCESSING',
 '41333,2014-04-05 00:00:00.0,5,COMPLETE',
 '45832,2014-05-05 00:00:00.0,5,PENDING_PAYMENT',
 '7485,2013-09-09 00:00:00.0,6,PROC

* Create a collection of order_items by reading data from a file.

In [29]:
order_items_path = "/data/retail_db/order_items/part-00000"
order_items = open(order_items_path). \
    read(). \
    splitlines()

* Get revenue for a given order_item_order_id.

In [30]:
def get_order_revenue(order_items, order_id):
    order_items_filtered = filter(lambda oi: 
                                  int(oi.split(",")[1]) == 2, 
                                  order_items
                                 )
    order_items_map = map(lambda oi: 
                          float(oi.split(",")[4]), 
                          order_items_filtered
                         )
    return round(sum(order_items_map), 2)

In [31]:
get_order_revenue(order_items, 2)

579.98

## Overview of Pandas Data Frames

While collections are typically the group of objects or tuples or simple strings, we need to parse them to further process the data. This process is tedious at times.
* With Data Frames we can define the structure.
* Data Frame is nothing but group of rows where each row have multiple attributes with names.
* Data Frame is similar to a Database Table or Spreadsheet with Header.
* Pandas provide rich and simple functions to convert data in files into Data Frames and process them
* Data can be read from files into Data Frame using functions such as read_csv.
* We can perform all standard operations on Data Frames.
  * Projection or Selection     
  * Filtering     
  * Aggregations     
  * Joins     
  * Sorting

### Tasks

Let us perform few tasks to recap the usage of Pandas Data Frames.
    
* Read order items data from the location on your system. In mine it is /data/retail_db/order_items/part-00000. Use the information below to define schema.
* It has 6 fields with the below names in the same order as specified below.
  * order_item_id
  * order_item_order_id
  * order_item_product_id
  * order_item_quantity
  * order_item_subtotal
  * order_item_product_price

In [32]:
import pandas as pd
order_items_path = "/data/retail_db/order_items/part-00000"
order_items = pd. \
    read_csv(order_items_path,
             names=["order_item_id", "order_item_order_id",
                    "order_item_product_id", "order_item_quantity",
                    "order_item_subtotal", "order_item_product_price"
                   ]
            )

* Project order_item_order_id and order_item_subtotal

In [33]:
order_items[["order_item_id", "order_item_subtotal"]]

Unnamed: 0,order_item_id,order_item_subtotal
0,1,299.98
1,2,199.99
2,3,250.00
3,4,129.99
4,5,49.98
...,...,...
172193,172194,129.99
172194,172195,59.99
172195,172196,50.00
172196,172197,1999.99


* Filter for order_item_order_id 2

In [34]:
order_items.query("order_item_order_id == 2")

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0
3,4,2,403,1,129.99,129.99


* Compute revenue for order_item_order_id 2

In [35]:
order_items. \
    query("order_item_order_id == 2")["order_item_subtotal"]. \
    sum()

579.98

* Get number of items and revenue for each order id. Give alias to the order revenue as **revenue**.

In [36]:
order_items. \
    groupby("order_item_order_id")["order_item_subtotal"]. \
    sum()

order_item_order_id
1         299.98
2         579.98
4         699.85
5        1129.86
7         579.92
          ...   
68879    1259.97
68880     999.77
68881     129.99
68882     109.99
68883    2149.99
Name: order_item_subtotal, Length: 57431, dtype: float64

In [37]:
order_items. \
    groupby("order_item_order_id")["order_item_subtotal"]. \
    agg(['sum', 'count']). \
    rename(columns={'sum': 'revenue'})

Unnamed: 0_level_0,revenue,count
order_item_order_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,299.98,1
2,579.98,3
4,699.85,4
5,1129.86,5
7,579.92,3
...,...,...
68879,1259.97,3
68880,999.77,5
68881,129.99,1
68882,109.99,2


## Limitations of Pandas

We can use Pandas for data processing. It provides rich APIs to read data from different sources, process the data and then write it to different targets.
* Pandas works well for light weight data processing.
* Pandas is typically single threaded, which means only one process take care of processing the data.
* As data volume grows, the processing time might grow exponentially and also run into resource contention.
* It is not trivial to use distributed processing using Pandas APIs. We will end up struggling with multi threading rather than business logic.
* There are Distributed Computing Frameworks such as Hadoop Map Reduce, Spark etc to take care of data processing at scale on multi node Hadoop or Spark Clusters.
* Both Hadoop Map Reduce and Spark comes with Distributed Computing Frameworks as well as APIs.

**Pandas is typically used for light weight Data Processing and Spark is used for Data Processing at Scale.**

## Development Life Cycle

Let us understand the development life cycle. We typically use IDEs such as PyCharm to develop Python based applications.

* Create Project - retail
* Choose the interpreter 3.x
* Make sure plugins such as pandas are installed.
* Create config.py script for externalizing run time parameters such as input path, output path etc.
* Create app folder for the source code.

### Tasks

Let us develop a simple application to understand end to end development life cycle.

* Read the data from order_items
* Get revenue for each order id
* Save the output which contain order id and revenue to a file.

Click [here](https://github.com/dgadiraju/python-retail/tree/v1.0) for the complete code for the above tasks.

## Exercises

Let us perform few exercises to understand how to process the data. We will use LinkedIn data to perform some basic data processing using Python.

* Get LinkedIn archive.
  * Go to https://linkedin.com
  * Me on top -> Settings & Privacy
  * Then go to "How LinkedIn users your data" -> Getting a copy of your data
  * Register and download. You will get a link as part of the email.
* Data contain multiple CSV files. We will limit the analysis to **Contacts.csv** and **Connections.csv**.
* Get the number of **contacts** with out email ids.
* Get the number of **contacts** from each source.
* Get the number of **connections** with each title.
* Get the number of **connections** from each company.
* Get the number of **contacts** for each month in the year 2018.
* Use Postgres or MySQL as databases (you can setup in your laptop) and write **connections** data to the database