# Quick Recap of Scala

Let us quickly recap of some of the core programming concepts of Python before we get into Spark.

## Data Engineering Life Cycle

Let us first understand the Data Engineering Life Cycle. We typically read the data, process it by applying business rules and write the data back to different targets
* Read the data from different sources.
  * Files
  * Databases
  * Mainframes
  * APIs
* Processing the data
  * Row Level Transformations
  * Aggregations
  * Sorting
  * Ranking
  * Joining multiple data sets
* Write data to different targets.
  * Files
  * Databases
  * Mainframes
  * APIs

## Python CLI or Jupyter Notebook

We can use Python CLI or Jupyter Notebook to explore APIs.

* We can launch Python CLI using `python` command.
* We can launch the Jupyter Notebook using the `jupyter notebook` command.
* A web service will be started on port number 8888 by default.
* We can go to the browser and connect to the web server using IP address and port number.
* We should be able to explore code in interactive fashion.
* We can issue magic commands such as %%sh to run shell commands, %%md to document using markdown etc.

### Tasks

Let us perform these tasks to just recollect how to use Python CLI or Jupyter Notebook.
* Create variables i and j assigning 10 and 20.5 respectively.

In [None]:
val i = 10
val j = 20.5

* Add the values and assign it to res.

In [None]:
val res = i + j

In [None]:
println(res)

* Get the type of i, j and res.

## Basic Programming Constructs

Let us recollect some of the basic programming constructs of Python.
* Comparison Operations (==, !=, <, >, <=, >=, etc) 
  * All the comparison operators return a True or False (Boolean value)
* Conditionals (if) 
  * We typically use comparison operators as part of conditionals.
* Loops (for) 
  * We can iterate through collection using `for i in l` where l is a standard collection such as list or set.
  * Python provides special function called as `range` which will return a collection of integers between the given range. It excludes the upper bound value.
* In Python, scope is defined by indentation.

### Tasks
 
Let us perform few tasks to quickly recap basic programming constructs of Python.
 * Get all the odd numbers between 1 and 15.
 

In [None]:
(1 to 15 by 2)

* Print all those numbers which are divisible by 3 from the above list.

In [None]:
for (i <- (1 to 15 by 2))
    if(i%3 == 0) println(i)

## Developing Functions

Let us understand how to develop functions using Python as programming language.
* Function starts with `def` followed by function name.
* Parameters can be of different types.
    * Required
    * Keyword
    * Variable Number
    * Functions
* Functions which take another function as an argument is called higher order functions.

### Tasks

 Let us perform few tasks to understand how to develop functions in Python.   
 
 * Sum of integers between lower bound and upper bound using formula.



In [None]:
def sumOfN(n: Int) =
    (n * (n + 1)) / 2

In [None]:
sumOfN(10)

In [None]:
def sumOfIntegers(lb: Int, ub: Int) =
    sumOfN(ub) - sumOfN(lb - 1)

In [None]:
sumOfIntegers(5, 10)

* Sum of integers between lower bound and upper bound using loops.

In [None]:
def sumOfIntegers(lb: Int, ub: Int) = {
    var total = 0
    for (e <- (lb to ub))
        total += e
    total
}

In [None]:

sumOfIntegers(1, 10)

* Sum of squares of integers between lower bound and upper bound using loops.

In [None]:
def sumOfSquares(lb: Int, ub: Int) = {
    var total = 0
    for (e <- (lb to ub))
        total += e * e
    total
}

In [None]:
sumOfSquares(2, 4)

* Sum of the even numbers between lower bound and upper bound using loops.

In [None]:
def sumOfEvens(lb: Int, ub: Int) = {
    var total = 0
    for (e <- (lb to ub))
        total += e * e
    total
}

In [None]:
sumOfEvens(2, 4)

## Lambda Functions

Let us recap details related to lambda functions.

* We can develop functions with out names. They are called Lambda Functions and also known as Anonymous Functions.
* We typically use them to pass as arguments to higher order functions which takes functions as arguments
    

### Tasks

Let us perform few tasks related to lambda functions.
    
* Create a generic function mySum which is supposed to perform arithmetic using integers within a range.   
    * It takes 3 arguments - lb, ub and f.
    * Function f should be invoked inside the function on each element within the range.

    

In [None]:
def mySum(lb: Int, ub: Int, f: Int => Int) = {
    var total = 0
    for (e <- (lb to ub))
        total += f(e)
    total    
}

* Sum of integers between lower bound and upper bound using mySum.

In [None]:
mySum(2, 4, i => i)

* Sum of squares of integers between lower bound and upper bound using mySum.

In [None]:
mySum(2, 4, i => i * i)

* Sum of the even numbers between lower bound and upper bound using mySum.

In [None]:
mySum(2, 4, i => if(i%2 == 0) i else 0)

## Overview of Collections and Tuples

Let"s quickly recap about Collections and Tuples in Python. We will primarily talk about collections and tuples that comes as part of Python standard library such as `list`, `set`,` dict` and `tuple.`

* Group of elements with length and index - `list`
* Group of unique elements - `set`
* Group of key value pairs - `dict`
* While list, set and dict contain group of homogeneous elements, tuple contains group of heterogeneous elements.
* We can consider list, set and dict as a table in a database and tuple as a row or record in a given table.
* Typically we create list of tuples or set of tuples and dict is nothing but collection of tuples with 2 elements and key is unique.
* We typically use Map Reduce APIs to process the data in collections. There are also some pre-defined functions such as `len`, `sum`,` min`,` max` etc for aggregating data in collections.

### Tasks

Let us perform few tasks to quickly recap details about Collections and Tuples in Python. We will also quickly recap about Map Reduce APIs.

* Create a collection of orders by reading data from a file.

In [None]:
import sys.process._

In [None]:
"ls -ltr /data/retail_db/orders/part-00000"!

In [None]:
val ordersPath = "/data/retail_db/orders/part-00000"

In [None]:
import scala.io.Source

In [None]:
val orders = Source.fromFile(ordersPath).
    getLines

* Get all unique order statuses. Make sure data is sorted in alphabetical order.

In [None]:
val ordersPath = "/data/retail_db/orders/part-00000"

import scala.io.Source
val orders = Source.fromFile(ordersPath).
    getLines

orders.
    map(order => order.split(",")(3)).
    toSet.
    toList.
    sorted.
    foreach(println)

* Get count of all unique dates.

In [None]:
val ordersPath = "/data/retail_db/orders/part-00000"

import scala.io.Source
val orders = Source.fromFile(ordersPath).
    getLines

orders.
    map(order => order.split(",")(1)).
    toSet.
    toList.
    sorted

* Sort the data in orders in ascending order by order_customer_id and then order_date.

In [None]:
val ordersPath = "/data/retail_db/orders/part-00000"

import scala.io.Source
val orders = Source.fromFile(ordersPath).
    getLines

orders.
    toList.
    sortBy(k => {
        val a = k.split(",")
        (a(2).toInt, a(1))
    }).
    take(20).
    foreach(println)

* Create a collection of order_items by reading data from a file.

In [None]:
val orderItemsPath = "/data/retail_db/order_items/part-00000"

import scala.io.Source
val orderItems = Source.fromFile(orderItemsPath).
    getLines.
    toList
orderItems.take(10).foreach(println)

* Get revenue for a given order_item_order_id.

In [None]:
def getOrderRevenue(orderItems: List[String], orderId: Int) = {   
    val orderItemsFiltered = orderItems.
        filter(orderItem => orderItem.split(",")(1).toInt == orderId)
    val orderItemsMap = orderItemsFiltered.
        map(orderItem => orderItem.split(",")(4).toFloat)
    orderItemsMap.sum
}

In [None]:
val orderItemsPath = "/data/retail_db/order_items/part-00000"

import scala.io.Source
val orderItems = Source.fromFile(orderItemsPath).
    getLines.
    toList

In [None]:
print(getOrderRevenue(orderItems, 2))

## Development Life Cycle

Let us understand the development life cycle. We typically use IDEs such as PyCharm to develop Python based applications.

* Create Project - retail
* Choose the interpreter 3.x
* Make sure plugins such as pandas are installed.
* Create config.py script for externalizing run time parameters such as input path, output path etc.
* Create app folder for the source code.

### Tasks

Let us develop a simple application to understand end to end development life cycle.

* Read the data from order_items
* Get revenue for each order id
* Save the output which contain order id and revenue to a file.

Click [here](https://github.com/dgadiraju/python-retail/tree/v1.0) for the complete code for the above tasks.

## Exercises

Let us perform few exercises to understand how to process the data. We will use LinkedIn data to perform some basic data processing using Python.

* Get LinkedIn archive.
  * Go to https://linkedin.com
  * Me on top -> Settings & Privacy
  * Then go to "How LinkedIn users your data" -> Getting a copy of your data
  * Register and download. You will get a link as part of the email.
* Data contain multiple CSV files. We will limit the analysis to **Contacts.csv** and **Connections.csv**.
* Get the number of **contacts** with out email ids.
* Get the number of **contacts** from each source.
* Get the number of **connections** with each title.
* Get the number of **connections** from each company.
* Get the number of **contacts** for each month in the year 2018.
* Use Postgres or MySQL as databases (you can setup in your laptop) and write connections data to the database