ERP is an acronym that stands for **Enterprise Resource Planning** (ERP). 

It's a business process management software that manages and integrates a company's financials, supply chain, operations, commerce, reporting, manufacturing, and human resource activities. 

One important ERP entity is the **Account Receivable (AR)**: it refers to the money a company's customers owe for goods or services they have received.

 **Account Receivable (AR)** could be:
 * Invoice
 * Credit Note
 * Debit Note
 * Cancellation
 * Miscellaneous

Each AR is made by several part like the **header** – the part with general information about customers/suppliers that define the invoice – the **list of items**, the **list of payments**, details about the **customers**, details about the **shipping**, ...

# Parameters

* N: number of invoices
* M: number of payments
* K: number of customers

In [None]:
sc = None
N=10000
M=12500
K=150

# AR Header

The Header of an AR document contains some general information like
* Customer ID
* Value
* Due Date
* Posting Date
* Document Number - must be unique per fiscal year
* Fiscal Year
* Document Type

Assumptions:
* we have "Invoice" has only type

In [None]:
from random import randint
from datetime import datetime,timedelta

def headerGenerator(k=5):
  postingDate = datetime(2022,1,1)+timedelta(randint(0,200))
  return {
          "customerId":"Customer_{customerId}".format(customerId=str(randint(0,k)+1).zfill(3)),
          "value":randint(50,10000),
          "documentCurrency":"EUR", 
          "postingDate":postingDate.strftime("%Y-%m-%d"),
          "dueDate":(postingDate+timedelta(randint(0,60))).strftime("%Y-%m-%d"),
          "fiscalYear":postingDate.strftime("%Y"),
          "documentType":"Invoice"
         }

def headerList(k=5,n=1000):
  rawHeaderList = [headerGenerator(k) for k in range(n)]
  rawHeaderList.sort(key=lambda row: row.get("postingDate"))
  for pos,val in enumerate(rawHeaderList):
    val["documentNumber"]="2022-{docNum}".format(docNum=str(pos).zfill(5))
  return rawHeaderList
  
myARList = headerList(K,N)
myARList

# AR Payments

List of lines that represent a payment made by a customer on a given AR.
* Document Number
* Payment Date
* Value Paid

In [None]:
def paymentGenerator(InvoiceList):
  documentNumber = "2022-{docNum}".format(docNum=str(randint(0,len(InvoiceList)-1)).zfill(5))
  invoice = [k for k in InvoiceList if k.get("documentNumber")==documentNumber][0]
  postingDate = datetime.strptime(invoice.get("postingDate"),"%Y-%m-%d")
  return { 
          "documentNumber":documentNumber,
          "paymentDate":(postingDate+timedelta(randint(15,90))).strftime("%Y-%m-%d"),
          "valuePaid":randint(1,invoice.get("value"))
          ,"documentCurrency":invoice.get("documentCurrency")
         }

def paymentList(InvoiceList,m=250):
  return [paymentGenerator(InvoiceList) for k in range(m)]
   
myPaymentList = paymentList(myARList,M)  
myPaymentList

# Part 00
* Define the type of each table (Log or Registry): which are the keys of these tables?

Both Header and Payments are log, because I cannot update or delete no one of them.
Keys:
* Header: documentNumber and fiscalYear
* Payments: documentNumber and paymenteDate (under the assumpions): i) I can receive multiple payments for a given AR, ii) I cannot receive more than one payment for a given invoice per day

# Part 01
* Create the two RDDs checking everything is ok!
* Create a unique RDD with pieces of information both from header and payments

In [None]:
import pyspark
sc = pyspark.SparkContext("local[*]")

In [None]:
headerRDD = sc.parallelize(myARList)
paymentsRDD = sc.parallelize(myPaymentList)
headerRDD.count()==N, paymentsRDD.count()==M

In [None]:
x = headerRDD.map(lambda x: (x.get("customerId"), x)) \
             .filter(lambda x: x[1].get("value") >= 9000) \
             .map(lambda x: (x[1].get("customerId"), 1)) \
             .reduceByKey(lambda x, y: x + y)
y = headerRDD.map(lambda x: (x.get("customerId"), x)) \
             .filter(lambda x: x[1].get("value") >= 8000) \
             .map(lambda x: (x[1].get("customerId"), 1)) \
             .reduceByKey(lambda x, y: x + y)

xy = x.join(y) \
      .filter(lambda x: x[1][0] < x[1][1])

xy.count()

In [None]:
semiJoinHeaderRDD = headerRDD.map(lambda row: 
                        ((row.get("documentNumber"), row.get("fiscalYear")), row))
semiJoinHeaderRDD.first()

In [None]:
semiJoinPaymentsRDD = paymentsRDD.map(lambda row: 
                          ((row.get("documentNumber"), row.get("paymentDate").split("-")[0]), row))
semiJoinPaymentsRDD.first()

In [None]:
joinRDD = semiJoinPaymentsRDD.join(semiJoinHeaderRDD)
joinRDD.first()

In [None]:
joinedRdd = semiJoinPaymentsRDD.join(semiJoinHeaderRDD)
joinFirstResult = joinedRdd.first()
print("Keys:{keys}".format(keys=joinFirstResult[0]))
print("Left part of the join:{valueLeft}".format(valueLeft=joinFirstResult[1][0]))
print("Right part of the join:{valueRight}".format(valueRight=joinFirstResult[1][1]))

In [None]:
def formatRow(row):
  basicRow = {"header":row[1][1]} #header
  basicRow["keyTuple"] = row[0]
  basicRow["paymentList"] = [row[1][0]]
  return basicRow
niceJoineRDD = joinedRdd.map(lambda row: formatRow(row))
niceJoineRDD.first()

In [None]:
niceJoineRDD.map(lambda x: (x.get("keyTuple"),1)).reduceByKey(lambda left,right: left+right).first()

In [None]:
result = niceJoineRDD.filter(lambda row: row.get("keyTuple")==('2022-09826', '2022')).collect()
for pos,val in enumerate(result):
  print("Element number {pos}".format(pos=pos))
  print(val)

In [None]:
def mergerFunction(leftDict,rightDict):
  leftDict["paymentList"] +=rightDict["paymentList"]
  return leftDict
  
reducedNiceJoinedRDD =niceJoineRDD.map(lambda row: (row.get("keyTuple"),row)) \
                                  .reduceByKey(lambda left,right: mergerFunction(left,right)) \
                                  .map(lambda x: x[1])
reducedNiceJoinedRDD.collect()

# Alternative and scalable approach

In [None]:
def quantitativeRepr(row):
    return {"amount":row.get("valuePaid"),"numberOfPayments":1,"lastDate":row.get("paymentDate")}

def combineFun(firstPayment,secondPayment):
    firstPayment["amount"] += secondPayment.get("amount")
    firstPayment["numberOfPayments"] += secondPayment.get("numberOfPayments")
    firstPayment["lastDate"] = secondPayment.get("lastDate") if secondPayment.get("lastDate")>firstPayment.get("lastDate") else firstPayment.get("lastDate")
    return firstPayment
  
lightPaymentRDD = paymentsRDD.map(lambda x: (x.get("documentNumber"),quantitativeRepr(x))) \
                             .reduceByKey(lambda firstPayment,secondPayment: combineFun(firstPayment,secondPayment)) \
                            #  .map(lambda x: {"keyTuple":x[0],"paymentStats":x[1]})
lightPaymentRDD.first()

In [None]:
newSemiJoinHeaderRDD = semiJoinHeaderRDD.map(lambda x: (x[0][0], x[1])).join(lightPaymentRDD)

In [None]:
def mergeDict(t):
    t[0].update(t[1])
    return t[0]

newSemiJoinHeaderRDD.map(lambda x: {"key":x[0], "value": mergeDict(x[1])}).first()


# Part 02
* How many invoices are open (i.e., not completely paid)?
* How many invoices are closed (i.e., completely paid)?
* How many invoices are overdued (i.e., not completely paid and with a due date in the past)?
* How many invoices have been paid not in time (i.e., completely paid and with the last payment after the due date)?
* Add to the RDD the information of "closingDate" as the date of the payment that close that invoice.
* Add to the RDD the boolean of "inTime": True if the closingDate < dueDate else False

In [None]:
RDD = newSemiJoinHeaderRDD

In [None]:
is_open          = lambda x: x[1][0].get("value") > x[1][1].get("amount")
is_closed        = lambda x: x[1][0].get("value") <= x[1][1].get("amount")
is_overdues      = lambda x: is_open(x) and datetime.strptime(x[1][0].get("dueDate"), "%Y-%M-%d") < datetime.now()
paid_not_in_time = lambda x: is_closed(x) and x[1][1].get("lastDate") > x[1][0].get("dueDate")

rdd_is_open          = RDD.filter(is_open)
rdd_is_closed        = RDD.filter(is_closed)
rdd_is_overdues      = RDD.filter(is_overdues)
rdd_paid_not_in_time = RDD.filter(paid_not_in_time)

print(f'''
    Open             {rdd_is_open.count()}
    Closed           {rdd_is_closed.count()}
    Overdues         {rdd_is_overdues.count()}
    Paid not in time {rdd_paid_not_in_time.count()}    
''')

In [None]:
def add_closing_date(x):
    x[1][0]["closingDate"] = x[1][1].get("lastDate") if is_closed(x) else None
    return x

def add_in_time(x):
    x[1][0]["inTime"] = True if is_closed(x) else False
    return x

RDD = RDD.map(lambda x: add_closing_date(x))
RDD = RDD.map(lambda x: add_in_time(x))

# Part 03 - Debit Note
* How many invoices have been paid for more then their value?
* Add to the Header RDD for each of them a Debit Note with the value to be charged back and the date of today

In [None]:
paid_more_then_their_value = lambda x: x[1][0].get("value") <= x[1][1].get("amount")

rdd_paid_more_then_their_value = RDD.filter(paid_more_then_their_value)

def add_debit_note(x):
    if not paid_more_then_their_value: return x
    x[1][0]["debitNote"] = x[1][1].get("amount") - x[1][0].get("value") 
    return x

RDD = RDD.map(lambda x: add_debit_note(x))


# Part 04 - Paymenets Frequency
* Add to the Payment Rdd the computed "expectedPaymentDate". It is based on the two previous payments, and is the last payment date + the difference between it and the payment right before, customer by customer.
So, in the example below, for the first two payment is not possible to compute, while for the third, the expected payment is the 2022/10/15 (date of the last payment) plus 3 (the difference between it and the payment of 2022/10/12) 

| customerId  | paymentDate | expectedPaymentDate | documentNumber | ... |
|-------------|-------------|---------------------|----------------|-----|
| Customer001 | 2022/10/12  | N/A                 | 2022_01001     | ... |
| Customer001 | 2022/10/15  | N/A                 | 2022_01004     | ... |
| Customer001 | 2022/10/16  | 2022/10/18 (15+3)   | 2022_00904     | ... |
| Customer001 | 2022/10/20  | 2022/10/17 (16+1)   | 2022_01004     | ... |
| Customer001 | 2022/10/30  | 2022/11/24 (20+4)   | 2022_01101     | ... |
| Customer001 | ...         | ...                 | ...            | ... |

* Show for each customer, the average error of such method

# Part 05 - Cosine Similarity
* How many customers has the company?
* Draw the histogram - without using .hist() - as the number of customer with 1 invoice, the number of customers with 2 invoices, ...
* Define two customers similarity based on the cosine similarity computed on the average payment time per day
    * a day with no invoice posted count as zero
    * for other days, compute the average payment timing using the due date as zero (10 days in advance means -10, 3 days after means +3)