ERP is an acronym that stands for **Enterprise Resource Planning** (ERP). 

It's a business process management software that manages and integrates a company's financials, supply chain, operations, commerce, reporting, manufacturing, and human resource activities. 

One important ERP entity is the **Account Receivable (AR)**: it refers to the money a company's customers owe for goods or services they have received.

 **Account Receivable (AR)** could be:
 * Invoice
 * Credit Note
 * Debit Note
 * Cancellation
 * Miscellaneous

Each AR is made by several part like the **header** – the part with general information about customers/suppliers that define the invoice – the **list of items**, the **list of payments**, details about the **customers**, details about the **shipping**, ...

# Parameters

* N: number of invoices
* M: number of payments
* K: number of customers

In [None]:
N=10000
M=12500
K=150

# AR Header

The Header of an AR document contains some general information like
* Customer ID
* Value
* Due Date
* Posting Date
* Document Number - must be unique per fiscal year
* Fiscal Year
* Document Type

Assumptions:
* we have "Invoice" has only type

In [None]:
from random import randint
from datetime import datetime,timedelta
def headerGenerator(k=5):
  postingDate = datetime(2022,1,1)+timedelta(randint(0,200))
  return {
          "customerId":"Customer_{customerId}".format(customerId=str(randint(0,k)+1).zfill(3)),
          "value":randint(50,10000),
          "documentCurrency":"EUR", 
          "postingDate":postingDate.strftime("%Y-%m-%d"),
          "dueDate":(postingDate+timedelta(randint(0,60))).strftime("%Y-%m-%d"),
          "fiscalYear":postingDate.strftime("%Y"),
          "documentType":"Invoice"
         }


def headerList(k=5,n=1000):
  rawHeaderList = [headerGenerator(k) for k in range(n)]
  rawHeaderList.sort(key=lambda row: row.get("postingDate"))
  for pos,val in enumerate(rawHeaderList):
    val["documentNumber"]="2022-{docNum}".format(docNum=str(pos).zfill(5))
  return rawHeaderList
  
myARList = headerList(K,N)

# AR Payments

List of lines that represent a payment made by a customer on a given AR.
* Document Number
* Payment Date
* Value Paid

In [None]:
def paymentGenerator(InvoiceList):
  documentNumber = "2022-{docNum}".format(docNum=str(randint(0,len(InvoiceList)-1)).zfill(5))
  invoice = [k for k in InvoiceList if k.get("documentNumber")==documentNumber][0]
  postingDate = datetime.strptime(invoice.get("postingDate"),"%Y-%m-%d")
  return { 
          "documentNumber":documentNumber,
          "paymentDate":(postingDate+timedelta(randint(15,90))).strftime("%Y-%m-%d"),
          "valuePaid":randint(1,invoice.get("value"))
          ,"documentCurrency":invoice.get("documentCurrency")
         }


def paymentList(InvoiceList,m=250):
  return [paymentGenerator(InvoiceList) for k in range(m)]
   
myPaymentList = paymentList(myARList,M)  


# Part 00
* Define the type of each table (Log or Registry): which are the keys of these tables?

In [None]:
# Both tables are a log tables
# Header key: documentNumber, fiscalYear
# Payments key: documentNumber, paymentDate

# Part 01
* Create the two RDDs checking everything is ok!
* Create a unique RDD with informations header and lines

In [None]:
headerRDD = sc.parallelize(myARList)
paymentsRDD = sc.parallelize(myPaymentList)
headerRDD.count()==N,paymentsRDD.count()==M

mapHeaderRDD = headerRDD.map(lambda x: (x.get("documentNumber"), x))
mapPaymentsRDD = paymentsRDD.map(lambda x: (x.get("documentNumber"), x))
mapHeaderRDD.join(mapPaymentsRDD).first()


# Part 02
* How many invoices are open (i.e., not completely paid)?
* How many invoices are closed (i.e., completely paid)?
* How many invoices are overdued (i.e., not completely paid and with a due date in the past)?
* How many invoices have been paid not in time (i.e., completely paid and with the last payment after the due date)?
* Add to the RDD the information of "closingDate" as the date of the payment that close that invoice.
* Add to the RDD the boolean of "inTime": True if the closingDate < dueDate else False

# Part 03 - Debit Note
* How many invoices have been paid for more then their value?
* Add to the Header RDD for each of them a Debit Note with the value to be charged back and the date of today

# Part 04 - Paymenets Frequency
* Add to the Payment Rdd the computed "expectedPaymentDate". It is based on the two previous payments, and is the last payment date + the difference between it and the payment right before, customer by customer.
So, in the example below, for the first two payment is not possible to compute, while for the third, the expected payment is the 2022/10/15 (date of the last payment) plus 3 (the difference between it and the payment of 2022/10/12) 
| customerId  | paymentDate | expectedPaymentDate | documentNumber | ... |
|-------------|-------------|---------------------|----------------|-----|
| Customer001 | 2022/10/12  | N/A                 | 2022_01001     | ... |
| Customer001 | 2022/10/15  | N/A                 | 2022_01004     | ... |
| Customer001 | 2022/10/16  | 2022/10/18 (15+3)   | 2022_00904     | ... |
| Customer001 | 2022/10/20  | 2022/10/17 (16+1)   | 2022_01004     | ... |
| Customer001 | 2022/10/30  | 2022/11/24 (20+4)   | 2022_01101     | ... |
| Customer001 | ...         | ...                 | ...            | ... |
* Show for each customer, the average error of such method

# Part 05 - Cosine Similarity
* How many customers has the company?
* Draw the histogram - without using .hist() - as the number of customer with 1 invoice, the number of customers with 2 invoices, ...
* Define two customers similarity based on the cosine similarity computed on the average payment time per day
    * a day with no invoice posted count as zero
    * for other days, compute the average payment timing using the due date as zero (10 days in advance means -10, 3 days after means +3)