ERP is an acronym that stands for **Enterprise Resource Planning** (ERP). 

It's a business process management software that manages and integrates a company's financials, supply chain, operations, commerce, reporting, manufacturing, and human resource activities. 

One important ERP entity is the **Account Receivable (AR)**: it refers to the money a company's customers owe for goods or services they have received.

 **Account Receivable (AR)** could be:
 * Invoice
 * Credit Note
 * Debit Note
 * Cancellation
 * Miscellaneous

Each AR is made by several part like the **header** – the part with general information about customers/suppliers that define the invoice – the **list of items**, the **list of payments**, details about the **customers**, details about the **shipping**, ...

# Parameters

* N: number of invoices
* M: number of payments
* K: number of customers

In [None]:
N=10000
M=12500
K=150

# AR Header

The Header of an AR document contains some general information like
* Customer ID
* Value
* Due Date
* Posting Date
* Document Number - must be unique per fiscal year
* Fiscal Year
* Document Type

Assumptions:
* we have "Invoice" has only type

In [None]:
from random import randint
from datetime import datetime,timedelta
def headerGenerator(k=5):
  postingDate = datetime(2022,1,1)+timedelta(randint(0,200))
  return {
          "customerId":"Customer_{customerId}".format(customerId=str(randint(0,k)+1).zfill(3)),
          "value":randint(50,10000),
          "documentCurrency":"EUR", 
          "postingDate":postingDate.strftime("%Y-%m-%d"),
          "dueDate":(postingDate+timedelta(randint(0,60))).strftime("%Y-%m-%d"),
          "fiscalYear":postingDate.strftime("%Y"),
          "documentType":"Invoice"
         }


def headerList(k=5,n=1000):
  rawHeaderList = [headerGenerator(p) for p in range(n)]
  rawHeaderList.sort(key=lambda row: row.get("postingDate"))
  for pos,val in enumerate(rawHeaderList):
    val["documentNumber"]="2022-{docNum}".format(docNum=str(pos).zfill(5))
  return rawHeaderList
  
myARList = headerList(K,N)

# AR Payments

List of lines that represent a payment made by a customer on a given AR.
* Document Number
* Payment Date
* Value Paid

In [None]:
def paymentGenerator(InvoiceList):
  documentNumber = "2022-{docNum}".format(docNum=str(randint(0,len(InvoiceList)-1)).zfill(5))
  invoice = [k for k in InvoiceList if k.get("documentNumber")==documentNumber][0]
  postingDate = datetime.strptime(invoice.get("postingDate"),"%Y-%m-%d")
  return { 
          "documentNumber":documentNumber,
          "paymentDate":(postingDate+timedelta(randint(15,90))).strftime("%Y-%m-%d"),
          "valuePaid":randint(1,invoice.get("value"))
          ,"documentCurrency":invoice.get("documentCurrency")
         }


def paymentList(InvoiceList,m=250):
  return [paymentGenerator(InvoiceList) for k in range(m)]
   
myPaymentList = paymentList(myARList,M)  


# Part 05 - Cosine Similarity
* How many customers has the company?
* Draw the histogram - without using .hist() - as the number of customer with 1 invoice, the number of customers with 2 invoices, ...
* Define two customers similarity based on the cosine similarity computed on the average payment time per day
    * a day with no invoice posted count as zero
    * for other days, compute the average payment timing using the due date as zero (10 days in advance means -10, 3 days after means +3)

# Cosine Similarity

In [None]:

customerRDD = ... # {"CustomerId":"","postingDate":datetime(),"dueDate":datetime(),"paymentDate":datetime()}
#computing the average as the sum divided by the count avoid the need to create a big list and then use average() or no.avg()... to be in a form (row,col,value)
preparedCustomerRDD = customerRDD.map(lambda x: ((x.get("CustomerId"),x.get("postingDate")),(x.get("dueDate")-x.get("paymentDate"),1))).reduceByKey(lambda x,y: (x[0]+y[0],x[1]+y[1])).map(lambda x: {"customerId":x[0][0],"postingDate":x[0][1],"avgDelay":x[1][0]/x[1][1]})
#coming to a form (column,(row,value))
toBeJoinedPreparedCustomerRDD = preparedCustomerRDD.map(lambda x: (x.get("postingDate"),(x.get("customerId"),(x.get("avgDelay")))))
#let's take only half matrix... here we have (postingDateA,((customer01,avgDay_customer01ATpostingDateA),(customer02,avgDay_customer02ATpostingDateA)))
joinedPreparedCustomerRDD = toBeJoinedPreparedCustomerRDD.join(toBeJoinedPreparedCustomerRDD).filter(lambda x: x[1][0]>x[1][1])
#but we need to compute for each customer the cosine similarity, so we move customers as key... (Customer01,Customer02),avgDay_customer01ATpostingDateA*avgDay_customer02ATpostingDateA ==> we ignore the given posting date, we already used it for the join...
toBeReducedRdd = joinedPreparedCustomerRDD.map(lambda x: ((x[0][0],x[1][0]),(x[0][1]*x[1][1]))).reduceByKey(lambda x,y: x+y)
# toBeReducedRdd<== give us for a couple of customer the upper part of cosine similarity (Customer01,Customer02),SUM(avgDay_customer01ATpostingDateI*avgDay_customer02ATpostingDateI)
fullCustomerSimilarity = toBeReducedRdd.flatMap(lambda row: [{"Customer01":row[0][0],"Customer02":row[0][1],"upperScore":row[1]},{"Customer01":row[0][1],"Customer02":row[0][0],"upperScore":row[1]}]) #<== we re-full the matrix we halved before