In [1]:
import pandas as pd
import numpy as np
import math as math
import matplotlib.pyplot as plt
import seaborn as sns
import random
%matplotlib inline 

In [2]:
df = pd.read_csv("kenya_data/diaries_transactions_all.csv", dtype={'account_startclose_balance': str})

# 1. Outcome

The general question is what loans we want to predict. If we want to predict microloans, we should maybe only use loans that are very similar to this as an outcome (like Business and Agriculture loans) but there are only 48 of those. Maybe we should generalize the question to predicting whether people pay back any kind of loan that is similar to a formal loan.

## 1.1. Which loans to include?

### 1.1.2 Kinds of loans

We think that we should use as outcome all loans that are reasonably close to a formal loan. These include: 
1. all formal loans (Credit Card, Student Loan, Individual Business or Agricultural loan, Payday loan, Consumer/personal loan (not payday loan), Joint liability loan, School feed loan, Hire purchase, Group enterprise loan, M-SHWARI loan.)
2. loans from informal groups

We are not sure if we should include other informal loans:
1. Informal credit at a store (Many very small new borrowings and repayments.)
2. Friends and Family: Borrowing (Usually paid back at once.)
3. Act as monay guard (Usually paid back at once.)
4. Moneylender or Shylock Borrowing (Usually paid back at once with a large interest).
Maybe the above are better suited as explanatory variables. In that case it might be a problem that not everybody has done all or some of the above. One solution to this could be to assign a score for each of the above. It could be 1 for people who did the above and paid it back (in a certain amount of time). This leads again to the problem of fairness since we do not have the conditions. And a score of 0 to people who did not do it. And a score of -1 to people who did the above but did not repay it.

Also we are not sure if we should include the following, which we need to investigate first.
1. Pawning assets
2. Supplier credit
3. Okoa Jahazi
4. Advances
5. Arrears owed to or owed by respondents.
6. Loan from employer

Below are the number of accounts that are liabilities that fall into each category.

In [19]:
df[(df["account_bsheet_desig"]=="Liability")&(df["unique_accnts"]==1)]["trx_type_desc"].value_counts()

Friends and family: Borrowing                605
Informal credit at a store                   430
Arrears owed by respondents                  232
Borrowing from an informal Group             159
Okoa Jahazi                                   82
Act as money guard                            76
Supplier credit                               67
Individual Business or Agriculture Loan       48
Wage advance                                  40
Moneylender or Shylock Borrowing              39
Consumer/ personal loan (not payday loan)     33
Hire Purchase                                 24
Joint liability loan                          21
M-SHWARI Loan                                 14
School Fees Loan                              12
Loan from employer                            12
Pawning assets                                 5
Credit card (including store card)             2
Payday loan                                    2
Group Enterprise Loan                          2
Student loan        

When choosing the kinds of loans we want to include we might also need to assess how close any of these loans are to a microcredit. For instance the purpose of a microcredit and a credit at a store are very different.

### 1.1.2 Threshold

Another question is whether we should exclude certain loans that are very low. If yes then the question is what should be the threshold. Minimum wage (income) in Kenya is ~4000 KES, if we use that as the minimum threshold, we would exclude approximately 25% of the Agriculture and Business loans.

If we are only interested in microloans we might also want to exclude very high loans (<100 0000KES).

## 1.2. What metric to use?

One reason for the restriction above is that we need the loans to resemble each other more or less in order to define a metric that measured how good of a borrower someone was.

1. Idea: For loans that have an initial amount that is borrowed, and then smaller repayments over a period of time, we can define a metric by using the regularity of the height of payments and the regularity of the intervals between payments. One way to do this is to compare the standard deviation with the mean. So someone for which the sd(height of payment)/mean(height of payment) is large would be considered a bad borrower. Similarly for intervals between payments.
2. Idea: To get started just look at wether someone who took a loan towards the beginning of the experiment has payed it back before the end of the experiment. Here we have the concern that this would not be fair.

## 1.3 Implementation of metrics

This program computes some properties of a loan, such that we have data that is not time dependent. The information it produces about each account are: Household ID, Account Owner ID, Starting balance, Closing Balance, Interests accrued, Number of new borrowings, Mean height of new borrowings, Total number of days the account was observed, Mean Payment height, Standard deviation of the payment height, Mean interval between payments, Standard deviation of intervals between payments. 

In [10]:
# For now we only keep Individual Business and Agricultural loans.
dfb = df[(df["trx_family_code"]=="FRMLN")&(df["trx_type_code"]==2760)]

In [11]:
# this creates a new dataframe that has only one entry for each loan with the information mentioned above
loans = pd.DataFrame(columns=["hh_ids","account_ids", "m_ids_owner","start_bal_kes", "clos_bal_kes", "interests_kes", "num_new_borrowing", "mean_borrowing_kes", "tot_acc_daysofobs", "mean_pay_int_days","sd_pay_int_days","mean_pay_height_kes","sd_pay_height_kes"] )

In [12]:
i = 0
for acc in (dfb["account_ids"].unique()):
    print(acc)
    # get household corresponding to the loan
    if len(dfb[dfb["account_ids"]==acc]["hh_ids"].unique())==1:
        hh = dfb[dfb["account_ids"]==acc]["hh_ids"].unique()[0]
    else:
        print("Error: account associated to several accounts.")
    
    #get account owner id
    if len(dfb[dfb["account_ids"]==acc]["m_ids_owner"].unique())==1:
        if "HH" in dfb[dfb["account_ids"]==acc]["m_ids_owner"].unique():
            ind_id = "HH"
        else:
            ind_id = int(dfb[dfb["account_ids"]==acc]["m_ids_owner"].unique()[0])
    else:
        print("Error: account associated to several individuals.")
        print(dfb[dfb["m_ids_owner"]==acc]["m_ids_owner"].unique())
        break
        
    #get total number of observation days
    if len(dfb[dfb["account_ids"]==acc]["tot_acc_daysofobs"].unique())==1:
        obs_days = dfb[dfb["account_ids"]==acc]["tot_acc_daysofobs"].unique()[0]
    else:
        print("Error: several different observation lengths.")
      
    #create new dataset with values sorted according to the day of transaction
    lna = dfb[dfb["account_ids"]==acc]
    lna = lna.sort_values("trx_stdtime_days_acc")
    
    #get starting balance
    start_bal = np.nan
    new_borrowings = np.nan
    mean_new_borrowings = np.nan
    if lna[lna["trx_prx_purpose"]=="1. Starting balance (today)"].shape[0]==1:
        start_bal = lna[lna["trx_prx_purpose"]=="1. Starting balance (today)"]["trx_value_kes"].unique()[0]
        new_borrowings = lna[lna["trx_prx_purpose"]=="2. New borrowing"].shape[0]
        all_borrow = lna[(lna["trx_prx_purpose"]=="2. New borrowing")|(lna["trx_prx_purpose"]=="1. Starting balance (today)")]
        # we want to exclude all new borrowings or starting balances that are 0
        mean_new_borrowings = all_borrow[all_borrow["trx_value_kes"]>0]["trx_value_kes"].mean()
    elif lna[lna["trx_prx_purpose"]=="1. Starting balance (today)"].shape[0]>1:
        print("Error: several starting balances")
    #if there is no starting balance at all
    elif lna[lna["trx_prx_purpose"]=="2. New borrowing"].shape[0]==1:
        start_bal = lna[lna["trx_prx_purpose"]=="2. New borrowing"]["trx_value_kes"].unique()[0]
        new_borrowings = 0
        mean_new_borrowings= start_bal
    elif lna[lna["trx_prx_purpose"]=="2. New borrowing"].shape[0]>1:
        start_bal = lna[lna["trx_prx_purpose"]=="2. New borrowing"].iloc[0,lna.columns.get_loc("trx_value_kes")]
        new_borrowings = lna[lna["trx_prx_purpose"]=="2. New borrowing"].shape[0]-1
        mean_new_borrowings = lna[lna["trx_prx_purpose"]=="2. New borrowing"]["trx_value_kes"].mean()
    else: 
        print("Error: no starting balance or new borrowing")
    
    #get total iterests accrued
    interests = 0
    interests = lna[lna["trx_prx_purpose"]=="5. Interest accruing"]["trx_value_kes"].sum()
    
    #get closing balance
    close_bal = np.nan
    if lna[lna["trx_prx_purpose"]=="6. Closing Balance--End of last DQ"].shape[0]==1:
        close_bal = lna[lna["trx_prx_purpose"]=="6. Closing Balance--End of last DQ"]["trx_value_kes"].unique()[0]
    elif lna[lna["trx_prx_purpose"]=="6. Closing Balance--End of last DQ"].shape[0]>1:
        print("Error: several closing balances.")
    else: 
        print("Error: no closing balance.")

    hei_mean,hei_sd,inter_mean,inter_sd = np.nan, np.nan, np.nan, np.nan
    #get the height of payments
    if lna[lna["trx_prx_purpose"]=="3. Payments"].shape[0]>0:
        hei_mean = lna[lna["trx_prx_purpose"]=="3. Payments"]["trx_value_kes"].mean()
        hei_sd = lna[lna["trx_prx_purpose"]=="3. Payments"]["trx_value_kes"].std()
        
        ##get the intervals between payments
        lna.insert(lna.shape[1],"time_since_last_payment",np.full(lna.shape[0],np.nan))
        ## first index in dataframe
        l =  lna[lna["trx_prx_purpose"]=="3. Payments"].index[0]
        for index, row in lna.iterrows():
            #there is no payment interval before the first payment
            if index == l:
                continue
            elif row["trx_prx_purpose"] == "3. Payments":
                lna.at[index,"time_since_last_payment"]=lna.loc[index,"trx_stdtime_days_acc"]-lna.loc[l,"trx_stdtime_days_acc"]
                l = index
        inter_mean = lna["time_since_last_payment"].mean()
        inter_sd = lna["time_since_last_payment"].std()
    
    loans.loc[i]=[hh,int(acc),ind_id,start_bal, close_bal, interests, new_borrowings, mean_new_borrowings, obs_days,inter_mean,inter_sd,hei_mean,hei_sd]
    i += 1

60137430710900000
105136540140100000
56134761927800000
105137049319900000
56134798164800000
59134726342000000
59134753176900000
Error: no closing balance.
59134691680100000
Error: no closing balance.
59134666467600000
Error: several starting balances
105137769890900000
59134727113000000
59135021173400000
60136531885600000
59136376033300000
60136436488300000
56135290200100000
60134752045800000
60134787021500000
105137414874400000
58134763288300000
61134770676600000
84136514083500000
Error: no starting balance or new borrowing
59136685821800000
59135288408400000
Error: no starting balance or new borrowing
60138558075900000
59134942516400000
62134821651800000
59134752569200000
Error: several closing balances.
59135900709000000
84136879000800000
59134865220800000
Error: several closing balances.
59136386375100000
111137291327800000
105137636847800000
60134978362300000
59135332895300000
59137629648900000
60135893778700000
58134814374400000
56135201879700000
105137629836900000
59134942469700

In [13]:
loans.head()

Unnamed: 0,hh_ids,account_ids,m_ids_owner,start_bal_kes,clos_bal_kes,interests_kes,num_new_borrowing,mean_borrowing_kes,tot_acc_daysofobs,mean_pay_int_days,sd_pay_int_days,mean_pay_height_kes,sd_pay_height_kes
0,KVIHK40,60137430710900000,60134547419200000,100000.0,101300.0,35000.0,0,100000.0,129,32.5,2.12132,11233.333333,28.867513
1,KELDK37,105136540140100000,65134432186900000,18000.0,35000.0,0.0,1,39000.0,200,39.2,19.460216,7166.666667,5307.227776
2,KELDK20,56134761927800000,65134442822400000,18870.0,0.0,2580.0,0,18870.0,377,36.4,14.397917,3575.0,1127.718937
3,KELDK38,105137049319900000,56134397318700000,298000.0,253608.0,0.0,0,298000.0,154,31.333333,18.147543,11098.0,0.0
4,KELDK15,56134798164800000,65134433094700000,50000.0,21250.0,0.0,0,50000.0,356,69.2,75.014665,4750.0,612.372436


## 1.4. Problems in the implementation, and how they were solved for now.

### 1.4.1. Problems with this way of summarizing

As can also be seen from the errors produced, there are several problems with this summary:
1. No starting balance. Solution: In this case we can use the first new borrowing instead. New problem: How should we count the number of new borrowings in this case.
2. Several starting balances.
3. No closing balance. Possible solution: Use the last balance instead?
4. Several closing balances. Possible solution: Use last one?
5. No starting balance or new borrowing.

### 1.4.2. Problems with the metric

Problem: The metric only works properly when there is a starting balance and then payments. If there is a new borrowing in between, a change in payments heights or intervals does not seem problematic to me. Especially since some people seem to take a loan, pay it of, and a long time after that take a new loan on the same account. 

Possible Solution: Consider each new borrowing as a new loan, even if it is on the same account.

Problem with this solution: 
1. This would mess up the closing balance.
2. There are loans were one starts to pay in, before one gets the money from the bank.

# 2. Which explaining variables to use?

## 2.1. General Attitude

Questions:
1. Should we only use variables that a person giving a microcredit would have access to? That would exclude a lot of information like Well-Being or Goings-On, for two reasons: (a) That data is very hard to collect (b) that seems to violate privacy. But we could still fit a model using these variables to see whether it gives any interesting results.
2. Is it realistic to use data about other loans (like credit from a store) to predict loan repayment?

## 2.2. What to use from Individual Data-Sets

### 2.2.1. Well-Being

There are 4 questions in the well-being data-set: Happiness, Confidence, Relationships, Economical well-being. We could use the average of the score over the whole year of each indiviual, i.e. each person would have a global happiness score, confidence score, etc.

### 2.2.2. Goings-on

There are 10 questions here, again we could use an average score over the year.

### 2.2.3. Transactions

Probably here too, we should use the average income over the whole preiod, average money spent on food over the whole period, etc. rather than looking at individual transactions.

2.2.4. Housing conditions
We could use the information about property that the family owns as a metric for evaluation the ability to pay off the debt as well as evaluate the general living conditions and financial obstacles that may make it hard for family to pay in time. 


2.2.5. Poverty

Here we are able to evaluate a daily income of family members

Summary of meeting (sorry for spelling)

1) Trunkate data time frame (6 month from loan start and so it ends withing the study period), chop off guys who pay longer 
2) Point system to evaluate the loan (going down every month by particular %)- #regularity and good loan payer behaviour - seady progress (does size of loan and type of loan (choose simple case) affect)  
3) How fast they are able to pay it off (loan takers) or in larger chunks, strong measure! but harder to see. 
4) Divide into two groups (bad payers and good loan takers) to find out the most reliable person to give a loan to. 
5) Incorporate the amount itself 

Optional: check seasonalty and other quant measures 
