# Kiva dataset exploration

## Introduction
Kiva.org is an online crowdfunding platform to extend financial services to poor and financially excluded people around the world. Kiva lenders have provided over $1 billion dollars in loans to over 2 million people. In order to set investment priorities, help inform lenders, and understand their target communities, knowing the level of poverty of each borrower is critical. However, this requires inference based on a limited set of information for each borrower.

Kiva has provided a dataset of loans issued over the last two years, and participants are invited to use this data as well as source external public datasets to help Kiva build models for assessing borrower welfare levels.

More on the data and task can be found on [Kaggle](https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding/home)

I will use these data to practice my data science skills, especially with Pandas, matplotlib, scikit and, eventually, I'll build a Deep Learning Model.

## Goals
The following questions will try to be answered.
 + How is the total amount of spent? Which country receive the biggest amount on loan? Which type of projects are most funded? And within the most funded countries?
 + How is the money spread relative to the MPI?
 + What kind of features are most present for a loan (gender, group, ...), and if available, which ones are the most likely funded (haven't founded accepted / refused loan data yet)?
 + What trend are the lending trends?
 + Eventually writing a loan request generator just because reasons.
 
## Plan
We'll attack the problem by first exploring the data files on by one (on this document). Next we'll try answering the above questions on separated notebooks:
 + [How is the amount of loan spent?](http://localhost:8888/notebooks/01%20Loans%20by%20country%20exploration.ipynb)

## Exploring Kiva's Data
Let's introduce quickly the available files.
 + *kiva_loans.csv* (671K x 20) contains details loan informations.
 + *kiva_mpi_region_locations.csv* (2773 x 9) contains MPI (Global Multidimensional Poverty Index) by location and region.
 + *loan_theme_ids.csv* (779K x 4) Loan themes by ID
 + *loan_themes_by_region.csv* (15.7K x 21) Loan themes by region
 
Let's start with *kiva_loans* 

In [77]:
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [85]:
kldf = pd.read_csv("data/02-kiva/kiva_loans.csv",index_col=0)
kldf.head(100)

Unnamed: 0_level_0,funded_amount,loan_amount,activity,sector,use,country_code,country,region,currency,partner_id,posted_time,disbursed_time,funded_time,term_in_months,lender_count,tags,borrower_genders,repayment_interval,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
653051,300.0,300.0,Fruits & Vegetables,Food,"To buy seasonal, fresh fruits to sell.",PK,Pakistan,Lahore,PKR,247.0,2014-01-01 06:12:39+00:00,2013-12-17 08:00:00+00:00,2014-01-02 10:06:32+00:00,12.0,12,,female,irregular,2014-01-01
653053,575.0,575.0,Rickshaw,Transportation,to repair and maintain the auto rickshaw used ...,PK,Pakistan,Lahore,PKR,247.0,2014-01-01 06:51:08+00:00,2013-12-17 08:00:00+00:00,2014-01-02 09:17:23+00:00,11.0,14,,"female, female",irregular,2014-01-01
653068,150.0,150.0,Transportation,Transportation,To repair their old cycle-van and buy another ...,IN,India,Maynaguri,INR,334.0,2014-01-01 09:58:07+00:00,2013-12-17 08:00:00+00:00,2014-01-01 16:01:36+00:00,43.0,6,"user_favorite, user_favorite",female,bullet,2014-01-01
653063,200.0,200.0,Embroidery,Arts,to purchase an embroidery machine and a variet...,PK,Pakistan,Lahore,PKR,247.0,2014-01-01 08:03:11+00:00,2013-12-24 08:00:00+00:00,2014-01-01 13:00:00+00:00,11.0,8,,female,irregular,2014-01-01
653084,400.0,400.0,Milk Sales,Food,to purchase one buffalo.,PK,Pakistan,Abdul Hakeem,PKR,245.0,2014-01-01 11:53:19+00:00,2013-12-17 08:00:00+00:00,2014-01-01 19:18:51+00:00,14.0,16,,female,monthly,2014-01-01
1080148,250.0,250.0,Services,Services,purchase leather for my business using ksh 20000.,KE,Kenya,,KES,,2014-01-01 10:06:19+00:00,2014-01-30 01:42:48+00:00,2014-01-29 14:14:57+00:00,4.0,6,,female,irregular,2014-01-01
653067,200.0,200.0,Dairy,Agriculture,To purchase a dairy cow and start a milk produ...,IN,India,Maynaguri,INR,334.0,2014-01-01 09:51:02+00:00,2013-12-16 08:00:00+00:00,2014-01-01 17:18:09+00:00,43.0,8,"user_favorite, user_favorite",female,bullet,2014-01-01
653078,400.0,400.0,Beauty Salon,Services,to buy more hair and skin care products.,PK,Pakistan,Ellahabad,PKR,245.0,2014-01-01 11:46:01+00:00,2013-12-20 08:00:00+00:00,2014-01-10 18:18:44+00:00,14.0,8,"#Elderly, #Woman Owned Biz",female,monthly,2014-01-01
653082,475.0,475.0,Manufacturing,Manufacturing,"to purchase leather, plastic soles and heels i...",PK,Pakistan,Lahore,PKR,245.0,2014-01-01 11:49:43+00:00,2013-12-20 08:00:00+00:00,2014-01-01 18:47:21+00:00,14.0,19,user_favorite,female,monthly,2014-01-01
653048,625.0,625.0,Food Production/Sales,Food,"to buy a stall, gram flour, ketchup, and coal ...",PK,Pakistan,Lahore,PKR,247.0,2014-01-01 05:41:03+00:00,2013-12-17 08:00:00+00:00,2014-01-03 15:45:04+00:00,11.0,24,,female,irregular,2014-01-01


All the loans are fully fully funded with no information on potential refused / not through loans. No information on if / when the loans where reimbursed so far. Useful information for what I want are loan_amount, activity, use, country and borrower gender. We could use this to find where the money is spent and on what type of activites. Note the gender column asks for a bit of postprocessing to discover groups and gender biases. Posted time combined with sector and activity can help finding general and country specific trends.

I'd like to see when if at any moment there is a difference between loan_amount and funded_amount which would mean that some of the money is lost somewhere.

In [99]:
mask = kldf["funded_amount"] != kldf["loan_amount"]
kldf[mask].head()

Unnamed: 0_level_0,funded_amount,loan_amount,activity,sector,use,country_code,country,region,currency,partner_id,posted_time,disbursed_time,funded_time,term_in_months,lender_count,tags,borrower_genders,repayment_interval,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
653261,4275.0,5000.0,Personal Housing Expenses,Housing,"to pave the ground and repair the ceiling, to ...",PS,Palestine,jenin,USD,122.0,2014-01-02 09:53:45+00:00,2013-12-24 08:00:00+00:00,,39.0,58,"#Supporting Family, #Interesting Photo, user_f...",male,monthly,2014-01-02
653256,1925.0,2400.0,Electronics Repair,Services,to pay the annual rent for his shop,IQ,Iraq,,USD,166.0,2014-01-02 09:44:10+00:00,2013-12-29 08:00:00+00:00,,15.0,41,"#Single, #Supporting Family, #Eco-friendly, us...",male,monthly,2014-01-02
653253,2625.0,3000.0,Grocery Store,Food,to pay the annual rent on his grocery store an...,IQ,Iraq,,USD,166.0,2014-01-02 09:35:12+00:00,2013-12-29 08:00:00+00:00,,15.0,72,"#First Loan, #Biz Durable Asset, #Single, user...",male,monthly,2014-01-02
653259,2750.0,3000.0,Grocery Store,Food,to install a new floor in his grocery store an...,IQ,Iraq,,USD,166.0,2014-01-02 09:51:47+00:00,2013-12-30 08:00:00+00:00,,15.0,44,"#Biz Durable Asset, #Supporting Family, user_f...",male,monthly,2014-01-02
653263,1300.0,3000.0,Clothing,Clothing,to buy shoes and clothes to sell.,PS,Palestine,jenin,USD,122.0,2014-01-02 10:03:18+00:00,2013-12-24 08:00:00+00:00,,27.0,35,"#Parent, user_favorite",female,monthly,2014-01-02


In [117]:
diff = kldf[mask]["loan_amount"] - kldf[mask]["funded_amount"]
np.sum(diff.values)

37857335.0

An apparently big amount of money is disappearing. Let's get a few stats out of this.

First our biggest losses.

In [115]:
diff_sorted = diff.sort_values(ascending=False)
for i in range(5):
    print(kldf.loc[diff_sorted.index[i]])

funded_amount                                 0
loan_amount                               50000
activity                     Goods Distribution
sector                                Wholesale
use                                         NaN
country_code                                 HT
country                                   Haiti
region                                      NaN
currency                                    USD
partner_id                                  506
posted_time           2016-12-09 22:54:50+00:00
disbursed_time        2017-02-28 08:00:00+00:00
funded_time                                 NaN
term_in_months                               14
lender_count                                  0
tags                                        NaN
borrower_genders                            NaN
repayment_interval                    irregular
date                                 2016-12-09
Name: 1201692, dtype: object
funded_amount                                 0
loan_amount

They do not seem to provide any info on why these two amounts don't match. Some differences are pretty big.

Let's see some statistics on these losses (mean, variance, distribution per country). And finally, we'll see if we can find some partners who pop more often.