## <a class="anchor" id="main">FoCS Lab</a>
### Contributors: Marco Distrutti, Santosh Anand

* [1 Normalization](#normalize)
* [2 Duration Field](#duration)
* [3 Search lenders](#search-lenders)
* [4 How many loans](#borrowers)
* [5 Overall amount](#overall-amount)
* [6 Overall percentage](#overall-percentage)
* [7 Overall by country/year](#year-overall-percentage)


In [1]:
import pandas as pd
import numpy as np
import itertools

df_loan_lenders = pd.read_csv('kiva-kaggle/loans_lenders.csv')
raws = df_loan_lenders

In [2]:
loan_lenders = df_loan_lenders
df_loan_lenders

Unnamed: 0,loan_id,lenders
0,483693,"muc888, sam4326, camaran3922, lachheb1865, reb..."
1,483738,"muc888, nora3555, williammanashi, barbara5610,..."
2,485000,"muc888, terrystl, richardandsusan8352, sherri4..."
3,486087,"muc888, james5068, rudi5955, daniel9859, don92..."
4,534428,"muc888, niki3008, teresa9174, mike4896, david7..."
...,...,...
1387427,678999,"michael43411218, carol5987, gooddogg1, chris41..."
1387428,1207353,"rjhoward1986, jeffrey6870, trolltech4460, elys..."
1387429,1206220,"vicky7746, gooddogg1, fairspirit, craig9729960..."
1387430,1206425,"rich6705, sergiiy9766, angela7509, barbara5610..."


## 1 <a class="anchor" id="normalize">Normalization</a>
#### Normalize the loan_lenders table. In the normalized table, each row must have one loan_id and one lender.
[⇑ index](#main)

Using **Vectorized** operations our computations are extremly faster then looping over the dataset because Vectorized operations are heavily implemented in **C** procedures. So the strategy is to create both vectors (loan_id and lender) separatly and loan_id is generated by repeating id with vectorized products.

In [3]:
%%time

#FIRST AXIS
lenders = [lender for lenders in df_loan_lenders['lenders'] for lender in lenders.split(',')]

#SECOND AXIS - using vectorized operations we boost the performance using internal opimized C procedures
loan_ids = [loan_id for loan_id in df_loan_lenders['loan_id']]
cardinality = [len(lenders.split(", ")) for lenders in df_loan_lenders['lenders']]
#create a new list with repeated ids for each loan in the same original row
flatted_ids = list(itertools.chain(*[[loan_ids[i]] * cardinality[i] for i in range(0, len(loan_ids))]))

#DATAFRAME
df_loan_lenders = pd.DataFrame({'loan_id':flatted_ids, 'lender':lenders})

Wall time: 16.8 s


In [4]:
#More then 28 milions of records
df_loan_lenders

Unnamed: 0,loan_id,lender
0,483693,muc888
1,483693,sam4326
2,483693,camaran3922
3,483693,lachheb1865
4,483693,rebecca3499
...,...,...
28293926,1206425,trogdorfamily7622
28293927,1206425,danny6470
28293928,1206425,don6118
28293929,1206486,alan5175


## 2 <a class="anchor" id="duration">Duration</a>
#### For each loan, add a column duration corresponding to the number of days between the disburse time and the planned expiration time. If any of those two dates is missing, also the duration must be missing.
[⇑ index](#main)

Specifying column names and types we can avoid unnecessary data and boost performance because we can read only what we need and the interpreter won't infer data types during the scan.

In [5]:
%%time
df_loans = pd.read_csv('kiva-kaggle/loans.csv',
                       usecols=['loan_id', 'disburse_time', 'planned_expiration_time', 'country_code', 'country_name', 'loan_amount'],
                       dtype={'loan_id': np.int32, 'disburse_time': 'str', 'planned_expiration_time': 'str', 'country_code': 'str', 'country_name': 'str', 'loan_amount': 'float'})

#it should be possible to parse datetime by specifying a lambda parser in read_csv method but
#with vectorized operations we saved more then 2 minutes for this loading.
df_loans['planned_expiration_time'] = pd.to_datetime(df_loans['planned_expiration_time'])
df_loans['disburse_time']= pd.to_datetime(df_loans['disburse_time'])
df_loans["duration"] = df_loans['planned_expiration_time'] - df_loans['disburse_time']

df_loans[['disburse_time', 'planned_expiration_time', 'duration']]

Wall time: 10.1 s


Unnamed: 0,disburse_time,planned_expiration_time,duration
0,2013-12-22 08:00:00+00:00,2014-02-14 03:30:06+00:00,53 days 19:30:06
1,2013-12-20 08:00:00+00:00,2014-03-26 22:25:07+00:00,96 days 14:25:07
2,2014-01-09 08:00:00+00:00,2014-02-15 21:10:05+00:00,37 days 13:10:05
3,2014-01-17 08:00:00+00:00,2014-02-21 03:10:02+00:00,34 days 19:10:02
4,2013-12-17 08:00:00+00:00,2014-02-13 06:10:02+00:00,57 days 22:10:02
...,...,...,...
1419602,2015-11-23 08:00:00+00:00,2016-01-02 01:00:03+00:00,39 days 17:00:03
1419603,2015-11-24 08:00:00+00:00,2016-01-02 16:40:07+00:00,39 days 08:40:07
1419604,2015-11-13 08:00:00+00:00,2016-01-03 22:20:04+00:00,51 days 14:20:04
1419605,2015-11-03 08:00:00+00:00,2016-01-05 08:50:02+00:00,63 days 00:50:02


## 3 <a class="anchor" id="search-lenders">Search lenders</a>
#### Find the lenders that have funded at least twice.
[⇑ index](#main)

Pandas aggregation methods give us the possibility to create grouped dataframe and apply aggregation functions such as the occurrences counting.

In [6]:
%%time

df_lenders_multifunder = df_loan_lenders.groupby('lender').count().rename(columns={"loan_id": "funds"}).sort_values(by=["funds"])
df_lenders_multifunder = df_lenders_multifunder[df_lenders_multifunder["funds"] >= 2]
df_lenders_multifunder

Wall time: 12.7 s


Unnamed: 0_level_0,funds
lender,Unnamed: 1_level_1
theresa5301,2
louis2781,2
leah1252,2
william6302,2
louis2768,2
...,...
themissionbeltco,76986
nms,100360
gmct,127089
trolltech4460,148347


## 4 <a class="anchor" id="borrowers">How many loans</a>
#### For each country, compute how many loans have involved that country as borrowers.
[⇑ index](#main)

There are missing values in country codes (even in the country.csv), if we group the loans by country_code we miss one record.

In [7]:
df_loans.country_code.isnull().any()

True

In [8]:
%%time
drop_for_country_borrows = ['planned_expiration_time', 'disburse_time', 'duration', 'loan_amount', 'country_code']
df_countries_borrows = df_loans.drop(columns=drop_for_country_group).groupby(['country_name']).count().rename(columns={"loan_id": "borrows"}).sort_values(by=["borrows"])
df_countries_borrows

NameError: name 'drop_for_country_group' is not defined

## 5 <a class="anchor" id="overall-amount">Overall amount</a>
#### For each country, compute the overall amount of money borrowed
[⇑ index](#main)

In [9]:
%%time
drop_for_country_borrowed = ['loan_id', 'planned_expiration_time', 'disburse_time', 'duration', 'country_code']
df_countries_borrowed = df_loans.drop(columns=drop_for_country_borrowed).groupby('country_name', as_index=False).sum().rename(columns={"loan_amount": "borrowed"}).sort_values(by=["borrowed"])
df_countries_borrowed

Wall time: 114 ms


Unnamed: 0,country_name,borrowed
28,Gaza,5000.0
89,Uruguay,8000.0
9,Botswana,8000.0
90,Vanuatu,9250.0
92,Virgin Islands,10000.0
...,...,...
14,Cambodia,51613525.0
64,Paraguay,53964700.0
40,Kenya,66735975.0
65,Peru,79437775.0


## 6 <a class="anchor" id="overall-amount">Overall percentage</a>
#### Like the previous point, but expressed as a percentage of the overall amount lent.
[⇑ index](#main)

In [10]:
total_borrowed = df_countries_borrowed.borrowed.sum()
#vectorized operation
total_borrowed_perc = (df_countries_borrowed.borrowed / total_borrowed) * 100

df_countries_borrowed_perc = pd.DataFrame({'country_name': df_countries_borrowed.country_name, 'borrowed':total_borrowed_perc})
df_countries_borrowed_perc

Unnamed: 0,country_name,borrowed
28,Gaza,0.000423
89,Uruguay,0.000677
9,Botswana,0.000677
90,Vanuatu,0.000783
92,Virgin Islands,0.000846
...,...,...
14,Cambodia,4.368706
64,Paraguay,4.567716
40,Kenya,5.648711
65,Peru,6.723825


## 7 <a class="anchor" id="year-overall-percentage">Overall by country/year</a>
#### Like the three previous points, but split for each year (with respect to disburse time).
[⇑ index](#main)

The following dataset is the grouped amount by **country_name** and **year**

In [11]:
df_loans_year = df_loans.drop(columns=drop_for_country_borrowed).rename(columns={"loan_amount": "borrowed"})
df_loans_year['year'] = df_loans.disburse_time.dt.year

df_countries_borrowed_year = df_loans_year.groupby(['country_name', 'year']).sum().sort_values(by=["borrowed"])
df_countries_borrowed_year

Unnamed: 0_level_0,Unnamed: 1_level_0,borrowed
country_name,year,Unnamed: 2_level_1
Paraguay,2018.0,50.0
Pakistan,2018.0,150.0
Philippines,2018.0,300.0
Mexico,2018.0,475.0
Thailand,2012.0,1050.0
...,...,...
Kenya,2015.0,10257950.0
Philippines,2014.0,13961450.0
Philippines,2015.0,16083375.0
Philippines,2016.0,16218925.0


We need the total amounts for each year

In [12]:
overall_years = df_loans_year.groupby(['year']).sum()
overall_years

Unnamed: 0_level_0,borrowed
year,Unnamed: 1_level_1
2005.0,102850.0
2006.0,1376575.0
2007.0,15446525.0
2008.0,39423050.0
2009.0,59689475.0
2010.0,72609150.0
2011.0,93699300.0
2012.0,119977575.0
2013.0,132043925.0
2014.0,152270425.0


Again, with vectorized operations we can boost our computation. In this case a division by matching the indexes will be done.
In case of an uncomparable matching is given the following error will be displayed: **ValueError: cannot join with no overlapping index names**

In [13]:
df_countries_borrowed_year = df_countries_borrowed_year / overall_years * 100
df_countries_borrowed_year

Unnamed: 0_level_0,Unnamed: 1_level_0,borrowed
country_name,year,Unnamed: 2_level_1
Paraguay,2018.0,0.005050
Pakistan,2018.0,0.015151
Philippines,2018.0,0.030302
Mexico,2018.0,0.047979
Thailand,2012.0,0.000875
...,...,...
Kenya,2015.0,6.582544
Philippines,2014.0,9.168852
Philippines,2015.0,10.320729
Philippines,2016.0,10.075220
