## Import libraries

In [1]:
import pandas as pd

## 1. Normalize the loan_lenders table. In the normalized table, each row must have one loan_id and one lender.
### Load dataset

In [2]:
loans_lenders_df = pd.read_csv("additional-kiva-snapshot/loans_lenders.csv")

### Basic data exploration and statistics

In [3]:
loans_lenders_df.head()

Unnamed: 0,loan_id,lenders
0,483693,"muc888, sam4326, camaran3922, lachheb1865, reb..."
1,483738,"muc888, nora3555, williammanashi, barbara5610,..."
2,485000,"muc888, terrystl, richardandsusan8352, sherri4..."
3,486087,"muc888, james5068, rudi5955, daniel9859, don92..."
4,534428,"muc888, niki3008, teresa9174, mike4896, david7..."


In [4]:
loans_lenders_df.tail()

Unnamed: 0,loan_id,lenders
1387427,678999,"michael43411218, carol5987, gooddogg1, chris41..."
1387428,1207353,"rjhoward1986, jeffrey6870, trolltech4460, elys..."
1387429,1206220,"vicky7746, gooddogg1, fairspirit, craig9729960..."
1387430,1206425,"rich6705, sergiiy9766, angela7509, barbara5610..."
1387431,1206486,"alan5175, amy38101311"


How many records are there?

In [5]:
loans_lenders_df.shape

(1387432, 2)

How many NA values are in this dataframe?

In [6]:
loans_lenders_df.isna().sum()

loan_id    0
lenders    0
dtype: int64

Are there duplicated loan_id values?

In [7]:
len(loans_lenders_df['loan_id'].unique())

1387432

That's good, each id is unique. I can now focus on how to normalize the table: the first thing to do will be to split the strings in the lenders column, so that we can have a list of lenders

In [8]:
loans_lenders_df['lenders'] = loans_lenders_df['lenders'].apply(lambda x : x.split(','))

In [9]:
loans_lenders_df = loans_lenders_df.explode('lenders').reset_index(drop=True)

Let's see if everything work as expected:

In [10]:
loans_lenders_df.head()

Unnamed: 0,loan_id,lenders
0,483693,muc888
1,483693,sam4326
2,483693,camaran3922
3,483693,lachheb1865
4,483693,rebecca3499


In [11]:
loans_lenders_df.tail()

Unnamed: 0,loan_id,lenders
28293926,1206425,trogdorfamily7622
28293927,1206425,danny6470
28293928,1206425,don6118
28293929,1206486,alan5175
28293930,1206486,amy38101311


Just for curiosity let's have a look at a random row:

In [12]:
loans_lenders_df.iloc[45]

loan_id           483738
lenders     danhostetler
Name: 45, dtype: object

In [13]:
loans_lenders_df.shape

(28293931, 2)

## 2. For each loan, add a column duration corresponding to the number of days between the disburse time and the planned expiration time. If any of those two dates is missing, also the duration must be missing.

In [14]:
loans_df = pd.read_csv("additional-kiva-snapshot/loans.csv")

In [15]:
loans_df.columns

Index(['loan_id', 'loan_name', 'original_language', 'description',
       'description_translated', 'funded_amount', 'loan_amount', 'status',
       'activity_name', 'sector_name', 'loan_use', 'country_code',
       'country_name', 'town_name', 'currency_policy',
       'currency_exchange_coverage_rate', 'currency', 'partner_id',
       'posted_time', 'planned_expiration_time', 'disburse_time',
       'raised_time', 'lender_term', 'num_lenders_total',
       'num_journal_entries', 'num_bulk_entries', 'tags', 'borrower_genders',
       'borrower_pictured', 'repayment_interval', 'distribution_model'],
      dtype='object')

In [16]:
loans_df.head()

Unnamed: 0,loan_id,loan_name,original_language,description,description_translated,funded_amount,loan_amount,status,activity_name,sector_name,...,raised_time,lender_term,num_lenders_total,num_journal_entries,num_bulk_entries,tags,borrower_genders,borrower_pictured,repayment_interval,distribution_model
0,657307,Aivy,English,"Aivy, 21 years of age, is single and lives in ...",,125.0,125.0,funded,General Store,Retail,...,2014-01-15 04:48:22.000 +0000,7.0,3,2,1,,female,True,irregular,field_partner
1,657259,Idalia Marizza,Spanish,"Doña Idalia, esta casada, tiene 57 años de eda...","Idalia, 57, is married and lives with her husb...",400.0,400.0,funded,Used Clothing,Clothing,...,2014-02-25 06:42:06.000 +0000,8.0,11,2,1,,female,True,monthly,field_partner
2,658010,Aasia,English,Aasia is a 45-year-old married lady and she ha...,,400.0,400.0,funded,General Store,Retail,...,2014-01-24 23:06:18.000 +0000,14.0,16,2,1,"#Woman Owned Biz, #Supporting Family, user_fav...",female,True,monthly,field_partner
3,659347,Gulmira,Russian,"Гулмире 36 лет, замужем, вместе с супругом вос...",Gulmira is 36 years old and married. She and ...,625.0,625.0,funded,Farming,Agriculture,...,2014-01-22 05:29:28.000 +0000,14.0,21,2,1,user_favorite,female,True,monthly,field_partner
4,656933,Ricky\t,English,Ricky is a farmer who currently cultivates his...,,425.0,425.0,funded,Farming,Agriculture,...,2014-01-14 17:29:27.000 +0000,7.0,15,2,1,"#Animals, #Eco-friendly, #Sustainable Ag",male,True,bullet,field_partner


In [17]:
loans_df.tail()

Unnamed: 0,loan_id,loan_name,original_language,description,description_translated,funded_amount,loan_amount,status,activity_name,sector_name,...,raised_time,lender_term,num_lenders_total,num_journal_entries,num_bulk_entries,tags,borrower_genders,borrower_pictured,repayment_interval,distribution_model
1419602,988180,,,,,400.0,400.0,funded,Tailoring,Services,...,2015-12-28 15:44:18.000 +0000,14.0,16,4,2,"#Parent, #Repeat Borrower, #Woman Owned Biz",,,monthly,field_partner
1419603,988213,Perlita,English,"Perlita is 52 years old, married and has three...","Perlita is 52 years old, married and has three...",300.0,300.0,funded,Pigs,Agriculture,...,2015-12-22 10:37:06.000 +0000,14.0,12,1,1,"#Animals, #Elderly, #Repeat Borrower, #Woman O...",female,true,irregular,field_partner
1419604,989109,Okyeso Nyame Group,English,Okyeso Nyame group will begin its third cycle ...,Okyeso Nyame group will begin its third cycle ...,2425.0,2425.0,funded,Bakery,Food,...,2015-12-26 20:24:47.000 +0000,8.0,76,2,1,"user_favorite, #Parent, #Vegan, #Woman Owned B...","female, female, female, male, male, female","true, true, true, true, true, true",irregular,field_partner
1419605,989143,Exequila,English,"Exequila is from San Miguel, Bohol. She is in...","Exequila is from San Miguel, Bohol. She is in...",100.0,100.0,funded,Farming,Agriculture,...,2015-12-06 21:03:57.000 +0000,12.0,3,1,1,,female,true,irregular,field_partner
1419606,989240,Lydia,French,Lydia a 37ans et habite dans une zone rurale. ...,Lydia is 37 years old and lives in a rural are...,175.0,175.0,funded,Sewing,Services,...,2015-12-04 23:17:04.000 +0000,14.0,7,1,1,,female,true,monthly,field_partner


In [18]:
loans_df.describe()

Unnamed: 0,loan_id,funded_amount,loan_amount,currency_exchange_coverage_rate,partner_id,lender_term,num_lenders_total,num_journal_entries,num_bulk_entries
count,1419607.0,1419607.0,1419607.0,1098081.0,1402817.0,1419583.0,1419607.0,1419607.0,1419607.0
mean,723371.3,796.1254,832.2284,0.1163657,149.6207,13.05139,22.25389,1.502054,1.134976
std,415676.6,1034.257,1080.551,0.03699645,87.69345,7.56666,27.7741,0.9903614,0.4950988
min,84.0,0.0,25.0,0.1,1.0,1.0,0.0,1.0,1.0
25%,364216.5,275.0,300.0,0.1,98.0,8.0,8.0,1.0,1.0
50%,724035.0,500.0,500.0,0.1,139.0,12.0,15.0,1.0,1.0
75%,1082972.0,950.0,1000.0,0.1,174.0,14.0,27.0,2.0,1.0
max,1444085.0,100000.0,100000.0,0.2,557.0,195.0,3045.0,48.0,24.0


There are many columns! At the moment only disburse_time and planned_expiration_time seems relevant, therefore it is better to filter the df!

In [19]:
columns_of_interest = ['loan_id', 'disburse_time','planned_expiration_time']

In [20]:
loans_filtered = loans_df[columns_of_interest]

In [21]:
loans_filtered.head()

Unnamed: 0,loan_id,disburse_time,planned_expiration_time
0,657307,2013-12-22 08:00:00.000 +0000,2014-02-14 03:30:06.000 +0000
1,657259,2013-12-20 08:00:00.000 +0000,2014-03-26 22:25:07.000 +0000
2,658010,2014-01-09 08:00:00.000 +0000,2014-02-15 21:10:05.000 +0000
3,659347,2014-01-17 08:00:00.000 +0000,2014-02-21 03:10:02.000 +0000
4,656933,2013-12-17 08:00:00.000 +0000,2014-02-13 06:10:02.000 +0000


In [22]:
loans_filtered.tail()

Unnamed: 0,loan_id,disburse_time,planned_expiration_time
1419602,988180,2015-11-23 08:00:00.000 +0000,2016-01-02 01:00:03.000 +0000
1419603,988213,2015-11-24 08:00:00.000 +0000,2016-01-02 16:40:07.000 +0000
1419604,989109,2015-11-13 08:00:00.000 +0000,2016-01-03 22:20:04.000 +0000
1419605,989143,2015-11-03 08:00:00.000 +0000,2016-01-05 08:50:02.000 +0000
1419606,989240,2015-11-03 08:00:00.000 +0000,2016-01-03 20:50:06.000 +0000


Let's have a look at the two variables:

In [23]:
loans_filtered.disburse_time.describe()

count                           1416794
unique                            75668
top       2017-02-01 08:00:00.000 +0000
freq                               2800
Name: disburse_time, dtype: object

In [24]:
loans_filtered.planned_expiration_time.describe()

count                           1047773
unique                           528035
top       2017-07-20 04:34:08.000 +0000
freq                                 22
Name: planned_expiration_time, dtype: object

They are seen as a generic object from Pandas even though they are dates.
How many NAs are there?

In [25]:
loans_filtered.disburse_time.isna().sum()

2813

In [26]:
loans_filtered.planned_expiration_time.isna().sum()

371834

In [27]:
loans_filtered.disburse_time = pd.to_datetime(loans_filtered.disburse_time)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [28]:
loans_filtered.planned_expiration_time = pd.to_datetime(loans_filtered.planned_expiration_time)

In [29]:
loans_filtered['diff_expiration_disburse'] = loans_filtered.planned_expiration_time - loans_filtered.disburse_time

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [30]:
loans_filtered.head()

Unnamed: 0,loan_id,disburse_time,planned_expiration_time,diff_expiration_disburse
0,657307,2013-12-22 08:00:00+00:00,2014-02-14 03:30:06+00:00,53 days 19:30:06
1,657259,2013-12-20 08:00:00+00:00,2014-03-26 22:25:07+00:00,96 days 14:25:07
2,658010,2014-01-09 08:00:00+00:00,2014-02-15 21:10:05+00:00,37 days 13:10:05
3,659347,2014-01-17 08:00:00+00:00,2014-02-21 03:10:02+00:00,34 days 19:10:02
4,656933,2013-12-17 08:00:00+00:00,2014-02-13 06:10:02+00:00,57 days 22:10:02


In [31]:
loans_filtered.describe()

Unnamed: 0,loan_id,diff_expiration_disburse
count,1419607.0,1044962
mean,723371.3,52 days 02:04:44.926735
std,415676.6,29 days 14:35:07.308709
min,84.0,-138 days +08:24:08
25%,364216.5,42 days 13:50:02
50%,724035.0,52 days 12:00:02
75%,1082972.0,61 days 19:50:01
max,1444085.0,1673 days 07:07:46


In [32]:
loans_filtered.diff_expiration_disburse.isna().sum()

374645

Is the number plausible? It should be less or equal than the number of NAs in the two columns:

In [33]:
loans_filtered.disburse_time.isna().sum() + loans_filtered.planned_expiration_time.isna().sum()

374647

Apparently yes! It means that in 2 scenarios both planned_expiration_time and disburse_time where NAs. Let's see where:

In [34]:
loans_filtered[loans_filtered[['disburse_time', 'planned_expiration_time']].isna().all(axis=1)]

Unnamed: 0,loan_id,disburse_time,planned_expiration_time,diff_expiration_disburse
423734,68814,NaT,NaT,NaT
1129851,71582,NaT,NaT,NaT


To do:
- Add the computed column to the loans_lenders dataframe
## 3. Find the lenders that have funded at least twice.

In [35]:
funding_freq = loans_lenders_df.groupby('lenders').lenders.count()
funding_freq

lenders
 000               39
 00000             39
 0002              70
 00mike00           1
 0101craign0101    71
                   ..
zzanita             2
zzcyna7269          1
zzinnia             1
zzmcfate           56
zzrvmf8538          2
Name: lenders, Length: 1639026, dtype: int64

In [36]:
funding_freq = funding_freq.to_frame()

In [37]:
funding_freq

Unnamed: 0_level_0,lenders
lenders,Unnamed: 1_level_1
000,39
00000,39
0002,70
00mike00,1
0101craign0101,71
...,...
zzanita,2
zzcyna7269,1
zzinnia,1
zzmcfate,56


In [38]:
funding_freq[funding_freq.lenders >= 2]

Unnamed: 0_level_0,lenders
lenders,Unnamed: 1_level_1
000,39
00000,39
0002,70
0101craign0101,71
0132575,4
...,...
zyrorl,3
zzaman,11
zzanita,2
zzmcfate,56


## 4. For each country, compute how many loans have involved that country as borrowers.

In [39]:
loans_df.columns

Index(['loan_id', 'loan_name', 'original_language', 'description',
       'description_translated', 'funded_amount', 'loan_amount', 'status',
       'activity_name', 'sector_name', 'loan_use', 'country_code',
       'country_name', 'town_name', 'currency_policy',
       'currency_exchange_coverage_rate', 'currency', 'partner_id',
       'posted_time', 'planned_expiration_time', 'disburse_time',
       'raised_time', 'lender_term', 'num_lenders_total',
       'num_journal_entries', 'num_bulk_entries', 'tags', 'borrower_genders',
       'borrower_pictured', 'repayment_interval', 'distribution_model'],
      dtype='object')

Let's filter, once again, the loans dataset keeping only the columns of interest:
- loan_id
- country_code
- country_name

In [40]:
loans_filtered['country_code'] = loans_df['country_code']
loans_filtered['country_name'] = loans_df['country_name']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [41]:
loans_filtered.head()

Unnamed: 0,loan_id,disburse_time,planned_expiration_time,diff_expiration_disburse,country_code,country_name
0,657307,2013-12-22 08:00:00+00:00,2014-02-14 03:30:06+00:00,53 days 19:30:06,PH,Philippines
1,657259,2013-12-20 08:00:00+00:00,2014-03-26 22:25:07+00:00,96 days 14:25:07,HN,Honduras
2,658010,2014-01-09 08:00:00+00:00,2014-02-15 21:10:05+00:00,37 days 13:10:05,PK,Pakistan
3,659347,2014-01-17 08:00:00+00:00,2014-02-21 03:10:02+00:00,34 days 19:10:02,KG,Kyrgyzstan
4,656933,2013-12-17 08:00:00+00:00,2014-02-13 06:10:02+00:00,57 days 22:10:02,PH,Philippines


In [42]:
loans_country = loans_filtered.country_name.value_counts()
loans_country

Philippines         285336
Kenya               143699
Peru                 86000
Cambodia             79701
El Salvador          64037
                     ...  
Botswana                 1
Mauritania               1
Uruguay                  1
Canada                   1
Papua New Guinea         1
Name: country_name, Length: 96, dtype: int64

##  5. For each country, compute the overall amount of money borrowed.

In [43]:
loans_filtered['loan_amount'] = loans_df['loan_amount']
loans_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,loan_id,disburse_time,planned_expiration_time,diff_expiration_disburse,country_code,country_name,loan_amount
0,657307,2013-12-22 08:00:00+00:00,2014-02-14 03:30:06+00:00,53 days 19:30:06,PH,Philippines,125.0
1,657259,2013-12-20 08:00:00+00:00,2014-03-26 22:25:07+00:00,96 days 14:25:07,HN,Honduras,400.0
2,658010,2014-01-09 08:00:00+00:00,2014-02-15 21:10:05+00:00,37 days 13:10:05,PK,Pakistan,400.0
3,659347,2014-01-17 08:00:00+00:00,2014-02-21 03:10:02+00:00,34 days 19:10:02,KG,Kyrgyzstan,625.0
4,656933,2013-12-17 08:00:00+00:00,2014-02-13 06:10:02+00:00,57 days 22:10:02,PH,Philippines,425.0


In [44]:
overall_money_borrowed = loans_filtered.groupby('country_name')['loan_amount'].agg(Money_borrowed='sum')
overall_money_borrowed

Unnamed: 0_level_0,Money_borrowed
country_name,Unnamed: 1_level_1
Afghanistan,1967950.0
Albania,4307350.0
Armenia,22950475.0
Azerbaijan,14784625.0
Belize,150175.0
...,...
Vietnam,24681100.0
Virgin Islands,10000.0
Yemen,3444000.0
Zambia,1978975.0


## 6. Like the previous point, but expressed as a percentage of the overall amount lent.

First step = finding the overall amount lent

In [45]:
overall_amount_lent = loans_filtered.loan_amount.sum()
overall_amount_lent

1181437300.0

In [46]:
money_borrowed_perc = (overall_money_borrowed/overall_amount_lent) * 100
money_borrowed_perc

Unnamed: 0_level_0,Money_borrowed
country_name,Unnamed: 1_level_1
Afghanistan,0.166573
Albania,0.364586
Armenia,1.942589
Azerbaijan,1.251410
Belize,0.012711
...,...
Vietnam,2.089074
Virgin Islands,0.000846
Yemen,0.291509
Zambia,0.167506


If everything is correct the column Money_borrowed_perc should sum to 100. Let's find out:

In [47]:
money_borrowed_perc.Money_borrowed.sum()

100.0

Thats good!

## 7. Like the three previous points, but split for each year (with respect to disburse time).

In [48]:
loans_filtered.head()

Unnamed: 0,loan_id,disburse_time,planned_expiration_time,diff_expiration_disburse,country_code,country_name,loan_amount
0,657307,2013-12-22 08:00:00+00:00,2014-02-14 03:30:06+00:00,53 days 19:30:06,PH,Philippines,125.0
1,657259,2013-12-20 08:00:00+00:00,2014-03-26 22:25:07+00:00,96 days 14:25:07,HN,Honduras,400.0
2,658010,2014-01-09 08:00:00+00:00,2014-02-15 21:10:05+00:00,37 days 13:10:05,PK,Pakistan,400.0
3,659347,2014-01-17 08:00:00+00:00,2014-02-21 03:10:02+00:00,34 days 19:10:02,KG,Kyrgyzstan,625.0
4,656933,2013-12-17 08:00:00+00:00,2014-02-13 06:10:02+00:00,57 days 22:10:02,PH,Philippines,425.0


In [49]:
loans_filtered['disburse_time'].describe()

count                       1416794
unique                        75668
top       2017-02-01 08:00:00+00:00
freq                           2800
first     2005-04-14 05:27:55+00:00
last      2018-03-19 07:00:00+00:00
Name: disburse_time, dtype: object

In [50]:
type(loans_filtered['disburse_time'])

pandas.core.series.Series

In [51]:
loans_filtered['disburse_year'] = loans_filtered['disburse_time'].dt.year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


### For each country, compute how many loans have involved that country as borrowers.

In [52]:
loans_country_year = loans_filtered.groupby(['country_name', 'disburse_year'])['loan_id'].agg(Loans_freq='count')
loans_country_year

Unnamed: 0_level_0,Unnamed: 1_level_0,Loans_freq
country_name,disburse_year,Unnamed: 2_level_1
Afghanistan,2007.0,408
Afghanistan,2008.0,370
Afghanistan,2009.0,678
Afghanistan,2010.0,632
Afghanistan,2011.0,247
...,...,...
Zimbabwe,2013.0,426
Zimbabwe,2014.0,2078
Zimbabwe,2015.0,600
Zimbabwe,2016.0,808


For being sure to have done everything correctly let's compare the results obtained at this stage for Afghanistan and the ones obtained by the same countryat task 4 

In [53]:
loans_country_year.loc['Afghanistan']

Unnamed: 0_level_0,Loans_freq
disburse_year,Unnamed: 1_level_1
2007.0,408
2008.0,370
2009.0,678
2010.0,632
2011.0,247
2015.0,1
2016.0,1


In [54]:
loans_country_year.loc['Afghanistan'].sum()

Loans_freq    2337
dtype: int64

In [55]:
loans_country['Afghanistan']

2337

The results are equal, therefore the procedure is ok

### For each country, compute the overall amount of money borrowed.

In [56]:
overall_money_borrowed_year = loans_filtered.groupby(
    ['country_name', 'disburse_year'])['loan_amount'].agg(Money_borrowed='sum')
overall_money_borrowed_year

Unnamed: 0_level_0,Unnamed: 1_level_0,Money_borrowed
country_name,disburse_year,Unnamed: 2_level_1
Afghanistan,2007.0,194975.0
Afghanistan,2008.0,365375.0
Afghanistan,2009.0,585125.0
Afghanistan,2010.0,563350.0
Afghanistan,2011.0,245125.0
...,...,...
Zimbabwe,2013.0,678525.0
Zimbabwe,2014.0,1311575.0
Zimbabwe,2015.0,723625.0
Zimbabwe,2016.0,788600.0


Let's perform the same check:

In [57]:
overall_money_borrowed_year.loc['Afghanistan']

Unnamed: 0_level_0,Money_borrowed
disburse_year,Unnamed: 1_level_1
2007.0,194975.0
2008.0,365375.0
2009.0,585125.0
2010.0,563350.0
2011.0,245125.0
2015.0,6000.0
2016.0,8000.0


In [58]:
overall_money_borrowed_year.loc['Afghanistan'].sum()

Money_borrowed    1967950.0
dtype: float64

In [59]:
overall_money_borrowed.loc['Afghanistan'].sum()

1967950.0

Everything looks fine

### Like the previous point, but expressed as a percentage of the overall amount lent.

In [60]:
money_borrowed_year_perc = (overall_money_borrowed_year/overall_amount_lent) * 100
money_borrowed_year_perc

Unnamed: 0_level_0,Unnamed: 1_level_0,Money_borrowed
country_name,disburse_year,Unnamed: 2_level_1
Afghanistan,2007.0,0.016503
Afghanistan,2008.0,0.030926
Afghanistan,2009.0,0.049527
Afghanistan,2010.0,0.047683
Afghanistan,2011.0,0.020748
...,...,...
Zimbabwe,2013.0,0.057432
Zimbabwe,2014.0,0.111015
Zimbabwe,2015.0,0.061250
Zimbabwe,2016.0,0.066749


## 8. For each lender, compute the overall amount of money lent. For each loan that has more than one lender, you must assume that all lenders contributed the same amount.

In [61]:
loans_df.head()

Unnamed: 0,loan_id,loan_name,original_language,description,description_translated,funded_amount,loan_amount,status,activity_name,sector_name,...,raised_time,lender_term,num_lenders_total,num_journal_entries,num_bulk_entries,tags,borrower_genders,borrower_pictured,repayment_interval,distribution_model
0,657307,Aivy,English,"Aivy, 21 years of age, is single and lives in ...",,125.0,125.0,funded,General Store,Retail,...,2014-01-15 04:48:22.000 +0000,7.0,3,2,1,,female,True,irregular,field_partner
1,657259,Idalia Marizza,Spanish,"Doña Idalia, esta casada, tiene 57 años de eda...","Idalia, 57, is married and lives with her husb...",400.0,400.0,funded,Used Clothing,Clothing,...,2014-02-25 06:42:06.000 +0000,8.0,11,2,1,,female,True,monthly,field_partner
2,658010,Aasia,English,Aasia is a 45-year-old married lady and she ha...,,400.0,400.0,funded,General Store,Retail,...,2014-01-24 23:06:18.000 +0000,14.0,16,2,1,"#Woman Owned Biz, #Supporting Family, user_fav...",female,True,monthly,field_partner
3,659347,Gulmira,Russian,"Гулмире 36 лет, замужем, вместе с супругом вос...",Gulmira is 36 years old and married. She and ...,625.0,625.0,funded,Farming,Agriculture,...,2014-01-22 05:29:28.000 +0000,14.0,21,2,1,user_favorite,female,True,monthly,field_partner
4,656933,Ricky\t,English,Ricky is a farmer who currently cultivates his...,,425.0,425.0,funded,Farming,Agriculture,...,2014-01-14 17:29:27.000 +0000,7.0,15,2,1,"#Animals, #Eco-friendly, #Sustainable Ag",male,True,bullet,field_partner


In [62]:
loans_df.shape

(1419607, 31)

In [63]:
len(loans_df.loan_id.unique())

1419607

In [64]:
loans_lenders_df

Unnamed: 0,loan_id,lenders
0,483693,muc888
1,483693,sam4326
2,483693,camaran3922
3,483693,lachheb1865
4,483693,rebecca3499
...,...,...
28293926,1206425,trogdorfamily7622
28293927,1206425,danny6470
28293928,1206425,don6118
28293929,1206486,alan5175


The first step will be to compute the number of people involved in each loan

In [65]:
loan_id_number_of_lenders= loans_lenders_df.groupby('loan_id')['lenders'].agg(number_of_lenders="count")
loan_id_number_of_lenders

Unnamed: 0_level_0,number_of_lenders
loan_id,Unnamed: 1_level_1
84,3
85,2
86,3
88,3
89,4
...,...
1444051,1
1444053,1
1444058,1
1444063,1


Let's see if the result is correct by manually checking some loans:

In [66]:
loans_lenders_df[loans_lenders_df.loan_id == 84]

Unnamed: 0,loan_id,lenders
4395150,84,ward
4395151,84,michael
4395152,84,brooke


In [67]:
loans_lenders_df[loans_lenders_df.loan_id == 85]

Unnamed: 0,loan_id,lenders
17666221,85,michael
17666222,85,patrick


In [68]:
loans_lenders_df[loans_lenders_df.loan_id == 1444065]

Unnamed: 0,loan_id,lenders
16789771,1444065,el5018


Let's now create a table where only the loan_id and the amount of money lent is present:

In [69]:
loan_id_amount = loans_df.filter(['loan_id', 'loan_amount'])
loan_id_amount.head()

Unnamed: 0,loan_id,loan_amount
0,657307,125.0
1,657259,400.0
2,658010,400.0
3,659347,625.0
4,656933,425.0


loans_lenders_df JOIN number_of_lenders Join loan_id_amount

In [70]:
loans_lenders_join_number_of_lenders = loans_lenders_df
loans_lenders_join_number_of_lenders= loans_lenders_join_number_of_lenders.join(loan_id_number_of_lenders, on="loan_id")

In [71]:
loans_lenders_join_number_of_lenders_join_loan_id_amount = loans_lenders_join_number_of_lenders.join(loan_id_amount, on='loan_id', lsuffix='',rsuffix= '_right')

In [72]:
loans_lenders_join_number_of_lenders_join_loan_id_amount.head()

Unnamed: 0,loan_id,lenders,number_of_lenders,loan_id_right,loan_amount
0,483693,muc888,40,77678.0,675.0
1,483693,sam4326,40,77678.0,675.0
2,483693,camaran3922,40,77678.0,675.0
3,483693,lachheb1865,40,77678.0,675.0
4,483693,rebecca3499,40,77678.0,675.0


In [73]:
del loans_lenders_join_number_of_lenders_join_loan_id_amount['loan_id_right']
loans_lenders_join_number_of_lenders_join_loan_id_amount

Unnamed: 0,loan_id,lenders,number_of_lenders,loan_amount
0,483693,muc888,40,675.0
1,483693,sam4326,40,675.0
2,483693,camaran3922,40,675.0
3,483693,lachheb1865,40,675.0
4,483693,rebecca3499,40,675.0
...,...,...,...,...
28293926,1206425,trogdorfamily7622,8,175.0
28293927,1206425,danny6470,8,175.0
28293928,1206425,don6118,8,175.0
28293929,1206486,alan5175,2,250.0


In [74]:
loans_lenders_join_number_of_lenders_join_loan_id_amount['loan_amount_per_lender'] = loans_lenders_join_number_of_lenders_join_loan_id_amount['loan_amount'] / loans_lenders_join_number_of_lenders_join_loan_id_amount['number_of_lenders']
loans_lenders_join_number_of_lenders_join_loan_id_amount.head()

Unnamed: 0,loan_id,lenders,number_of_lenders,loan_amount,loan_amount_per_lender
0,483693,muc888,40,675.0,16.875
1,483693,sam4326,40,675.0,16.875
2,483693,camaran3922,40,675.0,16.875
3,483693,lachheb1865,40,675.0,16.875
4,483693,rebecca3499,40,675.0,16.875


Now it is possible to find the final answer:

In [75]:
money_lent_per_lenders = loans_lenders_join_number_of_lenders_join_loan_id_amount.groupby('lenders')['loan_amount_per_lender'].agg(money_lent='sum')
money_lent_per_lenders

Unnamed: 0_level_0,money_lent
lenders,Unnamed: 1_level_1
000,1497.361065
00000,1404.284128
0002,5760.727175
00mike00,5.921053
0101craign0101,2732.053037
...,...
zzanita,106.250000
zzcyna7269,9.821429
zzinnia,16.000000
zzmcfate,5614.476700


To do:
* add some comments
* Invert the query order for optimization

## 9. For each country, compute the difference between the overall amount of money lent and the overall amount of money borrowed.
Since the country of the lender is often unknown, you can assume that the true distribution among the countries is the same as the one computed from the rows where the country is known.

In [76]:
lenders  = pd.read_csv("additional-kiva-snapshot/lenders.csv")
lenders.head()

Unnamed: 0,permanent_name,display_name,city,state,country_code,member_since,occupation,loan_because,loan_purchase_num,invited_by,num_invited
0,qian3013,Qian,,,,1461300457,,,1.0,,0
1,reena6733,Reena,,,,1461300634,,,9.0,,0
2,mai5982,Mai,,,,1461300853,,,,,0
3,andrew86079135,Andrew,,,,1461301091,,,5.0,Peter Tan,0
4,nguyen6962,Nguyen,,,,1461301154,,,,,0


In [77]:
loans_lenders_df.head()

Unnamed: 0,loan_id,lenders
0,483693,muc888
1,483693,sam4326
2,483693,camaran3922
3,483693,lachheb1865
4,483693,rebecca3499


### Money lent

unisco lenders con loans_lenders

In [78]:
lenders_loan_id = pd.merge(loans_lenders_df, lenders[['permanent_name', 'country_code']],left_on="lenders", right_on="permanent_name")

In [79]:
lenders_loan_id.head()

Unnamed: 0,loan_id,lenders,permanent_name,country_code
0,483693,muc888,muc888,US
1,483738,muc888,muc888,US
2,485000,muc888,muc888,US
3,486087,muc888,muc888,US
4,534428,muc888,muc888,US


Fill NA country_code rows taking into consideration data distribution. For doing that, in order to avoid distorting data, we will remove all the duplicated rows

In [80]:
lenders_country = lenders_loan_id[['lenders', 'country_code']].drop_duplicates()
lenders_country.head()

Unnamed: 0,lenders,country_code
0,muc888,US
696,klaus5005,DE
722,bernadette6835,AU
742,thomas9243,CH
822,herta6220,DE


How many NAs are there?

In [81]:
missing_country_code = lenders_country.country_code.isnull()
sum(missing_country_code)

105736

In [82]:
unique_countries_distribution = lenders_country.country_code.value_counts(normalize=True)
unique_countries_distribution

US    0.621301
CA    0.099305
AU    0.058015
GB    0.041945
DE    0.027591
        ...   
AD    0.000005
SC    0.000005
LS    0.000005
LY    0.000005
MP    0.000005
Name: country_code, Length: 195, dtype: float64

Now it is possible to replace the values:

In [83]:
import numpy as np

In [84]:
len(missing_country_code.index)

287569

In [85]:
len(lenders_country[missing_country_code])

105736

In [86]:
lenders_country.loc[missing_country_code,'country_code'] = np.random.choice(unique_countries_distribution.index,
                                                                            size=len(lenders_country[missing_country_code]),
                                                                            p=unique_countries_distribution.values)


In [87]:
lenders_country.head()

Unnamed: 0,lenders,country_code
0,muc888,US
696,klaus5005,DE
722,bernadette6835,AU
742,thomas9243,CH
822,herta6220,DE


How many NAs?

In [88]:
missing_country_code = lenders_country.country_code.isnull()
sum(missing_country_code)

0

In [89]:
money_lent_per_lenders.index

Index([' 000', ' 00000', ' 0002', ' 00mike00', ' 0101craign0101', ' 0132575',
       ' 0154884', ' 0161130', ' 0169713', ' 0185429',
       ...
       'zyra9641', 'zyrah8525', 'zyrorl', 'zzaba', 'zzaman', 'zzanita',
       'zzcyna7269', 'zzinnia', 'zzmcfate', 'zzrvmf8538'],
      dtype='object', name='lenders', length=1639026)

In [90]:
money_lent_per_lender_country = pd.merge(lenders_country, money_lent_per_lenders, left_on="lenders", right_index=True)
money_lent_per_lender_country

Unnamed: 0,lenders,country_code,money_lent
0,muc888,US,42267.265428
696,klaus5005,DE,875.155380
722,bernadette6835,AU,547.609981
742,thomas9243,CH,3721.544485
822,herta6220,DE,6782.691073
...,...,...,...
1387427,mary8615,US,54.545455
1387428,kelly3610,US,28.571429
1387429,joe4973,US,237.500000
1387430,kali7409,GB,80.000000


Now computing the amount of money lent by each country is not that difficult:

In [91]:
money_lent_by_country = money_lent_per_lender_country.groupby('country_code')['money_lent'].agg(overall_money_lent="sum")
money_lent_by_country

Unnamed: 0_level_0,overall_money_lent
country_code,Unnamed: 1_level_1
AD,521.212121
AE,256953.020817
AF,10145.559730
AL,1313.569513
AM,1155.037989
...,...
WF,1062.940455
XK,943.601103
YE,7406.988204
ZA,56507.196848


### Money borrowed
The answer to this question was found at Q5:

In [92]:
overall_money_borrowed

Unnamed: 0_level_0,Money_borrowed
country_name,Unnamed: 1_level_1
Afghanistan,1967950.0
Albania,4307350.0
Armenia,22950475.0
Azerbaijan,14784625.0
Belize,150175.0
...,...
Vietnam,24681100.0
Virgin Islands,10000.0
Yemen,3444000.0
Zambia,1978975.0


The only problem is that in one dataset we have the full name, in the other its 2-letters version. Let's add in the borrowed dataset this new column:

In [93]:
country_code_name = loans_df[['country_code', 'country_name']].drop_duplicates()
country_code_name.head()

Unnamed: 0,country_code,country_name
0,PH,Philippines
1,HN,Honduras
2,PK,Pakistan
3,KG,Kyrgyzstan
7,SV,El Salvador


In [94]:
country_code_name.shape

(96, 2)

Are there any NA?

In [95]:
sum(country_code_name.country_code.isna())

1

In [96]:
sum(country_code_name.country_name.isna())

0

In [97]:
country_code_name[country_code_name.country_code.isna()]

Unnamed: 0,country_code,country_name
82889,,Namibia


Let's replace this value with NA, that, according to the ISO standard, is the correct code for Nambibia. Source: https://en.wikipedia.org/wiki/ISO_3166-2:NA

In [98]:
country_code_name.loc[country_code_name.country_code.isna(),'country_code'] = "NA"

In [99]:
country_code_name[country_code_name.country_name == 'Namibia']

Unnamed: 0,country_code,country_name
82889,,Namibia


In [100]:
money_borrowed_by_country = pd.merge(country_code_name, overall_money_borrowed, left_on="country_name", right_index=True)
money_borrowed_by_country.head()

Unnamed: 0,country_code,country_name,Money_borrowed
0,PH,Philippines,97984600.0
1,HN,Honduras,11989325.0
2,PK,Pakistan,24995850.0
3,KG,Kyrgyzstan,14726900.0
7,SV,El Salvador,41691550.0


Let's now merge the borrowers and lenders:

In [101]:
money_lent_by_country.head()

Unnamed: 0_level_0,overall_money_lent
country_code,Unnamed: 1_level_1
AD,521.212121
AE,256953.020817
AF,10145.55973
AL,1313.569513
AM,1155.037989


In [102]:
lent_borrowed_by_country = pd.merge(money_lent_by_country, money_borrowed_by_country, left_index=True, right_on='country_code')
lent_borrowed_by_country

Unnamed: 0,overall_money_lent,country_code,country_name,Money_borrowed
845,10145.559730,AF,Afghanistan,1967950.0
129,1313.569513,AL,Albania,4307350.0
27,1155.037989,AM,Armenia,22950475.0
44,514.103050,AZ,Azerbaijan,14784625.0
15239,2744.046977,BA,Bosnia and Herzegovina,477250.0
...,...,...,...,...
118349,40801.157644,VU,Vanuatu,9250.0
750,943.601103,XK,Kosovo,3083025.0
83,7406.988204,YE,Yemen,3444000.0
858,56507.196848,ZA,South Africa,1006525.0


In [103]:
lent_borrowed_by_country['difference_lent_borr'] = lent_borrowed_by_country['overall_money_lent'] - lent_borrowed_by_country["Money_borrowed"]
lent_borrowed_by_country


Unnamed: 0,overall_money_lent,country_code,country_name,Money_borrowed,difference_lent_borr
845,10145.559730,AF,Afghanistan,1967950.0,-1.957804e+06
129,1313.569513,AL,Albania,4307350.0,-4.306036e+06
27,1155.037989,AM,Armenia,22950475.0,-2.294932e+07
44,514.103050,AZ,Azerbaijan,14784625.0,-1.478411e+07
15239,2744.046977,BA,Bosnia and Herzegovina,477250.0,-4.745060e+05
...,...,...,...,...,...
118349,40801.157644,VU,Vanuatu,9250.0,3.155116e+04
750,943.601103,XK,Kosovo,3083025.0,-3.082081e+06
83,7406.988204,YE,Yemen,3444000.0,-3.436593e+06
858,56507.196848,ZA,South Africa,1006525.0,-9.500178e+05


## 10. Which country has the highest ratio between the difference computed at the previous point and the population?

In [104]:
country_stats = pd.read_csv("additional-kiva-snapshot/country_stats.csv")
country_stats.head()

Unnamed: 0,country_name,country_code,country_code3,continent,region,population,population_below_poverty_line,hdi,life_expectancy,expected_years_of_schooling,mean_years_of_schooling,gni,kiva_country_name
0,India,IN,IND,Asia,Southern Asia,1339180127,21.9,0.623559,68.322,11.69659,6.298834,5663.474799,India
1,Nigeria,NG,NGA,Africa,Western Africa,190886311,70.0,0.527105,53.057,9.970482,6.0,5442.901264,Nigeria
2,Mexico,MX,MEX,Americas,Central America,129163276,46.2,0.761683,76.972,13.29909,8.554985,16383.10668,Mexico
3,Pakistan,PK,PAK,Asia,Southern Asia,197015955,29.5,0.550354,66.365,8.10691,5.08946,5031.173074,Pakistan
4,Bangladesh,BD,BGD,Asia,Southern Asia,164669751,31.5,0.578824,71.985,10.178706,5.241577,3341.490722,Bangladesh


In [106]:
country_pop = country_stats[['country_name', 'population']]
country_pop.head()

Unnamed: 0,country_name,population
0,India,1339180127
1,Nigeria,190886311
2,Mexico,129163276
3,Pakistan,197015955
4,Bangladesh,164669751


In [108]:
country_pop_money_stats = pd.merge(lent_borrowed_by_country, country_pop, left_on="country_name", right_on="country_name")
country_pop_money_stats.head()

Unnamed: 0,overall_money_lent,country_code,country_name,Money_borrowed,difference_lent_borr,population
0,10145.55973,AF,Afghanistan,1967950.0,-1957804.0,35530081
1,1313.569513,AL,Albania,4307350.0,-4306036.0,2930187
2,1155.037989,AM,Armenia,22950475.0,-22949320.0,2930450
3,514.10305,AZ,Azerbaijan,14784625.0,-14784110.0,9827589
4,2744.046977,BA,Bosnia and Herzegovina,477250.0,-474506.0,3507017


In [109]:
country_pop_money_stats['ratio_diff_money_pop']= country_pop_money_stats['difference_lent_borr']/country_pop_money_stats['population']
country_pop_money_stats.head()

Unnamed: 0,overall_money_lent,country_code,country_name,Money_borrowed,difference_lent_borr,population,ratio_diff_money_pop
0,10145.55973,AF,Afghanistan,1967950.0,-1957804.0,35530081,-0.055103
1,1313.569513,AL,Albania,4307350.0,-4306036.0,2930187,-1.469543
2,1155.037989,AM,Armenia,22950475.0,-22949320.0,2930450,-7.83133
3,514.10305,AZ,Azerbaijan,14784625.0,-14784110.0,9827589,-1.504348
4,2744.046977,BA,Bosnia and Herzegovina,477250.0,-474506.0,3507017,-0.135302


In [112]:
country_pop_money_stats.sort_values(by=['ratio_diff_money_pop'], ascending=False)

Unnamed: 0,overall_money_lent,country_code,country_name,Money_borrowed,difference_lent_borr,population,ratio_diff_money_pop
11,1.104320e+07,CA,Canada,50000.0,1.099320e+07,36624199,0.300162
70,8.598337e+07,US,United States,46352000.0,3.963137e+07,324459463,0.122146
15,8.657727e+04,CN,China,380525.0,-2.939477e+05,1409517397,-0.000209
71,5.732136e+03,UY,Uruguay,8000.0,-2.267864e+03,3456750,-0.000656
35,5.978758e+03,LK,Sri Lanka,74800.0,-6.882124e+04,20876917,-0.003297
...,...,...,...,...,...,...,...
46,4.873492e+03,NI,Nicaragua,30153225.0,-3.014835e+07,6217581,-4.848888
41,9.147357e+03,MN,Mongolia,15348375.0,-1.533923e+07,3075647,-4.987317
61,3.019364e+03,SV,El Salvador,41691550.0,-4.168853e+07,6377853,-6.536452
2,1.155038e+03,AM,Armenia,22950475.0,-2.294932e+07,2930450,-7.831330


## 11. Which country has the highest ratio between the difference computed at point 9 and the population that is not below the poverty line?

In [113]:
country_stats.describe()

Unnamed: 0,population,population_below_poverty_line,hdi,life_expectancy,expected_years_of_schooling,mean_years_of_schooling,gni
count,174.0,152.0,171.0,168.0,168.0,168.0,168.0
mean,43219240.0,28.476974,0.695777,71.059048,12.938925,8.282872,17494.527727
std,151506700.0,17.544183,0.159795,8.640918,3.034415,3.195489,19271.207947
min,196440.0,0.2,0.35244,48.943,4.87162,1.441532,587.473961
25%,4063056.0,15.1,0.555122,64.71225,10.75063,5.61966,3433.157219
50%,10120100.0,23.0,0.727287,73.3025,13.089405,8.567027,10492.68126
75%,31839050.0,38.725,0.827402,77.177,15.176165,11.23447,24666.624997
max,1409517000.0,82.5,0.949423,84.163,20.43272,13.37,129915.6009


Population below poverty line appears to be a percentage. Therefore the first step is to the compute the population NOT below the poverty line.

In [114]:
country_stats['pop_above_poverty'] = (country_stats['population']/100) * (100 -country_stats['population_below_poverty_line'])
country_stats.head()

Unnamed: 0,country_name,country_code,country_code3,continent,region,population,population_below_poverty_line,hdi,life_expectancy,expected_years_of_schooling,mean_years_of_schooling,gni,kiva_country_name,pop_above_poverty
0,India,IN,IND,Asia,Southern Asia,1339180127,21.9,0.623559,68.322,11.69659,6.298834,5663.474799,India,1045900000.0
1,Nigeria,NG,NGA,Africa,Western Africa,190886311,70.0,0.527105,53.057,9.970482,6.0,5442.901264,Nigeria,57265890.0
2,Mexico,MX,MEX,Americas,Central America,129163276,46.2,0.761683,76.972,13.29909,8.554985,16383.10668,Mexico,69489840.0
3,Pakistan,PK,PAK,Asia,Southern Asia,197015955,29.5,0.550354,66.365,8.10691,5.08946,5031.173074,Pakistan,138896200.0
4,Bangladesh,BD,BGD,Asia,Southern Asia,164669751,31.5,0.578824,71.985,10.178706,5.241577,3341.490722,Bangladesh,112798800.0


Now I can compute the desired ratio similar to the previous point.

In [115]:
country_poverty = country_stats[['country_name', 'pop_above_poverty']]
country_poverty.head()

Unnamed: 0,country_name,pop_above_poverty
0,India,1045900000.0
1,Nigeria,57265890.0
2,Mexico,69489840.0
3,Pakistan,138896200.0
4,Bangladesh,112798800.0


In [116]:
country_poverty_money_stats = pd.merge(lent_borrowed_by_country, country_poverty, left_on="country_name", right_on="country_name")
country_poverty_money_stats.head()

Unnamed: 0,overall_money_lent,country_code,country_name,Money_borrowed,difference_lent_borr,pop_above_poverty
0,10145.55973,AF,Afghanistan,1967950.0,-1957804.0,22810310.0
1,1313.569513,AL,Albania,4307350.0,-4306036.0,2511170.0
2,1155.037989,AM,Armenia,22950475.0,-22949320.0,1992706.0
3,514.10305,AZ,Azerbaijan,14784625.0,-14784110.0,9346037.0
4,2744.046977,BA,Bosnia and Herzegovina,477250.0,-474506.0,2903810.0


In [118]:
country_poverty_money_stats['ratio_diff_money_pop_above_pov']= country_poverty_money_stats['difference_lent_borr']/country_poverty_money_stats['pop_above_poverty']
country_poverty_money_stats.head()

Unnamed: 0,overall_money_lent,country_code,country_name,Money_borrowed,difference_lent_borr,pop_above_poverty,ratio_diff_money_pop_above_pov
0,10145.55973,AF,Afghanistan,1967950.0,-1957804.0,22810310.0,-0.08583
1,1313.569513,AL,Albania,4307350.0,-4306036.0,2511170.0,-1.714753
2,1155.037989,AM,Armenia,22950475.0,-22949320.0,1992706.0,-11.516661
3,514.10305,AZ,Azerbaijan,14784625.0,-14784110.0,9346037.0,-1.581859
4,2744.046977,BA,Bosnia and Herzegovina,477250.0,-474506.0,2903810.0,-0.163408


In [119]:
country_poverty_money_stats.sort_values(by=['ratio_diff_money_pop_above_pov'], ascending=False)

Unnamed: 0,overall_money_lent,country_code,country_name,Money_borrowed,difference_lent_borr,pop_above_poverty,ratio_diff_money_pop_above_pov
11,1.104320e+07,CA,Canada,50000.0,1.099320e+07,3.318152e+07,0.331305
70,8.598337e+07,US,United States,46352000.0,3.963137e+07,2.754661e+08,0.143870
15,8.657727e+04,CN,China,380525.0,-2.939477e+05,1.363003e+09,-0.000216
71,5.732136e+03,UY,Uruguay,8000.0,-2.267864e+03,3.121445e+06,-0.000727
35,5.978758e+03,LK,Sri Lanka,74800.0,-6.882124e+04,1.947816e+07,-0.003533
...,...,...,...,...,...,...,...
61,3.019364e+03,SV,El Salvador,41691550.0,-4.168853e+07,4.151982e+06,-10.040633
55,8.414246e+02,PY,Paraguay,53964700.0,-5.396386e+07,5.299189e+06,-10.183418
2,1.155038e+03,AM,Armenia,22950475.0,-2.294932e+07,1.992706e+06,-11.516661
53,7.102916e+04,PR,Puerto Rico,441900.0,-3.708708e+05,,


## 12. For each year, compute the total amount of loans. Each loan that has planned expiration time and disburse time in different years must have its amount distributed proportionally to the number of days in each year.
For example, a loan with disburse time December 1st, 2016, planned expiration time January 30th 2018, and amount 5000USD has an amount of 5000USD * 31 / (31+365+30) = 363.85 for 2016, 5000USD * 365 / (31+365+30) = 4284.04 for 2017, and 5000USD * 30 / (31+365+30) = 352.11 for 2018.