## Import libraries

In [1]:
import pandas as pd

## 1. Normalize the loan_lenders table. In the normalized table, each row must have one loan_id and one lender.
### Load dataset

In [2]:
loans_lenders_df = pd.read_csv("additional-kiva-snapshot/loans_lenders.csv")

### Basic data exploration and statistics

In [3]:
loans_lenders_df.head()

Unnamed: 0,loan_id,lenders
0,483693,"muc888, sam4326, camaran3922, lachheb1865, reb..."
1,483738,"muc888, nora3555, williammanashi, barbara5610,..."
2,485000,"muc888, terrystl, richardandsusan8352, sherri4..."
3,486087,"muc888, james5068, rudi5955, daniel9859, don92..."
4,534428,"muc888, niki3008, teresa9174, mike4896, david7..."


In [4]:
loans_lenders_df.tail()

Unnamed: 0,loan_id,lenders
1387427,678999,"michael43411218, carol5987, gooddogg1, chris41..."
1387428,1207353,"rjhoward1986, jeffrey6870, trolltech4460, elys..."
1387429,1206220,"vicky7746, gooddogg1, fairspirit, craig9729960..."
1387430,1206425,"rich6705, sergiiy9766, angela7509, barbara5610..."
1387431,1206486,"alan5175, amy38101311"


How many records are there?

In [5]:
loans_lenders_df.shape

(1387432, 2)

How many NA values are in this dataframe?

In [6]:
loans_lenders_df.isna().sum()

loan_id    0
lenders    0
dtype: int64

Are there duplicated loan_id values?

In [7]:
len(loans_lenders_df['loan_id'].unique())

1387432

That's good, each id is unique. I can now focus on how to normalize the table: the first thing to do will be to split the strings in the lenders column, so that we can have a list of lenders

In [8]:
loans_lenders_df['lenders'] = loans_lenders_df['lenders'].apply(lambda x : x.split(','))

In [9]:
loans_lenders_df = loans_lenders_df.explode('lenders').reset_index(drop=True)

Let's see if everything work as expected:

In [10]:
loans_lenders_df.head()

Unnamed: 0,loan_id,lenders
0,483693,muc888
1,483693,sam4326
2,483693,camaran3922
3,483693,lachheb1865
4,483693,rebecca3499


In [11]:
loans_lenders_df.tail()

Unnamed: 0,loan_id,lenders
28293926,1206425,trogdorfamily7622
28293927,1206425,danny6470
28293928,1206425,don6118
28293929,1206486,alan5175
28293930,1206486,amy38101311


Just for curiosity let's have a look at a random row:

In [12]:
loans_lenders_df.iloc[45]

loan_id           483738
lenders     danhostetler
Name: 45, dtype: object

In [13]:
loans_lenders_df.shape

(28293931, 2)

## 2. For each loan, add a column duration corresponding to the number of days between the disburse time and the planned expiration time. If any of those two dates is missing, also the duration must be missing.

## 3. Find the lenders that have funded at least twice.

## 4. For each country, compute how many loans have involved that country as borrowers.

##  5. For each country, compute the overall amount of money borrowed.