# DATA APPROPRIATION

In this notebook, I scrape data from Kiva.org to get additional information on the banks paired up with Kiva, most specifically:
 - their default rates, 
 - delinquency rates, 
  - rural percenatage 
  - amount of time in Months this bank has been paired up with Kiva. 
  
This information is collected to see if lenders care about default rates and delinquency rates when choosing whether to fund someone.

In [1]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
URL= 'https://www.kiva.org/about/where-kiva-works?region_filter=All&stage_filter=All&show_closed_partners=show_closed_partners&sort_by=riskRating'

First, I have to identify under which part of the page the information is in. I do this with the inspect button, left clicking on any website. Once I have identified the locations, I can extract the data using BeautifulSoup and for loops like the ones below.

In [3]:
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')

In [4]:
#LOCATE THE DATA
soup.find_all('td')[3]

<td>2.61%</td>

In [5]:
banks = [] 
k = 0
for i in soup.find_all('h1',class_='name'):
    try: 
        banks.append(soup.find_all('h1',class_ = 'name')[k].text.strip())
    except: 
        banks.append(np.nan)
    k+=1
    

In [6]:
time = [] 
k = 0
for i in soup.find_all('div',class_='timeOnKiva'):
    try: 
        time.append(soup.find_all('div',class_ = 'timeOnKiva')[k].text.strip('months on Kiva'))
    except: 
        time.append(np.nan)
    k+=1
    

In [7]:
delinquency = [] 
k = 3
for i in soup.find_all('tr')[1:]:
    try: 
        delinquency.append(soup.find_all('td')[k].text.strip('%'))
    except: 
        delinquency.append(np.nan)
    k+=5
    

In [8]:
default_rate = [] 
k = 4
for i in soup.find_all('tr')[1:]:
    try: 
        default_rate.append(soup.find_all('td')[k].text.strip('<td>\n\t% (see note)'))
    except: 
        default_rate.append(np.nan)
    k+= 5

In [9]:
print(len(delinquency))
print(len(default_rate))
print(len(banks))
print(len(time))

545
545
545
545


In [10]:
default_rate[1]

'0.08'

In [11]:
default_rate

['1.50',
 '0.08',
 '0.02',
 '0.02',
 '0.26',
 '0.06',
 '0.17',
 '0.17',
 '0.94',
 '0.88',
 '0.33',
 '0.52',
 '0.13',
 '1.08',
 '0.10',
 '0.12',
 '0.27',
 '0.38',
 '0.02',
 '0.47',
 '0.30',
 '0.07',
 '0.00',
 '0.61',
 '0.18',
 '0.51',
 '6.84',
 '0.13',
 '28.34',
 '0.28',
 '31.71',
 '0.03',
 '1.66',
 '0.00',
 '1.26',
 '0.00',
 '3.46',
 '0.23',
 '0.38',
 '0.00',
 '0.25',
 '0.00',
 'N/A',
 '0.00',
 '0.83',
 '0.12',
 '0.66',
 '1.89',
 '0.51',
 '0.17',
 '8.36',
 '0.78',
 '8.42',
 '1.25',
 '0.01',
 '11.77',
 '0.36',
 '0.62',
 '0.56',
 '0.00',
 '0.00',
 '0.00',
 '0.00',
 '0.03',
 '0.73',
 'N/A',
 '0.00',
 '0.00',
 '0.21',
 '0.16',
 'N/A',
 '0.00',
 '0.61',
 '0.45',
 '0.27',
 '0.06',
 '1.78',
 '0.35',
 '0.00',
 '0.03',
 '0.00',
 '0.32',
 '0.00',
 '0.00',
 '19.53',
 '0.00',
 '5.50',
 '1.59',
 '0.89',
 '0.00',
 '1.48',
 '0.91',
 '1.80',
 '0.00',
 '0.06',
 '0.04',
 '5.15',
 '0.50',
 '2.96',
 '0.02',
 '24.92',
 '0.00',
 '3.78',
 '0.00',
 '19.40',
 '0.26',
 '1.14',
 '2.76',
 '0.07',
 '0.01',
 '0.93'

Next, I put this information in a dataframe.

In [12]:
repayments = pd.DataFrame()

In [13]:
repayments['Field Partner Name'] = banks
repayments['delinquency'] = delinquency
repayments['time'] = time
repayments['default_rate'] = default_rate

In [14]:
repayments

Unnamed: 0,Field Partner Name,delinquency,time,default_rate
0,CrediCampo,2.61,114,1.50
1,Credo,6.34,122,0.08
2,Negros Women for Tomorrow Foundation (NWTF),3.25,140,0.02
3,Hattha Bank,0.22,157,0.02
4,Phillip Bank,0.53,178,0.26
...,...,...,...,...
540,Prisma Microfinance,0.00,181,0.00
541,Senegal Ecovillage Microfinance Fund (SEM),0.00,181,5.13
542,Regional Economic Development Center (REDC Bul...,0.00,181,14.46
543,The Shurush Initiative,0.00,181,57.16


Now that I have the banks names, I use a dataset from Kaggle that has the Partner ID's and Bank Names to concatenate Partner ID's. However, this Kaggle dataset is a snapshot in time taken 3 years ago, so it has less Banks and so the resulting dataset only has 259 rows, down from 545.

In [15]:
loans = pd.read_csv('/Users/nicolas/Downloads/loan_themes_by_region 2.csv')

In [16]:
loans['Partner ID'].value_counts()

123    1207
169     992
136     952
126     673
177     671
       ... 
229       1
274       1
540       1
532       1
543       1
Name: Partner ID, Length: 302, dtype: int64

In [17]:
repayments1 = repayments.merge(loans[['Field Partner Name','Partner ID','rural_pct']],on='Field Partner Name')

In [19]:
repayements1 = repayments1.drop_duplicates(ignore_index=True)

In [20]:
repayements1

Unnamed: 0,Field Partner Name,delinquency,time,default_rate,Partner ID,rural_pct
0,CrediCampo,2.61,114,1.50,199,87.0
1,Credo,6.34,122,0.08,181,73.0
2,Negros Women for Tomorrow Foundation (NWTF),3.25,140,0.02,145,69.0
3,Kashf Foundation,77.93,104,0.17,245,25.0
4,One Acre Fund,0.00,114,0.94,202,99.0
...,...,...,...,...,...,...
254,Salone Microfinance Trust (SMT),0.00,166,1.15,57,66.0
255,Aqroinvest Credit Union,0.00,166,5.23,56,96.0
256,Credit Mongol,0.00,131,9.97,42,30.0
257,Komak Credit Union,0.00,171,1.73,30,85.0


Now that we have this final dataset with Partner ID's and the default data, we can concatenate the new rows on the main dataset thanks to common Partner ID column. However, since I have scraped data for 259 banks while the main dataset has 500 banks,the resulting concatenation will produce a subset of the main dataset. 

In [21]:
loans2 = pd.read_csv('/Users/nicolas/Downloads/loans.csv')

In [22]:
lower_columns = []
for i in loans2.columns:
    lower_columns.append(i.lower())
loans2.columns = lower_columns

In [23]:
dictionary = ['Field Partner Name','delinquency','time','default_rate','partner_id','rural_pct']
repayments1.columns = dictionary

In [25]:
#MERGE
loans3 = loans2.merge(repayements1[['time','delinquency','default_rate','rural_pct','partner_id']],how='inner',on='partner_id')

In [26]:
loans3.isnull().sum()

loan_id                                  0
loan_name                            25235
original_language                    21677
description                          21683
description_translated              272437
funded_amount                            0
loan_amount                              0
status                                   0
image_id                             21677
video_id                           1511259
activity_name                            0
sector_name                              0
loan_use                             21686
country_code                            31
country_name                             0
town_name                            97508
currency_policy                          0
currency_exchange_coverage_rate     243071
currency                                 0
partner_id                               0
posted_time                              0
planned_expiration_time             152761
disburse_time                            2
raised_time

In [36]:
loans2.index

RangeIndex(start=0, stop=2054041, step=1)

In [37]:
# SUBSET
loans3.index

Int64Index([      0,       1,       2,       3,       4,       5,       6,
                  7,       8,       9,
            ...
            1511565, 1511566, 1511567, 1511568, 1511569, 1511570, 1511571,
            1511572, 1511573, 1511574],
           dtype='int64', length=1511575)

In [28]:
loans3.to_csv('loans2.csv')