# Progetto Foundations of Computer Science
Membri del gruppo: Andrea Pianalto, Michele Sartori, Silvia Gloria Tamburini

## Descrizione dei database necessari per il progetto

*Kiva* è un sito che fa da tramite tra delle persone che hanno bisogno di denaro (che vivono tipicamente in paesi del secondo o del terzo mondo e che vogliono portare avanti dei progetti), e persone che prestano loro dei soldi. Andando sul loro sito, è possibile infatti essere quello che presta i soldi, oppure essere quello che li riceve: per essere il primo, cliccando su *lend*, si sceglie l'attività a cui prestare e quanti soldi dare; per essere il secondo, cliccando su *borrow*, bisogna far partire una campagna di finanziamento - di fatto, un prestito, un *loan* - per ricevere dei soldi.

Nel seguito, le persone che prestano saranno chiamate *lenders*, mentre quelle che ricevono il prestito sono le *borrowers*. Il singolo prestito è il *loan*, ed è identificato dalla sua chiave: *loan-id*.

**country_stats:** contiene alcune informazioni sul paese da cui provengono i _borrowers_ .\
**GEconV4:** dati geografici delle aree da cui provengono i _borrowers_ .\
**lenders:** identikit dei _lenders_ (lavoro, paese da cui provengono, da quanto tempo prestano, e altro).\
**loan_coord**: coordinate geografiche delle singole transazioni.\
**loans**: informazioni sui singoli _borrower_ .\
**loans_lenders**: per ogni prestito, sono associati i suoi prestatori.

## Importazione delle librerie necessarie

In [1]:
import numpy as np
import pandas as pd
import re
#import matplotlib as mp
import random

## Importare qui i file quando necessario
Seguire le indicazioni riportate per ogni esercizio. Importare solo i file necessari (per non riempire la memoria)

In [5]:
loan_lenders = pd.read_csv('Data/loans_lenders.csv')
loan_lenders['lenders'] = loan_lenders['lenders'].str.split(',\s')

In [2]:
loans = pd.read_csv('Data/loans.csv')

In [3]:
lenders = pd.read_csv('Data/lenders.csv')

In [114]:
countries = pd.read_csv('Data/country_stats.csv')

In [8]:
#Questo file si trova nella cartella dei dati se è stato precedentemente eseguito in qualche momento l'esercizio 1.
norm = pd.read_csv('Data/norm_loan_lenders.csv')

## Exercise 1
Normalize the loan_lenders table. In the normalized table, each row must have one loan_id and one lender.

**File necessari**: loan_lenders \
**Avvertenze**: questo esercizio ha un tempo di esecuzione più lungo degli altri (circa 10 minuti)

### Prima soluzione: più veloce ma meno leggibile

In [6]:
norm = pd.DataFrame(columns=['loan_id','lenders'])  
norm = norm.append([dic for i in [[{'loan_id':loan_lenders['loan_id'][row],'lenders':l} for l in loan_lenders['lenders'][row]] for row in range(len(loan_lenders))] for dic in i],ignore_index = True)

### Seconda soluzione: meno veloce ma più leggibile

In [86]:
def dataframe_from_row (index):
    id_loan = loan_lenders.at[index,'loan_id']
    lenders = loan_lenders.at[index,'lenders']
    rep_id = [id_loan]*len(lenders)
    data = {'loan_id':rep_id,'lender_norm':lenders}
    return data

In [87]:
indexes = len(list(loan_lenders.index))

In [99]:
norm = pd.DataFrame(columns = ['loan_id','lender_norm'])
norm = norm.append([pd.DataFrame.from_dict(dataframe_from_row(index)) for index in range(indexes)],ignore_index = True)

Wall time: 2.94 s


### Esportazione del file

In [7]:
norm.to_csv('Data/norm_loan_lenders.csv',index=False)

### Reimportazione del file
Questo svuoterà un po' la memoria dal precedente calcolo, sostituendo *norm* con il file importato.

In [None]:
norm = pd.read_csv('Data/norm_loan_lenders.csv')

# Exercise 2
For each loan, add a column duration corresponding to the number of days between the disburse time and the planned expiration time. If any of those two dates is missing, also the duration must be missing.

**File necessari**: loans

### Visualizzazione del problema

*Disburse time*: indica il momento in cui il *borrower* riceve i suoi fondi. Qui è importante solo la data. \
*Planned expiration time*: momento in cui il prestito scade e bisogna restituire i soldi. Anche qui è importante solo la data.

*Loan length / repayment term*: il numero di mesi che passano dal momento in cui il prestito è dato effettivamente al *borrower* fino a quando bisogna rifondare l'ultimo prestatore.

In [3]:
loans[['loan_id', 'planned_expiration_time', 'disburse_time']]

Unnamed: 0,loan_id,planned_expiration_time,disburse_time
0,657307,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000
1,657259,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000
2,658010,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000
3,659347,2014-02-21 03:10:02.000 +0000,2014-01-17 08:00:00.000 +0000
4,656933,2014-02-13 06:10:02.000 +0000,2013-12-17 08:00:00.000 +0000
...,...,...,...
1419602,988180,2016-01-02 01:00:03.000 +0000,2015-11-23 08:00:00.000 +0000
1419603,988213,2016-01-02 16:40:07.000 +0000,2015-11-24 08:00:00.000 +0000
1419604,989109,2016-01-03 22:20:04.000 +0000,2015-11-13 08:00:00.000 +0000
1419605,989143,2016-01-05 08:50:02.000 +0000,2015-11-03 08:00:00.000 +0000


### Calcolo della differenza delle due date

In [3]:
loans['planned_expiration_time'] = pd.to_datetime(loans['planned_expiration_time'])
loans['disburse_time'] = pd.to_datetime(loans['disburse_time'])

In [4]:
date1 = loans['planned_expiration_time'].dt.date
date2 = loans['disburse_time'].dt.date
loans['loan_length'] = date1-date2

In [5]:
loans[['loan_id', 'loan_length']]

Unnamed: 0,loan_id,loan_length
0,657307,54 days
1,657259,96 days
2,658010,37 days
3,659347,35 days
4,656933,58 days
...,...,...
1419602,988180,40 days
1419603,988213,39 days
1419604,989109,51 days
1419605,989143,63 days


In [95]:
loans['loan_length'].dt.days

0          54.0
1          96.0
2          37.0
3          35.0
4          58.0
           ... 
1419602    40.0
1419603    39.0
1419604    51.0
1419605    63.0
1419606    61.0
Name: loan_length, Length: 1419607, dtype: float64

# Exercise 3
Find the lenders that have funded at least twice.

**File necessari**: norm

In [117]:
lenders_twice = norm.groupby('lenders').count() > 1
list(lenders_twice[lenders_twice['loan_id'] == True].index)

['000',
 '00000',
 '0002',
 '0101craign0101',
 '0132575',
 '0154884',
 '0161130',
 '0169713',
 '0185429',
 '0197462',
 '0206338',
 '0219854',
 '0221581',
 '0239059',
 '0245597',
 '0256321',
 '0265562',
 '0279282',
 '0288537',
 '0295920',
 '0307987',
 '0321212',
 '0326lsw',
 '0332772',
 '0346439',
 '0353400',
 '0367630',
 '0376099',
 '0384195',
 '0393784',
 '0407067',
 '0416503',
 '0422888',
 '0432352',
 '0443760',
 '0457584',
 '0462602',
 '0473787',
 '0483421',
 '0499990',
 '0509115',
 '0511209',
 '0526528',
 '0545998',
 '0554687',
 '0561575',
 '0579150',
 '0589889',
 '0595846',
 '0609725',
 '0614925',
 '0626305',
 '0634944',
 '0648612',
 '0653266',
 '0672816',
 '0684667',
 '0693181',
 '0703092',
 '070707Weddingtablegifts',
 '0711782',
 '0723706',
 '07272010',
 '0739360',
 '0743222',
 '0755154',
 '0764579',
 '0779467',
 '0786145',
 '0797268',
 '07brit08',
 '0802769',
 '0816',
 '0819212',
 '0822911',
 '0844736',
 '0854755',
 '0858539',
 '0868635',
 '0878881',
 '0894610',
 '0902841',
 '0

# Exercise 4
For each country, compute how many loans have involved that country as borrowers.

**File necessari**: loans

In [118]:
loans_per_country = loans.groupby('country_name').count()['loan_id']
loans_per_country

country_name
Afghanistan        2337
Albania            3075
Armenia           13952
Azerbaijan        10172
Belize              218
                  ...  
Vietnam           21839
Virgin Islands        2
Yemen              4206
Zambia             1277
Zimbabwe           5513
Name: loan_id, Length: 96, dtype: int64

# Exercise 5
For each country, compute the overall amount of money borrowed.

**File necessari**: loans

### Visualizzazione del problema

Nel database *loans* ci sono due elementi che sembrano simili: *loan_amount* e *funded amount*. Grazie alle seguenti linee di codice (cambiando il confronto), si nota che generalmente i due sono uguali (1.355.316 volte), ma piuttosto spesso capita anche che *funded_amount* < *loan_amount* (64.279 volte) e qualche volta (12) è il contrario.

Il motivo è questo:\
*loan_amount* è a quanto ammontava la richiesta di prestito.\
*funded _amount* sono i soldi che sono effettivamente arrivati al richiedente.

In [119]:
loans[['loan_id', 'funded_amount', 'loan_amount']]
difference = loans['funded_amount']-loans['loan_amount']
difference[difference==0] #Cambiare questa riga per confrontare i due elementi

0          0.0
1          0.0
2          0.0
3          0.0
4          0.0
          ... 
1419602    0.0
1419603    0.0
1419604    0.0
1419605    0.0
1419606    0.0
Length: 1355316, dtype: float64

### Risoluzione del problema

E' stato usato *loan_amount* perché in una domanda di una discussione era stato confermato dal professore di utilizzare questa cifra. Tuttavia, forse *funded_amount* era più rappresentativo per la quantità richiesta?

In [120]:
money_per_country = loans.groupby('country_name').sum()['loan_amount']
money_per_country

country_name
Afghanistan        1967950.0
Albania            4307350.0
Armenia           22950475.0
Azerbaijan        14784625.0
Belize              150175.0
                     ...    
Vietnam           24681100.0
Virgin Islands       10000.0
Yemen              3444000.0
Zambia             1978975.0
Zimbabwe           5851875.0
Name: loan_amount, Length: 96, dtype: float64

# Exercise 6
Like the previous point, but expressed as a percentage of the overall amount lent. 

**File necessari**: nessuno, ma deve essere definito _money_per_country_ (esercizio 5)

In [121]:
money_per_country_percentage = money_per_country/money_per_country.sum()*100
money_per_country_percentage

country_name
Afghanistan       0.166573
Albania           0.364586
Armenia           1.942589
Azerbaijan        1.251410
Belize            0.012711
                    ...   
Vietnam           2.089074
Virgin Islands    0.000846
Yemen             0.291509
Zambia            0.167506
Zimbabwe          0.495318
Name: loan_amount, Length: 96, dtype: float64

# Exercise 7
Like the three previous points, but split for each year (with respect to disburse time).

**File necessari**: loans

In [122]:
loans['disburse_year'] = loans['disburse_time'].dt.year
loans.head()

Unnamed: 0,loan_id,loan_name,original_language,description,description_translated,funded_amount,loan_amount,status,activity_name,sector_name,...,num_lenders_total,num_journal_entries,num_bulk_entries,tags,borrower_genders,borrower_pictured,repayment_interval,distribution_model,loan_length,disburse_year
0,657307,Aivy,English,"Aivy, 21 years of age, is single and lives in ...",,125.0,125.0,funded,General Store,Retail,...,3,2,1,,female,True,irregular,field_partner,54 days,2013.0
1,657259,Idalia Marizza,Spanish,"Doña Idalia, esta casada, tiene 57 años de eda...","Idalia, 57, is married and lives with her husb...",400.0,400.0,funded,Used Clothing,Clothing,...,11,2,1,,female,True,monthly,field_partner,96 days,2013.0
2,658010,Aasia,English,Aasia is a 45-year-old married lady and she ha...,,400.0,400.0,funded,General Store,Retail,...,16,2,1,"#Woman Owned Biz, #Supporting Family, user_fav...",female,True,monthly,field_partner,37 days,2014.0
3,659347,Gulmira,Russian,"Гулмире 36 лет, замужем, вместе с супругом вос...",Gulmira is 36 years old and married. She and ...,625.0,625.0,funded,Farming,Agriculture,...,21,2,1,user_favorite,female,True,monthly,field_partner,35 days,2014.0
4,656933,Ricky\t,English,Ricky is a farmer who currently cultivates his...,,425.0,425.0,funded,Farming,Agriculture,...,15,2,1,"#Animals, #Eco-friendly, #Sustainable Ag",male,True,bullet,field_partner,58 days,2013.0


### Analogo del punto 4.
For each country, compute how many loans have involved that country as borrowers

In [123]:
loans_per_country_per_year = loans.groupby(['country_name','disburse_year']).count()['loan_id']
loans_per_country_per_year

country_name  disburse_year
Afghanistan   2007.0            408
              2008.0            370
              2009.0            678
              2010.0            632
              2011.0            247
                               ... 
Zimbabwe      2013.0            426
              2014.0           2078
              2015.0            600
              2016.0            808
              2017.0           1079
Name: loan_id, Length: 748, dtype: int64

### Analogo del punto 5
For each country, compute the overall amount of money borrowed.

In [125]:
money_per_country_per_year = loans.groupby(['country_name','disburse_year'])['loan_amount'].sum()
money_per_country_per_year

country_name  disburse_year
Afghanistan   2007.0            194975.0
              2008.0            365375.0
              2009.0            585125.0
              2010.0            563350.0
              2011.0            245125.0
                                 ...    
Zimbabwe      2013.0            678525.0
              2014.0           1311575.0
              2015.0            723625.0
              2016.0            788600.0
              2017.0           1237600.0
Name: loan_amount, Length: 748, dtype: float64

### Analogo del punto 6
Like the previous point, but expressed as a percentage of the overall amount lent.

In [126]:
money_per_country_percentage_per_year = money_per_country_per_year/money_per_country_per_year.sum()*100
money_per_country_percentage_per_year

country_name  disburse_year
Afghanistan   2007.0           0.016657
              2008.0           0.031215
              2009.0           0.049989
              2010.0           0.048129
              2011.0           0.020942
                                 ...   
Zimbabwe      2013.0           0.057969
              2014.0           0.112053
              2015.0           0.061822
              2016.0           0.067373
              2017.0           0.105733
Name: loan_amount, Length: 748, dtype: float64

# Exercise 8
For each lender, compute the overall amount of money lent. For each loan that has more than one lender, you must assume that all lenders contributed the same amount.

**File necessari**: loan_lenders, norm e loans.

### Analisi del problema

Il singolo *lender* può aver prestato soldi in più prestiti diversi, e ogni prestito può aver ricevuto soldi da tanti *lenders* diversi. In questo secondo caso, si assume che ogni *lender* abbia contribuito allo stesso modo. Si ricavano le contribuzioni dal documento *loans*, mentre effettivamente chi ha contribuito dal documento *loan_lenders*

### Osservazione
Il numero di prestatori per ogni *loan_id* è indicato nella tabella *loans*, e facendo un confronto, si è osservato che tale numero è diverso da quello trovato raggruppando il file *norm_loan_lenders* per *loan_id* e contando gli elementi per gruppo. Tuttavia, poiché in *loans* non sono indicati i nomi dei prestatori, non si potrebbe assegnarlo a nessuno; per questa ragione, si è deciso di fare riferimento al raggruppamento di *norm_loan_lenders*.

In [9]:
norm.head()

Unnamed: 0,loan_id,lenders
0,483693,muc888
1,483693,sam4326
2,483693,camaran3922
3,483693,lachheb1865
4,483693,rebecca3499


In [12]:
lenders_per_loan = norm.groupby('loan_id').count()
lenders_per_loan

Unnamed: 0_level_0,lenders
loan_id,Unnamed: 1_level_1
84,3
85,2
86,3
88,3
89,4
...,...
1444051,1
1444053,1
1444058,1
1444063,1


In [13]:
loan_lenders_count = loan_lenders.merge(lenders_per_loan, on='loan_id')
loan_lenders_count

Unnamed: 0,loan_id,lenders_x,lenders_y
0,483693,"[muc888, sam4326, camaran3922, lachheb1865, re...",40
1,483738,"[muc888, nora3555, williammanashi, barbara5610...",15
2,485000,"[muc888, terrystl, richardandsusan8352, sherri...",15
3,486087,"[muc888, james5068, rudi5955, daniel9859, don9...",13
4,534428,"[muc888, niki3008, teresa9174, mike4896, david...",19
...,...,...,...
1387427,678999,"[michael43411218, carol5987, gooddogg1, chris4...",10
1387428,1207353,"[rjhoward1986, jeffrey6870, trolltech4460, ely...",5
1387429,1206220,"[vicky7746, gooddogg1, fairspirit, craig972996...",44
1387430,1206425,"[rich6705, sergiiy9766, angela7509, barbara561...",8


In [14]:
loan_lenders_count = loan_lenders_count.rename({'lenders_x': 'list_of_lenders', 'lenders_y': 'number_of_lenders'}, axis=1)
loan_lenders_count

Unnamed: 0,loan_id,list_of_lenders,number_of_lenders
0,483693,"[muc888, sam4326, camaran3922, lachheb1865, re...",40
1,483738,"[muc888, nora3555, williammanashi, barbara5610...",15
2,485000,"[muc888, terrystl, richardandsusan8352, sherri...",15
3,486087,"[muc888, james5068, rudi5955, daniel9859, don9...",13
4,534428,"[muc888, niki3008, teresa9174, mike4896, david...",19
...,...,...,...
1387427,678999,"[michael43411218, carol5987, gooddogg1, chris4...",10
1387428,1207353,"[rjhoward1986, jeffrey6870, trolltech4460, ely...",5
1387429,1206220,"[vicky7746, gooddogg1, fairspirit, craig972996...",44
1387430,1206425,"[rich6705, sergiiy9766, angela7509, barbara561...",8


In [17]:
loan_amount_per_loan_id = loans[['loan_id','loan_amount']]
loan_amount_per_loan_id

Unnamed: 0,loan_id,loan_amount
0,657307,125.0
1,657259,400.0
2,658010,400.0
3,659347,625.0
4,656933,425.0
...,...,...
1419602,988180,400.0
1419603,988213,300.0
1419604,989109,2425.0
1419605,989143,100.0


In [18]:
loan_lenders_count = loan_lenders_count.merge(loan_amount_per_loan_id, on='loan_id', copy=False)
loan_lenders_count

Unnamed: 0,loan_id,list_of_lenders,number_of_lenders,loan_amount
0,483693,"[muc888, sam4326, camaran3922, lachheb1865, re...",40,1225.0
1,483738,"[muc888, nora3555, williammanashi, barbara5610...",15,500.0
2,485000,"[muc888, terrystl, richardandsusan8352, sherri...",15,725.0
3,486087,"[muc888, james5068, rudi5955, daniel9859, don9...",13,400.0
4,534428,"[muc888, niki3008, teresa9174, mike4896, david...",19,625.0
...,...,...,...,...
1387423,678999,"[michael43411218, carol5987, gooddogg1, chris4...",10,325.0
1387424,1207353,"[rjhoward1986, jeffrey6870, trolltech4460, ely...",5,200.0
1387425,1206220,"[vicky7746, gooddogg1, fairspirit, craig972996...",44,2175.0
1387426,1206425,"[rich6705, sergiiy9766, angela7509, barbara561...",8,325.0


In [19]:
loan_lenders_count['single_contribution'] = loan_lenders_count['loan_amount']/loan_lenders_count['number_of_lenders']
loan_lenders_count

Unnamed: 0,loan_id,list_of_lenders,number_of_lenders,loan_amount,single_contribution
0,483693,"[muc888, sam4326, camaran3922, lachheb1865, re...",40,1225.0,30.625000
1,483738,"[muc888, nora3555, williammanashi, barbara5610...",15,500.0,33.333333
2,485000,"[muc888, terrystl, richardandsusan8352, sherri...",15,725.0,48.333333
3,486087,"[muc888, james5068, rudi5955, daniel9859, don9...",13,400.0,30.769231
4,534428,"[muc888, niki3008, teresa9174, mike4896, david...",19,625.0,32.894737
...,...,...,...,...,...
1387423,678999,"[michael43411218, carol5987, gooddogg1, chris4...",10,325.0,32.500000
1387424,1207353,"[rjhoward1986, jeffrey6870, trolltech4460, ely...",5,200.0,40.000000
1387425,1206220,"[vicky7746, gooddogg1, fairspirit, craig972996...",44,2175.0,49.431818
1387426,1206425,"[rich6705, sergiiy9766, angela7509, barbara561...",8,325.0,40.625000


In [20]:
norm_with_contributions = norm.merge(loan_lenders_count, how='left')
norm_with_contributions.head()

Unnamed: 0,loan_id,lenders,list_of_lenders,number_of_lenders,loan_amount,single_contribution
0,483693,muc888,"[muc888, sam4326, camaran3922, lachheb1865, re...",40.0,1225.0,30.625
1,483693,sam4326,"[muc888, sam4326, camaran3922, lachheb1865, re...",40.0,1225.0,30.625
2,483693,camaran3922,"[muc888, sam4326, camaran3922, lachheb1865, re...",40.0,1225.0,30.625
3,483693,lachheb1865,"[muc888, sam4326, camaran3922, lachheb1865, re...",40.0,1225.0,30.625
4,483693,rebecca3499,"[muc888, sam4326, camaran3922, lachheb1865, re...",40.0,1225.0,30.625


In [24]:
norm_with_contributions = norm_with_contributions.drop(labels=['list_of_lenders', 'number_of_lenders', 'loan_amount'], axis=1)

In [26]:
norm_with_contributions.groupby('lenders').sum()['single_contribution']

lenders
000                1764.285078
00000              1380.693644
0002               2472.563566
00mike00             52.631579
0101craign0101     2623.565117
                      ...     
zzmcfate          66113.226325
zzpaghetti9994       51.020408
zzrvmf8538          576.978086
zzzsai              267.667370
zzzworld             27.522936
Name: single_contribution, Length: 1383799, dtype: float64

# Exercise 9
For each country, compute the difference between the overall amount of money lent and the overall amount of money borrowed. Since the country of the lender is often unknown, you can assume that the true distribution among the countries is the same as the one computed from the rows where the country is known.

In [46]:
lenders = pd.read_csv('Data\lenders.csv')
lenders

Unnamed: 0,permanent_name,display_name,city,state,country_code,member_since,occupation,loan_because,loan_purchase_num,invited_by,num_invited
0,qian3013,Qian,,,,1461300457,,,1.0,,0
1,reena6733,Reena,,,,1461300634,,,9.0,,0
2,mai5982,Mai,,,,1461300853,,,,,0
3,andrew86079135,Andrew,,,,1461301091,,,5.0,Peter Tan,0
4,nguyen6962,Nguyen,,,,1461301154,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...
2349169,janet7309,Janet,,,,1342097163,,,,,0
2349170,pj4198,,,,,1342097515,,,,,0
2349171,maria2141,Maria,,,US,1342099723,,,2.0,,0
2349172,simone9846,Simone,,,,1342100213,,,,,0


In [73]:
money_per_country_code = loans.groupby('country_code').sum()['loan_amount']
money_per_country_code

country_code
AF     1967950.0
AL     4307350.0
AM    22950475.0
AZ    14784625.0
BA      477250.0
         ...    
XK     3083025.0
YE     3444000.0
ZA     1006525.0
ZM     1978975.0
ZW     5851875.0
Name: loan_amount, Length: 95, dtype: float64

In [44]:
funded_by_lenders_df = pd.DataFrame.from_dict(funded_by_lender,orient='index',columns = ['funded_amount'])
funded_by_lenders_df['lenders'] = funded_by_lenders_df.index

In [None]:
funded_by_lenders_df

In [None]:
lenders_country = pd.merge(lenders[['permanent_name','country_code']],funded_by_lenders_df,left_on = 'permanent_name',right_on = 'lenders')
lenders_country.dropna().head()

In [None]:
a = (lenders_country.dropna().groupby('country_code').count()/len(lenders_country.dropna()))['lenders']

In [None]:
c = list(a.index)
p = list(a.values)

In [None]:
x = pd.Series(np.random.choice(c, len(lenders_country[lenders_country['country_code'].isnull()]), True, p))

In [None]:
keys = list(lenders_country[lenders_country['country_code'].isnull()].index)
values = list(x)

In [None]:
lenders_country['country_code'] = lenders_country['country_code'].fillna(dict(zip(keys,values)))

In [None]:
borrowed_country = lenders_country.groupby('country_code').sum()['funded_amount']
borrowed_country

In [None]:
diff_borrow_lent = received_country - borrowed_country
diff_borrow_lent.dropna()

# Exercise 10
Which country has the highest ratio between the difference computed at the previous point and the population?

In [9]:
countries = pd.read_csv('Data\country_stats.csv')
countries

Unnamed: 0,country_name,country_code,country_code3,continent,region,population,population_below_poverty_line,hdi,life_expectancy,expected_years_of_schooling,mean_years_of_schooling,gni,kiva_country_name
0,India,IN,IND,Asia,Southern Asia,1339180127,21.9,0.623559,68.322,11.696590,6.298834,5663.474799,India
1,Nigeria,NG,NGA,Africa,Western Africa,190886311,70.0,0.527105,53.057,9.970482,6.000000,5442.901264,Nigeria
2,Mexico,MX,MEX,Americas,Central America,129163276,46.2,0.761683,76.972,13.299090,8.554985,16383.106680,Mexico
3,Pakistan,PK,PAK,Asia,Southern Asia,197015955,29.5,0.550354,66.365,8.106910,5.089460,5031.173074,Pakistan
4,Bangladesh,BD,BGD,Asia,Southern Asia,164669751,31.5,0.578824,71.985,10.178706,5.241577,3341.490722,Bangladesh
...,...,...,...,...,...,...,...,...,...,...,...,...,...
169,Somalia,SO,SOM,Africa,Eastern Africa,14742523,,,,,,,Somalia
170,Central African Republic,CF,CAF,Africa,Middle Africa,4659080,,0.352440,51.458,7.098980,4.230000,587.473961,Central African Republic
171,Samoa,WS,WSM,Oceania,Polynesia,196440,,0.702000,,,,,Samoa
172,Palestine,PS,PS,Asia,Western Asia,4920724,,0.677000,,,,,Palestine


In [None]:
diff_df = diff_borrow_lent.to_frame()
diff_df.columns = ['diff']

In [None]:
pop_diff = pd.merge(diff_df,countries[['country_name','country_code','population','population_below_poverty_line']],left_index = True,right_on = 'country_code')
pop_diff['ratio'] = pop_diff['diff']/pop_diff['population']
pop_diff.loc[pop_diff['ratio'].idxmax()]

# Exercise 11
Which country has the highest ratio between the difference computed at point 9 and the population that is not below the poverty line?

In [None]:
countries.sort_values(by =['country_code'])[['country_code','population']]

In [None]:
pop_diff['ratio_not_poor'] = pop_diff['diff']/((100-pop_diff['population_below_poverty_line'])*pop_diff['population']/100)
pop_diff.loc[pop_diff['ratio_not_poor'].idxmax()]

# Exercise 12
For each year, compute the total amount of loans. Each loan that has planned expiration time and disburse time in different years must have its amount distributed proportionally to the number of days in each year. For example, a loan with disburse time December 1st, 2016, planned expiration time January 30th 2018, and amount 5000USD has an amount of 5000USD * 31 / (31+365+30) = 363.85 for 2016, 5000USD * 365 / (31+365+30) = 4284.04 for 2017, and 5000USD * 30 / (31+365+30) = 352.11 for 2018.

**File necessari**: loans, modificato come nell'esercizio 2.

In [69]:
#Lavoro su un altro database per non allungare i conti.
loans_per_year = loans[['loan_amount','disburse_time','planned_expiration_time', 'loan_length']].copy()

In [70]:
#Nuova colonna: money per day, con i soldi da spendere per ogni giorno (sarebbe il 5000USD/(31+365+30), nell'esempio)
loans_per_year['money_per_day'] = loans_per_year['loan_amount']/loans_per_year['loan_length'].dt.days

In [71]:
loans_per_year = loans_per_year.replace([np.inf, -np.inf], np.nan)

In [72]:
#Ci sono dei casi per cui 'money per day' è negativo, perché 'loan_length' è negativo. 
#In questo caso mettiamo money_per_day a zero. Significa che abbiamo scelto di non contarli. 
for i in loans_per_year[loans_per_year['money_per_day'] < 0].index:
    loans_per_year.at[i,'money_per_day'] = 0

In [73]:
days_remaining = 365 - loans['disburse_time'].dt.dayofyear
days_left = loans['planned_expiration_time'].dt.dayofyear

In [74]:
#min_year = int(min(loans_per_year['disburse_time'].dt.year))
#max_year = int(max(loans_per_year['planned_expiration_time'].dt.year))
loans_per_year = loans_per_year.dropna()
loans_per_year['first_year'] = loans['disburse_time'].dt.year
loans_per_year['last_year'] = loans['planned_expiration_time'].dt.year
loans_per_year['disburse_year'] = loans_per_year['money_per_day']*days_remaining
loans_per_year['planned_exp_year'] = loans_per_year['money_per_day']*days_left
loans_per_year['mid_term_year'] = 365*loans_per_year['money_per_day']
#loans_per_year = loans_per_year.drop(['disburse_time', 'planned_expiration_time'], axis=1)

In [75]:
loans_per_year['first_year'] = loans_per_year['first_year'].astype(int)
loans_per_year['last_year'] = loans_per_year['last_year'].astype(int)

In [76]:
days_remaining

0            9.0
1           11.0
2          356.0
3          348.0
4           14.0
           ...  
1419602     38.0
1419603     37.0
1419604     48.0
1419605     58.0
1419606     58.0
Name: disburse_time, Length: 1419607, dtype: float64

In [77]:
loans_per_year

Unnamed: 0,loan_amount,disburse_time,planned_expiration_time,loan_length,money_per_day,first_year,last_year,disburse_year,planned_exp_year,mid_term_year
0,125.0,2013-12-22 08:00:00+00:00,2014-02-14 03:30:06+00:00,54 days,2.314815,2013,2014,20.833333,104.166667,844.907407
1,400.0,2013-12-20 08:00:00+00:00,2014-03-26 22:25:07+00:00,96 days,4.166667,2013,2014,45.833333,354.166667,1520.833333
2,400.0,2014-01-09 08:00:00+00:00,2014-02-15 21:10:05+00:00,37 days,10.810811,2014,2014,3848.648649,497.297297,3945.945946
3,625.0,2014-01-17 08:00:00+00:00,2014-02-21 03:10:02+00:00,35 days,17.857143,2014,2014,6214.285714,928.571429,6517.857143
4,425.0,2013-12-17 08:00:00+00:00,2014-02-13 06:10:02+00:00,58 days,7.327586,2013,2014,102.586207,322.413793,2674.568966
...,...,...,...,...,...,...,...,...,...,...
1419602,400.0,2015-11-23 08:00:00+00:00,2016-01-02 01:00:03+00:00,40 days,10.000000,2015,2016,380.000000,20.000000,3650.000000
1419603,300.0,2015-11-24 08:00:00+00:00,2016-01-02 16:40:07+00:00,39 days,7.692308,2015,2016,284.615385,15.384615,2807.692308
1419604,2425.0,2015-11-13 08:00:00+00:00,2016-01-03 22:20:04+00:00,51 days,47.549020,2015,2016,2282.352941,142.647059,17355.392157
1419605,100.0,2015-11-03 08:00:00+00:00,2016-01-05 08:50:02+00:00,63 days,1.587302,2015,2016,92.063492,7.936508,579.365079


In [156]:
#Ci sono 320 casi in cui la differenza tra i due tempi è maggiore di 1 anno... si arriva fino a 5!
diffyear = loans_per_year['last_year']-loans_per_year['first_year']
max(diffyear)

5.0

In [49]:
tot_year = {}
for year in list(loans_per_year['first_year'].unique()):
    tot_year[year] = tot_year.get(year,0) + loans_per_year.groupby('first_year').sum()['disburse_year'].loc[year]
for year in list(loans_per_year['last_year'].unique()):
    tot_year[year] = tot_year.get(year,0) + loans_per_year.groupby('last_year').sum()['planned_exp_year'].loc[year]
for row in loans_per_year.index:
    for mid_year in range(loans_per_year.at[row,'first_year']+1,loans_per_year.at[row,'last_year']):
        tot_year[mid_year] = tot_year.get(mid_year,0) + loans_per_year.at[row,'mid_term_year']

In [50]:
tot_year

{2013: 970894628.9000332,
 2014: 1318721896.6092165,
 2015: 1408014512.2894528,
 2012: 942294391.8506069,
 2016: 1532931972.7321393,
 2017: 1693812906.5738604,
 2018: 13775124.394079097,
 2011: 629911.7748786244}

In [55]:
loans['loan_amount'].sum()

1181437300.0

In [56]:
sum(tot_year.values())

7881075345.124267

# Exercise 13
For each value of repayment_interval, add a new column to the lenders dataframe that contains the total amount of money corresponding to loans in such state

In [44]:
x

datetime.date(2016, 1, 1)

In [20]:
loans['repayment_interval'].unique()

array(['irregular', 'monthly', 'bullet', 'weekly'], dtype=object)

In [47]:
(x-date1[0]).days

686

In [11]:
loans.groupby(['repayment_interval','loan_id']).sum()['loan_amount']

repayment_interval  loan_id
bullet              84         500.0
                    85         500.0
                    86         500.0
                    88         300.0
                    89         500.0
                               ...  
weekly              1090030    325.0
                    1090039    325.0
                    1090040    325.0
                    1090041    125.0
                    1090099    325.0
Name: loan_amount, Length: 1419607, dtype: float64

In [10]:
merged_status = pd.merge(merged,loans[['loan_id','repayment_interval']],on='loan_id')
merged_status

Unnamed: 0,loan_id,loan_amount,lenders,count,funded_divided,repayment_interval
0,657307,125.0,"[spencer5657, matthew8640, larry71496105]",3,41.666667,irregular
1,657259,400.0,"[ltr, andrew5306, dana9865, WHYu, robert978452...",7,57.142857,monthly
2,658010,400.0,"[kathy3100, omar7511, amirali5409, bingo, geni...",14,28.571429,monthly
3,659347,625.0,"[jasonsamfield, mikaela2498, tim1351, rifath92...",17,36.764706,monthly
4,656933,425.0,"[john86857365, gooddogg1, daniel8458, anjae514...",14,30.357143,bullet
...,...,...,...,...,...,...
1387423,988180,400.0,"[joyce5432, douglas5957, michelle27516947, wak...",13,30.769231,monthly
1387424,988213,300.0,"[dqqpkh5136, gerald4889, peter2548, tatjanacha...",11,27.272727,irregular
1387425,989109,2425.0,"[ulrike5921, oakviewroberts, kenmwong, trudy32...",69,35.144928,irregular
1387426,989143,100.0,"[jack6206, david85674927]",2,50.000000,irregular


In [116]:
possible_repayments = list(loans['repayment_interval'].unique())
for interval in possible_repayments:
    lenders[interval] = 0

In [111]:
lenders = lenders.set_index('permanent_name')

In [None]:
#p_loans = pd.read_csv('loans.csv')
#a_loans = loans.iloc[1000000:1419607]
p_loans = loans

In [None]:
%%time
for row in merged_status.index[:1000]:
    for l in merged_status.at[row,'lenders']:
        status = merged_status.at[row,'repayment_interval']
        lenders.ix[l,status] += merged_status.at[row,'funded_divided']        

In [130]:
norm[:5000].groupby('lenders')['loan_id'].apply(list)

Wall time: 429 ms


lenders
0545998               [319643]
1961          [356152, 356152]
25perday              [573696]
3beditions            [391924]
9678                  [524323]
                    ...       
zen7998               [573696]
zim7148               [315025]
zoe6003               [396664]
zoec                  [494337]
zsazsa8211            [308897]
Name: loan_id, Length: 4391, dtype: object

In [135]:
len(norm)/5000*429e-3/60

40.46032132999999

# Exercise 14
What is the occupation with the highest average amount of money lent (the average must be computed over all lenders with a given occupation)?

In [42]:
lenders

Unnamed: 0_level_0,display_name,city,state,country_code,member_since,occupation,loan_because,loan_purchase_num,invited_by,num_invited,irregular,monthly,bullet,weekly
permanent_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
qian3013,Qian,,,,1461300457,,,1.0,,0,0,0,0,0
reena6733,Reena,,,,1461300634,,,9.0,,0,0,0,0,0
mai5982,Mai,,,,1461300853,,,,,0,0,0,0,0
andrew86079135,Andrew,,,,1461301091,,,5.0,Peter Tan,0,0,0,0,0
nguyen6962,Nguyen,,,,1461301154,,,,,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
janet7309,Janet,,,,1342097163,,,,,0,0,0,0,0
pj4198,,,,,1342097515,,,,,0,0,0,0,0
maria2141,Maria,,,US,1342099723,,,2.0,,0,0,0,0,0
simone9846,Simone,,,,1342100213,,,,,0,0,0,0,0


In [47]:
lenders_occupation = pd.merge(lenders[['permanent_name','occupation']],funded_by_lenders_df,left_on = 'permanent_name',right_on = 'lenders')
lenders_occupation.dropna().head()

Unnamed: 0,permanent_name,occupation,funded_amount,lenders
31,vikas1098,Software Engineer,91.666667,vikas1098
177,kumari2781,Software Engineer,63.873626,kumari2781
390,javier7867,Technology Consultant,457.002956,javier7867
945,jens1183,IT Consultant,59.343605,jens1183
1099,pankaj1930,doctor,137.54638,pankaj1930


In [54]:
lenders_occupation[lenders_occupation['occupation'] == lenders_occupation.groupby('occupation').mean()['funded_amount'].idxmax()]

Unnamed: 0,permanent_name,occupation,funded_amount,lenders
231277,gooddogg1,www.linkedin.com/in/peacekeeper,8642502.0,gooddogg1


# Exercise 15
Cluster the loans according to the year-month of disburse time.

In [68]:
loans['disburse_time'] = pd.to_datetime(loans['disburse_time'])
loans['month_year'] = loans['disburse_time'].dt.year.astype(str) + '-' + loans['disburse_time'].dt.month.astype(str)

In [70]:
loans.groupby('month_year').count()

Unnamed: 0_level_0,loan_id,loan_name,original_language,description,description_translated,funded_amount,loan_amount,status,activity_name,sector_name,...,raised_time,lender_term,num_lenders_total,num_journal_entries,num_bulk_entries,tags,borrower_genders,borrower_pictured,repayment_interval,distribution_model
month_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005.0-4.0,203,187,187,187,0,203,203,203,203,203,...,203,203,203,203,203,2,187,187,203,203
2006.0-10.0,146,112,112,112,0,146,146,146,146,146,...,145,146,146,146,146,0,112,112,146,146
2006.0-11.0,744,525,525,525,0,744,744,744,744,744,...,744,744,744,744,744,1,525,525,744,744
2006.0-12.0,804,609,609,609,0,804,804,804,804,804,...,804,804,804,804,804,0,609,609,804,804
2006.0-3.0,1,0,0,0,0,1,1,1,1,1,...,0,1,1,1,1,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2017.0-9.0,17336,17286,17293,17293,17293,17336,17336,17336,17336,17336,...,16001,17336,17336,17336,17336,14856,17293,17293,17336,17336
2018.0-1.0,339,323,334,334,334,339,339,339,339,339,...,277,339,339,339,339,270,334,334,339,339
2018.0-2.0,560,560,560,560,560,560,560,560,560,560,...,307,560,560,560,560,551,560,560,560,560
2018.0-3.0,65,65,65,65,65,65,65,65,65,65,...,62,65,65,65,65,64,65,65,65,65


# Exercise 16
For each country, compute its overall GDP, by multiplying the per capita GDP with its population.

In [74]:
countries['overall_gni'] = countries['gni']*countries['population']
countries.head()

Unnamed: 0,country_name,country_code,country_code3,continent,region,population,population_below_poverty_line,hdi,life_expectancy,expected_years_of_schooling,mean_years_of_schooling,gni,kiva_country_name,overall_gni
0,India,IN,IND,Asia,Southern Asia,1339180127,21.9,0.623559,68.322,11.69659,6.298834,5663.474799,India,7584413000000.0
1,Nigeria,NG,NGA,Africa,Western Africa,190886311,70.0,0.527105,53.057,9.970482,6.0,5442.901264,Nigeria,1038975000000.0
2,Mexico,MX,MEX,Americas,Central America,129163276,46.2,0.761683,76.972,13.29909,8.554985,16383.10668,Mexico,2116096000000.0
3,Pakistan,PK,PAK,Asia,Southern Asia,197015955,29.5,0.550354,66.365,8.10691,5.08946,5031.173074,Pakistan,991221400000.0
4,Bangladesh,BD,BGD,Asia,Southern Asia,164669751,31.5,0.578824,71.985,10.178706,5.241577,3341.490722,Bangladesh,550242400000.0


# Exercise 17
Find the country with the highest rate of irregular repayment interval.

In [94]:
loans[loans['repayment_interval'] == 'irregular'].groupby('country_name').count().idxmax()['country_code']

'Philippines'

# Exercise 18
Find the country with the highest fraction of loaned amount with irregular repayment interval.

In [110]:
irregular_fraction = loans[loans['repayment_interval'] == 'irregular'].groupby('country_name').sum()['loan_amount'] \
                                                               / loans.groupby('country_name').sum()['loan_amount']
irregular_fraction.idxmax()

'Bhutan'