<a href="https://colab.research.google.com/github/nug1209/PwC_Switzerland_Digital_Intelligence_Virtual_Case_Experience/blob/main/PwC_Task_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this task we deal with merchant loan data. We will create a forecast about how much will the total repayment will be after these loans are finished (after 30 months since the loans started). The formula for counting the repayment is provided. We will also create percentage values for each repayment in comparison with the origination amount.

Import libraries.

In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
from datetime import datetime

Mount Google Drive to get the data.

In [69]:
drive.mount('/content/drive')
path = '/content/drive/MyDrive/Virtual Experience/PwC_Task_2_Data.csv'

df = pd.read_csv(path, delimiter=';')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Rename columns.

In [70]:
df = df.rename(columns={'Unnamed: 0':'origination_date', 'Origination Amount':'origination_amount'})
# df.head(3)

In [71]:
df.shape

(20, 22)

Preparation for melting, i.e. transform wide format table to long format.

In [72]:
to_remove = ['origination_date', 'origination_amount']
value_columns = [i for i in list(df.columns) if i not in to_remove]

This was to calculate repayment percentage early in the processing, but may be it is better done later.

In [73]:
# for i in value_columns:
#   df[i + '_actual_repayment_percent'] = 100 * df[i] / df['origination_amount']

# df.columns

Convert to long format.

In [74]:
df = pd.melt(df, id_vars=['origination_date', 'origination_amount'], value_vars=value_columns)
df = df.rename(columns={'variable':'repayment_date', 'value':'repayment_amount'})

Remove dates with no repayment.

In [75]:
df = df[df['repayment_amount'] != 0.00]
# df

In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 0 to 399
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   origination_date    210 non-null    object 
 1   origination_amount  210 non-null    float64
 2   repayment_date      210 non-null    object 
 3   repayment_amount    210 non-null    float64
dtypes: float64(2), object(2)
memory usage: 8.2+ KB


Add one row for the last entry as instructed in the task.

In [77]:
add_row = {'origination_date':['31.12.2020'], 'origination_amount':[30482978.52], 'repayment_date':['31.01.2021'], 'repayment_amount':[8747661.94]}

df = pd.concat([df, pd.DataFrame(add_row)], ignore_index=True)

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   origination_date    211 non-null    object 
 1   origination_amount  211 non-null    float64
 2   repayment_date      211 non-null    object 
 3   repayment_amount    211 non-null    float64
dtypes: float64(2), object(2)
memory usage: 6.7+ KB


Convert the date columns to date type.

In [80]:
df['origination_date'] = pd.to_datetime(df['origination_date'], format='%d.%m.%Y')
df['repayment_date'] = pd.to_datetime(df['repayment_date'], format='%d.%m.%Y')
# df.info()

Sort values.

In [81]:
df = df.sort_values(by=['origination_date', 'repayment_date'])
df

Unnamed: 0,origination_date,origination_amount,repayment_date,repayment_amount
0,2019-05-31,10018746.17,2019-05-31,1443069.08
1,2019-05-31,10018746.17,2019-06-30,3332200.33
3,2019-05-31,10018746.17,2019-07-31,1328138.75
6,2019-05-31,10018746.17,2019-08-31,928085.74
10,2019-05-31,10018746.17,2019-09-30,736418.27
...,...,...,...,...
207,2020-10-31,27699586.46,2020-12-31,1503544.68
189,2020-11-30,29872889.68,2020-11-30,4383982.78
208,2020-11-30,29872889.68,2020-12-31,8383025.07
209,2020-12-31,30482978.52,2020-12-31,4373830.97


In [82]:
df_copy = df.copy()

Create data frame from unique values of origination date and origination amount so that it can be merged later with the repayment values.

In [101]:
df1 = df[['origination_date', 'origination_amount']].drop_duplicates().reset_index(drop=True)
df1

Unnamed: 0,origination_date,origination_amount
0,2019-05-31,10018746.17
1,2019-06-30,10868379.04
2,2019-07-31,10733932.61
3,2019-08-31,12558727.02
4,2019-09-30,14505071.44
5,2019-10-31,15652952.2
6,2019-11-30,15107713.3
7,2019-12-31,17004745.04
8,2020-01-31,16794379.95
9,2020-02-29,19217205.82


Get the first two repayments.

In [102]:
two_repayments = df.groupby('origination_date')['repayment_amount'].apply(list).reset_index(name='repayments')['repayments'].apply(lambda x: x[0: 2])
df2 = pd.DataFrame(two_repayments, columns=['repayments'])
df2

Unnamed: 0,repayments
0,"[1443069.08, 3332200.33]"
1,"[1392751.6, 3011884.91]"
2,"[1537650.24, 2953335.55]"
3,"[1617681.94, 4082016.0]"
4,"[1992242.84, 3930445.6]"
5,"[2289453.76, 4682354.31]"
6,"[2162283.09, 4637701.69]"
7,"[2402403.37, 4947764.21]"
8,"[2502066.86, 4696910.48]"
9,"[2833811.35, 6142911.08]"


In [103]:
df_agg = pd.concat([df1, df2], axis=1)
df_agg

Unnamed: 0,origination_date,origination_amount,repayments
0,2019-05-31,10018746.17,"[1443069.08, 3332200.33]"
1,2019-06-30,10868379.04,"[1392751.6, 3011884.91]"
2,2019-07-31,10733932.61,"[1537650.24, 2953335.55]"
3,2019-08-31,12558727.02,"[1617681.94, 4082016.0]"
4,2019-09-30,14505071.44,"[1992242.84, 3930445.6]"
5,2019-10-31,15652952.2,"[2289453.76, 4682354.31]"
6,2019-11-30,15107713.3,"[2162283.09, 4637701.69]"
7,2019-12-31,17004745.04,"[2402403.37, 4947764.21]"
8,2020-01-31,16794379.95,"[2502066.86, 4696910.48]"
9,2020-02-29,19217205.82,"[2833811.35, 6142911.08]"


In [115]:
# df['c'] = df.apply(lambda row: row.a + row.b, axis=1)

def results(a, b):
  return [100* i / b for i in a]
  # [val/number for val in sample_list]


results([1, 2, 3], 10)

# df['percent'] = df.apply(lambda x: results(df['repayments'], df['origination_amount']), axis=1)

df['percent'] = df.apply(results(df['repayments'], df['origination_amount']), axis=1)

KeyError: ignored

In [105]:
df_agg['repayment_percentages'] = df['repayments'].apply(lambda x: 100 * x / df['origination_amount'])

df_agg

KeyError: ignored

In [86]:
# def expected_repayments(p):
  
#   for i in np.arange(4):

#     calc1 = sum(p)

#     calc2 = 1 - calc1

#     calc3 = (i - 1) / 30

#     calc4 = 1 - calc2

#     calc5 = 1 + (calc4 * calc3)

#     calc6 = p[1] * np.log(calc5)

#     pi = max(calc6, 0)

#     pi = 99.79 * pi

#     p.append(pi)
  
#   return p
	

In [87]:
list1 = [1, 2, 3, 4]

list1.pop()

list1

[1, 2, 3]

In [88]:
np.arange(3, 4)

array([3])

In [89]:
list1 = [1, 2, 3, 4]
# idx = 
list1[:-1]

[1, 2, 3]

In [90]:
def expected_repayments(p):
  
  for i in np.arange(3, 5):
    
    print(i)
    
    calc1 = sum(p[:-1])
    print('calc1', calc1)

    calc2 = 1 - calc1
    print('calc2', calc2)

    calc3 = (i - 1) / 30
    print('calc3', calc3)

    calc4 = 1 - calc2
    print('calc4', calc4)

    calc5 = 1 + (calc4 * calc2)
    print('calc5', calc5)

    calc6 = p[1] * np.log(calc5)
    print('calc6', calc6)

    pi = max(calc6, 0)
    print('pi', pi)

    # pi = 99.79 * pi
    print('pi', pi)


    p.append(pi)
  
  return p
	

In [91]:
p = [1392751.6, 3011884.91]
expected_repayments(p)

3
calc1 1392751.6
calc2 -1392750.6
calc3 0.06666666666666667
calc4 1392751.6
calc5 -1939755626549.9602
calc6 nan
pi nan
pi nan
4
calc1 4404636.51
calc2 -4404635.51
calc3 0.1
calc4 4404636.51
calc5 -19400818380587.47
calc6 nan
pi nan
pi nan


  calc6 = p[1] * np.log(calc5)


[1392751.6, 3011884.91, nan, nan]

In [92]:
def test(x):
  x.append(1)

df_agg['repayments'].apply(expected_repayments)
df_agg['repayments']

3
calc1 1443069.08
calc2 -1443068.08
calc3 0.06666666666666667
calc4 1443069.08
calc5 -2082446926581.9666
calc6 nan
pi nan
pi nan
4
calc1 4775269.41
calc2 -4775268.41
calc3 0.1
calc4 4775269.41
calc5 -22803193162811.34
calc6 nan
pi nan
pi nan
3
calc1 1392751.6
calc2 -1392750.6
calc3 0.06666666666666667
calc4 1392751.6
calc5 -1939755626549.9602
calc6 nan
pi nan
pi nan
4
calc1 4404636.51
calc2 -4404635.51
calc3 0.1
calc4 4404636.51
calc5 -19400818380587.47
calc6 nan
pi nan
pi nan
3
calc1 1537650.24
calc2 -1537649.24
calc3 0.06666666666666667
calc4 1537650.24
calc5 -2364366722920.8174
calc6 nan
pi nan
pi nan
4
calc1 4490985.79
calc2 -4490984.79
calc3 0.1
calc4 4490985.79
calc5 -20168948874995.133
calc6 nan
pi nan
pi nan
3
calc1 1617681.94
calc2 -1617680.94
calc3 0.06666666666666667
calc4 1617681.94
calc5 -2616893241319.2236
calc6 nan
pi nan
pi nan
4
calc1 5699697.9399999995
calc2 -5699696.9399999995
calc3 0.1
calc4 5699697.9399999995
calc5 -32486550907541.297
calc6 nan
pi nan
pi nan
3
cal

0     [1443069.08, 3332200.33, nan, nan]
1      [1392751.6, 3011884.91, nan, nan]
2     [1537650.24, 2953335.55, nan, nan]
3      [1617681.94, 4082016.0, nan, nan]
4      [1992242.84, 3930445.6, nan, nan]
5     [2289453.76, 4682354.31, nan, nan]
6     [2162283.09, 4637701.69, nan, nan]
7     [2402403.37, 4947764.21, nan, nan]
8     [2502066.86, 4696910.48, nan, nan]
9     [2833811.35, 6142911.08, nan, nan]
10    [2843285.54, 6228477.78, nan, nan]
11    [3332800.29, 6476251.76, nan, nan]
12    [3575896.21, 7636995.58, nan, nan]
13    [3746372.58, 8983764.16, nan, nan]
14    [3470149.61, 8030091.09, nan, nan]
15     [3510328.95, 8374134.7, nan, nan]
16    [3976808.32, 7065477.24, nan, nan]
17    [3667678.21, 8752706.73, nan, nan]
18    [4383982.78, 8383025.07, nan, nan]
19    [4373830.97, 8747661.94, nan, nan]
Name: repayments, dtype: object