## Feature Engineering on a Financial Dataset
You are working for a major bank in the Czech Republic and you have been tasked to analyze the transactions of existing customers. The data team has extracted all the tables from their database they think will be useful for you to analyze the dataset. You will need to consolidate the data from those tables into a single DataFrame and create new features in order to get an enriched dataset from which you will be able to perform an in-depth analysis of customers' banking transactions.

You will be using only the following four tables:

    account: The characteristics of a customer's bank account for a given branch
    client: Personal information related to the bank's customers
    disp: A table that links an account to a customer
    trans: A list of all historical transactions by account
    
    If you want to know more about these tables, you can look at the data dictionary for this dataset: https://github.com/Kusainov/czech-banking-fin-analysis/blob/master/Data%20dictionary.pdf

In [1]:
import pandas as pd

In [2]:
account_df = pd.read_csv('account.csv', sep=';')
client_df = pd.read_csv('client.csv', sep=';')
disp_df = pd.read_csv('disp.csv', sep=';')
trans_df = pd.read_csv('trans.csv', sep=';')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
# analyze account_df
print(account_df.shape)
print(account_df.info())
account_df.head()

(4500, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4500 entries, 0 to 4499
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   account_id   4500 non-null   int64 
 1   district_id  4500 non-null   int64 
 2   frequency    4500 non-null   object
 3   date         4500 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 140.8+ KB
None


Unnamed: 0,account_id,district_id,frequency,date
0,576,55,POPLATEK MESICNE,930101
1,3818,74,POPLATEK MESICNE,930101
2,704,55,POPLATEK MESICNE,930101
3,2378,16,POPLATEK MESICNE,930101
4,2632,24,POPLATEK MESICNE,930102


In [4]:
# analyze client_df
print(client_df.shape)
print(client_df.info())
client_df.head()

(5369, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5369 entries, 0 to 5368
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   client_id     5369 non-null   int64
 1   birth_number  5369 non-null   int64
 2   district_id   5369 non-null   int64
dtypes: int64(3)
memory usage: 126.0 KB
None


Unnamed: 0,client_id,birth_number,district_id
0,1,706213,18
1,2,450204,1
2,3,406009,1
3,4,561201,5
4,5,605703,5


In [5]:
# analyze disp_df
print(disp_df.shape)
print(disp_df.info())
disp_df.head()

(5369, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5369 entries, 0 to 5368
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   disp_id     5369 non-null   int64 
 1   client_id   5369 non-null   int64 
 2   account_id  5369 non-null   int64 
 3   type        5369 non-null   object
dtypes: int64(3), object(1)
memory usage: 167.9+ KB
None


Unnamed: 0,disp_id,client_id,account_id,type
0,1,1,1,OWNER
1,2,2,2,OWNER
2,3,3,2,DISPONENT
3,4,4,3,OWNER
4,5,5,3,DISPONENT


In [6]:
# analyze trans_df
print(trans_df.shape)
print(trans_df.info())
trans_df.head()

(1056320, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1056320 entries, 0 to 1056319
Data columns (total 10 columns):
 #   Column      Non-Null Count    Dtype  
---  ------      --------------    -----  
 0   trans_id    1056320 non-null  int64  
 1   account_id  1056320 non-null  int64  
 2   date        1056320 non-null  int64  
 3   type        1056320 non-null  object 
 4   operation   873206 non-null   object 
 5   amount      1056320 non-null  float64
 6   balance     1056320 non-null  float64
 7   k_symbol    574439 non-null   object 
 8   bank        273508 non-null   object 
 9   account     295389 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 80.6+ MB
None


Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,


In [7]:
print(trans_df.info())
print(account_df.info())
print(disp_df.info())
print(client_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1056320 entries, 0 to 1056319
Data columns (total 10 columns):
 #   Column      Non-Null Count    Dtype  
---  ------      --------------    -----  
 0   trans_id    1056320 non-null  int64  
 1   account_id  1056320 non-null  int64  
 2   date        1056320 non-null  int64  
 3   type        1056320 non-null  object 
 4   operation   873206 non-null   object 
 5   amount      1056320 non-null  float64
 6   balance     1056320 non-null  float64
 7   k_symbol    574439 non-null   object 
 8   bank        273508 non-null   object 
 9   account     295389 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 80.6+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4500 entries, 0 to 4499
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   account_id   4500 non-null   int64 
 1   district_id  4500 non-null   int64 
 2   frequency    4500 non-null  

In [8]:
# merge tables together using common tables.
trans_account_df = pd.merge(trans_df, account_df, how='left',
                           on=['account_id'])
trans_account_df.head()

Unnamed: 0,trans_id,account_id,date_x,type,operation,amount,balance,k_symbol,bank,account,district_id,frequency,date_y
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,,16,POPLATEK MESICNE,930101
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,,55,POPLATEK MESICNE,930101
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,,55,POPLATEK MESICNE,930101
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,,74,POPLATEK MESICNE,930101
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,,77,POPLATEK MESICNE,930102


In [9]:
disp_df.head()

Unnamed: 0,disp_id,client_id,account_id,type
0,1,1,1,OWNER
1,2,2,2,OWNER
2,3,3,2,DISPONENT
3,4,4,3,OWNER
4,5,5,3,DISPONENT


#### We can see that the account_id column doesn't contain a unique identifier, and this will add additional rows after the merge. Subset Dataframe to only the **OWNER** type

In [10]:
disp_owner_df = disp_df[disp_df['type'] == 'OWNER']

In [11]:
disp_owner_df.duplicated(subset='account_id').sum()

0

In [12]:
# Merge trans_account_df and disp_onwer_df
trans_acc_disp_df = pd.merge(trans_account_df, disp_owner_df, how='left', 
                            on='account_id')
print(trans_acc_disp_df.shape)
trans_acc_disp_df.head()

(1056320, 16)


Unnamed: 0,trans_id,account_id,date_x,type_x,operation,amount,balance,k_symbol,bank,account,district_id,frequency,date_y,disp_id,client_id,type_y
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,,16,POPLATEK MESICNE,930101,2873,2873,OWNER
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,,55,POPLATEK MESICNE,930101,692,692,OWNER
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,,55,POPLATEK MESICNE,930101,844,844,OWNER
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,,74,POPLATEK MESICNE,930101,4601,4601,OWNER
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,,77,POPLATEK MESICNE,930102,2397,2397,OWNER


In [13]:
# merge trans_acc_disp_df with client_df
merged_df = pd.merge(trans_acc_disp_df, client_df, how='left', 
                    on=['client_id', 'district_id'])
print(merged_df.shape)
merged_df.head()

(1056320, 17)


Unnamed: 0,trans_id,account_id,date_x,type_x,operation,amount,balance,k_symbol,bank,account,district_id,frequency,date_y,disp_id,client_id,type_y,birth_number
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,,16,POPLATEK MESICNE,930101,2873,2873,OWNER,755324.0
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,,55,POPLATEK MESICNE,930101,692,692,OWNER,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,,55,POPLATEK MESICNE,930101,844,844,OWNER,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,,74,POPLATEK MESICNE,930101,4601,4601,OWNER,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,,77,POPLATEK MESICNE,930102,2397,2397,OWNER,


In [14]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1056320 entries, 0 to 1056319
Data columns (total 17 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   trans_id      1056320 non-null  int64  
 1   account_id    1056320 non-null  int64  
 2   date_x        1056320 non-null  int64  
 3   type_x        1056320 non-null  object 
 4   operation     873206 non-null   object 
 5   amount        1056320 non-null  float64
 6   balance       1056320 non-null  float64
 7   k_symbol      574439 non-null   object 
 8   bank          273508 non-null   object 
 9   account       295389 non-null   float64
 10  district_id   1056320 non-null  int64  
 11  frequency     1056320 non-null  object 
 12  date_y        1056320 non-null  int64  
 13  disp_id       1056320 non-null  int64  
 14  client_id     1056320 non-null  int64  
 15  type_y        1056320 non-null  object 
 16  birth_number  894186 non-null   float64
dtypes: float64(4), int64(7), ob

In [15]:
merged_df.rename(columns={'date_x': 'trans_date', 
                          'type_x': 'trans_type', 
                          'date_y': 'account_creation', 
                          'type_y': 'client_type'}, 
                inplace=True)
merged_df.head()

Unnamed: 0,trans_id,account_id,trans_date,trans_type,operation,amount,balance,k_symbol,bank,account,district_id,frequency,account_creation,disp_id,client_id,client_type,birth_number
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,,16,POPLATEK MESICNE,930101,2873,2873,OWNER,755324.0
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,,55,POPLATEK MESICNE,930101,692,692,OWNER,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,,55,POPLATEK MESICNE,930101,844,844,OWNER,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,,74,POPLATEK MESICNE,930101,4601,4601,OWNER,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,,77,POPLATEK MESICNE,930102,2397,2397,OWNER,


In [16]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1056320 entries, 0 to 1056319
Data columns (total 17 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   trans_id          1056320 non-null  int64  
 1   account_id        1056320 non-null  int64  
 2   trans_date        1056320 non-null  int64  
 3   trans_type        1056320 non-null  object 
 4   operation         873206 non-null   object 
 5   amount            1056320 non-null  float64
 6   balance           1056320 non-null  float64
 7   k_symbol          574439 non-null   object 
 8   bank              273508 non-null   object 
 9   account           295389 non-null   float64
 10  district_id       1056320 non-null  int64  
 11  frequency         1056320 non-null  object 
 12  account_creation  1056320 non-null  int64  
 13  disp_id           1056320 non-null  int64  
 14  client_id         1056320 non-null  int64  
 15  client_type       1056320 non-null  object 
 16  

In [17]:
# convert trans_date and account_creation to datetime
merged_df['trans_date'] = pd.to_datetime(merged_df['trans_date'], 
                                        format='%y%m%d')
merged_df['account_creation'] = pd.to_datetime(merged_df['account_creation'], 
                                              format='%y%m%d')

In [18]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1056320 entries, 0 to 1056319
Data columns (total 17 columns):
 #   Column            Non-Null Count    Dtype         
---  ------            --------------    -----         
 0   trans_id          1056320 non-null  int64         
 1   account_id        1056320 non-null  int64         
 2   trans_date        1056320 non-null  datetime64[ns]
 3   trans_type        1056320 non-null  object        
 4   operation         873206 non-null   object        
 5   amount            1056320 non-null  float64       
 6   balance           1056320 non-null  float64       
 7   k_symbol          574439 non-null   object        
 8   bank              273508 non-null   object        
 9   account           295389 non-null   float64       
 10  district_id       1056320 non-null  int64         
 11  frequency         1056320 non-null  object        
 12  account_creation  1056320 non-null  datetime64[ns]
 13  disp_id           1056320 non-null  int64 

In [19]:
# transformations on birth_number
# create a new column called is_female by performing the following calculation:
# (merged_df['birth_number'] % 10000) / 5000 > 1
merged_df['is_female'] = (merged_df['birth_number'] % 10000) / 5000 > 1

In [20]:
merged_df['birth_number'].head()

0    755324.0
1         NaN
2         NaN
3         NaN
4         NaN
Name: birth_number, dtype: float64

In [21]:
# transform all the rows with is_female is True by removing the value within the column birth_number by 5000
merged_df.loc[merged_df['is_female'] == True, 'birth_number'] -= 5000

In [22]:
merged_df['birth_number'].head()

0    750324.0
1         NaN
2         NaN
3         NaN
4         NaN
Name: birth_number, dtype: float64

In [23]:
# convert the birth_number column to datetime with format='%y%m%d', erros='coerce' parameters
pd.to_datetime(merged_df['birth_number'], format='%y%m%d', 
              errors='coerce')

0         1975-03-24
1                NaT
2                NaT
3                NaT
4                NaT
             ...    
1056315   2046-05-25
1056316   2066-11-01
1056317   2033-12-31
1056318   2022-07-20
1056319   2046-12-02
Name: birth_number, Length: 1056320, dtype: datetime64[ns]

In [24]:
# Because the year was recorded with only two digits in this dataset, the date is converted to either 20th
# or 21st century years. This needs to be fixed.
# convert birth_number column to string and print out first five rows
merged_df['birth_number'] = merged_df['birth_number'].astype(str)
merged_df['birth_number'].head()

0    750324.0
1         nan
2         nan
3         nan
4         nan
Name: birth_number, dtype: object

In [25]:
# After the conversion to a string, all the missing values are converted to a string
# with the 'nan' value. Convert them back to proper missing values.
import numpy as np
merged_df.loc[merged_df['birth_number'] == 'nan', 
             'birth_number'] = np.nan

In [26]:
merged_df['birth_number'].head()

0    750324.0
1         NaN
2         NaN
3         NaN
4         NaN
Name: birth_number, dtype: object

In [27]:
# Add the 19 prefix to birth_number for all rows that don't have missing values
merged_df.loc[~merged_df['birth_number'].isna(), 
             'birth_number'] = '19' + merged_df.loc[~merged_df['birth_number'].isna(), 
                                                   'birth_number']
merged_df['birth_number'].head()

0    19750324.0
1           NaN
2           NaN
3           NaN
4           NaN
Name: birth_number, dtype: object

In [28]:
# Convert the birth_number column to the .datetime() with the following parameters:
# format='%y%m%d', errors='coerce' and save results to birth_number
merged_df['birth_number'] = pd.to_datetime(merged_df['birth_number'], 
                                          format='%Y%m%d', errors='coerce')
merged_df['birth_number'].head(20)

0    1975-03-24
1           NaT
2           NaT
3           NaT
4           NaT
5    1938-08-12
6           NaT
7    1979-03-24
8    1971-03-02
9           NaT
10   1970-06-24
11          NaT
12          NaT
13   1928-04-02
14   1940-12-02
15          NaT
16   1925-08-30
17          NaT
18          NaT
19   1978-06-27
Name: birth_number, dtype: datetime64[ns]

In [32]:
(merged_df['account_creation'] - merged_df['birth_number']) / np.timedelta64(1, 'Y')

0          17.777230
1                NaN
2                NaN
3                NaN
4                NaN
             ...    
1056315    50.325469
1056316    28.534467
1056317    62.741877
1056318    71.141776
1056319    46.158374
Length: 1056320, dtype: float64

In [33]:
# The year issue has been fixed. Create a feature that will calculate the age of the customer when their account was created
merged_df['age_at_creation'] = merged_df['account_creation'] - merged_df['birth_number']

In [34]:
# Convert the timedelta results in age_at_creation dy dividing them by np.timedelta64(1, 'Y')
merged_df['age_at_creation'] = merged_df['age_at_creation'] / np.timedelta64(1, 'Y')

In [35]:
# Convert age_at_creation to an integer using .round()
merged_df['age_at_creation'] = merged_df['age_at_creation'].round()
merged_df.head()

Unnamed: 0,trans_id,account_id,trans_date,trans_type,operation,amount,balance,k_symbol,bank,account,district_id,frequency,account_creation,disp_id,client_id,client_type,birth_number,is_female,age_at_creation
0,695247,2378,1993-01-01,PRIJEM,VKLAD,700.0,700.0,,,,16,POPLATEK MESICNE,1993-01-01,2873,2873,OWNER,1975-03-24,True,18.0
1,171812,576,1993-01-01,PRIJEM,VKLAD,900.0,900.0,,,,55,POPLATEK MESICNE,1993-01-01,692,692,OWNER,NaT,False,
2,207264,704,1993-01-01,PRIJEM,VKLAD,1000.0,1000.0,,,,55,POPLATEK MESICNE,1993-01-01,844,844,OWNER,NaT,False,
3,1117247,3818,1993-01-01,PRIJEM,VKLAD,600.0,600.0,,,,74,POPLATEK MESICNE,1993-01-01,4601,4601,OWNER,NaT,False,
4,579373,1972,1993-01-02,PRIJEM,VKLAD,400.0,400.0,,,,77,POPLATEK MESICNE,1993-01-02,2397,2397,OWNER,NaT,False,


In [15]:
# merge trans_account_df with disp_df
trans_account_disp_df = pd.merge(trans_account_df, disp_df, how='left',
                                on=['account_id', 'type'])
trans_account_disp_df.head()

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account,district_id,frequency,disp_id,client_id
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,,16.0,POPLATEK MESICNE,,
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,,55.0,POPLATEK MESICNE,,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,,55.0,POPLATEK MESICNE,,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,,74.0,POPLATEK MESICNE,,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,,77.0,POPLATEK MESICNE,,


In [17]:
# final dataset
final_df = pd.merge(trans_account_disp_df, client_df, 
                  how='left', on=['client_id', 'district_id'])
final_df.head()

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account,district_id,frequency,disp_id,client_id,birth_number
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,,16.0,POPLATEK MESICNE,,,
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,,55.0,POPLATEK MESICNE,,,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,,55.0,POPLATEK MESICNE,,,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,,74.0,POPLATEK MESICNE,,,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,,77.0,POPLATEK MESICNE,,,


In [18]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1056320 entries, 0 to 1056319
Data columns (total 15 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   trans_id      1056320 non-null  int64  
 1   account_id    1056320 non-null  int64  
 2   date          1056320 non-null  int64  
 3   type          1056320 non-null  object 
 4   operation     873206 non-null   object 
 5   amount        1056320 non-null  float64
 6   balance       1056320 non-null  float64
 7   k_symbol      574439 non-null   object 
 8   bank          273508 non-null   object 
 9   account       295389 non-null   float64
 10  district_id   4576 non-null     float64
 11  frequency     4576 non-null     object 
 12  disp_id       0 non-null        float64
 13  client_id     0 non-null        float64
 14  birth_number  0 non-null        float64
dtypes: float64(7), int64(3), object(5)
memory usage: 128.9+ MB
