# AML Mini-Challenge - Credit Card Affinity Modelling

> Dominik Filliger & Noah Leuenberger

The task can be found [here](https://spaces.technik.fhnw.ch/storage/uploads/spaces/82/exercises/20240218__AML_Trainingscenter_MiniChallenge_Kreditkarten_Aufgabenstellung-1708412668.pdf).

# Setup

In [None]:
import pandas as pd
from datetime import datetime

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()
#plt.style.use('seaborn-white')
#plt.style.use('ggplot')

# Data Import & Wrangling

In [None]:
def remap_values(df, column, mapping):
    # assert that all values in the column are in the mapping except for NaN
    assert df[column].dropna().isin(mapping.keys()).all()
    
    df[column] = df[column].map(mapping, na_action='ignore')
    return df

def map_empty_to_nan(df, column):
    if df[column].dtype != 'object':
        return df

    df[column] = df[column].replace(r'^\s*$', np.nan, regex=True)
    return df

def read_csv(file_path, sep=";", dtypes=None):
    df = pd.read_csv(file_path, sep=sep, dtype=dtypes)
    
    for col in df.columns:
        df = map_empty_to_nan(df, col)
        
    return df

## Accounts

In [None]:
accounts = read_csv("data/account.csv")

# Translated frequency from Czech to English
# according to https://sorry.vse.cz/~berka/challenge/PAST/index.html
accounts = remap_values(accounts, 'frequency', {
    "POPLATEK MESICNE": "MONTHLY_ISSUANCE",
    "POPLATEK TYDNE": "WEEKLY_ISSUANCE",
    "POPLATEK PO OBRATU": "ISSUANCE_AFTER_TRANSACTION"
})

accounts['date'] = pd.to_datetime(accounts['date'], format='%y%m%d')

accounts.rename(columns={'date': 'account_created',
                         'frequency': 'account_frequency'}, inplace=True)

accounts.info()

## Clients

In [None]:
clients = read_csv("data/client.csv")

def parse_birth_number(birth_number):
    birth_number_str = str(birth_number)

    # Extract year, month, and day from birth number from string
    # according to https://sorry.vse.cz/~berka/challenge/PAST/index.html
    year = int(birth_number_str[:2])
    month = int(birth_number_str[2:4])
    day = int(birth_number_str[4:6])

    # Determine sex based on month and adjust month for female clients
    # according to https://sorry.vse.cz/~berka/challenge/PAST/index.html
    if month > 50:
        sex = "Female"
        month -= 50
    else:
        sex = "Male"

    # Validate date
    assert 1 <= month <= 12
    assert 1 <= day <= 31
    assert 0 <= year <= 99

    if month in [4, 6, 9, 11]:
        assert 1 <= day <= 30
    elif month == 2:
        assert 1 <= day <= 29
    else:
        assert 1 <= day <= 31

    # Assuming all dates are in the 1900s
    birth_date = datetime(1900 + year, month, day)
    return pd.Series([sex, birth_date])


clients[['sex', 'birth_date']] = clients['birth_number'].apply(parse_birth_number)

# Calculate 'age' assuming the reference year is 1999
clients['age'] = clients['birth_date'].apply(lambda x: 1999 - x.year)

# Drop 'birth_number' column as it is no longer needed
clients = clients.drop(columns=['birth_number'])

clients.info()

## Dispositions

In [None]:
dispositions = read_csv("data/disp.csv")

#dispositions = dispositions[dispositions['type'] == 'OWNER']

dispositions.info()

## Orders

In [None]:
orders = read_csv("data/order.csv")

# Translated from Czech to English
# according to https://sorry.vse.cz/~berka/challenge/PAST/index.html
orders = remap_values(orders, 'k_symbol', {
    "POJISTNE": "Insurance_Payment",
    "SIPO": "Household",
    "LEASING": "Leasing",
    "UVER": "Loan_Payment"
})

orders['account_to'] = orders['account_to'].astype('category')

orders = orders.rename(columns={'amount': 'debited_amount'})

orders.info()

## Transactions

In [None]:
# column 8 is the 'bank' column which contains NaNs and must be read as string
transactions = read_csv("data/trans.csv", dtypes={8: str})

transactions['date'] = pd.to_datetime(transactions['date'], format='%y%m%d')

# Translated type, operations and characteristics from Czech to English
# according to https://sorry.vse.cz/~berka/challenge/PAST/index.html
transactions = remap_values(transactions, 'type', {
    "VYBER": "Withdrawal", # Also withdrawal as it is against the documentation present in the dataset
    "PRIJEM": "Credit",
    "VYDAJ": "Withdrawal"
})

transactions = remap_values(transactions, 'operation', {
    "VYBER KARTOU": "Credit Card Withdrawal",
    "VKLAD": "Credit in Cash",
    "PREVOD Z UCTU": "Collection from Another Bank",
    "VYBER": "Withdrawal in Cash",
    "PREVOD NA UCET": "Remittance to Another Bank"
})

transactions = remap_values(transactions, 'k_symbol', {
    "POJISTNE": "Insurance Payment",
    "SLUZBY": "Payment on Statement",
    "UROK": "Interest Credited",
    "SANKC. UROK": "Sanction Interest",
    "SIPO": "Household",
    "DUCHOD": "Old-age Pension",
    "UVER": "Loan Payment"
})

# Set the amount to negative for withdrawals and positive for credits
transactions['amount'] = np.where(transactions['type'] == "Credit", transactions['amount'], -transactions['amount'])

transactions.rename(columns={'type': 'transaction_type'}, inplace=True)

transactions.info()

We need some categorical indicator wheter a transactions is a transactions incoming or outgoing from the perspective of the account holder. This will be important for the feature engineering later on. We will create a column called `transaction_direction` using the amount to engineer this feature.

Balance is the wealth on the account after the transaction.

k_symbol is the purpose of the transaction. This is often use in the context of budgeting in E-Banking applications or just personal finance management. A lot of NA values are present in this column. We will have to deal with this later on and weigh the importance of this column.



Track the time series of a given account to get a better understanding of the datasets nature.

It seems that there can be multiple transactions on the same day. We will have to aggregate the transactions on the same day to get a better understanding of the transactions as the timestamp resolution is not high enough to track the transactions on a daily basis.

We need some handling for this as the ID is not informative as well (Dani). 

For the feature enginnering a per month evaluation of the transactions is sufficient (Dani). 

We need to make sure across the board that for the prediction we only use the data that is available at the time of the prediction. This means that we can only use the data from the past to predict the future. This is important to keep in mind when we engineer the features as some entities do not have any information about the date and therefore we cannot use them for the prediction as we cannot rule out that they are not from the future.

Frequency analysis of the transactions could be interesting as the hypothesis might be that the more frequent the transactions the more likely the account holder is to be interested in a credit card. Fourier transformation could be used to get a better understanding of the frequency of the transactions.

## Loans

In [None]:
loans = read_csv("data/loan.csv")

loans['date'] = pd.to_datetime(loans['date'], format='%y%m%d')

loans['status'] = loans['status'].map({
    "A": "Contract finished, no problems",
    "B": "Contract finished, loan not payed",
    "C": "Contract running, OK thus-far",
    "D": "Contract running, client in debt"
})

loans.rename(columns={
    'date': 'granted_date',
    'amount': 'amount',
    'duration': 'duration',
    'payments': 'monthly_payments',
    'status': 'status'
}, inplace=True)

loans.info()

## Credit Cards

In [None]:
cards = read_csv("data/card.csv")

cards['issued'] = pd.to_datetime(cards['issued'], format='%y%m%d %H:%M:%S').dt.date

cards.info()

## Demographic data

In [None]:
districts = read_csv("data/district.csv")

# Rename columns
# according to https://sorry.vse.cz/~berka/challenge/PAST/index.html
districts.rename(columns={
    'A1': 'district_id',
    'A2': 'district_name',
    'A3': 'region',
    'A4': 'inhabitants',
    'A5': 'small_municipalities',
    'A6': 'medium_municipalities',
    'A7': 'large_municipalities',
    'A8': 'huge_municipalities',
    'A9': 'cities',
    'A10': 'ratio_urban_inhabitants',
    'A11': 'average_salary',
    'A12': 'unemployment_rate_1995',
    'A13': 'unemployment_rate_1996',
    'A14': 'entrepreneurs_per_1000_inhabitants',
    'A15': 'crimes_committed_1995',
    'A16': 'crimes_committed_1996'
}, inplace=True)

for col in ['unemployment_rate_1995', 'unemployment_rate_1996', 'crimes_committed_1995', 'crimes_committed_1996']:
    districts[col] = pd.to_numeric(districts[col], errors='coerce')

districts.info()

We need to differentiate between the domicile of the client and account, as they can be different.


# EDA

In [None]:
def plot_categorical_variables(df, categorical_columns, fill_na_value='NA'):
    """
    Plots count plots for categorical variables in a DataFrame, filling NA values with a specified string.
    
    Parameters:
    - df: pandas.DataFrame containing the data.
    - categorical_vars: list of strings, names of the categorical variables in df to plot.
    - fill_na_value: string, the value to use for filling NA values in the categorical variables.
    """
    # Fill NA values in the specified categorical variables
    for var in categorical_columns:
        if df[var].isna().any():
            df[var] = df[var].fillna(fill_na_value)

    total = float(len(df))
    fig, axes = plt.subplots(nrows=len(categorical_columns), figsize=(14, len(categorical_columns) * 4.5))

    if len(categorical_columns) == 1:  # If there's only one categorical variable, wrap axes in a list
        axes = [axes]

    for i, var in enumerate(categorical_columns):
        ax = sns.countplot(x=var, data=df, ax=axes[i], order=df[var].value_counts().index)

        axes[i].set_title(f'Distribution of {var}')
        axes[i].set_ylabel('Count')
        axes[i].set_xlabel(var)

        for p in ax.patches:
            height = p.get_height()
            ax.text(p.get_x() + p.get_width() / 2.,
                    height + 3,
                    '{:1.2f}%'.format((height / total) * 100),
                    ha="center")

    plt.tight_layout()
    plt.show()

def plot_numerical_distributions(df, numerical_columns, kde=True, bins=30):
    """
    Plots the distribution of all numerical variables in a DataFrame.
    
    Parameters:
    - df: pandas.DataFrame containing the data.
    """

    # Determine the number of rows needed for subplots based on the number of numerical variables
    nrows = len(numerical_columns)

    # Create subplots
    fig, axes = plt.subplots(nrows=nrows, ncols=1, figsize=(8, 5 * nrows))

    if nrows == 1:  # If there's only one numerical variable, wrap axes in a list
        axes = [axes]

    for i, var in enumerate(numerical_columns):
        sns.histplot(df[var], ax=axes[i], kde=kde, bins=bins)
        axes[i].set_title(f'Distribution of {var}')
        axes[i].set_xlabel(var)
        axes[i].set_ylabel('Frequency')

    plt.tight_layout()
    plt.show()

def plot_date_monthly_counts(df, date_column, title):
    """
    Plots the monthly counts of a date column in a DataFrame.
    
    Parameters:
    - df: pandas.DataFrame containing the data.
    - date_column: string, name of the date column in df to plot.
    - title: string, title of the plot.
    """
    df[date_column] = pd.to_datetime(df[date_column])
    df['month'] = df[date_column].dt.to_period('M')

    monthly_counts = df['month'].value_counts().sort_index()
    monthly_counts.plot(kind='bar', figsize=(14, 6))
    plt.title(title)
    plt.xlabel('Month')
    plt.ylabel('Count')
    plt.show()

## Transactions


In [None]:
plot_categorical_variables(transactions, ['transaction_type', 'operation', 'k_symbol'])

In [None]:
# Getting a list of unique years from the dataset
transactions['year'] = transactions['date'].dt.year
transactions['month'] = transactions['date'].dt.month

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']


years = sorted(transactions['year'].unique())

# Creating a figure with subplots for each year: one row for each year with two plots (box plot and bar chart)
fig, axs = plt.subplots(len(years) * 2, 1, figsize=(15, 6 * len(years)), sharex=True, gridspec_kw={'height_ratios': [3, 1] * len(years)})

for i, year in enumerate(years):
    # Filter transactions for the current year
    yearly_transactions = transactions[transactions['year'] == year]

    # Preparing data for the box plot: a list of amounts for each month for the current year
    amounts_per_month_yearly = [yearly_transactions[yearly_transactions['month'] == month]['amount'] for month in range(1, 13)]

    # Preparing data for the bar chart for the current year
    monthly_summary_yearly = yearly_transactions.groupby('month').agg(TotalAmount=('amount', 'sum'), TransactionCount=('amount', 'count')).reset_index()

    # Box plot for transaction amounts by month for the current year
    axs[i*2].boxplot(amounts_per_month_yearly, patch_artist=True)
    # now with seaborn
    # sns.boxplot(data=yearly_transactions, x='month', y='amount', ax=axs[i*2])
    axs[i*2].set_title(f'Transaction Amounts Per Month in {year} (Box Plot)')
    axs[i*2].set_yscale('symlog')
    axs[i*2].set_ylabel('Transaction Amounts (log scale)')
    axs[i*2].grid(True, which='both')

    # Bar chart for transaction count by month for the current year
    axs[i*2 + 1].bar(monthly_summary_yearly['month'], monthly_summary_yearly['TransactionCount'], color='tab:red', alpha=0.6)
    axs[i*2 + 1].set_ylabel('Transaction Count')
    axs[i*2 + 1].grid(True, which='both')

# Setting x-ticks and labels for the last bar chart (shared x-axis for all)
axs[-1].set_xticks(range(1, 13))
axs[-1].set_xticklabels(months)
axs[-1].set_xlabel('Month')

plt.tight_layout()
plt.show()



Following the documentation of the dataset, there are multiple relationships that need to be validated. https://sorry.vse.cz/~berka/challenge/PAST/index.html

## Relationships

Following the documentation of the dataset, there are multiple relationships that need to be validated. https://sorry.vse.cz/~berka/challenge/PAST/index.html

The ERD according to the descriptions on https://sorry.vse.cz/~berka/challenge/PAST/index.html

[![](https://mermaid.ink/img/pako:eNqtV1Fv4jgQ_itWXu6l7SZ0gQatTsqGdhddCxVQrXSqFJnEgLWJnbOd7bGl__3GTgA3JGyvWh7ajPPNZ898nrHz7MQ8Ic7AIWJI8Urg7JEh-AVhOHkYz9FzaeqfVIKyFcJxzAumIpqg-7_QoxOUNhoNH50jcEL1Q2zQNxo9rAZq8AQrUv45EIaCYEU5Q0MYb-BeCvJPQVi8AZ-RlAVmMUE3u8Gdw0v5L7wdXTeGE6eUWNGExmxa3YIKtY5YkS2IAOBnbbYt7S1hVwsbjmb3k9loPpqMG1YHRPl-bUCSc0lNThqzfYjlpjWWYxVvfqmi2uRamMkTI-LDgySiFsNkOryeNqyei4SI_fIn2voNq1lg9j1SHGBTEtPcxPgZxk7w1tAVve2wTDlWCGdmWkg1WVBFEhRkdWDF_D2Sm2zBU8De402mWcM1FjhWRNCfZt_WkjSfBuNZELYIrQRmcp-qubaAq1XptyXMriqbsmXXVjJD3SVUffhG1ToR-AmnDVCeE1EW52vmO2gmJ9JqQ49TW4IXODWVDCVWPQVLSCqyfE_rYU_SpkltOxkVhYLt_YudZAFrm6hS-XYSNMkLkbG9urdg_C5ZDdcXiFjv1rqutewb7HHaKbxMir2cBjWs7GO2vNztEpB3nKl1ukFVAciGgKTCqpA71pmx6p05mA6b-jIWyaErg9F6wOR2lz3dIHc7XNPN4fkoqxTOEUjkbsbSstN66Nrz6SicN7fsfe-_P3HkVXCGM2KDxmA3wARZlfJMzUNdP8YjvowoW2PoW7iUZ8wvEF-i0WGw2SsrGIXOiFNIHJHRE9S9zRSlKuq6rtb7FRB9QjD8PkpwhJ4ceb7vH_PCy3P95n3UHbfk9hu59dtz_93kKxV5wNCQjD_NeDNrXEL2ioTGPi4tU3LaoRAL3S5eqfmgx2wx0VTDG8r9B7TmFYkkdE-h70ZBOYBmZuDYoWAky1NuajiCNZDI7-oJrWE9F0F_-N03efdavHvN6QGEILkgjBRCRnCwmBzXwr-2QQhAyGvPN88yqqAhRrGgGeho4gnNcz2IU14926tndYBaD9huz8_58_7KPDDtLsa6I4_2lVpHVxfSgSlpSROY5ADeUVVY-46oHQIpeUzNBPqU3nlVlP_PyYZpz-227Mja5SuWLSsqb3wa8wUk0bK3Ie1rTxlsDNfCZvS2PD4Hpv8y9BW2ssY5Z05GRIZpAt8ppuE-OmpNdJ_U0IQscZGaI-0FoLhQfLZhsTNQoiBnTpHrzl593DiDJU4ljOaY_c15tgPpGw8Xd-WnkPkiMhBn8Oz86wyuOhf9bsf96Lndvuu5ff_M2TgDz-1ddK763qXvel237131X86cn4bUvbi67PU6Xt_z3Y--e9ntv_wHEyE3kA?type=png)](https://mermaid.live/edit#pako:eNqtV1Fv4jgQ_itWXu6l7SZ0gQatTsqGdhddCxVQrXSqFJnEgLWJnbOd7bGl__3GTgA3JGyvWh7ajPPNZ898nrHz7MQ8Ic7AIWJI8Urg7JEh-AVhOHkYz9FzaeqfVIKyFcJxzAumIpqg-7_QoxOUNhoNH50jcEL1Q2zQNxo9rAZq8AQrUv45EIaCYEU5Q0MYb-BeCvJPQVi8AZ-RlAVmMUE3u8Gdw0v5L7wdXTeGE6eUWNGExmxa3YIKtY5YkS2IAOBnbbYt7S1hVwsbjmb3k9loPpqMG1YHRPl-bUCSc0lNThqzfYjlpjWWYxVvfqmi2uRamMkTI-LDgySiFsNkOryeNqyei4SI_fIn2voNq1lg9j1SHGBTEtPcxPgZxk7w1tAVve2wTDlWCGdmWkg1WVBFEhRkdWDF_D2Sm2zBU8De402mWcM1FjhWRNCfZt_WkjSfBuNZELYIrQRmcp-qubaAq1XptyXMriqbsmXXVjJD3SVUffhG1ToR-AmnDVCeE1EW52vmO2gmJ9JqQ49TW4IXODWVDCVWPQVLSCqyfE_rYU_SpkltOxkVhYLt_YudZAFrm6hS-XYSNMkLkbG9urdg_C5ZDdcXiFjv1rqutewb7HHaKbxMir2cBjWs7GO2vNztEpB3nKl1ukFVAciGgKTCqpA71pmx6p05mA6b-jIWyaErg9F6wOR2lz3dIHc7XNPN4fkoqxTOEUjkbsbSstN66Nrz6SicN7fsfe-_P3HkVXCGM2KDxmA3wARZlfJMzUNdP8YjvowoW2PoW7iUZ8wvEF-i0WGw2SsrGIXOiFNIHJHRE9S9zRSlKuq6rtb7FRB9QjD8PkpwhJ4ceb7vH_PCy3P95n3UHbfk9hu59dtz_93kKxV5wNCQjD_NeDNrXEL2ioTGPi4tU3LaoRAL3S5eqfmgx2wx0VTDG8r9B7TmFYkkdE-h70ZBOYBmZuDYoWAky1NuajiCNZDI7-oJrWE9F0F_-N03efdavHvN6QGEILkgjBRCRnCwmBzXwr-2QQhAyGvPN88yqqAhRrGgGeho4gnNcz2IU14926tndYBaD9huz8_58_7KPDDtLsa6I4_2lVpHVxfSgSlpSROY5ADeUVVY-46oHQIpeUzNBPqU3nlVlP_PyYZpz-227Mja5SuWLSsqb3wa8wUk0bK3Ie1rTxlsDNfCZvS2PD4Hpv8y9BW2ssY5Z05GRIZpAt8ppuE-OmpNdJ_U0IQscZGaI-0FoLhQfLZhsTNQoiBnTpHrzl593DiDJU4ljOaY_c15tgPpGw8Xd-WnkPkiMhBn8Oz86wyuOhf9bsf96Lndvuu5ff_M2TgDz-1ddK763qXvel237131X86cn4bUvbi67PU6Xt_z3Y--e9ntv_wHEyE3kA)

This ERD shows how the data appears in the dataset:

[![](https://mermaid.ink/img/pako:eNqtV99P2zAQ_lesvOyFbjCJSq2mSSHlRzRoUVq0F6TITdzWIrEz2xnqgP99ZydNTeIUhOgD5JzvPvvufJ-dJy_hKfHGHhETitcC5_cMwc8PgtnddIGeKlP_pBKUrRFOEl4yFdMU3f5C955f2Sic3HsdcEr1Q2LQFxo9qQda8BQrUv3ZEwaCYEU5QxMYd3CvBPlTEpZswSeUssQsIehiN7hzeKn-BdfhuTOcJKPEiiYwpmt1SyrUJmZlviQCgGfa7Fvae8KuFzYJ57ezebgIZ1PH6oCoaNYGJAWX1OTEme19LBe9sXSrePFmFdW20IWZPTIivt1JIloxzKLJeeRYPRcpEc3yZ9r6hNUsMXuIFQdYRBJamBjPYOwAbwtd09sOq4xjhXBupoVUkyVVJEV-3gbWzA-x3OZLngH2Fm9zzRpssMCJIoL-M_u2laRF5E_nftBTaCUwk02qFtoCrt5Kvy9hdlfZlD27ti4z9F1K1bffVG1SgR9x5oDygoiqOV8z34CYHEirDe2mtgIvcWY6GVqsfvJXkFRk-R6uhz1JX01a28lUUSjY3m_sJAvY2kR1la9nvqu8EBlrqnsNxmeV1XBdQsR6t7br2sq-wXbTTuFlWjblNKhJbXfZimq3S0DecKY22RbVDSAdAUmFVSl3rHNjtZXZjyYuXcYi3asyGL0HTGGr7GGB3O1wTbeA505WKZwjkMjdjJVlp3Wv2osoDBZuyW60__bAkVfDGc6JDZqC7YAJsq7KE5mHdv0Yj_kqpmyDQbdwVZ4p_4r4CoX7QbdXXjIKyogzSByR8SP0vc0UZyo-PT7W9X4FRD8QDH-MEhxBk-OT0WjU5YWXA_3mY9TfjyvukZNbvx2MPky-VvEJMDiS8dOMu1mTCtJUJDB2t7VMy2mHUiy1XLyq5p0es4uJIg13tPtfkOY1iSWop9B3I78aQHMz0HUoGcmLjJsejmENJB6d6gmtYT0XQV9Gp-_yHvZ4D93pAYQghSCMlELGcLCYHLfCP7dBCEDopD_fPM-pAkGME0FzqKOJJzDP7SAOeQ1tr6GlAC0NeH4eDPhTc2UeG7lLsFbksOnUNrq-kI5NS0uawiR78I6qxtp3RO3gS8kTaibQp_TOq6bUTs_P73WyYbWnUWTtcoVlz4qqG5_GXEJJdNn7kPa1pwo2gWuhEw1Tm-NzbPSXoSvYyhrnHXk5ETmmKXynGMG999SGaJ3U0JSscJmZI-0FoLhUfL5liTdWoiRHXlloZa8_brzxCmcSRvUVh4ub6tsn4WxF197LfwBRISI?type=png)](https://mermaid.live/edit#pako:eNqtV99P2zAQ_lesvOyFbjCJSq2mSSHlRzRoUVq0F6TITdzWIrEz2xnqgP99ZydNTeIUhOgD5JzvPvvufJ-dJy_hKfHGHhETitcC5_cMwc8PgtnddIGeKlP_pBKUrRFOEl4yFdMU3f5C955f2Sic3HsdcEr1Q2LQFxo9qQda8BQrUv3ZEwaCYEU5QxMYd3CvBPlTEpZswSeUssQsIehiN7hzeKn-BdfhuTOcJKPEiiYwpmt1SyrUJmZlviQCgGfa7Fvae8KuFzYJ57ezebgIZ1PH6oCoaNYGJAWX1OTEme19LBe9sXSrePFmFdW20IWZPTIivt1JIloxzKLJeeRYPRcpEc3yZ9r6hNUsMXuIFQdYRBJamBjPYOwAbwtd09sOq4xjhXBupoVUkyVVJEV-3gbWzA-x3OZLngH2Fm9zzRpssMCJIoL-M_u2laRF5E_nftBTaCUwk02qFtoCrt5Kvy9hdlfZlD27ti4z9F1K1bffVG1SgR9x5oDygoiqOV8z34CYHEirDe2mtgIvcWY6GVqsfvJXkFRk-R6uhz1JX01a28lUUSjY3m_sJAvY2kR1la9nvqu8EBlrqnsNxmeV1XBdQsR6t7br2sq-wXbTTuFlWjblNKhJbXfZimq3S0DecKY22RbVDSAdAUmFVSl3rHNjtZXZjyYuXcYi3asyGL0HTGGr7GGB3O1wTbeA505WKZwjkMjdjJVlp3Wv2osoDBZuyW60__bAkVfDGc6JDZqC7YAJsq7KE5mHdv0Yj_kqpmyDQbdwVZ4p_4r4CoX7QbdXXjIKyogzSByR8SP0vc0UZyo-PT7W9X4FRD8QDH-MEhxBk-OT0WjU5YWXA_3mY9TfjyvukZNbvx2MPky-VvEJMDiS8dOMu1mTCtJUJDB2t7VMy2mHUiy1XLyq5p0es4uJIg13tPtfkOY1iSWop9B3I78aQHMz0HUoGcmLjJsejmENJB6d6gmtYT0XQV9Gp-_yHvZ4D93pAYQghSCMlELGcLCYHLfCP7dBCEDopD_fPM-pAkGME0FzqKOJJzDP7SAOeQ1tr6GlAC0NeH4eDPhTc2UeG7lLsFbksOnUNrq-kI5NS0uawiR78I6qxtp3RO3gS8kTaibQp_TOq6bUTs_P73WyYbWnUWTtcoVlz4qqG5_GXEJJdNn7kPa1pwo2gWuhEw1Tm-NzbPSXoSvYyhrnHXk5ETmmKXynGMG999SGaJ3U0JSscJmZI-0FoLhUfL5liTdWoiRHXlloZa8_brzxCmcSRvUVh4ub6tsn4WxF197LfwBRISI)

In [None]:
# Verify 1:1 relationships between CLIENT, LOAN and DISPOSITION
assert dispositions['client_id'].is_unique, "Each client_id should appear exactly once in the DISPOSITION DataFrame."
assert loans['account_id'].is_unique, "Each account_id should appear exactly once in the LOAN DataFrame."

# Verify 1:M relationships between ACCOUNT and DISPOSITION
assert dispositions['account_id'].is_unique == False, "An account_id should appear more than once in the DISPOSITION DataFrame."

# Verify each district_id in ACCOUNT and CLIENT exists in DISTRICT
assert set(accounts['district_id']).issubset(
    set(districts['district_id'])), "All district_ids in ACCOUNT should exist in DISTRICT."
assert set(clients['district_id']).issubset(
    set(districts['district_id'])), "All district_ids in CLIENT should exist in DISTRICT."

# Verify each account_id in DISPOSITION, ORDER, TRANSACTION, and LOAN exists in ACCOUNT
assert set(dispositions['account_id']).issubset(
    set(accounts['account_id'])), "All account_ids in DISPOSITION should exist in ACCOUNT."
assert set(orders['account_id']).issubset(
    set(accounts['account_id'])), "All account_ids in ORDER should exist in ACCOUNT."
assert set(transactions['account_id']).issubset(
    set(accounts['account_id'])), "All account_ids in TRANSACTION should exist in ACCOUNT."
assert set(loans['account_id']).issubset(
    set(accounts['account_id'])), "All account_ids in LOAN should exist in ACCOUNT."

# Verify each client_id in DISPOSITION exists in CLIENT
assert set(dispositions['client_id']).issubset(
    set(clients['client_id'])), "All client_ids in DISPOSITION should exist in CLIENT."

# Verify each disp_id in CARD exists in DISPOSITION
assert set(cards['disp_id']).issubset(set(dispositions['disp_id'])), "All disp_ids in CARD should exist in DISPOSITION."

In [None]:
dispositions[dispositions["account_id"] == 3980]

In [None]:
# are there any cards issued to dispositions that are not of type 'OWNER'?
cards[~cards['disp_id'].isin(dispositions[dispositions['type'] == "OWNER"]['disp_id'])]

In [None]:
dispositions[dispositions['type'] != "OWNER"].head()

In [None]:
# get all clients associated with account 3980
clients[clients['client_id'].isin(dispositions[dispositions['account_id'] == 3980]['client_id'])]

# Merge non-transactional data

In [None]:
orders_pivot = orders.pivot_table(index='account_id', 
                                 columns='k_symbol', 
                                 values='debited_amount',
                                 aggfunc='sum', 
                                 fill_value=0)

orders_pivot.columns = [f'k_symbol_debited_sum_{col.lower()}' for col in orders_pivot.columns]

# TODO: find something better than this
orders_pivot = orders_pivot.reset_index() # Use created index as account_id
orders_pivot.head()

In [None]:
def merge_non_transactional_data(clients, districts, dispositions, accounts, orders, loans, cards):
    # Rename district_id for clarity in clients and accounts DataFrames
    clients = clients.rename(columns={'district_id': 'client_district_id'})
    accounts = accounts.rename(columns={'district_id': 'account_district_id'})
    
    # Prepare districts dataframe for merge with prefix for clients and accounts
    districts_client_prefixed = districts.add_prefix('client_')
    districts_account_prefixed = districts.add_prefix('account_')
    
    # Merge district information for clients and accounts with prefixed columns
    clients_with_districts = pd.merge(clients, districts_client_prefixed, left_on='client_district_id', right_on='client_district_id', how='left')
    accounts_with_districts = pd.merge(accounts, districts_account_prefixed, left_on='account_district_id', right_on='account_district_id', how='left')

    # Merge cards with dispositions and prefix card-related columns to avoid confusion
    cards_prefixed = cards.add_prefix('card_')
    dispositions_with_cards = pd.merge(dispositions, cards_prefixed, left_on='disp_id', right_on='card_disp_id', how='left')
    
    # Merge clients (with district info) with dispositions and cards
    # Assuming dispositions might have columns that overlap with clients, prefix those if necessary
    clients_dispositions_cards = pd.merge(dispositions_with_cards, clients_with_districts, on='client_id', how='left')
    
    # Merge the above with accounts (with district info) on account_id
    accounts_clients_cards = pd.merge(accounts_with_districts, clients_dispositions_cards, on='account_id', how='left')
    
    # Merge orders DataFrame, assuming orders might contain columns that could overlap, prefix as needed
    orders_prefixed = orders.add_prefix('order_')
    comprehensive_df_with_orders = pd.merge(accounts_clients_cards, orders_prefixed, left_on='account_id', right_on='order_account_id', how='left')
    
    # Merge loans with the comprehensive dataframe (now including orders) on account_id
    # Prefix loan-related columns to maintain clarity
    loans_prefixed = loans.add_prefix('loan_')
    final_df = pd.merge(comprehensive_df_with_orders, loans_prefixed, left_on='account_id', right_on='loan_account_id', how='left')
    
    return final_df

non_transactional_df = merge_non_transactional_data(clients, districts, dispositions, accounts, orders_pivot, loans, cards)
non_transactional_df.info()

# Transactional Data

# Model Construction

# Feature Engineering

# Model Engineering

# Model Comparison & Selection

### JITT 05.03.24
- New customers are handled differently
- Customer without the required history should be ignored otherwise they are treated as irrelevant
- Lag is ignored like (12 + 1) months
- Age should be in relation to the time of the event (card issued / reference date for refrence clients)
- How old are customers with a Junior Card? This should be evaluated based on the data
    - Example with Junior Card model with Age as most important feature as a negative example
- Reference clients
    - They should not be as similar as possible (Twin brother problem)
    - Same external market conditions
    - Same environment
    - See slide 6
- Owner and disponents cannot be distinguished directly and assumptions are required
    - MasterCards vs Visa war: as much cards as possible for both client of an account
    - AGain the Twin brother problem as features are too similar possibly

#### General notes to self
- Visualise monthly product puchases
- Viz environment of selected client and reference clients and answer the questions are they from a comparable environment

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=7d865ccc-7b5c-4f8d-b4e6-4e008791345d' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>