# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Backfill</span>

**Note**: This tutorial does not support Google Colab.

## 🗒️ This notebook is divided into the following sections:
1. Fetch historical data.
2. Connect to the Hopsworks feature store.
3. Create feature groups and insert them to the feature store.
4. Data Visualization.

![tutorial-flow](../../images/01_featuregroups.png)

## <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install -U hopsworks --quiet

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from functions import *

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27;"> 💽 Loading the Data </span>

#### <span style="color:#ff5f27;">⛳️ Application Train dataset</span>

The main training dataset contains information about each loan application at Home Credit. Every loan has its own row and is identified by the feature `sk_id_curr`. This dataset has binary target indicating if the loan was repaid(0) or not(1).

In [None]:
applications_df = pd.read_csv("https://repo.hops.works/dev/davit/credit_scores/applications.csv")
applications_df.head()

In [None]:
applications_df.shape

#### <span style="color:#ff5f27;">⛳️ Bureau Balance dataset</span>

Dataset contains monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.

In [None]:
bureau_balances_df = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/bureau_balances.csv')
bureau_balances_df.head(3)

In [None]:
bureau_balances_df.shape

#### <span style="color:#ff5f27;">⛳️ Bureau Dataset</span>

Dataset contains data about client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.

In [None]:
bureaus_df = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/bureaus.csv')
bureaus_df.head(3)

In [None]:
bureaus_df.shape

#### <span style="color:#ff5f27;">⛳️ Credit Card Balance Dataset</span>

Dataset contains monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.

In [None]:
credit_card_balances_df = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/credit_card_balances.csv')
credit_card_balances_df.head(3)

In [None]:
credit_card_balances_df.shape

#### <span style="color:#ff5f27;">⛳️ Installments Payments Dataset</span>

Dataset contains payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

In [None]:
installment_payments_df = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/installment_payments.csv')
installment_payments_df.head(3)

In [None]:
installment_payments_df.shape

#### <span style="color:#ff5f27;">⛳️ POS (point of sales) and Cash Loans Balance Dataset</span>

Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.

This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample

In [None]:
pos_cash_balances_df = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/pos_cash_balances.csv')
pos_cash_balances_df.head(3)

In [None]:
pos_cash_balances_df.shape

#### <span style="color:#ff5f27;">⛳️ Previous Application Dataset</span>

All previous applications for Home Credit loans of clients who have loans in our sample.

There is one row for each previous application related to loans in our data sample.

In [None]:
previous_applications_df = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/previous_applications.csv')
previous_applications_df.head(3)

In [None]:
previous_applications_df.shape

---

## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

#### <span style="color:#ff5f27;"> ⛳️ Dataset with amount of previous loans</span>

In [None]:
# Grouping the 'bureaus_df' DataFrame by 'sk_id_curr' and counting the number of occurrences
# Renaming the resulting column to 'previous_loan_counts'
previous_loan_counts = bureaus_df.groupby('sk_id_curr', as_index=False)['sk_id_bureau'].count() \
                           .rename(columns={'sk_id_bureau': 'previous_loan_counts'})

# Displaying the first 3 rows of the resulting DataFrame
previous_loan_counts.head(3)

---

## <span style="color:#ff5f27;">👨🏻‍⚖️ Dealing with missing values</span>

In [None]:
# Next function removes missing values.
# If column has more than 20% of missing values -> remove.
# The rest missing values will be dropped by rows.
applications_df = remove_nans(applications_df)
bureaus_df = remove_nans(bureaus_df)
previous_applications_df = remove_nans(previous_applications_df)

credit_card_balances_df.dropna(inplace=True)
installment_payments_df.dropna(inplace=True)
pos_cash_balances_df.dropna(inplace=True)

---

## <span style="color:#ff5f27;"> 🔮 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

---

## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

A `Feature Groups` is a logical grouping of features, and experience has shown, that this grouping generally originates from the features being derived from the same data source. The `Feature Group` lets you save metadata along features.

Generally, the features in a feature group are engineered together in an ingestion job. However, it is possible to have additional jobs to append features to an existing feature group. Furthermore, `Feature Groups` provide a way of defining a namespace for features, such that you can define features with the same name multiple times, but uniquely identified by the group they are contained in.

> It is important to note that `Feature Groups` are not groupings of features for immediate training of Machine Learning models. Instead, to ensure reusability of features, it is possible to combine features from any number of groups into **Feature View**.

### <span style="color:#ff5f27;">⛳️ Creating Applications Feature Group </span>

In [None]:
applications_fg = fs.get_or_create_feature_group(
    name='applications',
    version=1,
    primary_key=['sk_id_curr'],
    online_enabled=False,
    event_time='datetime',
)
applications_fg.insert(
    applications_df,
    write_options={"wait_for_job": True},
)

#### <span style="color:#ff5f27;"> ⛳️ Bureau Balance Feature Group</span>

In [None]:
bureau_balances_fg = fs.get_or_create_feature_group(
    name='bureau_balances',
    version=1,
    primary_key=['sk_id_bureau'],
    online_enabled=False,
)
bureau_balances_fg.insert(
    bureau_balances_df,
    write_options={"wait_for_job": True},
)

#### <span style="color:#ff5f27;"> ⛳️ Bureau Feature Group</span>

In [None]:
bureaus_fg = fs.get_or_create_feature_group(
    name='bureaus',
    version=1,
    primary_key=['sk_id_curr','sk_id_bureau'],
    online_enabled=False,
    event_time='datetime',
)
bureaus_fg.insert(
    bureaus_df,
    write_options={"wait_for_job": True},
)

#### <span style="color:#ff5f27;"> ⛳️ Previous Application Feature Group</span>

In [None]:
previous_applications_fg = fs.get_or_create_feature_group(
    name='previous_applications',
    version=1,
    primary_key=['sk_id_prev','sk_id_curr'],
    online_enabled=False,
    event_time='datetime',
)
previous_applications_fg.insert(
    previous_applications_df,
    write_options={"wait_for_job": True},
)

#### <span style="color:#ff5f27;"> ⛳️ Pos_Cash_Balance Feature Group</span>

In [None]:
pos_cash_balances_fg = fs.get_or_create_feature_group(
    name='pos_cash_balances',
    version=1,
    primary_key=['sk_id_prev','sk_id_curr'],
    online_enabled=False,
)
pos_cash_balances_fg.insert(
    pos_cash_balances_df,
    write_options={"wait_for_job": True},
)

#### <span style="color:#ff5f27;"> ⛳️ Instalments Payments Feature Group</span>

In [None]:
installment_payments_fg = fs.get_or_create_feature_group(
    name='installment_payments',
    version=1,
    primary_key=['sk_id_prev','sk_id_curr'],
    online_enabled=False,
    event_time='datetime',
)
installment_payments_fg.insert(
    installment_payments_df,
    write_options={"wait_for_job": True},
)

#### <span style="color:#ff5f27;"> ⛳️ Credit Card Balance Feature Group</span>

In [None]:
credit_card_balances_fg = fs.get_or_create_feature_group(
    name='credit_card_balances',
    version=1,
    primary_key=['sk_id_prev','sk_id_curr'],
    online_enabled=False,
)
credit_card_balances_fg.insert(
    credit_card_balances_df,
    write_options={"wait_for_job": True},
)

#### <span style="color:#ff5f27;"> ⛳️ Previous Load Counts Feature Group</span>

In [None]:
previous_loan_counts_fg = fs.get_or_create_feature_group(
    name='previous_loan_counts',
    version=1,
    primary_key=['sk_id_curr'],
    online_enabled=False,
)

previous_loan_counts_fg.insert(
    previous_loan_counts,
    write_options={"wait_for_job": True},
)

---

## <span style="color:#ff5f27;">👨🏻‍🎨 Data Exploration</span>

In [None]:
plt.figure(figsize=(12,5))

plt.pie(
    applications_df.target.value_counts(),
    labels=['Repayed','Not Repayed'], 
    explode=(0, 0.2),
    shadow=True,
    autopct='%1.1f%%',
    radius=1.2,
)

plt.title("Ratio of Loan Repayed or Not", fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,5))

sns.distplot(applications_df.amt_credit)

plt.title("Distribution of Amount of Credit", fontsize = 15)
plt.xlabel('Amount of credit')
plt.show()

In [None]:
plt.figure(figsize=(12,5))

sns.distplot(applications_df.amt_goods_price)

plt.title("Distribution of Amount of Goods Price", fontsize = 15)
plt.xlabel('Amount of goods price')
plt.show()

In [None]:
plt.figure(figsize=(12,5))

sns.distplot(
    applications_df.days_birth/-365,
    bins=30,
)

plt.title("Distribution of Applicant Age", fontsize=15)
plt.xlabel('Years')
plt.show()

In [None]:
temp_df = applications_df.name_type_suite.value_counts().reset_index()

plt.figure(figsize=(12,5))

sns.barplot(
    data=temp_df, 
    x='name_type_suite', 
    y='count',
)

plt.title("Who accompanied client when applying for the  application", fontsize=15)
plt.xlabel('Accompanior')
plt.ylabel('Amount')
plt.show()

In [None]:
plt.figure(figsize=(12,5))

plt.pie(
    applications_df.flag_own_car.value_counts(),
    labels=['Loan for other purpose','Loan for a car'],
    explode=(0, 0.1),
    shadow=True,
    autopct='%1.1f%%',
    radius=1.2,
)

plt.title("Ratio of loan for a car or not", fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(12,5))

plt.pie(
    applications_df.flag_own_realty.value_counts(),
    labels=['Loan for revalty','Loan for other purpose'], 
    explode=(0, 0.1),
    shadow=True, 
    autopct='%1.1f%%',
    radius=1.2,
)

plt.title("Ratio of realty for a car or not", fontsize=15)
plt.show()

In [None]:
temp_df = applications_df.name_income_type.value_counts()[:4]

plt.figure(figsize=(12,5))

plt.pie(
    temp_df,
    labels=temp_df[:4].index,
    explode=(0, 0.075,0.1,0.1), 
    shadow=True, 
    autopct='%1.1f%%',
    labeldistance=0.8,
    radius=1.2,
)

plt.title("Income Ratio", fontsize=15)
plt.show()

In [None]:
temp_df = applications_df.name_family_status.value_counts()[:-1]

plt.figure(figsize=(12,5))

plt.pie(
    temp_df,
    labels=temp_df.index,
    explode=(0,0.1,0.1,0.1), 
    shadow=True, 
    autopct='%1.1f%%',
    labeldistance=1.05,
    radius=1.2,
)

plt.title("Family Status Ratio", fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(12,5))

ax = sns.countplot(
    data=applications_df,
    x='name_education_type',
    hue='target',
    order=applications_df['name_education_type'].value_counts().index
)

plt.title("Education of who applied for loan", fontsize=15)
plt.xlabel('Education Type')
plt.ylabel('Count')
add_perc(ax,applications_df.name_education_type,5,2)
plt.show()

In [None]:
temp_df = previous_applications_df.name_contract_status.value_counts()[:-1]

plt.figure(figsize=(12,5))

plt.pie(
    temp_df,
    labels=temp_df.index,
    explode=(0,0.1,0.1), 
    shadow=True, 
    autopct='%1.1f%%',
    labeldistance=1.05,
    radius=1.25,
)

plt.title("Contract Approvement Ratio", fontsize=15)
plt.show()

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 02: Feature Pipeline </span>

In the next notebook we will generate a new data for Feature Groups.
