# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Backfill Features to the Feature Store</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/{project_name}/{notebook_name}.ipynb)

## 🗒️ This notebook is divided into the following sections:
1. Fetch historical data
2. Connect to the Hopsworks feature store
3. Create feature groups and insert them to the feature store
4. Data Visualization

![tutorial-flow](../../images/01_featuregroups.png)

## <span style='color:#ff5f27'> 📝 Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from functions import *

## <span style="color:#ff5f27;"> 💽 Loading the Data </span>

#### <span style="color:#ff5f27;">⛳️ Application Train dataset</span>

The main training dataset contains information about each loan application at Home Credit. Every loan has its own row and is identified by the feature **SK_ID_CURR**. This dataset has binary target indicating if the loan was repaid(0) or not(1).

In [None]:
application_train_org = pd.read_csv("https://repo.hops.works/dev/davit/credit_scores/application_train.csv")
application_train_org.head()

#### <span style="color:#ff5f27;">⛳️ Application Test dataset</span>

In [None]:
application_test = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/application_test.csv')

application_test.head()

#### <span style="color:#ff5f27;">⛳️ Bureau Balance dataset</span>

Dataset contains monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.

In [None]:
bureau_balance = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/bureau_balance.csv')

bureau_balance.head()

#### <span style="color:#ff5f27;">⛳️ Bureau Dataset</span>

Dataset contains data about client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.

In [None]:
bureau = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/bureau.csv')

bureau.head()

#### <span style="color:#ff5f27;">⛳️ Credit Card Balance Dataset</span>

Dataset contains monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.

In [None]:
credit_card_balance = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/credit_card_balance.csv')

credit_card_balance.head()

#### <span style="color:#ff5f27;">⛳️ Installments Payments Dataset</span>

Dataset contains payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

In [None]:
installments_payments = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/installments_payments.csv')

installments_payments.head()

#### <span style="color:#ff5f27;">⛳️ POS (point of sales) and Cash Loans Balance Dataset</span>

Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.

This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample

In [None]:
pos_cash_balance = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/POS_CASH_balance.csv')

pos_cash_balance.head()

#### <span style="color:#ff5f27;">⛳️ Previous Application Dataset</span>

All previous applications for Home Credit loans of clients who have loans in our sample.

There is one row for each previous application related to loans in our data sample.

In [None]:
previous_application = pd.read_csv('https://repo.hops.works/dev/davit/credit_scores/previous_application.csv')

previous_application.head()

---

## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

#### <span style="color:#ff5f27;"> ⛳️ Dataset with amount of previous loans</span>

In [None]:
previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_counts'})

previous_loan_counts.head()

---

## <span style="color:#ff5f27;">👨🏻‍⚖️ Dealing with missing values</span>

In [None]:
application_train = remove_nans(application_train_org)
application_test = remove_nans(application_test)
bureau = remove_nans(bureau)
credit_card_balance.dropna(inplace = True)
installments_payments.dropna(inplace = True)
pos_cash_balance.dropna(inplace = True)
previous_application = remove_nans(previous_application)

---
## <span style="color:#ff5f27;">🔬 🧬 Subsampling Data</span>

Our datasets have a lot of data. So we are going to subsample them in order to save our time.



In [None]:
application_train_sample = get_subsample(application_train)
bureau_balance_sample = get_subsample(bureau_balance)
bureau_sample = get_subsample(bureau)
credit_card_balance_sample = get_subsample(credit_card_balance)
installments_payments_sample = get_subsample(installments_payments)
pos_cash_balance_sample = get_subsample(pos_cash_balance)
previous_application_sample = get_subsample(previous_application)

## <span style="color:#ff5f27;"> 🔮 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

In [None]:
print(f'Feature Store Name: {fs.name}')
print(f'Feature Store Description: {fs.description}')

---

## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

A `Feature Groups` is a logical grouping of features, and experience has shown, that this grouping generally originates from the features being derived from the same data source. The `Feature Group` lets you save metadata along features.

Generally, the features in a feature group are engineered together in an ingestion job. However, it is possible to have additional jobs to append features to an existing feature group. Furthermore, `Feature Groups` provide a way of defining a namespace for features, such that you can define features with the same name multiple times, but uniquely identified by the group they are contained in.

> It is important to note that `Feature Groups` are not groupings of features for immediate training of Machine Learning models. Instead, to ensure reusability of features, it is possible to combine features from any number of groups into **Feature View**.

### <span style="color:#ff5f27;">⛳️Creating Application train and test Feature Groups </span>

In [None]:
application_train.columns = [col_name.lower() for col_name in application_train.columns]

application_train_fg = fs.get_or_create_feature_group(
    name = 'application_train',
    version = 1,
    primary_key = ['sk_id_curr'],
    online_enabled = False
)

application_train_fg.insert(application_train_sample)

In [None]:
application_test.columns = [col_name.lower() for col_name in application_test.columns]

application_test_fg = fs.get_or_create_feature_group(
    name = 'application_test',
    version = 1,
    primary_key = ['sk_id_curr'],
    online_enabled = False
)

application_test_fg.insert(application_test)

#### <span style="color:#ff5f27;"> ⛳️ Bureau Balance Feature Group</span>

In [None]:
bureau_balance.columns = [col_name.lower() for col_name in bureau_balance.columns]

bureau_balance_fg = fs.get_or_create_feature_group(
    name = 'bureau_balance',
    version = 1,
    primary_key = ['sk_id_bureau'],
    online_enabled = False
)

bureau_balance_fg.insert(bureau_balance_sample)

#### <span style="color:#ff5f27;"> ⛳️ Bureau Feature Group</span>

In [None]:
bureau.columns = [col_name.lower() for col_name in bureau.columns]

bureau_fg = fs.get_or_create_feature_group(
    name = 'bureau',
    version = 1,
    primary_key = ['sk_id_curr','sk_id_bureau'],
    online_enabled = False
)

bureau_fg.insert(bureau_sample)

#### <span style="color:#ff5f27;"> ⛳️ Previous Application Feature Group</span>

In [None]:
previous_application.columns = [col_name.lower() for col_name in previous_application.columns]

previous_application_fg = fs.get_or_create_feature_group(
    name = 'previous_application',
    version = 1,
    primary_key = ['sk_id_prev','sk_id_curr'],
    online_enabled = False
)

previous_application_fg.insert(previous_application_sample)

#### <span style="color:#ff5f27;"> ⛳️ Pos_Cash_Balance Feature Group</span>

In [None]:
pos_cash_balance.columns = [col_name.lower() for col_name in pos_cash_balance.columns]

pos_cash_balance_fg = fs.get_or_create_feature_group(
    name = 'pos_cash_balance',
    version = 1,
    primary_key = ['sk_id_prev','sk_id_curr'],
    online_enabled = False
)

pos_cash_balance_fg.insert(pos_cash_balance_sample)

#### <span style="color:#ff5f27;"> ⛳️ Instalments Payments Feature Group</span>

In [None]:
installments_payments.columns = [col_name.lower() for col_name in installments_payments.columns]

installments_payments_fg = fs.get_or_create_feature_group(
    name = 'installments_payments',
    version = 1,
    primary_key = ['sk_id_prev','sk_id_curr'],
    online_enabled = False
)

installments_payments_fg.insert(installments_payments_sample)

#### <span style="color:#ff5f27;"> ⛳️ Credit Card Balance Feature Group</span>

In [None]:
credit_card_balance.columns = [col_name.lower() for col_name in credit_card_balance.columns]

credit_card_balance_fg = fs.get_or_create_feature_group(
    name = 'credit_card_balance',
    version = 1,
    primary_key = ['sk_id_prev','sk_id_curr'],
    online_enabled = False
)

credit_card_balance_fg.insert(credit_card_balance_sample)

#### <span style="color:#ff5f27;"> ⛳️ Previous Load Counts Feature Group</span>

In [None]:
previous_loan_counts.columns = [col_name.lower() for col_name in previous_loan_counts.columns]

previous_loan_counts_fg = fs.get_or_create_feature_group(
    name = 'previous_loan_counts',
    version = 1,
    primary_key = ['sk_id_curr'],
    online_enabled = False
)

previous_loan_counts_fg.insert(previous_loan_counts)

---

## <span style="color:#ff5f27;">👨🏻‍🎨 Data Exploration</span>

In [None]:
plt.figure(figsize=(12,5))

plt.pie(
    application_train.target.value_counts(),
    labels = ['Repayed','Not Repayed'], 
    explode = (0, 0.2),
    shadow=True,
    autopct='%1.1f%%',
    radius = 1.2
)

plt.title("Ratio of Loan Repayed or Not", fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,5))

sns.distplot(application_train.amt_credit)

plt.title("Distribution of Amount of Credit", fontsize = 15)
plt.xlabel('Amount of credit')
plt.show()

In [None]:
plt.figure(figsize=(12,5))

sns.distplot(application_train.amt_goods_price)

plt.title("Distribution of Amount of Goods Price", fontsize = 15)
plt.xlabel('Amount of goods price')
plt.show()

In [None]:
plt.figure(figsize=(12,5))

sns.distplot(application_train.days_birth / -365,bins = 30)

plt.title("Distribution of Applicant Age", fontsize = 15)
plt.xlabel('Years')
plt.show()

In [None]:
temp_df = application_train.name_type_suite.value_counts().reset_index()

plt.figure(figsize=(12,5))

sns.barplot(data = temp_df, x = 'index', y = 'name_type_suite')

plt.title("Who accompanied client when applying for the  application", fontsize = 15)
plt.xlabel('Accompanior')
plt.ylabel('Amount')
plt.show()

In [None]:
plt.figure(figsize=(12,5))

plt.pie(
    application_train.flag_own_car.value_counts(),
    labels = ['Loan for other purpose','Loan for a car'] ,
    explode = (0, 0.1),
    shadow = True,
    autopct = '%1.1f%%',
    radius = 1.2
)

plt.title("Ratio of loan for a car or not", fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,5))

plt.pie(
    application_train.flag_own_realty.value_counts(),
    labels = ['Loan for revalty','Loan for other purpose'], 
    explode = (0, 0.1),
    shadow=True, 
    autopct='%1.1f%%',
    radius = 1.2
)

plt.title("Ratio of realty for a car or not", fontsize = 15)
plt.show()

In [None]:
temp_df = application_train.name_income_type.value_counts()[:4]

plt.figure(figsize=(12,5))

plt.pie(
    temp_df,
    labels = temp_df[:4].index,
    explode = (0, 0.075,0.1,0.1), 
    shadow = True, 
    autopct = '%1.1f%%',
    labeldistance = 0.8,
    radius = 1.2
)

plt.title("Income Ratio", fontsize = 15)
plt.show()

In [None]:
temp_df = application_train.name_family_status.value_counts()[:-1]

plt.figure(figsize=(12,5))

plt.pie(
    temp_df,
    labels = temp_df.index,
    explode = (0,0.1,0.1,0.1), 
    shadow = True, 
    autopct = '%1.1f%%',
    labeldistance = 1.05,
    radius = 1.2
)

plt.title("Family Status Ratio", fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,5))

sns.countplot(
    data = application_train_org,
    x = 'OCCUPATION_TYPE',
    order = application_train_org['OCCUPATION_TYPE'].value_counts().index
)

plt.title("Occupation of who applied for loan", fontsize = 15)
plt.xticks(rotation = 45)
plt.show()

In [None]:
plt.figure(figsize=(12,5))

ax = sns.countplot(
    data = application_train_org,
    x = 'NAME_EDUCATION_TYPE',
    hue = 'TARGET',
    order = application_train_org['NAME_EDUCATION_TYPE'].value_counts().index
)

plt.title("Education of who applied for loan", fontsize = 15)
plt.xlabel('Education Type')
plt.ylabel('Count')
add_perc(ax,application_train_org.NAME_EDUCATION_TYPE,5,2)
plt.show()

In [None]:
temp_df = previous_application.name_contract_status.value_counts()[:-1]

plt.figure(figsize=(12,5))

plt.pie(
    temp_df,
    labels = temp_df.index,
    explode = (0,0.1,0.1), 
    shadow = True, 
    autopct = '%1.1f%%',
    labeldistance = 1.05,
    radius = 1.25
)

plt.title("Contract Approvement Ratio", fontsize = 15)
plt.show()

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 02 </span>

In the next notebook we will generate a new data for Feature Groups