# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span>

<span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Groups</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/churn/1_feature_groups.ipynb)

**Note**: you may get an error when installing hopsworks on Colab, and it is safe to ignore it.

This is the first part of the series of tutorials about predicting customers that are at risk of churning with Hopsworks Feature Store. As part of this first module, you will work with user data related to the telco industry. 
The objective of this tutorial is to demonstrate how to work with the **Hopworks Feature Store** for batch data with a goal of training and deploying a model that can predict customers that are at risk of churning.


## 🗒️ This notebook is divided in 3 sections:
1. Loading the data and feature engineering,
2. Connect to the Hopsworks feature store,
3. Create feature groups and upload them to the feature store.

First of all you will load the data and do some feature engineering on it.

In [None]:
!pip install -U hopsworks --quiet

The data you will use comes from three different CSV files:

- `demography.csv`: Demographic informations,
- `customer_info.csv`: customer information such as contract type, billing methods and monthly charges as well as whether customer has churned within the last month.
- `subscriptions.csv`: customer subscription to services such as internet, mobile or movie streaming.

You can conceptualize these CSV files as originating from separate data sources.
**All three files have a customer id column `customerid` in common, which you can use for joins.**

Let's go ahead and load the data.

In [None]:
import pandas as pd

demography_df = pd.read_csv("https://repo.hops.works/dev/davit/churn/demography.csv")
customer_info_df = pd.read_csv("https://repo.hops.works/dev/davit/churn/customer_info.csv")
subscriptions_df = pd.read_csv("https://repo.hops.works/dev/davit/churn/subscriptions.csv")

---
## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

In this section you will perform feature engineering, such as converting textual features to numerical featurs and replacing missing values to 0s. Let's start with the Customer information feature group.

In [None]:
# Fix missing values problem.
customer_info_df["TotalCharges"] = pd.to_numeric(customer_info_df["TotalCharges"], errors='coerce')
customer_info_df["TotalCharges"].fillna(0, inplace=True)

customer_info_df["Churn"].replace({"No" : 0, "Yes" : 1}, inplace=True)

---
## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features. In this case, you will create 3 feature groups:
1. **Customer information feature group.**
2. **Customer demography feature group.**
3. **Customer subscibtion feature group.** 

As you can see feature groups are related to their source data. These feature groups have `customerid` as primary key, which will allow you to join them when creating a dataset in the next tutorial.

Before you can create a feature group you need to connect to Hopsworks feature store.

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group.

In [None]:
customer_info_fg = fs.get_or_create_feature_group(
    name="customer_info",
    version=1,
    description="Customer info for churn prediction.",
    primary_key=['customerID'],
)

A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, you have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent you need to populate it with its associated data using the `insert` function.

In [None]:
customer_info_fg.insert(customer_info_df)

In [None]:
feature_descriptions = [
    {"name": "customerid", "description": "Customer id"}, 
    {"name": "contract", "description": "Type of contact"}, 
    {"name": "tenure", "description": "How long they’ve been a customer"}, 
    {"name": "paymentmethod", "description": "Payment method"}, 
    {"name": "paperlessbilling", "description": "Whether customer has paperless billing or not"}, 
    {"name": "monthlycharges", "description": "Monthly charges"}, 
    {"name": "totalcharges", "description": "Total charges"},
    {"name": "churn", "description": "Whether customer has left within the last month or not"},  
]

for desc in feature_descriptions: 
    customer_info_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
demography_fg = fs.get_or_create_feature_group(
    name="customer_demography_info",
    version=1,
    description="Customer demography info for churn prediction.",
    primary_key=['customerID'],
)
demography_fg.insert(demography_df)

In [None]:
feature_descriptions = [
    {"name": "customerid", "description": "Customer id"}, 
    {"name": "gender", "description": "Customer gender"},
    {"name": "seniorcitizen", "description": "Whether customer is a senior citizen or not"}, 
    {"name": "dependents", "description": "Whether customer has dependents or not"}, 
    {"name": "partner", "description": "Whether customer has partners or not"}, 
]

for desc in feature_descriptions: 
    demography_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
subscriptions_fg = fs.get_or_create_feature_group(
    name="customer_subscription_info",
    version=1,
    description="Customer subscription info for churn prediction.",
    primary_key=['customerID'],
)
subscriptions_fg.insert(subscriptions_df)

In [None]:
feature_descriptions = [
    {"name": "customerid", "description": "Customer id"}, 
    {"name": "deviceprotection", "description": "Whether customer has signed up for device protection service"},
    {"name": "onlinebackup", "description": "Whether customer has signed up for online backup service"}, 
    {"name": "onlinesecurity", "description": "Whether customer has signed up for online security service"}, 
    {"name": "internetservice", "description": "Whether customer has signed up for internet service"}, 
    {"name": "multiplelines", "description": "Whether customer has signed up for multiple lines service"}, 
    {"name": "phoneservice", "description": "Whether customer has signed up for phone service"}, 
    {"name": "techsupport", "description": "Whether customer has signed up for tech support service"}, 
    {"name": "streamingmovies", "description": "Whether customer has signed up for streaming movies service"}, 
    {"name": "streamingtv", "description": "Whether customer has signed up for streaming TV service"}, 
]

for desc in feature_descriptions: 
    subscriptions_fg.update_feature_description(desc["name"], desc["description"])

All three feature groups are now accessible and searchable in the UI

![fg-overview](../churn/images/churn_fg.gif)

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 02 </span>

In the following notebook you will use your feature groups to create a dataset to train a model on.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/{project_name}/{notebook_name}.ipynb)