## Feature Groups

In this series of tutorials, we will work with data related to telecom churn transactions. The end goal is to train and serve a model on the Hopsworks platform that can predict whether a customer will leave a service or not.

In this particular notebook you will learn how to:
- Connect to the Hopsworks feature store.
- Create feature groups and upload them to the feature store.

![tutorial-flow](images/online_offline_fs.png)

First of all, we will load the data and do some feature engineering on it.

### Data

We will use the [IBM Telco customer churn prediction dataset](https://github.com/IBM/telco-customer-churn-on-icp4d), which contains customer info and whether they have left the service or not.

Let's go ahead and load the data.

In [1]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Broadly speaking, we have two types of features:
- Features relating to customer information, e.g. `gender`, `SeniorCitizen`.
- Features relating to services customers use, e.g. `InternetService`, `PhoneService`, `MonthlyCharges`.

Many of these are binary, some are categorical, and a few are numerical. 

#### Data Cleanup

We need to fix a few things in the dataset.

In [2]:
# Fix missing values problem.
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors='coerce')
df["TotalCharges"].fillna(0, inplace=True)

# Replace binary labels with {0,1}.
for col in df:
    if len(df[col].unique()) == 2:
        df[col].replace({"No" : 0, "Yes" : 1}, inplace=True)

df["gender"].replace({"Female" : 0, "Male" : 1}, inplace=True)

df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,0,0,1,0,1,0,No phone service,DSL,No,...,No,No,No,No,Month-to-month,1,Electronic check,29.85,29.85,0
1,5575-GNVDE,1,0,0,0,34,1,No,DSL,Yes,...,Yes,No,No,No,One year,0,Mailed check,56.95,1889.5,0
2,3668-QPYBK,1,0,0,0,2,1,No,DSL,Yes,...,No,No,No,No,Month-to-month,1,Mailed check,53.85,108.15,1
3,7795-CFOCW,1,0,0,0,45,0,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,0,Bank transfer (automatic),42.3,1840.75,0
4,9237-HQITU,0,0,0,0,2,1,No,Fiber optic,No,...,No,No,No,No,Month-to-month,1,Electronic check,70.7,151.65,1


Note that we need to do some additional preprocessing, such as one-hot encoding the categorical features, to make the data compatible with a machine learning model. We will do this in the next notebook, where we will create the actual dataset we will train our models on.

### Feature Groups

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features. In our problem setting, it makes sense to create a single feature group for our data as it comes from a single data source. However, you could also imagine an alternative problem setting with multiple data sources, let's say one with customer info (birthdate, gender) and one with service info (list of services a user has signed up for at a particular point in time). In that case, it would make sense to create separate feature groups for these data sources. 

Before we can create a feature group we need to connect to our feature store.

In [None]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

To create a feature group we need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group.

In [None]:
customer_info_fg = fs.create_feature_group(
    name="customer_info",
    description="Customer info for churn prediction.",
    primary_key=['customerID'],
    online_enabled=True
)

Here we have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent we populate it with its associated data using the `save` function.

In [None]:
customer_info_fg.save(df)

### Next Steps

In the next notebook we will use our feature groups to create a dataset we can train a model on.