## Churn Prediction Feature Engineering

<img src="https://github.com/RafiKurlansik/laughing-garbanzo/blob/main/step1.png?raw=true">

### Featurization Logic

This is a fairly clean dataset so we'll just do some one-hot encoding, and clean up the column names afterward.

In [0]:
# Read into Spark
telcoDF = spark.table("workspace.databricks_study.telco_customer_churn")

display(telcoDF)

Using `koalas` allows us to scale `pandas` code.

In [0]:
# import databricks.koalas as ks
import pyspark.pandas as ks

def compute_churn_features(data):
  
  # Convert to koalas
  # data = data.to_koalas()
  data = ks.DataFrame(data)
  
  # OHE
  data = ks.get_dummies(data, 
                        columns=['gender', 'partner', 'dependents',
                                 'phoneService', 'multipleLines', 'internetService',
                                 'onlineSecurity', 'onlineBackup', 'deviceProtection',
                                 'techSupport', 'streamingTV', 'streamingMovies',
                                 'contract', 'paperlessBilling', 'paymentMethod'],dtype = 'int64')
  
  # Convert label to int and rename column
  data['Churn'] = data['Churn'].map({'Yes': 1, 'No': 0})
  data = data.astype({'Churn': 'int32'})
  # data = data.rename(columns = {'churnString': 'churn'})
  
  # Clean up column names
  data.columns = data.columns.str.replace(' ', '')
  data.columns = data.columns.str.replace('(', '-')
  data.columns = data.columns.str.replace(')', '')
  
  # Drop missing values
  data = data.dropna()
  
  return data

In [0]:
%pip install databricks-feature-engineering

In [0]:
dbutils.library.restartPython()

In [0]:
data = ks.DataFrame(telcoDF)
display(data)

In [0]:
test_df = ks.get_dummies(data, 
                        columns=['gender', 'partner', 'dependents',
                                 'phoneService', 'multipleLines', 'internetService',
                                 'onlineSecurity', 'onlineBackup', 'deviceProtection',
                                 'techSupport', 'streamingTV', 'streamingMovies',
                                 'contract', 'paperlessBilling', 'paymentMethod'],dtype = 'int64')

In [0]:
display(test_df)

In [0]:
# from databricks.feature_store import FeatureStoreClient
# from databricks.feature_store import feature_table
from databricks.feature_engineering import FeatureEngineeringClient

fs = FeatureEngineeringClient()

churn_features_df = compute_churn_features(telcoDF)

churn_feature_table = fs.create_table(
  name='workspace.databricks_study.churn_features',
  primary_keys='customerID',
  schema=churn_features_df.spark.schema(),
  description='These features are derived from the sr_ibm_telco_churn table in the lakehouse.  I created dummy variables for the categorical columns, cleaned up their names, and added a boolean flag for whether the customer churned or not.  No aggregations were performed.'
)

fs.write_table(df=churn_features_df.to_spark(), name='workspace.databricks_study.churn_features', mode='merge')