# End-to-End Data Science Workflow Demo
This notebook generates a synthetic dataset (6500+ rows, 20 columns) with mixed types and intentional data quality issues, then walks through:
1. Data Wrangling & Cleaning
2. Exploratory Data Analysis (EDA)
3. Predictive Modeling (Regression & Classification)
4. Deployment with Gradio (two mini apps)

*Targets*
- Regression target: monthly_spend
- Classification target: churn (0/1)

*How to use*: Run cells from top to bottom in Google Colab.

### Data Description

Here's a description of each feature in the dataset:

- *customer_id*: Unique identifier for each customer.
- *signup_date*: Date when the customer signed up.
- *country*: Country of the customer.
- *city*: City of the customer.
- *gender*: Gender of the customer.
- *age*: Age of the customer.
- *membership_tier*: Membership tier of the customer (e.g., Basic, Standard).
- *tenure_months*: Number of months the customer has been a member.
- *avg_session_length*: Average length of customer sessions in minutes.
- *sessions_per_month*: Number of sessions per month.
- *support_tickets*: Number of support tickets raised by the customer.
- *last_payment_method*: Last payment method used by the customer.
- *is_mobile_user*: Whether the customer is a mobile user (True/False).
- *num_devices*: Number of devices used by the customer.
- *email_click_rate*: Click rate for emails sent to the customer.
- *referral_count*: Number of referrals made by the customer.
- *discount_rate*: Discount rate applied to the customer.
- *satisfaction_score*: Customer satisfaction score.
- *monthly_spend*: Monthly spend of the customer (Regression Target).
- *churn*: Whether the customer churned (1) or not (0) (Classification Target).

# **Classification**

In [87]:
import pandas as pd

df = pd.read_csv ('/content/cleaned_data (1).csv')
df.head()

Unnamed: 0,customer_id,signup_date,country,city,gender,age,membership_tier,tenure_months,avg_session_length,sessions_per_month,support_tickets,last_payment_method,is_mobile_user,num_devices,email_click_rate,referral_count,discount_rate,satisfaction_score,monthly_spend,churn
0,100000,2025-02-24,malaysia,kuala lumpur,other,39,standard,8,21.51,21,0,credit card,True,2,0.315,0,0.138,6.9,44.62,0
1,100001,2021-04-24,philippines,cebu,male,51,basic,55,19.6,16,1,paypal,False,2,0.224,0,0.0,5.7,54.71,0
2,100002,2019-05-29,thailand,phuket,female,29,standard,78,31.98,18,0,credit card,True,5,0.053,0,0.112,6.3,22.4,0
3,100003,2018-01-01,indonesia,jakarta,female,43,standard,95,12.71,19,2,credit card,False,3,0.231,1,0.117,7.3,63.13,0
4,100004,2021-06-16,malaysia,shah alam,male,51,standard,53,3.0,15,1,unknown,True,3,0.036,0,0.114,5.8,33.28,0


Check data types

In [88]:
df.dtypes

Unnamed: 0,0
customer_id,int64
signup_date,object
country,object
city,object
gender,object
age,int64
membership_tier,object
tenure_months,int64
avg_session_length,float64
sessions_per_month,int64


## **Data Pre-processing**

Check for the object data type

In [89]:
categorical_cols = df.select_dtypes(include=["object"])
categorical_cols.dtypes

Unnamed: 0,0
signup_date,object
country,object
city,object
gender,object
membership_tier,object
last_payment_method,object


Convert signup_date datatype : object -> date

In [90]:
# Convert signup_date to datetime (coerce errors); age/discount_rate to numeric; normalize gender case
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')

Check data type

In [91]:
df.dtypes

Unnamed: 0,0
customer_id,int64
signup_date,datetime64[ns]
country,object
city,object
gender,object
age,int64
membership_tier,object
tenure_months,int64
avg_session_length,float64
sessions_per_month,int64


Label all categorical variables (Encoder)

In [92]:
from sklearn.preprocessing import LabelEncoder

categorical_cols = df.select_dtypes(include=["object"])

# Initialize and fit LabelEncoder
encoders = {}
for col in ['country', 'city', 'gender', 'membership_tier', 'last_payment_method']:
  le = LabelEncoder()
  df[col] = le.fit_transform(df[col])
  encoders[col] = le

df.head()

Unnamed: 0,customer_id,signup_date,country,city,gender,age,membership_tier,tenure_months,avg_session_length,sessions_per_month,support_tickets,last_payment_method,is_mobile_user,num_devices,email_click_rate,referral_count,discount_rate,satisfaction_score,monthly_spend,churn
0,100000,2025-02-24,2,21,2,39,2,8,21.51,21,0,1,True,2,0.315,0,0.138,6.9,44.62,0
1,100001,2021-04-24,3,6,1,51,0,55,19.6,16,1,4,False,2,0.224,0,0.0,5.7,54.71,0
2,100002,2019-05-29,5,26,0,29,2,78,31.98,18,0,1,True,5,0.053,0,0.112,6.3,22.4,0
3,100003,2018-01-01,1,17,0,43,2,95,12.71,19,2,1,False,3,0.231,1,0.117,7.3,63.13,0
4,100004,2021-06-16,2,29,1,51,2,53,3.0,15,1,5,True,3,0.036,0,0.114,5.8,33.28,0


Check data data types : object -> float or int

In [93]:
df.dtypes

Unnamed: 0,0
customer_id,int64
signup_date,datetime64[ns]
country,int64
city,int64
gender,int64
age,int64
membership_tier,int64
tenure_months,int64
avg_session_length,float64
sessions_per_month,int64


Labelling the categorical variables

In [94]:
from sklearn.preprocessing import LabelEncoder

for col, le in encoders.items():
  mapping = dict(zip(le.classes_, le.transform(le.classes_)))
  print('______')
  print(f"Mapping for {col}:")
  for i, label in enumerate(le.classes_):
    print(f"{i} → {label}")

#from sklearn.preprocessing import LabelEncoder
#import joblib

#encoders = {}
#for col in categorical_cols:
    #le = LabelEncoder()
    #df[col] = le.fit_transform(df[col].astype(str))
    #encoders[col] = le

# Save encoders
#joblib.dump(encoders, "label_encoders.joblib")

______
Mapping for country:
0 → brunei
1 → indonesia
2 → malaysia
3 → philippines
4 → singapore
5 → thailand
6 → vietnam
______
Mapping for city:
0 → baguio
1 → bandar seri begawan
2 → bandung
3 → bangar
4 → bangkok
5 → can tho
6 → cebu
7 → central
8 → chiang mai
9 → da nang
10 → davao
11 → denpasar
12 → hai phong
13 → hanoi
14 → hat yai
15 → ho chi minh
16 → ipoh
17 → jakarta
18 → johor bahru
19 → jurong
20 → kuala belait
21 → kuala lumpur
22 → manila
23 → medan
24 → pattaya
25 → penang
26 → phuket
27 → quezon city
28 → seria
29 → shah alam
30 → surabaya
31 → tampines
32 → tutong
33 → unknown
34 → woodlands
35 → yishun
______
Mapping for gender:
0 → female
1 → male
2 → other
______
Mapping for membership_tier:
0 → basic
1 → premium
2 → standard
3 → vip
______
Mapping for last_payment_method:
0 → bank transfer
1 → credit card
2 → debit card
3 → e-wallet
4 → paypal
5 → unknown


In [95]:
#from sklearn.preprocessing import LabelEncoder
#import joblib

# Identify categorical columns (object type)
#categorical_cols = df.select_dtypes(include=["object"]).columns

# Initialize a dictionary to store encoders
#encoders = {}

# Apply LabelEncoder to each categorical column
#for col in categorical_cols:
    #le = LabelEncoder()
    #df[col] = le.fit_transform(df[col].astype(str)) # Convert to string to handle potential NaN or mixed types
    #encoders[col] = le # Store the fitted encoder

# Display the data types after encoding
#display(df.dtypes)

# Optionally, save the encoders for later use (e.g., in deployment)
# joblib.dump(encoders, "label_encoders.joblib")

In [96]:
df.head()

Unnamed: 0,customer_id,signup_date,country,city,gender,age,membership_tier,tenure_months,avg_session_length,sessions_per_month,support_tickets,last_payment_method,is_mobile_user,num_devices,email_click_rate,referral_count,discount_rate,satisfaction_score,monthly_spend,churn
0,100000,2025-02-24,2,21,2,39,2,8,21.51,21,0,1,True,2,0.315,0,0.138,6.9,44.62,0
1,100001,2021-04-24,3,6,1,51,0,55,19.6,16,1,4,False,2,0.224,0,0.0,5.7,54.71,0
2,100002,2019-05-29,5,26,0,29,2,78,31.98,18,0,1,True,5,0.053,0,0.112,6.3,22.4,0
3,100003,2018-01-01,1,17,0,43,2,95,12.71,19,2,1,False,3,0.231,1,0.117,7.3,63.13,0
4,100004,2021-06-16,2,29,1,51,2,53,3.0,15,1,5,True,3,0.036,0,0.114,5.8,33.28,0


In [97]:
# Convert object columns to float, coercing errors, excluding 'signup_date'
#for col in df.select_dtypes(include='object').columns:
    #if col != 'signup_date':
        #df[col] = pd.to_numeric(df[col], errors='coerce')

#df.dtypes

In [98]:
# Convert object columns to float, coercing errors
#for col in df.select_dtypes(include='object').columns:
    #df[col] = pd.to_numeric(df[col], errors='coerce')

#df.dtypes

In [99]:
X = df.drop(['churn', 'signup_date', 'customer_id'], axis=1) #use for all the features for x except label
y = df['churn']

X.head()

Unnamed: 0,country,city,gender,age,membership_tier,tenure_months,avg_session_length,sessions_per_month,support_tickets,last_payment_method,is_mobile_user,num_devices,email_click_rate,referral_count,discount_rate,satisfaction_score,monthly_spend
0,2,21,2,39,2,8,21.51,21,0,1,True,2,0.315,0,0.138,6.9,44.62
1,3,6,1,51,0,55,19.6,16,1,4,False,2,0.224,0,0.0,5.7,54.71
2,5,26,0,29,2,78,31.98,18,0,1,True,5,0.053,0,0.112,6.3,22.4
3,1,17,0,43,2,95,12.71,19,2,1,False,3,0.231,1,0.117,7.3,63.13
4,2,29,1,51,2,53,3.0,15,1,5,True,3,0.036,0,0.114,5.8,33.28


In [100]:
#normalization
from sklearn.preprocessing import StandardScaler, LabelEncoder

norm_x = StandardScaler()

X = norm_x.fit_transform(X) #not going to be dataset but going to be and arrays

encoder_y = LabelEncoder()

y = encoder_y.fit_transform(y)

type(X)

numpy.ndarray

In [101]:
import joblib

joblib.dump(norm_x, "scaler.joblib")
joblib.dump(encoder_y, "label_encoder.joblib")

['label_encoder.joblib']

In [102]:
type(y)

numpy.ndarray

## **Choosing model**

In [103]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Applying the following algorithms :

1.   K-NN
2.   Decision Tree
3.   SVM
4.   Random Forest
5.   Logistic Regression
6.   XGoost



Import algorithms

In [135]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, jaccard_score, log_loss

### **1. KNN**

In [136]:
model_KNN = KNeighborsClassifier()

#training
model_KNN.fit(X_train, y_train)

#prediction
y_pred_KNN = model_KNN.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_KNN)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_KNN))
print('Classification report: ', classification_report(y_test, y_pred_KNN))

Accuracy score:  99.53917050691244
Classification report:                precision    recall  f1-score   support

           0       1.00      1.00      1.00      1296
           1       0.00      0.00      0.00         6

    accuracy                           1.00      1302
   macro avg       0.50      0.50      0.50      1302
weighted avg       0.99      1.00      0.99      1302



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### **2. Decision Tree**

In [106]:
model_DT = DecisionTreeClassifier()

#training
model_DT.fit(X_train, y_train)

#prediction
y_pred_KNN = model_DT.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_KNN)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_KNN))
#print('Classification report: ', classification_report(y_test, y_pred_KNN))

Accuracy score:  99.15514592933948


### **3. SVM**

In [107]:
model_SVM = SVC()

#training
model_SVM.fit(X_train, y_train)

#prediction
y_pred_SVM = model_SVM.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_SVM)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_SVM))
#print('Classification report: ', classification_report(y_test, y_pred_SVM))

Accuracy score:  99.53917050691244


In [108]:
model_SVM = SVC(kernel = 'linear')

#training
model_SVM.fit(X_train, y_train)

#prediction
y_pred_SVM = model_SVM.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_SVM)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_SVM))
#print('Classification report: ', classification_report(y_test, y_pred_SVM))

Accuracy score:  99.53917050691244


In [109]:
model_SVM = SVC(kernel = 'poly')

#training
model_SVM.fit(X_train, y_train)

#prediction
y_pred_SVM = model_SVM.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_SVM)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_SVM))
#print('Classification report: ', classification_report(y_test, y_pred_SVM))

Accuracy score:  99.46236559139786


In [110]:
model_SVM = SVC(kernel = 'sigmoid')

#training
model_SVM.fit(X_train, y_train)

#prediction
y_pred_SVM = model_SVM.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_SVM)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_SVM))
#print('Classification report: ', classification_report(y_test, y_pred_SVM))

Accuracy score:  99.38556067588326


### **4. Random Forest**

In [111]:
model_RF = RandomForestClassifier()

#training
model_RF.fit(X_train, y_train)

#prediction
y_pred_RF = model_RF.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_RF)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_RF))
#print('Classification report: ', classification_report(y_test, y_pred_RF))

Accuracy score:  99.53917050691244


### **5. Logistic Regression**

In [112]:
model_LR = LogisticRegression()

#training
model_LR.fit(X_train, y_train)

#prediction
y_pred_LR = model_LR.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_LR)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_LR))
#print('Classification report: ', classification_report(y_test, y_pred_LR))

Accuracy score:  99.53917050691244


### **6. XGoost**

In [113]:
model_XGoost = XGBClassifier()

#training
model_XGoost.fit(X_train, y_train)

#prediction
y_pred_XGoost = model_XGoost.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_XGoost)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_XGoost))
#print('Classification report: ', classification_report(y_test, y_pred_XGoost))

Accuracy score:  99.53917050691244


**Summarization of all the models**

In [114]:
from sklearn.metrics import accuracy_score, jaccard_score, log_loss, confusion_matrix, classification_report

classification_models = {
    'KNN': model_KNN,
    'Decision Tree': model_DT,
    'SVM': model_SVM,
    'Random Forest': model_RF,
    'Logistic Regression': model_LR,
    'XGBoost': model_XGoost
}

classification_results = {}

for name, model in classification_models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    jaccard = jaccard_score(y_test, y_pred)
    # For models that can provide probabilities, calculate log loss
    try:
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        logloss = log_loss(y_test, y_pred_proba)
    except AttributeError:
        logloss = "N/A" # Not all models have predict_proba

    classification_results[name] = {'Accuracy': accuracy, 'Jaccard Score': jaccard, 'Log Loss': logloss}
    print(f'{name}:')
    print(f'Accuracy: {accuracy * 100:.2f}%')
    print(f'Jaccard Score: {jaccard:.2f}')
    print(f'Log Loss: {logloss}')
    print('\n')

# Find the best model based on Accuracy
best_model_name = max(classification_results, key=lambda k: classification_results[k]['Accuracy'])
print(f'The best classification model based on Accuracy is: {best_model_name} with Accuracy of {classification_results[best_model_name]["Accuracy"]*100:.2f}%')

KNN:
Accuracy: 99.54%
Jaccard Score: 0.00
Log Loss: 0.17082632355027805


Decision Tree:
Accuracy: 99.08%
Jaccard Score: 0.00
Log Loss: 0.3321995704066099


SVM:
Accuracy: 99.39%
Jaccard Score: 0.00
Log Loss: N/A


Random Forest:
Accuracy: 99.54%
Jaccard Score: 0.00
Log Loss: 0.14680721206149133


Logistic Regression:
Accuracy: 99.54%
Jaccard Score: 0.00
Log Loss: 0.030978853701765573


XGBoost:
Accuracy: 99.54%
Jaccard Score: 0.00
Log Loss: 0.04052473857888496


The best classification model based on Accuracy is: KNN with Accuracy of 99.54%


## **Classification**

In [115]:
from sklearn.model_selection import cross_val_score

# Apply cross-validation to the SVM model
model_KK = KNeighborsClassifier()
cv_scores = cross_val_score(model_KNN, X, y, cv=5) # Using 5 folds

print("Cross-validation scores:", cv_scores)
print("Mean cross-validation accuracy:", cv_scores.mean() * 100)

Cross-validation scores: [0.99462366 0.99462366 0.99538816 0.99538816 0.99538816]
Mean cross-validation accuracy: 99.5082360136537


## **Deployment**

In [138]:
# Example correct training & saving workflow
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
import joblib

# Assume X_train and y_train are your training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Fit the model
model_XGoost = XGBClassifier(n_neighbors=5)
model_XGoost.fit(X_train_scaled, y_train)  # <-- THIS STEP IS CRITICAL

# Save everything
joblib.dump(model_XGoost, "model_XGoost.joblib")
joblib.dump(norm_x, "scaler.joblib")
joblib.dump(encoder_y, "label_encoder.joblib")  # if used for y


Parameters: { "n_neighbors" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


['label_encoder.joblib']

In [137]:
# churn_gradio_app.py
import gradio as gr
import joblib
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Assuming X and y are already defined and preprocessed from previous steps
# If not, you would need to include the data loading and preprocessing steps here
# For demonstration, let's assume X and y are available

# Train the KNN model
model_XGoost = XGBClassifier()
model_XGoost.fit(X, y) # Train on the entire dataset for deployment

# Save the trained model
joblib.dump(model_XGoost, "churn_pipeline.joblib")

# Load the pipeline (in this case, just the trained model)
pipeline = joblib.load("churn_pipeline.joblib")

# Get the feature names from the dataframe used for training (excluding the target and signup_date)
feature_cols = [col for col in df.columns if col not in ['churn', 'signup_date']]

# Define slider bounds from your data (assuming df is available)
age_min, age_max = float(df['age'].min()), float(df['age'].max())
tenure_min, tenure_max = float(df['tenure_months'].min()), float(df['tenure_months'].max())
avg_session_length_min, avg_session_length_max = float(df['avg_session_length'].min()), float(df['avg_session_length'].max())
session_per_month_min, session_per_month_max = float(df['sessions_per_month'].min()), float(df['sessions_per_month'].max())
support_tickets_min, support_tickets_max = float(df['support_tickets'].min()), float(df['support_tickets'].max())
num_devices_min, num_devices_max = float(df['num_devices'].min()), float(df['num_devices'].max())
email_click_rate_min, email_click_rate_max = float(df['email_click_rate'].min()), float(df['email_click_rate'].max())
referral_count_min, referral_count_max = float(df['referral_count'].min()), float(df['referral_count'].max())
discount_rate_min, discount_rate_max = float(df['discount_rate'].min()), float(df['discount_rate'].max())
satisfaction_score_min, satisfaction_score_max = float(df['satisfaction_score'].min()), float(df['satisfaction_score'].max())
monthly_spend_min, monthly_spend_max = float(df['monthly_spend'].min()), float(df['monthly_spend'].max())


def predict_churn(
    country, city, gender, age, membership_tier, tenure_months,
    avg_session_length, sessions_per_month, support_tickets, last_payment_method,
    is_mobile_user, num_devices, email_click_rate, referral_count, discount_rate,
    satisfaction_score, monthly_spend
):
    # Create a dictionary with the input data
    data = {
        'country': [country],
        'city': [city],
        'gender': [gender],
        'age': [age],
        'membership_tier': [membership_tier],
        'tenure_months': [tenure_months],
        'avg_session_length': [avg_session_length],
        'sessions_per_month': [sessions_per_month],
        'support_tickets': [support_tickets],
        'last_payment_method': [last_payment_method],
        'is_mobile_user': [is_mobile_user],
        'num_devices': [num_devices],
        'email_click_rate': [email_click_rate],
        'referral_count': [referral_count],
        'discount_rate': [discount_rate],
        'satisfaction_score': [satisfaction_score],
        'monthly_spend': [monthly_spend]
    }

    # Create a DataFrame with the input data, ensuring the order of columns matches feature_cols
    X_new = pd.DataFrame(data)

    # Apply label encoding to categorical features using the saved encoders
    categorical_cols_to_encode = ['country', 'city', 'gender', 'membership_tier', 'last_payment_method']
    for col in categorical_cols_to_encode:
        if col in X_new.columns and col in encoders:
            # Ensure the input value is in the encoder's known classes, handle unknown values if necessary
            # For simplicity, we assume inputs will match known classes from training data
            X_new[col] = encoders[col].transform(X_new[col])

    # Apply the same scaling as used during training
    # Assuming norm_x scaler is available from previous steps
    # Convert boolean 'is_mobile_user' to int for scaling if necessary, depending on how it was handled during training
    if 'is_mobile_user' in X_new.columns:
      X_new['is_mobile_user'] = X_new['is_mobile_user'].astype(int)

    # Select only the features used for training before scaling
    X_new_scaled = norm_x.transform(X_new[feature_cols])


    proba = pipeline.predict_proba(X_new_scaled)[:, 1]
    label = "Yes" if proba >= 0.5 else "No"
    return label, float(proba)

# Update inputs to match the features and use dropdowns with original labels, including customer_id
inputs = [
    gr.Number(label="customer_id", precision=0),
    gr.Dropdown(choices=list(encoders['country'].classes_), label="country"),
    gr.Dropdown(choices=list(encoders['city'].classes_), label="city"),
    gr.Dropdown(choices=list(encoders['gender'].classes_), label="gender"),
    gr.Slider(age_min, age_max, step=1, label="age"),
    gr.Dropdown(choices=list(encoders['membership_tier'].classes_), label="membership_tier"),
    gr.Slider(tenure_min, tenure_max, step=1, label="tenure_months"),
    gr.Slider(avg_session_length_min, avg_session_length_max, step=0.01, label="avg_session_length"),
    gr.Slider(session_per_month_min, session_per_month_max, step=1, label="sessions_per_month"),
    gr.Slider(support_tickets_min, support_tickets_max, step=1, label="support_tickets"),
    gr.Dropdown(choices=list(encoders['last_payment_method'].classes_), label="last_payment_method"),
    gr.Checkbox(label="is_mobile_user"),
    gr.Slider(num_devices_min, num_devices_max, step=1, label="num_devices"),
    gr.Slider(email_click_rate_min, email_click_rate_max, step=0.001, label="email_click_rate"),
    gr.Slider(referral_count_min, referral_count_max, step=1, label="referral_count"),
    gr.Slider(discount_rate_min, discount_rate_max, step=0.001, label="discount_rate"),
    gr.Slider(satisfaction_score_min, satisfaction_score_max, step=0.01, label="satisfaction_score"),
    gr.Slider(monthly_spend_min, monthly_spend_max, step=0.01, label="monthly_spend"),
]


outputs = [
    gr.Textbox(label="Churn Prediction (Yes/No)"),
    gr.Number(label="Probability of Churn")
]

demo = gr.Interface(
    fn=predict_churn,
    inputs=inputs,
    outputs=outputs,
    title="Customer Churn Predictor",
    description="Enter customer details to predict churn. Model: KNN."
)

if __name__ == "__main__":
    demo.launch()

KeyError: 'gender'