## **WiDSxSAP Case Competition 2024**

In [None]:
# Dependencies
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

### Exploratory Data Analysis

In [None]:
# Load datasets
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [None]:
# View the first few rows of the dataset
print(train.head())

In [None]:
# Check the total & types
print(train.info())

We will run summary statistics and clean the data provided.

In [None]:
# Summary statistics for numerical features
print(train.describe())

In [None]:
# Count distinct values and mode for categorical features
categorical_features = ['AI_Response_Time', 'Customer_Churn'] 
for feature in categorical_features:
    print(f"\nValue counts for {feature}:")
    print(train[feature].value_counts())

In [None]:
# Identify any missing values in the dataset
print(train.isnull().sum())

The dataset is tidy, **equally distributed** between categories, and has **no null values**.

#### **Univariate Analysis**

In [None]:
# Visualize the distribution of numerical features
numerical_features = ['Age', 'Overall_Usage_Frequency'] 
for feature in numerical_features:
    plt.figure(figsize=(10, 5))
    sns.histplot(train[feature], kde=True, bins=30)
    plt.title(f'Distribution of {feature}')
    plt.show()
    
# Separate histogram for Customer_Service_Interactions
plt.figure(figsize=(5, 5))
sns.histplot(data=train, x='Customer_Service_Interactions', binwidth=1)
plt.title('Distribution of Customer Service Interactions')
plt.xlabel('Customer Service Interactions')
plt.ylabel('Count')
plt.show()

In [None]:
# Target variable
plt.figure(figsize=(4,4))
sns.countplot(x='Customer_Churn', data=train)
plt.title('Distribution of Customer Churn')
plt.show()

The data across different variables, both categorical and numerical, are distributed pretty evenly. However, as the target variable has more samples of positive customer churn, we will need to upsample our data during training.

#### **Correlation Analysis**

In [None]:
# Plotting correlation matrix
correlation_matrix = train.corr()

plt.figure(figsize=(12, 8))

# Drawing heatmap
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm',
            xticklabels=correlation_matrix.columns,
            yticklabels=correlation_matrix.columns)
plt.title("Correlation between variables of training set")

plt.tight_layout()
plt.show()

* AI interaction levels have a weak negative correlation (about 0.2) to customer churn.
* Similarly, weak negative relationships can be observed related to Satisfaction with AI services and AI Personalization Effectiveness. 
* Conversely, Age has a weak but noticable positive correlation to customer churn.

#### **Analysing Bivariate Feature Relationships**

We use box plots / scatter plots to visualize our **continuous** variables.

In [None]:
# List of continuous variables
continuous_vars = ['Age', 'Overall_Usage_Frequency']

fig, axes = plt.subplots(1, len(continuous_vars), figsize=(9, 5)) 

# Iterate through the variables and create a box plot for each
for ax, column in zip(axes, continuous_vars):
    sns.boxplot(ax=ax, x='Customer_Churn', y=column, data=train)
    ax.set_title(f'{column} vs Customer Churn')
    ax.set_xlabel('')  
    ax.set_ylabel('')  

plt.tight_layout()

# Show the plot
plt.show()

We use stacked bar charts to visualize our **categorical** features.

In [None]:
# Convert the 'Customer_Churn' back to a categorical type for visualization
train_df['Customer_Churn'] = train_df['Customer_Churn'].astype('category')

# List of categorical variables
categorical_vars = ['AI_Interaction_Level', 'Satisfaction_with_AI_Services', 'AI_Response_Time', 'Change_in_Usage_Patterns', 'AI_Personalization_Effectiveness']

# Iterate through the list and create a stacked bar chart 
for var in categorical_vars:
    cross_tab = pd.crosstab(index=train_df[var], columns=train_df['Customer_Churn'], normalize='index')
    ax = cross_tab.plot(kind='bar', stacked=True, figsize=(10, 6))
    plt.title(f'{var} vs Customer Churn')
    plt.ylabel('Percentage')
    for c in ax.containers:
        ax.bar_label(c, fmt='%.2f%%', label_type='center')
    plt.show()

#### **Feature Engineering**

Based on the above insights, we will create a new feature for predicting churn: 

## **Data Preprocessing**

In [None]:
# Load dependencies
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MinMaxScaler

We will apply One-Hot Encoding to categorical variables to process them into a form that is better suited for prediction through machine learning algorithms. We create a binary column for each category in the original data variable and assign binary values accordingly (0, 1).

In [None]:
# Initializing OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)

train_p = train

# Apply One-Hot Encoding
AI_Response_Time_encoded = onehot_encoder.fit_transform(train[['AI_Response_Time']])

# List of categorical variables (Unsure of which is categorical yet?)
categorical_vars = ['AI_Interaction_Level', 'AI_Response_Time']

# Apply One-Hot Encoding
for var in categorical_vars:
    encoded_data = onehot_encoder.fit_transform(train[[var]])
    encoded_df = pd.DataFrame(encoded_data, 
                              columns=[f"{var}_{int(i)}" for i in range(encoded_data.shape[1])])
    
    # Concatenate the new DataFrame to the original one
    train_p = pd.concat([train_p.reset_index(drop=True), encoded_df.reset_index(drop=True)], axis=1)
    
    # Drop the original column
    train_p.drop(var, axis=1, inplace=True)

# Other variables are treated as numeric
numeric_vars = ['Satisfaction_with_AI_Services', 'AI_Personalization_Effectiveness', 'Overall_Usage_Frequency', 'Customer_Service_Interactions', 'Change_in_Usage_Patterns']
train_p[numeric_vars] = train_p[numeric_vars].apply(train_p.to_numeric)

print(df.head())

#### **Scaling Data**

Since we are looking to try different algorithms for prediction, we will perform standardization and normalization separately on the data to apply to different models accordingly. SVM and Logistic Regression models tend to benefit from standardization, while RNN models benefit more from normalization.

#### **Standardizing the Data**

Result dataframe: train_p_standard

In [None]:
# Initializing the scaler
scaler = StandardScaler()

train_p_standard = train_p

# Apply standardization
train_p_standard['Age'] = scaler.fit_transform(train_p_standard[['Age']])
train_p_standard['Overall_Usage_Frequency'] = scaler.fit_transform(train_p_standard[['Overall_Usage_Frequency']])
train_p_standard['Customer_Service_Interactions'] = scaler.fit_transform(train_p_standard[['Customer_Service_Interactions']])

In [None]:
print(train_p_standard.head())

#### **Data Normalization**

Result dataframe: train_p_normal

We do not need to conduct train-test splitting; the training and testing data has already been split.

## **Model Selection**

#### **Baseline Model**

We will use **Logistic Regression** as a commonly-implemented model for binary classification to establish baseline performance.

#### **Model Evaluation**