# Customer Churn Prediction: Neural Network Implementation
This implements a custom Neural Network from scratch using NumPy to predict customer churn. It includes a complete preprocessing pipeline featuring log transformations, one-hot encoding, and SMOTE for handling class imbalance.

## Steps:
1. Import Libraries 
2. Read Data from `\Dataset` folder (Retrieved from UCI Repositories)
3. Clean Dataset (rename and format the column)
4. Data Preprocessing
5. Initialize the model
6. Train and test the data
7. Evaluate Performance using graph

### 0. Install Library Requirements

In [None]:
!pip install numpy pandas matplotlib seaborn tabulate imblearn

### 1. Import Libraries
Import necessary libraries:
* `numpy`     : High performance array object
* `pandas`    : Data analysis and manipulation tools
* `matplotlib`: Data visualizations
* `seaborn`   : Data visualizations framework based on **matplotlib**
* `tabulate`  : Pretty-print tabular structure library
* `imblearn`  : SMOTE class balance

#### 1.1 Initialize variables
* `EPOCHS`      : Iteration counts
* `BATCH_SIZE`  : Hyperparameter of size of batch for smaller subsets
* `PATIENCE`    : Hyperparameter stopping condition that wait for an improvement

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tabulate import tabulate
from imblearn.over_sampling import SMOTE
import seaborn as sns

EPOCHS = 250
BATCH_SIZE = 32
PATIENCE = 100

#### 1.2 Helper Functions
1.1 Handle mathematical operation used in the `NeuralNetwork` Class

In [3]:
def sigmoid(x): return 1 /(1 + np.exp(-x))
def derivative_sigmoid(sigmoid_x): return sigmoid_x * (1- sigmoid_x )
def relu(x): return np.maximum(0,x)
def derivative_relu(relu_x): return (relu_x>0).astype(float)
def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    # Clip predictions to prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

### 2. Read and Display Dataset Information
load the dataset and understand the data, and use the Interquartile Range (IQR) method to identify potential outliers in numerical columns.
2.1 Read dataset

In [7]:
def read_file(file_path: str):
    df = pd.read_csv(file_path)
    new_columns = [col.strip().replace('  ', ' ').replace(' ', '_').lower() for col in df.columns]
    df.columns = new_columns
    return df

path = r'..\Dataset\Customer Churn.csv'
df = read_file(path)
df.head(20)

2.2 Display dataset data types and null values

In [None]:
df.info()

2.3 Display dataset shape and its statisical values

In [None]:
df.describe()

2.4. Detect Outliers

In [None]:
def detect_outliers_iqr(df, k=1.5):
    nums = df.select_dtypes(include='number')
    outlier_info = {}
    
    for c in nums.columns:
        #skip binary columns
        if(nums[c].nunique() <= 2):
            continue
        
        q1 = nums[c].quantile(0.25) #1st quartile
        q3 = nums[c].quantile(0.75) #3rd quartile
        iqr = q3 - q1
        lower = q1 - k * iqr
        upper = q3 + k * iqr
        mask = (nums[c] < lower) | (nums[c] > upper)
        outlier_info[c] = {
            'count': int(mask.sum()),
            'indices': nums.index[mask].tolist(),
            'lower': float(lower),
            'upper': float(upper)
        }
    return outlier_info

outliers = detect_outliers_iqr(df)
table_data = []
print('\nOutlier summary (IQR method):')
for col, info in outliers.items():
    if info['count'] > 0:
    # Calculate percentage
        perc = (info["count"] / len(df)) * 100
        
        # Add a list (row) to our table_data
        table_data.append([col, info["count"], f"{info['lower']:.3f}",
            f"{info['upper']:.3f}", f"{perc:.2f}%"])
headers = ["Column", "Outlier Count", "Lower Bound", "Upper Bound", "Percentage"]
print(tabulate(table_data, headers=headers))
    


Outlier summary (IQR method):
Column                     Outlier Count    Lower Bound    Upper Bound  Percentage
-----------------------  ---------------  -------------  -------------  ------------
call_failure                          47         -15.5           28.5   1.49%
subscription_length                  282          18             50     8.95%
charge_amount                        370          -1.5            2.5   11.75%
seconds_of_use                       200       -6239.25       14108.8   6.35%
frequency_of_use                     129         -75            197     4.10%
frequency_of_sms                     368        -115.5          208.5   11.68%
distinct_called_numbers               77         -26             70     2.44%
age_group                            170           0.5            4.5   5.40%
age                                  688          17.5           37.5   21.84%
customer_value                       116        -898.08        1800.27  3.68%


### 3. Data Cleaning 
This process take step before handling with outliers and data preprocessing.

3.1 Remove Duplicate Row (if any)

In [None]:
df = df.drop_duplicates()
print(f'\n[Changes] Removed duplicate rows. New shape={df.shape}\n')

3.2 Remove Redundant Groups

`age_group` and `age` columns both have the same values but in different types, numeric and nominal respectively.
`age_group` column is dropped to prevent biases when learning

In [None]:
df = df.drop(columns=['age_group'])
print(f'\n[Changes] Dropped column: age_group due to redundancy. New shape={df.shape}\n\n')


### 4. Data Preprocessing

The dataset will follows the exact steps to avoid any imbalance or data leakage during testing.

[Split] -> [Log] -> [Encode] -> [Fit] -> [Scale] -> [Balance]


#### 4.1 Split

* Splitting the dataset into train, and test sets, and both input,X and output,Y
* This function will reset the intialized index of X and Y dataset, and shuffle them before splitting to prevent **data leakage**

@function `split_data()`:
* params
    * `X`, `y` as input and output
    * `test_split` as testing split percentage ; default `0.2`
    * `randomness` as random values for randomize dataset index before splitting ; default `None`
* return
    * `X_train` = Training set without `Churn`
    * `Y_train` = `Churn` Training column
    * `Y_train` = Testing set without `Churn`
    * `Y_test` = `Churn` Testing column


In [None]:
def split_data(X, y, test_split=0.2, randomness=None):
    # Set seed for reproducibility
    if randomness is not None:
        np.random.seed(randomness)
    
    # reset X and Y current index
    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)
    
    # Identify unique classes and their indices (0 and 1)
    unique_classes = np.unique(y)
    train_indices = []
    test_indices = []
    
    for cls in unique_classes:
        # Get indices of rows belonging to this class
        cls_indices = np.where(y == cls)[0]

        # Shuffle indices within this specific class
        np.random.shuffle(cls_indices)

        # Determine the split point
        total_count = len(cls_indices)
        test_count = int(total_count * test_split)
        
        # Split indices
        cls_test = cls_indices[:test_count]
        cls_train = cls_indices[test_count:]
        
        # Add to main lists
        test_indices.extend(cls_test)
        train_indices.extend(cls_train)
        
    # Shuffle the final combined indices so they aren't grouped by class
    np.random.shuffle(train_indices)
    np.random.shuffle(test_indices)
    
    # Use .iloc for DataFrames to select the rows
    X_train, X_test = X.iloc[train_indices], X.iloc[test_indices]
    y_train, y_test = y.iloc[train_indices], y.iloc[test_indices]
    return X_train, X_test, y_train, y_test

X = df.drop(columns=['churn'], axis=1)
Y = df['churn']
X_train,X_test,y_train,y_test = split_data(X,Y,test_split=0.2, randomness=42)
print(f'[Changes] Successfully split data into Training and Testing.')


#### 4.2 Log Transformation

* This will treat the outliers found before, compressing all the values within the dataset to **reduce** the skewness of the data
* This step is to prevent unfair patterns.
* Use `numpy.log(1 + x)`
* log transformation on seperated train and test set to prevent **data leakage**

@function `log_transformation()`:
* params
    * `df_train`, `df_test` as training and testing dataset
    * `cols_log` as list of column to be log transformed
* return
    * `df_train` = train dataset that have been log transformed
    * `df_test` = test dataset that have been log transformed

In [16]:
def log_transformation(df_train, df_test, cols_log: list):
    for col in cols_log:
        df_train[col] = np.log1p(df_train[col])
        df_test[col] = np.log1p(df_test[col])
    print(f'[Changes] Applied log transformation to selected columns.')
    return df_train, df_test

cols_to_log = [
    'seconds_of_use',
    'frequency_of_use',
    'frequency_of_sms',
    'distinct_called_numbers',
    'call_failure',
    'customer_value',
    'charge_amount'
]
X_train, X_test = log_transformation(X_train,X_test,cols_to_log)
    

[Changes] Applied log transformation to selected columns.


#### 4.3 One-Hot Encoding

* Convert categorical column into multiple, numeric dummy columns that values between `0` and `1`.
* Example: `plan` have 3 categories  = `[cat1, cat2, cat3]`. It will split into dummy columns for each category, `plan_cat1`, `plan_cat2`, `plan_cat3` 
* use drop first column attribute to prevent **multicollinearity**

@function `get_train_category()`:
* params
    * `df` as dataset
    * `col_name` retrieve category from this column
*return
    * `list` = list of column name


@function `one_hot_encoding()`:
* params
    * `df` as dataset
    * `col_name` as column name that will apply one-hot encoding
    * `categories` as category inside `col_name`
    * `drop_first` drop first column to simplify the dummy column
* return
    * `converted_pd` = return the dataframe with the added dummy columns 