## Dataset Information: 
1. Call Failures: number of call failures
2. Complains: 0 = no complaints ; 1 = complaint
3. Subscription Length: total months of subscription
4. Charge Amount: Ordinal attribute ; 0-9
5. Seconds of Use: total seconds of calls
6. Frequency of use: total number of calls
7. Frequency of SMS: total number of text messages
8. Distinct Called Numbers: total number of distinct phone calls 
9. Age Group: ordinal attribute ; 1-5
10. Tariff Plan: binary ; 1 = pay as you go ; 2 = contactual
11. Status: binary ; 1 = active ; 2 = non-active
12. Customer Value: The calculated value of customer
13. Churn: binary ; 1 = churn ; 0 = non-churn - (Class label)


## Display Information

display necessary information before Data Preprocessing

it will include:
* df.info() : C
* df.describe() : statistic of dataset
* df.isnull().sum() : check for null value
* df.select_dtypes(include = 'number') : check for negative value
* df.select_dtypes(include = 'object') : check for object value
* df.duplicated().sum() : check for duplicate value
* df['Churn'].value_counts() : check for class imbalance
* df.nunique : check for unique values

In [None]:

import os
import sys
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, minmax_scale

df = pd.read_csv("Dataset/Customer Churn.csv")

#### Display General information

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Call  Failure            3150 non-null   int64  
 1   Complains                3150 non-null   int64  
 2   Subscription  Length     3150 non-null   int64  
 3   Charge  Amount           3150 non-null   int64  
 4   Seconds of Use           3150 non-null   int64  
 5   Frequency of use         3150 non-null   int64  
 6   Frequency of SMS         3150 non-null   int64  
 7   Distinct Called Numbers  3150 non-null   int64  
 8   Age Group                3150 non-null   int64  
 9   Tariff Plan              3150 non-null   int64  
 10  Status                   3150 non-null   int64  
 11  Age                      3150 non-null   int64  
 12  Customer Value           3150 non-null   float64
 13  Churn                    3150 non-null   int64  
dtypes: float64(1), int64(13)

#### Display Dataset Statistic

In [3]:
df.describe()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
count,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0
mean,7.627937,0.076508,32.541905,0.942857,4472.459683,69.460635,73.174921,23.509841,2.826032,1.077778,1.248254,30.998413,470.972916,0.157143
std,7.263886,0.265851,8.573482,1.521072,4197.908687,57.413308,112.23756,17.217337,0.892555,0.267864,0.432069,8.831095,517.015433,0.363993
min,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,15.0,0.0,0.0
25%,1.0,0.0,30.0,0.0,1391.25,27.0,6.0,10.0,2.0,1.0,1.0,25.0,113.80125,0.0
50%,6.0,0.0,35.0,0.0,2990.0,54.0,21.0,21.0,3.0,1.0,1.0,30.0,228.48,0.0
75%,12.0,0.0,38.0,1.0,6478.25,95.0,87.0,34.0,3.0,1.0,1.0,30.0,788.38875,0.0
max,36.0,1.0,47.0,10.0,17090.0,255.0,522.0,97.0,5.0,2.0,2.0,55.0,2165.28,1.0


#### Check for Null Values

In [4]:
df.isnull().sum()

Call  Failure              0
Complains                  0
Subscription  Length       0
Charge  Amount             0
Seconds of Use             0
Frequency of use           0
Frequency of SMS           0
Distinct Called Numbers    0
Age Group                  0
Tariff Plan                0
Status                     0
Age                        0
Customer Value             0
Churn                      0
dtype: int64

#### Check for Negative Values

In [5]:
numeric_df = df.select_dtypes(include=['number'])

negative = (numeric_df < 0).sum()
print(negative)    

Call  Failure              0
Complains                  0
Subscription  Length       0
Charge  Amount             0
Seconds of Use             0
Frequency of use           0
Frequency of SMS           0
Distinct Called Numbers    0
Age Group                  0
Tariff Plan                0
Status                     0
Age                        0
Customer Value             0
Churn                      0
dtype: int64


#### Check for Object Values
Need to be handled if object value exists

In [6]:
for i in df.select_dtypes(include = "object").columns[:5]:
    df[i].value_counts()

#### Check for duplicate Value
Will be removed if duplicates exist

In [7]:
df.duplicated().sum()

np.int64(300)

#### Check for Class Output Imbalance

In [8]:
df['Churn'].value_counts()

Churn
0    2655
1     495
Name: count, dtype: int64

#### Check for Unique Values

In [9]:
df.nunique()

Call  Failure                37
Complains                     2
Subscription  Length         45
Charge  Amount               11
Seconds of Use             1756
Frequency of use            242
Frequency of SMS            405
Distinct Called Numbers      92
Age Group                     5
Tariff Plan                   2
Status                        2
Age                           5
Customer Value             2654
Churn                         2
dtype: int64

In [None]:

def detect_outliers_iqr(df, k=1.5):
    nums = df.select_dtypes(include='number')
    outlier_info = {}
    for c in nums.columns:
        #skip binary columns
        if(nums[c].nunique() <= 2):
            continue
        
        q1 = nums[c].quantile(0.25) #1st quartile
        q3 = nums[c].quantile(0.75) #3rd quartile
        iqr = q3 - q1
        lower = q1 - k * iqr
        upper = q3 + k * iqr
        mask = (nums[c] < lower) | (nums[c] > upper)
        outlier_info[c] = {
            'count': int(mask.sum()),
            'indices': nums.index[mask].tolist(),
            'lower': float(lower),
            'upper': float(upper)
        }
    return outlier_info

def winsorize_by_quantile(df, lower_q=0.01, upper_q=0.99):
    df_w = df.copy()
    nums = df_w.select_dtypes(include=[np.number]).columns
    for c in nums:
        low = df_w[c].quantile(lower_q)
        high = df_w[c].quantile(upper_q)
        df_w[c] = df_w[c].clip(lower=low, upper=high)
    return df_w

def remove_outliers_iqr(df, k=1.5):
    nums = df.select_dtypes(include=[np.number])
    mask = pd.Series(True, index=df.index)
    for c in nums.columns:
        q1 = nums[c].quantile(0.25)
        q3 = nums[c].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - k * iqr
        upper = q3 + k * iqr
        mask &= (nums[c] >= lower) & (nums[c] <= upper)
    return df[mask]

def main(method='winsorize'):
    outliers = detect_outliers_iqr(df)
    print('Outlier summary (IQR method):')
    for col, info in outliers.items():
        if info['count'] > 0:
            print(f'  {col}: {info["count"]} outliers, bounds=({info["lower"]:.3f},{info["upper"]:.3f})')

    if method == 'winsorize':
        df_w = winsorize_by_quantile(df, lower_q=0.01, upper_q=0.99)
        out_w = detect_outliers_iqr(df_w)
        print('\nAfter winsorization (1st/99th percentiles):')
        for col, info in out_w.items():
            if info['count'] > 0:
                print(f'  {col}: {info["count"]} outliers remain')
        out_path = os.path.join(r'd:\testAI\ML', 'Customer churn_winsorized.csv')
        df_w.to_csv(out_path, index=False)
        print(f'Saved winsorized -> {out_path}')

    elif method == 'remove':
        df_r = remove_outliers_iqr(df)
        print(f'\nAfter removal shape: {df_r.shape}')
        out_path = os.path.join(r'd:\testAI\ML', 'Customer churn_outliers_removed.csv')
        df_r.to_csv(out_path, index=False)
        print(f'Saved removed-outliers -> {out_path}')

    else:
        print('Unknown method. Use "winsorize" or "remove".')

if __name__ == '__main__':
    method = sys.argv[1] if len(sys.argv) > 1 else 'winsorize'
    main(method)

Loaded d:\testAI\ML\Customer churn.csv, shape=(3150, 14)
Outlier summary (IQR method):
  Call  Failure: 47 outliers, bounds=(-15.500,28.500)
  Subscription  Length: 282 outliers, bounds=(18.000,50.000)
  Charge  Amount: 370 outliers, bounds=(-1.500,2.500)
  Seconds of Use: 200 outliers, bounds=(-6239.250,14108.750)
  Frequency of use: 129 outliers, bounds=(-75.000,197.000)
  Frequency of SMS: 368 outliers, bounds=(-115.500,208.500)
  Distinct Called Numbers: 77 outliers, bounds=(-26.000,70.000)
  Age Group: 170 outliers, bounds=(0.500,4.500)
  Age: 688 outliers, bounds=(17.500,37.500)
  Customer Value: 116 outliers, bounds=(-898.080,1800.270)
Unknown method. Use "winsorize" or "remove".


#### Split Data into Train and Test

80% Train and 20% test

In [11]:
x = df.drop('Churn', axis=1)
y = df['Churn']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, shuffle=True, stratify=y)

print(len(x_train)/len(x) )

0.8
