# Phase 1: Data Exploration & Initial Assessment
In this phase, we explore the dataset to understand its structure, columns, missing values, and categories.  
This step is important before any cleaning or modeling.


## Step 1: Load the dataset
We will load the dataset using pandas and call it `df` for simplicity.


In [1]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv(r"C:\Users\kisho\OneDrive\Desktop\In_project\Telco_Churn_EDA_Kishore\Customer Churn.csv")
df.head()


Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
0,8.0,0,38.0,0.0,4370,71,5,17,3,1,1,30,,0
1,0.0,0,39.0,0.0,318,5,7,4,2,1,2,25,46.035,0
2,10.0,0,37.0,0.0,2453,60,359,24,3,1,1,30,1536.52,0
3,10.0,0,,0.0,4198,66,1,35,1,1,1,15,240.02,0
4,,0,38.0,0.0,2393,58,2,33,1,1,1,15,145.805,0


### Observation
The dataset is loaded. The first few rows show customer information like demographics, services, and churn status.


## Step 2: Preview more sample data
We can view more rows with `.head()` or `.tail()`.


In [74]:
df.head(10)

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
0,8.0,0,38.0,0.0,4370,71,5,17,3,1,1,30,,0
1,0.0,0,39.0,0.0,318,5,7,4,2,1,2,25,46.035,0
2,10.0,0,37.0,0.0,2453,60,359,24,3,1,1,30,1536.52,0
3,10.0,0,,0.0,4198,66,1,35,1,1,1,15,240.02,0
4,,0,38.0,0.0,2393,58,2,33,1,1,1,15,145.805,0
5,,0,38.0,1.0,3775,82,32,28,3,1,1,30,282.28,0
6,4.0,0,38.0,0.0,2360,39,285,18,3,1,1,30,1235.96,0
7,13.0,0,37.0,2.0,9115,121,144,43,3,1,1,30,945.44,0
8,7.0,0,38.0,0.0,13773,169,0,44,3,1,1,30,557.68,0
9,7.0,0,,1.0,4515,83,2,25,3,1,1,30,191.92,0


### Observation
This shows the first 10 rows of the dataset. It helps us see patterns and formatting of the data.


In [75]:
df.tail(10)

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
3140,16.0,0,29.0,0.0,1005,31,17,9,3,1,2,30,,0
3141,5.0,0,28.0,0.0,1130,16,28,5,4,1,2,45,98.65,0
3142,,0,27.0,1.0,1530,38,26,15,2,1,1,25,187.56,0
3143,7.0,0,27.0,1.0,3530,67,15,25,3,1,1,30,203.88,0
3144,7.0,0,20.0,1.0,2000,32,35,16,3,1,1,30,221.28,0
3145,21.0,0,19.0,2.0,6697,147,92,44,2,2,1,25,721.98,0
3146,17.0,0,17.0,1.0,9237,177,80,42,5,1,1,55,,0
3147,13.0,0,18.0,,3157,51,38,21,3,1,1,30,280.32,0
3148,7.0,0,11.0,2.0,4695,46,222,12,3,1,1,30,1077.64,0
3149,8.0,1,11.0,2.0,1792,25,7,9,3,1,1,30,100.68,1


### Observation
The last 10 rows are displayed. This helps check if the dataset has unusual formatting at the bottom.


## Step 3: Random sampling
We can also check a random sample of rows.


In [76]:
df.sample(5, random_state=1)

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
2001,0.0,0,37.0,0.0,0,0,0,0,2,1,2,25,0.0,0
943,0.0,0,24.0,0.0,2515,50,0,23,3,1,1,30,102.6,0
1611,4.0,0,37.0,0.0,2048,54,33,19,3,1,2,30,216.08,0
403,8.0,0,35.0,0.0,4018,58,0,30,1,1,1,15,224.18,0
1301,0.0,0,43.0,0.0,273,9,15,8,2,1,2,25,80.19,0


### Observation
This shows 5 random rows. Useful to quickly spot-check values.


## Step 4: Dataset shape
We check how many rows and columns the dataset has.


In [77]:
df.shape

(3150, 14)

### Observation
This returns the number of rows and columns, which tells us the dataset size.


## Step 5: Column names
We list all column names to see what features are available.


In [78]:
df.columns

Index(['Call Failure', 'Complains', 'Subscription Length', 'Charge Amount',
       'Seconds of Use', 'Frequency of use', 'Frequency of SMS',
       'Distinct Called Numbers', 'Age Group', 'Tariff Plan', 'Status', 'Age',
       'Customer Value', 'Churn'],
      dtype='object')

### Observation
We can now see all column names clearly. This helps us identify the dataset features.


## Column Description and Meaning

- **Call Failure**: Number of dropped or failed calls for the customer.
- **Complains**: Whether the customer has lodged complaints (0 = No, 1 = Yes).
- **Subscription Length**: Duration of subscription in months.
- **Charge Amount**: Total charges billed to the customer.
- **Seconds of Use**: Total voice usage in seconds.
- **Frequency of use**: Number of calls made by the customer.
- **Frequency of SMS**: Number of SMS sent by the customer.
- **Distinct Called Numbers**: Count of unique phone numbers called.
- **Age Group**: Categorical representation of customer age group.
- **Tariff Plan**: Indicates the type of tariff plan subscribed to.
- **Status**: Current service status of the customer.
- **Age**: Age of the customer in years.
- **Customer Value**: Calculated lifetime value of the customer.
- **Churn**: Target variable (1 = Churned, 0 = Retained).


## Step 6: Data types
We check the type of data each column has.


In [79]:
df.dtypes

Call Failure               float64
Complains                    int64
Subscription Length        float64
Charge Amount              float64
Seconds of Use               int64
Frequency of use             int64
Frequency of SMS             int64
Distinct Called Numbers      int64
Age Group                    int64
Tariff Plan                  int64
Status                       int64
Age                          int64
Customer Value             float64
Churn                        int64
dtype: object

### **Observation:**  
The dataset contains a mix of numerical columns with both `int64` and `float64` data types.  

- **Float columns**: `Call Failure`, `Subscription Length`, `Charge Amount`, `Customer Value` → these store continuous values.  
- **Integer columns**: `Complains`, `Seconds of Use`, `Frequency of use`, `Frequency of SMS`, `Distinct Called Numbers`, `Age Group`, `Tariff Plan`, `Status`, `Age`, `Churn` → these represent discrete numbers, categories, or identifiers.  
- There are **no object/string columns**, meaning the dataset is fully numeric.  

This shows the dataset is clean in terms of data types, and categorical variables (like `Churn`, `Tariff Plan`, `Status`) are stored as integers instead of text.


## Step 7: Dataset info
We use `.info()` to see non-null counts and memory usage.


In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Call Failure             2992 non-null   float64
 1   Complains                3150 non-null   int64  
 2   Subscription Length      2898 non-null   float64
 3   Charge Amount            3056 non-null   float64
 4   Seconds of Use           3150 non-null   int64  
 5   Frequency of use         3150 non-null   int64  
 6   Frequency of SMS         3150 non-null   int64  
 7   Distinct Called Numbers  3150 non-null   int64  
 8   Age Group                3150 non-null   int64  
 9   Tariff Plan              3150 non-null   int64  
 10  Status                   3150 non-null   int64  
 11  Age                      3150 non-null   int64  
 12  Customer Value           2835 non-null   float64
 13  Churn                    3150 non-null   int64  
dtypes: float64(4), int64(10)

### Observation
**Quick Observations from df.info():**  

- **Size**: 3,150 rows × 14 columns  
- **Missing Values**: Found in 4 columns (`Call Failure`, `Subscription Length`, `Charge Amount`, `Customer Value`)  
- **Data Types**:  
  - 4 float columns (continuous values)  
  - 10 integer columns (counts/categories)  
- **Complete Columns**: 10/14 have no missing values  
- **Target Ready**: `Churn` column has no nulls  


## Step 8: Summary statistics
We use `.describe()` for numeric columns.


In [81]:
df.describe()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
count,2992.0,3150.0,2898.0,3056.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,2835.0,3150.0
mean,7.572193,0.076508,32.506901,0.937827,4472.459683,69.460635,73.174921,23.509841,2.826032,1.077778,1.248254,30.998413,473.795691,0.157143
std,7.223481,0.265851,8.589523,1.515462,4197.908687,57.413308,112.23756,17.217337,0.892555,0.267864,0.432069,8.831095,519.513809,0.363993
min,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,15.0,0.0,0.0
25%,1.0,0.0,29.0,0.0,1391.25,27.0,6.0,10.0,2.0,1.0,1.0,25.0,113.8725,0.0
50%,6.0,0.0,35.0,0.0,2990.0,54.0,21.0,21.0,3.0,1.0,1.0,30.0,229.92,0.0
75%,11.0,0.0,38.0,1.0,6478.25,95.0,87.0,34.0,3.0,1.0,1.0,30.0,790.2925,0.0
max,36.0,1.0,47.0,10.0,17090.0,255.0,522.0,97.0,5.0,2.0,2.0,55.0,2149.28,1.0


### Observation
We see count, mean, min, max, and quartiles for numeric features. This shows ranges and unusual values.


## Step 9: Unique values in each column
We use `.nunique()` to count distinct values.


In [82]:
df.nunique()

Call Failure                 37
Complains                     2
Subscription Length          45
Charge Amount                11
Seconds of Use             1756
Frequency of use            242
Frequency of SMS            405
Distinct Called Numbers      92
Age Group                     5
Tariff Plan                   2
Status                        2
Age                           5
Customer Value             2408
Churn                         2
dtype: int64

**Key Insights from df.nunique():**  

- **Binary Columns (2 unique values)**:  
  `Complains`, `Tariff Plan`, `Status`, `Churn` → Likely yes/no flags  
- **Categorical Columns**:  
  `Age Group` (5), `Call Failure` (37), `Subscription Length` (45)  
- **High-Cardinality Columns**:  
  `Seconds of Use` (1756), `Customer Value` (2408) → Near-continuous  
- **Actionable Notes**:  
  - Convert binary columns to `bool`/`category` for efficiency  
  - High unique counts in some columns may need binning 

## Step 10: Missing values
We count missing values in each column.


In [83]:
df.isnull().sum()

Call Failure               158
Complains                    0
Subscription Length        252
Charge Amount               94
Seconds of Use               0
Frequency of use             0
Frequency of SMS             0
Distinct Called Numbers      0
Age Group                    0
Tariff Plan                  0
Status                       0
Age                          0
Customer Value             315
Churn                        0
dtype: int64

**Missing Values Summary:**

- **Columns with Nulls**:
  - `Call Failure`: 158 missing (5%)
  - `Subscription Length`: 252 missing (8%)
  - `Charge Amount`: 94 missing (3%)
  - `Customer Value`: 315 missing (10%)

- **Complete Columns**: 10/14 columns have no missing values

**Key Takeaways**:
1. `Customer Value` has the most missing data (10%)
2. Critical columns like `Churn` and key demographics (`Age`, `Status`) are complete
3. Action needed: Impute missing values in 4 columns

## Step 11: Percentage of missing values
We check missing values as percentages.


In [84]:
(df.isnull().sum() / len(df)) * 100

Call Failure                5.015873
Complains                   0.000000
Subscription Length         8.000000
Charge Amount               2.984127
Seconds of Use              0.000000
Frequency of use            0.000000
Frequency of SMS            0.000000
Distinct Called Numbers     0.000000
Age Group                   0.000000
Tariff Plan                 0.000000
Status                      0.000000
Age                         0.000000
Customer Value             10.000000
Churn                       0.000000
dtype: float64

**Missing Values Analysis (Percentage):**

- **Significant Missing Data (>5%)**:
  - `Customer Value`: 10.0% missing
  - `Subscription Length`: 8.0% missing
  - `Call Failure`: 5.0% missing

- **Minor Missing Data (<3%)**:
  - `Charge Amount`: 2.98% missing

- **Complete Columns**: 10/14 columns (71%) have no missing values

**Action Items**:
1. Prioritize handling for `Customer Value` (highest % missing)
2. Consider different imputation strategies for:
   - Continuous: `Customer Value`, `Charge Amount`
   - Categorical: `Call Failure`, `Subscription Length`

## Step 12: Value counts for categorical columns
We check the distribution of values.


In [85]:
for col in df.select_dtypes(include='object').columns:
    print(f"\nColumn: {col}")
    print(df[col].value_counts().head())

**Observation on Categorical Columns Analysis:**

1. **No Output Generated** because:
   - The dataset contains no object/string columns (`select_dtypes(include='object')` returned empty)
   - All columns are numeric (int64/float64) as previously seen in `df.info()`

## Step 13: Unique values for selected columns
We can directly list the unique categories.


In [88]:
df['Call Failure'].unique()

array([ 8.,  0., 10., nan,  4., 13.,  7.,  9., 25.,  3.,  2., 23., 21.,
        1.,  6., 11., 16., 12., 14., 28.,  5., 26., 24., 19., 15., 22.,
       20., 18., 17., 30., 27., 29., 31., 33., 35., 32., 34., 36.])

### Observation
We see all possible categories in the "Call Failure" column.


## Step 14: Null check across all rows
We check if any row has missing data.


In [89]:
df.isnull().any(axis=1).sum()

np.int64(756)

### Observation
This shows how many rows have missing values in any column. There are 756 rows which contain missing values overall

## Step 15: Duplicate records
We check for duplicate rows.


In [90]:
df.duplicated().sum()

np.int64(194)

### Observation
This shows how many rows are exact duplicates.


## Step 16: Checking balance of target column
We look at the churn distribution.


In [92]:
df['Churn'].value_counts()

Churn
0    2655
1     495
Name: count, dtype: int64

### Observation
We see how many customers churned vs stayed. Imbalance here is common in churn datasets.


**Churn Distribution:**
- Non-Churn (0): 2,655 (84%)
- Churn (1): 495 (16%)



## Step 17: Numeric vs categorical columns
We separate numeric and categorical features.


In [95]:
num_cols = df.select_dtypes(include=np.number).columns
cat_cols = df.select_dtypes(include='object').columns
num_cols, cat_cols


(Index(['Call Failure', 'Complains', 'Subscription Length', 'Charge Amount',
        'Seconds of Use', 'Frequency of use', 'Frequency of SMS',
        'Distinct Called Numbers', 'Age Group', 'Tariff Plan', 'Status', 'Age',
        'Customer Value', 'Churn'],
       dtype='object'),
 Index([], dtype='object'))

### Observation
We now know which columns are numeric and which are categorical.


**Feature Separation Results:**

- **Numeric Columns (14)**: All columns (including categorical encodings)
- **Categorical Columns (0)**: No object-type columns found

**Key Insight**:
- All variables are stored as numeric (int64/float64)
- Categorical features (like `Tariff Plan`, `Status`) are numerically encoded

## Step 18: Min and max of numeric columns
We check ranges of numerical features.


In [96]:
df[num_cols].min(), df[num_cols].max()

(Call Failure                0.0
 Complains                   0.0
 Subscription Length         3.0
 Charge Amount               0.0
 Seconds of Use              0.0
 Frequency of use            0.0
 Frequency of SMS            0.0
 Distinct Called Numbers     0.0
 Age Group                   1.0
 Tariff Plan                 1.0
 Status                      1.0
 Age                        15.0
 Customer Value              0.0
 Churn                       0.0
 dtype: float64,
 Call Failure                  36.00
 Complains                      1.00
 Subscription Length           47.00
 Charge Amount                 10.00
 Seconds of Use             17090.00
 Frequency of use             255.00
 Frequency of SMS             522.00
 Distinct Called Numbers       97.00
 Age Group                      5.00
 Tariff Plan                    2.00
 Status                         2.00
 Age                           55.00
 Customer Value              2149.28
 Churn                          1.00
 dt

### Observation
This shows the smallest and largest values per numeric column.

**Numerical Features Range Analysis:**

- **Key Observations**:
  - `Seconds of Use` (0-17,090) and `Customer Value` (0-2,149) have wide ranges → consider scaling
  - Binary features (`Complains`, `Churn`) correctly show 0-1 range
  - `Age` ranges from 15-55 (reasonable customer ages)
  - `Tariff Plan` (1-2) and `Status` (1-2) confirm categorical encoding

- **Action Items**:
  1. **Scale Features**: Normalize/RobustScale high-range columns
  2. **Verify Zero Values**: Check if zeros are valid (e.g., `Charge Amount=0`)
  3. **Convert Categories**: Change numeric-categoricals to `category` dtype

## Step 19: Outlier Detection (Using IQR Method)

We will use the Interquartile Range (IQR) method to detect outliers in each numeric column:
- Calculate Q1 (25th percentile) and Q3 (75th percentile).
- Compute IQR = Q3 - Q1.
- Define lower bound = Q1 - 1.5 * IQR, upper bound = Q3 + 1.5 * IQR.
- Count data points outside these bounds as outliers.


In [2]:
# Select numeric columns
numeric_cols = df.select_dtypes(include=['number']).columns

# Create a DataFrame to store outlier info
outlier_summary = []

for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = ((df[col] < lower) | (df[col] > upper)).sum()
    outlier_summary.append([col, Q1, Q3, lower, upper, outliers])

# Convert to DataFrame
outlier_df = pd.DataFrame(outlier_summary, columns=['Column','Q1','Q3','Lower Bound','Upper Bound','Outlier Count'])
outlier_df = outlier_df.sort_values(by='Outlier Count', ascending=False)
outlier_df


Unnamed: 0,Column,Q1,Q3,Lower Bound,Upper Bound,Outlier Count
10,Status,1.0,1.0,1.0,1.0,782
11,Age,25.0,30.0,17.5,37.5,688
13,Churn,0.0,0.0,0.0,0.0,495
6,Frequency of SMS,6.0,87.0,-115.5,208.5,368
3,Charge Amount,0.0,1.0,-1.5,2.5,355
9,Tariff Plan,1.0,1.0,1.0,1.0,245
1,Complains,0.0,0.0,0.0,0.0,241
2,Subscription Length,29.0,38.0,15.5,51.5,204
4,Seconds of Use,1391.25,6478.25,-6239.25,14108.75,200
8,Age Group,2.0,3.0,0.5,4.5,170


### Observation on Outlier Analysis

- **Highest Outlier Counts:**
    - `Status` (782 outliers) and `Age` (688 outliers) have the largest number of values outside the IQR bounds.
        - This suggests either these columns have very few unique values (categorical-like) or the IQR method is not ideal for binary/ordinal columns like `Status`.
    - `Churn` also shows 495 outliers for similar reasons (binary target variable). This is expected and should be ignored for outlier handling.
- **Significant Outliers in Numeric Features:**
    - `Frequency of SMS` (368), `Charge Amount` (355), `Seconds of Use` (200), and `Customer Value` (109) indicate heavy skewness.
    - These columns have wide ranges, so we must **consider capping extreme values** or **applying transformations (e.g., log or square root)** in Phase-2.
- **Other Columns:**
    - `Subscription Length`, `Frequency of use`, `Distinct Called Numbers` show moderate outliers, which can also influence model performance.
- **Key Takeaways for Phase-2:**
    - Do not treat binary or categorical columns (Status, Tariff Plan, Churn, Complains) with outlier removal.
    - For continuous columns with high skew and outliers (`Charge Amount`, `Seconds of Use`, `Frequency of SMS`, `Customer Value`), apply **scaling or capping strategies**.


## Step 20: Correlation check 
We use `.corr()` with pandas.


In [97]:
df.corr(numeric_only=True)

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
Call Failure,1.0,0.152193,0.17933,0.589319,0.497351,0.569067,-0.020579,0.500669,0.057325,0.194568,-0.109269,0.04793,0.123214,-0.003552
Complains,0.152193,1.0,-0.014432,-0.032041,-0.104952,-0.090774,-0.111633,-0.058199,0.019976,0.00114,0.271405,0.003298,-0.134836,0.532053
Subscription Length,0.17933,-0.014432,1.0,0.075002,0.119624,0.102308,0.070215,0.087183,0.017252,-0.156441,0.142545,-0.004499,0.104664,-0.017115
Charge Amount,0.589319,-0.032041,0.075002,1.0,0.445871,0.37807,0.089195,0.4147,0.280829,0.323227,-0.357034,0.280313,0.163059,-0.201088
Seconds of Use,0.497351,-0.104952,0.119624,0.445871,1.0,0.946489,0.102123,0.676536,0.02006,0.133593,-0.460618,0.020843,0.410588,-0.298935
Frequency of use,0.569067,-0.090774,0.102308,0.37807,0.946489,1.0,0.100019,0.736114,-0.032544,0.206452,-0.454752,-0.02835,0.399349,-0.303337
Frequency of SMS,-0.020579,-0.111633,0.070215,0.089195,0.102123,0.100019,1.0,0.07965,-0.053719,0.195686,-0.296164,-0.092798,0.926271,-0.220754
Distinct Called Numbers,0.500669,-0.058199,0.087183,0.4147,0.676536,0.736114,0.07965,1.0,0.020941,0.172079,-0.413039,0.051037,0.280348,-0.278867
Age Group,0.057325,0.019976,0.017252,0.280829,0.02006,-0.032544,-0.053719,0.020941,1.0,-0.150593,0.002506,0.960758,-0.178256,-0.01455
Tariff Plan,0.194568,0.00114,-0.156441,0.323227,0.133593,0.206452,0.195686,0.172079,-0.150593,1.0,-0.164143,-0.119426,0.250963,-0.105853


In [3]:
# Full correlation matrix
corr_matrix = df.corr(numeric_only=True)

# Display top correlations with Churn
churn_corr = corr_matrix["Churn"].drop("Churn").sort_values(ascending=False)
print("Top Positive Correlations with Churn:")
print(churn_corr[churn_corr > 0])
print("\nTop Negative Correlations with Churn:")
print(churn_corr[churn_corr < 0])

# Display correlation matrix
corr_matrix


Top Positive Correlations with Churn:
Complains    0.532053
Status       0.498976
Name: Churn, dtype: float64

Top Negative Correlations with Churn:
Call Failure              -0.003552
Age Group                 -0.014550
Subscription Length       -0.017115
Age                       -0.017705
Tariff Plan               -0.105853
Charge Amount             -0.201088
Frequency of SMS          -0.220754
Distinct Called Numbers   -0.278867
Customer Value            -0.288013
Seconds of Use            -0.298935
Frequency of use          -0.303337
Name: Churn, dtype: float64


Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
Call Failure,1.0,0.152193,0.17933,0.589319,0.497351,0.569067,-0.020579,0.500669,0.057325,0.194568,-0.109269,0.04793,0.123214,-0.003552
Complains,0.152193,1.0,-0.014432,-0.032041,-0.104952,-0.090774,-0.111633,-0.058199,0.019976,0.00114,0.271405,0.003298,-0.134836,0.532053
Subscription Length,0.17933,-0.014432,1.0,0.075002,0.119624,0.102308,0.070215,0.087183,0.017252,-0.156441,0.142545,-0.004499,0.104664,-0.017115
Charge Amount,0.589319,-0.032041,0.075002,1.0,0.445871,0.37807,0.089195,0.4147,0.280829,0.323227,-0.357034,0.280313,0.163059,-0.201088
Seconds of Use,0.497351,-0.104952,0.119624,0.445871,1.0,0.946489,0.102123,0.676536,0.02006,0.133593,-0.460618,0.020843,0.410588,-0.298935
Frequency of use,0.569067,-0.090774,0.102308,0.37807,0.946489,1.0,0.100019,0.736114,-0.032544,0.206452,-0.454752,-0.02835,0.399349,-0.303337
Frequency of SMS,-0.020579,-0.111633,0.070215,0.089195,0.102123,0.100019,1.0,0.07965,-0.053719,0.195686,-0.296164,-0.092798,0.926271,-0.220754
Distinct Called Numbers,0.500669,-0.058199,0.087183,0.4147,0.676536,0.736114,0.07965,1.0,0.020941,0.172079,-0.413039,0.051037,0.280348,-0.278867
Age Group,0.057325,0.019976,0.017252,0.280829,0.02006,-0.032544,-0.053719,0.020941,1.0,-0.150593,0.002506,0.960758,-0.178256,-0.01455
Tariff Plan,0.194568,0.00114,-0.156441,0.323227,0.133593,0.206452,0.195686,0.172079,-0.150593,1.0,-0.164143,-0.119426,0.250963,-0.105853


### Observation
This gives correlations between numeric columns. Useful to see relationships.


**Correlation Matrix Insights:**

- **Strongest Churn Correlations**:
  - Positive: `Complains` (0.53), `Status` (0.50)
  - Negative: `Seconds of Use` (-0.30), `Customer Value` (-0.29)

- **Key Feature Relationships**:
  - `Frequency of SMS` ↔ `Customer Value`: 0.93 (Very strong)
  - `Seconds of Use` ↔ `Frequency of use`: 0.95 (Near-perfect)
  - `Age Group` ↔ `Age`: 0.96 (Expected redundancy)

- **Actionable Findings**:
  1. **Top Churn Predictors**: Focus on `Complains` and `Status`
  2. **Multicollinearity Alert**: Consider dropping one of:
     - `Frequency of use` (duplicates `Seconds of Use`)
     - `Age` (duplicates `Age Group`)
  3. **Negative Indicators**: High usage customers (`Seconds of Use`, `Customer Value`) churn less

## Step 21: Final Summary
- Dataset loaded with rows and columns inspected.  
- Structure, data types, and summary statistics reviewed.  
- Missing values and duplicates identified.  
- Categories explored.  
- Target variable checked.  

This completes **Phase 1: Data Exploration**. We now have a clear understanding of the dataset.
