# Advanced Exploratory Data Analysis (EDA)

This notebook demonstrates advanced EDA techniques for comprehensive data understanding and preparation for machine learning models.

## Table of Contents
1. [Library Imports](#library-imports)
2. [Data Loading and Initial Exploration](#data-loading)
3. [Data Quality Assessment](#data-quality)
4. [Univariate Analysis](#univariate-analysis)
5. [Bivariate Analysis](#bivariate-analysis)
6. [Multivariate Analysis](#multivariate-analysis)
7. [Feature Engineering Insights](#feature-engineering)
8. [Statistical Testing](#statistical-testing)
9. [Summary and Recommendations](#summary)

## Library Imports

In [72]:
import pandas as pd
import numpy as np
import warnings
import os

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    OneHotEncoder,
    LabelEncoder,
    OrdinalEncoder
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report,
    roc_auc_score,
    precision_recall_curve,
    auc
)

from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
    GridSearchCV,
    RandomizedSearchCV
)

from scipy.stats import chi2_contingency, ttest_ind
from imblearn.over_sampling import SMOTE
import joblib

# Configuration
sns.set_style('whitegrid')
warnings.filterwarnings('ignore', category=FutureWarning)
pd.set_option('display.max_columns', None)

print("All necessary libraries have been imported successfully.")

All necessary libraries have been imported successfully.


## Data Loading and Initial Exploration

In [73]:
df = pd.read_csv("data/raw/dataset.xls")

print(f"Dataset shape: {df.shape}")
print(f"\nColumn names: {list(df.columns)}")
df.head()

Dataset shape: (7043, 21)

Column names: ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Data Quality Assessment

In [74]:
print(df.shape)
print(df.dtypes)


(7043, 21)
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


In [75]:

# Missing values
print(df.isnull().sum())


customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


In [76]:
# Duplicates
print(df.duplicated().sum())


0


In [77]:
# Summary stats for numeric columns
print(df.describe())


       SeniorCitizen       tenure  MonthlyCharges
count    7043.000000  7043.000000     7043.000000
mean        0.162147    32.371149       64.761692
std         0.368612    24.559481       30.090047
min         0.000000     0.000000       18.250000
25%         0.000000     9.000000       35.500000
50%         0.000000    29.000000       70.350000
75%         0.000000    55.000000       89.850000
max         1.000000    72.000000      118.750000


In [78]:
# Unique values per column
print(df.nunique())


customerID          7043
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
Churn                  2
dtype: int64


In [79]:
# Spot potential outliers
for col in df.select_dtypes(include='number'):
    print(col,'\t ,\t' ,df[col].mean(),'\t', df[col].std())

SeniorCitizen 	 ,	 0.1621468124378816 	 0.3686116056100131
tenure 	 ,	 32.37114865824223 	 24.55948102309446
MonthlyCharges 	 ,	 64.76169246059918 	 30.090047097678493


## Univariate Analysis

In [80]:
# Individual feature analysis will go here

## Bivariate Analysis

In [81]:
# Pairwise feature relationships will go here

## Multivariate Analysis

In [82]:
# Complex feature interactions will go here

## Feature Engineering Insights

In [83]:
# Feature creation and transformation insights will go here

## Statistical Testing

In [84]:
# Statistical significance tests will go here

## Summary and Recommendations

### Key Findings
- Finding 1
- Finding 2
- Finding 3

### Recommendations for Model Development
- Recommendation 1
- Recommendation 2
- Recommendation 3

### Next Steps
- Move to advanced model pipeline development
- Consider ensemble methods based on EDA insights
- Address any data quality issues identified