# Exploratory Data Analysis: Employee Churn

## 1. Introduction

> In this notebook, we perform an initial exploratory data analysis (EDA) on the Employee Churn dataset. This step is crucial for understanding the structure, content, and quality of the data, and to formulate early hypotheses.

Objectives:
- Understand the structure of the data
- Detect missing values and data types
- Identify next steps to perform in data cleaning
- Generate descriptive statistics
- Explore distributions and potential relationships

## 2. Load Libraries

In [None]:
# Standard libraries
import matplotlib.pyplot as plt
import numpy as numpy
import pandas as pd
from pathlib import Path
import sys

# Add the project root to sys.path
project_root = Path.cwd().parent  # from notebooks/ to root
sys.path.append(str(project_root))

# Custom modules
from src.config import RAW_EMPLOYEE_CHURN_FILE
from src.data.load_data import load_data

%matplotlib inline

## 3. Load Data

In [None]:
df = load_data(RAW_EMPLOYEE_CHURN_FILE)
df

## 4. Explore dataset

In [None]:
df.info()

## 5. Data types and missing values

In [None]:
# Missing values
df.isna().sum().sort_values(ascending = False)

## 6. Descriptive statistics

## 7. Analysis

### 7.1 Categorical features

### 7.2 Numerical features

## 8. Insights and relations

## 9. Data Quality notes

Missing Data:
- Consider imputing X with median or model-based imputation
- Y may be dropped or simplified (e.g., extract from Z)
- YY can be filled with mode

Outliers:
- ZZ shows significant outliers. Might need log-transform or cap.

## 10. Next steps
- Handle missing values
- Feature engineering (e.g., Title from Name, FamilySize)
- Correlation analysis
- Prepare for modeling (data cleaning, encoding, scaling)

## 📌 Summary
- Data loaded successfully; key features explored
- Identified missing values and basic distributions
- Clear early patterns: gender and class impact survival
- Several data quality issues flagged

## Save processed data