# 1. Data Preparation

Step 1. Load the dataset
The dataset was successfully loaded from a CSV file using pandas. Displaying the first rows (head()) confirms that the data was loaded correctly, with 35 features and an expected tabular structure.

In [4]:
import pandas as pd

df = pd.read_csv("IBM_HR_Employee_Attrition.csv")

In [14]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


Step 2. Dataset Structure and Data Types
The df.info() output indicates that the dataset contains 1470 observations and 35 features. No missing values are present across the columns, and data types are correctly defined, confirming that the data is ready for further analysis.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

Step 3. Create and Verify Dataset Copy
A copy of the dataset (df_copy) was created to safely perform the analysis without modifying the original data. Displaying the first rows confirms that the copy was created correctly and the data structure is preserved.

In [6]:
df_copy = df.copy()

In [16]:
df_copy.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [8]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

Step 4. Missing Values Check
The missing values check shows that no missing values are present across all columns. This indicates that the dataset does not require additional missing data handling and is fully ready for further analysis.

In [18]:
df_copy.isnull().sum().sort_values(ascending=False)

Age                         0
StandardHours               0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StockOptionLevel            0
MonthlyIncome               0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
MonthlyRate                 0
MaritalStatus               0
Attrition                   0
EmployeeCount               0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeNumber              0
JobSatisfaction             0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole   

Step 5. Duplicate Records Check
The duplicate check indicates that no duplicate records are present in the dataset. This confirms the uniqueness of observations and the overall integrity of the data.

In [20]:
df_copy.duplicated().sum()

0

Step 6. Inspect the Target Variable (Attrition)
The Attrition variable was selected as the target variable because it represents the key outcome of interest — whether an employee leaves the organization or remains employed. From a business perspective, the central question is why employees leave and which factors contribute to attrition, as high turnover impacts costs, workforce stability, and organizational performance. Additionally, Attrition is a binary variable (Yes/No), making it suitable for exploratory analysis and potential predictive modeling.
The distribution of the target variable indicates that most employees remain in the company (≈84%), while around 16% experience attrition. This reflects a moderate class imbalance that should be considered in further analysis.

In [22]:
df_copy["Attrition"].value_counts()
df_copy["Attrition"].value_counts(normalize=True)

Attrition
No     0.838776
Yes    0.161224
Name: proportion, dtype: float64

Step 7.Identify non-informative columns
The analysis of unique values shows that some features have only one unique value (e.g., Over18, StandardHours, EmployeeCount) and therefore do not provide analytical value. These non-informative columns can be safely excluded from further analysis, while the remaining features exhibit sufficient variability to be potentially informative.

In [24]:
df_copy.nunique().sort_values()

Over18                         1
StandardHours                  1
EmployeeCount                  1
Gender                         2
Attrition                      2
PerformanceRating              2
OverTime                       2
MaritalStatus                  3
Department                     3
BusinessTravel                 3
StockOptionLevel               4
EnvironmentSatisfaction        4
JobInvolvement                 4
JobSatisfaction                4
RelationshipSatisfaction       4
WorkLifeBalance                4
Education                      5
JobLevel                       5
EducationField                 6
TrainingTimesLastYear          7
JobRole                        9
NumCompaniesWorked            10
PercentSalaryHike             15
YearsSinceLastPromotion       16
YearsWithCurrManager          18
YearsInCurrentRole            19
DistanceFromHome              29
YearsAtCompany                37
TotalWorkingYears             40
Age                           43
HourlyRate

Step 8. Remove irrelevant features
Remove non-informative columns to reduce noise and improve analytical clarity.

In [26]:
cols_to_drop = ["EmployeeNumber", "Over18", "StandardHours"]
df_copy = df_copy.drop(columns=cols_to_drop, errors="ignore")

Step 9. Convert categorical features to category type
Convert categorical variables to the appropriate data type for efficient analysis and grouping.

In [28]:
obj_cols = df_copy.select_dtypes(include="object").columns

for col in obj_cols:
    df_copy[col] = df_copy[col].astype("category")

Step 9. Define target, categorical, and numerical features

In [30]:
target = "Attrition"

categorical_cols = df_copy.select_dtypes(include="category").columns.tolist()
numerical_cols = df_copy.select_dtypes(include=["int64", "float64"]).columns.tolist()

if target in categorical_cols:
    categorical_cols.remove(target)

print("Target:", target)
print("Categorical features:", categorical_cols)
print("Numerical features:", numerical_cols)

Target: Attrition
Categorical features: ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime']
Numerical features: ['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']


Step 10. Check categorical feature cardinality
The categorical feature cardinality analysis shows that all categorical variables have a moderate number of unique values. This makes them suitable for interpretation, visualization, and further encoding without introducing excessive complexity.

In [32]:
df_copy[categorical_cols].nunique().sort_values(ascending=False)

JobRole           9
EducationField    6
BusinessTravel    3
Department        3
MaritalStatus     3
Gender            2
OverTime          2
dtype: int64

Step 11. Validate numerical feature ranges
The descriptive statistics of numerical features indicate that all values fall within realistic and expected ranges. No extreme or implausible values are observed, suggesting that the numerical data is well-formed and does not require immediate outlier treatment.

In [34]:
df_copy[numerical_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,1470.0,36.92381,9.135373,18.0,30.0,36.0,43.0,60.0
DailyRate,1470.0,802.485714,403.5091,102.0,465.0,802.0,1157.0,1499.0
DistanceFromHome,1470.0,9.192517,8.106864,1.0,2.0,7.0,14.0,29.0
Education,1470.0,2.912925,1.024165,1.0,2.0,3.0,4.0,5.0
EmployeeCount,1470.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
EnvironmentSatisfaction,1470.0,2.721769,1.093082,1.0,2.0,3.0,4.0,4.0
HourlyRate,1470.0,65.891156,20.329428,30.0,48.0,66.0,83.75,100.0
JobInvolvement,1470.0,2.729932,0.711561,1.0,2.0,3.0,3.0,4.0
JobLevel,1470.0,2.063946,1.10694,1.0,1.0,2.0,3.0,5.0
JobSatisfaction,1470.0,2.728571,1.102846,1.0,2.0,3.0,4.0,4.0


Step 12. Final data readiness check
The final check confirms that the dataset contains 1470 observations and 32 features after preprocessing. No missing values are present, and data types are correctly defined for both numerical and categorical features. The dataset is fully prepared for further analysis and visualization.

In [36]:
df_copy.shape
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 32 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Age                       1470 non-null   int64   
 1   Attrition                 1470 non-null   category
 2   BusinessTravel            1470 non-null   category
 3   DailyRate                 1470 non-null   int64   
 4   Department                1470 non-null   category
 5   DistanceFromHome          1470 non-null   int64   
 6   Education                 1470 non-null   int64   
 7   EducationField            1470 non-null   category
 8   EmployeeCount             1470 non-null   int64   
 9   EnvironmentSatisfaction   1470 non-null   int64   
 10  Gender                    1470 non-null   category
 11  HourlyRate                1470 non-null   int64   
 12  JobInvolvement            1470 non-null   int64   
 13  JobLevel                  1470 non-null   int64 

In [38]:
df_copy.to_csv("HR_Attrition_cleaned.csv", index=False)

Overall Data Preparation Conclusion

During the data preparation process, the dataset was successfully loaded, validated, and cleaned by removing non-informative features. No missing values or duplicate records were detected, data types were properly defined, and numerical features fall within realistic ranges. The target variable Attrition was appropriately identified and shows a moderate class imbalance. Overall, the dataset is fully prepared for subsequent exploratory analysis and visualization.

# 2. Visualizations

## Visualization 1: Employee Attrition Distribution

**Chart Type:** Bar Chart  
**X-axis:** Attrition (Yes / No)  
**Y-axis:** Number of Employees  

**Insight:**  
The majority of employees remain with the company, while attrition represents a smaller but meaningful share of the workforce.

## Visualization 2: Employee Attrition Rate (%)

**Chart Type:** Bar Chart  
**X-axis:** Attrition (Yes / No)  
**Y-axis:** Percentage of Employees  

**Insight:**  
Although the majority of employees remain with the company, approximately 16% experience attrition, indicating a meaningful level of employee turnover.

## Visualization 3: Employee Attrition by Department

**Chart Type:** Stacked Bar Chart  
**X-axis:** Department  
**Y-axis:** Number of Employees  
**Color:** Attrition (Yes / No)

**Insight:**  
Attrition is not evenly distributed across departments. Research & Development shows the highest absolute number of employee departures, which is largely driven by its significantly larger workforce. Sales also experiences noticeable attrition, while Human Resources has comparatively lower attrition levels, reflecting its smaller department size. 
Overall, attrition appears to scale with department size rather than being disproportionately concentrated in smaller teams.

## Visualization 4: Employee Attrition by Job Role

**Chart Type:** Stacked Bar Chart  
**X-axis:** Number of Employees  
**Y-axis:** Job Role  
**Color:** Attrition (Yes / No)

**Insight:**  
Attrition varies significantly by job role. Sales Executive and Research Scientist roles show the highest absolute number of departures, which corresponds to their larger workforce sizes. Laboratory Technicians also exhibit a relatively high attrition count compared to other technical roles. In contrast, managerial and director-level positions experience notably lower attrition, suggesting higher role stability at senior levels.

## Visualization 5: Employee Attrition by Years at Company

**Chart Type:** Stacked Bar Chart  
**X-axis:** OverTime (Yes / No)  
**Y-axis:** Number of Employees  
**Color:** Attrition (Yes / No)

**Insight:**  
The number of employees who remained with the company significantly exceeds the number of departures. Although attrition is present at a noticeable level, the overall workforce size remains largely stable over time. This indicates that employee turnover exists but does not dominate the organization’s workforce dynamics.

## Visualization 6: Average Monthly Income by Attrition Status

**Chart Type:** Average Comparison (Dot Plot)  
**X-axis:** Attrition (Yes / No)  
**Y-axis:** Average Monthly Income  

**Insight:**  
Employees who remained with the company tend to have higher monthly incomes, as reflected by a higher median and upper income range. The income distribution for retained employees is also wider, indicating greater variability and the presence of higher earners. In contrast, employees who left the company show a lower median income and a more concentrated distribution, suggesting that attrition is more common among lower-paid employees.

## Visualization 7: Average Age by Attrition Status

**Chart Type:** Average Comparison (Dot Plot)  
**X-axis:** Attrition (Yes / No)  
**Y-axis:** Average Age  

**Insight:**  
Employees who left the company are, on average, younger than those who remained. This pattern suggests that attrition is more prevalent among younger employees, which may reflect differences in career stage, job expectations, or mobility between age groups.

## Visualization 8: Attrition Rate by Job Satisfaction

**Chart Type:** 100% Stacked Bar Chart  
**X-axis:** Job Satisfaction (1 = Low, 4 = High)  
**Y-axis:** Percentage of Employees  
**Color:** Attrition (Yes / No)

**Insight:**
Attrition shows a clear inverse relationship with job satisfaction. Employees with the lowest job satisfaction level (Level 1) have the highest attrition rate, while those with the highest satisfaction level (Level 4) exhibit the lowest proportion of departures. This pattern suggests that job satisfaction is a strong factor associated with employee retention.

## Visualization 9: Attrition Rate by Work-Life Balance

**Chart Type:** 100% Stacked Bar Chart  
**X-axis:** Work-Life Balance (1 = Poor, 4 = Excellent)  
**Y-axis:** Percentage of Employees  
**Color:** Attrition (Yes / No)

**Insight:**  
Attrition decreases as work-life balance improves. Employees reporting poor work-life balance (Level 1) show the highest attrition rate, while those with better balance (Levels 3 and 4) demonstrate lower proportions of departures. This pattern indicates that work-life balance is strongly associated with employee retention.

## Visualization 10: Attrition vs Years at Company

**Chart Type:** Average Comparison (Dot Plot)  
**X-axis:** Attrition (Yes / No)  
**Y-axis:** Average Years at Company  

**Insight:**  
Employees who left the company have spent significantly fewer years on average compared to those who remained. This suggests that attrition is more prevalent among employees with shorter tenure, while longer-serving employees demonstrate higher retention and stability within the organization.