In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn modules
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, 
                             roc_auc_score, mean_squared_error, mean_absolute_error, 
                             r2_score, classification_report, confusion_matrix, 
                             precision_recall_curve)

# Machine learning models
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.ensemble import (RandomForestClassifier, RandomForestRegressor, 
                              GradientBoostingClassifier, GradientBoostingRegressor)
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# Other utilities
import joblib
import pickle
from scipy.stats import boxcox
import xgboost as xgb

# Load dataset
df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
df_transformed = df


In [None]:
df.head()

## Data Cleaning:

Since the data was collected in-house through a multiple-choice survey without open-ended questions, the cleaning process is relatively straightforward. However, there are still a few aspects to check:

- **Data types:** Ensure correct classification, primarily int64 and object.
- **Duplicates and missing values:** Identify and handle any null or duplicate entries.
- **Whitespace:** Extra spaces have been removed for consistency.

## Data Exploration

In [None]:
df.describe().T

## Checking for Outliers  

While some machine learning models are robust to outliers, understanding their presence is crucial, as many statistical techniques assume normally distributed errors. Skewed data can impact model performance, especially for algorithms sensitive to variance, such as linear regression and SVMs.  

Common methods for detecting outliers include:  
- **Z-score**: Identifies data points that deviate significantly from the mean.  
- **Interquartile Range (IQR) Method**: Flags values that fall beyond 1.5 times the IQR.  
- **Visualization Techniques**: Box plots, histograms, and density plots provide intuitive insights into data distribution.  

For this analysis, I used **box plots** to visualize the spread and detect potential outliers effectively.

In [None]:
# Set figure size
plt.figure(figsize=(12, 6))

# Create a boxplot for all numerical columns
sns.boxplot(data=df, orient="h")

# Show the plot
plt.title("Outlier Detection using Boxplots")
plt.show()

Some results are difficult to interpret due to scaling in the graph above. To improve clarity, I created a focused visualization of selected columns below.

In [None]:
columns_to_check = ['NumCompaniesWorked', 'PerformanceRating', 'StockOptionLevel','TotalWorkingYears','TrainingTimesLastYear','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']  # Replace with actual column names

# Plot only selected columns
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[columns_to_check], orient="h")
plt.show()

### Comparing the box plots for each feature by Attrition: