## WNS Analytics Wizard

Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:

- They first identify a set of employees based on recommendations/ past performance
- Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
- At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.


They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.


Dataset Description:

|Variable |	Definition|
|-------------------------|---------------------------------------|
|employee_id |	Unique ID for employee|
|department |	Department of employee|
|region 	|Region of employment (unordered)|
|education 	|Education Level|
|gender 	|Gender of Employee|
|recruitment_channel |	Channel of recruitment for employee|
|no_of_trainings 	|no of other trainings completed in previous year on soft skills, technical skills etc.|
|age 	|Age of Employee|
|previous_year_rating |	Employee Rating for the previous year|
|length_of_service 	|Length of service in years|
|KPIs_met >80% 	|if Percent of KPIs(Key performance Indicators) >80% then 1 else 0|
|awards_won? 	|if awards won during previous year then 1 else 0|
|avg_training_score| 	Average score in current training evaluations|
|is_promoted |	(Target) Recommended for promotion|



# 1. Load Important Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from xgboost import XGBClassifier

import warnings
warnings.filterwarnings("ignore")

# 2. Data Load

In [None]:
df = pd.read_csv("train_LZdllcl(1).csv")

# 3. Data Exploration
**a. Initial Exploration:**

*  The first few lines (head(), tail(), info(), isnull().sum()) are used to understand the structure, types, and missing values in the dataset. This is the first thing you do before data cleaning.

**b. Looking at Specific Columns:**

*  It then explores two important columns: education and previous_year_rating, both of which may contain missing values or important patterns.
*  Using value_counts(normalize=True) helps you see how frequently each value appears (in percentage) and how significant the missing data is.

**c. Service Length Focus:**


*   The code checks the distribution of 'length_of_service' (i.e., how long each employee has worked).
*  It then zooms in on employees with only 1 year of service, possibly because short-tenure employees might behave differently.
*  Finally, it checks how many missing values exist for these 1-year employees, which might help decide how to clean or treat their data.

In [None]:
# Show the first 5 rows of the DataFrame to understand the top entries
df.head()

In [None]:
# Show the last 5 rows of the DataFrame to understand the bottom entries
df.tail()

In [None]:
# Display a concise summary of the DataFrame: columns, data types, non-null values, memory usage
df.info()

In [None]:
# Check the total number of missing (NaN) values in each column
df.isnull().sum()

In [None]:
# Display the first 5 rows for selected columns: 'education' and 'previous_year_rating'
df[['education','previous_year_rating']].head()

In [None]:
# Show percentage distribution of all values in 'education', including missing values
df.education.value_counts(normalize=True,dropna=False)*100

In [None]:
# Show percentage distribution of all values in 'previous_year_rating', including missing values
df.previous_year_rating.value_counts(normalize=True,dropna=False)*100

In [None]:
# Count how many employees have each value for 'length_of_service'
df.length_of_service.value_counts()

In [None]:
# Filter and show rows where the employee has only 1 year of service
df[df.length_of_service==1]

In [None]:
# Check for missing values only in rows where 'length_of_service' is 1
df[df.length_of_service==1].isna().sum()

In [None]:
#the previous_year_ratings are filled with 0s instead, making it the lowest score as well as indicating absence of value (rating) all together.

# Data Cleaning and Imputation:
*  Previous_year_rating.fillna(0): Replaces missing ratings with 0, which logically means "no rating" or worst performance.

*  Education.fillna(mode): Missing education values are replaced with the most common level of education, assuming it’s the best default guess.

# Checking for Remaining Issues:
*  The .isna().sum() checks are used after imputation to verify that no missing values remain, especially in subsets like:
*  Employees with 1 year of service.
*  Employees who were promoted.

# Data Distribution and Summary:
*  Describe() gives a quick overview of numeric features (mean, min, max, etc.).
*  Value_counts(normalize=True) shows what percent of employees were promoted vs not.
*  The Seaborn countplot provides a visual comparison — it's especially helpful for spotting imbalance in the dataset (e.g., very few promotions).

In [None]:
# Fill missing values in 'previous_year_rating' with 0
# (represents no rating and lowest possible performance)
df.previous_year_rating.fillna(0,inplace=True)

In [None]:
# Check for missing values in rows where employee has only 1 year of service
df[df.length_of_service==1].isna().sum()

In [None]:
# Check for missing values in rows where employee was promoted
df[df.is_promoted==1].isna().sum()

In [None]:
# Show the most common value(s) in the 'education' column
df.education.mode()

In [None]:
# Get the most frequently occurring value (mode) in 'education' column
df.education.mode()[0]

In [None]:
# Fill missing values in 'education' with the mode (most common education level)
df.education.fillna(df.education.mode()[0],inplace=True)

In [None]:
# Check for any remaining missing values in the dataset
df.isna().sum()

In [None]:
# Show statistical summary for numerical columns (mean, std, min, max, etc.)
df.describe()

In [None]:
# Show percentage distribution of promoted vs not-promoted employees
df['is_promoted'].value_counts(normalize=True)*100

In [None]:
# Plot a count plot showing number of promoted vs not-promoted employees
plt.figure(figsize=(4,3))
sns.countplot(data=df, x='is_promoted',hue='is_promoted')
plt.title('Promotion Count');

# Data Visualization for Skew Detection
*  Histograms help visualize distribution of values in numeric columns like age, length_of_service, and avg_training_score.
*  If the histogram is skewed to the right (positively skewed), it can negatively affect model performance.

# Log Transformation
*  The code uses np.log1p() to reduce skewness and normalize the values.
*  This helps models like logistic regression or decision trees perform better.

# Skewness Check
*  scipy.stats.skew() calculates how asymmetric the data is.
*  A value near 0 means symmetric.
*  Positive = right-skewed, Negative = left-skewed.
*  The code compares before and after transformation to ensure improvement.

# Replacing Columns
*  The original columns are replaced with the log-transformed versions.



In [None]:
# Show the first 5 rows of the dataset to review the structure and sample data
df.head()

In [None]:
# Plot the distribution of the 'age' column with a KDE (curve) to observe skewness
plt.figure(figsize=(12,6))
sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')

In [None]:
# Plot the distribution of 'length_of_service'
plt.figure(figsize=(12,6))
sns.histplot(df['length_of_service'], kde=True)
plt.title('Lebgth of Service Distribution')

In [None]:
# Plot the distribution of 'avg_training_score'
plt.figure(figsize=(12,6))
sns.histplot(df['avg_training_score'], kde=True)
plt.title('Avg Training Score Distribution')

In [None]:
# Apply log transformation to reduce skewness in these numeric columns
# log1p(x) = log(x + 1), used to handle zero values safely
age_log = np.log1p(df['age'])
service_log = np.log1p(df['length_of_service'])
score_log = np.log1p(df['avg_training_score'])

In [None]:
# Make sure log_val is defined; otherwise, remove this
log_val[2]

In [None]:
# Import libraries for visualization and math
import numpy as np
import matplotlib.pyplot as plt

# Assign original columns to variables for plotting
age = df['age']
service = df['length_of_service']
score = df['avg_training_score']

# Log-transformed columns
age_log = np.log1p(age)
service_log = np.log1p(service)
score_log = np.log1p(score)

# Plotting
plt.figure(figsize=(15, 8))

# Row 1: Original distributions
plt.subplot(2, 3, 1)
plt.hist(age, bins=30, color='skyblue', edgecolor='black')
plt.title('Original Age Distribution')

plt.subplot(2, 3, 2)
plt.hist(service, bins=30, color='orange', edgecolor='black')
plt.title('Original Length of Service')

plt.subplot(2, 3, 3)
plt.hist(score, bins=30, color='green', edgecolor='black')
plt.title('Original Avg Training Score')

# Row 2: Log-transformed distributions
plt.subplot(2, 3, 4)
plt.hist(age_log, bins=30, color='skyblue', edgecolor='black')
plt.title('Log-Transformed Age')

plt.subplot(2, 3, 5)
plt.hist(service_log, bins=30, color='orange', edgecolor='black')
plt.title('Log-Transformed Length of Service')

plt.subplot(2, 3, 6)
plt.hist(score_log, bins=30, color='green', edgecolor='black')
plt.title('Log-Transformed Avg Training Score')

plt.tight_layout()
plt.show()


In [None]:
# Import skew function to measure asymmetry in data distribution

from scipy.stats import skew

# Print skewness before and after log transformation
print(f"Age: {skew(age):.4f}")
print(f"Age (log): {skew(age_log):.4f}")
print('====================================================')
print(f"Length of Service: {skew(service):.4f}")
print(f"Length of Service (log): {skew(service_log):.4f}")
print('====================================================')
print(f"Avg Training Score: {skew(score):.4f}")
print(f"Avg Training Score (log): {skew(score_log):.4f}")

In [None]:
# Show data again to confirm structure before replacing original columns
df.head()

In [None]:
# Create a copy of df after transformations for safe experimentation
data=df.copy()

In [None]:
# Replace original skewed features with their log-transformed versions
age_log

In [None]:
df['age']=age_log

In [None]:
df['length_of_service']=service_log

In [None]:
df['avg_training_score']=score_log

# Region Mapping:
*  Region values like 'region_1', 'region_2' are transformed into numbers like 1, 2, etc., using a custom dictionary.
*  This gives each region a unique, meaningful numeric ID.

# Department Mapping:
*  Department names are mapped manually to integers based on their business roles.
*  For example, 'Sales & Marketing' → 0, 'Technology' → 2, and so on.

In [None]:
# Show the first 5 rows to understand the structure and current values
df.head()

In [None]:
# View all unique department categories sorted by frequency
df['department'].value_counts().index

In [None]:
# View all unique region codes (like 'region_1', 'region_2', etc.)
df['region'].value_counts().index

In [None]:
# Count how many unique region codes are present
len(df['region'].value_counts().index)

In [None]:

region=list(df['region'].value_counts().index)

In [None]:
# Example: Convert the string 'region_1' to integer 1
int(region[0].split('_')[1])

In [None]:
# Create a mapping: {'region_1': 1, 'region_2': 2, ..., 'region_n': n}
region_map=dict()
for i in (region):
    region_map[i]=int(i.split('_')[1])

In [None]:
# Print the region mapping dictionary to verify conversion
print(region_map)

In [None]:
# Replace the 'region' column with the integer-mapped values
df['region']=df['region'].map(region_map)

In [None]:
# Create a dictionary to manually map department names to integers
dept_map={'Sales & Marketing':0, 'Operations':1, 'Technology':2, 'Procurement':3,
       'Analytics':4, 'Finance':5, 'HR':6, 'Legal':7, 'R&D':8}

In [None]:
# Replace the 'department' column with mapped integer values
df['department']=df['department'].map(dept_map)

In [None]:
df['education'].unique()

# What Each Mapping Does:

1. Education:

*  This uses ordinal encoding — assumes a natural order (Below Secondary < Bachelor's < Master's & above).

2. Gender:

*  Encoded as binary: 0 = female, 1 = male.

3. Recruitment Channel:

*  Encoded with integers (no ranking involved), just so models can use them as features.



In [None]:
# Assumes higher number = higher education
edu_map={'Below Secondary':0,"Bachelor's":1,"Master's & above":2}

In [None]:
df['education']=df['education'].map(edu_map)

In [None]:
# Check unique values in 'gender' column (e.g., 'f', 'm')
df['gender'].unique()

In [None]:
# Map gender: 'f' (female) to 0, 'm' (male) to 1
gender_map={'f':0, 'm':1}

In [None]:
df['gender']=df['gender'].map(gender_map)

In [None]:
# Check unique recruitment channels (e.g., 'other', 'referred', 'sourcing')
df['recruitment_channel'].unique()

In [None]:
# Map recruitment channels to numbers (no order implied, just labels)
channel_map={'other':0,'referred':1,'sourcing':2}

In [None]:
df['recruitment_channel']=df['recruitment_channel'].map(channel_map)

|Step |	Explanation|
|-------------------------|---------------------------------------|
|df.describe().T|	Gives statistical summary of all columns. Helpful to understand data spread and distribution.|
|drop('employee_id') |	Employee IDs are unique identifiers and don't contribute to promotion decisions.|
|sns.heatmap(df.corr())	|Shows how strongly features are correlated with each other. Can help detect redundancy.|
|X, y split 	|Separating input features and target variable (is_promoted).|
|SMOTE() 	|Since the dataset is imbalanced (very few promotions), SMOTE generates synthetic samples for the minority class (promoted employees) to avoid bias during training.|
|StandardScaler |	Feature scaling ensures that all columns contribute equally, especially for algorithms like SVM, Logistic Regression, and Neural Nets.|
|train_test_split() 	|Divides data into training and testing sets so that model can be validated on unseen data. stratify=y_res ensures balanced class distribution.|


In [None]:
# Preview the first few rows of the dataset
df.head()

In [None]:
# Show statistical summary (mean, std, min, max, etc.) for each column
df.describe().T

In [None]:
# Create a copy of the DataFrame after all preprocessing steps
data1=df.copy()

In [None]:
# Drop the 'employee_id' column as it is just an identifier and not useful for prediction
df.drop(columns=['employee_id'],inplace=True)

In [None]:
# Plot a heatmap to visualize the correlation between all numeric features
plt.figure(figsize=(15, 8))
sns.heatmap(df.corr(),annot=True)

In [None]:
# Define the features (X) and target (y)
X = df.drop(['is_promoted'], axis=1)      # Data Frame of the features, Input features
y = df['is_promoted']                     # Target Column, Output label

In [None]:
# Check the first few rows of features and target
X.head()

In [None]:
y.head()

In [None]:
# Check the class distribution (imbalanced classification problem)
y.value_counts(normalize=True)*100

In [None]:
# Apply SMOTE to balance the dataset by generating synthetic samples for the minority class
from imblearn.over_sampling import SMOTE
X_res, y_res = SMOTE().fit_resample(X, y)

In [None]:
# Check class distribution after SMOTE (should now be balanced)
y_res.value_counts(normalize=True)*100

In [None]:
# Scale the features to have zero mean and unit variance (important for many ML algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_res)

In [None]:
# Split the dataset into training and testing sets
# stratify ensures class balance is maintained in both sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_res, test_size=0.2, stratify=y_res, random_state=42)

In [None]:
# Model Selection and Training

|Part|	Explanation|
|-------------------------|---------------------------------------|
|models = {...}|A dictionary of popular classification models being tested.|
|model.fit() |		Trains the model on the training dataset.|
|model.predict()	|	Uses the trained model to make predictions on test data.|
|f1_score()	|	Measures model performance using the F1 Score, which balances precision and recall. Very useful for imbalanced classification problems (like promotions).|
|results[name] = f1	|Stores each model’s F1 score for comparison.|


In [None]:
# Dictionary of models to train and evaluate
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'KNN': KNeighborsClassifier(),
    'SVM': SVC(kernel='rbf'),
    'Naive Bayes': GaussianNB(),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss')
}

In [None]:
# Dictionary to store F1 scores of each model
results = {}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model on training data
    preds = model.predict(X_test)   # Make predictions on test data
    f1 = f1_score(y_test, preds)   # Calculate F1 Score
    results[name] = f1           # Save the result
    print(f"{name} F1 Score: {f1:.4f}")  # Print the F1 score

In [None]:
# Hyperparameter Grids

In [None]:

rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

In [None]:
xgb_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0]
}

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Randomized Search CV for Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_search = RandomizedSearchCV(rf_model, rf_param_grid, n_iter=10, scoring='f1', cv=3, n_jobs=-1, random_state=42)
rf_search.fit(X_train, y_train)
rf_best = rf_search.best_estimator_
rf_f1 = f1_score(y_test, rf_best.predict(X_test))

In [None]:
# Randomized Search CV for XGBoost
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_search = RandomizedSearchCV(xgb_model, xgb_param_grid, n_iter=10, scoring='f1', cv=3, n_jobs=-1, random_state=42)
xgb_search.fit(X_train, y_train)
xgb_best = xgb_search.best_estimator_
xgb_f1 = f1_score(y_test, xgb_best.predict(X_test))

In [None]:
# Results
print("Best Parameters:")
print("Random Forest:", rf_search.best_params_)
print("XGBoost      :", xgb_search.best_params_)

print("F1-Score Comparison:")
print(f"Random Forest: {rf_f1:.4f}")
print(f"XGBoost      : {xgb_f1:.4f}")

# 1. Compare Models Using F1 Score
*  At the end of your model training, you compare the F1 scores of two top-performing models — Random Forest and XGBoost.
*  You use a simple if-else condition:
*  If Random Forest has a higher F1 score, it's considered the better model.
*  Otherwise, XGBoost is preferred.

# 2. Save the Best Model
*  Once you've selected the best model (in your case, xgb_best for XGBoost), you save it using Python's joblib library:
*  This line saves the trained model to a .pkl file, which means:
*  You don't have to retrain it every time.
*  You can load and use it later to make predictions directly (e.g., in a real-world app or dashboard).

# 3. Save the Scaler
*  Before training, you used StandardScaler to scale your features (make them all have a similar range).
*  It’s important to save the scaler too because you’ll need it later to scale new data in the same way as the training data:

In [None]:
# Compare F1 scores of Random Forest and XGBoost
# Based on which model performed better, print the final decision
if rf_f1 > xgb_f1:
    print("Final Decision: Random Forest is better based on F1-score.")
else:
    print("Final Decision: XGBoost is better based on F1-score.")

In [None]:
# Import joblib for saving models and preprocessors
import joblib

# Save the best XGBoost model to a file
# This allows you to reuse the trained model without retraining it
joblib.dump(xgb_best, 'final_xgboost_model.pkl')

print("Final XGBoost model saved as 'final_xgboost_model.pkl'")


In [None]:
# Save the StandardScaler used to scale the features
# This ensures any new input data can be scaled the same way during prediction
joblib.dump(scaler, "scaler.pkl")