---

# **Project Summary**

---

# **Employee Performance Analysis**  
#### **INX Future Inc.** | Code: 10281  
#### **Miral Katpara** | CDS Datamites 

---

### **Introduction**

*INX Future Inc*, a leading provider of data analytics and automation solutions, has faced recent challenges with declining employee performance, increasing service delivery escalations, and a notable drop in client satisfaction by 8 percentage points. My task is to analyze the current employee data and uncover the key factors driving these performance issues and attrition rates.

---

### **Project Scope and Objectives**

- My solution aims to provide *INX Future Inc* with a comprehensive analysis of employee performance data, identifying the underlying causes of performance decline and the factors affecting employee satisfaction.
- We will develop predictive models to pinpoint potential indicators of non-performing employees and deliver actionable insights. These insights will guide INX's management in implementing targeted interventions to enhance employee performance while maintaining high morale and sustaining the company's reputation as a top employer.

---

## **Project Outline**

In this project, we addressed two key problems: predicting employee attrition and predicting employee performance ratings. We employed several machine learning models to analyze and forecast these aspects, with a focus on understanding employee behavior, improving workforce management, and enhancing company performance.

---

## **Data Preprocessing and EDA**

### **Exploratory Data Analysis (EDA)**

From the EDA, we derived several critical insights:

- **Gender Distribution**: The dataset shows a 60/40 split between male and female employees.
- **Education Backgrounds**: Employees come from six different education backgrounds, with Life Sciences and Marketing being predominant.
- **Job Roles**: Research and Development (R&D) roles and Sales positions are crucial for the company's performance.
- **Job Satisfaction**: Most employees reported high job satisfaction, good work environment, and work-life balance. However, a small segment with poor ratings indicates a need for improvement.
- **Performance Ratings**: While most employees perform well, only 11% achieved an outstanding performance rating. This is an area for potential improvement.
- **Attrition Rates**: Attrition rates are relatively low. Key factors influencing attrition include job satisfaction, work environment, and education background. Notably, Life Science employees and Developers/Sales Executives showed higher attrition rates.

---

### **Data Cleaning and Processing**

- **Handling Categorical Data**: 
  - Categorical features were converted into numerical format for model compatibility.
  - **Label Encoding** was used for categorical variables without inherent order (e.g., Gender, Department).
  - **Ordinal Encoding** was applied to variables with a rank or hierarchy (e.g., Job Satisfaction, Job Level).

- **Outlier Detection and Management**: Outliers were identified and handled to ensure they did not skew the model's performance. This involved either removing them or replacing them with mean/median values.

- **Scaling**: Features were standardized using `StandardScaler` to ensure uniformity in the training process, which helps improve model performance.

- **SMOTE for Class Imbalance**:
  - **SMOTE (Synthetic Minority Over-sampling Technique)** was used to address class imbalance in the attrition dataset. By generating synthetic samples for the minority class, SMOTE helps the model learn better and improve performance.

### Encoding categorical variables
label_encoder = LabelEncoder()
ordinal_encoder = OrdinalEncoder()

### Applying StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### SMOTE for class imbalance
smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)


---

## **Model Training and Evaluation**

### **Attrition Prediction**

We trained several models to predict employee attrition, evaluating each model's performance based on accuracy, interpretability, and robustness.

#### **Logistic Regression:**
- **Accuracy**: 0.88
- **Explanation**: Logistic Regression is a linear model that estimates the probability of a binary outcome using a logistic function. It is simple, interpretable, and effective for binary classification tasks like predicting whether an employee will leave the company.
- **Mathematical Notation**: The model estimates the probability \( P(y=1 \mid X) \) using the equation:
  $$
  P(y=1 \mid X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_n X_n)}}
  $$
  where \( \beta \) represents the model coefficients.

#### **Neural Network:**
- **Accuracy**: 0.71
- **Explanation**: Neural Networks, inspired by the human brain, consist of interconnected neurons organized in layers. They capture complex patterns and relationships in the data, making them suitable for high-dimensional data.
- **Mathematical Notation**: The output of a neuron is given by:
  $$
  a = \sigma(WX + b)
  $$
  where \( \sigma \) is the activation function, \( W \) is the weight matrix, \( X \) is the input, and \( b \) is the bias term.

#### **K-Nearest Neighbors (KNN):**
- **Accuracy**: 0.85
- **Explanation**: KNN is a simple, non-parametric algorithm that classifies a sample based on the majority class among its K-nearest neighbors. It's effective for datasets with non-linear decision boundaries.
- **Mathematical Notation**: The distance metric (e.g., Euclidean distance) is used to determine the nearest neighbors:
  $$
  d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
  $$

#### **Support Vector Classification (SVC):**
- **Accuracy**: 0.87
- **Explanation**: SVC is a powerful classifier that finds the hyperplane separating the classes with the maximum margin. It handles high-dimensional data well and is effective for binary classification.
- **Mathematical Notation**: The optimization problem solved by SVC is:
  $$
  \min_{w, b} \frac{1}{2} ||w||^2 \text{ subject to } y_i (w^T x_i + b) \geq 1 \text{ for all } i
  $$

#### **Random Forest:**
- **Accuracy**: 0.94
- **Explanation**: Random Forest is an ensemble method that builds multiple decision trees and merges them to get a more accurate and stable prediction. It's robust against overfitting and handles complex feature interactions effectively.
- **Mathematical Notation**: The final prediction is made by aggregating the predictions of individual trees, typically using majority voting:
  $$
  \hat{y} = \text{mode}(\hat{y}_1, \hat{y}_2, \dots, \hat{y}_n)
  $$

#### **XGBoost:**
- **Accuracy**: 0.86
- **Explanation**: XGBoost is an advanced boosting algorithm that iteratively improves weak learners (e.g., decision trees) by focusing on the errors made by previous models. It's known for its high performance and ability to manage missing data and outliers efficiently.
- **Mathematical Notation**: The objective function is minimized using gradient descent:
  $$
  \text{Obj}(\theta) = \sum_{i=1}^{n} l(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k)
  $$
  where \( \Omega(f_k) \) is a regularization term.

---

### **Performance Rating Prediction**

We also trained models to predict employee performance ratings, aiming to identify high-performing employees and those in need of support.

#### **LightGBM:**
- **Accuracy**: 95%
- **Explanation**: LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It's designed for efficiency and scalability, making it suitable for large datasets with a large number of features.
- **Mathematical Notation**: The model uses a histogram-based algorithm for splitting nodes, which reduces the computational cost:
  $$
  \text{gain} = \text{split\_gain} - \gamma \times \text{penalty}
  $$

#### **K-Nearest Neighbors (KNN):**
- **Accuracy**: 78%
- **Explanation**: As with attrition prediction, KNN classifies a sample based on its nearest neighbors. While simple, KNN may struggle with high-dimensional data, which can lead to lower accuracy in this context.

#### **Random Forest:**
- **Accuracy**: 82%
- **Explanation**: Random Forest continues to perform well, offering robustness and reduced overfitting, making it a reliable choice for predicting performance ratings.

#### **Neural Network:**
- **Accuracy**: 86%
- **Explanation**: Neural Networks capture the complex relationships between features and performance ratings, providing a nuanced prediction model. However, they require careful tuning to avoid overfitting.

#### **XGBoost:**
- **Accuracy**: 93%
- **Explanation**: XGBoost's strong performance is consistent across tasks, providing efficient and accurate predictions while being robust to overfitting and capable of handling complex feature interactions.

---

## **Recommendations**

- **For Attrition Prediction**: Random Forest and XGBoost provided the highest accuracy. Given Random Forest's robustness and interpretability, it is recommended for predicting attrition. However, for large-scale implementations where computational efficiency is critical, XGBoost may be preferred.

- **For Performance Rating Prediction**: LightGBM and XGBoost both performed well, with LightGBM showing a slight edge in accuracy. LightGBM is recommended for its efficiency and scalability, particularly in large datasets. However, XGBoost remains a strong alternative, especially when dealing with more complex or noisy data.

---

## **Conclusion**

* #### The analysis and models developed provide *INX Future Inc* with actionable insights into employee performance and attrition. These insights can guide the management in implementing targeted strategies to improve employee satisfaction, reduce attrition, and enhance overall company performance. 
* #### The predictive models, especially Random Forest and LightGBM, offer reliable tools for ongoing monitoring and intervention.

---
