# Employee Performance Analysis  
### INX Future Inc.

| **Field**                 | **Details**                               |
|---------------------------|-------------------------------------------|
| **Candidate Name**        | NIRMAL NEWTON M                           |
| **Candidate E-mail**      | nirmalnewton2003@gmail.com                 |
| **REP Name**              | DataMites™ Solutions Pvt Ltd             |
| **Venue Name**            | Open Project                             |
| **Exam Country**          | India                                    |
| **Assessment ID**         | E10901-PR2-V18                          |
| **Module**                | Certified Data Scientist - Project       |
| **Language**              | English                                  |
| **Exam Format**           | Open Project - IABAC™ Project Submission |
| **Submission Deadline**   | 08-Apr-2025                             |
| **Registered Trainer**    | Ashok Kumar A                                    |
| **Project Assessment**    | IABAC™                                   |


# **2. Analysis**

## **Feature Description**

The analysis began by exploring the structure and characteristics of the dataset. Understanding feature types is vital to identifying how each variable relates to employee performance. Using `pandas`, the dataset was examined and features were categorized into **categorical**, **numerical**, **ordinal**, and **alphanumeric** types.

---

### **Categorical Features**

Categorical variables represent data sorted into distinct categories or labels. These are useful for grouping observations and uncovering patterns in qualitative data:

- `EmpNumber`  
- `Gender`  
- `EducationBackground`  
- `MaritalStatus`  
- `EmpDepartment`  
- `EmpJobRole`  
- `BusinessTravelFrequency`  
- `OverTime`  
- `Attrition`  

---

### **Numerical Features**

These features consist of continuous or discrete numeric values and are key to statistical modeling and trend analysis:

- `Age`  
- `DistanceFromHome`  
- `EmpHourlyRate`  
- `NumCompaniesWorked`  
- `EmpLastSalaryHikePercent`  
- `TotalWorkExperienceInYears`  
- `TrainingTimesLastYear`  
- `ExperienceYearsAtThisCompany`  
- `ExperienceYearsInCurrentRole`  
- `YearsSinceLastPromotion`  
- `YearsWithCurrManager`  

---

### **Ordinal Features**

Ordinal features have values with a natural order or ranking, typically representing levels of satisfaction or rating:

- `EmpEducationLevel`  
- `EmpEnvironmentSatisfaction`  
- `EmpJobInvolvement`  
- `EmpJobLevel`  
- `EmpJobSatisfaction`  
- `EmpRelationshipSatisfaction`  
- `EmpWorkLifeBalance`  
- `PerformanceRating`  

---

### **Alphanumeric Features**

These include features with unique identifiers or values not directly related to prediction tasks:

- `EmpNumber` (Unique identifier excluded from predictive modeling)

---

## **Distribution of Numerical Features**

Exploring numerical feature distributions offers early insights into patterns and anomalies in the data:

- **Age:** A majority of employees are between 30 and 40 years old.  
- **NumCompaniesWorked:** Most employees have worked at two or fewer previous companies.  
- **EmpHourlyRate:** Hourly wages commonly range from $60 to $100.  
- **ExperienceYearsAtThisCompany:** Most employees have less than 5 years of tenure at INX.  
- **EmpLastSalaryHikePercent:** Typical salary hikes fall between 11% and 15%.

---

### **Normality Check**

The **skewness** metric was used to determine how closely each numerical feature follows a normal distribution.

- For example, `YearsSinceLastPromotion` showed significant right skewness:  
  - **Skewness Value:** 1.97

---

### **Skewness Correction**

To reduce the skewness in non-normally distributed features, **Square Root** and **Log Transformations** were applied. These methods are particularly effective in handling positively skewed, count-based variables.

---

## **Distribution of Categorical Features**

Analyzing categorical features reveals class distributions and potential imbalances:

- **Gender:** 60% Male, 40% Female  
- **EducationBackground:** Six unique educational categories present  
- **EmpJobRole:** Includes 19 distinct job roles  
- **JobSatisfaction:** Most employees report high job satisfaction  
- **Attrition:** Approximately 85% of employees are retained  
- **PerformanceRating:** Only 11% received an “Outstanding” rating  
- **OverTime:** Around 30% of employees work overtime regularly

---

# **3. Data Cleaning**

Cleaning ensures the data is complete, consistent, and ready for analysis. While there were no missing values, several numerical features exhibited outliers:

- `NumCompaniesWorked`  
- `TotalWorkExperienceInYears`  
- `TrainingTimesLastYear`  
- `ExperienceYearsAtThisCompany`  
- `ExperienceYearsInCurrentRole`  
- `YearsWithCurrManager`  
- `YearsSinceLastPromotion`  

These were addressed using appropriate outlier detection and correction methods.

---

# **4. Data Preprocessing**

Preprocessing transforms raw data into a format suitable for modeling. Key steps included:

1. **Outlier Handling:** Applied correction techniques to minimize the impact of extreme values in numerical features.  
2. **Categorical Encoding:** Utilized both **Label Encoding** and **One-Hot Encoding** to convert categorical variables into numeric form.  
3. **Data Quality Assurance:** Performed integrity checks to resolve inconsistencies and ensure data reliability.


# **5. Analysis by Visualization**

Visualization revealed several important insights into employee behavior, department dynamics, and performance trends. Here are the key takeaways:

---

### **1. Training, Salary Hike & Performance**
- **Training improves performance**, especially for underperformers, though gains plateau with excessive hours.
- **High performers are often rewarded with salary hikes over 20%**, reinforcing performance-based incentives.

---

### **2. Department Trends**
- **Sales and R&D** are the largest departments.
- **Sales also reports the highest attrition**, pointing to possible dissatisfaction or pressure.

---

### **3. Performance Patterns**
- Most employees have a **Performance Rating of 3**.
- **Hourly pay has no clear link to performance**, suggesting compensation isn’t a direct driver of output.

---

### **4. Work-Life Balance & Environment**
- **Better work-life balance and higher environment satisfaction** are linked to **higher performance**.
- **Married employees** report **better balance**, possibly due to structured routines.

---

### **5. Employee Demographics & Overtime**
- The majority of the workforce is **young to mid-career**.
- **Overtime does not significantly impact attrition**, suggesting other factors play a larger role.

---

### **6. Performance vs Satisfaction**
- Surprisingly, **job satisfaction, attrition, and work-life balance have limited influence** on performance ratings in the current dataset.

---

### **Conclusion**
Overall, the visuals highlighted how training, environment, and structured incentives affect employee performance. These patterns support strategic planning in workforce development and retention.
n guiding model development and supporting strategic HR decisions.


# **Machine Learning Model**

To predict employee performance ratings, several machine learning algorithms were applied and evaluated on the processed dataset. The objective was to identify the most accurate model for use in performance prediction and HR decision-making.

---

## **Algorithms Used**

| **Model**                  | **Accuracy**  |
|----------------------------|---------------|
| **LogisticRegression**     | **72.92%**     |
| **DecisionTreeClassifier** | **89.17%**     |
| **RandomForestClassifier** | **92.08%**     |
| **SVC**                    | **83.75%**     |
| **KNN**                    | **66.25%**     |
| **NaiveBayes**             | **70.83%**     |
| **ANN (Artificial Neural Network)** | **83.33%**     |

These models are well-suited for classification problems involving structured, labeled data.

---

## **Methodology**

### **1. Data Preparation**
- The dataset was divided into **training** and **testing** sets.
- To address imbalance in the target variable (performance rating), **SMOTE (Synthetic Minority Over-sampling Technique)** was used.
  - SMOTE helps by generating synthetic data for minority classes, allowing models to learn from a more balanced dataset.

### **2. Model Training and Evaluation**
- All models were trained on the balanced training data and evaluated using the test set.
- **Accuracy** was chosen as the primary performance metric for comparison.

---

## **Results & Insights**

- **RandomForestClassifier** yielded the **highest accuracy (92.08%)**, proving to be the most reliable model for predicting performance.
- **DecisionTreeClassifier** also performed strongly with an **89.17%** accuracy, offering both interpretability and effectiveness.
- **ANN** and **SVC** delivered competitive performance with **83.33%** and **83.75%**, respectively.
- **LogisticRegression** and **NaiveBayes** produced moderate results.
- **KNN** was the least accurate at **66.25%**, indicating limited effectiveness on this dataset.

---

### **Conclusion**

This modeling approach, enhanced by SMOTE to correct class imbalances, demonstrated that ensemble methods like **Random Forest** and decision-based models like **Decision Tree** are the most accurate for predicting employee performance. These models can be valuable assets for improving talent management strategies within HR analytics.
ent management strategies within HR analytics.
or HR analytics and strategic decision-making.
