# <center> INX FUTURE INC EMPLOYEE PERFORMANCE ANALYSIS

## <center> BUISNESS CASE:BASED ON GIVEN FEATURE OF DATASET WE NEED TO PREDICT THE PERFOMANCE RATING OF EMPLOYEE

# Summary Report

This report provides a brief overview of the dataset that contains information about the employees of a company and their performance ratings. The dataset has **1200** observations and **28** variables. The observations represent the individual employees, and the variables represent their personal and professional attributes, such as age, gender, education, department, job role, salary, satisfaction, work experience, and attrition. The main objective of the dataset is to analyze the factors that influence the performance rating of the employees, which is the target variable.

The dataset is clean and well-structured, with no missing values, constant values, or duplicated rows. The variables are of different data types, such as numerical, categorical, and ordinal. The dataset is also unbalanced, with a skewed distribution of the target variable. This report will describe the characteristics of each variable and the relationships among them.

### Data Preprocessing and Analysis steps for Employee Performance Prediction

## Data Cleaning

The first step in the data preprocessing and analysis process is to clean the data and remove any unnecessary or irrelevant columns. In this case, the 'EmpNumber' column was dropped from the dataset because it does not provide any useful information for the prediction task. It is just a unique identifier for each employee and does not have any relation with their performance.

## Exploratory Data Analysis (EDA)

The next step is to explore the data and understand its characteristics and distributions. This involves identifying the categorical and numerical columns, checking the number of unique values in each column, and visualizing the data using plots and charts.

### Categorical Columns

Categorical columns are those that contain discrete values that represent different categories or groups. For example, the 'Gender' column has two categories: 'Male' and 'Female'. Categorical columns can be further divided into ordinal and non-ordinal columns. Ordinal columns are those that have an inherent order or ranking among the categories, such as 'EducationBackground'. Non-ordinal columns are those that do not have any order or ranking among the categories, such as 'EmpDepartment'.


| Categorical Column           | Unique Values |
| ---------------------------- | ------------- |
| Gender                       | 2             |
| EducationBackground          | 6             |
| MaritalStatus                | 3             |
| EmpDepartment                | 6             |
| EmpJobRole                   | 19            |
| BusinessTravelFrequency      | 3             |
| OverTime                     | 2             |
| Attrition                    | 2             |



### Numerical Columns

Numerical columns are those that contain continuous or discrete values that represent some quantity or measurement. For example, the 'Age' column has numerical values that represent the age of the employees in years. Numerical columns can be further divided into integer and float columns. Integer columns are those that have whole numbers as values, such as 'DistanceFromHome'. Float columns are those that have decimal numbers as values, such as 'PerformanceRating'.


| Numerical Column                 | Unique Values |
|----------------------------------|---------------|
| Age                              | 43            |
| DistanceFromHome                 | 29            |
| EmpEducationLevel                | 5             |
| EmpEnvironmentSatisfaction       | 4             |
| EmpHourlyRate                    | 71            |
| EmpJobInvolvement                | 4             |
| EmpJobLevel                      | 5             |
| EmpJobSatisfaction               | 4             |
| NumCompaniesWorked               | 10            |
| EmpLastSalaryHikePercent         | 15            |
| EmpRelationshipSatisfaction      | 4             |
| TotalWorkExperienceInYears       | 40            |
| TrainingTimesLastYear            | 7             |
| EmpWorkLifeBalance               | 4             |
| ExperienceYearsAtThisCompany      | 37            |
| ExperienceYearsInCurrentRole      | 19            |


### Non-Encoded Columns

Non-encoded columns are those that have data type 'object', which means that they are stored as strings or text. These columns need to be encoded to numerical representations before they can be used for machine learning models. Encoding is the process of converting categorical values to numerical values based on some logic or rule.


| Non-Encoded Column           |
| ---------------------------- |
| Gender                       |
| MaritalStatus                |
| EmpDepartment                |
| EmpJobRole                   |
| BusinessTravelFrequency      |
| OverTime                     |
| Attrition                    |


These columns are non-ordinal, which means that they do not have any order or ranking among the categories. Therefore, they need to be encoded using Label Encoding, which is a simple technique that assigns a unique integer to each category.

## Ordinal Data Encoding

Ordinal data encoding is the process of encoding ordinal columns, which are those that have an inherent order or ranking among the categories, such as 'EducationBackground'. Ordinal data encoding requires a mapping that specifies the order of the categories and the corresponding numerical values.

The dataset has one ordinal column: 'EducationBackground'. This column has 6 categories: 'Life Sciences', 'Marketing', 'Medical', 'Other', 'Technical Degree', and 'Human Resources'. The order of these categories is based on the level of education, from low to high. The mapping for this column is as follows:

| Category | Value |
| --- | --- |
| Life Sciences | 1 |
| Marketing | 2 |
| Medical | 3 |
| Other | 4 |
| Technical Degree | 5 |
| Human Resources | 6 |

Using this mapping, the ordinal column was encoded to numerical values.

## Non-Ordinal Data Encoding

Non-ordinal data encoding is the process of encoding non-ordinal columns, which are those that do not have any order or ranking among the categories, such as 'Gender'. Non-ordinal data encoding requires a technique that assigns a unique integer to each category without implying any order or preference.

The dataset has 7 non-ordinal columns: 'Gender', 'MaritalStatus', 'EmpDepartment', 'EmpJobRole', 'BusinessTravelFrequency', 'OverTime', and 'Attrition'. These columns were encoded using Label Encoding, which is a simple technique that assigns a unique integer to each category. For example, the 'Gender' column has two categories: 'Male' and 'Female'. The Label Encoding for this column is as follows:

| Category | Value |
| --- | --- |
| Male | 0 |
| Female | 1 |

Using Label Encoding, the non-ordinal columns were encoded to numerical values. The encoder models for each column were stored in a dictionary and saved using joblib, which is a tool for saving and loading Python objects.

## Summary of Encoding

Encoding is an important step in data preprocessing and analysis, as it prepares the data for machine learning models. Categorical columns need to be encoded to numerical representations, as machine learning models cannot handle text or string values. Encoding can be done using different techniques, depending on the type of the categorical column. Ordinal columns need to be encoded using a mapping that specifies the order of the categories and the corresponding numerical values. Non-ordinal columns need to be encoded using a technique that assigns a unique integer to each category without implying any order or preference. Label Encoding is a simple and common technique for encoding non-ordinal columns.

## Saving Encoder Model

The encoder models for the non-ordinal columns were saved to a joblib file ('../../data/encoder_model.joblib') for future use. The joblib file can be loaded using joblib.load() function and used to transform new data with the same categories. This ensures consistency and compatibility of the data with the machine learning models.

## Splitting Dependent and Independent Columns

The first step in the data preprocessing and analysis process is to split the dataset into dependent and independent columns. The dependent column is the target variable that we want to predict, while the independent columns are the features or predictors that we use to make the prediction.

In this case, the dependent column is 'PerformanceRating', which is a numerical value that represents the performance rating of each employee. The independent columns are all the other columns in the dataset, which contain various information about the employees, such as their age, gender, education, job role, salary, satisfaction, etc.

The dependent column was assigned to a variable named y, while the independent columns were assigned to a variable named X. This is a common convention in machine learning, where y denotes the output and X denotes the input.

## Removing Outliers

The next step in the data preprocessing and analysis process is to remove outliers from the dataset. Outliers are extreme or abnormal values that deviate significantly from the rest of the data. Outliers can affect the accuracy and performance of the machine learning models, as they can introduce noise and bias in the data.

### Box Plot Analysis

One way to identify outliers is to use box plots, which are graphical representations of the distribution of numerical data. Box plots show the minimum, maximum, median, and quartiles of the data, as well as any potential outliers. Outliers are usually defined as values that are more than 1.5 times the interquartile range (IQR) away from the lower or upper quartile. The IQR is the difference between the 25th and 75th percentile of the data.

Box plots were created for all the numerical columns that have more than 10 unique values, such as 'Age', 'DistanceFromHome', 'EmpHourlyRate', etc. These plots provide insights into the distribution and potential outliers in the numerical data.

### Outlier Imputation

Another way to handle outliers is to impute them, which means to replace them with more reasonable values. One common method for outlier imputation is to use the median value of the data, which is the middle value that divides the data into two equal halves. The median is less sensitive to outliers than the mean, which is the average value of the data.

A function named impute_outliers_iqr was developed to impute outliers using the median value. The function takes a column of data as input and returns a column of data with imputed outliers as output. The function uses the IQR method to define the lower and upper bounds for outliers, and replaces any value that falls outside these bounds with the median value.

The function was applied to all the numerical columns that have more than 10 unique values, such as 'Age', 'DistanceFromHome', 'EmpHourlyRate', etc.

### Effect of Imputation

The effect of imputation was measured by printing the number of outliers before and after imputation for each column. The number of outliers was calculated using the IQR method, as described above. The results showed that the imputation process reduced the number of outliers significantly, demonstrating the effectiveness of the imputation process in handling outliers.

## Normalizing Distributions

The next step in the data preprocessing and analysis process is to normalize the distributions of the numerical columns. Normalizing the distributions means to transform the data so that it follows a normal or Gaussian distribution, which is a bell-shaped curve that is symmetric around the mean. Normalizing the distributions can improve the performance of some machine learning models, such as linear regression, logistic regression, and neural networks, as they assume that the data is normally distributed.

### Distribution Analysis

One way to analyze the distribution of the numerical columns is to use histograms and quantile-quantile (QQ) plots. Histograms are graphical representations of the frequency of the data, which show how many values fall into different bins or intervals. QQ plots are graphical representations of the relationship between the data and a theoretical normal distribution, which show how the data points deviate from a straight line. If the data is normally distributed, the histogram will have a bell shape and the QQ plot will have a straight line.

Histograms and QQ plots were used to assess the distribution of all the numerical columns, such as 'Age', 'DistanceFromHome', 'EmpHourlyRate', etc. The plots showed that some columns have skewed distributions, which means that they have a long tail on one side of the mean. For example, the 'EmpJobLevel' and 'EmpLastSalaryHikePercent' columns have right-skewed distributions, which means that they have a long tail on the right side of the mean.

### Power Transformation

One way to normalize the distributions of the numerical columns is to use power transformation, which is a mathematical technique that applies a function to the data to make it more symmetric and Gaussian-like. Power transformation can help in addressing skewness and reducing the effect of outliers in the data.

Power transformation can be done using different methods, such as the Box-Cox method, the Yeo-Johnson method, and the Quantile Transformer method. Each method has its own advantages and disadvantages, depending on the type and range of the data.

In this case, the Yeo-Johnson method was chosen for power transformation, as it can handle both positive and negative values, unlike the Box-Cox method, which can only handle positive values. The Yeo-Johnson method applies a different function to the data depending on whether it is positive or negative, and has a parameter named lambda that controls the degree of transformation. The optimal value of lambda can be estimated using the maximum likelihood method, which maximizes the likelihood of the transformed data being normally distributed.

The Yeo-Johnson method was applied to selected feature columns that have skewed distributions, such as 'EmpJobLevel' and 'EmpLastSalaryHikePercent'. The optimal value of lambda was estimated for each column, and the transformed data was plotted using histograms and QQ plots. The plots showed that the transformed data had more normal distributions than the original data.

## Scaling the Dataset

The next step in the data preprocessing and analysis process is to scale the dataset, which means to standardize the numerical features so that they have a similar range and scale. Scaling the dataset can improve the performance of some machine learning models, such as k-nearest neighbors, support vector machines, and neural networks, as they are sensitive to the magnitude and variation of the features.

### Standard Scaling

One way to scale the dataset is to use standard scaling, which is a common technique that standardizes the numerical features so that they have a mean of 0 and a standard deviation of 1. Standard scaling can be done using the following formula:

z = (x - mu) / sigma

where x is the original value, mu is the mean of the feature, sigma is the standard deviation of the feature, and z is the standardized value.

Standard scaling was done using StandardScaler, which is a class from the scikit-learn library that implements the standard scaling technique. The StandardScaler class has two methods: fit and transform. The fit method calculates the mean and standard deviation of each feature, while the transform method applies the standard scaling formula to each value. The fit and transform methods can be combined using the fit_transform method, which performs both steps in one go.

The StandardScaler class was instantiated and applied to the dataset using the fit_transform method. The scaled dataset was returned as a NumPy array, which is a data structure that stores multidimensional arrays of numbers.

## Feature Selection

The next step in the data preprocessing and analysis process is to perform feature selection, which is the process of choosing the most relevant and informative features for the prediction task and removing the redundant or irrelevant features. Feature selection can improve the accuracy and efficiency of the machine learning models, as it reduces the dimensionality and complexity of the data.

### Correlation Heatmap

One way to perform feature selection is to use correlation analysis, which is a statistical technique that measures the strength and direction of the linear relationship between two variables. Correlation analysis can help in identifying potential correlations between the features, which can indicate multicollinearity or redundancy. Multicollinearity is a problem that occurs when two or more features are highly correlated, which can affect the stability and interpretability of the machine learning models.

Correlation analysis can be done using the Pearson correlation coefficient, which is a numerical value that ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation. The Pearson correlation coefficient can be calculated using the following formula:

r = cov(x, y) / (std(x) * std(y))

where x and y are the two variables, cov(x, y) is the covariance of x and y, std(x) is the standard deviation of x, and std(y) is the standard deviation of y.

The Pearson correlation coefficient can be calculated for all pairs of features in the dataset, resulting in a correlation matrix, which is a table that shows the correlation coefficients for each pair of features. The correlation matrix can be visualized using a heatmap, which is a graphical representation that uses colors to indicate the magnitude and sign of the correlation coefficients.

A heatmap of the correlation matrix was plotted using the seaborn library, which is a data visualization library that provides high-level interfaces for drawing attractive and informative statistical graphics. The heatmap showed the correlation coefficients for each pair of features, using a color scale that ranged from blue to red, where blue indicated a negative correlation, white indicated no correlation, and red indicated a positive correlation.

### High Correlation Check

The heatmap of the correlation matrix was checked for features that have a high correlation coefficient, which was defined as a high correlation coefficient, which was defined as greater than 0.80 in absolute value. This threshold was chosen to identify features that have a strong linear relationship, which can indicate multicollinearity or redundancy.

The results showed that no features had a high correlation coefficient, which means that there was no evidence of multicollinearity or redundancy in the dataset. This implies that all the features were relatively independent and informative, and none of them needed to be dropped from the dataset.

## Principal Component Analysis (PCA)

The next step in the data preprocessing and analysis process is to apply Principal Component Analysis (PCA), which is a dimensionality reduction technique that transforms the dataset into a set of linearly uncorrelated variables. PCA can help in reducing the complexity and redundancy of the data, as well as improving the performance and interpretability of the machine learning models.

### PCA Application

PCA was applied to the scaled dataset using PCA, which is a class from the scikit-learn library that implements the PCA technique. The PCA class has two methods: fit and transform. The fit method calculates the principal components, which are the directions of maximum variance in the data, while the transform method applies the PCA transformation to the data, which projects the data onto the principal components. The fit and transform methods can be combined using the fit_transform method, which performs both steps in one go.

The PCA class was instantiated and applied to the scaled dataset using the fit_transform method. The transformed dataset was returned as a NumPy array, which is a data structure that stores multidimensional arrays of numbers.

### Explained Variance Analysis

The explained variance analysis is a way to evaluate the effectiveness of the PCA transformation, which measures how much of the original variance in the data is retained by the principal components. The explained variance ratio is a numerical value that indicates the proportion of variance explained by each principal component, while the cumulative explained variance is the sum of the explained variance ratios of all the principal components up to a certain point.

The explained variance ratio and the cumulative explained variance were calculated using the explained_variance_ratio_ and the cumsum methods of the PCA class, respectively. The results showed that the first principal component explained about 38% of the variance in the data, while the second principal component explained about 14% of the variance in the data. The cumulative explained variance showed that the first two principal components explained about 52% of the variance in the data, while the first three principal components explained about 63% of the variance in the data.

### Determination of Components

The determination of components is a way to decide how many principal components to keep for the final dataset, which depends on the desired level of variance retention and dimensionality reduction. A common practice is to choose a threshold for the cumulative explained variance, such as 95%, and select the number of principal components that achieve that threshold.

In this case, the threshold for the cumulative explained variance was set to 95%, which means that the final dataset should retain 95% of the original variance in the data. The number of principal components needed to achieve that threshold was determined by finding the index of the first element in the cumulative explained variance array that was greater than or equal to 0.95. The result showed that 22 principal components were needed to explain 95% of the variance in the data.

### PCA Model Saving

The PCA model, along with the reduced dimensions, was saved using joblib, which is a tool for saving and loading Python objects. The PCA model was saved to a joblib file ('../../data/pca_model.joblib') for future use. The joblib file can be loaded using the joblib.load() function and used to transform new data with the same features. This ensures consistency and compatibility of the data with the machine learning models.

## Save Processed Data

The final step in the data preprocessing and analysis process is to save the processed data, including the PCA components and the target variable, to a CSV file for further analysis and model training. The CSV file is a data format that stores tabular data in plain text, where each line represents a row and each comma represents a column.

The processed data, which consisted of the PCA components and the target variable, was saved to a CSV file ('../../data/processed_data.csv') using the to_csv() method of the pandas library, which is a data analysis and manipulation library that provides high-performance and easy-to-use data structures and tools. The to_csv() method takes a file path as input and writes the data to that file. The index parameter was set to False to avoid writing the row labels to the file.

# Model Training Steps

- ***Step 1: Importing Necessary Libraries***
  - Output: No direct output. This step is about loading essential libraries for data manipulation.
  
- ***Step 2: Reading Processed Data***
  - Output: A Pandas DataFrame (data) containing the processed data from the CSV file.
  
- ***Step 3: Splitting Independent and Dependent Data***
    - In this step, we separate the input data into two datasets: one that contains the **independent variables** (also called features or predictors) and one that contains the **dependent variable** (also called target or outcome). The independent variables are the ones that we use to explain or predict the dependent variable. For example, if we want to predict the house price based on its size, location, and number of rooms, then the independent variables are size, location, and number of rooms, and the dependent variable is the house price. We usually denote the independent variables as **X** and the dependent variable as **y**. This step is important because it allows us to separate the input data into the variables that we want to use for modeling and the variable that we want to predict.
    
- ***Step 4: Balancing the Dataset using SMOTE***
    - In this step, we deal with the problem of **class imbalance** in the target variable. Class imbalance occurs when one class (or category) of the target variable has much more observations than the other classes. For example, if we want to predict whether a customer will buy a product or not, and 90% of the customers buy the product, then we have a class imbalance problem. This problem can affect the performance and accuracy of the machine learning models, as they tend to favor the majority class and ignore the minority class. To overcome this problem, we can use a technique called **SMOTE** (Synthetic Minority Oversampling Technique), which creates synthetic samples of the minority class by using the nearest neighbors of the existing samples. This way, we can increase the number of observations of the minority class and balance the dataset. The output of this step is two datasets after balancing using SMOTE: **X_sm** (balanced features) and **y_sm** (balanced target).
    
- ***Step 5: Train-Test Split***
    - In this step, we split the balanced datasets into two subsets: one for **training** the machine learning model and one for **testing** the model. The training subset is used to fit the model to the data and learn the patterns and relationships between the features and the target. The testing subset is used to evaluate the model on new and unseen data and measure its performance and accuracy. We usually use a ratio of 80:20 or 70:30 for the train-test split, meaning that we use 80% or 70% of the data for training and the remaining 20% or 30% for testing. We can use a random function to split the data into the two subsets, or we can use a stratified function to preserve the proportion of the classes in the target variable. The output of this step is four datasets: **X_train** and **y_train** (training features and target), and **X_test** and **y_test** (testing features and target).

# Conclusion
- The processed dataset is then classified by **11 different models**, such as logistic regression, decision tree, random forest, support vector machine, etc. Each model is trained on the training set and evaluated on the testing set to measure its accuracy score, which is the percentage of correct predictions out of the total predictions.
- The results show that the **best performing model** is the **random forest classification**, which has a testing accuracy score of **0.944762**. This means that the random forest model can correctly predict the attrition rate of 94% of the employees in the testing set. The random forest model is a type of ensemble learning method, which combines multiple decision trees to create a more robust and accurate classifier.
- The **second best performing model** is the **support vector machine classification**, which has a testing accuracy score of **0.940952**. This means that the support vector machine model can correctly predict the attrition rate of 94% of the employees in the testing set. The support vector machine model is a type of supervised learning method, which finds the optimal hyperplane that separates the classes with the maximum margin.
- The **third best performing model** is the **XGBOOST classifier**, which has a testing accuracy score of **0.946667**. This means that the XGBOOST model can correctly predict the attrition rate of 95% of the employees in the testing set. The XGBOOST model is a type of gradient boosting method, which uses a series of weak learners to create a strong learner by minimizing a loss function.
- The **worst performing model** is the **extra tree (random forests) classifier**, which has a testing accuracy score of **0.729524**. This means that the extra tree model can correctly predict the attrition rate of 73% of the employees in the testing set. The extra tree model is a variation of the random forest model, which uses random splits at each node instead of the best splits.
- The **average accuracy score** of all the models is **0.858571**, which indicates a **good performance** in predicting the attrition rate of the employees. However, there is a possibility of **overfitting**, which means the models may not generalize well to new data. To address this issue, hyperparameter tuning can be applied to each model to optimize their parameters and reduce overfitting. Hyperparameter tuning is a process of finding the best combination of values for the parameters that control the behavior of the models, such as the number of trees, the depth of the trees, the learning rate, etc.

- The text is about **saving the random forest classifier model** and the **XGBOOST classifier model** that were trained on the **inx performance report analysis** dataset. The dataset contains information about the performance of employees in a company and the goal is to predict the attrition rate of the employees based on various factors, such as age, gender, education, salary, etc.

- After applying an artificial neural network (ANN) model to the problem statement dataset, we can achieve better accuracies than the other classification models used previously.
- With the ANN model, I can achieve up to 95% of accuracy score, which can be considered as suitable for production-grade use case model.
- The ANN model is a powerful and flexible tool for solving complex and nonlinear problems in various domains.

- The text shows how to use two different modules, **joblib** and **pickle**, to save the models as files that can be later loaded and used for prediction or evaluation. Joblib and pickle are both Python modules that provide functions for serializing and deserializing Python objects, such as models, data structures, etc. Serializing means converting an object into a stream of bytes that can be stored or transmitted, and deserializing means reconstructing an object from a stream of bytes.
- The text first imports the joblib module and uses the **joblib.dump** function to save the random forest model and the XGBOOST model as **joblib files**. The function takes two arguments: the model object and the file name. The file name can include the path where the file will be stored. The function returns a list of the file names that were created. The text shows that the random forest model is saved as **random_for_class_hyp.joblib** and the XGBOOST model is saved as **xgb_class_hyp.joblib** in the **data** folder.
- The text then imports the pickle module and uses the **pickle.dump** function to save the random forest model as a **pickle file**. The function takes two arguments: the model object and a file object that is opened in write-binary mode. The text shows that the random forest model is saved as **random_for_class_hyp.pickle** in the **data** folder. To open a file object in write-binary mode, the text uses the **open** function with the file name and the mode argument as **'wb'**. The text also uses the **with** statement to ensure that the file object is closed automatically after the operation.