## Model Prediction and Evaluation for Employee Performance

In [2]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [3]:
# Importing data, Encoding and splitting data for the model
data_path = "/home/ec2-user/SageMaker/data/EmployeeData_Raw.csv"

try:
    employee_df = pd.read_csv(data_path)
    X = employee_df.drop(['PerformanceRating', 'EmpNumber'], axis=1)
    y = employee_df['PerformanceRating']

    # One hot encoding of categorical features
    categorical_features = X.select_dtypes(include='object').columns
    X_processed = pd.get_dummies(X, columns=categorical_features, drop_first=True)

    # Splitting data into training and testing
    X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42, stratify=y)
    print("Data loaded, processed and split into training and testing sets.")
except FileNotFoundError:
    print(f"Error: {data_path} not found")
    X_test, y_test = None, None

Data loaded, processed and split into training and testing sets.


In [5]:
# Loading trained models
trained_models = {}
models_dir = '/home/ec2-user/SageMaker/src/Models'

try:
    trained_models['Logistic Regression'] = joblib.load(f'{models_dir}/logistic_regression_model.joblib')
    print("Loaded Logistic Regression")

    trained_models['Random Forest Classifier'] = joblib.load(f'{models_dir}/random_forest_model.joblib')
    print("Loaded Random Forest Classifier")

    trained_models['Light GBM Classifier'] = joblib.load(f'{models_dir}/lightgbm_model.joblib')
    print("Loaded Light GBM Classifier")
except FileNotFoundError as e:
    print(f"Error loading models: {e}. Model not found in '{models_dir}' directory.")
    trained_models = {}
except Exception as e:
    print(f"An error occurred during model loading: {e}")
    trained_models = {}

Loaded Logistic Regression
Loaded Random Forest Classifier
Loaded Light GBM Classifier


In [16]:
# Making predictions on test dataset
if trained_models and X_test is not None:
    print("Making predictions on test dataset")
    predictions = {}
    for model_name, model in trained_models.items():
        predictions[model_name] = model.predict(X_test)
        print(f"Predictions made using {model_name}.")

    print("Sample prediction vs actual value of 20 values")
    actual = y_test.values
    logistic_regression_preds = predictions.get('Logistic Regression', [])
    random_forest_preds = predictions.get('Random Forest Classifier', [])
    lightgbm_preds = predictions.get('Light GBM Classifier', [])

    sample_comparison = pd.DataFrame({
        'Actual': actual,
        'Logistic Regression Pred': logistic_regression_preds,
        'Random Forest Pred': random_forest_preds,
        'LightGBM Pred': lightgbm_preds
    })
    print(sample_comparison.head(20).to_markdown(index=False, numalign="left", stralign="left"))
else:
    print("Test data not found")

Making predictions on test dataset
Predictions made using Logistic Regression.
Predictions made using Random Forest Classifier.
Predictions made using Light GBM Classifier.
Sample prediction vs actual value of 20 values
| Actual   | Logistic Regression Pred   | Random Forest Pred   | LightGBM Pred   |
|:---------|:---------------------------|:---------------------|:----------------|
| 3        | 3                          | 3                    | 3               |
| 3        | 3                          | 3                    | 3               |
| 2        | 3                          | 3                    | 3               |
| 2        | 2                          | 2                    | 2               |
| 3        | 3                          | 3                    | 3               |
| 3        | 3                          | 3                    | 3               |
| 2        | 3                          | 2                    | 2               |
| 3        | 3                   

In [19]:
# Evaluating model performance
if predictions and y_test is not None:
    print("Model evaluation on test set")

    for model_name, y_pred in predictions.items():
        print(f"\n--- {model_name} Evaluation ---")
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Accuracy: {accuracy:.4f}")
        print("Classification Report:")
        print(classification_report(y_test, y_pred))
else:
    print("Model evaluation failed due to missing training dataset")

Model evaluation on test set

--- Logistic Regression Evaluation ---
Accuracy: 0.8417
Classification Report:
              precision    recall  f1-score   support

           2       0.66      0.54      0.59        39
           3       0.88      0.93      0.90       175
           4       0.82      0.69      0.75        26

    accuracy                           0.84       240
   macro avg       0.78      0.72      0.75       240
weighted avg       0.83      0.84      0.84       240


--- Random Forest Classifier Evaluation ---
Accuracy: 0.9167
Classification Report:
              precision    recall  f1-score   support

           2       0.91      0.79      0.85        39
           3       0.91      0.98      0.95       175
           4       1.00      0.65      0.79        26

    accuracy                           0.92       240
   macro avg       0.94      0.81      0.86       240
weighted avg       0.92      0.92      0.91       240


--- Light GBM Classifier Evaluation ---
Acc

### 4. Using the Model for Hiring

The trained model can be integrated into the hiring process. When a new candidate is being considered, their relevant data points (Age, Gender, EducationBackground, etc.) would be collected. This data would then need to be preprocessed using the *exact same steps* as the training data (one-hot encoding, handling missing values if any were present, etc.). The preprocessed candidate data would then be fed into the chosen best model to obtain a predicted performance rating.

This predicted rating would serve as *one* input into the hiring decision, alongside interviews, assessments, and other criteria. It should not be the sole determinant but a tool to inform the decision-makers about the candidate's potential performance based on historical employee data.

 **Steps for using the model in Hiring:**
 1. Collect candidate data points corresponding to the model's features.
 2. Preprocess the candidate data using the same steps as the training data (encoding, scaling, etc.).
 3. Load the best performing trained model.
 4. Use the model's `.predict()` method on the preprocessed candidate data.
 5. Interpret the predicted performance rating as a factor in the hiring decision.