# HR DATA ANALYSIS

In this notebook we will be looking at the dataset to see if we can glean useful insights by means Data Analysis and Data Visualization. Roughly, we will be following the below structure: 

* Load the data.
* Display useful statistics.
* Build generic functions to detect nulls and missing values.
* Handle missing values.
* Make Visualizations to understand data better.

The problem comes under classification as we are predicting a binary value of either promoted (1) or not (0). After going through the above listed steps one can efficiently build ML models like Naive Bayes, Logistic Regression, Random Forests to name a few. This notebook will cover EDA concepts.

# Load libraries

In [None]:
import os
import numpy as np 
import matplotlib.pyplot as plt
import missingno as msno
import seaborn as sns
import pandas as pd 
pd.options.mode.chained_assignment = None  # default='warn'



# Load Dataset

In [None]:
df_train = pd.read_csv('/kaggle/input/hranalysis/train.csv')
df_test = pd.read_csv('/kaggle/input/hranalysis/test.csv')



# Display rows
print(df_train.head(5))
print('======================')
print(df_test.head(5))

# Display summary statistics


In [None]:
# List the column names
print(list(df_train.columns))

In [None]:
# Describe the data
print(df_train.describe())

It seems like much of the data has discrete values (0 or 1) in terms of numerical columns. Columns like KPIs_met ranges in terms of percent values between 0 - 1. Let's look into the categorical columns

In [None]:
# Select categorical columns
print(df_train.select_dtypes(include = ['object']))

At this point looking at rhe categorical columns, *Region* can be removed (unless we find something that refutes this decision), *department*, *education* and *recruitment_channel* can be encoded either via LabelEncoding or OneHotEncoding

In [None]:
print(df_train.info())

# Investigating Missing Values

In [None]:
# Generic function to calculate missing values, zero values
def calcMissingValues(df):
    '''    
        This function is used to calculate : zero values, missing values, NA and returns a dataframe with the above calculated
        values. 
        
        Input: Dataframe
        Output: Returns a dataframe
    '''
    
    # Calc zero vals
    zero_vals = (df == 0.0).astype(int).sum(axis = 0)
    
    # Calc missing vals
    missing_vals = df.isnull().sum()
    
    # Calc missing value percent
    missing_val_percent = round((missing_vals / len(df)) * 100.0, 2)
    
    df_missing_stat = pd.concat([zero_vals , missing_vals , missing_val_percent] , axis = 1)
    
    df_missing_stat = df_missing_stat.rename(columns = {0: 'zero_vals', 1: 'missing_vals', 2: '%_missing_vals'})
    
    df_missing_stat['data_types'] = df.dtypes
    
    print(df_missing_stat)
    
    return df_missing_stat

In [None]:
df_missing_stat = calcMissingValues(df_train)

# Visualize Missing values

We will use the ***missingno*** library to visualize the missing values in our dataset. Visualization provides some intuition and a possible pattern that can be useful to interpret the data in a better way.

In [None]:
# plot a missing value matrix
msno.matrix(df_train)
plt.show()

*previous_year_rating* has missing values and would be interesting to see if the values were not recorded or they did not exist which may happen in cases where *length_of_service* is less than 1 (The employee either is trainee or has joined relatively new). Both these columns must be observed before handling the missing values.

In [None]:
# Plotting a bar graph
msno.bar(df_train , figsize = (10 , 8) , color = 'orange')
plt.show()

The values to the right side gives the row numbers, and the left gives the proportion of rows to the total. The values at the top of the bar gives the actual number of non-missing rows.

In [None]:
# Observe null records to see if there is any corresponding pattern in other columns
train_copy = df_train.copy()
print(train_copy[train_copy['previous_year_rating'].isnull()]['length_of_service'])

print()

print(train_copy[train_copy.filter(items = ['previous_year_rating']).isnull().any(axis = 1)]['length_of_service'])

Both lines of code give the same result and our initial guess was correct whenever there is a null value in *previous_year_rating* the *length_of_service* column has a value of 1. This rules out deleting the rows having nulls. Let's again look at the info statistics to see if we can impute reasonably.

In [None]:
print(df_train.describe())

Since the mean value of *previous_year_rating* is 3.3, we can impute the missing values with the mean as it makes sense to give an average rating to employees than a ratin of 1 which is not realistic.

In [None]:
# Replace the missing values for previous_year_rating with mean
df_train['previous_year_rating'].fillna(df_train['previous_year_rating'].mean() , inplace = True)

In [None]:
print(df_train['education'].value_counts())

# Get the mode of the feature education
print()
print('Mode: ' , df_train['education'].mode()[0])


For education, we see Bachelors and Masters being the most common value and we can impute with Bachelors for the missing value as it makes a reasonable estimate and it also is the mode (statistics)

In [None]:
# Replace the missing values for education with mode
df_train['education'].fillna(df_train['education'].mode()[0] , inplace = True)

# Check for missing values
df_missing_stat = calcMissingValues(df_train)

# EDA and Data Visualization

We will now look into data analysis and visualize some of the relationships between features to get more insights about the data.

We will do a pairplot analysis to see what are the reltionships between different variables and how it influences the target variable. In datasets having more features, pairplots are quite useful in revealing patterns that help in subsequent analysis.

In [None]:
sns.pairplot(df_train)
plt.show()

The employee_id column can be safely dropped

In [None]:
# Create a copy of the train dataset
df_x = df_train.copy()

df_x.drop('employee_id' , inplace = True , axis = 1)

# Let's make a pairplot with employee_id being dropped

# Lets visualize from the perspective of education degree
plt.figure(figsize = (12 , 8))
sns.pairplot(df_x , hue = 'education')
plt.show()

Some observations from the pairplot
(Stacked bar graphs do not serve any purpose when you are analysing a numerical value against another numerical value!! They are useful when there is a categorical variable. Otherwise the analysis can be misleading)

* Lesser no_of_trainings has more promotions (Does quality over quanity matter here?)
* Promotions are provided irrespective of the employee age
* All types of previous_year_ratings have received promotions, so there is no explicit pattern or strong relationship to discern here. Higher ratings have more number of promotions.
* Similarly age too does not play a role in promotions as in any company people receive promotions across different age groups.
* Length_of_service has a positive linear relationship with age which is obviously true!.


In [None]:
# Let's compare some of the features against the target variable
prev_yr_rating = df_x.groupby(['previous_year_rating'] , as_index = False)['is_promoted'].sum()

prev_yr_rating['previous_year_rating'] = prev_yr_rating['previous_year_rating'].round().astype(int)

print(prev_yr_rating)

'''

VALID LEGEND LOCATIONS

best
upper right
upper left
lower left
lower right
right
center left
center right
lower center
upper center
center


'''


prev_yr_rating.plot(kind = 'bar', x = 'previous_year_rating' , y = 'is_promoted', color = 'yellow' , alpha = 0.6, figsize = (10 , 8) , rot = 0)
plt.xlabel('Previous Year Ratings')
plt.ylabel('Total Promotions')
plt.title('Previous Year Ratings - Promotions')
plt.legend(loc = 'upper left')
plt.tight_layout()
plt.show()

In [None]:
# Compare education and total promotions
education_promotions = df_x.groupby(['education'], as_index = False)['is_promoted'].sum()

print(education_promotions)

education_promotions.plot(kind = 'bar', x = 'education' , y = 'is_promoted', color = 'yellow' , alpha = 0.6, figsize = (10 , 8) , rot = 0)
plt.xlabel('Education Degree')
plt.ylabel('Total Promotions')
plt.title('Education Degree - Promotions')
plt.legend(loc = 'upper left')
plt.tight_layout()
plt.show()

In general, a Bachelors degree is a necessary to be considered for a promotion and the count is also boosted by the handling of missing values.

In [None]:
# Overall proportion of different degrees

# Pie Chart ref: https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f
print(df_x.groupby(['education']).size())

sizes = list(df_x.groupby(['education']).size())

print(sizes)

labels = ['Bachelors' , 'Below Secondary', 'Masters']
colors = ['#ff9999','#1f70f0','#99ff99']
pie_explode = [0 , 0 , 0.3]

plt.figure(figsize = (10 , 8))
plt.pie(sizes , labels = labels , explode = pie_explode , colors = colors , shadow = True, startangle = 90 , textprops={'fontsize': 14} , autopct = '%.1f%%')
plt.ylabel('')
plt.title('Degree distribution in the data' , fontsize = 20)
plt.tight_layout()
plt.show()

In [None]:
# Recruitment channel - employment
# Pie Chart ref: https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f

# Unique values for the type of recruitment followed
print(df_x['recruitment_channel'].unique())

print(df_x['recruitment_channel'].value_counts())
recruitment_categories = list(df_x['recruitment_channel'].value_counts())

print(recruitment_categories)

labels = ['other' , 'sourcing', 'referred']
colors = ['#3f4857','#a5a8ad','#687d9e']
pie_explode = [0 , 0.3 , 0]

plt.figure(figsize = (10 , 8))
plt.pie(recruitment_categories , labels = labels , explode = pie_explode , colors = colors , shadow = True, startangle = 90 , textprops = {'fontsize': 14} , autopct = '%.1f%%')
plt.ylabel('')
plt.title('Recruitment Categories' , fontsize = 20)
plt.tight_layout()
plt.show()

In a similar vein, the other features could be visualized either as a donut chart or bar graphs.

# Conclusion

This notebook includes Data Analysis, EDA (as offered by the dataset) and Data Visualization. Building models on top this dataset given the detailed analysis and handling of missing values should be fairly simple. This notebook primarily serves as an exercise for analysis and visualizations