<a href="https://colab.research.google.com/github/nallagondu/datatrained_inter_public/blob/main/HR_Analytics_Project_First_Phase_Evaluation_Projects_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**HR Analytics Project-** Understanding the Attrition in HR
Project Description
Every year a lot of companies hire a number of employees. The companies invest time and money in training those employees, not just this but there are training programs within the companies for their existing employees as well. The aim of these programs is to increase the effectiveness of their employees. But where HR Analytics fit in this? and is it just about improving the performance of employees?

**HR Analytics**
Human resource analytics (HR analytics) is an area in the field of analytics that refers to applying analytic processes to the human resource department of an organization in the hope of improving employee performance and therefore getting a better return on investment. HR analytics does not just deal with gathering data on employee efficiency. Instead, it aims to provide insight into each process by gathering data and then using it to make relevant decisions about how to improve these processes.

**Attrition in HR**
Attrition in human resources refers to the gradual loss of employees overtime. In general, relatively high attrition is problematic for companies. HR professionals often assume a leadership role in designing company compensation programs, work culture, and motivation systems that help the organization retain top employees.

**How does Attrition affect companies?**
and how does HR Analytics help in analyzing attrition? We will discuss the first question here and for the second question, we will write the code and try to understand the process step by step.

**Attrition affecting Companies**
A major problem in high employee attrition is its cost to an organization. Job postings, hiring processes, paperwork, and new hire training are some of the common expenses of losing employees and replacing them. Additionally, regular employee turnover prohibits your organization from increasing its collective knowledge base and experience over time. This is especially concerning if your business is customer-facing, as customers often prefer to interact with familiar people. Errors and issues are more likely if you constantly have new workers.





**Dataset Link-**
•	https://github.com/FlipRoboTechnologies/ML_-Datasets/blob/main/HR%20Analytics/ibm-hr-analytics-employee-attrition-performance.zip


In [None]:
# 1.for Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 2.for Warnings
import warnings

warnings.filterwarnings('ignore')

# 3. For data Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score # Import cross_val_score
#from sklearn.feature_selection import

#4. For predection the models

from sklearn.linear_model import*
from sklearn.preprocessing import*
from sklearn.tree import*
from sklearn.ensemble import*
from sklearn.metrics import*
from sklearn.neighbors import*
from sklearn.svm import*
from sklearn.naive_bayes import*

#Get data form above github

In [None]:
HR_DATA_URL = "https://raw.githubusercontent.com/FlipRoboTechnologies/ML_-Datasets/main/HR%20Analytics/ibm-hr-analytics-employee-attrition-performance.zip"
hrdf = pd.read_csv(HR_DATA_URL, compression='zip')
hrdf.head()


In [None]:
hrdf.shape

In [None]:
hrdf.info()

In [None]:
missing_hrdf = hrdf.isnull().sum()

print(missing_hrdf)




In [None]:
#droping some elements those are not imply any meaningful insights in our analysis

cols_drop  = ['EmployeeCount','EmployeeNumber','Over18','StandardHours']
hrdf = hrdf.drop(cols_drop,axis=1)
hrdf.head()

In [None]:
for column in hrdf.columns:
    print(f"Column: {column}, Unique Values: {hrdf[column].unique()}")

In [None]:
hrdf.duplicated().sum()

print(hrdf.duplicated())

Based on the above info output ,there is no null data or duplicates in the dataset

In [None]:
hrdf.describe()

In [None]:
hrdf.dtypes

In [None]:
print(hrdf.groupby(["Attrition"]).size())
print('\n')
print(hrdf.groupby(["BusinessTravel"]).size())
print('\n')
print(hrdf.groupby(["Department"]).size())
print('\n')
print(hrdf.groupby(["EducationField"]).size())
print('\n')
print(hrdf.groupby(["Gender"]).size())
print('\n')
print(hrdf.groupby(["JobRole"]).size())
print('\n')
print(hrdf.groupby(["MaritalStatus"]).size())

In [None]:
categorical_columns = ['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']

for col in categorical_columns:
  le = LabelEncoder()
  hrdf[col] = le.fit_transform(hrdf[col])


In [None]:
hrdf.head()

In [None]:
#First we can check the age wise employees count an org
age_attrition = hrdf.groupby(["Age", "Attrition"]).size().unstack()
age_attrition

the above details givs us ,The age_attrition shows the number of employees who left  or stay in the company for each age group

From above data **1.employees aged 30-40 have a lower attrition rate compared to** other age groups
**2.Employees aged 45 have a higher attrition rate**

In [None]:
from scipy import stats

numerical_hrdf = hrdf.select_dtypes(include = [np.number])

#calculate z_score for numerical columns

z_scores = stats.zscore(numerical_hrdf)
outliers = (z_scores > 3) | (z_scores < -3)
print(outliers)

In [None]:
# for more understanding
age_attrition.plot(kind = 'bar',stacked=True,figsize=(15,10))
plt.title('Attrition by Age')
plt.xlabel('Age')
plt.ylabel('Number of Employees')
plt.show()

In [None]:
#Show employee Attrition in percentage
plt.subplot(1,2,1)
sns.countplot(x='Attrition',data=hrdf)
plt.title('Attrition')
plt.xlabel('Attrition')
plt.ylabel('Count')
plt.show()

In [None]:
def pie_bar_plot(df, column1, column2):
  plt.figure(figsize=(10,5))
  plt.subplot(1,2,1)
  df[column1].value_counts().plot(kind='pie', autopct='%1.1f%%')
  plt.title(f'{column1} Distribution')
  plt.subplot(1,2,2)
  sns.countplot(x=column1, hue=column2, data=df)
  plt.title(f'{column1} vs {column2}')

  plt.show()
pie_bar_plot(hrdf,'Gender', 'Attrition')

In [None]:
pie_bar_plot(hrdf,'EducationField', 'Attrition')

In [None]:
pie_bar_plot(hrdf,'JobRole', 'Attrition')

In [None]:
hrdf.hist(figsize=(20,20), color = 'g',alpha=0.5)
plt.show()

In [None]:
#Based on montlly income and Attrition

rate_attrition = hrdf.groupby(["MonthlyIncome", "Attrition"]).size().unstack()
rate_attrition

In [None]:
hrdf.dtypes

In [None]:
hrdf.describe()

In [None]:
numerical_hrdf = hrdf.select_dtypes(include = [np.number])
numerical_hrdf
fig,ax = plt.subplots(figsize=(25,20))
sns.heatmap(numerical_hrdf.corr(),annot=True,ax=ax)

Here ,Joblevel,monthlyIncome is have highest correlation
and remaining some elements have good correlation like Joblevel ,Total working experience ,Total years in company, Years in Current role,Years wtih last promotion ,years with current manager etc .

and performance rating with percentage Hike is also good correlation .


Some variables that are more related to the targetvariable attrition.

EnvironmentSatisfaction,Jobinvolvment, Job levell,Job satisfaction,MonthlyIncome, Stock Optional Level,Total working years , etc ..


In [None]:
hrdf[(hrdf["Attrition"]== 1)] .groupby(["EducationField"]).size()

In [None]:


hrdf[(hrdf["Attrition"]== 0)] .groupby(["EducationField"]).size()

In [None]:
#if age_attrition = 1
hrdf[(hrdf["Attrition"]== 1)] .groupby(["EducationField"]).size() / hrdf.groupby(["EducationField"]).size()

In [None]:
#if age_attrition = 0
hrdf[(hrdf["Attrition"]== 0)] .groupby(["EducationField"]).size() / hrdf.groupby(["EducationField"]).size()

In [None]:
#hrdf.groupby(["Attrition"]).size()
hrdf.groupby(['Gender']).sum()

In Above ,**we looks at the pecentage of attrited employees by education field**

In [None]:
hrdf.groupby(["Attrition"]).size()

In [None]:
num_hrdf1 = hrdf.select_dtypes(include = 'number')
hrdf1_skewness = num_hrdf1.skew()
print(hrdf1_skewness)

In [None]:
#Feature selection

#above heat map tells us how related the variables  to attrition.
#we will pick the ones that have an absolute correlation value greater than 0.1 for our model

#picking the features that have absolute correlation value greater than 0.1

hrdf1 = hrdf[['Attrition', 'Age', 'BusinessTravel', 'EducationField', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'OverTime', 'StockOptionLevel', 'TotalWorkingYears', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsWithCurrManager']]


In [None]:
x = hrdf1.drop('Attrition', axis=1)
y = hrdf1['Attrition'] # y is target variable



In [None]:
from sklearn.preprocessing import StandardScaler,MinMaxScaler

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

In [None]:
for col in x.columns:
  if x[col].dtype == 'object':
    le = LabelEncoder()
    x[col] = le.fit_transform(x[col])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
print("Train set:", x_train.shape, y_train.shape)
print("Test set:", x_test.shape, y_test.shape)

#Building logistric regression model to fit training model



In [None]:
from sklearn.linear_model import LogisticRegression

#ValueError: could not convert string to float: 'No'



LR_model = LogisticRegression(C = 1.0 ,solver = 'newton-cg', max_iter = 800,random_state = 85).fit(x_train,y_train)
LR_model

In [None]:
#xtrain_pred = LR_model.predict(x_train)
#print(xtrain_pred)

LR_Y_pred = LR_model.predict(x_test)
print(LR_Y_pred)

In [None]:
LR_Acc = accuracy_score(y_test,LR_Y_pred)
print("LR :",LR_Acc)

precision = precision_score(y_test,LR_Y_pred)
recall = recall_score(y_test,LR_Y_pred)
f1 = f1_score(y_test,LR_Y_pred)
print('precision',precision)
print(recall)
print(f1)


In [None]:
#confusion matrix
conf_mat = confusion_matrix(y_test,LR_Y_pred)
print(conf_mat)

In [None]:
class_report = classification_report(y_test,LR_Y_pred)
print("Classification report :", class_report)

In [None]:
# Model selection

models = [LinearRegression(),
          Ridge(alpha = 0.001),
          Lasso(alpha=0.003),
          SVR(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          BayesianRidge(),
          GradientBoostingRegressor(),
          AdaBoostRegressor(base_estimator=LinearRegression())]

model_names = 'LinearRegression','Ridge','Lasso','SVR','DecisionTreeRegressor','RandomForestRegressor','BayesianRidge','GradientBoostingRegressor','AdaBoostRegressor'
model_df = pd.DataFrame(columns=['Model','MSE','R2','MeanCV'])
for model,model_names in zip(models,model_names):
  print(model)

  model.fit(x_train,y_train)
  pred = model.predict(x_test)
  mse = mean_squared_error(y_test,pred,squared=False)
  r2 = model.score(x_test,y_test)

  averages = cross_val_score(model,x_train,y_train,cv=5,scoring='neg_mean_squared_error').mean()

  model_df = pd.concat([model_df,pd.DataFrame({'Model': [model_names],'MSE':mse,'R2':r2,'MeanCV': [averages]})],ignore_index=True)
print(model_df)

In [None]:
#Based on above details  GradientBoostingRegressor  is the best model

