<a href="https://colab.research.google.com/github/julialorrayne/Python---Data-Analysis/blob/main/Employee_Retention_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
df = pd.read_csv('hr_dataset.csv')
df.head()

Unnamed: 0,Employee_ID,Department,Years_at_Company,Salary,Job_Satisfaction,Performance_Score,Left_Company
0,1,Finance,26,111666,4,99,0
1,2,Marketing,30,114351,9,34,0
2,3,HR,7,38566,10,67,0
3,4,Marketing,4,101885,9,84,0
4,5,Marketing,29,40728,7,73,0


In [2]:
#know the structure of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Employee_ID        10000 non-null  int64 
 1   Department         10000 non-null  object
 2   Years_at_Company   10000 non-null  int64 
 3   Salary             10000 non-null  int64 
 4   Job_Satisfaction   10000 non-null  int64 
 5   Performance_Score  10000 non-null  int64 
 6   Left_Company       10000 non-null  int64 
dtypes: int64(6), object(1)
memory usage: 547.0+ KB


In [3]:
#change Employee_ID variable type to object
df['Employee_ID'] = df['Employee_ID'].astype(str)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Employee_ID        10000 non-null  object  
 1   Department         10000 non-null  object  
 2   Years_at_Company   10000 non-null  int64   
 3   Salary             10000 non-null  int64   
 4   Job_Satisfaction   10000 non-null  int64   
 5   Performance_Score  10000 non-null  int64   
 6   Left_Company       10000 non-null  int64   
 7   Salary_Range       10000 non-null  category
dtypes: category(1), int64(5), object(2)
memory usage: 557.1+ KB


In [None]:
#count missing values
df.isnull().sum()

Unnamed: 0,0
Employee_ID,0
Department,0
Years_at_Company,0
Salary,0
Job_Satisfaction,0
Performance_Score,0
Left_Company,0


After analyzing the structure of the dataset, we can observe that there are no missing or null values, and the data types are correct.

# **Checking for outliers:**
Years_at_Company, Salary, Performance_Score

In [None]:
import plotly.express as px

In [None]:
# show outliers for 'Years_at_Company'
px.box(df,y='Years_at_Company').show()

Checking the box plot, we can see that there are no outliers for 'years_at_company'

In [None]:
#Check for outlier for 'Salary'
px.box(df,y='Salary').show()

Checking the box plot, we can see that there are no outliers for 'Salary'

In [None]:
#Check for outliers for 'Job_Satisfaction'
px.box(df,y='Job_Satisfaction').show()

JobChecking the box plot, we can see that there are no outliers for 'Job_Satisfaction'

In [None]:
#Check for outliers for 'Performance_Score'
px.box(df,y='Performance_Score').show()

Checking the box plot, we can see that there are no outliers for 'Performance_Score'

**Identifify the distribution of the variables Years_at_Company, Salary,Job_Satisfaction, Performance_Score**

In [None]:
px.histogram(df,x='Years_at_Company',nbins=30).show()

Analysing the histogram, we can see that years at company doesn't vary much.

In [None]:
px.histogram(df,x='Salary',nbins=20).show()

Analysing the histogram, we can see that salary doesn't vary much.

In [None]:
px.histogram(df,x='Performance_Score',nbins=25).show()

Analysisng the histogram, we can see that the performance score doesn't vary much.

In [None]:
count_df = df['Job_Satisfaction'].value_counts().reset_index()
count_df.columns = ['Job_Satisfaction', 'Count']

In [None]:
px.bar(count_df,
             x='Job_Satisfaction',
             y='Count',
             title='Job Satisfaction',
             labels={'Job_Satisfaction': 'Satisfaction Rating', 'Count': 'Number of Votes'},
             text='Count').show()  # Show counts on top of bars

Analysing the bar grapgh, we can see that the satisfaction rate doesn't vary much.

In [None]:
#visualize average years at company by department
px.bar(
    x=df['Department'].value_counts().index,
    y=df.groupby('Department')['Years_at_Company'].mean(),
    labels={'x':'Department','y':'Years at Company'},
    color=df['Department'].value_counts().index
).show()

Department with highest average number of years at the company:Marketing - approximately 15.983

Department with lowest average number of years at the company:Finance - approximately 15.495

In [None]:
#visualize average years at company by salary range
# Create salary bins
df['Salary_Range'] = pd.cut(
    df['Salary'],
    bins=[0,30000, 45000, 60000, 75000, 90000,105000,120000, float('inf')],
    labels=['0-30k', '30k-45k', '45k-60k', '60k-75k', '75k-90k', '90k-105k','105k-120k','120k+']
)

# Calculate average years at the company by salary range
average_years = df.groupby('Salary_Range')['Years_at_Company'].mean().reset_index()

# Create a bar plot
fig = px.bar(
    average_years,
    x='Salary_Range',
    y='Years_at_Company',
    labels={'Salary_Range': 'Salary Range', 'Years_at_Company': 'Average Years at Company'},
    title='Average Years at Company by Salary Range',
    text = average_years['Years_at_Company'].round(2)
)
fig.show()





In [None]:
#Correlation Coefficient (Pearson’s r) for years at company and salary
correlation = df['Years_at_Company'].corr(df['Salary'])
print(f"Correlation coefficient: {correlation}")

Correlation coefficient: 0.003933668577055318


The bar graph and Pearson's r show weak positive correlation between years at company and salary.

In [None]:
left_company_counts = df.groupby(['Salary_Range', 'Left_Company']).size().reset_index(name='Count')
fig = px.bar(
    left_company_counts,
    x='Salary_Range',
    y='Count',
    color='Left_Company',
    barmode='group',
    labels={'Salary_Range': 'Salary Range', 'Count': 'Number of Employees', 'Left_Company': 'Left Company'},
    title='Number of Employees Who Left or Stayed by Salary Range',
    text=left_company_counts['Count']
)

# Display the bar chart
fig.show()





The bar chart shows that there's a positive correlation between salary level and staying at the company.

# **Exploring Job Satisfaction correlation with years at company**

In [None]:
px.box(df, x='Years_at_Company', y='Job_Satisfaction',
       labels={'x': 'Years at Company', 'y': 'Job Satisfaction'}).show()

The median job satisfaction for each number of years at the company varies while the min and max are the same.

In [None]:
#Correlation Coefficient (Pearson’s r) for Job_Satisfaction and Years_at_Company
correlation = df['Job_Satisfaction'].corr(df['Years_at_Company'])
print(f"Correlation coefficient: {correlation}")

Correlation coefficient: 0.01338885010755026


There is a weak positive correlation between Job_Satisfaction and Years_at_Company

In [None]:
left_company_counts = df.groupby(['Job_Satisfaction', 'Left_Company']).size().reset_index(name='Count')
fig = px.bar(
    left_company_counts,
    x='Job_Satisfaction',
    y='Count',
    color='Left_Company',
    barmode='group',
    labels={'Job_Satisfaction': 'Job_Satisfaction', 'Count': 'Number of Employees', 'Left_Company': 'Left Company'},
    title='Number of Employees Who Left or Stayed by Job_Satisfaction',
    text=left_company_counts['Count']
)

# Display the bar chart
fig.show()

There is a positive correlation betweeen the number of employees who stay and their level of satisfaction

# **Exploring correlation between job satisfaction and salary**

In [None]:
px.box(df, x='Job_Satisfaction', y='Salary',
       labels={'x': 'Job_Satisfaction', 'y': 'Salary'}).show()

In [None]:
#Correlation Coefficient (Pearson’s r) for Job_Satisfaction and Salary
correlation = df['Job_Satisfaction'].corr(df['Salary'])
print(f"Correlation coefficient: {correlation}")

Correlation coefficient: -0.009405056726846566


There is a weak negative correlation between Job_Satisfaction and salary

In [None]:
px.box(df, x='Job_Satisfaction', y='Performance_Score',
       labels={'x': 'Job_Satisfaction', 'y': 'Performance_Score'}).show()

In [None]:
#Correlation Coefficient (Pearson’s r) for Job_Satisfaction and Performance_Score
correlation = df['Job_Satisfaction'].corr(df['Performance_Score'])
print(f"Correlation coefficient: {correlation}")

Correlation coefficient: -0.0024055309179555773


There is a weak negative correlation between Job_Satisfaction and Performance_Score

# **Analysis**
According to the analysis, the factors that appear to have greatest impact on the decision to stay in the company is the job satisfaction and the salary level. As the satisfaction level increases from 1-10, the number of employees who stay increases from 755 to 836. As the salary increases from 30k-120k, the number of employees who stay increases from 1326 to 1406, while the number of employees who leaves decreases from 388 to 355.
It seems like the salary and performance score doesn't have a great influence in the employees satisfaction, since thre is a weak correlation.

# **Analyzing factors that could have influenced employee turnover**

In [4]:
from scipy.stats import pointbiserialr
Left_Company = df['Left_Company']  # This should be a binary column (0 or 1)
Salary = df['Salary']
# Assume `binary_variable` is your 0-1 variable and `continuous_variable` is your continuous data
corr, p_value = pointbiserialr(Left_Company, Salary)

print(f'Point-Biserial Correlation: {corr}')
print(f'P-value: {p_value}')

Point-Biserial Correlation: -0.016367062765291474
P-value: 0.10171171507602969


The point-biserial correlation of -0.016 indicates no meaningful relationship between Left_Company and Salary.
The p-value of 0.1017 further confirms that this correlation is not statistically significant.

In [5]:
Left_Company = df['Left_Company']  # This should be a binary column (0 or 1)
Salary = df['Job_Satisfaction']
# Assume `binary_variable` is your 0-1 variable and `continuous_variable` is your continuous data
corr, p_value = pointbiserialr(Left_Company, Salary)

print(f'Point-Biserial Correlation: {corr}')
print(f'P-value: {p_value}')

Point-Biserial Correlation: 0.007378502906264988
P-value: 0.4606553838602664


The point-biserial correlation of 0.0074 indicates no relationship between 'Left_Company' and 'Job_Satisfaction'. The p-value of 0.4607 confirms that this correlation is not statistically significant.

In [6]:
Left_Company = df['Left_Company']  # This should be a binary column (0 or 1)
Salary = df['Performance_Score']
# Assume `binary_variable` is your 0-1 variable and `continuous_variable` is your continuous data
corr, p_value = pointbiserialr(Left_Company, Salary)

print(f'Point-Biserial Correlation: {corr}')
print(f'P-value: {p_value}')

Point-Biserial Correlation: -0.007348600864986472
P-value: 0.46247462004008066


The point-biserial correlation of -0.0073 indicates no relationship between 'Left_Company' and 'Performance_Score'. The p-value of 0.4625 confirms that this correlation is not statistically significant.

# **Predicting the probability of an employee leaving the company **

In [None]:
import statsmodels.api as sm

In [None]:
hr_dataset = pd.read_csv('hr_dataset.csv')

In [None]:
X = hr_dataset.loc[:, ['Years_at_Company', 'Salary', 'Job_Satisfaction', 'Performance_Score']]
y = hr_dataset['Left_Company']

In [None]:
#Logistic Regression
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()

Optimization terminated successfully.
         Current function value: 0.504113
         Iterations 5


In [None]:
#summary
print(model.summary())

                           Logit Regression Results                           
Dep. Variable:           Left_Company   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9995
Method:                           MLE   Df Model:                            4
Date:                Tue, 15 Oct 2024   Pseudo R-squ.:               0.0005616
Time:                        02:06:00   Log-Likelihood:                -5041.1
converged:                       True   LL-Null:                       -5044.0
Covariance Type:            nonrobust   LLR p-value:                    0.2256
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -1.1939      0.109    -10.936      0.000      -1.408      -0.980
Years_at_Company     -0.0040      0.003     -1.379      0.168      -0.010       0.002
Salary            -1.544

# **Analysis**

Years at company: the  negative sign of the coefficient suggests that with each additional year at the company, the log-odds of leaving decrease slightly. However, since the p-value is greater than 0.05, this relationship is not statistically significant.

Salary: The coefficient is very close to zero, indicating a very small relationship between salary and the probability of leaving the company. It is not statistically significant (p > 0.05).

Job Satisfaction: There's a slight positive coefficient suggesting that higher job satisfaction might increase the log-odds of leaving, but this relationship is weak and not statistically significant (p > 0.05).

Performance Score: The small negative relationship indicates that better performance scores might be associated with a reduced probability of leaving, but this effect is very small and not statistically significant (p > 0.05).
