# GROUP ASSIGNMENT - RMBA-2771 Fall 2025

1. Executive Summary

2. Introduction

Potential Research Question 1: How does managerial tenure moderate the relationship between job satisfaction and voluntary employee attrition?

Hypotheses H1: Higher job satisfaction is associated with lower attrition risk. 
H2: Longer managerial tenure (years with current manager) is associated with lower attrition risk. 

H3: Managerial tenure moderates the job satisfaction–attrition link such that the protective effect of job satisfaction is stronger when managerial tenure is longer.

Potential Research Question 2: 
Does promotion stagnation (years since last promotion) increase attrition, and is this effect buffered by participation in internal training programs?

H1: More years since last promotion are associated with higher attrition risk. 
H2: Greater participation in training sessions is associated with lower attrition risk. 
H3: Training participation moderates the promotion‐stagnation–attrition link such that the positive effect of promotion stagnation on attrition is weaker for employees with high training engagement.

Research Question 3: 
Does business travel intensity exacerbate the impact of poor work–life balance on voluntary attrition?

H1: Higher business travel frequency is associated with higher attrition risk. 
H2: Lower work–life balance ratings are associated with higher attrition risk. 
H3: Business travel frequency moderates the work–life–balance–attrition link such that the negative effect of poor work–life balance on attrition is stronger among employees with high travel intensity.

Research Question 
Does Business travel intensity affect the risk of attrition 

Research Question: Does promotion stagnation increase attrition, and is this effect affected by age?

H1: More years since last promotion are associated with higher attrition risk. 
H2: 
H3;:

3. Method 1 

In [67]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.proportion import proportions_ztest
import statsmodels.formula.api as smf

In this code we do some basic exploratory analysis to get an overview of the data set. Firstly, we assess the data set to ensure we identified interesting variables. Next we checked for duplicates and missing values but could not find any. For text entries we used "Unknwon" and for numeric we used 0. Lastly, we have decided to drop columns that are not adding value in example "EmployeeCount", "StandardHours", "Over18". The entries in these columns are all the same and therefore do not add much value. 

In [57]:
df= pd.read_csv("HR-Employee-Attrition.csv", encoding="latin1")

# 1. Load the data
df = pd.read_csv("HR-Employee-Attrition.csv")

# 2. Peek at the top and structure
print(df.head())      # first 5 rows
print(df.info())      # column types & non-null counts

# 3. Drop any exact duplicate rows
df.drop_duplicates(inplace=True)
print("After dropping duplicates:", df.shape)

# 4. Check for missing values
print("Missing per column:\n", df.isnull().sum())

# 5. Fill missing values
#    - For text (object) columns → “Unknown”
#    - For numeric columns → 0
for col in df.columns:
    if df[col].dtype == "object":
        df[col].fillna("Unknown", inplace=True)
    else:
        df[col].fillna(0, inplace=True)

# 6. Drop columns that won’t vary and aren’t informative
to_drop = ["EmployeeCount", "StandardHours", "Over18"]
for c in to_drop:
    if c in df.columns:
        df.drop(columns=c, inplace=True)

# 8. Re-inspect to confirm
print(df.head())
print(df.info())



   Age Attrition     BusinessTravel  DailyRate              Department  \
0   41       Yes      Travel_Rarely       1102                   Sales   
1   49        No  Travel_Frequently        279  Research & Development   
2   37       Yes      Travel_Rarely       1373  Research & Development   
3   33        No  Travel_Frequently       1392  Research & Development   
4   27        No      Travel_Rarely        591  Research & Development   

   DistanceFromHome  Education EducationField  EmployeeCount  EmployeeNumber  \
0                 1          2  Life Sciences              1               1   
1                 8          1  Life Sciences              1               2   
2                 2          2          Other              1               4   
3                 3          4  Life Sciences              1               5   
4                 2          1        Medical              1               7   

   EnvironmentSatisfaction  Gender  HourlyRate  JobInvolvement  JobLevel  

In [58]:
# Simply checking the scale of the numeric columns 
cols_to_check = [
    'DistanceFromHome', 'Education', 'EnvironmentSatisfaction',
    'HourlyRate', 'JobInvolvement', 'JobLevel',
    'JobSatisfaction', 'RelationshipSatisfaction', 'WorkLifeBalance'
]

# Loop through each column and print min and max
for col in cols_to_check:
    min_val = df[col].min()
    max_val = df[col].max()
    print(f"{col}: Min = {min_val}, Max = {max_val}")


DistanceFromHome: Min = 1, Max = 29
Education: Min = 1, Max = 5
EnvironmentSatisfaction: Min = 1, Max = 4
HourlyRate: Min = 30, Max = 100
JobInvolvement: Min = 1, Max = 4
JobLevel: Min = 1, Max = 5
JobSatisfaction: Min = 1, Max = 4
RelationshipSatisfaction: Min = 1, Max = 4
WorkLifeBalance: Min = 1, Max = 4


In [59]:
# Detect outliers using the IQR method for numerical columns, this is not really necessary because our variable are nominal and therefore do not have outliers.
numeric_cols = df.select_dtypes(include='number').columns
outlier_summary = {}

for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outlier_summary[col] = outliers.shape[0]

print("Outlier counts per column:")
print(outlier_summary)

# Check for irregularities in categorical columns
categorical_cols = df.select_dtypes(include='object').columns
for col in categorical_cols:
    print(f"\nUnique values in '{col}':")
    print(df[col].value_counts(dropna=False))

Outlier counts per column:
{'Age': 0, 'DailyRate': 0, 'DistanceFromHome': 0, 'Education': 0, 'EmployeeNumber': 0, 'EnvironmentSatisfaction': 0, 'HourlyRate': 0, 'JobInvolvement': 0, 'JobLevel': 0, 'JobSatisfaction': 0, 'MonthlyIncome': 114, 'MonthlyRate': 0, 'NumCompaniesWorked': 52, 'PercentSalaryHike': 0, 'PerformanceRating': 226, 'RelationshipSatisfaction': 0, 'StockOptionLevel': 85, 'TotalWorkingYears': 63, 'TrainingTimesLastYear': 238, 'WorkLifeBalance': 0, 'YearsAtCompany': 104, 'YearsInCurrentRole': 21, 'YearsSinceLastPromotion': 107, 'YearsWithCurrManager': 14}

Unique values in 'Attrition':
Attrition
No     1233
Yes     237
Name: count, dtype: int64

Unique values in 'BusinessTravel':
BusinessTravel
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: count, dtype: int64

Unique values in 'Department':
Department
Research & Development    961
Sales                     446
Human Resources            63
Name: count, dtype: int64

Unique values in '

In [60]:
keep_cols = ['Attrition', 'BusinessTravel', 'MaritalStatus']

# Method 1: Select only those columns (returns a new DataFrame)
df = df[keep_cols].copy()

# Method 2: Drop all other columns in-place
to_drop = [col for col in df.columns if col not in keep_cols]
df.drop(columns=to_drop, inplace=True)

# Verify
print(df.head())
print(df.columns)

  Attrition     BusinessTravel MaritalStatus
0       Yes      Travel_Rarely        Single
1        No  Travel_Frequently       Married
2       Yes      Travel_Rarely        Single
3        No  Travel_Frequently       Married
4        No      Travel_Rarely       Married
Index(['Attrition', 'BusinessTravel', 'MaritalStatus'], dtype='object')


In [61]:
# I am using this code to check the condition that each cell has at least 10 observations.
# Table 1: Attrition vs Marital Status
attrition_marital = pd.crosstab(
    index=df["Attrition"],
    columns=df["MaritalStatus"],
    margins=True,
    margins_name="Total"
)
print("Attrition vs Marital Status")
print(attrition_marital)

# Table 2: Attrition vs Business Travel Frequency
attrition_travel = pd.crosstab(
    index=df["Attrition"],
    columns=df["BusinessTravel"],
    margins=True,
    margins_name="Total"
)
print("\nAttrition vs Business Travel Frequency")
print(attrition_travel)

Attrition vs Marital Status
MaritalStatus  Divorced  Married  Single  Total
Attrition                                      
No                  294      589     350   1233
Yes                  33       84     120    237
Total               327      673     470   1470

Attrition vs Business Travel Frequency
BusinessTravel  Non-Travel  Travel_Frequently  Travel_Rarely  Total
Attrition                                                          
No                     138                208            887   1233
Yes                     12                 69            156    237
Total                  150                277           1043   1470


H1. There is no difference in levels of attrition among non-travelers, rare travelers and frequent travelers.  

In [62]:

from scipy.stats import chi2_contingency

# 1. Build the contingency table
contingency_table = pd.crosstab(
    df['Attrition'],
    df['BusinessTravel']
)

# 2. Run the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

# 3. Output the results
print("Contingency Table:")
print(contingency_table)

print(f"\nChi2 Statistic: {chi2:.4f}")
print(f"P-Value: {p:.4f}")
print(f"Degrees of Freedom: {dof}")

print("\nExpected Frequencies:")
print(pd.DataFrame(
    expected,
    index=contingency_table.index,
    columns=contingency_table.columns
))

Contingency Table:
BusinessTravel  Non-Travel  Travel_Frequently  Travel_Rarely
Attrition                                                   
No                     138                208            887
Yes                     12                 69            156

Chi2 Statistic: 24.1824
P-Value: 0.0000
Degrees of Freedom: 2

Expected Frequencies:
BusinessTravel  Non-Travel  Travel_Frequently  Travel_Rarely
Attrition                                                   
No              125.816327         232.340816     874.842857
Yes              24.183673          44.659184     168.157143


This implies that attrition and travel frequency are not unrelated. The contigency table shows that far fewer travelers choose to leave the company than if the observations were independent. And far more frequent travelers choose to leave the company. 

In [63]:
# 1. Build the contingency table
contingency_table = pd.crosstab(
    df['Attrition'],
    df['MaritalStatus']
)

# 2. Run the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

# 3. Output the results
print("Contingency Table:")
print(contingency_table)

print(f"\nChi2 Statistic: {chi2:.4f}")
print(f"P-Value: {p:.4f}")
print(f"Degrees of Freedom: {dof}")

print("\nExpected Frequencies:")
print(pd.DataFrame(
    expected,
    index=contingency_table.index,
    columns=contingency_table.columns))

Contingency Table:
MaritalStatus  Divorced  Married  Single
Attrition                               
No                  294      589     350
Yes                  33       84     120

Chi2 Statistic: 46.1637
P-Value: 0.0000
Degrees of Freedom: 2

Expected Frequencies:
MaritalStatus    Divorced     Married     Single
Attrition                                       
No             274.279592  564.495918  394.22449
Yes             52.720408  108.504082   75.77551


In [64]:
# 1. Define low- and high-travel groups
low_travel = df['BusinessTravel'].isin(['Non-Travel', 'Travel_Rarely'])
high_travel = df['BusinessTravel'] == 'Travel_Frequently'

# 2. Count attriters in each group
count_low  = df.loc[low_travel,  'Attrition'].eq('Yes').sum()
nobs_low   = low_travel.sum()

count_high = df.loc[high_travel, 'Attrition'].eq('Yes').sum()
nobs_high  = high_travel.sum()

# 3. Perform two-proportions z-test (one-sided: low < high)
stat, pval = proportions_ztest(
    count  = [count_low, count_high],
    nobs    = [nobs_low,   nobs_high],
    alternative = 'smaller'
)

print(f"Low-travel attrition rate:  {count_low}/{nobs_low:.0f} = {count_low/nobs_low:.3f}")
print(f"High-travel attrition rate: {count_high}/{nobs_high:.0f} = {count_high/nobs_high:.3f}")
print(f"\nZ-statistic = {stat:.3f},  one-sided p-value = {pval:.4f}")


Low-travel attrition rate:  168/1193 = 0.141
High-travel attrition rate: 69/277 = 0.249

Z-statistic = -4.415,  one-sided p-value = 0.0000


Looking at the low p-value there is significant evidence that the high travel attrition rate is significantly higher than the low travel attrition rate. 

4. Method 2 

In [69]:
# Encode Attrition as 0/1
df['attrit_bin'] = df['Attrition'].map({'No': 0, 'Yes': 1})

# Fit logistic regression
model = smf.logit(
    formula = "attrit_bin ~ C(BusinessTravel, Treatment('Travel_Frequently'))",
    data    = df
).fit(disp=False)

print(model.summary())

# To get odds ratios
odds_ratios = pd.DataFrame({
    'OR': model.params.apply(np.exp),
    '2.5%': model.conf_int()[0].apply(np.exp),
    '97.5%': model.conf_int()[1].apply(np.exp)
})
print("\nOdds Ratios (vs. Travel_Frequently):")
print(odds_ratios)


                           Logit Regression Results                           
Dep. Variable:             attrit_bin   No. Observations:                 1470
Model:                          Logit   Df Residuals:                     1467
Method:                           MLE   Df Model:                            2
Date:                Wed, 17 Sep 2025   Pseudo R-squ.:                 0.01830
Time:                        19:29:34   Log-Likelihood:                -637.41
converged:                       True   LL-Null:                       -649.29
Covariance Type:            nonrobust   LLR p-value:                 6.927e-06
                                                                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------------------------------
Intercept                                                             -1.1034      0.139     -7.94

5. Reflection on the use of AI

We primarily used AI to help us solving issues. Our main contribution was finding the problems that require a solution. (...)

6. Conclusion