# Projeto de Previsão de Rotatividade de Funcionários (Attrition)

Este notebook organiza scripts, análises e resultados do projeto de Machine Learning para prever a saída de funcionários.

## 1. Importar Bibliotecas

Nesta seção, vamos importar as principais bibliotecas para análise e modelagem de dados em Python.

In [106]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

## 2. Importar os arquivos no notebook

In [107]:
df = pd.read_csv("../data/raw/rh_data.csv")

### 2.1. Checando informações no DataFrame

In [108]:
print("Formato da base:", df.shape)

Formato da base: (4410, 24)


In [109]:
pd.set_option("display.max_rows", None)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4410 entries, 0 to 4409
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      4410 non-null   int64  
 1   Attrition                4410 non-null   object 
 2   BusinessTravel           4410 non-null   object 
 3   Department               4410 non-null   object 
 4   DistanceFromHome         4410 non-null   int64  
 5   Education                4410 non-null   int64  
 6   EducationField           4410 non-null   object 
 7   EmployeeCount            4410 non-null   int64  
 8   EmployeeID               4410 non-null   int64  
 9   Gender                   4410 non-null   object 
 10  JobLevel                 4410 non-null   int64  
 11  JobRole                  4410 non-null   object 
 12  MaritalStatus            4410 non-null   object 
 13  MonthlyIncome            4410 non-null   int64  
 14  NumCompaniesWorked      

In [110]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeID,Gender,...,NumCompaniesWorked,Over18,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
0,51,No,Travel_Rarely,Sales,6,2,Life Sciences,1,1,Female,...,1.0,Y,11,8,0,1.0,6,1,0,0
1,31,Yes,Travel_Frequently,Research & Development,10,1,Life Sciences,1,2,Female,...,0.0,Y,23,8,1,6.0,3,5,1,4
2,32,No,Travel_Frequently,Research & Development,17,4,Other,1,3,Male,...,1.0,Y,15,8,3,5.0,2,5,0,3
3,38,No,Non-Travel,Research & Development,2,5,Life Sciences,1,4,Male,...,3.0,Y,11,8,3,13.0,5,8,7,5
4,32,No,Travel_Rarely,Research & Development,10,1,Medical,1,5,Male,...,4.0,Y,12,8,2,9.0,2,6,0,4


In [111]:
df.describe()

Unnamed: 0,Age,DistanceFromHome,Education,EmployeeCount,EmployeeID,JobLevel,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
count,4410.0,4410.0,4410.0,4410.0,4410.0,4410.0,4410.0,4391.0,4410.0,4410.0,4410.0,4401.0,4410.0,4410.0,4410.0,4410.0
mean,36.92381,9.192517,2.912925,1.0,2205.5,2.063946,65029.312925,2.69483,15.209524,8.0,0.793878,11.279936,2.79932,7.008163,2.187755,4.123129
std,9.133301,8.105026,1.023933,0.0,1273.201673,1.106689,47068.888559,2.498887,3.659108,0.0,0.851883,7.782222,1.288978,6.125135,3.221699,3.567327
min,18.0,1.0,1.0,1.0,1.0,1.0,10090.0,0.0,11.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,2.0,2.0,1.0,1103.25,1.0,29110.0,1.0,12.0,8.0,0.0,6.0,2.0,3.0,0.0,2.0
50%,36.0,7.0,3.0,1.0,2205.5,2.0,49190.0,2.0,14.0,8.0,1.0,10.0,3.0,5.0,1.0,3.0
75%,43.0,14.0,4.0,1.0,3307.75,3.0,83800.0,4.0,18.0,8.0,1.0,15.0,3.0,9.0,3.0,7.0
max,60.0,29.0,5.0,1.0,4410.0,5.0,199990.0,9.0,25.0,8.0,3.0,40.0,6.0,40.0,15.0,17.0


## 3. Limpar dados

### 3.1. Identificar valores nulos

In [112]:
# Ver quantidade de valores nulos por coluna
print(df.isnull().sum())

# Ver proporção (% de nulos)
print(df.isnull().mean() * 100)

Age                         0
Attrition                   0
BusinessTravel              0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeID                  0
Gender                      0
JobLevel                    0
JobRole                     0
MaritalStatus               0
MonthlyIncome               0
NumCompaniesWorked         19
Over18                      0
PercentSalaryHike           0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           9
TrainingTimesLastYear       0
YearsAtCompany              0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64
Age                        0.000000
Attrition                  0.000000
BusinessTravel             0.000000
Department                 0.000000
DistanceFromHome           0.000000
Education                  0.000000
EducationField             0.000000
EmployeeCount  

Na variável NumCompaniesWorked eles serão substituídos por 1, visto que só sabemos que eles trabalham nessa empresa atualmente. 

* Nesse momento também haverá a substituição de valores 0 por 1. 

In [113]:
df['NumCompaniesWorked'] = df['NumCompaniesWorked'].fillna(1)

In [114]:
df.loc[df['NumCompaniesWorked'] == 0, 'NumCompaniesWorked'] = 1


Na variável TotalWorkingYears será substituído pelo mesmo valor em YearsAtCompany, já que sabemos a quanto tempo o funcionário está na empresa. 

In [115]:
df["TotalWorkingYears"] = np.where(
    df["TotalWorkingYears"].isnull(), 
    df["YearsAtCompany"], 
    df["TotalWorkingYears"]
)

In [116]:
print(df.isnull().sum())

Age                        0
Attrition                  0
BusinessTravel             0
Department                 0
DistanceFromHome           0
Education                  0
EducationField             0
EmployeeCount              0
EmployeeID                 0
Gender                     0
JobLevel                   0
JobRole                    0
MaritalStatus              0
MonthlyIncome              0
NumCompaniesWorked         0
Over18                     0
PercentSalaryHike          0
StandardHours              0
StockOptionLevel           0
TotalWorkingYears          0
TrainingTimesLastYear      0
YearsAtCompany             0
YearsSinceLastPromotion    0
YearsWithCurrManager       0
dtype: int64


### 3.2. Remover duplicados

In [117]:
duplicados = df[df.duplicated(keep='first')]
print(duplicados)

Empty DataFrame
Columns: [Age, Attrition, BusinessTravel, Department, DistanceFromHome, Education, EducationField, EmployeeCount, EmployeeID, Gender, JobLevel, JobRole, MaritalStatus, MonthlyIncome, NumCompaniesWorked, Over18, PercentSalaryHike, StandardHours, StockOptionLevel, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsSinceLastPromotion, YearsWithCurrManager]
Index: []

[0 rows x 24 columns]


In [118]:
df.drop_duplicates(keep='first', inplace=True) 

### 3.3. Identificar a variável resposta

A variável resposta "attrition" está no formato de texto ("Yes" e "No"), será necessário transformá-la em valores numéricos antes de treinar o modelo. Eles serão separados para não transformar junto com outras variáveis. 

In [119]:
y = df["Attrition"].map({"Yes": 1, "No": 0})
X = df.drop(columns=["Attrition"])

X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

### 3.4. Transformar variáveis categóricas (Label Encoding)

In [120]:
# Identificar colunas categóricas
cat_cols = X.select_dtypes(include="object").columns
print("Colunas categóricas:", cat_cols)

le = LabelEncoder()

# Aplicar LabelEncoder em cada coluna categórica
for col in cat_cols:
    X[col] = le.fit_transform(X[col])

Colunas categóricas: Index(['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole',
       'MaritalStatus', 'Over18'],
      dtype='object')


### 3.5. Rejuntar X e Y

In [121]:
df_clean = pd.concat([X, y], axis=1)

### 3.6. Identificar e gerenciar dados fora do escopo da análise

In [122]:
# Verificar colunas com um único valor
for col in df.columns:
    if df[col].nunique() == 1:
        print(f"Coluna {col} é constante e pode ser removida.")

Coluna EmployeeCount é constante e pode ser removida.
Coluna Over18 é constante e pode ser removida.
Coluna StandardHours é constante e pode ser removida.


In [123]:
df = df.drop(columns=["EmployeeCount", "Over18", "StandardHours"])

## 4. Criar novas variáveis

In [124]:
# Faixa Etária

df["AgeGroup"] = pd.cut(df["Age"], bins=[18, 30, 40, 50, 60], 
                        labels=["18-30", "31-40", "41-50", "51-60"])

In [125]:
# Tempo de caso relativo: proporção entre YearsAtCompany e TotalWorkingYears

df["PercYearsAtCompany"] = df["YearsAtCompany"] / (df["TotalWorkingYears"] + 1)

In [126]:
# Tempo médio por empresa

df['AvgYearsPerCompany'] = df['TotalWorkingYears'] / (df['NumCompaniesWorked'].replace(0,1))

In [127]:
# Taxa de promoção

df['PromotionRate'] = df['YearsSinceLastPromotion'] / (df['TotalWorkingYears'] + 1)

In [128]:
# Categoria de renda: agrupar MonthlyIncome em faixas

df["IncomeGroup"] = pd.qcut(df["MonthlyIncome"], q=4, labels=["Baixo", "Médio", "Alto", "Muito Alto"])

In [129]:
# Distância de casa (binário: 0=perto, 1=longe)

df['FarFromHome'] = (df['DistanceFromHome'] > df['DistanceFromHome'].median()).astype(int)

In [130]:
# Experiência em múltiplas empresas: binária (se já trabalhou em mais de 3 empresas)

df["MultiCompanyExp"] = (df["NumCompaniesWorked"] > 3).astype(int)

In [131]:
# Proporção de tempo de carreira na empresa atual

df['CompanyExperienceRatio'] = df['YearsAtCompany'] / (df['TotalWorkingYears'] + 1)

In [132]:
# Gap desde última promoção:
## Criar indicador se ficou >5 anos sem promoção.

df["LongTimeNoPromotion"] = (df["YearsSinceLastPromotion"] > 5).astype(int)

## 5. Dividir a base em treino e teste

In [134]:
# Definindo variáveis
X = df.drop(columns=["Attrition"])  # features (explicativas)
y = df["Attrition"]                 # target (variável resposta)

In [135]:
# Divisão em treino e teste
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,        # 20% teste, 80% treino
    random_state=42,      # seed para reprodutibilidade
    stratify=y            # mantém a proporção do target
)

In [136]:
print("Tamanho do treino:", X_train.shape)
print("Tamanho do teste:", X_test.shape)
print("Proporção no treino:", y_train.value_counts(normalize=True))
print("Proporção no teste:", y_test.value_counts(normalize=True))

Tamanho do treino: (3528, 29)
Tamanho do teste: (882, 29)
Proporção no treino: Attrition
No     0.838719
Yes    0.161281
Name: proportion, dtype: float64
Proporção no teste: Attrition
No     0.839002
Yes    0.160998
Name: proportion, dtype: float64


## 6. Fazer uma análise exploratória