<a href="https://colab.research.google.com/github/polydiaguiar/turnover-prediction-final-project/blob/main/turnover_prediction_final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## | üéØ Perguntas Norteadoras de Neg√≥cio
* Quais caracter√≠sticas mais influenciam na rotatividade de funcion√°rios?
* Qual perfil de colaborador tem maior propens√£o a sair da empresa?
* Um modelo de machine learning pode prever com boa precis√£o a sa√≠da de um
funcion√°rio?
* Que a√ß√µes a empresa pode tomar com base nessas previs√µes?


## | üîç DATASET OVERVIEW

### üìÇ Vari√°veis

* **Age:** Idade do funcion√°rio (em anos).  
* **Attrition:** Indica se o funcion√°rio deixou a empresa (Yes/No).  
* **BusinessTravel:** Frequ√™ncia de viagens a trabalho (ex: "Travel_Rarely", "Travel_Frequently", "Non-Travel").  
* **DailyRate:** Taxa di√°ria de remunera√ß√£o (valor num√©rico).  
* **Department:** Departamento do funcion√°rio (ex: "Sales", "Research & Development", "Human Resources").  
* **DistanceFromHome:** Dist√¢ncia entre casa e trabalho (em km).  
* **Education:** N√≠vel educacional (1-5, onde 1="Below College", 5="Doctor").  
* **EducationField:** √Årea de forma√ß√£o (ex: "Life Sciences", "Medical", "Technical Degree").  
* **EmployeeCount:** Contagem de funcion√°rios (normalmente 1 para registros individuais).  
* **EmployeeNumber:** ID √∫nico do funcion√°rio.  
* **EnvironmentSatisfaction:** Satisfa√ß√£o com o ambiente de trabalho (escala num√©rica, normalmente 1-4).  
* **Gender:** G√™nero ("Male" ou "Female").  
* **HourlyRate:** Remunera√ß√£o por hora.  
* **JobInvolvement:** Engajamento no trabalho (escala num√©rica, ex: 1-4).  
* **JobLevel:** N√≠vel hier√°rquico (1=j√∫nior, 5=s√™nior).  
* **JobRole:** Cargo ocupado (ex: "Sales Executive", "Research Scientist").  
* **JobSatisfaction:** Satisfa√ß√£o com o trabalho (escala num√©rica, ex: 1-4).  
* **MaritalStatus:** Estado civil ("Single", "Married", "Divorced").  
* **MonthlyIncome:** Sal√°rio mensal.  
* **MonthlyRate:** Taxa de remunera√ß√£o mensal.  
* **NumCompaniesWorked:** N√∫mero de empresas onde j√° trabalhou.  
* **Over18:** Se √© maior de 18 anos (normalmente "Yes" para todos).  
* **OverTime:** Faz horas extras ("Yes" ou "No").  
* **PercentSalaryHike:** Percentual do √∫ltimo aumento salarial.  
* **PerformanceRating:** Avalia√ß√£o de desempenho (ex: 1-5).  
* **RelationshipSatisfaction:** Satisfa√ß√£o com relacionamentos no trabalho (escala num√©rica).  
* **StandardHours:** Carga hor√°ria padr√£o (ex: 80 horas/m√™s).  
* **StockOptionLevel:** N√≠vel de op√ß√µes de a√ß√µes (ex: 0-3).  
* **TotalWorkingYears:** Total de anos de experi√™ncia profissional.  
* **TrainingTimesLastYear:** N√∫mero de treinamentos no √∫ltimo ano.  
* **WorkLifeBalance:** Equil√≠brio vida-trabalho (escala num√©rica).  
* **YearsAtCompany:** Tempo na empresa atual (em anos).  
* **YearsInCurrentRole:** Tempo no cargo atual (em anos).  
* **YearsSinceLastPromotion:** Tempo desde a √∫ltima promo√ß√£o (em anos).  
* **YearsWithCurrManager:** Tempo com o mesmo gerente (em anos).




### üìÇ Refer√™ncia

**Title**: IBM HR Analytics Employee Attrition & Performance  
**Source**: [Kaggle](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset/data)  
**Author**: pavansubhash  
**License**: Database Contents License (DbCL) v1.0





## | üìö IMPORT DE BIBLIOTECA

In [1]:
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from google.colab import drive

## | üìÇ LEITURA DE DADOS

In [2]:
# Cria drive no colab
drive.mount('/content/drive', force_remount=True)

# Especifica caminho do arquivo a ser lido
caminho = '/content/drive/MyDrive/bancos/RH-DATASET.csv'

Mounted at /content/drive


In [3]:
# Instancia vari√°vel atribuindo arquivo
df = pd.read_csv(caminho)

## | ‚òëÔ∏è TRATAMENTO DOS DADOS | LIMPEZA E  PR√â-PROCESSAMENTO

In [5]:
# Visualizar 5 primeiras linhas
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [6]:
# Visualiza as colunas do datset
df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [7]:
# Visualiza informa√ß√µes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [16]:
# Checa se h√° ids repetidos
df['EmployeeNumber'].duplicated().value_counts()

Unnamed: 0_level_0,count
EmployeeNumber,Unnamed: 1_level_1
False,1470


In [18]:
# Dicion√°rio para entender a cardinalidade, verificar se os tipos est√£o corretos e entender os valores de baixa cardinalidade
dic = pd.DataFrame({
    'Tipo': df.dtypes,
    'Total Valores √∫nicos': df.nunique(),
    'Valores mais frequentes': df.mode().iloc[0],
    'Valores √∫nicos': df.apply(lambda x: list(x.unique()) if x.nunique()<=15 else "N/A")
})

print('üóÉÔ∏è Resumo de informa√ß√µes:')
dic

üóÉÔ∏è Resumo de informa√ß√µes:


Unnamed: 0,Tipo,Total Valores √∫nicos,Valores mais frequentes,Valores √∫nicos
Age,int64,43,35.0,
Attrition,object,2,No,"[Yes, No]"
BusinessTravel,object,3,Travel_Rarely,"[Travel_Rarely, Travel_Frequently, Non-Travel]"
DailyRate,int64,886,691.0,
Department,object,3,Research & Development,"[Sales, Research & Development, Human Resources]"
DistanceFromHome,int64,29,2.0,
Education,int64,5,3.0,"[2, 1, 4, 3, 5]"
EducationField,object,6,Life Sciences,"[Life Sciences, Other, Medical, Marketing, Tec..."
EmployeeCount,int64,1,1.0,[1]
EmployeeNumber,int64,1470,1,


###Resumo executivo

* 35 colunas e 1470 linhas
* N√£o h√° 'missing values'
* N√£o h√° ids ('EmployeeNumber') repetidos


#### üü© Classifica√ß√£o das vari√°veis

* **Vari√°veis categ√≥ricas nominal:** 'Attrition' (Vari√°vel target), 'Department', 'EducationField', 'EmployeeNumber', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime',
* **Vari√°veis categ√≥ricas ordinal:** 'BusinessTravel', 'Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'PerformanceRating', 'RelationshipSatisfaction', 'WorkLifeBalance', 'StockOptionLevel'.
* **Vari√°veis num√©ricas cont√≠nuas:** 'DistanceFromHome', 'MonthlyIncome', 'MonthlyRate',
* **Vari√°veis num√©ricas discretas:** 'Age', 'DailyRate', 'EmployeeCount', 'HourlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'StandardHours', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany' 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'.

#### üü© Tratamentos aplicados

* 'StockOptionLevel' o n√≠vel dispon√≠vel para escolha √© definido pela empresa, que usualmente defini os requisitos baseados em senioridade e at√© performance, por isto classifiquei como categ√≥rica ordinal
* As colunas 'EmployeeCount', 'Over18', 'StandardHours' apresentam valores constantes, sendo desinteressantes para a an√°lise, por isto ser√£o removidas
*  'EmployeeNumber' √© o id de identifica√ß√£o, tamb√©m ser√° removido
* Transforma√ß√£o dos dtypes de todas as vari√°ves categ√≥ricas para 'category"

In [40]:
# Faz c√≥pia do df original
df_tratado= df.copy()

# Dropa colunas desnecess√°rias
df_tratado = df_tratado.drop(columns=['EmployeeCount', 'Over18', 'StandardHours','EmployeeNumber'], axis=1)

In [41]:
# Transforma dtype das vari√°veis categ√≥ricas nominais

# Lista de colunas nominais
colunas_nominais = [
    'Attrition', 'Department', 'EducationField', 'Gender',
    'JobRole', 'MaritalStatus', 'OverTime', 'BusinessTravel'
]

df_tratado[colunas_nominais] = df_tratado[colunas_nominais].astype('category')

In [43]:
# Transforma dtype das vari√°veis categ√≥ricas ordinais

# Lista de colunas ordinais
# Todas as colunas ordinais s√£o num√©ricas
colunas_ordinais = [
    'Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel',
    'PerformanceRating', 'RelationshipSatisfaction', 'WorkLifeBalance', 'StockOptionLevel'
]

# Transforma Dtypes preservando a hierarquia dentro de cada feature ordinal
for column in colunas_ordinais:
  df_tratado[column] = pd.Categorical(df_tratado[column], categories = sorted(df[column].unique()), ordered= True)


In [45]:
# Checa transforma√ß√µes nos tipos de vari√°veis
df_tratado.dtypes

Unnamed: 0,0
Age,int64
Attrition,category
BusinessTravel,category
DailyRate,int64
Department,category
DistanceFromHome,int64
Education,category
EducationField,category
EnvironmentSatisfaction,category
Gender,category


## | üìä AN√ÅLISE EXPLORAT√ìRIA


In [10]:
# Investigar jobsatsfaction X 'EnvironmentSatisfaction' X 'RelationshipSatisfaction'

#'HourlyRate': TALVEZ CALCULAR O % DE HORAS TRABAHADAS DO TOTAL DE 80h

- An√°lise explorat√≥ria univariada

- An√°lise explorat√≥ria multivariada

## | üíª MODELAGEM PREDITIVA

## | üìã AVALIA√á√ÉO DOS MODELOS

## | üìç CONCLUS√ÉO