# Índice

1. [Importamos librerías y configuraciones](#importamos-librerías-y-configuraciones)
2. [Importamos Datos](#importamos-datos)
3. [Comprobamos longitud de los DataFrames](#comprobamos-longitud-de-los-dataframes)
4. [Unión de DataFrames](#unión-de-dataframes)
5. [Sustitución de Valores Binarios (Yes/No, N/Y)](#sustitución-de-valores-binarios-yesno-ny)
6. [Conversión de Tipo de Datos](#conversión-de-tipo-de-datos)
7. [Salvamos Datos](#salvamos-datos)

---
---

## Importamos librerias y configuraciones

In [61]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('../')

from config import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


---
---

## Importamos Datos

In [62]:
df_emplo_survey_data = pd.read_csv('../../datos/employee_survey_data.csv').reset_index(drop=True)
df_emplo_survey_data.sample()

Unnamed: 0,EmployeeID,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance
4322,4323,3.0,2.0,2.0


In [63]:
df_general_data = pd.read_csv('../../datos/general_data.csv').reset_index(drop=True)
df_general_data.sample()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeID,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,Over18,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
3045,33,Yes,Travel_Rarely,Human Resources,28,2,Human Resources,1,3046,Female,5,Manager,Single,55610,1.0,Y,21,8,1,1.0,3,1,0,0


In [64]:
df_mana_survey_data = pd.read_csv('../../datos/manager_survey_data.csv').reset_index(drop=True)
df_mana_survey_data.sample()

Unnamed: 0,EmployeeID,JobInvolvement,PerformanceRating
1356,1357,3,4


---
---

## Comprobamos longitud de los DataFrames

#### (comprobamos que los 3dfs tengan el mismo número de filas)

In [65]:
len_dfs = [df_general_data.shape[0], df_emplo_survey_data.shape[0], df_mana_survey_data.shape[0]]
set(len_dfs)

{4410}

#### como podemos ver, todas tienen 4410 filas, ya que si no, al convertir la lista en un set y por defecto, eliminar los valores repetidos, en caso de haber longitudes diferentes, nos saldrían más de un unico valor.

---
---

## Union DataFrames

#### Ahora que hemos visto que tenemos el mismo número de filas, vamos a unir todos los datos en un dataframe mediante el ```EmployeeID``` [también nos hemos asegurado que EmployeeID tiene el mismo nombre en los 3 dfs]

#### cambiamos los nombres de las columnas del df ```df_mana_survey_data``` para que luego sepamos que pertenecen a la opinión del manager sobre ese empleado.

In [66]:
df_mana_survey_data.columns = ['EmployeeID', 'Manager_opinion_JobInvolvement', 'Manager_opinion_PerformanceRating']
df_mana_survey_data.sample()

Unnamed: 0,EmployeeID,Manager_opinion_JobInvolvement,Manager_opinion_PerformanceRating
2626,2627,2,3


In [67]:
df = df_emplo_survey_data.merge(df_general_data, on='EmployeeID').merge(df_mana_survey_data, on='EmployeeID')
df.sample()

Unnamed: 0,EmployeeID,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,Over18,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager,Manager_opinion_JobInvolvement,Manager_opinion_PerformanceRating
762,763,1.0,4.0,2.0,47,No,Travel_Rarely,Research & Development,23,4,Life Sciences,1,Female,2,Sales Executive,Married,62740,7.0,Y,14,8,2,17.0,0,6,1,2,3,3


#### comprobamos si tenemos duplicados

In [68]:
df.duplicated().sum()

0

---
---

## Sustitución Valores Binarios (Yes/No, N/Y).

#### Vamos a sustituir los valores (Yes/No, N/Y) por Booleanos True y False.

#### Variables a modificar: 
- Attrition
- Over18

In [69]:
df['Attrition'].unique()

array(['No', 'Yes'], dtype=object)

In [70]:
df['Over18'].unique()

array(['Y'], dtype=object)

In [71]:
df['Over18'].value_counts()

Over18
Y    4410
Name: count, dtype: int64

#### hemos visto que ```Over18``` tiene un 100 de 'Y' como valor [lo cual tiene sentido], por lo que la vamos a eliminar de nuestro df

In [72]:
df.drop(columns=['Over18'], inplace=True)

df['Attrition'] = df['Attrition'].map({'No': False, 'Yes': True})

In [73]:
df.sample(3)

Unnamed: 0,EmployeeID,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager,Manager_opinion_JobInvolvement,Manager_opinion_PerformanceRating
170,171,1.0,4.0,3.0,47,False,Travel_Rarely,Research & Development,1,3,Technical Degree,1,Male,5,Manufacturing Director,Divorced,135700,5.0,16,8,2,20.0,2,5,0,4,2,3
2822,2823,1.0,1.0,4.0,32,True,Travel_Rarely,Research & Development,2,4,Life Sciences,1,Female,2,Research Director,Single,41480,7.0,15,8,0,10.0,0,5,0,4,2,3
3451,3452,1.0,3.0,3.0,26,False,Travel_Frequently,Research & Development,3,3,Medical,1,Male,2,Laboratory Technician,Divorced,139640,1.0,11,8,2,5.0,1,5,1,3,2,3


---
---

## Conversion Tipo Datos

#### Vamos a comprobar el tipo de dato que tienen nuestras varibales junto a su valor y en caso de que sea necesario, cambiarlo, eliminar la columna, etc.

In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4410 entries, 0 to 4409
Data columns (total 28 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   EmployeeID                         4410 non-null   int64  
 1   EnvironmentSatisfaction            4385 non-null   float64
 2   JobSatisfaction                    4390 non-null   float64
 3   WorkLifeBalance                    4372 non-null   float64
 4   Age                                4410 non-null   int64  
 5   Attrition                          4410 non-null   bool   
 6   BusinessTravel                     4410 non-null   object 
 7   Department                         4410 non-null   object 
 8   DistanceFromHome                   4410 non-null   int64  
 9   Education                          4410 non-null   int64  
 10  EducationField                     4410 non-null   object 
 11  EmployeeCount                      4410 non-null   int64

In [75]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
EmployeeID,4410.0,2205.5,1273.2,1.0,1103.25,2205.5,3307.75,4410.0
EnvironmentSatisfaction,4385.0,2.72,1.09,1.0,2.0,3.0,4.0,4.0
JobSatisfaction,4390.0,2.73,1.1,1.0,2.0,3.0,4.0,4.0
WorkLifeBalance,4372.0,2.76,0.71,1.0,2.0,3.0,3.0,4.0
Age,4410.0,36.92,9.13,18.0,30.0,36.0,43.0,60.0
DistanceFromHome,4410.0,9.19,8.11,1.0,2.0,7.0,14.0,29.0
Education,4410.0,2.91,1.02,1.0,2.0,3.0,4.0,5.0
EmployeeCount,4410.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
JobLevel,4410.0,2.06,1.11,1.0,1.0,2.0,3.0,5.0
MonthlyIncome,4410.0,65029.31,47068.89,10090.0,29110.0,49190.0,83800.0,199990.0


#### **Cambios que vamos a hacer:**

##### **Eliminación de columnas**

- ```EmployeeID``` ya que para entrenar a nuestro modelo predictivo no nos aporta información. Más adelante para otro tipo de estudio si nos puede ser útil.

- ```EmployeeCount``` ya que todos los registros contienen el valor de 1.

- ```StandardHours``` ya que todos los registros contienen el valor de 8 horas.

guardamos los datos para otro posible estudio ya que lo tenemos todo unido y con los tipos de datos correctos.

In [76]:
df.to_pickle('../../datos/futuro_estudio/df_employee_data.pkl')

Procedemos con la eliminación de columnas

In [77]:
df.drop(columns=['EmployeeID', 'EmployeeCount', 'StandardHours'], inplace=True)
df.sample()

Unnamed: 0,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager,Manager_opinion_JobInvolvement,Manager_opinion_PerformanceRating
2551,4.0,2.0,3.0,22,True,Travel_Rarely,Research & Development,16,2,Life Sciences,Female,2,Laboratory Technician,Single,35790,1.0,19,3,1.0,3,1,0,0,3,3


#### **Cambio de Valores (categorización de las variables numericas)**

- ```Education```
- ```EnvironmentSatisfaction```
- ```JobSatisfaction```
- ```WorkLifeBalance```
- ```Manager_opinion_JobInvolvement```
- ```Manager_opinion_PerformanceRating```

In [78]:
df['Education'] = df['Education'].map({
                                        1: 'Below College',
                                        2: 'College',
                                        3: 'Bachelor',
                                        4: 'Master',
                                        5: 'Doctor'
                                    })

df['EnvironmentSatisfaction'] = df['EnvironmentSatisfaction'].map({
                                        1: 'Low',
                                        2: 'Medium',
                                        3: 'High',
                                        4: 'Very High'
                                    })

df['JobSatisfaction'] = df['JobSatisfaction'].map({
                                        1: 'Low',
                                        2: 'Medium',
                                        3: 'High',
                                        4: 'Very High'
                                    })

df['WorkLifeBalance'] = df['WorkLifeBalance'].map({
                                        1: 'Bad',
                                        2: 'Good',
                                        3: 'Better',
                                        4: 'Best'
                                    })

df['Manager_opinion_JobInvolvement'] = df['Manager_opinion_JobInvolvement'].map({
                                        1: 'Low',
                                        2: 'Medium',
                                        3: 'High',
                                        4: 'Very High'
                                    })

df['Manager_opinion_PerformanceRating'] = df['Manager_opinion_PerformanceRating'].map({
                                        1: 'Low',
                                        2: 'Good',
                                        3: 'Excellent',
                                        4: 'Outstanding'
                                    })

print("NUESTROS DATOS AHORA")
print("--"*10 + "\n")
df.sample(5)

NUESTROS DATOS AHORA
--------------------



Unnamed: 0,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager,Manager_opinion_JobInvolvement,Manager_opinion_PerformanceRating
1472,Medium,Medium,Bad,32,False,Travel_Frequently,Research & Development,17,Master,Other,Male,4,Sales Executive,Married,193280,1.0,15,3,5.0,2,5,0,3,High,Excellent
3514,High,Very High,Better,37,False,Travel_Rarely,Research & Development,23,Bachelor,Life Sciences,Male,3,Manufacturing Director,Divorced,166590,7.0,16,1,9.0,2,6,1,3,High,Excellent
3180,Low,High,Better,22,False,Travel_Rarely,Research & Development,2,College,Medical,Female,2,Laboratory Technician,Married,195450,0.0,14,0,3.0,3,2,2,2,Medium,Excellent
2397,High,High,Better,43,False,Travel_Rarely,Research & Development,2,Master,Life Sciences,Female,1,Research Scientist,Married,24220,1.0,11,1,14.0,2,14,6,11,High,Excellent
353,High,High,Best,35,False,Travel_Rarely,Research & Development,1,Bachelor,Medical,Male,1,Research Director,Single,108550,3.0,13,1,17.0,2,8,1,6,High,Excellent


---
---

## Salvamos Datos

Una vez realizada una exploración inicial incluso, una conversión de los datos, guardamos estos y continuaremos realizando nuestro EDA en el notebook ```2_EDA.ipynb```.

In [79]:
df.to_pickle('../../datos/tratados/df_employee_data.pkl')