# Exploracion Inicial de los datos.

## En este notebook vamos a ver que datos tenemos, su tipo, sus valores nulos, si hay que unificarlos, etc.

---
---

## Importamos librerias y configuraciones

In [49]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('../')

from config import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


---
---

## Importamos Datos

In [50]:
df_emplo_survey_data = pd.read_csv('../../datos/employee_survey_data.csv').reset_index(drop=True)
df_emplo_survey_data.sample()

Unnamed: 0,EmployeeID,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance
1144,1145,4.0,4.0,3.0


In [51]:
df_general_data = pd.read_csv('../../datos/general_data.csv').reset_index(drop=True)
df_general_data.sample()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeID,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,Over18,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
4164,31,No,Travel_Rarely,Research & Development,2,2,Medical,1,4165,Male,2,Laboratory Technician,Divorced,39750,1.0,Y,15,8,1,5.0,3,5,4,3


In [52]:
df_mana_survey_data = pd.read_csv('../../datos/manager_survey_data.csv').reset_index(drop=True)
df_mana_survey_data.sample()

Unnamed: 0,EmployeeID,JobInvolvement,PerformanceRating
1497,1498,3,3


---
---

## Comprobamos longitud de los DataFrames

#### (comprobamos que los 3dfs tengan el mismo número de filas)

In [53]:
len_dfs = [df_general_data.shape[0], df_emplo_survey_data.shape[0], df_mana_survey_data.shape[0]]
set(len_dfs)

{4410}

#### como podemos ver, todas tienen 4410 filas, ya que si no, al convertir la lista en un set y por defecto, eliminar los valores repetidos, en caso de haber longitudes diferentes, nos saldrían más de un unico valor.

---
---

## Union DataFrames

#### Ahora que hemos visto que tenemos el mismo número de filas, vamos a unir todos los datos en un dataframe mediante el ```EmployeeID``` [también nos hemos asegurado que EmployeeID tiene el mismo nombre en los 3 dfs]

#### cambiamos los nombres de las columnas del df ```df_mana_survey_data``` para que luego sepamos que pertenecen a la opinión del manager sobre ese empleado.

In [54]:
df_mana_survey_data.columns = ['EmployeeID', 'Manager_opinion_JobInvolvement', 'Manager_opinion_PerformanceRating']
df_mana_survey_data.sample()

Unnamed: 0,EmployeeID,Manager_opinion_JobInvolvement,Manager_opinion_PerformanceRating
2498,2499,3,3


In [55]:
df = df_emplo_survey_data.merge(df_general_data, on='EmployeeID').merge(df_mana_survey_data, on='EmployeeID')
df.sample()

Unnamed: 0,EmployeeID,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,Over18,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager,Manager_opinion_JobInvolvement,Manager_opinion_PerformanceRating
4391,4392,4.0,3.0,1.0,32,Yes,Travel_Rarely,Sales,23,1,Life Sciences,1,Male,3,Healthcare Representative,Single,24680,0.0,Y,11,8,0,4.0,2,3,1,2,3,3


#### comprobamos si tenemos duplicados

In [56]:
df.duplicated().sum()

0

---
---

## Sustitución Valores Binarios (Yes/No, N/Y).

#### Vamos a sustituir los valores (Yes/No, N/Y) por Booleanos True y False.

#### Variables a modificar: 
- Attrition
- Over18

In [57]:
df['Attrition'].unique()

array(['No', 'Yes'], dtype=object)

In [58]:
df['Over18'].unique()

array(['Y'], dtype=object)

In [59]:
df['Over18'].value_counts()

Over18
Y    4410
Name: count, dtype: int64

#### hemos visto que ```Over18``` tiene un 100 de 'Y' como valor [lo cual tiene sentido], por lo que la vamos a eliminar de nuestro df

In [60]:
df.drop(columns=['Over18'], inplace=True)

df['Attrition'] = df['Attrition'].map({'No': False, 'Yes': True})

In [62]:
df.sample(3)

Unnamed: 0,EmployeeID,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager,Manager_opinion_JobInvolvement,Manager_opinion_PerformanceRating
561,562,1.0,2.0,1.0,28,False,Travel_Rarely,Sales,2,4,Marketing,1,Male,1,Sales Executive,Married,93550,0.0,11,8,0,6.0,4,5,0,4,3,3
1081,1082,4.0,2.0,3.0,22,True,Travel_Rarely,Research & Development,16,2,Life Sciences,1,Female,2,Laboratory Technician,Single,35790,1.0,19,8,3,1.0,3,1,0,0,3,3
1664,1665,4.0,2.0,4.0,45,False,Travel_Rarely,Research & Development,2,1,Medical,1,Female,4,Sales Executive,Married,44040,0.0,23,8,0,9.0,2,8,3,7,3,4


---
---

## Conversion Tipo Datos

#### Vamos a comprobar el tipo de dato que tienen nuestras varibales y en caso de que sea necesario, cambiarlo.

In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4410 entries, 0 to 4409
Data columns (total 28 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   EmployeeID                         4410 non-null   int64  
 1   EnvironmentSatisfaction            4385 non-null   float64
 2   JobSatisfaction                    4390 non-null   float64
 3   WorkLifeBalance                    4372 non-null   float64
 4   Age                                4410 non-null   int64  
 5   Attrition                          4410 non-null   bool   
 6   BusinessTravel                     4410 non-null   object 
 7   Department                         4410 non-null   object 
 8   DistanceFromHome                   4410 non-null   int64  
 9   Education                          4410 non-null   int64  
 10  EducationField                     4410 non-null   object 
 11  EmployeeCount                      4410 non-null   int64

#### Cambios que vamos a hacer:

#### Casi todos los datos están con su tipo correcto, además, hemos visto que al cambiar los valores de ```Attrition``` ha obtenido el tipo de dato booleno. Solo nos quedaría:

- Cambiar valores de ```Education```: le asignaremos su valor original ya que nos viene como numero y así poder ver si tienen orden o no una vez tengamos su valores categóricos.

In [65]:
df['Education'] = df['Education'].map({
                                        1: 'Below College',
                                        2: 'College',
                                        3: 'Bachelor',
                                        4: 'Master',
                                        5: 'Doctor'
                                    })

df['Education'].unique()

array(['College', 'Below College', 'Master', 'Doctor', 'Bachelor'],
      dtype=object)

In [66]:
df.sample(3)

Unnamed: 0,EmployeeID,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager,Manager_opinion_JobInvolvement,Manager_opinion_PerformanceRating
630,631,3.0,4.0,2.0,35,False,Non-Travel,Research & Development,1,Below College,Life Sciences,1,Male,2,Sales Executive,Married,136100,4.0,22,8,2,16.0,4,13,4,12,2,4
2604,2605,3.0,4.0,3.0,35,False,Travel_Rarely,Research & Development,5,Bachelor,Life Sciences,1,Male,2,Sales Representative,Married,48830,1.0,18,8,1,10.0,1,10,0,9,1,3
3425,3426,3.0,4.0,3.0,59,False,Travel_Rarely,Research & Development,16,Master,Medical,1,Female,5,Laboratory Technician,Single,186650,3.0,12,8,0,30.0,3,5,4,3,2,3


- Eliminación de la columna ```EmployeeID``` ya que ya no

---
---

## Salvamos Datos

#### Una vez realizada una exploraci'on