# Work Absenteeism
***+ Tableau Dashboard***

Mitko Stoychev Dimitrov

17/06/2024 

This notebook is one of my first Data Science projects. Since this notebook is intended to be my work **portfolio** to simulate a real work environment, I will focus this project on business. 

In the business environment, Data Science projects usually have **3 components** or phases:
1. Business Analytics
2. Machine Learning
3. Productivization

# 1. Business Analytics
The **goal** of this phase is to find the **insigths** among the data, i.e. we have to separate the noise from the signal.

The Business Analytics stage is composed of 3 main phases: 
1. Data quality analysis → In a business environment there are always errors in the data. Apply data science techniques to eliminate such errors.
2. Exploratory data analisis of variables (EDA) → First Categoric Data and then Numeric Data
3. Analysis and generation of insights → seed questions

## 1. Load the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 2. Load the data

In [2]:
df = pd.read_csv("data/AbandonoEmpleados.csv", sep=";", index_col="id", na_values="#N/D")
df

Unnamed: 0_level_0,edad,abandono,viajes,departamento,distancia_casa,educacion,carrera,empleados,satisfaccion_entorno,sexo,...,satisfaccion_companeros,horas_quincena,nivel_acciones,anos_experiencia,num_formaciones_ult_ano,conciliacion,anos_compania,anos_en_puesto,anos_desde_ult_promocion,anos_con_manager_actual
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,41,Yes,Travel_Rarely,Sales,1,Universitaria,Life Sciences,1,Media,3.0,...,Baja,80,0,8,0,,6,,0,5
2,49,No,Travel_Frequently,Research & Development,8,Secundaria,Life Sciences,1,Alta,2.0,...,Muy_Alta,80,1,10,3,,10,,1,7
4,37,Yes,Travel_Rarely,Research & Development,2,Secundaria,Other,1,Muy_Alta,2.0,...,Media,80,0,7,3,,0,2.0,0,0
5,33,No,Travel_Frequently,Research & Development,3,Universitaria,Life Sciences,1,Muy_Alta,3.0,...,Alta,80,0,8,3,,8,3.0,3,0
7,27,No,Travel_Rarely,Research & Development,2,Universitaria,Medical,1,Baja,3.0,...,Muy_Alta,80,1,6,3,,2,,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2061,36,No,Travel_Frequently,Research & Development,23,Master,Medical,1,Alta,4.0,...,Alta,80,1,17,3,,5,4.0,0,3
2062,39,No,Travel_Rarely,Research & Development,6,Secundaria,Medical,1,Muy_Alta,2.0,...,Baja,80,1,9,5,,7,,1,7
2064,27,No,Travel_Rarely,Research & Development,4,Master,Life Sciences,1,Media,4.0,...,Media,80,1,6,0,,6,,0,3
2065,49,No,Travel_Frequently,Sales,2,Secundaria,Medical,1,Muy_Alta,,...,Muy_Alta,80,0,17,3,,9,,0,8


## 3. Data quality


Data Quality Analysis phase, ensuring the integrity and usability of your data is crucial. Here are the typical steps you can follow to check the data quality:

1. Check for Missing Values (Null Data)
Identify Missing Values: Use functions like isnull() or isna() in pandas to identify missing values.
Quantify Missing Values: Determine the percentage of missing values in each column to understand the extent of the issue.
Handle Missing Values: Decide on a strategy to handle missing values, such as:
Remove missing data: Drop rows or columns with missing values.
Impute missing data: Fill missing values with a specific value (mean, median, mode, etc.) or use advanced imputation techniques.
2. Check for Duplicate Data
Identify Duplicates: Use functions like duplicated() to find duplicate rows.
Remove Duplicates: Remove duplicates using drop_duplicates().
3. Check for Inconsistent Data
Consistency Checks: Ensure that categorical data values are consistent. For example, check for typos or variations in the category names (e.g., "USA" vs "U.S.A" vs "United States").
Standardization: Standardize categorical variables to a consistent format.
4. Check for Outliers
Identify Outliers: Use visualizations (box plots, scatter plots) and statistical methods (Z-score, IQR) to detect outliers.
Handle Outliers: Decide whether to remove, transform, or keep outliers based on their impact on the analysis.
5. Validate Data Types
Ensure Correct Data Types: Check if each column has the correct data type (e.g., numeric, categorical, datetime).
Convert Data Types: Convert columns to the appropriate data type if necessary.
6. Check for Data Range and Validity
Range Checks: Ensure that numeric data falls within expected ranges.
Validity Checks: Verify that data entries make sense (e.g., no negative ages, dates within a plausible range).
7. Handle Irrelevant Data
Identify Irrelevant Columns: Identify and remove columns that are not necessary for your analysis.
8. Cross-Validation and Consistency Checks
Cross-Validation: Cross-validate data between different sources or different parts of the dataset to ensure consistency.
Logical Consistency: Ensure that related data columns make logical sense together (e.g., start date should be before end date).
9. Check for Data Completeness
Completeness Check: Ensure that all required data points are present and complete for the analysis.
10. Address Data Leakage
Leakage Check: Ensure that there is no data leakage where future data is used inappropriately to predict past events.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1470 entries, 1 to 2068
Data columns (total 31 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   edad                      1470 non-null   int64  
 1   abandono                  1470 non-null   object 
 2   viajes                    1470 non-null   object 
 3   departamento              1470 non-null   object 
 4   distancia_casa            1470 non-null   int64  
 5   educacion                 1369 non-null   object 
 6   carrera                   1470 non-null   object 
 7   empleados                 1470 non-null   int64  
 8   satisfaccion_entorno      1470 non-null   object 
 9   sexo                      1271 non-null   float64
 10  implicacion               1452 non-null   object 
 11  nivel_laboral             1470 non-null   int64  
 12  puesto                    1470 non-null   object 
 13  satisfaccion_trabajo      1394 non-null   object 
 14  estado_civil 

### 3.1. Null Data