# Exploratory data analysis

El departamento de recursos humanos de una empresa multinacional ha almacenado los datos de las promociones internas del último año. Con estos datos la empresa quiere conocer si existen patrones determinados a la hora de promocionar a un empleado o no. Además esta empresa quiere saber si puede tomar alguna medida en el futuro para orientar la mejora de las carreras profesionales de sus empleados.

Para ello la empresa os pide:

* Realizar un análisis exploratorio de los datos detallando aquellos aspectos más relevantes que hayáis encontrado.
* Construir un modelo de clasificación que prediga la probabilidad de que un empleado sea promocionado o no basandonos en el histórico que tenemos.
* Desarrollar un cuadro de mando con Dash que resuma los aspectos más relevantes que hayáis extraido en el análisis exploratorio y pueda aconsejar a un empleado en las acciones que puede tomar para incrementar su probabilidad de ascenso.

¿Qué recomendaciones le daríais al departamento de recursos humanos basándoos en los datos?

## Información de los datos:
Las variables que tenemos en los datos son las siguientes:

* employee_id: Identificador del empleado
* department: Departamento del empleado
* region: Región del empleado
* educacion: Nivel de estudios
* gender: Género del empleado
* recruitment_channel: Manera en la que el empleado ha sido contratado
* no_of_trainings: Número de formaciones que ha realizado el empleado en el último año
* age: Edad del empleado
* previous_year_rating: Puntuación obtenida en la evaluación durante los años anteriores
* length_of_service: Años de servicio
* awards_won: Si ha ganado algún premio durante el último año
* avg_training_score: Puntuación media de las formaciones realizadas
* is_promoted: 1 si ha sido ascendido y 0 en caso contrario.

# Index

## 1. Data loading and first approach

* **1.1** Load data
* **1.2** Preliminary null value exploration
* **1.3** Preliminary column exploration
    

## 2. Null value treatment

* **2.1** Treating "Training Score" nulls (`avg_training_score`)
* **2.2** Treating "Rating" nulls (`previous_year_rating`)
* **2.3** Treating "Education" nulls (`education`)



In [2]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

In [4]:
df = pd.read_csv('trabajo1.csv')
df.head(3)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49.0,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60.0,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50.0,0


# Department & Gender

En primer lugar, se genera una tabla para agregar algunos datos y facilitar su representación

In [79]:
df_department = pd.DataFrame()

df_department["department"] = df['department'].unique()
df_department["total_people"] = [df[df['department']==department]['department'].count() for department in  df_department["department"]]
# Ordenamos para que aparezcan de mayor a menor por número de empleados
df_department.sort_values(by = "total_people", inplace = True, ascending = False)

# Datos de distribución por género
df_department["male_staff"]   = [df[(df['department']==department) & (df['gender']=="m")]['department'].count() for department in  df_department["department"]]
df_department["female_staff"] = [df[(df['department']==department) & (df['gender']=="f")]['department'].count() for department in  df_department["department"]]

# Datos de promociones
df_department["total_promotions"] = [df[df['is_promoted']==1]['department'].count() for department in  df_department["department"]]
df_department["promotions_male_staff"]   = [df[(df['department']==department) & (df['gender']=="m") & (df['is_promoted']==1) ]['department'].count() for department in  df_department["department"]]
df_department["promotions_female_staff"] = [df[(df['department']==department) & (df['gender']=="f") & (df['is_promoted']==1) ]['department'].count() for department in  df_department["department"]]



In [75]:
data = [go.Bar(x=df_department['department'] , y=df_department['total_people'], name= "Departments")]
layout = go.Layout(title = "Distribution of workers per department", xaxis_title = "Department", yaxis_title = "Number of workers")
fig = go.Figure(data = data, layout = layout)
fig.show()


In [76]:
trace0 = go.Bar(x=df_department['department'] , y=df_department['male_staff'], name= "Male Staff")
trace1 = go.Bar(x=df_department['department'] , y=df_department['female_staff'], name= "Female Staff")

data = [trace0, trace1]
layout = go.Layout(title = "Distribution of workers per department and gender", xaxis_title = "Department", yaxis_title = "Number of workers")
fig = go.Figure(data = data, layout = layout)
fig.show()

In [77]:
trace0 = go.Bar(x=df_department['department'] , y=df_department['male_staff']/df_department['total_people'], name= "Male Staff")
trace1 = go.Bar(x=df_department['department'] , y=df_department['female_staff']/df_department['total_people'], name= "Female Staff")

data = [trace0, trace1]
layout = go.Layout(title = "Distribution of workers per department and gender", xaxis_title = "Department", yaxis_title = "Percentage of workers")
fig = go.Figure(data = data, layout = layout)
fig.show()

In [82]:
trace0 = go.Bar(x=df_department['department'] , y=df_department['promotions_male_staff']/df_department['male_staff'], name= "Male Staff")
trace1 = go.Bar(x=df_department['department'] , y=df_department['promotions_female_staff']/df_department['female_staff'], name= "Female Staff")

data = [trace0, trace1]
layout = go.Layout(title = "Percentage of workers promoted by department", xaxis_title = "Department", yaxis_title = "Percentage of workers promoted")
fig = go.Figure(data = data, layout = layout)
fig.show()

In [85]:
trace0 = go.Bar(x=['Male', 'Female'] , y=[df_department['promotions_male_staff'].sum()/df_department['male_staff'].sum(),df_department['promotions_female_staff'].sum()/df_department['female_staff'].sum() ], name= "Staff")


data = [trace0]
layout = go.Layout(title = "Total percentage of workers promoted", xaxis_title = "Gender", yaxis_title = "Percentage of workers promoted")
fig = go.Figure(data = data, layout = layout)
fig.show()

# Promotions per categorical variable

We will make use of the following functions

In [171]:
def plot_promotions_by_cat_variable(colname, title, x_title, y_title):
    trace0 = go.Bar(x = df[colname].unique(),
                    y = [df[(df[colname]==item) & (df['is_promoted']==1)][colname].count() for item in df[colname].unique()],
                    name = "Promoted",
                    marker_color = "mediumspringgreen")

    trace1 = go.Bar(x = df[colname].unique(),
                    y = [df[(df[colname]==item) & (df['is_promoted']==0)][colname].count() for item in df[colname].unique()],
                    name = "Not Promoted",
                    marker_color = "salmon")

    data = [trace0, trace1]
    layout = go.Layout(title = title, xaxis_title = x_title, yaxis_title = y_title)
    fig = go.Figure(data = data, layout = layout)
    fig.show()

In [172]:
def plot_percentage_promotions_by_cat_variable(colname, title, x_title, y_title):
    
    totals_per_col = df[colname].value_counts()
    
    
    y = [df[(df[colname]==item) & (df['is_promoted']==1)][colname].count()/totals_per_col[item] for item in df[colname].unique()]
    trace0 = go.Bar(x = df[colname].unique(),
                    y = y,
                    name = "Promoted",
                    marker_color = "mediumspringgreen",
                    text= ["{0}%".format(round(value*100,1)) for value in y],
                    textposition="auto",
                    textangle=0,
                    textfont_size = 20,
                    textfont_color= "black",
                   )

    data = [trace0]
    layout = go.Layout(title = title, xaxis_title = x_title, yaxis_title = y_title)
    fig = go.Figure(data = data, layout = layout)

    fig.show()

# Department

In [174]:
plot_promotions_by_cat_variable("department", "Promotions by department", "Department", "Number of workers")
plot_percentage_promotions_by_cat_variable("department", "% of promotions per department","Department"," % of workers")

# Region

In [175]:
plot_promotions_by_cat_variable("region", "Promotions by region", "Region", "Number of workers")
plot_percentage_promotions_by_cat_variable("region", "% of promotions per region","Department"," % of workers")

# Recruitment Channel

In [176]:
plot_promotions_by_cat_variable("recruitment_channel", "Promotions by recruitment channel", "Channel", "Number of workers")
plot_percentage_promotions_by_cat_variable("recruitment_channel", "% of promotions per recruitment channel","Channel"," % of workers")

# Education

In [177]:
plot_promotions_by_cat_variable("education", "Promotions by education levels", "Education", "Number of workers")

In [178]:
# Guardamos la base de datos en una variable auxiliar
df_aux = df.copy()
# Rellenamos los NA para mostrarlos
df['education'].fillna(value = "NA", inplace = True)
# Se muestra
plot_percentage_promotions_by_cat_variable("education", "% of promotions per education level","Education level"," % of workers")
# Se recupera la base de datos inicial
df = df_aux.copy()