# Exploratory data analysis - Answering Questions

El departamento de recursos humanos de una empresa multinacional ha almacenado los datos de las promociones internas del último año. Con estos datos la empresa quiere conocer si existen patrones determinados a la hora de promocionar a un empleado o no. Además esta empresa quiere saber si puede tomar alguna medida en el futuro para orientar la mejora de las carreras profesionales de sus empleados.

Para ello la empresa os pide:

* Realizar un análisis exploratorio de los datos detallando aquellos aspectos más relevantes que hayáis encontrado.
* Construir un modelo de clasificación que prediga la probabilidad de que un empleado sea promocionado o no basandonos en el histórico que tenemos.
* Desarrollar un cuadro de mando con Dash que resuma los aspectos más relevantes que hayáis extraido en el análisis exploratorio y pueda aconsejar a un empleado en las acciones que puede tomar para incrementar su probabilidad de ascenso.

¿Qué recomendaciones le daríais al departamento de recursos humanos basándoos en los datos?

## Información de los datos:
Las variables que tenemos en los datos son las siguientes:

* employee_id: Identificador del empleado
* department: Departamento del empleado
* region: Región del empleado
* educacion: Nivel de estudios
* gender: Género del empleado
* recruitment_channel: Manera en la que el empleado ha sido contratado
* no_of_trainings: Número de formaciones que ha realizado el empleado en el último año
* age: Edad del empleado
* previous_year_rating: Puntuación obtenida en la evaluación durante los años anteriores
* length_of_service: Años de servicio
* awards_won: Si ha ganado algún premio durante el último año
* avg_training_score: Puntuación media de las formaciones realizadas
* is_promoted: 1 si ha sido ascendido y 0 en caso contrario.

In [32]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

In [33]:
df = pd.read_csv('../data/trabajo1.csv')
df.head(3)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49.0,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60.0,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50.0,0


# How is each department constituted?

In [34]:
df_department = pd.read_pickle('../data/datos_departamentos.pkl')
df_department.head(3)

Unnamed: 0,department,total_people,male_staff,female_staff,total_promotions,promotions_male_staff,promotions_female_staff,percentage_promotions,mean_age,median_age,mean_prev_year_rating,mean_awards_won
0,Sales & Marketing,16840,13686,3154,1213,1037,176,0.072031,34.860629,33.0,3.067937,0.021437
1,Operations,11348,6671,4677,1023,581,442,0.090148,36.073669,35.0,3.632156,0.023088
2,Technology,7138,4350,2788,768,491,277,0.107593,34.86719,33.0,3.158677,0.025918


In [35]:
traces = [go.Scatter(x= df_department[df_department['department']==department]['percentage_promotions'],
                     y = df_department[df_department['department']==department]['mean_prev_year_rating'],
                     mode = 'markers',
                     marker_size = df_department[df_department['department']==department]['total_people']/100, 
                     hovertemplate='<b>{0}</b><br><br>'.format(department)+ 
                                '<b>T. People:</b> {0}<br>'.format(df_department[df_department['department']=="HR"]['total_people'].values[0]) + 
                                '<b>Promotion:</b> {0}%<br>'.format(round(df_department[df_department['department']==department]['percentage_promotions'].values[0]*100,2))+
                                '<b>Y. Ranking:</b> %{y:.1f}<br>',
                     showlegend = False,
                     name = ''
                    ) for department in df_department['department']]


data = traces
layout = go.Layout(title = "Department constitution", xaxis_title = "% of promotions", yaxis_title = "Prev. year rating")

fig = go.Figure(data = data, layout = layout)
fig.show()


# How does each department hire?

In [36]:
df.head(3)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49.0,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60.0,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50.0,0


In [37]:
departments = list(df['department'].unique())

In [38]:
recruitment_channels = list(df['recruitment_channel'].unique())

In [39]:
print(departments)

['Sales & Marketing', 'Operations', 'Technology', 'Analytics', 'R&D', 'Procurement', 'Finance', 'HR', 'Legal']


Los separamos en grandes y pequeños para que la visualización quede más clara

In [40]:
departments_large = ["Sales & Marketing", "Operations", "Technology", "Procurement"]

In [41]:
departments_small = [dep for dep in departments if dep not in departments_large]

In [42]:
def generateIcicleByDepartment(departments_to_generate, icicle_column, title):
    '''
    Generates an Icicle plot for the specified departments and icicly classes .

            Parameters:
                    departments_to_generate (array): a list of departments that will be at the root of the chart
                    icicles (str): a column of the dataframe. It specifies what the "leaves" or icycles will be
                    title (str): Title of the plot

            Returns:
                    fig.show(): plots the specified figure
    '''
    
    # Arrays with all the information
    labels = ["Departments"] # Starts with the value for the root node
    parents = [""] # The parent of the root node is empty
    ids = ["Departments"] # id for the rood node is the same
    values = [0] # Initialized to 0. Will change to tot_value at the end
    tot_value = 0 # Accumulates the sum of all icycle values
    
    # Generate the list of icyles (nodes that will hang from root node)
    icicles = list(df[icicle_column].unique())

    for department in departments_to_generate:
        parents.append("Departments")
        labels.append(department)
        ids.append(department)


        values.append(df[df['department']==department]['department'].count())
        tot_value = tot_value + df[df['department']==department]['department'].count()

        for icicle in icicles:
            labels.append(icicle)
            parents.append(department)
            ids.append(str(department) + " - " + str(icicle))
            values.append(df[(df['department']==department) & (df[icicle_column]==icicle)][icicle_column].count())

    values[0] = tot_value
    
    fig =go.Figure(go.Icicle(
    ids = ids,
    labels=labels,
    parents=parents,
    values=values,
    branchvalues="total",
    root_color="lightgrey"
    ))

    fig.update_layout(margin = dict(t=50, l=25, r=25, b=25), title = title)

    fig.show()

In [43]:
generateIcicleByDepartment(departments_large, "recruitment_channel","Larger company departments")

In [44]:
generateIcicleByDepartment(departments_small, "recruitment_channel","Smaller company departments")

# What is the education level in each department?

In [45]:
generateIcicleByDepartment(departments_large, "education","Larger company departments")

In [46]:
generateIcicleByDepartment(departments_small, "education","Larger company departments")

Its seems weird that in Analytics there are only two types of education

In [47]:
df[df["department"]=="Analytics"]['education'].unique()

array(["Bachelor's", "Master's & above", nan], dtype=object)

Ok, it makes sense

There are some `nan` in education

Where are all the nans in education located?

In [48]:
nans_in_education = df[pd.isnull(df["education"])].copy()

In [49]:
nans_in_education.head(3)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted
10,29934,Technology,region_23,,m,sourcing,1,30,,1,0,77.0,0
21,33332,Operations,region_15,,m,sourcing,1,41,4.0,11,0,57.0,0
32,35465,Sales & Marketing,region_7,,f,sourcing,1,24,1.0,2,0,48.0,0


In [50]:
df['department'].nunique() == nans_in_education["department"].nunique()

True

In [51]:
nans_in_education["recruitment_channel"].nunique() == df["recruitment_channel"].nunique()

True

In [52]:
nans_in_education["region"].nunique() == df["region"].nunique()

False

In [53]:
print(nans_in_education["region"].nunique())
print(df["region"].nunique())

31
34


3 regions do not have `nan` in education

# Age and Length of Service against promotion?

In [54]:
df.head(3)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49.0,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60.0,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50.0,0


In [55]:
ages = np.sort(df['age'].unique())
service_lengths = np.sort(df['length_of_service'].unique())

In [56]:
options = [] # All possible combinations
for age in ages:
    for length in service_lengths:
        options.append((age, length))

In [57]:
total_people_per_age_and_experience = [df[(df['age']==option[0]) & (df['length_of_service']==option[1])].shape[0] for option in options]
total_people_per_age_and_experience_promoted = [df[(df['age']==option[0]) & (df['length_of_service']==option[1]) & (df['is_promoted']==1)].shape[0] for option in options]
total_people_per_age_and_experience_not_promoted = [df[(df['age']==option[0]) & (df['length_of_service']==option[1])& (df['is_promoted']==0)].shape[0] for option in options]

In [58]:
df_ages_service_lengths = pd.DataFrame()

df_ages_service_lengths['age'] = [valor[0] for valor in options]
df_ages_service_lengths['length_of_service'] = [valor[1] for valor in options]
df_ages_service_lengths['tot_people'] = total_people_per_age_and_experience
df_ages_service_lengths['promoted'] = total_people_per_age_and_experience_promoted
df_ages_service_lengths['not_promoted'] = total_people_per_age_and_experience_not_promoted
df_ages_service_lengths['per_promoted'] = df_ages_service_lengths['promoted']/df_ages_service_lengths['tot_people']

In [59]:
df_ages_service_lengths

Unnamed: 0,age,length_of_service,tot_people,promoted,not_promoted,per_promoted
0,20,1,55,1,54,0.018182
1,20,2,58,3,55,0.051724
2,20,3,0,0,0,
3,20,4,0,0,0,
4,20,5,0,0,0,
...,...,...,...,...,...,...
1430,60,31,6,0,6,0.000000
1431,60,32,4,1,3,0.250000
1432,60,33,5,0,5,0.000000
1433,60,34,4,1,3,0.250000


In [60]:
df_ages_service_lengths_non_empty = df_ages_service_lengths[df_ages_service_lengths["tot_people"]!=0]

In [61]:
df_ages_service_lengths_non_empty.to_pickle('../data/datos_ages_service_lengths.pkl')

In [62]:
trace0 = go.Scatter(x = df_ages_service_lengths_non_empty["age"],
                    y = df_ages_service_lengths_non_empty["length_of_service"],
                    mode = "markers",
                    marker_size = df_ages_service_lengths_non_empty["per_promoted"]*50,
                    showlegend = False,
                    marker = {
                        "color":df_ages_service_lengths_non_empty['tot_people'],
                        "showscale":True,
                        "cmax":100,
                        "cmin":0,
                        "colorbar": {
                              "title" :"Number of workers in group"
                          },
                        "colorscale": "plasma"
                        
                      },
                    hovertemplate='<b>T. Age:</b> {0}<br>'.format(df_ages_service_lengths_non_empty["age"].values[0]) + 
                                '<b>L. Service:</b> {0}<br>'.format(df_ages_service_lengths_non_empty["length_of_service"].values[0])+
                                '<b>Y. % Promoted:</b> {0}%<br>'.format(round(df_ages_service_lengths_non_empty["per_promoted"].values[0],1)),

                   )

data = [trace0]
layout = go.Layout(title = "Percentage of promotions by seniority", xaxis_title = "Age", yaxis_title = "Length of service")

fig = go.Figure(data = data, layout = layout)
fig.show()

# Traning Score and Previous Year rating against promotion?