# Exploratory data analysis

This notebook provides an exploratory data analysis of the dataset. As a reminder, this dataset  - `Description des emplois salariés en 2021` is taken from the `Insee` website at the following link : <https://www.insee.fr/fr/statistiques/7651654#dictionnaire>.
We aim to study the effect of gender on the level of wages, depending on several variables. This code provides a first glance at the structure of our data.

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import MultiPolygon
from tqdm import tqdm
import gdown
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode
init_notebook_mode(connected= True)
import plotly.express as px
import seaborn as sns

`Warning` : This code should be run after the [import and formating notebook](test_import_données.ipynb).

In [None]:
base = pd.read_csv("INSEE_DATA_TREATED.csv")

In [None]:
base.head()

In [None]:
base.columns

Further explanations on the meaning of the variables and their type :

In [None]:
numerical_columns = [
    'DATDEB', 'DATFIN', #date début et fin de rémunération par rapport au 01/01
     'AGE', #age en années
     'POND', #pondération 1/12e
    'DUREE', #durée de paie en jours
      'NBHEUR', 'NBHEUR_TOT', #nombre d'heures salariées total (quelle diff?)
    'WAGE', #transformation of TRNNETO
    'UNEMP' #transformation of TRALCHT
]
categorical_columns = [
    'A6', 'A17', 'A38' #activité en nomenclature agrégrée
    'CPFD', #temps complet ou partiel
    'DEPR', 'DEPT', #département résidence et travail
    'DOMEMPL', 'DOMEMPL_EM', #domaine de l'emploi et l'établissement d'affectation/employeur
    
    'FILT', #indic poste annexe 2 ou non-annexe 1 (seuils rémunération volume)
    'REGR', 'REGT', #région de résidence et de travail
    'SEXE', #1 homme 2 femme
    'PCS', #PCS-ESE
    'TYP_EMPLOI', #ordinaire, apprenti, autre
    'CONV_COLL', #convention collective
  
    'TRNNETO', #rémunération nette globale en tranches -> à passer en numérique ?
    'TRALCHT', #total des indémnités de chômage, en tranches -> passage en numérique ?
    'TREFF', #tranche d'effectif : de 0 à 250+ postes
    'CONT_TRAV', #contrat de travail : APP apprentissage, TOA occasionnel ou à l'acte, TTP intérim, AUTre
    'CS', #CSP mais code plus simple 
    'AGE_TR', #age en tranches quadriennales
    'DATDEB_TR',
       'DATFIN_TR', #dates début et fin rémunération en tranches
    'DUREE_TR', #durée de paie exprimée en jours en tranches mensuelles
    'DOMEMPL_EM_N', 'DOMEMPL_N', 'REGR_N',
       'REGT_N', 'CS_N', 'DEPR_N', 'DEPT_N','A38_N' #les variables renommées avec les labels correspondants aux codes
]
all_columns = numerical_columns + categorical_columns

In [None]:
#The dataset is very large :
print(f"Number of rows : {base.shape[0]}, number of columns : {base.shape[1]}")

### Descriptive analysis

Our target variable is the wage of our individuals. In the dataset, wage is coded as a categorical variable.

In [None]:
#WAGE has been preprocessed. We chose to keep the lower bound of the each wage range.

base['WAGE'].sort_values().unique()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(base['WAGE'], bins=100, color = 'royalblue', alpha = 0.5)
plt.title('Distribution of the "WAGE" variable')

Such histogramms can be plotted with other variables

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(base['AGE'], bins=70, color = 'royalblue', alpha = 0.5)
plt.title('Distribution of the age')

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(base['AGE_TR'], bins=70, color = 'royalblue', alpha = 0.5)
plt.title("Distribution of 'AGE_TR' : age in slices of 4 years")

One can also better grasp the fluctuations on the labor market, usign the 'DATDEB' and 'DATFIN' columns which give the start and end dates of the employment period. 'DUREE' might also be useful in that case.

In [None]:
print(f"Proportion of individuals with DUREE = 360 days : {round(base[base['DUREE'] == 360].shape[0] / base.shape[0], 3)}")

In [None]:
#Print the proportion of individuals with DATDEB = 1 and DATFIN = 365
print(f"Proportion of individuals with DATDEB = 1 and DATFIN = 360 : {round(base[(base['DATDEB'] == 1) & (base['DATFIN'] == 360)].shape[0] / base.shape[0], 3)}")

In [None]:
print(f"Average duration between DATDEB and DATFIN for individuals with |DATDEB - DATFIN| != 360 : {round(base[base['DATDEB'] != 1]['DATFIN'].mean(), 2)}")

We are interested in the link between the wage level and gender, controlling on the other variables we have available. First, one can see that gender has a great influence over the average wage level and on the field of work.

In [None]:
#Plot the stacked barplot of 'WAGE' levels depending on 'SEXE'
wage_sexe_counts = base.groupby(['WAGE', 'SEXE']).size().unstack(fill_value=0)
wage_sexe_proportions = wage_sexe_counts.div(wage_sexe_counts.sum(axis=1), axis=0)
wage_sexe_proportions.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='tab10', alpha=0.8)

plt.title('Gender repartition among wage levels')
plt.xlabel('WAGE level')
plt.ylabel('Proportion')
legend_labels = ['Men', 'Women']  # Map 1 -> Men, 2 -> Women
plt.legend(title='SEXE', labels=legend_labels, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
base['SEXE'].value_counts()

This stacked barplot graph indicates that in our dataset, the amount of working women earning 

In [None]:
#Plot the proportion of people earning less than 17K relative to their age group

#We first need to create the age groups

base['AGE'].describe()

base['AGE_GROUP'] = pd.cut(base['AGE'], bins=[0, 25, 35, 45, 55, 100], labels=['0-25', '25-35', '35-45', '45-55', '55+'])
#And count the number of people in each group that earn less than 17K
age_group_counts = base[base['WAGE'] <= 17000]['AGE_GROUP'].value_counts().sort_index()
age_group_counts

In [None]:
# Group the data by AGE_GROUP and CPFD, and calculate the proportions
age_cpfd_counts = base[base['WAGE'] <= 17000].groupby(['AGE_GROUP', 'CPFD']).size().unstack(fill_value=0)
age_cpfd_proportions = age_cpfd_counts.div(age_cpfd_counts.sum(axis=1), axis=0)

# Plot the stacked bar plot
age_cpfd_proportions.plot(kind='barh', stacked=True, figsize=(10, 6), colormap='tab10', alpha=0.8)

# Add labels and title
plt.title('Proportion of CPFD values for people earning less than 17K by age group')
plt.xlabel('Proportion')
plt.ylabel('Age group')
plt.legend(title='CPFD', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Group the data by AGE_GROUP and CPFD, and calculate the proportions
age_cpfd_counts = base[base['WAGE'] <= 17000].groupby(['SEXE', 'CPFD']).size().unstack(fill_value=0)
age_cpfd_proportions = age_cpfd_counts.div(age_cpfd_counts.sum(axis=1), axis=0)

# Plot the stacked bar plot
age_cpfd_proportions.plot(kind='barh', stacked=True, figsize=(10, 6), colormap='tab10', alpha=0.8)

# Add labels and title
plt.title('Proportion of CPFD values for people earning less than 17K by age group')
plt.xlabel('Proportion')
plt.ylabel('SEXE')
plt.legend(title='CPFD', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['DATDEB'], bins=100, cumulative=False)
ax.set_xlabel('Date of start of revenue with respect to the 01/01')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['DATFIN'], bins=100, cumulative=False)
ax.set_xlabel('Date of end of revenue with respect to the 01/01')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['DATFIN'], bins=100, cumulative=False, log=True)
ax.set_xlabel('Date of end of revenue with respect to the 01/01')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['TRNNETO'], bins=100, cumulative=False)
ax.set_xlabel('Date of start of revenue with respect to the 01/01')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
base['SEXE'].value_counts()

In [None]:
categories = base['AGE_TR'].value_counts().index
counts = base['AGE_TR'].value_counts().values
fig, ax = plt.subplots()
ax.bar(categories, counts, width=0.5)
ax.set_xlabel("Age par tranche")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['CS'], bins=100, cumulative=False)
ax.set_xlabel('CSP simplifié')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
categories = base['CS_N'].value_counts().index
counts = base['CS_N'].value_counts().values
fig, ax = plt.subplots()
ax.barh(categories, counts)
ax.set_xlabel("Catégorie Socio-Professionnelle")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
categories = base['CONT_TRAV'].value_counts().index
counts = base['CONT_TRAV'].value_counts().values
fig, ax = plt.subplots()
ax.bar(categories, counts, width=0.5)
ax.set_xlabel('Contrat de travail')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
categories = base['DOMEMPL_EM_N'].value_counts().index
counts = base['DOMEMPL_EM_N'].value_counts().values
fig, ax = plt.subplots()
ax.barh(categories, counts)
ax.set_xlabel("Domaine d'emploi")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
categories = base['DOMEMPL'].value_counts().index
counts = base['DOMEMPL'].value_counts().values
fig, ax = plt.subplots()
ax.bar(categories, counts, width=0.5)
ax.set_xlabel("Domaine d'emploi")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['DUREE'], bins=100, cumulative=False)
ax.set_xlabel('Durée de paie, en jours')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['DUREE'], bins=100, cumulative=False, log=True)
ax.set_xlabel('Durée de paie, en jours')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
categories = base['REGT'].value_counts().index
counts = base['REGT'].value_counts().values
fig, ax = plt.subplots()
ax.bar(categories, counts, width=0.5)
ax.set_xlabel("Région de travail")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
categories = base['REGT_N'].value_counts().index
counts = base['REGT_N'].value_counts().values
fig, ax = plt.subplots()
ax.barh(categories, counts)
ax.set_xlabel("Région de travail")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
categories = base['REGR'].value_counts().index
counts = base['REGR'].value_counts().values
fig, ax = plt.subplots()
ax.bar(categories, counts, width=0.5)
ax.set_xlabel("Région de résidence")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
categories = base['REGR_N'].value_counts().index
counts = base['REGR_N'].value_counts().values
fig, ax = plt.subplots()
ax.barh(categories, counts)
ax.set_xlabel("Région de résidence")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['NBHEUR'], bins=100, cumulative=False)
ax.set_xlabel("Nombre d'heures salariées")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['POND'], bins=100, cumulative=False)
ax.set_xlabel("Nombre d'heures salariées")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
categories = base['DEPR_N'].value_counts().index
counts = base['DEPR_N'].value_counts().values
fig, ax = plt.subplots()
ax.barh(categories, counts)
ax.set_xlabel("Département de résidence")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['WAGE'], bins=100, cumulative=False)
ax.set_xlabel("Salaire")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.hist(base['UNEMP'], bins=100, cumulative=False)
ax.set_xlabel("Chômage")
ax.set_ylabel('Frequency')
plt.show()

In [None]:
base['UNEMP'].value_counts()

## DAG : causal graph and choice of covariates

In [None]:
import networkx as nx

#from pgmpy.base.DAG import DAG

In [None]:
#!pip install pgmpy

In [None]:
digraph = nx.DiGraph(
    [
        ("AGE", "WAGE"),
        ("AGE", "SUBREGION"),
        ("AGE", "WHOURS"),
        ("AGE", "CSP"),
        ("AGE", "CONTRACT"),
        ("SEX", "WAGE"),
        ("SEX", "WHOURS"),
        ("SEX", "ACTIVITY 38"),
        ("SEX", "CSP"),
        ("SUBREGION", "WAGE"),
        ("SUBREGION", "NB EMPLOYEES"),
        ("CSP", "WAGE"),
        ("CSP", "CONV COLL"),
        ("CSP", "NB EMPLOYEES"),
        ("CSP", "WHOURS"),
        ("CONTRACT", "CSP"),
        ("CONTRACT", "WHOURS"),
        ("CONTRACT", "WAGE"),
        ("ACTIVITY 38", "CSP"),
        ("ACTIVITY 38", "WAGE"),
        ("ACTIVITY 38", "WHOURS"),
        ("ACTIVITY 38", "CONTRACT"),
        ("CONV COLL", "WAGE"),
        ("WHOURS", "WAGE"),
    ]
)

In [None]:
plt.figure(figsize=(6,6))

pos = nx.shell_layout(digraph)
 # layout for reproducibility and visibility

nx.draw(digraph, pos, with_labels=True, node_color='lightblue',
edge_color='gray', node_size=500)

plt.show()