### Sommaire

* [Chapitre 1. Premières explorations des sets de données](#chapter1)
    * [Section 1.1. Contexte et objectifs du projet](#section_1_1)
    * [Section 1.2. Importation des librairies](#section_1_2)
    * [Section 1.3. Importation des données](#section_1_3)
    * [Section 1.4. Exploration préliminaire des variables du jeu de données](#section_1_4)
        * [Section 1.4.1. Analyse des types et des colonnes, et forme des DataFrames ](#section_1_4_1)
        * [Section 1.4.2. Analyse des doublons dans chaque DataFrame](#section_1_4_2)
        * [Section 1.4.3. Résumé de la qualité des données](#section_1_4_3)
* [Chapitre 2. Exploration et Analyse de données avec DataViz](#chapter2)
    * [Section 2.1](#section_2_1)
    * [Section 2.2](#section_2_2)
    * [Section 2.3](#section_2_3)
    * [Section 2.4](#section_2_4)
    * [Section 2.5](#section_2_5)
 
* [Chapitre 3. Nettoyage de données et Pre-procession](#chapter3)
    * [Section 3.1](#section_3_1)
    * [Section 3.2](#section_3_2)
    * 
* [Chapitre 4. ](#chapter4)
    * [Section 4.1](#section_4_1)
    * [Section 4.2](#section_4_2)


### Chapitre 1. Premières explorations des sets de données <a class="anchor" id="chapter1"></a>

#### Section 1.1. Contexte et objectifs du projet <a class="anchor" id="section_1_1"></a>

**Contexte :** L'INSEE, Institut national de la statistique et des études économiques, est l'organisme officiel français chargé de recueillir une variété de données sur le territoire français. Ces données, qu'elles soient démographiques (telles que les naissances, les décès, la densité de la population...) ou économiques (comme les salaires, le nombre d'entreprises par secteur d'activité ou par taille...), offrent une vision complète de la société française. 
Elles constituent ainsi une ressource précieuse pour analyser et comprendre les dynamiques sociales, économiques et territoriales du pays.

**Objectifs :** Cette étude vise à comparer les inégalités en France selon plusieurs dimensions. Tout d'abord, nous nous pencherons sur les disparités entre les entreprises, en examinant leur localisation géographique et leur taille. 
Ensuite, nous nous intéresserons aux inégalités au sein de la population, en analysant les variations de salaires en fonction de différents critères tels que la catégorie d’emploi et la localisation géographique. 
Enfin, nous concentrerons notre attention sur une grande ville en particulier, afin d'étudier de manière approfondie les inégalités qui peuvent exister à l'échelle locale.


#### Section 1.2. Importation des librairies <a class="anchor" id="section_1_2"></a>

In [1]:
import sys
sys.path.append('../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import modules from the appended path
from fi_functions import *

#### Section 1.3. Importation des données <a class="anchor" id="section_1_3"></a>

In [2]:
df_entreprises = pd.read_csv('../data/base_etablissement_par_tranche_effectif.csv') 
display(df_entreprises.head())

df_salary = pd.read_csv('../data/net_salary_per_town_categories_update2021.csv', sep = ';')
display(df_salary.head())

df_name_geographic = pd.read_csv('../data/name_geographic_information.csv', na_values = '-') 
display(df_name_geographic.head())

df_population = pd.read_csv('../data/population.csv', dtype={'CODGEO': object} )
display(df_population.head())

Unnamed: 0,CODGEO,LIBGEO,REG,DEP,E14TST,E14TS0ND,E14TS1,E14TS6,E14TS10,E14TS20,E14TS50,E14TS100,E14TS200,E14TS500
0,1001,L'Abergement-Clémenciat,82,1,25,22,1,2,0,0,0,0,0,0
1,1002,L'Abergement-de-Varey,82,1,10,9,1,0,0,0,0,0,0,0
2,1004,Ambérieu-en-Bugey,82,1,996,577,272,63,46,24,9,3,2,0
3,1005,Ambérieux-en-Dombes,82,1,99,73,20,3,1,2,0,0,0,0
4,1006,Ambléon,82,1,4,4,0,0,0,0,0,0,0,0


Unnamed: 0,CODGEO,SNHM20,SNHMC20,SNHMP20,SNHME20,SNHMO20,SNHMF20,SNHMFC20,SNHMFP20,SNHMFE20,...,SNHMHO20,SNHM1820,SNHM2620,SNHM5020,SNHMF1820,SNHMF2620,SNHMF5020,SNHMH1820,SNHMH2620,SNHMH5020
0,1004,15.013132,25.221939,16.950395,11.259209,11.772666,13.043695,21.80658,14.36813,11.045518,...,12.095672,10.880315,15.072711,17.255747,10.291297,13.234401,14.192028,11.286232,16.13062,19.22654
1,1007,15.261333,24.391671,16.179052,11.917058,12.520221,13.855898,21.532587,15.160246,11.951341,...,12.88161,10.544073,15.263487,16.961952,10.618428,14.034526,14.418987,10.487221,16.229187,19.324302
2,1014,14.578709,27.940065,16.384056,12.175141,11.848787,12.570863,20.702968,13.198429,12.152977,...,12.304387,11.058325,14.116861,16.528222,10.295307,12.22643,13.932078,11.62366,15.517659,18.409024
3,1024,14.658174,23.817275,16.319296,11.905547,12.127913,13.05378,20.157674,14.923592,11.414435,...,12.525517,10.498073,14.627982,16.404495,10.031113,13.268833,13.691905,10.812016,15.505663,18.422798
4,1025,14.95291,25.822076,15.365464,11.663794,12.400297,13.444614,22.271103,14.233353,11.410936,...,12.66639,10.385396,14.750372,16.85486,10.309374,13.513608,14.336143,10.442608,15.580433,18.421533


Unnamed: 0,EU_circo,code_région,nom_région,chef.lieu_région,numéro_département,nom_département,préfecture,numéro_circonscription,nom_commune,codes_postaux,code_insee,latitude,longitude,éloignement
0,Sud-Est,82,Rhône-Alpes,Lyon,1,Ain,Bourg-en-Bresse,1,Attignat,1340,1024,46.283333,5.166667,1.21
1,Sud-Est,82,Rhône-Alpes,Lyon,1,Ain,Bourg-en-Bresse,1,Beaupont,1270,1029,46.4,5.266667,1.91
2,Sud-Est,82,Rhône-Alpes,Lyon,1,Ain,Bourg-en-Bresse,1,Bény,1370,1038,46.333333,5.283333,1.51
3,Sud-Est,82,Rhône-Alpes,Lyon,1,Ain,Bourg-en-Bresse,1,Béreyziat,1340,1040,46.366667,5.05,1.71
4,Sud-Est,82,Rhône-Alpes,Lyon,1,Ain,Bourg-en-Bresse,1,Bohas-Meyriat-Rignat,1250,1245,46.133333,5.4,1.01


Unnamed: 0,NIVGEO,CODGEO,LIBGEO,MOCO,AGEQ80_17,SEXE,NB
0,COM,1001,L'Abergement-Clémenciat,11,0,1,15
1,COM,1001,L'Abergement-Clémenciat,11,0,2,15
2,COM,1001,L'Abergement-Clémenciat,11,5,1,20
3,COM,1001,L'Abergement-Clémenciat,11,5,2,20
4,COM,1001,L'Abergement-Clémenciat,11,10,1,20


#### Section 1.4. Exploration préliminaire des variables du jeu de données <a class="anchor" id="section_1_4"></a> 

##### Section 1.4.1. Analyse des types et des colonnes, et forme des DataFrames <a class="anchor" id="section_1_4_1"></a> 

In [3]:
# Review data types and columns info and shape

print(df_entreprises.info())
print(df_salary.info())
print(df_name_geographic.info())
print(df_population.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36681 entries, 0 to 36680
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   CODGEO    36681 non-null  object
 1   LIBGEO    36681 non-null  object
 2   REG       36681 non-null  int64 
 3   DEP       36681 non-null  object
 4   E14TST    36681 non-null  int64 
 5   E14TS0ND  36681 non-null  int64 
 6   E14TS1    36681 non-null  int64 
 7   E14TS6    36681 non-null  int64 
 8   E14TS10   36681 non-null  int64 
 9   E14TS20   36681 non-null  int64 
 10  E14TS50   36681 non-null  int64 
 11  E14TS100  36681 non-null  int64 
 12  E14TS200  36681 non-null  int64 
 13  E14TS500  36681 non-null  int64 
dtypes: int64(11), object(3)
memory usage: 3.9+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5421 entries, 0 to 5420
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CODGEO     5421 non-null   object 


In [4]:
# Shape for each DataFrame 

print("Shape of DF:")
print("entreprise:",df_entreprises.shape)
print("salary:",df_salary.shape)
print("name_geographic:",df_name_geographic.shape)
print("population:",df_population.shape)

Shape of DF:
entreprise: (36681, 14)
salary: (5421, 25)
name_geographic: (36840, 14)
population: (8536584, 7)


##### Section 1.4.2. Analyse des doublons dans chaque DataFrame<a class="anchor" id="section_1_4_2"></a> 

In [5]:
# Number of duplicates for each DataFrame

print('Number of duplicates :')
print('entreprises :', df_entreprises.duplicated().sum())
print('salary :', df_salary.duplicated().sum())
print('name_geographic :', df_name_geographic.duplicated().sum(), ', name_geographic by code_insee :', df_name_geographic.duplicated(subset=['code_insee']).sum())
print('population :', df_population.duplicated().sum())

Number of duplicates :
entreprises : 0
salary : 0
name_geographic : 0 , name_geographic by code_insee : 147
population : 0


##### Section 1.4.3. Résumé de la qualité des données<a class="anchor" id="section_1_4_3"></a>  

In [6]:
summary(df_entreprises)
summary(df_salary)
summary(df_population)
summary_short(df_name_geographic)

Unnamed: 0,type_info,%_missing_values,nb_unique_values,nb_zero_values,%_zero_values,list_unique_values,mean_or_mode,flag
CODGEO,object,0.0,36681,0,0.0,Too many values...,01001,Nothing to report
LIBGEO,object,0.0,34142,0,0.0,Too many values...,Sainte-Colombe,Nothing to report
REG,int64,0.0,27,0,0.0,Too many values...,49.415365,Nothing to report
DEP,object,0.0,101,0,0.0,Too many values...,62,Nothing to report
E14TST,int64,0.0,1423,399,1.0,Too many values...,123.456067,Nothing to report
E14TS0ND,int64,0.0,1125,579,2.0,Too many values...,83.555301,Nothing to report
E14TS1,int64,0.0,650,6118,17.0,Too many values...,27.291486,Nothing to report
E14TS6,int64,0.0,272,20324,55.0,Too many values...,5.22055,Nothing to report
E14TS10,int64,0.0,220,22641,62.0,Too many values...,3.800333,Nothing to report
E14TS20,int64,0.0,160,25884,71.0,Too many values...,2.296448,Nothing to report


Unnamed: 0,type_info,%_missing_values,nb_unique_values,nb_zero_values,%_zero_values,list_unique_values,mean_or_mode,flag
CODGEO,object,0.0,5421,0,0.0,Too many values...,10003.0,Nothing to report
SNHM20,float64,0.0,5421,0,0.0,Too many values...,15.440462,Nothing to report
SNHMC20,float64,0.0,5421,0,0.0,Too many values...,25.194776,Nothing to report
SNHMP20,float64,0.0,5421,0,0.0,Too many values...,15.847373,Nothing to report
SNHME20,float64,0.0,5421,0,0.0,Too many values...,11.573974,Nothing to report
SNHMO20,float64,0.0,5421,0,0.0,Too many values...,12.124348,Nothing to report
SNHMF20,float64,0.0,5421,0,0.0,Too many values...,13.824962,Nothing to report
SNHMFC20,float64,0.0,5421,0,0.0,Too many values...,22.03816,Nothing to report
SNHMFP20,float64,0.0,5421,0,0.0,Too many values...,14.682978,Nothing to report
SNHMFE20,float64,0.0,5421,0,0.0,Too many values...,11.402604,Nothing to report


Unnamed: 0,type_info,%_missing_values,nb_unique_values,nb_zero_values,%_zero_values,list_unique_values,mean_or_mode,flag
NIVGEO,object,0.0,1,0,0.0,[COM],COM,It's imbalanced!
CODGEO,object,0.0,35868,0,0.0,Too many values...,01001,Nothing to report
LIBGEO,object,0.0,33452,0,0.0,Too many values...,Sainte-Colombe,Nothing to report
MOCO,int64,0.0,7,0,0.0,"[11, 12, 21, 22, 23, 31, 32]",11,Nothing to report
AGEQ80_17,int64,0.0,17,502152,6.0,Too many values...,0,Nothing to report
SEXE,int64,0.0,2,0,0.0,"[1, 2]",1,Nothing to report
NB,int64,0.0,2953,6325631,74.0,Too many values...,7.446743,Nothing to report


Unnamed: 0,nb_unique_values,%_missing_values,nb_missing_values,type
EU_circo,8,0.0,0,object
code_région,28,0.0,0,int64
nom_région,28,0.0,0,object
chef.lieu_région,28,0.0,0,object
numéro_département,102,0.0,0,object
nom_département,102,0.0,0,object
préfecture,102,0.0,0,object
numéro_circonscription,24,0.0,0,int64
nom_commune,34142,0.0,0,object
codes_postaux,6106,0.0,0,object


**Conclusion pour df_name_geographic :** 
- Les noms des colonnes en français, tandis que ceux des autres DataFrames sont en anglais. Il est donc nécessaire de les renommer pour une cohérence.
- La colonne "code_insee" semble correspondre à "CODGEO", mais il y a un '0' au debut, par example, '1024' au lieu de '01024'.
- Les colonnes "latitude", "longitude", "éloignement" présentent respectivement 2929, 2841 et 2962 valeurs manquantes (NaN).
- La colonne "longitude" devrait normalement être de type float64. Cependant, nous n'avons pas pu modifier le type en raison de problèmes avec les données. Par exemple :
  
    - L'utilisation de ',' au lieu de '.' pour la notation décimale : **'5,83'**
    - La présence de '-' : ceux-ci doivent être remplacés par des valeurs NaN.

**Conclusion pour df_salary :** 
-  Il y a de nombreuses catégories de salariés, mais on peut voir qu'elles sont divisées selon les catégories suivantes : catégorie d'emploi, sexe et âge. Ça peut faciliter analyse. 
- La colonne "CODGEO" semble correspondre à "CODGEO", mais il y a un '0' au debut, par example, '1024' au lieu de '01024'.

**Conclusion pour df_entreprises :**  
- Il y a beaucoup de zeros dans la DF pour les colonnes suivantes :
    - 'E14TS6', 'E14TS10', 'E14TS20', 'E14TS50', 'E14TS100', 'E14TS200',
       'E14TS500'
- Pour améliorer cela, nous pouvons créer de nouvelles colonnes qui catégorisent les données en fonction de tailles d'entreprises plus vastes :
    - Micro entreprise: 0 <= taille < 10 personnes
    - Petit entreprise: 10 <= taille < 50 personnes
    - Moyenne entreprise :  50 <= taille < 200 personnes
    - Grande Entreprise : taille >= 200 personnes

In [11]:
df_name_geographic_final = pd.read_csv('../data/name_geographic_final.csv') 

#CSV immport has deleted "0" on INSEE CODE ==> function adds 0 
add_leading_zeros(df_name_geographic_final,'COM_code_insee', 5)

display(df_name_geographic_final.head())

  df_name_geographic_final = pd.read_csv('../data/name_geographic_final.csv')


Unnamed: 0,COM_code_insee,COM_nom,COM_latitude,COM_longitude,DEPT,DEPT_code,DEPT_nom,DEPT_ChefLieu_Code_insee,DEPT_ChefLieu,DEPT_ChefLieu_latitude,...,REG_ChefLieu_latitude,REG_ChefLieu_longitude,CAP,Capitale,Capitale_latitude,Capitale_longitude,DIST,DIST_COM_CL_DEPT,DIST_COM_CL_REG,DIST_COM_PARIS
0,1001,L'Abergement-Clémenciat,46.153721,4.92585,||,1,Ain,1053,bourg en bresse,46.205014,...,45.770061,4.828519,||,Paris,48.852937,2.35005,||,25.27,43.32,357.03
1,1001,L'Abergement-Clémenciat,46.153721,4.92585,||,1,Ain,1053,bourg en bresse,46.205014,...,45.770061,4.828519,||,Paris,48.852937,2.35005,||,25.27,43.32,357.03
2,1002,L'Abergement-de-Varey,46.009606,5.428088,||,1,Ain,1053,bourg en bresse,46.205014,...,45.770061,4.828519,||,Paris,48.852937,2.35005,||,25.88,53.5,391.78
3,1002,L'Abergement-de-Varey,46.009606,5.428088,||,1,Ain,1053,bourg en bresse,46.205014,...,45.770061,4.828519,||,Paris,48.852937,2.35005,||,25.88,53.5,391.78
4,1004,Ambérieu-en-Bugey,45.961049,5.372275,||,1,Ain,1053,bourg en bresse,46.205014,...,45.770061,4.828519,||,Paris,48.852937,2.35005,||,28.83,47.15,393.78


In [12]:
# Verification of duplicates by code_insee in df_name_geographic
df_name_geographic_final[df_name_geographic_final.duplicated(subset = ['COM_code_insee'])]

Unnamed: 0,COM_code_insee,COM_nom,COM_latitude,COM_longitude,DEPT,DEPT_code,DEPT_nom,DEPT_ChefLieu_Code_insee,DEPT_ChefLieu,DEPT_ChefLieu_latitude,...,REG_ChefLieu_latitude,REG_ChefLieu_longitude,CAP,Capitale,Capitale_latitude,Capitale_longitude,DIST,DIST_COM_CL_DEPT,DIST_COM_CL_REG,DIST_COM_PARIS
1,01001,L'Abergement-Clémenciat,46.153721,4.925850,||,1,Ain,1053,bourg en bresse,46.205014,...,45.770061,4.828519,||,Paris,48.852937,2.35005,||,25.27,43.32,357.03
3,01002,L'Abergement-de-Varey,46.009606,5.428088,||,1,Ain,1053,bourg en bresse,46.205014,...,45.770061,4.828519,||,Paris,48.852937,2.35005,||,25.88,53.50,391.78
5,01004,Ambérieu-en-Bugey,45.961049,5.372275,||,1,Ain,1053,bourg en bresse,46.205014,...,45.770061,4.828519,||,Paris,48.852937,2.35005,||,28.83,47.15,393.78
7,01005,Ambérieux-en-Dombes,45.996164,4.911967,||,1,Ain,1053,bourg en bresse,46.205014,...,45.770061,4.828519,||,Paris,48.852937,2.35005,||,34.65,25.96,371.49
9,01006,Ambléon,45.749886,5.594585,||,1,Ain,1053,bourg en bresse,46.205014,...,45.770061,4.828519,||,Paris,48.852937,2.35005,||,57.34,59.47,422.87
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271872,97617,Tsingoni,-12.782686,45.134576,||,976,Mayotte,97611,mamoudzou,-12.790232,...,-12.790232,45.194781,||,Paris,48.852937,2.35005,||,6.58,6.58,8037.00
271873,97617,Tsingoni,-12.782686,45.134576,||,976,Mayotte,97611,mamoudzou,-12.790232,...,-12.790232,45.194781,||,Paris,48.852937,2.35005,||,6.58,6.58,8037.00
271874,97617,Tsingoni,-12.782686,45.134576,||,976,Mayotte,97611,mamoudzou,-12.790232,...,-12.790232,45.194781,||,Paris,48.852937,2.35005,||,6.58,6.58,8037.00
271875,97617,Tsingoni,-12.782686,45.134576,||,976,Mayotte,97611,mamoudzou,-12.790232,...,-12.790232,45.194781,||,Paris,48.852937,2.35005,||,6.58,6.58,8037.00


In [14]:
#Delete Duplicates on INSEE CODE
df_name_geographic_final=df_name_geographic_final.drop_duplicates(subset=["COM_code_insee"], keep="first")

In [None]:
# MERGE START

In [None]:
# MERGE Entreprise & df_name_geographic_final

In [19]:
df_entreprises24 = pd.read_csv('../data/df_entreprises24.csv') 

#CSV immport has deleted "0" on INSEE CODE ==> function adds 0 
add_leading_zeros(df_entreprises24,'CODGEO', 5)
df_entreprises24=df_entreprises24.drop(['ET_BE', 'ET_BE_0sal', 'ET_BE_1_4', 'ET_BE_5_9',
       'ET_BE_10_19', 'ET_BE_20_49', 'ET_BE_50_99', 'ET_BE_100_199',
       'ET_BE_200_499', 'ET_BE_500P', '||'], axis=1)

display(df_entreprises24.head())

  df_entreprises24 = pd.read_csv('../data/df_entreprises24.csv')


Unnamed: 0,CODGEO,Total_Salaries,nb_auto_entrepreneur,nb_micro_entreprises,nb_small_entreprises,nb_medium_entreprises,nb_large_entreprises
0,1001,0,0,0,0,0,0
1,1002,0,0,0,0,0,0
2,1004,33,2,14,15,2,0
3,1005,4,0,2,2,0,0
4,1006,0,0,0,0,0,0


In [17]:
df_entreprises24.columns

Index(['CODGEO', 'ET_BE', 'ET_BE_0sal', 'ET_BE_1_4', 'ET_BE_5_9',
       'ET_BE_10_19', 'ET_BE_20_49', 'ET_BE_50_99', 'ET_BE_100_199',
       'ET_BE_200_499', 'ET_BE_500P', '||', 'Total_Salaries',
       'nb_auto_entrepreneur', 'nb_micro_entreprises', 'nb_small_entreprises',
       'nb_medium_entreprises', 'nb_large_entreprises'],
      dtype='object')

### Chapitre 2. Exploration et Analyse de données avec DataViz <a class="anchor" id="chapter2"></a>

#### Section 2.1 <a class="anchor" id="section_2_1"></a>

#### Section 2.2 <a class="anchor" id="section_2_2"></a>

#### Section 2.3 <a class="anchor" id="section_2_3"></a>
#### Section 2.4 <a class="anchor" id="section_2_4"></a>
#### Section 2.5 <a class="anchor" id="section_2_5"></a>





In [8]:
summary(df_merge_ent_geo)

NameError: name 'df_merge_ent_geo' is not defined

In [None]:
# Aggregate the data by regional capital and sum the number of enterprises
region_enterprises = df_merge_ent_geo.groupby('chef.lieu_région')['E14TST'].sum().reset_index()

# Sort region_enterprises DataFrame by the 'E14TST' column
region_enterprises_sorted = region_enterprises.sort_values(by = 'E14TST', ascending = False)
display(region_enterprises_sorted)

# Aggregate the data by EU_circo and sum the number of enterprises
eu_circo_enterprises = df_merge_ent_geo.groupby('EU_circo')['E14TST'].sum().reset_index()

# Sort eu_circo_enterprises DataFrame by the 'E14TST' column
eu_circo_enterprises_sorted = eu_circo_enterprises.sort_values(by='E14TST', ascending=False)
display(eu_circo_enterprises_sorted)

In [None]:
from matplotlib.ticker import FuncFormatter
#df_merge_ent_geo

sns.set(style="whitegrid")

# Plotting the bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x='EU_circo', y='E14TST', data=eu_circo_enterprises_sorted, palette='viridis' )
plt.title('Total Number of Enterprises by EU_circo')
plt.xlabel('EU_circo')
plt.ylabel('Total Number of Enterprises')
plt.xticks(rotation=45, ha='right')

# Customize the y-axis ticks to show whole numbers
plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:,.0f}'.format(x)))

plt.show();


In [None]:
# Plotting the bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x='chef.lieu_région', y='E14TST', data=region_enterprises_sorted, palette='viridis' )
plt.title('Total Number of Enterprises by Regional Capital')
plt.xlabel('Regional Capital')
plt.ylabel('Total Number of Enterprises')
plt.xticks(rotation = 90, ha='right')

# Customize the y-axis ticks to show whole numbers
plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:,.0f}'.format(x)))

plt.show();

In [None]:
df_merge_ent_geo.columns

In [None]:
# Aggregate the data by region and sum the counts of each type of enterprise
region_enterprise_counts = df_merge_ent_geo.groupby('chef.lieu_région')[['nb_micro_entreprises', 'nb_small_entreprises', 'nb_medium_entreprises', 'nb_large_entreprises']].sum().reset_index()
region_enterprise_counts

In [None]:
region_enterprise_counts.plot(x='chef.lieu_région', kind='bar', stacked=True, figsize=(12, 6))

### Chapitre 3. Nettoyage de données et Pre-procession  <a class="anchor" id="chapter3"></a>

#### Section 3.1. df_name_geographic dataset preprocessing <a class="anchor" id="section_3_1"></a>


In [None]:
# Detection errors in the longitude, latitude columns

print(df_name_geographic.longitude.apply(detection_error).loc[df_name_geographic.longitude.apply(detection_error).notna()].values)
print(df_name_geographic.latitude.apply(detection_error).loc[df_name_geographic.latitude.apply(detection_error).notna()].values)

In [None]:
# Replacing commas to dots
df_name_geographic["longitude"] = df_name_geographic["longitude"].apply(lambda x: str(x).replace(',','.'))
df_name_geographic["longitude"] = df_name_geographic["longitude"].astype(float)

#df_name_geographic.dtypes

In [None]:
# Verification of duplicates by code_insee in df_name_geographic 
df_name_geographic[df_name_geographic.duplicated(subset = ['code_insee'])]

In [None]:
# Drop duplicates by code_insee in name_geographic
df_name_geographic.drop_duplicates(subset=["code_insee"], keep="first", inplace=True)

In [None]:
# Verify the unique lengths of strings in the 'code_insee' column of df

unique_lengths_df_name_geographic = get_unique_lengths(df_name_geographic['code_insee'])

print("Unique lengths in 'code_insee' column, df_name_geographic:", unique_lengths_df_name_geographic)

In [None]:
# Adding leading zeros for the code_insee column

add_leading_zeros(df_name_geographic,'code_insee', 5)


In [None]:
# Verification of unique lengths in 'code_insee' column
print("Unique lengths in 'code_insee' column, df_name_geographic:", get_unique_lengths(df_name_geographic['code_insee']))

In [None]:
print(df_entreprises.shape, df_name_geographic.shape)

#### Section 3.2. df_entreprises dataset preprocessing <a class="anchor" id="section_3_2"></a>

In [None]:
# Create a new column for the enterprise types

df_entreprises['nb_micro_entreprises'] = df_entreprises['E14TS1'] + df_entreprises['E14TS6']
df_entreprises['nb_small_entreprises'] = df_entreprises['E14TS10'] + df_entreprises['E14TS20']
df_entreprises['nb_medium_entreprises'] = df_entreprises['E14TS50'] + df_entreprises['E14TS100']
df_entreprises['nb_large_entreprises'] = df_entreprises['E14TS200'] + df_entreprises['E14TS500']

In [None]:
df_entreprises.columns

In [None]:
# Drop unuseful columns in df_entreprises

#df_entreprises = df_entreprises.drop()

In [None]:
# Verify the unique lengths of strings in the 'CODGEO' column of df

unique_lengths_df_entreprises = get_unique_lengths(df_entreprises['CODGEO'])

print("Unique lengths in 'CODGEO' column, df_entreprises:", unique_lengths_df_entreprises)

In [None]:
# Merge df_entreprises and df_name_geographic

df_merge_ent_geo = pd.merge(left = df_entreprises, right = df_name_geographic, left_on = 'CODGEO', right_on = 'code_insee')


#### Section 3.3. df_salary dataset preprocessing <a class="anchor" id="section_3_3"></a>

In [None]:
# Verify the unique lengths of strings in the 'CODGEO' column of df

unique_lengths_df_salary = get_unique_lengths(df_salary['CODGEO'])
print("Unique lengths in 'CODGEO' column, df_salary:", unique_lengths_df_salary)

In [None]:
# Adding leading zeros for the CODGEO column

add_leading_zeros(df_salary, 'CODGEO', 5)

In [None]:
print("Unique lengths in 'CODGEO' column, df_salary:", get_unique_lengths(df_salary['CODGEO']))

#### Section 3.4. df_population dataset preprocessing <a class="anchor" id="section_3_4"></a>

In [None]:
# Verify the unique lengths of strings in the 'CODGEO' column of df

unique_lengths_df_population = get_unique_lengths(df_population['CODGEO'])

print("Unique lengths in 'CODGEO' column, df_population:", unique_lengths_df_population)

### Chapitre 4. <a class="anchor" id="chapter4"></a>

#### Section 4.1.  <a class="anchor" id="section_4_1"></a>

In [None]:
summary(df_entreprises)
summary(df_salary)
summary(df_population)
