# Projet Machine Learning - Groupe 1 - Noteboko Python

Promotion 58 - 4GMM, Année universitaire 2023-2024\
Julia Manon (B), Piot Damien (A), Dumas Thomas (B), Ben Abdallah Selim (A)

Le jeu de données comprend 3649 observations et 21 variables, qui représentent diverses caractéristiques liées à
la consommation énergétique et à la géographie de 176 pays du monde au cours des années 2000 à 2020.

---

Les variables sont les suivantes:\
• **Entity** : Nom du pays ou de la région pour lequel les données sont rapportées.\
• **Year** : Année pour laquelle les données sont communiquées, entre 2000 et 2020.\
• **Access to electricity (% of population)** : Pourcentage de la population ayant accès à l’électricité.\
• **Access to clean fuels for cooking (% of population)** : Pourcentage de la population qui utilise principalement des combustibles propres.\
• **Renewable-electricity-generating-capacity-per-capita** : Capacité installée d’énergie renouvelable par personne.\
• **Financial flows to developing countries (US Dollars)** : Aide et assistance des pays développés pour les projets d’énergie propre.\
• **Renewable energy share in total final energy consumption (%)** : Pourcentage d’énergie renouve- lable dans la consommation d’énergie finale.\
• **Electricity from fossil fuels (TWh)** : Électricité produite à partir de combustibles fossiles (charbon, pétrole, gaz) en térawattheures.\
• **Electricity from nuclear (TWh)** : Électricité produite à partir de l’énergie nucléaire en térawattheures.\
• **Electricity from renewables (TWh)** : Électricité produite à partir de sources renouvelables (hydroélec-
tricité, énergie solaire, énergie éolienne, etc.) en térawattheures.\
• **Low-carbon electricity (% electricity)** : Pourcentage d’électricité provenant de sources à faible teneur en carbone (nucléaire et énergies renouvelables).\
• **Primary energy consumption per capita (kWh/person)** : Consommation d’énergie par personne en kilowattheures.
• **Energy intensity level of primary energy (MJ/2011 PPP GDP)** : Consommation d’énergie par unité de PIB à parité de pouvoir d’achat.\
• **Value-co2-emissions (metric tons per capita)** : Émissions de dioxyde de carbone par personne en tonnes métriques.\
• **Renewables (% equivalent primary energy)** : Équivalent énergie primaire provenant de sources re- nouvelables.\
• **GDP growth (annual %)** : Taux de croissance annuel du PIB en monnaie locale constante.\
• **GDP per capita** : Produit intérieur brut (PIB) par personne.\
• **Density (P/Km2)** : Densité de population en personnes par kilomètre carré.\
• **Land Area (Km2)** : Surface terrestre totale en kilomètres carrés.\
• **Latitude** : Latitude du centroïde du pays en degrés décimaux.\
• **Longitude** : Longitude du centroïde du pays en degrés décimaux.

---

**Objectif** : Notre objectif est de prédire la variable **Value-co2-emissions** à partir des autres variables.

---

# 1 - Prise en main des données et analyse exploratoire

In [1]:
file_path <- "ProjetML/global-data-on-sustainable-energy .csv"
data <- read.csv(file_path)

head(data)

Unnamed: 0_level_0,Entity,Year,Access.to.electricity....of.population.,Access.to.clean.fuels.for.cooking,Renewable.electricity.generating.capacity.per.capita,Financial.flows.to.developing.countries..US...,Renewable.energy.share.in.the.total.final.energy.consumption....,Electricity.from.fossil.fuels..TWh.,Electricity.from.nuclear..TWh.,Electricity.from.renewables..TWh.,⋯,Primary.energy.consumption.per.capita..kWh.person.,Energy.intensity.level.of.primary.energy..MJ..2017.PPP.GDP.,Value_co2_emissions_kt_by_country,Renewables....equivalent.primary.energy.,gdp_growth,gdp_per_capita,Density.n.P.Km2.,Land.Area.Km2.,Latitude,Longitude
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<int>,<dbl>,<dbl>
1,Afghanistan,2000,1.613591,6.2,9.22,20000.0,44.99,0.16,0,0.31,⋯,302.5948,1.64,760,,,,60,652230,33.93911,67.70995
2,Afghanistan,2001,4.074574,7.2,8.86,130000.0,45.6,0.09,0,0.5,⋯,236.8919,1.74,730,,,,60,652230,33.93911,67.70995
3,Afghanistan,2002,9.409158,8.2,8.47,3950000.0,37.83,0.13,0,0.56,⋯,210.8622,1.4,1030,,,179.4266,60,652230,33.93911,67.70995
4,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,36.66,0.31,0,0.63,⋯,229.9682,1.4,1220,,8.832278,190.6838,60,652230,33.93911,67.70995
5,Afghanistan,2004,20.064968,10.9,7.75,,44.24,0.33,0,0.56,⋯,204.2312,1.2,1030,,1.414118,211.3821,60,652230,33.93911,67.70995
6,Afghanistan,2005,25.390894,12.2,7.51,9830000.0,33.88,0.34,0,0.59,⋯,252.0691,1.41,1550,,11.229715,242.0313,60,652230,33.93911,67.70995


In [2]:
summary(data)

    Entity               Year      Access.to.electricity....of.population.
 Length:3649        Min.   :2000   Min.   :  1.252                        
 Class :character   1st Qu.:2005   1st Qu.: 59.801                        
 Mode  :character   Median :2010   Median : 98.362                        
                    Mean   :2010   Mean   : 78.934                        
                    3rd Qu.:2015   3rd Qu.:100.000                        
                    Max.   :2020   Max.   :100.000                        
                                   NA's   :10                             
 Access.to.clean.fuels.for.cooking
 Min.   :  0.00                   
 1st Qu.: 23.18                   
 Median : 83.15                   
 Mean   : 63.26                   
 3rd Qu.:100.00                   
 Max.   :100.00                   
 NA's   :169                      
 Renewable.electricity.generating.capacity.per.capita
 Min.   :   0.00                                     
 1st Qu.:   

## 1.1 - Vérification de la nature des variables et de leur encodage

In [3]:
data$Entity = as.factor(data$Entity) #On convertir Entity en variable qualitative
data$Year = as.factor(data$Year) #On convertit Year en variable qualitative
data$Density.n.P.Km2. = as.numeric(gsub(",","",data$Density.n.P.Km2.)) #On convertit Density(P/Km2) en une variable numérique

head(data)
#summary(data)

Unnamed: 0_level_0,Entity,Year,Access.to.electricity....of.population.,Access.to.clean.fuels.for.cooking,Renewable.electricity.generating.capacity.per.capita,Financial.flows.to.developing.countries..US...,Renewable.energy.share.in.the.total.final.energy.consumption....,Electricity.from.fossil.fuels..TWh.,Electricity.from.nuclear..TWh.,Electricity.from.renewables..TWh.,⋯,Primary.energy.consumption.per.capita..kWh.person.,Energy.intensity.level.of.primary.energy..MJ..2017.PPP.GDP.,Value_co2_emissions_kt_by_country,Renewables....equivalent.primary.energy.,gdp_growth,gdp_per_capita,Density.n.P.Km2.,Land.Area.Km2.,Latitude,Longitude
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>
1,Afghanistan,2000,1.613591,6.2,9.22,20000.0,44.99,0.16,0,0.31,⋯,302.5948,1.64,760,,,,60,652230,33.93911,67.70995
2,Afghanistan,2001,4.074574,7.2,8.86,130000.0,45.6,0.09,0,0.5,⋯,236.8919,1.74,730,,,,60,652230,33.93911,67.70995
3,Afghanistan,2002,9.409158,8.2,8.47,3950000.0,37.83,0.13,0,0.56,⋯,210.8622,1.4,1030,,,179.4266,60,652230,33.93911,67.70995
4,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,36.66,0.31,0,0.63,⋯,229.9682,1.4,1220,,8.832278,190.6838,60,652230,33.93911,67.70995
5,Afghanistan,2004,20.064968,10.9,7.75,,44.24,0.33,0,0.56,⋯,204.2312,1.2,1030,,1.414118,211.3821,60,652230,33.93911,67.70995
6,Afghanistan,2005,25.390894,12.2,7.51,9830000.0,33.88,0.34,0,0.59,⋯,252.0691,1.41,1550,,11.229715,242.0313,60,652230,33.93911,67.70995


Comme pour le notebook python, nous allons renommer les colonnes, selon la même nomenclature.

In [4]:
print(colnames(data))

 [1] "Entity"                                                          
 [2] "Year"                                                            
 [3] "Access.to.electricity....of.population."                         
 [4] "Access.to.clean.fuels.for.cooking"                               
 [5] "Renewable.electricity.generating.capacity.per.capita"            
 [6] "Financial.flows.to.developing.countries..US..."                  
 [7] "Renewable.energy.share.in.the.total.final.energy.consumption...."
 [8] "Electricity.from.fossil.fuels..TWh."                             
 [9] "Electricity.from.nuclear..TWh."                                  
[10] "Electricity.from.renewables..TWh."                               
[11] "Low.carbon.electricity....electricity."                          
[12] "Primary.energy.consumption.per.capita..kWh.person."              
[13] "Energy.intensity.level.of.primary.energy..MJ..2017.PPP.GDP."     
[14] "Value_co2_emissions_kt_by_country"                        

In [5]:
#names(data)[] <- ""
names(data)[3] <- "Elec_access"
names(data)[4] <- "Clean_access"
names(data)[5] <- "Renewable_per_capita" #
names(data)[6] <- "Financial_flows" #
names(data)[7] <- "Renewable_share"
names(data)[8] <- "Fossil_elec"
names(data)[9] <- "Nuclear_elec"
names(data)[10] <- "Renewable_elec"
names(data)[11] <- "Low_carb_elec"
names(data)[12] <- "Energy_per_capita"
names(data)[13] <- "PEnergy_intensity"
names(data)[14] <- "CO2"
names(data)[15] <- "Renewables" #
names(data)[16] <- "Growth"
names(data)[17] <- "GDP_per_capita"
names(data)[18] <- "Density"
names(data)[19] <- "Area"
      
#print(colnames(data))
head(data)

Unnamed: 0_level_0,Entity,Year,Elec_access,Clean_access,Renewable_per_capita,Financial_flows,Renewable_share,Fossil_elec,Nuclear_elec,Renewable_elec,⋯,Energy_per_capita,PEnergy_intensity,CO2,Renewables,Growth,GDP_per_capita,Density,Area,Latitude,Longitude
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>
1,Afghanistan,2000,1.613591,6.2,9.22,20000.0,44.99,0.16,0,0.31,⋯,302.5948,1.64,760,,,,60,652230,33.93911,67.70995
2,Afghanistan,2001,4.074574,7.2,8.86,130000.0,45.6,0.09,0,0.5,⋯,236.8919,1.74,730,,,,60,652230,33.93911,67.70995
3,Afghanistan,2002,9.409158,8.2,8.47,3950000.0,37.83,0.13,0,0.56,⋯,210.8622,1.4,1030,,,179.4266,60,652230,33.93911,67.70995
4,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,36.66,0.31,0,0.63,⋯,229.9682,1.4,1220,,8.832278,190.6838,60,652230,33.93911,67.70995
5,Afghanistan,2004,20.064968,10.9,7.75,,44.24,0.33,0,0.56,⋯,204.2312,1.2,1030,,1.414118,211.3821,60,652230,33.93911,67.70995
6,Afghanistan,2005,25.390894,12.2,7.51,9830000.0,33.88,0.34,0,0.59,⋯,252.0691,1.41,1550,,11.229715,242.0313,60,652230,33.93911,67.70995


## 1.2 - Déterminons le taux de valeurs manquantes pour chaque variable

In [6]:
missing_rates <- sapply(data, function(x) mean(is.na(x)))
missing <-  missing_rates * 100
print(missing)

              Entity                 Year          Elec_access 
          0.00000000           0.00000000           0.27404768 
        Clean_access Renewable_per_capita      Financial_flows 
          4.63140586          25.51383941          57.24856125 
     Renewable_share          Fossil_elec         Nuclear_elec 
          5.31652508           0.57550014           3.45300082 
      Renewable_elec        Low_carb_elec    Energy_per_capita 
          0.57550014           1.15100027           0.00000000 
   PEnergy_intensity                  CO2           Renewables 
          5.67278706          11.72924089          58.56399013 
              Growth       GDP_per_capita              Density 
          8.68731159           7.72814470           0.02740477 
                Area             Latitude            Longitude 
          0.02740477           0.02740477           0.02740477 


On observe que les variables **Renewable-electricity-generating-capacity-per-capita** que l'on a renommé **Renawable_per_capita**,**Financial flows to developing countries (US Dollars)** que l'on a renommé **Financial_flows** et **Renewables (% equivalent primary energy)** que l'on a renommé **Renewables** ont plus de 25% (et même plus de 57% pour **Financial_flows** et **Energy_per_capita**) de valeurs manquantes. Nous allons donc les supprimer.

In [7]:
# Supression des colonnes ayant un taux de données manquantes trop élevé
data$Financial_flows <- NULL
data$Renewables <- NULL
data$Renewable_per_capita <- NULL

#head(data)
colnames(data) #Nous constatons que les variables concernées ont bien été supprimées

## 1.3 - On ne garde que les individus qui n'ont pas de valeurs manquantes

In [8]:
nb_ind <- nrow(data)
print(paste("Nombre d'individus actuel : ", nb_ind)) #3649 individus
head(data)
write.csv(data, file = "test_dataframe.csv", row.names = FALSE)

[1] "Nombre d'individus actuel :  3649"


Unnamed: 0_level_0,Entity,Year,Elec_access,Clean_access,Renewable_share,Fossil_elec,Nuclear_elec,Renewable_elec,Low_carb_elec,Energy_per_capita,PEnergy_intensity,CO2,Growth,GDP_per_capita,Density,Area,Latitude,Longitude
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>
1,Afghanistan,2000,1.613591,6.2,44.99,0.16,0,0.31,65.95744,302.5948,1.64,760,,,60,652230,33.93911,67.70995
2,Afghanistan,2001,4.074574,7.2,45.6,0.09,0,0.5,84.74577,236.8919,1.74,730,,,60,652230,33.93911,67.70995
3,Afghanistan,2002,9.409158,8.2,37.83,0.13,0,0.56,81.15942,210.8622,1.4,1030,,179.4266,60,652230,33.93911,67.70995
4,Afghanistan,2003,14.738506,9.5,36.66,0.31,0,0.63,67.02128,229.9682,1.4,1220,8.832278,190.6838,60,652230,33.93911,67.70995
5,Afghanistan,2004,20.064968,10.9,44.24,0.33,0,0.56,62.92135,204.2312,1.2,1030,1.414118,211.3821,60,652230,33.93911,67.70995
6,Afghanistan,2005,25.390894,12.2,33.88,0.34,0,0.59,63.44086,252.0691,1.41,1550,11.229715,242.0313,60,652230,33.93911,67.70995


In [9]:
#Supression des individus pourlesquels on a des valeurs manquantes
data <- na.omit(data) #je trouve 2082 au lieu de 2868, ça supprime trop de lignes
new_nb_ind <- nrow(data)
print(paste("Nombre d'individus après supression des valeurs manquantes : ", new_nb_ind))

[1] "Nombre d'individus après supression des valeurs manquantes :  2868"


## 1.4 Analyse descriptive unidimensionnelle des données