># **Preprocessing: Handling Missing Values and Variable Separation**
>
> The dataframe used in this notebook originates from the preprocessing steps performed in the
> notebooks 1_0, 1_1, and 1_2 from the app_emulation folder.
>Further refinement of the selected features is performed here in accordance with the project's aim to explore different approaches to the dataframe, in order to study the relationships that determine and help explain the behavior of the Electric Energy Consumption variable.



In [None]:
# Importing Required Libraries

import pandas as pd
import numpy as np

In [None]:
# Loading the Dataset

df = pd.read_csv("23_22_21-eea_europa_eu-CarsCO2_combustion.csv", sep = ",", na_values = ["nan", "None", "null", "NA", "N/A", "n/a", ""], keep_default_na = True)
df.head()

  df = pd.read_csv("23_22_21-eea_europa_eu-CarsCO2_combustion.csv", sep = ",", na_values = ["nan", "None", "null", "NA", "N/A", "n/a", ""], keep_default_na = True)


Unnamed: 0,ID,member_state,manufacturer_name_eu,vehicle_type,commercial_name,category_of_vehicle,fuel_type,fuel_mode,innovative_technologies,mass_vehicle,weltp_test_mass,engine_capacity,engine_power,erwltp,year,electric_range,electric_energy_consumption,fuel_consumption,specific_co2_emissions
0,56002959,GR,HYUNDAI,OS,"KONA,KAUAI",M1,diesel,M,,1415.0,1600.0,1598.0,100.0,,2021,,,,127.0
1,56002960,GR,HYUNDAI,OS,"KONA,KAUAI",M1,diesel,M,,1415.0,1600.0,1598.0,100.0,,2021,,,,127.0
2,56002961,GR,HYUNDAI,OS,"KONA,KAUAI",M1,diesel,M,,1415.0,1600.0,1598.0,100.0,,2021,,,,127.0
3,56002962,GR,HYUNDAI,OS,"KONA,KAUAI",M1,diesel,M,,1415.0,1600.0,1598.0,100.0,,2021,,,,127.0
4,56002963,GR,HYUNDAI,OS,"KONA,KAUAI",M1,diesel,M,,1415.0,1600.0,1598.0,100.0,,2021,,,,127.0


In [None]:
# Inspecting Columns and Shape

print(df.columns)
print(df.shape)

Index(['ID', 'member_state', 'manufacturer_name_eu', 'vehicle_type',
       'commercial_name', 'category_of_vehicle', 'fuel_type', 'fuel_mode',
       'innovative_technologies', 'mass_vehicle', 'weltp_test_mass',
       'engine_capacity', 'engine_power', 'erwltp', 'year', 'electric_range',
       'electric_energy_consumption', 'fuel_consumption',
       'specific_co2_emissions'],
      dtype='object')
(26186032, 19)


In [None]:
# Dataset Info Overview

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26186032 entries, 56002959 to 140000058
Data columns (total 18 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   member_state                 object 
 1   manufacturer_name_eu         object 
 2   vehicle_type                 object 
 3   commercial_name              object 
 4   category_of_vehicle          object 
 5   fuel_type                    object 
 6   fuel_mode                    object 
 7   innovative_technologies      object 
 8   mass_vehicle                 float64
 9   weltp_test_mass              float64
 10  engine_capacity              float64
 11  engine_power                 float64
 12  erwltp                       float64
 13  year                         int64  
 14  electric_range               float64
 15  electric_energy_consumption  float64
 16  fuel_consumption             float64
 17  specific_co2_emissions       float64
dtypes: float64(9), int64(1), object(8)
me

In [None]:
# Indexing the DataFrame by ID Column
df.set_index("ID", inplace = True)

In [None]:
# Missing Value Imputation and Analysis

# Calculate the percentage of missing values for each column in the entire DataFrame
missing_percentage = df.isna().mean() * 100

# Display the percentage of missing values per column
print("Percentage of missing values per column:")
print(missing_percentage)

# Define the threshold for dropping rows (10%)
threshold = 10.0

# Identify columns where missing values are below the threshold
columns_to_drop_rows = [col for col in df.columns if missing_percentage[col] < threshold]

# Drop rows where these columns have missing values
df_cleaned = df.dropna(subset = columns_to_drop_rows).copy()  # Criando uma cópia explícita

# Identify categorical columns and fill missing values with 'None' (indicating no information)
categorical_columns = df_cleaned.select_dtypes(include=["object", "category"]).columns
df_cleaned.loc[:, categorical_columns] = df_cleaned[categorical_columns].fillna("None")

# Ensure 'vehicle_type' and 'commercial_name' are treated as strings
df_cleaned.loc[:, "vehicle_type"] = df_cleaned["vehicle_type"].astype(str)
df_cleaned.loc[:, "commercial_name"] = df_cleaned["commercial_name"].astype(str)

# Replace missing values in 'innovative_technologies' with "NonTech", ensuring any existing "None" is replaced
df_cleaned.loc[:, "innovative_technologies"] = df_cleaned["innovative_technologies"].replace("None", "NonTech").fillna("NonTech")

# For the following numerical columns, missing values or the string "None" are replaced with 0
numerical_fix_columns = ["erwltp", "electric_range", "electric_energy_consumption"]

for col in numerical_fix_columns:
    if col in df_cleaned.columns:
        # Replace any occurrence of "None" with 0 and fill remaining NaN with 0
        df_cleaned.loc[:, col] = df_cleaned[col].replace("None", 0).fillna(0)
        # Convert the column to a numeric type to ensure numerical integrity
        df_cleaned.loc[:, col] = pd.to_numeric(df_cleaned[col], errors = "coerce")

# Overwrite the original df with the cleaned version
df = df_cleaned

# Display the first few rows to verify the changes
print(df.head())

Percentage of missing values per column:
member_state                   0.0
manufacturer_name_eu           0.0
vehicle_type                   0.0
commercial_name                0.0
category_of_vehicle            0.0
fuel_type                      0.0
fuel_mode                      0.0
innovative_technologies        0.0
mass_vehicle                   0.0
weltp_test_mass                0.0
engine_capacity                0.0
engine_power                   0.0
erwltp                         0.0
year                           0.0
electric_range                 0.0
electric_energy_consumption    0.0
fuel_consumption               0.0
specific_co2_emissions         0.0
dtype: float64
         member_state manufacturer_name_eu vehicle_type commercial_name  \
ID                                                                        
56003309           GR               TOYOTA    XA5(EU,M)     TOYOTA RAV4   
56003313           GR               TOYOTA    XA5(EU,M)     TOYOTA RAV4   
56003314      

In [None]:
# Splitting the Data into Multiple DataFrames Based on Domain Knowledge

# Define columns for each model
model_identification_cols = ["member_state", "manufacturer_name_eu", "vehicle_type", "commercial_name", "year", "category_of_vehicle", "fuel_type", 'fuel_mode', "electric_energy_consumption"]
model_prediction_cols = ["mass_vehicle", "weltp_test_mass", "engine_capacity", "engine_power", "erwltp", "year", "electric_range", "fuel_consumption", "specific_co2_emissions", "innovative_technologies", "fuel_type", "fuel_mode", "electric_energy_consumption"]

# Create separate dataframes
df_model_identification = df[model_identification_cols]
df_model_prediction = df[model_prediction_cols]

# Ensure target variable is the last column
df_model_identification = df_model_identification[[col for col in model_identification_cols if col != "electric_energy_consumption"] + ["electric_energy_consumption"]]
df_model_prediction = df_model_prediction[[col for col in model_prediction_cols if col != "electric_energy_consumption"] + ["electric_energy_consumption"]]

# Export to CSV
df_model_identification.to_csv("model_identification_data.csv", index = True)
df_model_prediction.to_csv("model_prediction_data.csv", index = True)

# Display confirmation
print("\n✅ DataFrames successfully created and exported.")



✅ DataFrames successfully created and exported.


# Model Separation: Identification vs. Prediction

The dataset has been divided into two models based on different modeling objectives:

1. **Prediction Model (Causal Estimation)**  
   This model is designed to estimate the **causal relationship** between vehicle attributes and electric energy consumption. It includes variables that likely have a **direct physical impact** on energy consumption, such as vehicle mass, engine power, fuel consumption, and electric range.  
   - **Goal:** Develop a regression model to predict electric energy consumption based on vehicle specifications.  
   - **Next Steps:** Apply regression-based techniques (linear regression, tree-based models, or neural networks) to capture the relationship between predictors and energy consumption.

2. **Identification Model (Market Patterns & Correlation)**  
   This model focuses on **identifying patterns** in energy consumption based on contextual and categorical variables, such as the manufacturer, country of origin, and vehicle category. Instead of direct causality, this model explores **statistical correlations** and market trends.  
   - **Goal:** Understand the likelihood of energy consumption levels based on categorical attributes.  
   - **Next Steps:** Use classification models, clustering, or probabilistic methods to segment vehicles and predict typical energy consumption patterns.

Both models serve different but complementary purposes. The prediction model aims for an analytical approach to estimate energy consumption based on fundamental vehicle characteristics, while the identification model helps in recognizing **market-driven trends** and consumer behavior.

The separation of the data into two datasets is intended not only to immediately pursue both modeling strategies, but primarily to clearly lay out the identified options and keep them available for potential future processing. The immediate goal of this project is to focus on processing the dataset associated with prediction, using different models with varying levels of complexity.

These models will be developed separately, allowing flexibility in selecting appropriate machine learning techniques for each task.
