[Dataset - General/Initial Struture](https://www.eea.europa.eu/en/datahub/datahubitem-view/fa8b1229-3db6-495d-b18e-9c9b3267c02b?activeAccordion=)

|Name|Definition|Datatype|Cardinality|Relevance Comment|
|---|---|---|---|---|
|ID|Identification number.|integer|1..1|mapping/identification only|
|MS|Member state.|varchar(2)|0..1|only indirect influence?|
|Mp|Manufacturer pooling.|varchar(50)|0..1|mapping/identification only|
|VFN|Vehicle family identification number.|varchar(25)|0..1|mapping/identification only|
|Mh|Manufacturer name EU standard denomination .|varchar(50)|0..1|mapping/identification only|
|Man|Manufacturer name OEM declaration.|varchar(50)|0..1|mapping/identification only|
|MMS|Manufacturer name MS registry denomination .|varchar(125)|0..1|mapping/identification only|
|TAN|Type approval number.|varchar(50)|0..1|mapping/identification only|
|T|Type.|varchar(25)|0..1|mapping/identification only|
|Va|Variant.|varchar(25)|0..1|mapping/identification only|
|Ve|Version.|varchar(35)|0..1|mapping/identification only|
|Mk|Make.|varchar(25)|0..1|mapping/identification only|
|Cn|Commercial name.|varchar(50)|0..1|mapping/identification only|
|Ct|Category of the vehicle type approved.|varchar(5) |0..1|maybe correlated to fuel type or engine type?|
|Cr|Category of the vehicle registered.|varchar(5) |0..1|maybe correlated to fuel type or engine type?|
|M (kg)|Mass in running order Completed/complete vehicle .|integer|0..1|relevant?|
|Mt|WLTP test mass.|integer|0..1|relevant?|
|Enedc (g/km)|Specific CO2 Emissions (NEDC).|integer|0..1|older standard?|
|Ewltp (g/km)|Specific CO2 Emissions (WLTP).|integer|0..1|our target variable?|
|W (mm)|Wheel Base.|integer|0..1|potentially relevant (influence on size and weight?)|
|At1 (mm)|Axle width steering axle.|integer|0..1|potentially relevant (influence on size and weight?)|
|At2 (mm)|Axle width other axle.|integer|0..1|potentially relevant (influence on size and weight?)|
|Ft|Fuel type.|varchar(25)|0..1|highly relevant?|
|Fm|Fuel mode.|varchar(1) |0..1|relevant? (e.g. if hybrid)|
|Ec (cm3)|Engine capacity.|integer|0..1|relevant?|
|Ep (KW)|Engine power.|integer|0..1|relevant?|
|Z (Wh/km)|Electric energy consumption.|integer|0..1|tbd|
|IT|Innovative technology or group of innovative technologies.|varchar(25)|0..1|potentially relevant (influence of car characteristics, but maybe too superficial/complex)|
|Ernedc (g/km)|Emissions reduction through innovative technologies.|float|0..1|probably depending on IT value but with focus emissions -> relevant?|
|Erwltp (g/km)|Emissions reduction through innovative technologies (WLTP).|float|0..1|probably depending on IT value but with focus emissions -> relevant?|
|De|Deviation factor.|float|0..1|tbd|
|Vf|Verification factor.|integer|0..1|tbd|
|R|Total new registrations.|integer|0..1|tbd|
|Year|Reporting year.|integer|0..1|relevant?|
|Status|P = Provisional data, F = Final data.|varchar(1) |0..1|tbd|
|Version_file|Internal versioning of deliverables.|varchar(10)|0..1|tbd|
|E (g/km)|Specific CO2 Emission. Deprecated value, only relevant for data until 2016.|float|0..1|tbd|
|Er (g/km)|Emissions reduction through innovative technologies. Deprecated value, only relevant for data until 2016.|float|0..1|tbd|
|Zr|Electric range.|integer|0..1|tbd|
|Dr|Registration date.|date|0..1|tbd|
|Fc|Fuel consumption.|float|0..1|tbd|

In [1]:
"""
Read in raw data from .csv and give an overview about missing values and data types.
"""

import pandas as pd
from config import COLS_PRE_DROP, DENSITY_THRESHOLD, PRE_ANALYSIS_FILE

In [2]:
df = pd.read_csv(PRE_ANALYSIS_FILE)

print(f"Dimension: {df.shape}")
print(f"Columns: {df.columns}")
print(f"Data types:\n{df.dtypes}")

  df = pd.read_csv(PRE_ANALYSIS_FILE)


Dimension: (30134963, 40)
Columns: Index(['ID', 'Country', 'VFN', 'Mp', 'Mh', 'Man', 'MMS', 'Tan', 'T', 'Va',
       'Ve', 'Mk', 'Cn', 'Ct', 'Cr', 'r', 'm (kg)', 'Mt', 'Enedc (g/km)',
       'Ewltp (g/km)', 'W (mm)', 'At1 (mm)', 'At2 (mm)', 'Ft', 'Fm',
       'ec (cm3)', 'ep (KW)', 'z (Wh/km)', 'IT', 'Ernedc (g/km)',
       'Erwltp (g/km)', 'De', 'Vf', 'Status', 'year', 'Date of registration',
       'Fuel consumption ', 'ech', 'RLFI', 'Electric range (km)'],
      dtype='object')
Data types:
ID                        int64
Country                  object
VFN                      object
Mp                       object
Mh                       object
Man                      object
MMS                     float64
Tan                      object
T                        object
Va                       object
Ve                       object
Mk                       object
Cn                       object
Ct                       object
Cr                       object
r                     

In [3]:
df.head(5)

Unnamed: 0,ID,Country,VFN,Mp,Mh,Man,MMS,Tan,T,Va,...,Erwltp (g/km),De,Vf,Status,year,Date of registration,Fuel consumption,ech,RLFI,Electric range (km)
0,56002959,GR,IP-091932-KMH-1,HYUNDAI,HYUNDAI,HYUNDAI MOTOR COMPANY,,e4*2007/46*1259*11,OS,F5D11,...,,,,F,2021,2021-06-17,,,,
1,56002960,GR,IP-091932-KMH-1,HYUNDAI,HYUNDAI,HYUNDAI MOTOR COMPANY,,e4*2007/46*1259*11,OS,F5D11,...,,,,F,2021,2021-06-04,,,,
2,56002961,GR,IP-091932-KMH-1,HYUNDAI,HYUNDAI,HYUNDAI MOTOR COMPANY,,e4*2007/46*1259*11,OS,F5D11,...,,,,F,2021,2021-04-07,,,,
3,56002962,GR,IP-091932-KMH-1,HYUNDAI,HYUNDAI,HYUNDAI MOTOR COMPANY,,e4*2007/46*1259*11,OS,F5D11,...,,,,F,2021,2021-04-13,,,,
4,56002963,GR,IP-091932-KMH-1,HYUNDAI,HYUNDAI,HYUNDAI MOTOR COMPANY,,e4*2007/46*1259*11,OS,F5D11,...,,,,F,2021,2021-11-19,,,,


In [4]:
# Give overview about missing percentages for each variable
missing_percentage = df.isna().sum() / len(df)
print(missing_percentage)

ID                      0.000000
Country                 0.000000
VFN                     0.011319
Mp                      0.063854
Mh                      0.000000
Man                     0.000003
MMS                     1.000000
Tan                     0.002642
T                       0.000429
Va                      0.002218
Ve                      0.003307
Mk                      0.000038
Cn                      0.003110
Ct                      0.001289
Cr                      0.000001
r                       0.000000
m (kg)                  0.000015
Mt                      0.012037
Enedc (g/km)            0.890057
Ewltp (g/km)            0.001166
W (mm)                  0.358176
At1 (mm)                0.390444
At2 (mm)                0.391398
Ft                      0.000000
Fm                      0.000007
ec (cm3)                0.131063
ep (KW)                 0.010614
z (Wh/km)               0.789128
IT                      0.376593
Ernedc (g/km)           1.000000
Erwltp (g/

In [5]:
cols_to_be_dropped = list()

for col, percentage in missing_percentage.items():
    if percentage > DENSITY_THRESHOLD:
        cols_to_be_dropped.append(col)

print(f"Columns to be dropped due to availability below threshold: {cols_to_be_dropped}")

Columns to be dropped due to availability below threshold: ['MMS', 'Enedc (g/km)', 'z (Wh/km)', 'Ernedc (g/km)', 'De', 'Vf', 'ech', 'RLFI', 'Electric range (km)']


In [6]:
# compare this with config setup for cross checks
for col in cols_to_be_dropped:
    if col not in COLS_PRE_DROP:
        print(f"Column {col} should be dropped based on threshold, but is not in config setup.")

Column z (Wh/km) should be dropped based on threshold, but is not in config setup.
Column Electric range (km) should be dropped based on threshold, but is not in config setup.


The next step is to generate datasets following a common standard (names, types, etc.). The knowledge of this notebook will be used in 1_1-prep_database_file_generator.ipynb

---

### Data Assessment Summary

Based on the initial exploration, we identified the following issues:

#### 1️⃣ **Column Name Inconsistencies**

Some column names do not fully match the table description, which may cause confusion and require renaming:

- `"Country"` ≠ `"MS"`
- `"Electric Range"` ≠ `"Zr"`
- `"Fuel Consumption"` ≠ `"Fc"`
- `"r"` ≠ `"R"`
- `"m (kg)"` ≠ `"M (kg)"`
- `"ec (cm³)"` ≠ `"Ec (cm³)"`
- `"ep (KW)"` ≠ `"Ep (KW)"`
- `"z (Wh/km)"` ≠ `"Z (Wh/km)"`
- `"year"` ≠ `"Year"`

#### 2️⃣ **Columns with Many Missing Values**

Several columns appear to be **completely empty** or contain mostly missing data, requiring verification before further processing:

- `"MMS"`
- `"Enedc (g/km)"`
- `"W (mm)"`
- `"At1 (mm)"`
- `"At2 (mm)"`
- `"Ernedc (g/km)"`
- `"De"`
- `"Vf"`

#### 3️⃣ **Redundant or Less Informative Variables**

Certain columns provide little additional information compared to other, more relevant columns. These might be considered for removal:

- Less informative: `"ID"`, `"Mp"`, `"VFN"`, `"Mk"`, `"Man"`, `"Tan"`, `"Va"`, `"Ve"`, `"Cr"`
- More relevant alternatives: `"T"`, `"Mh"`, `"Cn"`, `"CT"`

#### 4️⃣ **Potentially Constant Columns**

The variable `"r"` appears to always be equal to `1`, which suggests it may not be useful for analysis.

#### 5️⃣ **Deprecated Variables (Relevant Only Until 2016)**

The following columns contain values that are no longer meaningful after 2016 and may be excluded from the analysis:

- `"E (g/km)"`
- `"Er (g/km)"`

#### 6️⃣ **Metadata Columns**

The following columns contain metadata rather than analytical data and should be treated separately:

- `"Status"`
- `"Version_file"`

#### 7️⃣ **Redundant Information**

- `"Year"` and `"Dr"` provide the same information, making one of them unnecessary.