# Explorative Data Analysis and Visualization
## Course Code: DLBDSEDAV01

# Task 1:  Visually Exploring a Data Set

This notebook contains the implementation of Task 1 of the Explorative Data Analysis and Visualization course (DLBDSEDAV01) and it describes the analysis of the _Electric Vehicle Specs Dataset (2025)_.

The notebook includes statistical analysis of the features of the dataset, the corresponding visualizations and conculsions based on the data shown. The implementation steps and design decisions regarding the visualizations are omitted here. They are included in the Written Assignment that accompanies this notebook.

Since the visualizations presented here are part of the written assignment, the titles of the figures are included in the corresponding captions (Fundamental of Data Visualization - C. Wilke).

# 1. Dataset information

- __Name__: Electric Vehicle Specs Dataset (2025)
- __Source__: [Kaggle](https://www.kaggle.com/datasets/urvishahir/electric-vehicle-specifications-dataset-2025/data)
- __Features included__:
  
    - __Brand and Model__: Manufacturer and specific nameplate of the EV.
    - __Car Body Type__: Classification such as hatchback, SUV, sedan, etc.
    - __Segment__: Vehicle segment (e.g., compact, midsize, executive).
    - __Battery Capacity (kWh)__: The gross energy capacity of the battery.
    - __Number of Cells and Battery Type__: Technical battery information, where available.
    - __Efficiency (Wh/km)__: Power consumption rate of the vehicle.
    - __Range (km)__: Estimated driving range on a full charge.
    - __Fast Charging Power (kW)__: Maximum supported DC fast-charging power.
    - __Fast Charge Port Type__: Connector standard (e.g., CCS, CHAdeMO).
    - __Top Speed (km/h)__: Maximum speed of the vehicle.
    - __0–100 km/h Acceleration (s)__: Time to reach 100 km/h from a standstill.
    - __Torque (Nm)__: Maximum torque output, where available.
    - __Towing Capacity (kg)__: Ability to tow loads, provided where applicable.
    - __Cargo Volume (L)__: Luggage space, sometimes approximate or expressed in alternative units.
    - __Seats__: Total seating capacity.
    - __Length, Width, Height (mm)__: Physical footprint of the vehicle.
    - __Drivetrain__: Powertrain configuration (e.g., AWD, RWD, FWD).
    - __Source URL__: Reference link for each car in the [EV database](https://ev-database.org/).

In [None]:
# For exact versions of the modules see environment.yml
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler

from itertools import combinations

# 2. Understanding the data

Before starting with analysis of the data and calculation of statistics, the data set is examined to understand its structure, size, datatypes, missing values, etc.

In [None]:
# Take a look at the original dataset
df = pd.read_csv(r"../data/electric_vehicles_spec_2025.csv")
df.head(10)

In [None]:
# number of entries, column names, datatypes
df.info()

In [None]:
# Check for missing values. Those columns that have many missing values will be dropped since imputing lots of values can
# enter bias into the data set
df.isnull().sum().to_frame('Missing Values').style.background_gradient(cmap='Reds')

In [None]:
# Check for the unique values in the dataset. The columns that have only one value can be dropped since
# this values alone does not provide any information relevant for this analysis
df.nunique().sort_values().to_frame("Unique values").style.background_gradient(cmap='Blues')

In [None]:
df.fast_charge_port.value_counts()

Based on this results these columns can be dropped:
1. number_of_cells: Around 40% of the values in this columns are missing.
2. battery_type: this columns offers only one unique value _Lithium-ion_. It does not provide information relevant for the current analysis.
3. source_url: the urls contain in this column point to [EV-Database](https://ev-database.org/). It is not relevant for this analysis.
4. From the 477 vehicles with Fast Charge Port data, 476 of them use the CCS and only one does not. Not much information for the analysis.

In [None]:
# Since there is only one battery type, this column does provide useful information for this analysis we can drop it
df = df.drop(["number_of_cells", "battery_type", "fast_charge_port", "source_url"], axis=1)
df.info()

In [None]:
# This mapping is used though out the notebook to properly rename the axis of the plots
xlabel_map = {
    "range_km" : "Range (km)",
    "torque_nm" : "Torque (Nm)",
    "top_speed_kmh" : "Top Speed (km/h)",
    "battery_capacity_kWh" : "Battery Capacity (kWh)",
    "fast_charging_power_kw_dc" : "Fast Charging Power (kW)",
    "acceleration_0_100_s" : "Acceleration 0-100 km/h (s)",
    "efficiency_km_kWh" : "Efficiency (km/kWh)",
    "drivetrain" : "Drivetrain",
    "new_segment" : "New Segment"
}

# 3 Analysis across all vehicles
## 3.1 Number of vehicles per brand

We see that the German brands are amogst the brands with the highest number of models, with Mercedes-Benz leading the charts with 42 models., followed by Audi, Porsche and Volkswagen.

Firefly was another intereting case as a data point, since there is no name for a model so it was not counted at first, to deal with this, the value for the count is set manually to one.

In [None]:
df.brand.unique().shape

In [None]:
# count the models by each brand and sort them from highest to lowest
n_model_by_brand = df.loc[:,["brand", "model"]].groupby("brand").count().sort_values("model", ascending=False)
display(n_model_by_brand.tail())
n_model_by_brand.loc["firefly", "model"] = 1
display(n_model_by_brand.tail())

In [None]:
# How many vehicle are available for each brand?
# Interesting: which are the brands with the highest number of models. What about German brands?
germany = ["Mercedes-Benz", "Audi", "Porsche", "Volkswagen", "BMW", "Opel", "Smart"]
colors = ["tab:blue" if brand in germany else "tab:gray" for brand in n_model_by_brand.index]
# Prepare the plot
fig, ax = plt.subplots(figsize=(15,15))
sns.barplot(n_model_by_brand,
            y=n_model_by_brand.index,
            x="model", 
            hue=n_model_by_brand.index,
            orient="h",
            legend=False,
            ax=ax, palette=colors)
# Place the count number right next to the bars
for y, model_count in enumerate(n_model_by_brand["model"]):
    # default height of the bar is 0.8, to center the text add 0.2
    # make some space between bar and text, add 0.1
    ax.text(model_count+.1, y+.2, model_count, fontsize=14)
# Name the axes
ax.set_ylabel("Vehicle brand", fontsize=16)
ax.set_xlabel("Number of models", fontsize=16)
ax.set_xticks(np.arange(0,50,5))
ax.set_xticklabels(ax.get_xticks(), fontsize=14)
ax.set_yticks(np.arange(n_model_by_brand.index.shape[0]))
ax.set_yticklabels(n_model_by_brand.index, fontsize=14)
ax.grid(True, linewidth=0.5, linestyle=':', axis="x")
ax.tick_params(axis='x', length=0, pad=5)
# remove the contours
for position in ("top", "right", "left", "bottom"):
    ax.spines[position].set_visible(False)
fig.tight_layout()

## 3.2 Features across all vehicles

The goal of this section is to get an overview of the behavior of the featurea across EVs in general.

In [None]:
features = ["range_km", "torque_nm", "top_speed_kmh", "acceleration_0_100_s", "battery_capacity_kWh", "fast_charging_power_kw_dc"]

### 3.2.1 Correlation

In [None]:
corr_matrix = df[features].corr()
corr_matrix = corr_matrix.rename(columns=xlabel_map, index=xlabel_map)
corr_matrix

In [None]:
mask = np.tril(np.ones_like(corr_matrix, dtype=bool),k=-1)
mask

In [None]:
fig, ax = plt.subplots(figsize=(6,6))
sns.heatmap(corr_matrix, mask=mask, vmin=-1, vmax=1, center=0, cmap="PiYG", annot=True, square=True)
ax.xaxis.tick_top()
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=14)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=14)
ax.tick_params(axis="both", length=0)
ax.tick_params(axis="x", pad=10)

We observe some strong correlations between the features. Some of them are expected such as _Battery Capacity_ and _Range_, i.e., if the vehicle has a large battery we expect to have a large range. _Torque_ and _Acceleration_ are also strongly negatively correlated, which is also the behavior we would expect, vehicles with high torque are the ones that the need to shortest time to reach 100 km/h.

This initial observations, show clear relationships between certain characteristics. However, the correlation alone does not capture the whole picture, spcial if we include further features like the _Drivetrain_ and _Efficiency_.

We will examine some of this strong correlations in more detail to explore non-linear relationships and how multiple features might interact to influence the performance parameters.

### 3.2.2 Battery capacity and range

We derive the _Efficiency_ of the vehicles as:

$Efficiency = \frac{\text{Range}}{\text{Battery capacity}}$

In [None]:
# calculating efficiency
df['efficiency_km_kWh'] = df['range_km'] / df['battery_capacity_kWh']

In [None]:
# Caclulating the mean battery capacity and range across al vehicles
df.loc[:,["battery_capacity_kWh", "range_km", "efficiency_km_kWh"]].describe().T.rename(index=xlabel_map)

We also examine the skewness of these three feature to get a sense of the symetry of the distributions.

In [None]:
df.loc[:,["battery_capacity_kWh", "range_km", "efficiency_km_kWh"]].skew().apply(round, args=(2,))

In [None]:
mean_range = df.range_km.mean()
mean_battery_capacity = df.battery_capacity_kWh.mean()
mean_efficiency = df.efficiency_km_kWh.mean()

In [None]:
# figure and axs
fig, axs = plt.subplots(figsize=(16,6), nrows=1, ncols=3)
# distribution of the range
sns.histplot(df, x="range_km", bins=15, kde=True, stat="density", alpha=0.5, ax=axs[0])
# ranges must be set separately for each plot
axs[0].set_xticks(np.arange(100,700,100), np.arange(100,700,100))
axs[0].set_xticklabels(axs[0].get_xticklabels(), fontsize=14)
axs[0].axvline(mean_range, color='darkred', linestyle='--', linewidth=2)
axs[0].text(x=mean_range+10, y=axs[0].get_ylim()[1], s=f"Mean = {mean_range:.1f}km", fontsize=14)

#distribution of the battery capacity
sns.histplot(df, x="battery_capacity_kWh", bins=15, kde=True, stat="density", ax=axs[1])
axs[1].set_xticks(np.arange(20,130,20), np.arange(20,130,20))
axs[1].set_xticklabels(axs[1].get_xticklabels(), fontsize=14)
axs[1].axvline(mean_battery_capacity, color='darkred', linestyle='--', linewidth=2)
axs[1].text(x=mean_battery_capacity+2, y=axs[1].get_ylim()[1]*1.07, s=f"Mean = {mean_battery_capacity:.1f}kWh", fontsize=14)

print(axs[2].get_ylim()[1])

# distribution of the efficiency
sns.histplot(df, x="efficiency_km_kWh", bins=15, kde=True, stat="density", ax=axs[2])
axs[2].set_xticks(np.arange(3,8,1))
axs[2].set_xticklabels(axs[2].get_xticklabels(), fontsize=14)
axs[2].axvline(mean_efficiency, color='darkred', linestyle='--', linewidth=2)
axs[2].text(x=mean_efficiency+.1, y=axs[2].get_ylim()[1]*1.14, s=f"Mean = {mean_efficiency:.1f}km/kWh", fontsize=14)

# Editing the plots 
for ax in axs:#.flatten():
    ax.grid(True, linewidth=0.5, linestyle=':', axis="y")
    ax.spines[["top", "right", "bottom", "left"]].set_visible(False)
    ax.set_yticks(ax.get_yticks())
    ax.set_yticklabels(ax.get_yticklabels(), fontsize=14)
    ax.set_xlabel(xlabel_map.get(ax.get_xlabel(), ""), fontsize=16) # Use the get method to avoid a KeyError when looping over last plot
    ax.set_ylabel(ax.get_ylabel(), fontsize=16)
    ax.tick_params(axis="y", length=0)
    ax.tick_params(axis="x", pad=10)


fig.tight_layout()

From the distributions we observe:

1. Mean Battery Capacity = 74.0 kWh; Mean range = 393.2 km; Mean efficiency = 5.4 km/kWh
2. The low values of skewness confirm that the distributions are approximately symmetric (-0.5, 0.5)

In [None]:
# Create figure
fig, ax = plt.subplots(figsize=(14,7))
# Battery capcity and Range
sns.scatterplot(df.rename(columns=xlabel_map),
                x=xlabel_map.get("battery_capacity_kWh"),
                y=xlabel_map.get("range_km"),
                style=xlabel_map.get("drivetrain"),
                hue=xlabel_map.get("efficiency_km_kWh"),
                s=100,
                palette="copper_r",
                ax=ax)

# Edit the plot
ax.spines[["top", "right", "bottom", "left"]].set_visible(False)
ax.grid(True, linewidth=0.5, linestyle=":")
ax.set_xlabel(ax.get_xlabel(), fontsize=16)
ax.set_ylabel(ax.get_ylabel(), fontsize=16)
ax.set_xticks(np.arange(20,130,10))
ax.set_xticklabels(ax.get_xticks(), fontsize=14)
ax.set_yticks(np.arange(0,900,100))
ax.set_yticklabels(ax.get_yticks(), fontsize=14)


# Get the vehicle with the highest battery and range - middle efficiency
highest_range = df[(df["drivetrain"]=="RWD") & (df["range_km"]>650)]
highest_range = highest_range[["brand", "model", "battery_capacity_kWh", "range_km", "efficiency_km_kWh"]]

ax.text(highest_range["battery_capacity_kWh"].iloc[0] + .5,
        highest_range["range_km"].iloc[0] + 10,
        f"{highest_range['brand'].iloc[0]}\n{highest_range['model'].iloc[0]}", fontsize=10)

# Get the vehicle with half of the battery size and range - high efficiency
high_efficiency = df[(df["drivetrain"]=="RWD") & (df["battery_capacity_kWh"].between(60, 65)) & (df["range_km"].between(400, 450))]
high_efficiency = high_efficiency[["brand", "model", "battery_capacity_kWh", "range_km", "efficiency_km_kWh"]]

ax.text(high_efficiency["battery_capacity_kWh"].iloc[0]+0.5,
        high_efficiency["range_km"].iloc[0] + 10,
        f"{high_efficiency['brand'].iloc[0]}\n{high_efficiency['model'].iloc[0][:12]}", fontsize=10)

fig.tight_layout()

1. The scatter plot shows a positive correlation between battery capacity and range, i.e., range tends to increase as the battery capacity increases.
2. _FWD_ vehicles tend to have lower values for _Range_ compared to _RWD_ and _AWD_, with more _AWD_ vehicles on the higher end of the _Range_.
3. The plot also shows that vehicles with higher efficiency tend to have smaller batteries. An example of this, are the _Mercedes-Benz EQS 450+_ and the _Tesla Model 3_ as shown in the table below:

In [None]:
pd.concat([highest_range, high_efficiency], axis=0).rename(columns=xlabel_map)

In [None]:
df.loc[:,["battery_capacity_kWh", "range_km", "efficiency_km_kWh"]].corr().rename(columns=xlabel_map, index=xlabel_map)

We can conclude that Battery Capacity is strongly correlated with Range but not with Efficiency. In fact, as shown in the plot above, the most efficient EVs are the ones with the smaller batteries, suggesting that Battery size does not play a crucial role in achieving high efficiency.

### 3.2.3 Torque, Acceleration and Top Speed

At the begining of the analysis, we detected 7 missing torque values. Since we are not doing predictive modeling, we can opt for simplicity and drop these rows.

In [None]:
df[df["torque_nm"].isnull()]

In [None]:
df_torque = df.dropna(subset="torque_nm")

In [None]:
# Caclulating the mean battery capacity and range across al vehicles
df_torque.loc[:,["torque_nm", "top_speed_kmh", "acceleration_0_100_s"]].describe().T.rename(index=xlabel_map)

In [None]:
df_torque.loc[:,["torque_nm", "top_speed_kmh", "acceleration_0_100_s"]].skew().apply(round, args=(2,))

In [None]:
mean_top_speed = df_torque.top_speed_kmh.mean()
mean_torque = df_torque.torque_nm.mean()
mean_acceleration = df_torque.acceleration_0_100_s.mean()

In [None]:
# figure and axs
fig, axs = plt.subplots(figsize=(18,6), nrows=1, ncols=3)
# distribution of top speed
sns.histplot(df, x="top_speed_kmh", bins=15, kde=True, stat="density", ax=axs[0])
# ranges must be set separately for each plot
axs[0].set_xticks(np.arange(100,350,50), np.arange(100,350,50))
axs[0].set_xticklabels(axs[0].get_xticklabels(), fontsize=14)
axs[0].axvline(mean_top_speed, color='darkred', linestyle='--', linewidth=2)
axs[0].text(x=mean_top_speed + 6, y=axs[0].get_ylim()[1], s=f"Mean = {mean_top_speed:.1f}km/h", fontsize=14)

# distribution of torque
sns.histplot(df, x="torque_nm", bins=15, kde=True, stat="density", ax=axs[1])
axs[1].set_xticks(np.arange(0,1600,200), np.arange(0,1600, 200))
axs[1].set_xticklabels(axs[1].get_xticklabels(), fontsize=14)
axs[1].axvline(mean_torque, color='darkred', linestyle='--', linewidth=2)
axs[1].text(x=mean_torque + 40, y=axs[1].get_ylim()[1], s=f"Mean = {mean_torque:.1f}Nm", fontsize=14)

# distribution of acceleration
sns.histplot(df, x="acceleration_0_100_s", bins=15, kde=True, stat="density", ax=axs[2])
axs[2].set_xticks(np.arange(0,20,2), np.arange(0,20,2))
axs[2].set_xticklabels(axs[2].get_xticklabels(), fontsize=14)
axs[2].axvline(mean_acceleration, color='darkred', linestyle='--', linewidth=2)
axs[2].text(x=mean_acceleration+.5, y=axs[2].get_ylim()[1]*1.04, s=f"Mean = {mean_acceleration:.1f}km", fontsize=14)

# Editing the plots 
for ax in axs:
    ax.grid(True, linewidth=0.5, linestyle=':', axis="y")
    ax.spines[["top", "right", "bottom", "left"]].set_visible(False)
    ax.set_xticks(ax.get_xticks())
    ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
    ax.set_yticks(ax.get_yticks())
    ax.set_yticklabels(ax.get_yticklabels(), fontsize=14)
    ax.set_xlabel(xlabel_map.get(ax.get_xlabel(), ""), fontsize=16) # Use the get method to avoid a KeyError when looping over last plot
    ax.set_ylabel(ax.get_ylabel(), fontsize=16)

fig.tight_layout()

In [None]:
df.loc[:,["top_speed_kmh", "torque_nm", "acceleration_0_100_s"]].describe().T.apply(round, args=(2,))

In [None]:
# Function to model the behavior of torque vs acceleration and Acceleration and speed
def decay(x, a, b, c):
    """Exponential decay"""
    return a * np.exp(-b*x) + c

# Function to evaluate the behvior of torque and top speed
def grow(x, a, r):
    """Exponential growth"""
    return a * ((1 + r) ** x)

In [None]:
# Fitting a linear regresion for torque and speed
lr = LinearRegression()
lr.fit(df_torque.loc[:,["torque_nm"]], df_torque.loc[:,"top_speed_kmh"])

In [None]:
# Extract the values for curve fitting
torque = df_torque["torque_nm"].values
acceleration = df_torque["acceleration_0_100_s"].values
top_speed = df_torque["top_speed_kmh"].values

In [None]:
# Parameters for torque vs acceleration
# X data to plot the line
xaxis1 = np.arange(100,1500,100)
popt_torque_acceleration, pcov = curve_fit(decay, xdata=torque, ydata=acceleration, p0=(20.0,0.0001,2.0))
# Parameters for torque vs top_speed
xaxis2 = pd.DataFrame(np.arange(150,1500,100), columns=["torque_nm"])
popt_torque_speed, pcov = curve_fit(grow, xdata=torque, ydata=top_speed, p0=(120,.005))
# Parameters for acceleration speed
xaxis3 = np.arange(2,20,1)
popt_acceleration_speed, pcov = curve_fit(decay, xdata=acceleration, ydata=top_speed, p0=(325.0,0.01,120.0))

In [None]:
fig, axs = plt.subplots(figsize=(14,6), nrows=1, ncols=2)
# Torque vs acceleration
sns.scatterplot(df_torque,
                x="torque_nm",
                y="acceleration_0_100_s",
                ax=axs[0])
# Plot the fitted curve
axs[0].plot(xaxis1, decay(xaxis1, *popt_torque_acceleration), color="orange", linewidth=2)

# Torque vs top speed
sns.scatterplot(df_torque,
                x="torque_nm",
                y="top_speed_kmh",
                ax=axs[1])
# Plot the linear regresion for comparison
axs[1].plot(xaxis2, lr.predict(xaxis2), color="green", linewidth=2)

# Editing the plots 
for ax in axs:
    ax.grid(True, linewidth=0.5, linestyle=':')
    ax.spines[["top", "right", "bottom", "left"]].set_visible(False)
    ax.set_xticks(ax.get_xticks())
    ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
    ax.set_yticks(ax.get_yticks())
    ax.set_yticklabels(ax.get_yticklabels(), fontsize=14)
    ax.set_xlabel(xlabel_map.get(ax.get_xlabel(), ""), fontsize=16) # Use the get method to avoid a KeyError when looping over last plot
    ax.set_ylabel(xlabel_map.get(ax.get_ylabel(), ""), fontsize=16)
    ax.tick_params(axis="both", length=0, pad=10)

fig.tight_layout()

In [None]:
(df_torque.loc[:,["torque_nm", "acceleration_0_100_s", "top_speed_kmh"]]
    .corr()
    .rename(columns=xlabel_map, index=xlabel_map)
    .apply(round, args=(2,)))

In [None]:
linear_r2 = r2_score(top_speed, lr.predict(df_torque[["torque_nm"]]))
r2_linear_fit = round(linear_r2, 3)

exp_growth_r2 = r2_score(top_speed, grow(df_torque[["torque_nm"]], *popt_torque_speed))
r2_nonlinear_fit = round(exp_growth_r2, 3)

print(f"R2 score for the linear function = {r2_linear_fit}\nR2 score for the non-linear function = {r2_nonlinear_fit}")

The relationship between _Torque_ and _Acceleration_ aligns with the physical intuition: vehicles with higher torque tend to accelerate faster, resulting in lower acceleration times. This can be also observed in the correlogram, which shows a relatively strong negative correlation of -0.79. However, despite this strong correlation, the scatter plot reveals a clear non-linear trend, suggesting that a simple linear model does not properly capture the relationship between these two features.

The data appears to follow an exponential decay, where increases in torque lead to progressively smaller improvements in acceleration time. This implies that while increasing torque does reduce the time to reach 100 km/h, smaller improvements are obtained after a certain point, i.e., additional torque yields only minimal acceleration increases.

The correlogram also shows a strong positive correlation value of 0.81 between _Torque_ and _Top speed_. Initially, this scatter plot suggested a non-linear relationship, as it was in the previous case. However, further anaylsis revealed that the features are better explained by a linear model, as supported by the slightly higher $R^2$ value (0.649) compared to the exponential growth (0.639). This finding suggests that vehicles with high torque will tend to have higher values of top speed.

In summary, vehicles with more torque not only accelerate faster but also tend to reach higher top speeds. The fact that torque is tied to both acceleration and top speed supports the idea that it is a key indicator of overall performance.

# 4. Analysis by segment

## 4.1 Definition of vehicle segments

The segements contained in the dataset are divided in two big groups:

1. With prefix "J"
2. With no prefix

The "J" stands for sport utility cars, the second letter indicates the segmentation according to _Case No COMP/M.1406 Hyundai / Kia Regulation (EEC) No 4064/89 Merger Procedure_ for both groups.

In [None]:
# Segments starting with J refer to SUVs
# So two major classifications can be performed SUV and not SUVs
df[df["car_body_type"] == "SUV"]['segment'].str.startswith("J").all()

In [None]:
# According to Case No COMP/M.1406 Hyundai / Kia Regulation (EEC) No 4064/89 Merger Procedure
models_by_segment = df.groupby("segment")["model"].count().sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(14,7))
sns.barplot(models_by_segment, orient="h", ax=ax)
ax.grid(True, linewidth=0.5, linestyle=":", axis="x")
ax.set_xticks(np.arange(0,100,5))
ax.set_xticklabels(ax.get_xticks(), fontsize=16)
ax.set_ylabel("Car Segment", fontsize=16)
ax.set_yticks(np.arange(models_by_segment.index.shape[0]))
ax.set_yticklabels(models_by_segment.index, fontsize=14)
ax.set_xlabel("Number of car models per Segment", fontsize=14)
ax.spines[["top", "right", "left", "bottom"]].set_visible(False)

In [None]:
# With the original segments
metrics = df.groupby("segment")[["range_km", "torque_nm", "top_speed_kmh", "acceleration_0_100_s", "battery_capacity_kWh", "fast_charging_power_kw_dc"]].agg(["mean", "median", "skew"])
metrics = metrics.stack(1, future_stack=True)

def highlight_selected(row):
    if row.name[1] == "mean":
        return ["background-color: #f9cb9c"] * len(row)
    return [""] * len(row)

metrics.style.apply(highlight_selected, axis=1)

Some of the segments contained in the data set include only a few cars. In this case, the calculated statistics will not be representative, since segments with 1-3 cars can skew the mean with just one extrem value. To address this issues, the segments will be merged into broader groups that are also contextually related, this with the goal of increasing sample sizes and thus potentially improving the descriptive statistcs.

Another issue that will be addressed with these broader groups is the fact that the visualizations can become cluttered and difficult to interpret.

The trade-off of this approach, as mentioned before is taht the means of the new groups could be pulled towards the outliers.

In [None]:
new_segment_categories = {
    'A - Mini': 'Mini & Compact',
    'JA - Mini': 'Mini & Compact',
    'B - Compact': 'Mini & Compact',
    'JB - Compact': 'Mini & Compact',
    'C - Medium': 'Medium',
    'JC - Medium': 'Medium',
    'D - Large': 'Large & Executive',
    'E - Executive': 'Large & Executive',
    'JD - Large': 'Large & Executive',
    'JE - Executive': 'Large & Executive',
    'F - Luxury': 'Luxury & Sports',
    'I - Luxury': 'Luxury & Sports',
    'JF - Luxury': 'Luxury & Sports',
    'G - Sports': 'Luxury & Sports',
    'N - Passenger Van': 'Passenger Van'
}

In [None]:
df["new_segment"] = df["segment"].map(new_segment_categories)

In [None]:
models_by_new_segment = df.groupby("new_segment")["model"].count().sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(12,6))
sns.barplot(models_by_new_segment, orient="h", ax=ax)
ax.grid(True, linewidth=0.5, linestyle=":", axis="x")
ax.set_xticks(np.arange(0,155,10))
ax.set_xticklabels(ax.get_xticks(), fontsize=16)
ax.set_ylabel("Derived Car Segment", fontsize=16)
ax.set_yticks(np.arange(models_by_new_segment.index.shape[0]))
ax.set_yticklabels(models_by_new_segment.index, fontsize=14)
ax.set_xlabel("Number of car models per Segment", fontsize=14)
ax.spines[["top", "right", "left", "bottom"]].set_visible(False)

In [None]:
# With the original segments
metrics_new_segment = df.groupby("new_segment")[["range_km", "torque_nm", "top_speed_kmh", "acceleration_0_100_s", "battery_capacity_kWh", "fast_charging_power_kw_dc"]].agg(["mean", "std", "median", "skew"])
metrics_new_segment = metrics_new_segment.stack(1, future_stack=True)
metrics_new_segment.style.apply(highlight_selected, axis=1)

### 4.2 Features by segment

Now, we examine the difference between the different segments according some important features of the EVs.

In [None]:
# Create the figure
fig, axs = plt.subplots(figsize=(14,14), nrows=3, ncols=2)
# Define segment order
segment_order = ["Passenger Van", "Mini & Compact", "Medium", "Large & Executive", "Luxury & Sports"]
# Create violinplots
handles, labels = None, None
for ax, metric in zip(axs.flatten(), metrics_new_segment.columns):
    sns.stripplot(df, x=metric, hue="new_segment", hue_order=segment_order, jitter=True, dodge=True, size=5, alpha=0.8, ax=ax)
    sns.violinplot(df, x=metric, hue="new_segment", hue_order=segment_order, alpha=0.2, ax=ax)
    ax.set_xlabel(xlabel_map[metric], fontsize=16)
    ax.set_ylabel("", fontsize=16)
    ax.set_xticks(ax.get_xticks())
    ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
    ax.set_yticks([])
    #ax.legend(title=None, fontsize=14, loc="upper right")
    ax.grid(True, linewidth=0.5, linestyle=':', axis="x")
    ax.spines[["top", "right", "bottom", "left"]].set_visible(False)
    # The vi
    if handles is None:
        h, l = ax.get_legend_handles_labels()
        by_label = dict(zip(l, h))
        labels  = list(by_label.keys())
        handles = [by_label[k] for k in labels]
    ax.legend_.remove()

fig.legend(handles, labels, loc="center left", fontsize=14,
           bbox_to_anchor=(1.02, 0.5), frameon=False, title=None)
fig.tight_layout(rect=[0, 0, 0.95, 1])

In [None]:
(df[df["new_segment"]=="Luxury & Sports"]
    .loc[:,["brand", "model", 
            "range_km", "torque_nm", "top_speed_kmh", "acceleration_0_100_s", "battery_capacity_kWh", "fast_charging_power_kw_dc"]]
    .sort_values(by=["top_speed_kmh", "acceleration_0_100_s"], ascending=False).head(3))

The plots show how the distributions of different features compare across the different vehicle segments.

The overall tendency shows that the mean values tend to increase with the segment, from _Passenger Van_ (lowest) to _Luxury & Sports_ (highest), with the latter also showing the largest variability. This behavior is expected, when we considered that _Luxury & Sports_ includes vehicles such as the _Maserati Folgore_ or the _Porche Taycan_ both with top speeds greater tahn 300 km/h, and on the other extreme in the _Passenger Vans_, we have small vehicles like the Dacia Spring with a top speed of 125 km/h.

While this trends is clear, specially in features like _Top speed_ and _Acceleration_, we also see some overlap between the segments, specially between _Medium_ and _Large & Executive_. This suggests that the segment alone does not determine the vehicle performance.

Further statistical test, are used to determine wheteher these observed categories are statistically significant.

## Confidence intervals

In [None]:
# Organize the features to perform the tests
tests = list(combinations(models_by_new_segment.index, 2))
test_results = pd.DataFrame(tests, columns=("Feature A", "Feature B"))
test_results

## 4.3 Torque, Acceleration and Top speed by vehicle segment

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
# Torque vs acceleration
sns.scatterplot(df.rename(columns=xlabel_map),
                x="Torque (Nm)",
                y="Acceleration 0-100 km/h (s)",
                hue="New Segment",
                style="Drivetrain",
                s=100,
                ax=ax)
# Edit the plot
ax.grid(True, linewidth=0.5, linestyle=":")
ax.spines[["top", "right", "bottom", "left"]].set_visible(False)
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
ax.set_yticks(ax.get_yticks())
ax.set_yticklabels(ax.get_yticklabels(), fontsize=14)
ax.set_xlabel(ax.get_xlabel(), fontsize=16)
ax.set_ylabel(ax.get_ylabel(), fontsize=16)

In [None]:
mean_torque = df_torque.torque_nm.mean()
mean_torque

In [None]:
total_count_AWD = df_torque[(df_torque["drivetrain"] == "AWD")]["model"].count()
count_count_AWD_greater_than_m =df_torque[(df_torque["drivetrain"] == "AWD") & (df_torque["torque_nm"] > mean_torque)]["model"].count()

print(round((count_count_AWD_greater_than_m / total_count_AWD)*100, 1), "% of AWD vehicles with greater torque than the mean 498.01 Nm")

In [None]:
luxury_AWD = df_torque[(df_torque["drivetrain"] == "AWD") &
    (df_torque["new_segment"] == "Luxury & Sports")][["model", "drivetrain", "torque_nm"]].shape[0]
luxury_total = df_torque[(df_torque["new_segment"] 
                          == "Luxury & Sports")][["model", "drivetrain", "torque_nm"]].shape[0]
print(round((luxury_AWD/luxury_total)*100, 1), "% of Luxury & Sports vehicles are AWD")

In [None]:
luxury_RWD = df_torque[(df_torque["drivetrain"] == "RWD") &
    (df_torque["new_segment"] == "Luxury & Sports")][["model", "drivetrain", "torque_nm"]].shape[0]
print(round((luxury_RWD/luxury_total)*100, 1), "% of Luxury & Sports vehicles are RWD")

The seven missing vehicles with no torque data are not included in this part of the analysis.

94.6% of AWD vehicles have a higher values of torque than the mean of all vehicles (498.01 Nm), many of the RWD vehicles revolve around the mean value of the torque and all the FWD vehicles in the dataset have lower torque values than the mean of all vehicles.

The vehicles belonging to the _Mini & compact_ segment present low values of torque and long acceleration times, with very few exceptions such as the Smart Brabus, which is an AWD vehicle and has a torque value (584 Nm) greater than the mean.

Vehicles of the _Passenger Vans_ and _Medium_ segments can also be found on the lower end of the torque values which also translates in low acceleration times.

_Large & Executive_ vehicles are spread over a wide range of torque and acceleration values. The way they are distributed follows the trend  previously mentioned regrading the drivetrain, FWD vehicles are on the lower end of the torque and acceleration values, it then progresses to RWD vehicles with many vehicles revolving around the mean values of torque (498.01 Nm) and acceleration (6.88 s), and finally the AWD vehicles that overlap with those of the _Luxury & Sports_ segement.

_Luxury & Sports_ contains the vehicles with the highest torque and acceleration values, with 17.9% being RWD and 82.1% being AWD.

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
# Torque vs acceleration
sns.scatterplot(df.rename(columns=xlabel_map),
                x="Torque (Nm)",
                y="Top Speed (km/h)",
                hue="New Segment",
                style="Drivetrain",
                s=100,
                ax=ax)
# Edit the plot
ax.grid(True, linewidth=0.5, linestyle=":")
ax.spines[["top", "right", "bottom", "left"]].set_visible(False)
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
ax.set_yticks(ax.get_yticks())
ax.set_yticklabels(ax.get_yticklabels(), fontsize=14)
ax.set_xlabel(ax.get_xlabel(), fontsize=16)
ax.set_ylabel(ax.get_ylabel(), fontsize=16)

In [None]:
df_torque[(df_torque["torque_nm"].between(700,800)) & (df_torque["new_segment"] == "Medium")]

The torque and top speed scatter plot shows similar trends to those of the previous scatter plot, which aligns with the expectation that the vehicles that have high acceleration values will also have high top speeds.

_Mini & Compact_ vehicles together with the _Passenger Vans_ show the lowest top speeds. Vehicles beloging to the _Medium segment_ have overall somwhat better top speed than the previous segments but there are some vehicles of this segment that show top speeds comparable to those of the _Large & Executive_ and _Luxury & Sports_ segments, such as the KIA EV6 and the Hyundai IONIQ 5 N, both reaching top speeds of 260 km/h with AWD drivetrains. The biggest overlap is still observed between the _Large & Executive_ and _Luxury & Sports_ segments

In [None]:
sns.catplot(df,
            kind="strip",
            x="torque_nm",
            y="new_segment",
            col="drivetrain")

In [None]:
df_by_segment_drivetrain = df[(df["new_segment"] == "Passenger Van")]
df_by_segment_drivetrain.loc[:, ["brand", "model", "new_segment", "torque_nm", "acceleration_0_100_s", "drivetrain"]]