## Topic Modeling at AI Usage in Automotive Industry

This notebook identifies orthogonal focus areas and sub-areas (technologies) of AI research and application in the automotive sector using a corpus of approximately 25,000 patents and academic papers.

Document titles and abstracts are combined into a single text field and embedded using a Sentence Transformer. A seed-based cosine similarity approach is then used to assign each document to the most relevant focus area and technology. Results are visually validated using 2D UMAP to assess semantic separation and seed orthogonality.

In [None]:
AUTO_TOPS = [
    "Perception",
    "Communication_Technologies",
    "Energy_Source",
    "Energy_Storage",
    "Energy_Management"
    "Urban_Mobility",
    "Manufacturing",
    "Robotics",
    "Cybersecurity",
    
]

AUTO_AREA_SEEDS = {
    
    "Perception": {
        "seed": (
            "This area covers how a vehicle perceives and reconstructs its surrounding environment "
            "using sensors such as cameras, radars, lidars and inertial units. The goal is to detect "
            "lanes, vehicles, pedestrians, obstacles, free space and road structure by combining "
            "multi-sensor data into a unified spatial representation. It includes object detection, "
            "segmentation, depth estimation, multi-sensor fusion, 3D mapping, localization and "
            "environmental scene understanding for driving and navigation."
        )
    },
    
    "Communication_Technologies": {
        "seed": (
            "This area covers all communication between the vehicle, the infrastructure and the cloud. "
            "It includes cellular connectivity, V2X communication, wireless communication, vehicle-to-cloud data exchange, "
            "edge and fog computing integration and telematics services. It also spans in-vehicle data "
            "routing to gateways and central compute units, over-the-air update delivery, remote "
            "diagnostics and large-scale backend platforms that collect, process and distribute "
            "vehicle data for services and fleet-level coordination."
        )
    },
    
    "Energy_Source": {
        "seed": (
        "This area addresses how energy is originally generated. It includes "
        "electrochemical energy from traction battery cells, advanced chemical based material researches, microscopic molecular and atomic based researches, hydrogen-based fuel cell systems and "
        "auxiliary renewable sources such as vehicle-integrated solar cells. The focus is on primary "
        "energy generation principles, conversion efficiency, availability, power density and "
        "fundamental source-level performance characteristics."
    )
},

    "Energy_Storage": {
        "seed": (
        "This area addresses how energy is stored, buffered and preserved inside the vehicle before "
        "being used for propulsion. It includes Lithium batteries, rechargable batteries, high-voltage battery packs, modular battery "
        "architectures, hybrid storage systems combining batteries and supercapacitors, and safety "
        "mechanisms ensuring thermal, electrical and mechanical stability. Degradation, ageing, "
        "state-of-charge(SOC), state-of-health(SOH), second-life utilisation and usable capacity optimisation "
        "are central topics of this domain."
    )
},

    "Energy_Management": {
        "seed": (
        "This area addresses how stored energy is actively managed, converted into mechanical motion "
        "and exchanged with external electrical infrastructure. It includes battery management "
        "systems, thermal control, fast and smart charging strategies, traction inverters, electric "
        "motor control, regenerative braking and bidirectional grid interaction concepts such as "
        "vehicle-to-grid and vehicle-to-home. Power flow optimisation, efficiency, grid stability, "
        "real-time control and system-level energy orchestration belong to this domain."
    )
},

    "Urban_Mobility": {
        "seed": (
        "This area addresses city-scale transportation planning, policy design and infrastructure "
        "It focuses on macro-level "
        "traffic flow modelling, congestion policy, public transport network design, land-use and "
        "transport integration, emission regulation, accessibility planning and long-term urban "
        "mobility investment strategies. It includes demand forecasting for city transport systems, "
        "equity of access, environmental impact assessment and governance of multimodal transport "
        "ecosystems at metropolitan scale. Interdisciplinary finance, marketing "
        "and after-sales markets in mobility and manufacturing industries. It includes credit and "
        "loan default prediction, customer risk profiling, dynamic pricing, demand forecasting, "
        "customer churn and loyalty modelling, market segmentation, promotion effectiveness, "
        "brand impact analysis and aftermarket revenue optimisation."
    )
},

    "Manufacturing": {
        "seed": (
            "This area covers artificial intelligence applied to vehicle production and factory "
            "operations. It includes vision-based quality inspection, defect detection, process "
            "monitoring, predictive maintenance for machines, production line balancing, scheduling, "
            "and automation of material handling. The focus is on intelligent, efficient and highly "
            "automated automotive manufacturing systems rather than energy or data analytics."
        )
    },
    
    "Robotics": {
        "seed": (
            "This area focuses on mobile and stationary robots that act as physical agents in factories "
            "and logistics. It includes robotic assembly in body and final assembly shops, autonomous "
            "mobile robots and automated guided vehicles for internal warehouse logistics, last-mile "
            "delivery robots and specialised mobile platforms equipped with end effectors, nozzles or "
            "stabiliser arms for industrial processing. Fleet coordination, indoor navigation and "
            "cooperation of multiple robotic units are central topics."
        )
    },

    "Cybersecurity": {
        "seed": (
            "This area covers protection of communication links and data against "
            "attacks and failures, together with governance and safety concepts. It includes secure "
            "boot and firmware integrity, cryptographic communication, gateways and "
            "firewalls, intrusion detection, secure over-the-air update "
            "mechanisms and backend security for connected vehicle services, "
            "redundancy concepts, and cybersecurity regulations are also "
            "part of this domain."
        )
    },

}


In [None]:
AUTO_TOP_SEEDS_1 = {
    "Sensor_Fusion": "multi sensor fusion architecture combining lidar radar and camera into unified perception outputs",
    "Occupancy_Grid": "spatial occupancy grid mapping for free space and obstacle representation around the vehicle",
    "SLAM": "simultaneous localization and mapping using onboard sensors for ego pose and map building in dynamic traffic",
    "Trajectory_Prediction": "future motion and path prediction of vehicles and vulnerable road users in traffic scenes",
    "Environment_Modeling": "semantic scene and object relationship modeling for a structured driving environment representation"
}
AUTO_TOP_SEEDS_2 = {
    "4G": "4g lte cellular communication for connected vehicles",
    "5G": "5g cellular communication for ultra low latency vehicle connectivity",
    "Wireless_Communication": "wireless communication protocols for vehicle data transmission",
    "V2V": "vehicle to vehicle direct communication",
    "V2I": "vehicle to infrastructure communication",
    "Edge_Computing": "edge computing for real time vehicle data processing",
    "Cloud_Computing": "cloud computing backend for vehicle data storage and processing"
}
AUTO_TOP_SEEDS_3 = {
    "Solar_Cell": "photovoltaic solar cells (solar cell, perovskite solar, sensitized solar, organic photovoltaics, silicon solar)",
    "Electrochemical_Energy": "electrochemical energy systems (electrochemical energy, hydrogen evolution)",
    "Nano_Energy": "nanoscale energy systems (nano energy, quantum dots)",
}

AUTO_TOP_SEEDS_4 = {
    "Battery_Management_System": "battery management and control (battery management bms, battery management system, state charge soc, state health soh, battery state health)",
    "Battery_Thermal_Management": "battery thermal and cooling systems (battery thermal management, thermal management systems)",
}
AUTO_TOP_SEEDS_5 = {
    "Smart_Grid": "smart grid control and monitoring (smart grid technologies, power grid)",
    "Distributed_Energy_Resources": "distributed energy generation and control, dc based local energy networks (dc microgrid, distributed energy resources)",
    "V2G_G2V_Technologies": "bidirectional vehicle grid interaction (grid v2g technology, vehicle g2v, bidirectional energy)",
    "Charging_Infrastructure": "ev charging systems and network (charging infrastructure)",
     
}
AUTO_TOP_SEEDS_6 = {
    "Traffic_Planning": "urban traffic congestion dynamics (traffic congestion)",
    "Transport_Infrastructure": "urban transport network and infrastructure (transport infrastructure)",
    "Mobility_Demand_Forecasting": "urban travel demand prediction (demand forecasting)",
    "Micro_Mobility": "e-scooters, bikes, e-bikes and small personal transport (micro mobility)",
    }

AUTO_TOP_SEEDS_7 = {
    "InLine_Quality_Inspection": "inline defect, mistake proofing, error prevention, weld quality inspection (poka yoke, defect detection, weld quality)",
    "Predictive_Maintenance": "vehicle and equipment predictive maintenance (vehicle maintenance, predictive maintenance pd)",
    "Process_Monitoring_Optimization": "real time monitoring and waste minimizing stable processes (real time monitoring, proactively finding deviations, minimizing waste optimizing, quality constant process)"
}

AUTO_TOP_SEEDS_8 = {
    "Autonomous_Delivery_Robots": "autonomous robotic delivery and last mile transport (automated delivery, robotic delivery shipping)",
    "AGV_Systems": "automated guided vehicles for structured factory and warehouse transport, inventory tracking (automated guided vehicle)",
    "Hybrid_Modular_Robotics": "hybrid and modular robotic system architectures combining multiple robot types into reconfigurable platforms (hybrid modular)"
}
AUTO_TOP_SEEDS_9 = { 
    "Cyber_Physical_Security": "security of cyber physical automotive systems, cryptographic and encryption ciphering, network intrusion detection, cyber attacks (cyber physical)",
    "InVehicle_Network_Protocols": "automotive communication and bus protocols (controller area network, protocols)",
    "Integrity_Protection": "data and message integrity protection mechanisms (integrity protection)",  
}
AUTO_TOP_SEEDS = {
    "Perception": AUTO_TOP_SEEDS_1,
    "Communication_Technologies": AUTO_TOP_SEEDS_2,
    "Energy_Source": AUTO_TOP_SEEDS_3,
    "Energy_Storage": AUTO_TOP_SEEDS_4,
    "Energy_Management": AUTO_TOP_SEEDS_5,
    "Urban_Mobility": AUTO_TOP_SEEDS_6,
    "Manufacturing": AUTO_TOP_SEEDS_7,
    "Robotics": AUTO_TOP_SEEDS_8,
    "Cybersecurity": AUTO_TOP_SEEDS_9,
}


In [None]:
# === Cell 1: imports & paths ===
from pathlib import Path
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from umap import UMAP
import plotly.express as px
import torch


DATA_DIR    = Path("../../01_data")
CORPUS_PATH = DATA_DIR / "predictive_model" / "df_auto_corpus_labeled.parquet"
OUT_DIR     = Path("../../01_data") / "predictive_model"

MODEL_DIR   = Path("../../04_models/predictive_techname")
MODEL_DIR.mkdir(parents=True, exist_ok=True)

EMB_PATH    = MODEL_DIR / "doc_embeddings_area_base_free_clean_baai.npy"


import torch
print(torch.__version__)
print(torch.version.cuda)


In [None]:
# === Cell 2: data filtering, AREA assignment with doc embeddings & seed embeddings cosine similarity score ===

df_full = pd.read_parquet(CORPUS_PATH)


df_full["title_l"] = df_full["title"].astype(str).str.lower()
df_full["text_l"]  = df_full["text"].astype(str).str.lower()


auto_keywords = [
    "automotive", "vehicle", "car", "ev", "autonomous"
]


mask_auto = (
    df_full["title_l"].str.contains("|".join(auto_keywords), regex=True) |
    df_full["text_l"].str.contains("|".join(auto_keywords), regex=True)
)

df = df_full[mask_auto].copy()
print("Shape after filtering:", df.shape)


before_dup = df.shape[0]
df = df.drop_duplicates(subset=["title_l", "text_l"], keep="first")
after_dup = df.shape[0]
print(f"Duplicate cleaned: {before_dup - after_dup} deleted")


df = df.drop(columns=["title_l", "text_l"])

print("source_type distribution after filtering:")
print(df["source_type"].value_counts(), "\n")


keep_types = ["paper", "patent"]
df = df[df["source_type"].isin(keep_types)].copy()
print("Paper + Patent df shape:", df.shape)



encoder_area = SentenceTransformer("BAAI/bge-large-en-v1.5")
print("Encoder dim:", encoder_area.get_sentence_embedding_dimension())


area_cat_embeddings = {}
for label, subdict in AUTO_AREA_SEEDS.items():
    texts = list(subdict.values())
    emb = encoder_area.encode(
        texts,
        convert_to_numpy=True,
        normalize_embeddings=True,
        show_progress_bar=False,
    )
    area_cat_embeddings[label] = emb.mean(axis=0)

cat_matrix = np.stack(list(area_cat_embeddings.values()))  # (n_area, dim)
cat_labels = list(area_cat_embeddings.keys())

print("cat_matrix shape:", cat_matrix.shape)
print("cat_labels:", cat_labels)


texts_non = df["text"].fillna("").tolist()
if EMB_PATH.exists():
    print(">> doc_embeddings_area_base_free_clean_baai.npy found, downloading...")
    doc_emb_non = np.load(EMB_PATH)
    if doc_emb_non.shape[0] != len(texts_non):
        print("!! UYARI: dimensions not matching, reencoding...")
        doc_emb_non = encoder_area.encode(
            texts_non,
            batch_size=64,
            show_progress_bar=True,
            convert_to_numpy=True,
            normalize_embeddings=True,
        )
        np.save(EMB_PATH, doc_emb_non)
else:
    print(">> doc_embeddings_area_base_free_clean_baai.npy found, downloading...")
    doc_emb_non = encoder_area.encode(
        texts_non,
        batch_size=64,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=True,
    )
    np.save(EMB_PATH, doc_emb_non)

print("Final doc_emb_non shape:", doc_emb_non.shape)
assert doc_emb_non.shape[0] == len(texts_non)


sims_area = doc_emb_non @ cat_matrix.T   # (N_docs, n_area)
rows = np.arange(sims_area.shape[0])

sorted_idx   = np.argsort(sims_area, axis=1)
top1_idx     = sorted_idx[:, -1]
top2_idx     = sorted_idx[:, -2]

top1_scores  = sims_area[rows, top1_idx]
top2_scores  = sims_area[rows, top2_idx]

top1_labels  = [cat_labels[i] for i in top1_idx]
top2_labels  = [cat_labels[i] for i in top2_idx]

df["auto_top8_pred"]   = top1_labels
df["seed_top1_sim"]    = top1_scores
df["seed_top2_label"]  = top2_labels
df["seed_top2_sim"]    = top2_scores
df["margin_pp"]        = df["seed_top1_sim"] - df["seed_top2_sim"]


df["auto_focus_label"] = (
    df["auto_top8_pred"].astype(str)
    .str.extract(r"^\s*(\d+)", expand=False)
)

df["auto_focus_area"] = (
    df["auto_top8_pred"]
    .astype(str)
    .str.replace(r"^\s*\d+\s*[_\-]\s*", "", regex=True)
)

print("AREA distribution (paper+patent):")
print(df["auto_focus_area"].value_counts())


In [None]:
# === Cell 3: UMAP definition 2D for visualisation ===

umap_2d = UMAP(
    n_components=2,
    metric="cosine",
    random_state=42,
    n_neighbors=40,
    min_dist=0.1,
)

X_2d = umap_2d.fit_transform(doc_emb_non)

df_umap = (
    df.reset_index()
    .rename(columns={"index": "orig_index"})
    .assign(
        x_2d=X_2d[:, 0],
        y_2d=X_2d[:, 1],
    )
)

print("UMAP number of documents:", len(df_umap))
print(df_umap["auto_focus_area"].value_counts())

def wrap_vertical(text, max_len=40):
    if not isinstance(text, str):
        return ""
    return "<br>".join(
        [text[i:i+max_len] for i in range(0, len(text), max_len)]
    )

df_umap["text_vertical"] = df_umap["text"].apply(wrap_vertical)


In [None]:
# === Cell 4: 2D Visualisation of the Areas ===

fig = px.scatter(
    df_umap,
    x="x_2d",
    y="y_2d",
    color="auto_focus_area",
    hover_data=["source_type", "title", "auto_focus_area", "margin_pp", "seed_top2_label"],
)

fig.update_traces(
    marker=dict(size=5, opacity=0.7),
    hoverlabel=dict(
        bgcolor="white",
        font_size=12,
        align="left"
    )
)
fig.update_layout(
    width=2200,
    height=900,
    legend_title_text="Area",
    dragmode="pan",
    xaxis=dict(autorange=True, fixedrange=False),
    yaxis=dict(autorange=True, fixedrange=False),
)

fig.show(config={"scrollZoom": True})


In [None]:
# === Cell 5: TECH assignment with doc embeddings & seed embeddings cosine similarity score ===



global_tech_rows = []
for area_name, tech_dict in AUTO_TOP_SEEDS.items():
    for tech_name, seed_text in tech_dict.items():
        global_tech_rows.append({
            "area_name": area_name,
            "tech_name": tech_name,
            "seed_text": seed_text,
        })

df_tech = pd.DataFrame(global_tech_rows)
print("Number of techs:", len(df_tech))

tech_texts  = df_tech["seed_text"].tolist()
tech_labels = df_tech["tech_name"].tolist()
tech_areas  = df_tech["area_name"].tolist()

tech_emb = encoder_area.encode(
    tech_texts,
    convert_to_numpy=True,
    normalize_embeddings=True,
    show_progress_bar=False,
)

print("tech_emb shape:", tech_emb.shape)


sims_tech = doc_emb_non @ tech_emb.T    # (N_docs, N_tech)
best_idx  = np.argmax(sims_tech, axis=1)
rows      = np.arange(sims_tech.shape[0])

best_scores      = sims_tech[rows, best_idx]
best_tech_labels = [tech_labels[i] for i in best_idx]
best_tech_areas  = [tech_areas[i] for i in best_idx]

df_umap["auto_tech_cluster"]        = best_tech_labels
df_umap["auto_tech_sim"]            = best_scores
df_umap["auto_tech_area_from_tech"] = best_tech_areas  

print(df_umap[["auto_focus_area", "auto_tech_area_from_tech", "auto_tech_cluster", "auto_tech_sim"]].head(20))


In [None]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

display(
    df_umap[
        [
            "auto_focus_area",
            "auto_tech_area_from_tech",
            "auto_tech_cluster",
            "auto_tech_sim",
            "title"
        ]
    ].head(50)
)

In [None]:
# === Cell 6: COSINE SIMILARITIES HISTOGRAMS FOR EACH TECHS ===
import plotly.express as px


df_hist = df_umap.dropna(subset=["auto_tech_cluster", "auto_tech_sim"]).copy()

for tech, g in df_hist.groupby("auto_tech_cluster"):
    fig = px.histogram(
        g,
        x="auto_tech_sim",
        nbins=30,
        title=f"Cosine similarity distribution â€“ {tech}",
    )
    fig.update_xaxes(range=[0, 1])          
    fig.update_yaxes(title="Document count")
    fig.show()


In [None]:
# === Cell 7: Cosine similarity Summaries ===


print("Average auto_tech_sim:", df_umap["auto_tech_sim"].mean())


summary = (
    df_umap
    .dropna(subset=["auto_tech_cluster", "auto_tech_sim"])
    .groupby("auto_tech_cluster")["auto_tech_sim"]
    .agg(["count", "mean", "median", "std"])
    .sort_values("mean")
)

summary

In [None]:
# === Cell 8: 2D Visualisation of the Techs ===

fig = px.scatter(
    df_umap,
    x="x_2d",
    y="y_2d",
    color="auto_tech_cluster",
    symbol="auto_tech_cluster",
    hover_data=["title", "auto_focus_area", "auto_tech_cluster", "auto_tech_sim"],
)

fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.update_layout(
    width=2200,
    height=900,
    legend_title_text="Tech cluster",
    dragmode="pan",
    xaxis=dict(autorange=True, fixedrange=False),
    yaxis=dict(autorange=True, fixedrange=False),
)

fig.show(config={"scrollZoom": True})

print(df_umap["auto_tech_cluster"].value_counts().head(20))


In [None]:
# === Cell 9: Score ===

from sklearn.metrics import silhouette_score


labels_area = df_auto_corpus_area_tech.loc[df_auto_corpus_area_tech["source_type"].isin(["paper","patent"]), "auto_focus_area"].tolist()


sil_area = silhouette_score(doc_emb_non, labels_area, metric="cosine")
print("Silhouette (AREA) =", sil_area)


labels_tech = df_auto_corpus_area_tech.loc[df_auto_corpus_area_tech["source_type"].isin(["paper","patent"]), "auto_tech_cluster"].tolist()


sil_tech = silhouette_score(doc_emb_non, labels_tech, metric="cosine")
print("Silhouette (TECH) =", sil_tech)


In [None]:
# === Cell 10: Exporting the data ===

cols = [
    "year",
    "month",
    "text",
    "source_type",
    "auto_focus_area",           
    "auto_tech_cluster",         
    "auto_tech_sim",             
    "auto_tech_area_from_tech",  
    "seed_top1_sim",
    "seed_top2_label",
    "seed_top2_sim",
]

df_auto_corpus_area_tech = df_umap[cols].copy()

save_path = OUT_DIR / "df_auto_corpus_area_tech_free_clean_baai.parquet"
df_auto_corpus_area_tech.to_parquet(save_path)

save_path