# **NLP Intent Parser for Industrial Technician Queries**

A modular pipeline consisting of:
1. Topic Router (LDA, SVM, Mini-BERT)
2. Intent + Target + Parameter Token Classifier (DistilBERT, BiLSTM, LSTM)
3. Context Resolver for domain-aware refinement

This notebook demonstrates preprocessing, embeddings, token labeling, 
three different modeling strategies, evaluation, and comparison.

### **1. Import and Setup**


In [5]:
!pip install --upgrade pip



In [6]:
!pip install pandas numpy scikit-learn nltk torch seaborn matplotlib transformers tensorflow



In [7]:
%pip install tensorflow

import tensorflow as tf
print(tf.__version__)

Note: you may need to restart the kernel to use updated packages.
2.20.0


**Why We Generated the Dataset Ourselves**

There isn’t any publicly available dataset that captures "technician-style" micro-grid instructions with the level of structure we need (intent, target, parameter, modifier, conditions). Real industrial datasets are either private, messy, and rarely come with clean labels or ones we can make sense of. Since our goal here is to benchmark different NLP models, not to clean handwritten maintenance logs, synthetic data gives us full control over the balance, coverage, and consistency.

It lets us shape the exact problem in the manner that we want to model, and it’s standard practice during early prototyping before fine-tuning on real operational data later.

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from transformers import BertTokenizer


from transformers import AutoTokenizer, AutoModelForTokenClassification

  from .autonotebook import tqdm as notebook_tqdm


### **2. Data Exploration (EDA)**

**The first step is to confirm formatting and make sure all columns loaded correctly.**

*Our EDA focuses on validating distribution, coverage, and linguistic variety across intents, targets, and parameters. Since the dataset is synthetic, the goal isn’t noise inspection but ensuring balance, realism, and sufficient diversity to train and compare NLP models reliably.*

In [9]:
df = pd.read_csv('./data/solar_ds.csv')

In [10]:
df.head()

Unnamed: 0,query,intent,target,parameter,modifier,conditions
0,Log irradiance readings on the inverter.,log,inverter,irradiance,overload,during_peak_hours
1,Monitor microgrid_controller — temperature see...,monitor,microgrid_controller,temperature,sudden_drop,during_peak_hours
2,Inspect inverter — efficiency seems critical.,inspect,inverter,efficiency,critical,during_peak_hours
3,Optimize anomaly in inverter temperature.,optimize,inverter,temperature,high,at_night
4,Reset anomaly in battery_bank temperature.,reset,battery_bank,temperature,high,under_cloud_cover


In [11]:
df.sample(5)

Unnamed: 0,query,intent,target,parameter,modifier,conditions
1059,Inspect anomaly in inverter fault_code.,inspect,inverter,fault_code,low,under_cloud_cover
4962,Diagnose current readings on the grid_tie_inve...,diagnose,grid_tie_inverter,current,overload,post_storm
3884,Reset issue detected in inverter fault_code.,reset,inverter,fault_code,critical,at_night
3086,Log anomaly in battery_bank output_power.,log,battery_bank,output_power,critical,during_peak_hours
3761,Reset issue detected in charge_controller faul...,reset,charge_controller,fault_code,critical,at_night


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   query       5000 non-null   object
 1   intent      5000 non-null   object
 2   target      5000 non-null   object
 3   parameter   5000 non-null   object
 4   modifier    5000 non-null   object
 5   conditions  5000 non-null   object
dtypes: object(6)
memory usage: 234.5+ KB


### **3. Preprocessing**

This section transforms raw queries into model-ready inputs for both the classical LSTM/BiLSTM pipeline and the BERT pipeline.

We only perform necessary cleaning steps as the synthetic data is already consistent.

##### **3.1 Normalisation**

Even though the dataset is synthetic, we will apply minimal normalisation for consistency across models:

Lowercasing (for LSTM/BiLSTM only — BERT does its own thing)

Strip extra whitespace

Optional punctuation spacing (only if needed)

In [13]:
def normalise(text):
    return " ".join(text.lower().strip().split())

df["text_norm"] = df["query"].apply(normalise)


##### **3.2 Train/Test Split**

In [14]:
train_df, test_df = train_test_split(
    df,
    test_size=0.2,
    random_state=42,
    stratify=df["intent"]
)
print(f"Train shape: {train_df.shape}, Test shape: {test_df.shape}")

grouped = train_df.groupby('intent').size()
print(grouped)

Train shape: (4000, 7), Test shape: (1000, 7)
intent
check       489
diagnose    492
inspect     472
log         531
monitor     469
optimize    541
predict     509
reset       497
dtype: int64


##### **3.3 Labels for Intent, Target, Parameter**

In [15]:
import numpy as np

intent2id = {lbl: i for i, lbl in enumerate(df["intent"].unique())}
target2id = {lbl: i for i, lbl in enumerate(df["target"].unique())}
param2id = {lbl: i for i, lbl in enumerate(df["parameter"].unique())}

df["intent_id"] = df["intent"].map(intent2id)
df["target_id"] = df["target"].map(target2id)
df["param_id"] = df["parameter"].map(param2id)

# Print the mappings
print("Intent to ID mapping:")
print(intent2id)
print("\nTarget to ID mapping:")
print(target2id)
print("\nParameter to ID mapping:")
print(param2id)

Intent to ID mapping:
{'log': 0, 'monitor': 1, 'inspect': 2, 'optimize': 3, 'reset': 4, 'predict': 5, 'diagnose': 6, 'check': 7}

Target to ID mapping:
{'inverter': 0, 'microgrid_controller': 1, 'battery_bank': 2, 'solar_panel': 3, 'smart_meter': 4, 'pv_array': 5, 'grid_tie_inverter': 6, 'charge_controller': 7}

Parameter to ID mapping:
{'irradiance': 0, 'temperature': 1, 'efficiency': 2, 'state_of_charge': 3, 'fault_code': 4, 'output_power': 5, 'voltage': 6, 'frequency': 7, 'load_balance': 8, 'current': 9}


##### **3.4 Tokenisation**

**A) Classical (LSTM/Bi-LSTM) Tokeniser**

In [16]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tk = Tokenizer(num_words=20000, oov_token="<UNK>")
tk.fit_on_texts(train_df["text_norm"])

train_seq = tk.texts_to_sequences(train_df["text_norm"])
test_seq = tk.texts_to_sequences(test_df["text_norm"])

MAX_LEN = 32
train_seq = pad_sequences(train_seq, maxlen=MAX_LEN, padding="post")
test_seq = pad_sequences(test_seq, maxlen=MAX_LEN, padding="post")

**B) BERT Tokenizer (HuggingFace)**

In [17]:
from transformers import DistilBertTokenizerFast

bert_tok = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")


def encode_batch(texts):
    return bert_tok(
        texts.tolist(),
        padding=True,
        truncation=True,
        max_length=64,
        return_tensors="pt"
    )


train_bert = encode_batch(train_df["query"])
test_bert = encode_batch(test_df["query"])

##### **3.5 Final Training Dictionaries**

In [None]:
print(df["intent_id"].value_counts())

train_labels = {
    "intent": train_df["intent_id"].values,
    "target": train_df["target_id"].values,
    "parameter": train_df["param_id"].values,
}

test_labels = {
    "intent": test_df["intent_id"].values,
    "target": test_df["target_id"].values,
    "parameter": test_df["param_id"].values,
}


0       0
1       1
2       2
3       3
4       4
       ..
4995    4
4996    3
4997    0
4998    0
4999    0
Name: intent_id, Length: 5000, dtype: int64


KeyError: 'intent_id'