# **NLP Intent Parser for Industrial Technician Queries**

A modular pipeline consisting of:
1. Topic Router (LDA, SVM, Mini-BERT)
2. Intent + Target + Parameter Token Classifier (DistilBERT, BiLSTM, LSTM)
3. Context Resolver for domain-aware refinement

This notebook demonstrates preprocessing, embeddings, token labeling, 
three different modeling strategies, evaluation, and comparison.

1. Dataset Creation  
2. EDA  
3. Preprocessing (clean + unified)

4. Baseline
   - TF-IDF + Linear SVM

5. Classical Deep Learning Pipeline
   - LSTM
   - BiLSTM
   - Training + Evaluation

6. Transformer Pipeline
   - DistilBERT token classifier
   - Training + Evaluation

7. End-to-End Intent Parser Demo
8. Model Comparison Summary


### **1. Import and Setup**


In [None]:
!pip install --upgrade pip

In [None]:
!pip install pandas numpy scikit-learn nltk torch seaborn matplotlib transformers tensorflow

In [None]:
%pip install tensorflow

import tensorflow as tf
print(tf.__version__)

**Why We Generated the Dataset Ourselves**

There isn’t any publicly available dataset that captures "technician-style" micro-grid instructions with the level of structure we need (intent, target, parameter, modifier, conditions). Real industrial datasets are either private, messy, and rarely come with clean labels or ones we can make sense of. Since our goal here is to benchmark different NLP models, not to clean handwritten maintenance logs, synthetic data gives us full control over the balance, coverage, and consistency.

It lets us shape the exact problem in the manner that we want to model, and it’s standard practice during early prototyping before fine-tuning on real operational data later.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from transformers import BertTokenizer


from transformers import AutoTokenizer, AutoModelForTokenClassification

### **2. Data Exploration (EDA)**

**The first step is to confirm formatting and make sure all columns loaded correctly.**

*Our EDA focuses on validating distribution, coverage, and linguistic variety across intents, targets, and parameters. Since the dataset is synthetic, the goal isn’t noise inspection but ensuring balance, realism, and sufficient diversity to train and compare NLP models reliably.*

In [None]:
df = pd.read_csv('./data/solar_ds.csv')

In [None]:
df.head()

In [None]:
df.sample(5)

In [None]:
df.info()

### **3. Preprocessing**

This section transforms raw queries into model-ready inputs for both the classical LSTM/BiLSTM pipeline and the BERT pipeline.

We only perform necessary cleaning steps as the synthetic data is already consistent.

##### **3.1 Normalisation**

Even though the dataset is synthetic, we will apply minimal normalisation for consistency across models:

Lowercasing (for LSTM/BiLSTM only — BERT does its own thing)

Strip extra whitespace

Optional punctuation spacing (only if needed)

In [None]:
def normalise(text):
    return " ".join(text.lower().strip().split())


df["text_norm"] = df["query"].apply(normalise)

##### **3.2 Train/Test Split**

We stratify by intent to preserve class balance.

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(
    df,
    test_size=0.2,
    random_state=42,
    stratify=df["intent"]
)

print(train_df.shape, test_df.shape)

##### **3.3 Numerical Labels for Intent, Target, Parameter**

We create mapping dictionaries and apply them directly to both train_df and test_df.

In [None]:
intent2id = {lbl: i for i, lbl in enumerate(df["intent"].unique())}
target2id = {lbl: i for i, lbl in enumerate(df["target"].unique())}
param2id = {lbl: i for i, lbl in enumerate(df["parameter"].unique())}

# Reverse maps for decoding model predictions
id2intent = {v: k for k, v in intent2id.items()}
id2target = {v: k for k, v in target2id.items()}
id2param = {v: k for k, v in param2id.items()}

# Apply to splits
train_df["intent_id"] = train_df["intent"].map(intent2id)
train_df["target_id"] = train_df["target"].map(target2id)
train_df["param_id"] = train_df["parameter"].map(param2id)

test_df["intent_id"] = test_df["intent"].map(intent2id)
test_df["target_id"] = test_df["target"].map(target2id)
test_df["param_id"] = test_df["parameter"].map(param2id)

##### **3.4 Tokenisation**

**A) LSTM/Bi-LSTM Tokeniser**

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tk = Tokenizer(num_words=20000, oov_token="<UNK>")
tk.fit_on_texts(train_df["text_norm"])

train_seq = tk.texts_to_sequences(train_df["text_norm"])
test_seq = tk.texts_to_sequences(test_df["text_norm"])

MAX_LEN = 32
train_seq = pad_sequences(train_seq, maxlen=MAX_LEN, padding="post")
test_seq = pad_sequences(test_seq,  maxlen=MAX_LEN, padding="post")

**B) BERT Tokeniser**

In [None]:
from transformers import DistilBertTokenizerFast

bert_tok = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")


def encode_batch(texts):
    return bert_tok(
        texts.tolist(),
        padding=True,
        truncation=True,
        max_length=64,
        return_attention_mask=True,
        return_tensors="pt"
    )


train_bert = encode_batch(train_df["query"])
test_bert = encode_batch(test_df["query"])

##### **3.5 Final Label Dictionaries**

In [None]:
train_labels = {
    "intent": train_df["intent_id"].values,
    "target": train_df["target_id"].values,
    "parameter": train_df["param_id"].values,
}

test_labels = {
    "intent": test_df["intent_id"].values,
    "target": test_df["target_id"].values,
    "parameter": test_df["param_id"].values,
}

##### **4. Baseline: TF-IDF + Linear SVM**

Before we move into neural models, we establish a classical baseline using TF-IDF features and a Linear SVM classifier.
This gives us a sanity-check: if the neural models can’t beat this, something’s wrong.

We treat this as a pure intent classification problem (single label).

##### **4.1 Vectorise Text with TF-IDF**

We apply TF-IDF to the normalised training text:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    stop_words="english"
)

X_train_tfidf = tfidf.fit_transform(train_df["text_norm"])
X_test_tfidf = tfidf.transform(test_df["text_norm"])

##### **4.2 Train Linear SVM**

Linear SVM performs well on short technical text and is fast to train.

In [None]:
from sklearn.svm import LinearSVC

svm_clf = LinearSVC()
svm_clf.fit(X_train_tfidf, train_df["intent_id"])

##### **4.3 Evaluate Baseline**

In [None]:
from sklearn.metrics import classification_report, accuracy_score

pred_svm = svm_clf.predict(X_test_tfidf)

acc = accuracy_score(test_df["intent_id"], pred_svm)
print("Baseline SVM Accuracy:", round(acc, 4))

print(classification_report(test_df["intent_id"], pred_svm))

##### **5. Classical Deep Learning Pipeline**

This section implements sequence models for intent, target, and parameter classification.
We compare LSTM and BiLSTM to assess whether bidirectionality improves performance.

##### **5.1 Model Inputs**

We already prepared:

- train_seq / test_seq → tokenized and padded sequences for LSTM/BiLSTM

- train_labels / test_labels → intent_id, target_id, param_id

Number of classes:

In [None]:
num_intents = len(intent2id)
num_targets = len(target2id)
num_params = len(param2id)
vocab_size = min(20000, len(tk.word_index) + 1)

##### **5.2 LSTM Model**

In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Bidirectional, Dense
from tensorflow.keras.optimizers import Adam

EMB_DIM = 64

input_layer = Input(shape=(MAX_LEN,), name="input_lstm")
x = Embedding(input_dim=vocab_size, output_dim=EMB_DIM, mask_zero=True)(input_layer)
x = LSTM(64)(x)

intent_out = Dense(num_intents, activation="softmax", name="intent")(x)
target_out = Dense(num_targets, activation="softmax", name="target")(x)
param_out = Dense(num_params, activation="softmax", name="parameter")(x)

input_lstm = Input(shape=(MAX_LEN,), name="input_lstm")
x_lstm = Embedding(input_dim=vocab_size, output_dim=EMB_DIM, mask_zero=True)(input_lstm)
x_lstm = LSTM(64)(x_lstm)

intent_out_lstm = Dense(num_intents, activation="softmax", name="intent")(x_lstm)
target_out_lstm = Dense(num_targets, activation="softmax", name="target")(x_lstm)
param_out_lstm  = Dense(num_params, activation="softmax", name="parameter")(x_lstm)

lstm_model = Model(
    inputs=input_lstm,
    outputs=[intent_out_lstm, target_out_lstm, param_out_lstm]
)
lstm_model.compile(
    optimizer=Adam(),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy", "accuracy", "accuracy"]
)
lstm_model.summary()

##### **5.3 BI-LSTM Model**

In [None]:


MAX_LEN = train_seq.shape[1]

input_bi = Input(shape=(MAX_LEN,), name="input_bi")
x_bi = Embedding(input_dim=vocab_size, output_dim=EMB_DIM,
                 mask_zero=True)(input_bi)
x_bi = Bidirectional(LSTM(64))(x_bi)

intent_out_bi = Dense(num_intents, activation="softmax", name="intent")(x_bi)
target_out_bi = Dense(num_targets, activation="softmax", name="target")(x_bi)
param_out_bi = Dense(num_params, activation="softmax", name="parameter")(x_bi)

bilstm_model = Model(
    inputs=input_bi,
    outputs=[intent_out_bi, target_out_bi, param_out_bi]
)
bilstm_model.compile(
    optimizer=Adam(),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy", "accuracy", "accuracy"]
)
bilstm_model.summary()

##### **5.4 Training Loop**

In [None]:
EPOCHS = 5
BATCH_SIZE = 32

history_lstm = lstm_model.fit(
    train_seq,
    [train_labels["intent"], train_labels["target"], train_labels["parameter"]],
    validation_data=(test_seq, [test_labels["intent"], test_labels["target"], test_labels["parameter"]]),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE
)

history_bilstm = bilstm_model.fit(
    train_seq,
    [train_labels["intent"], train_labels["target"], train_labels["parameter"]],
    validation_data=(test_seq, [test_labels["intent"], test_labels["target"], test_labels["parameter"]]),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE
)

##### **5.5 Evaluation**

In [None]:
# Example: Evaluate BiLSTM on test set
from sklearn.metrics import classification_report, accuracy_score
import numpy as np
pred_intent_bi, pred_target_bi, pred_param_bi = bilstm_model.predict(test_seq)


pred_intent_ids = np.argmax(pred_intent_bi, axis=1)
pred_target_ids = np.argmax(pred_target_bi, axis=1)
pred_param_ids = np.argmax(pred_param_bi, axis=1)

print("Intent Accuracy:", round(accuracy_score(
    test_labels["intent"], pred_intent_ids), 4))
print("Target Accuracy:", round(accuracy_score(
    test_labels["target"], pred_target_ids), 4))
print("Parameter Accuracy:", round(accuracy_score(
    test_labels["parameter"], pred_param_ids), 4))

#### **6. Transformer Pipeline**

We now use a pre-trained DistilBERT model fine-tuned for token-level classification, extracting intent, target, and parameter from technician queries.

##### **6.1 Model Definition**

In [None]:
import torch
from torch import nn
from transformers import DistilBertModel

class DistilBERTClassifier(nn.Module):
    def __init__(self, num_labels):
        super().__init__()
        self.bert = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(768, num_labels)

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        cls = outputs.last_hidden_state[:, 0]  # [CLS] embedding
        logits = self.classifier(self.dropout(cls))

        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)

        return logits, loss


Model Instantiation

In [None]:
num_intent = len(intent2id)
num_target = len(target2id)
num_param = len(param2id)

model_intent = DistilBERTClassifier(num_intent)
model_target = DistilBERTClassifier(num_target)
model_param = DistilBERTClassifier(num_param)

##### **6.2 Trraining Loop**

In [None]:
def train_transformer(model, inputs, labels, epochs=3, lr=2e-5):
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

    for epoch in range(epochs):
        model.train()
        logits, loss = model(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            labels=labels
        )
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        print(f"Epoch {epoch+1}/{epochs} - loss: {loss.item():.4f}")

Training each classifier

In [None]:
bert_train_inputs = encode_batch(train_df["query"])
bert_test_inputs = encode_batch(test_df["query"])

print(type(bert_train_inputs))
print(bert_train_inputs.keys())

In [None]:
train_transformer(model_intent, bert_train_inputs, train_labels["intent"])
train_transformer(model_target, bert_train_inputs, train_labels["target"])
train_transformer(model_param,  bert_train_inputs, train_labels["parameter"])

##### **6.4 Evaluation**

In [None]:
def predict_bert(model, inputs):
    model.eval()
    with torch.no_grad():
        logits, _ = model(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"]
        )
    return logits.argmax(dim=-1).numpy()


pred_intent = predict_bert(model_intent, bert_test_inputs)
pred_target = predict_bert(model_target, bert_test_inputs)
pred_param = predict_bert(model_param,  bert_test_inputs)

In [None]:
from sklearn.metrics import classification_report, accuracy_score

print("Intent Accuracy:", accuracy_score(test_labels["intent"], pred_intent))
print(classification_report(test_labels["intent"], pred_intent))

print("Target Accuracy:", accuracy_score(test_labels["target"], pred_target))
print(classification_report(test_labels["target"], pred_target))

print("Parameter Accuracy:", accuracy_score(
    test_labels["parameter"], pred_param))
print(classification_report(test_labels["parameter"], pred_param))

##### **7. End-to-End Intent Parser Demo**

This section shows the full pipeline in action: input → intent extraction → target → parameter.

In [None]:
# Example queries
example_queries = [
    "Diagnose battery bank temperature",
    "Reset microgrid_controller",
    "Check solar_panel efficiency"
]

# Classical pipeline (LSTM/BiLSTM)


def run_classical(query, tokenizer, model, max_len=32):
    seq = tokenizer.texts_to_sequences([query])
    seq_padded = pad_sequences(seq, maxlen=max_len, padding="post")
    intent_pred, target_pred, param_pred = model.predict(seq_padded)
    return intent_pred.argmax(), target_pred.argmax(), param_pred.argmax()

# Transformer pipeline (DistilBERT)


def run_transformer(query, tokenizer, model):
    encoding = tokenizer(query, return_tensors="pt",
                         padding="max_length", truncation=True, max_length=32)
    with torch.no_grad():
        outputs = model(
            input_ids=encoding["input_ids"],
            attention_mask=encoding["attention_mask"]
        )
    pred = outputs.logits.argmax(dim=1).item()
    return pred


print("=== End-to-End Demo ===")
for q in example_queries:
    intent_id, target_id, param_id = run_classical(q, tk, lstm_model)
    print(f"\nQuery: {q}")
    print(
        f"Classical Pipeline → Intent: {intent2id[intent_id]}, Target: {target2id[target_id]}, Parameter: {param2id[param_id]}")

    # Transformer example (intent only)
    intent_trf = run_transformer(q, bert_tok, model_intent)
    print(f"Transformer Pipeline → Intent: {intent2id[intent_trf]}")

##### **8. Model Comparison Summary**

A table + short text summarizing your experiments.

| Model                | Accuracy (Intent) | Accuracy (Target) | Accuracy (Parameter) | Notes |
|---------------------|-----------------|-----------------|--------------------|-------|
| TF-IDF + SVM         | 1.0             | -               | -                  | Baseline classical model |
| LSTM                 | 1.0             | 1.0             | 1.0                | Classical deep learning, trained end-to-end |
| BiLSTM               | 1.0             | 1.0             | 1.0                | Slightly better for sequential dependencies |
| DistilBERT           | 1.0             | 1.0*            | 1.0*               | Transformer-based, handles context; attention improves extraction |


