# **NLP Intent Parser for Industrial Technician Queries**

A modular pipeline consisting of:
1. Topic Router (LDA, SVM, Mini-BERT)
2. Intent + Target + Parameter Token Classifier (DistilBERT, BiLSTM, LSTM)
3. Context Resolver for domain-aware refinement

This notebook demonstrates preprocessing, embeddings, token labeling, 
three different modeling strategies, evaluation, and comparison.


### **1. Import and Setup**

In [1]:
!pip install --upgrade pip



In [2]:
!pip install pandas numpy scikit-learn nltk torch seaborn matplotlib transformers tensorflow



In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from transformers import BertTokenizer




from transformers import AutoTokenizer, AutoModelForTokenClassification




  from .autonotebook import tqdm as notebook_tqdm


###  **2. Load Technician Query Dataset**

**Why We Generated the Dataset Ourselves**

There isn’t any publicly available dataset that captures "technician-style" micro-grid instructions with the level of structure we need (intent, target, parameter, modifier, conditions). Real industrial datasets are either private, messy, and rarely come with clean labels or ones we can make sense of. Since our goal here is to benchmark different NLP models, not to clean handwritten maintenance logs, synthetic data gives us full control over the balance, coverage, and consistency.

It lets us shape the exact problem in the manner that we want to model, and it’s standard practice during early prototyping before fine-tuning on real operational data later.

In [4]:
df = pd.read_csv('./data/solar_ds.csv')    

### **3. Data Exploration (EDA)**

**The first step is to confirm formatting and make sure all columns loaded correctly.**

*Our EDA focuses on validating distribution, coverage, and linguistic variety across intents, targets, and parameters. Since the dataset is synthetic, the goal isn’t noise inspection but ensuring balance, realism, and sufficient diversity to train and compare NLP models reliably.*

In [5]:
df.head()

Unnamed: 0,query,intent,target,parameter,modifier,conditions
0,Log irradiance readings on the inverter.,log,inverter,irradiance,overload,during_peak_hours
1,Monitor microgrid_controller — temperature see...,monitor,microgrid_controller,temperature,sudden_drop,during_peak_hours
2,Inspect inverter — efficiency seems critical.,inspect,inverter,efficiency,critical,during_peak_hours
3,Optimize anomaly in inverter temperature.,optimize,inverter,temperature,high,at_night
4,Reset anomaly in battery_bank temperature.,reset,battery_bank,temperature,high,under_cloud_cover


In [6]:
df.sample(5)

Unnamed: 0,query,intent,target,parameter,modifier,conditions
4802,Inspect why the pv_array current is low.,inspect,pv_array,current,low,heatwave
42,Monitor state_of_charge readings on the grid_t...,monitor,grid_tie_inverter,state_of_charge,unstable,post_storm
279,Optimize smart_meter — current seems none.,optimize,smart_meter,current,none,under_cloud_cover
1603,Log the solar_panel fault_code.,log,solar_panel,fault_code,unstable,none
840,Optimize issue detected in battery_bank voltage.,optimize,battery_bank,voltage,overload,at_night


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   query       5000 non-null   object
 1   intent      5000 non-null   object
 2   target      5000 non-null   object
 3   parameter   5000 non-null   object
 4   modifier    5000 non-null   object
 5   conditions  5000 non-null   object
dtypes: object(6)
memory usage: 234.5+ KB


### **4. Preprocessing Functions**

*Even though the dataset is synthetic and noise-free, preprocessing is still required to prepare the data for deep learning models. This includes tokenization, padding/truncation to a fixed sequence length, and label encoding. We skip stopword removal, lemmatization, and other cleaning steps because our goal is to preserve the natural language variation that helps the model learn intent patterns.*

#### **4.1 Label Encoding**

We encode each structured field: intent, target, parameter.

In [8]:
intent_encoder = LabelEncoder()
target_encoder = LabelEncoder()
parameter_encoder = LabelEncoder()

df["intent_id"] = intent_encoder.fit_transform(df["intent"])
df["target_id"] = target_encoder.fit_transform(df["target"])
df["parameter_id"] = parameter_encoder.fit_transform(df["parameter"])

##### **4.2 Train/Val/Test Split**
We split once, and reuse the same split for all models to keep comparisons fair.


In [9]:
train_df, test_df = train_test_split(
    df, test_size=0.15, random_state=42, stratify=df["intent"])
train_df, val_df = train_test_split(
    train_df, test_size=0.15, random_state=42, stratify=train_df["intent"])

#### **4.3 Preprocessing for LSTM & Bi-LSTM**

a) Tokenisation

In [10]:
MAX_VOCAB = 8000  # can adjust after EDA
tokenizer = Tokenizer(num_words=MAX_VOCAB, oov_token="<OOV>")

tokenizer.fit_on_texts(train_df["query"])

b) Text to Sequence Conversion

In [11]:
X_train_seq = tokenizer.texts_to_sequences(train_df["query"])
X_val_seq = tokenizer.texts_to_sequences(val_df["query"])
X_test_seq = tokenizer.texts_to_sequences(test_df["query"])

c) Padding

In [12]:
MAX_LEN = 25
X_train = pad_sequences(X_train_seq, maxlen=MAX_LEN, padding="post")
X_val = pad_sequences(X_val_seq, maxlen=MAX_LEN, padding="post")
X_test = pad_sequences(X_test_seq, maxlen=MAX_LEN, padding="post")

d) Extract Label IDs

In [13]:
y_train_intent = train_df["intent_id"].values
y_val_intent = val_df["intent_id"].values
y_test_intent = test_df["intent_id"].values

#### **4.4 Preprocessing for BERT**

We will load the Tokeniser and Tokenise with Masks and Segment IDs

In [15]:
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


def bert_encode(texts, tokenizer, max_len=32):
    input_ids = []
    attention_masks = []

    for t in texts:
        encoded = tokenizer.encode_plus(
            t,
            add_special_tokens=True,
            max_length=max_len,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors="tf"
        )
        input_ids.append(encoded["input_ids"])
        attention_masks.append(encoded["attention_mask"])

    return (
        tf.concat(input_ids, axis=0),
        tf.concat(attention_masks, axis=0),
    )

X_train_bert_ids, X_train_bert_mask = bert_encode(train_df["query"], bert_tokenizer)
X_val_bert_ids,   X_val_bert_mask   = bert_encode(val_df["query"], bert_tokenizer)
X_test_bert_ids,  X_test_bert_mask  = bert_encode(test_df["query"], bert_tokenizer)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


#### **Final Note:**

The preprocessing steps here ensure compatibility with both classical sequence models (LSTM/BiLSTM) and transformer-based models (BERT). Since our dataset is synthetic, the focus is not on cleaning but on formatting: tokenisation, padding, and label encoding. 

These steps allow us to directly compare model performance on a consistent, well-structured task.

### **5. Topic Modeling Module**

#### **5.1 TF-IDF + SVM Baseline**

#### **5.2 LDA Topic Modeling (Unsupervised)**

#### **5.3 MiniBERT Topic Classifier (Supervised)**

### **6. Compare Topic Models**

### **7. Token Classification Dataset Preparation**

### **8. Model 1: DistilBERT Token Classifier**

### **9. Model 2: BiLSTM Token Classifier**

### **10. Model 3: Simple LSTM Tagger**

### **11. Training Loops (All Models)**

### **12. Evaluation: Intent, Target, Parameter Extraction**

### **13. Context Resolver Logic**

### **14. End-to-End Pipeline Demonstration**

### **15. Model Comparison Summary (The MLE Signal)**

### **16. Conclusions & Future Work**

Include:

- integrate with GridGuard

- replace LDA with BERTopic

- build your own transformer from scratch (future project)

- deploy as microservice