# **NLP Intent Parser for Industrial Technician Queries**

A modular pipeline consisting of:
1. Topic Router (LDA, SVM, Mini-BERT)
2. Intent + Target + Parameter Token Classifier (DistilBERT, BiLSTM, LSTM)
3. Context Resolver for domain-aware refinement

This notebook demonstrates preprocessing, embeddings, token labeling, 
three different modeling strategies, evaluation, and comparison.


### **1. Import and Setup**

In [5]:
!pip install pandas numpy scikit-learn nltk torch seaborn matplotlib transformers 

Collecting nltk
  Using cached nltk-3.9.2-py3-none-any.whl (1.5 MB)
Collecting torch
  Using cached torch-2.9.1-cp311-cp311-win_amd64.whl (111.0 MB)
Collecting transformers
  Using cached transformers-4.57.1-py3-none-any.whl (12.0 MB)
Collecting click
  Using cached click-8.3.1-py3-none-any.whl (108 kB)
Collecting regex>=2021.8.3
  Using cached regex-2025.11.3-cp311-cp311-win_amd64.whl (277 kB)
Collecting tqdm
  Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Collecting filelock
  Using cached filelock-3.20.0-py3-none-any.whl (16 kB)
Collecting sympy>=1.13.3
  Using cached sympy-1.14.0-py3-none-any.whl (6.3 MB)
Collecting networkx>=2.5.1
  Using cached networkx-3.5-py3-none-any.whl (2.0 MB)
Collecting fsspec>=0.8.5
  Using cached fsspec-2025.10.0-py3-none-any.whl (200 kB)
Collecting huggingface-hub<1.0,>=0.34.0
  Using cached huggingface_hub-0.36.0-py3-none-any.whl (566 kB)
Collecting tokenizers<=0.23.0,>=0.22.0
  Using cached tokenizers-0.22.1-cp39-abi3-win_amd64.whl (2.7 MB)
Collec


[notice] A new release of pip available: 22.3.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

import nltk
import re
import torch
import seaborn as sns
import matplotlib.pyplot as plt

from transformers import AutoTokenizer, AutoModelForTokenClassification

###  **2. Load Technician Query Dataset**

**Why We Generated the Dataset Ourselves**

There isn’t any publicly available dataset that captures "technician-style" micro-grid instructions with the level of structure we need (intent, target, parameter, modifier, conditions). Real industrial datasets are either private, messy, and rarely come with clean labels or ones we can make sense of. Since our goal here is to benchmark different NLP models, not to clean handwritten maintenance logs, synthetic data gives us full control over the balance, coverage, and consistency.

It lets us shape the exact problem in the manner that we want to model, and it’s standard practice during early prototyping before fine-tuning on real operational data later.

In [9]:
df = pd.read_csv('./data/solar_ds.csv')    

### **3. Data Exploration (EDA)**

**The first step is to confirm formatting and make sure all columns loaded correctly.**

*Our EDA focuses on validating distribution, coverage, and linguistic variety across intents, targets, and parameters. Since the dataset is synthetic, the goal isn’t noise inspection but ensuring balance, realism, and sufficient diversity to train and compare NLP models reliably.*

In [13]:
df.head()

Unnamed: 0,query,intent,target,parameter,modifier,conditions
0,Log irradiance readings on the inverter.,log,inverter,irradiance,overload,during_peak_hours
1,Monitor microgrid_controller — temperature see...,monitor,microgrid_controller,temperature,sudden_drop,during_peak_hours
2,Inspect inverter — efficiency seems critical.,inspect,inverter,efficiency,critical,during_peak_hours
3,Optimize anomaly in inverter temperature.,optimize,inverter,temperature,high,at_night
4,Reset anomaly in battery_bank temperature.,reset,battery_bank,temperature,high,under_cloud_cover


In [12]:
df.sample(5)

Unnamed: 0,query,intent,target,parameter,modifier,conditions
2135,Inspect output_power readings on the battery_b...,inspect,battery_bank,output_power,sudden_drop,during_peak_hours
862,Optimize issue detected in inverter irradiance.,optimize,inverter,irradiance,low,post_storm
3202,Reset current readings on the solar_panel.,reset,solar_panel,current,none,post_storm
834,Optimize the pv_array frequency.,optimize,pv_array,frequency,sudden_drop,during_peak_hours
1765,Check the low fault_code in the charge_control...,check,charge_controller,fault_code,low,heatwave


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   query       5000 non-null   object
 1   intent      5000 non-null   object
 2   target      5000 non-null   object
 3   parameter   5000 non-null   object
 4   modifier    5000 non-null   object
 5   conditions  5000 non-null   object
dtypes: object(6)
memory usage: 234.5+ KB


### **4. Preprocessing Functions**

### **5. Topic Modeling Module**

#### **5.1 TF-IDF + SVM Baseline**

#### **5.2 LDA Topic Modeling (Unsupervised)**

#### **5.3 MiniBERT Topic Classifier (Supervised)**

### **6. Compare Topic Models**

### **7. Token Classification Dataset Preparation**

### **8. Model 1: DistilBERT Token Classifier**

### **9. Model 2: BiLSTM Token Classifier**

### **10. Model 3: Simple LSTM Tagger**

### **11. Training Loops (All Models)**

### **12. Evaluation: Intent, Target, Parameter Extraction**

### **13. Context Resolver Logic**

### **14. End-to-End Pipeline Demonstration**

### **15. Model Comparison Summary (The MLE Signal)**

### **16. Conclusions & Future Work**

Include:

- integrate with GridGuard

- replace LDA with BERTopic

- build your own transformer from scratch (future project)

- deploy as microservice