# SmartGriev Model Training

This notebook trains all three core models for the SmartGriev system:
1. Sentiment Analysis Model
2. Text Classification Model
3. Named Entity Recognition (NER) Model

Each model is optimized for grievance-related text in multiple Indian languages.

# Data Requirements

## 1. Sentiment Analysis Data
Required format in JSON:
```json
{
    "text": "The road condition is very poor and no action has been taken.",
    "sentiment": "NEGATIVE",
    "language": "en"  // or "hi", "ta", etc.
}
```

Sources:
- [SAIL Dataset](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews) - Customer Reviews
- [PG Portal Data](https://pgportal.gov.in/) - Public Grievances
- Municipal Corporation Databases

## 2. Complaint Classification Data
Required format:
```json
{
    "text": "Street light in sector 7 has not been working for 2 weeks",
    "category": "UTILITIES",
    "sub_category": "STREET_LIGHT"
}
```

Sources:
- Municipal Corporation Complaint Systems
- Smart City Grievance Portals
- State Government Portals

## 3. NER Training Data
Required format:
```json
{
    "text": "Municipal Commissioner Mr. Kumar visited Gandhi Nagar on 15th January",
    "entities": [
        [19, 28, "PERSON"],
        [36, 47, "LOC"],
        [51, 62, "DATE"]
    ]
}
```

Sources:
- Government Office Orders
- Public Notices
- Official Communications
- Municipal Corporation Records

## Data Collection Tools
1. Web Scraping:
   - Government websites
   - Public grievance portals
   - News articles about civic issues

2. Official APIs:
   - Smart City APIs
   - Government Open Data Portals
   - RTI Disclosure Portals

3. Manual Annotation:
   - Use tools like Label Studio
   - BRAT Annotation Tool
   - Prodigy for NER annotation

## Sample Dataset Size Requirements:
- Sentiment Analysis: Minimum 5000 labeled examples
- Classification: 1000+ examples per category
- NER: 2000+ sentences with annotated entities

In [None]:
# Required imports
import os
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from transformers import (
    TFAutoModelForSequenceClassification,
    AutoTokenizer,
    TFAutoModelForTokenClassification
)
import tensorflow_hub as hub
import tensorflow_text as text
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import json

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

# Check for GPU
print("GPU Available: ", tf.config.list_physical_devices('GPU'))
print("TensorFlow version:", tf.__version__)

# Constants
MODELS_DIR = './saved_models'
os.makedirs(MODELS_DIR, exist_ok=True)

# Model configurations
CONFIG = {
    'sentiment': {
        'model_name': 'microsoft/mdeberta-v3-base',  # Multilingual DeBERTa
        'num_labels': 3,
        'labels': ['NEGATIVE', 'NEUTRAL', 'POSITIVE'],
        'max_length': 128,
        'batch_size': 32
    },
    'classification': {
        'model_name': 'microsoft/mdeberta-v3-base',
        'num_labels': 8,
        'labels': ['CIVIC', 'LAW_AND_ORDER', 'UTILITIES', 'TRANSPORTATION', 
                  'EDUCATION', 'HEALTH', 'AGRICULTURE', 'OTHER'],
        'max_length': 128,
        'batch_size': 32
    },
    'ner': {
        'labels': ['PERSON', 'ORG', 'LOC', 'DATE', 'FACILITY', 'DEPT'],
        'embedding_dim': 200,
        'lstm_units': 128,
        'max_length': 100,
        'batch_size': 32
    }
}

In [None]:
# Data loading functions
def load_complaint_data():
    """Load complaint data from fixtures"""
    with open('./fixtures/initial_data.json', 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    complaints = []
    for item in data:
        if item['model'] == 'complaints.complaint':
            complaints.append({
                'text': item['fields']['description'],
                'category': item['fields']['category'],
                'sentiment': item['fields'].get('sentiment', 'NEUTRAL'),
                'entities': item['fields'].get('entities', [])
            })
    
    return pd.DataFrame(complaints)

# Load and split data
df = load_complaint_data()
print("Dataset size:", len(df))
print("\nCategory distribution:")
print(df['category'].value_counts())
print("\nSentiment distribution:")
print(df['sentiment'].value_counts())

In [None]:
# Model building functions
def build_transformer_model(model_type):
    """Build a transformer-based model for sentiment or classification"""
    config = CONFIG[model_type]
    
    # Create base model
    base_model = TFAutoModelForSequenceClassification.from_pretrained(
        config['model_name'],
        num_labels=config['num_labels']
    )
    
    # Create input layers
    input_ids = layers.Input(shape=(config['max_length'],), dtype=tf.int32, name='input_ids')
    attention_mask = layers.Input(shape=(config['max_length'],), dtype=tf.int32, name='attention_mask')
    
    # Get outputs from base model
    outputs = base_model({'input_ids': input_ids, 'attention_mask': attention_mask})
    logits = outputs.logits
    
    # Add custom layers for fine-tuning
    x = layers.Dropout(0.1)(logits)
    outputs = layers.Dense(config['num_labels'], activation='softmax')(x)
    
    # Create model
    model = keras.Model(
        inputs={'input_ids': input_ids, 'attention_mask': attention_mask},
        outputs=outputs
    )
    
    # Compile model
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=2e-5),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

def build_ner_model():
    """Build a custom BiLSTM-CRF model for NER"""
    config = CONFIG['ner']
    
    # Input layer
    input_text = layers.Input(shape=(config['max_length'],))
    
    # Embedding layer
    x = layers.Embedding(
        input_dim=10000,  # Vocabulary size, will be updated later
        output_dim=config['embedding_dim'],
        mask_zero=True
    )(input_text)
    
    # BiLSTM layers
    x = layers.Bidirectional(layers.LSTM(
        config['lstm_units'],
        return_sequences=True
    ))(x)
    x = layers.Bidirectional(layers.LSTM(
        config['lstm_units'],
        return_sequences=True
    ))(x)
    
    # Output layers
    x = layers.TimeDistributed(layers.Dropout(0.2))(x)
    outputs = layers.TimeDistributed(
        layers.Dense(len(config['labels']), activation='softmax')
    )(x)
    
    # Create model
    model = keras.Model(inputs=input_text, outputs=outputs)
    
    # Compile model
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

def preprocess_text(texts, tokenizer, max_length):
    """Preprocess text data for transformer models"""
    encoded = tokenizer(
        texts.tolist(),
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='tf'
    )
    return {
        'input_ids': encoded['input_ids'],
        'attention_mask': encoded['attention_mask']
    }

In [None]:
# Prepare sentiment analysis data
print("Preparing sentiment analysis model...")

# Convert sentiment labels to indices
label_encoder = LabelEncoder()
sentiment_labels = label_encoder.fit_transform(df['sentiment'])

# Split data
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['text'].values, 
    sentiment_labels,
    test_size=0.2,
    random_state=42
)

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(CONFIG['sentiment']['model_name'])

# Preprocess data
train_data = preprocess_text(train_texts, tokenizer, CONFIG['sentiment']['max_length'])
val_data = preprocess_text(val_texts, tokenizer, CONFIG['sentiment']['max_length'])

# Build and train model
sentiment_model = build_transformer_model('sentiment')

history = sentiment_model.fit(
    train_data,
    train_labels,
    validation_data=(val_data, val_labels),
    epochs=5,
    batch_size=CONFIG['sentiment']['batch_size']
)

# Save model
sentiment_model.save(f'{MODELS_DIR}/sentiment/final')
tokenizer.save_pretrained(f'{MODELS_DIR}/sentiment/final')

# Test model
test_text = "The road condition is terrible and no one is taking action."
test_data = preprocess_text(np.array([test_text]), tokenizer, CONFIG['sentiment']['max_length'])
prediction = sentiment_model.predict(test_data)
predicted_label = CONFIG['sentiment']['labels'][np.argmax(prediction[0])]
confidence = np.max(prediction[0])

print(f"\nTest prediction for: {test_text}")
print(f"Sentiment: {predicted_label}")
print(f"Confidence: {confidence:.2f}")

# Plot training history
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.legend()
plt.show()

In [None]:
# Prepare classification data
print("Preparing classification model...")

# Convert category labels to indices
label_encoder = LabelEncoder()
classification_labels = label_encoder.fit_transform(df['category'])

# Split data
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['text'].values, 
    classification_labels,
    test_size=0.2,
    random_state=42
)

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(CONFIG['classification']['model_name'])

# Preprocess data
train_data = preprocess_text(train_texts, tokenizer, CONFIG['classification']['max_length'])
val_data = preprocess_text(val_texts, tokenizer, CONFIG['classification']['max_length'])

# Build and train model
classifier_model = build_transformer_model('classification')

history = classifier_model.fit(
    train_data,
    train_labels,
    validation_data=(val_data, val_labels),
    epochs=5,
    batch_size=CONFIG['classification']['batch_size']
)

# Save model
classifier_model.save(f'{MODELS_DIR}/classification/final')
tokenizer.save_pretrained(f'{MODELS_DIR}/classification/final')

# Test model
test_text = "The streetlight in our area has not been working for a week."
test_data = preprocess_text(np.array([test_text]), tokenizer, CONFIG['classification']['max_length'])
prediction = classifier_model.predict(test_data)
predicted_label = CONFIG['classification']['labels'][np.argmax(prediction[0])]
confidence = np.max(prediction[0])

print(f"\nTest prediction for: {test_text}")
print(f"Category: {predicted_label}")
print(f"Confidence: {confidence:.2f}")

# Plot training history
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.legend()
plt.show()

In [None]:
# Prepare NER training data
print("Preparing NER model...")

def prepare_ner_data(df):
    """Prepare data for NER model"""
    texts = []
    labels = []
    
    for text, entities in zip(df['text'], df['entities']):
        # Convert text to token sequence
        tokens = text.split()  # Simple tokenization, could be improved
        
        # Create label sequence (one label per token)
        label_seq = ['O'] * len(tokens)  # O for Outside
        for start, end, label in entities:
            # Find token indices for this entity
            entity_tokens = text[start:end].split()
            for i, token in enumerate(tokens):
                if i < len(tokens) - len(entity_tokens) + 1:
                    if tokens[i:i+len(entity_tokens)] == entity_tokens:
                        for j in range(len(entity_tokens)):
                            if j == 0:
                                label_seq[i+j] = f'B-{label}'  # Beginning
                            else:
                                label_seq[i+j] = f'I-{label}'  # Inside
        
        texts.append(tokens)
        labels.append(label_seq)
    
    return texts, labels

# Prepare data
texts, labels = prepare_ner_data(df)

# Create vocabularies
tokenizer = Tokenizer(oov_token='<UNK>')
tokenizer.fit_on_texts(texts)

# Create label mapping
unique_labels = set(['O'])
for label_seq in labels:
    unique_labels.update(label_seq)
label2id = {label: i for i, label in enumerate(sorted(unique_labels))}
id2label = {i: label for label, i in label2id.items()}

# Convert to sequences
X = tokenizer.texts_to_sequences(texts)
y = [[label2id[l] for l in seq] for seq in labels]

# Pad sequences
X_padded = pad_sequences(X, maxlen=CONFIG['ner']['max_length'], padding='post')
y_padded = pad_sequences(y, maxlen=CONFIG['ner']['max_length'], padding='post')

# Split data
X_train, X_val, y_train, y_val = train_test_split(
    X_padded, y_padded,
    test_size=0.2,
    random_state=42
)

# Update model config with vocabulary size
CONFIG['ner']['vocab_size'] = len(tokenizer.word_index) + 1
CONFIG['ner']['num_labels'] = len(label2id)

# Build and train model
ner_model = build_ner_model()

history = ner_model.fit(
    X_train,
    y_train,
    validation_data=(X_val, y_val),
    epochs=10,
    batch_size=CONFIG['ner']['batch_size']
)

# Save model and tokenizer
ner_model.save(f'{MODELS_DIR}/ner/final')
with open(f'{MODELS_DIR}/ner/tokenizer.json', 'w') as f:
    json.dump({
        'word_index': tokenizer.word_index,
        'label2id': label2id,
        'id2label': id2label
    }, f)

# Test model
test_text = """The drainage system in Gandhi Nagar is completely blocked. 
Municipal Corporation officer Mr. Kumar has not taken any action despite 
multiple complaints since January 15th."""

# Preprocess test text
test_tokens = test_text.split()
test_seq = tokenizer.texts_to_sequences([test_tokens])
test_padded = pad_sequences(test_seq, maxlen=CONFIG['ner']['max_length'], padding='post')

# Predict
predictions = ner_model.predict(test_padded)
predicted_labels = np.argmax(predictions[0], axis=-1)

print("\nNER Test Results:")
for token, label_id in zip(test_tokens[:CONFIG['ner']['max_length']], predicted_labels):
    if label_id != label2id['O']:  # Only show non-Outside entities
        print(f"{token}: {id2label[label_id]}")

# Plot training history
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.legend()
plt.show()

In [None]:
# Test all models
def test_models():
    # Load sentiment model and tokenizer
    sentiment_model = keras.models.load_model(f'{MODELS_DIR}/sentiment/final')
    sentiment_tokenizer = AutoTokenizer.from_pretrained(f'{MODELS_DIR}/sentiment/final')
    
    # Load classification model and tokenizer
    classifier_model = keras.models.load_model(f'{MODELS_DIR}/classification/final')
    classifier_tokenizer = AutoTokenizer.from_pretrained(f'{MODELS_DIR}/classification/final')
    
    # Load NER model and tokenizer
    ner_model = keras.models.load_model(f'{MODELS_DIR}/ner/final')
    with open(f'{MODELS_DIR}/ner/tokenizer.json', 'r') as f:
        ner_data = json.load(f)
        word_index = ner_data['word_index']
        id2label = ner_data['id2label']
    
    # Test text
    test_text = """The drainage system in Gandhi Nagar is completely blocked. 
    Municipal Corporation officer Mr. Kumar has not taken any action despite 
    multiple complaints since January 15th."""
    
    # Test sentiment analysis
    sentiment_data = preprocess_text(
        np.array([test_text]),
        sentiment_tokenizer,
        CONFIG['sentiment']['max_length']
    )
    sentiment_pred = sentiment_model.predict(sentiment_data)
    sentiment_label = CONFIG['sentiment']['labels'][np.argmax(sentiment_pred[0])]
    sentiment_conf = np.max(sentiment_pred[0])
    
    print("Sentiment Analysis:")
    print(f"Sentiment: {sentiment_label}")
    print(f"Confidence: {sentiment_conf:.2f}\n")
    
    # Test classification
    class_data = preprocess_text(
        np.array([test_text]),
        classifier_tokenizer,
        CONFIG['classification']['max_length']
    )
    class_pred = classifier_model.predict(class_data)
    class_label = CONFIG['classification']['labels'][np.argmax(class_pred[0])]
    class_conf = np.max(class_pred[0])
    
    print("Category Classification:")
    print(f"Category: {class_label}")
    print(f"Confidence: {class_conf:.2f}\n")
    
    # Test NER
    tokens = test_text.split()
    test_seq = [[word_index.get(word.lower(), word_index['<UNK>']) for word in tokens]]
    test_padded = pad_sequences(test_seq, maxlen=CONFIG['ner']['max_length'], padding='post')
    
    ner_pred = ner_model.predict(test_padded)
    pred_labels = np.argmax(ner_pred[0], axis=-1)
    
    print("Named Entity Recognition:")
    current_entity = None
    current_text = []
    
    for token, label_id in zip(tokens[:CONFIG['ner']['max_length']], pred_labels):
        label = id2label[str(label_id)]
        if label != 'O':
            if label.startswith('B-'):
                if current_entity:
                    print(f"{''.join(current_text)}: {current_entity}")
                current_entity = label[2:]
                current_text = [token]
            elif label.startswith('I-'):
                if current_entity == label[2:]:
                    current_text.append(token)
            else:
                if current_entity:
                    print(f"{''.join(current_text)}: {current_entity}")
                current_entity = label
                current_text = [token]
    
    if current_entity:
        print(f"{''.join(current_text)}: {current_entity}")

# Run tests
print("Testing all models...\n")
test_models()

# Model Integration Instructions

After training, the models will be saved in the `saved_models` directory:
- Sentiment Analysis: `saved_models/sentiment/final`
- Text Classification: `saved_models/classification/final`
- NER: `saved_models/ner/model-best`

To use these models in the Django application:

1. Update the model paths in `views.py`:
```python
sentiment_analyzer = pipeline('text-classification', model='./saved_models/sentiment/final')
text_classifier = pipeline('text-classification', model='./saved_models/classification/final')
nlp = spacy.load('./saved_models/ner/model-best')
```

2. The models can now handle:
   - Sentiment analysis in multiple languages
   - Classification of complaints into categories
   - Named entity recognition for important information

3. Each model provides confidence scores that can be used to:
   - Filter low-confidence predictions
   - Route uncertain cases to human reviewers
   - Monitor model performance