# Machine Learning Approach Implementation
## Roman Urdu to Urdu Script Conversion Project

This notebook covers Step 4 & 5 of our methodology:
- Machine Learning Model Implementation
- Model Training and Evaluation
- Comparison with Dictionary Approach

### Objectives:
1. Implement and train ML models (word-based and character-based)
2. Evaluate ML model performance
3. Compare different ML approaches
4. Analyze feature importance and model behavior
5. Compare with dictionary-based approach

In [None]:
# Import required libraries
import sys
import os
from pathlib import Path
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
import joblib

# Add project root to path
project_root = Path('../')
sys.path.append(str(project_root))

from models.ml_model import MLModel
from models.dictionary_model import DictionaryModel
from utils.data_loader import DataLoader
from utils.preprocessing import RomanUrduPreprocessor
from evaluation.metrics import (
    calculate_bleu_score, calculate_rouge_l, calculate_word_accuracy,
    calculate_sentence_accuracy, calculate_character_accuracy, calculate_edit_distance
)

# Set up plotting
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_palette("husl")

print("Libraries imported successfully!")

## 1. Data Preparation

In [None]:
# Initialize components
data_loader = DataLoader("../data")
preprocessor = RomanUrduPreprocessor()

# Load data
sample_data = data_loader.load_sample_data()
test_data = data_loader.load_test_data()
dictionary = data_loader.load_dictionary()

print(f"Sample data: {len(sample_data)} sentences")
print(f"Test data: {len(test_data)} sentences")
print(f"Dictionary: {len(dictionary)} entries")

# Prepare training data from sample_data
train_roman = [item['roman'] for item in sample_data]
train_urdu = [item['urdu'] for item in sample_data]

# Prepare test data
test_roman = [item['roman'] for item in test_data]
test_urdu = [item['urdu'] for item in test_data]

print(f"\nTraining samples: {len(train_roman)}")
print(f"Test samples: {len(test_roman)}")

## 2. Word-Based ML Model

In [None]:
# Initialize and train word-based model
print("Training Word-Based ML Model...")
print("=" * 40)

word_ml_model = MLModel(approach="word")

# Train the model
word_ml_model.train(train_roman, train_urdu)
print("Word-based model training completed!")

# Get model info
print(f"\nModel Statistics:")
print(f"Vocabulary size: {len(word_ml_model.word_pairs)}")
print(f"Feature dimension: {word_ml_model.vectorizer.get_feature_names_out().shape[0] if hasattr(word_ml_model, 'vectorizer') else 'N/A'}")

In [None]:
# Test word-based model on individual words
test_words = ["main", "aap", "kaise", "hain", "ghar", "ja", "raha", "hun"]

print("Word-Based Model - Individual Word Tests:")
print("=" * 45)
for word in test_words:
    converted = word_ml_model.convert_word(word)
    print(f"{word:10} -> {converted}")

In [None]:
# Test word-based model on sentences
test_sentences = [
    "main acha hun",
    "aap kaise hain",
    "wo ghar ja raha hai"
]

print("Word-Based Model - Sentence Tests:")
print("=" * 40)
for sentence in test_sentences:
    converted = word_ml_model.convert_text(sentence)
    print(f"Roman: {sentence}")
    print(f"Urdu:  {converted}")
    print("-" * 30)

## 3. Character-Based ML Model

In [None]:
# Initialize and train character-based model
print("Training Character-Based ML Model...")
print("=" * 42)

char_ml_model = MLModel(approach="character")

# Train the model
char_ml_model.train(train_roman, train_urdu)
print("Character-based model training completed!")

# Get model info
print(f"\nModel Statistics:")
print(f"Character mappings: {len(char_ml_model.char_pairs)}")
print(f"Feature dimension: {char_ml_model.vectorizer.get_feature_names_out().shape[0] if hasattr(char_ml_model, 'vectorizer') else 'N/A'}")

In [None]:
# Test character-based model
print("Character-Based Model - Word Tests:")
print("=" * 40)
for word in test_words:
    converted = char_ml_model.convert_word(word)
    print(f"{word:10} -> {converted}")

print("\nCharacter-Based Model - Sentence Tests:")
print("=" * 45)
for sentence in test_sentences:
    converted = char_ml_model.convert_text(sentence)
    print(f"Roman: {sentence}")
    print(f"Urdu:  {converted}")
    print("-" * 30)

## 4. Model Evaluation on Test Set

In [None]:
# Evaluate both models on test set
def evaluate_model(model, model_name, test_roman, test_urdu):
    print(f"Evaluating {model_name}...")
    
    predictions = []
    references = test_urdu
    
    # Generate predictions
    for roman_text in test_roman:
        predicted = model.convert_text(roman_text)
        predictions.append(predicted)
    
    # Calculate metrics
    bleu_scores = [calculate_bleu_score(pred, ref) for pred, ref in zip(predictions, references)]
    rouge_scores = [calculate_rouge_l(pred, ref) for pred, ref in zip(predictions, references)]
    word_accuracies = [calculate_word_accuracy(pred, ref) for pred, ref in zip(predictions, references)]
    char_accuracies = [calculate_character_accuracy(pred, ref) for pred, ref in zip(predictions, references)]
    edit_distances = [calculate_edit_distance(pred, ref) for pred, ref in zip(predictions, references)]
    
    sentence_accuracy = calculate_sentence_accuracy(predictions, references)
    
    metrics = {
        'BLEU': np.mean(bleu_scores),
        'ROUGE-L': np.mean(rouge_scores),
        'Word_Accuracy': np.mean(word_accuracies),
        'Sentence_Accuracy': sentence_accuracy,
        'Character_Accuracy': np.mean(char_accuracies),
        'Avg_Edit_Distance': np.mean(edit_distances)
    }
    
    return metrics, predictions, bleu_scores, word_accuracies

# Evaluate word-based model
word_metrics, word_predictions, word_bleu_scores, word_word_accuracies = evaluate_model(
    word_ml_model, "Word-Based ML Model", test_roman, test_urdu
)

# Evaluate character-based model
char_metrics, char_predictions, char_bleu_scores, char_word_accuracies = evaluate_model(
    char_ml_model, "Character-Based ML Model", test_roman, test_urdu
)

print("\nEvaluation completed!")

In [None]:
# Display results
print("Word-Based ML Model Performance:")
print("=" * 40)
for metric, value in word_metrics.items():
    if 'Distance' in metric:
        print(f"{metric:20}: {value:.3f}")
    else:
        print(f"{metric:20}: {value:.3f} ({value*100:.1f}%)")

print("\nCharacter-Based ML Model Performance:")
print("=" * 42)
for metric, value in char_metrics.items():
    if 'Distance' in metric:
        print(f"{metric:20}: {value:.3f}")
    else:
        print(f"{metric:20}: {value:.3f} ({value*100:.1f}%)")

## 5. Model Comparison Analysis

In [None]:
# Load dictionary model results for comparison
dict_model = DictionaryModel("../data/roman_urdu_dictionary.json")
dict_metrics, dict_predictions, dict_bleu_scores, dict_word_accuracies = evaluate_model(
    dict_model, "Dictionary Model", test_roman, test_urdu
)

# Create comparison dataframe
comparison_data = {
    'Model': ['Dictionary', 'Word-based ML', 'Character-based ML'],
    'BLEU': [dict_metrics['BLEU'], word_metrics['BLEU'], char_metrics['BLEU']],
    'ROUGE-L': [dict_metrics['ROUGE-L'], word_metrics['ROUGE-L'], char_metrics['ROUGE-L']],
    'Word_Accuracy': [dict_metrics['Word_Accuracy'], word_metrics['Word_Accuracy'], char_metrics['Word_Accuracy']],
    'Sentence_Accuracy': [dict_metrics['Sentence_Accuracy'], word_metrics['Sentence_Accuracy'], char_metrics['Sentence_Accuracy']],
    'Character_Accuracy': [dict_metrics['Character_Accuracy'], word_metrics['Character_Accuracy'], char_metrics['Character_Accuracy']]
}

comparison_df = pd.DataFrame(comparison_data)
print("Model Comparison:")
print("=" * 80)
print(comparison_df.to_string(index=False, float_format='%.3f'))

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

metrics_to_plot = ['BLEU', 'ROUGE-L', 'Word_Accuracy', 'Sentence_Accuracy', 'Character_Accuracy']
colors = ['skyblue', 'lightcoral', 'lightgreen']

for i, metric in enumerate(metrics_to_plot):
    values = comparison_df[metric].values
    bars = axes[i].bar(comparison_df['Model'], values, color=colors, alpha=0.8)
    axes[i].set_title(f'{metric} Comparison', fontsize=12)
    axes[i].set_ylabel('Score')
    axes[i].set_ylim(0, 1)
    axes[i].grid(True, alpha=0.3)
    
    # Add value labels
    for bar, value in zip(bars, values):
        height = bar.get_height()
        axes[i].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                     f'{value:.3f}', ha='center', va='bottom', fontsize=10)
    
    axes[i].tick_params(axis='x', rotation=45)

# Remove the last subplot
axes[5].remove()

plt.tight_layout()
plt.show()

## 6. Detailed Performance Analysis

In [None]:
# Analyze performance distributions
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# BLEU score distributions
axes[0, 0].hist(dict_bleu_scores, bins=10, alpha=0.7, label='Dictionary', color='skyblue')
axes[0, 0].hist(word_bleu_scores, bins=10, alpha=0.7, label='Word ML', color='lightcoral')
axes[0, 0].hist(char_bleu_scores, bins=10, alpha=0.7, label='Char ML', color='lightgreen')
axes[0, 0].set_title('BLEU Score Distributions')
axes[0, 0].set_xlabel('BLEU Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Word accuracy distributions
axes[0, 1].hist(dict_word_accuracies, bins=10, alpha=0.7, label='Dictionary', color='skyblue')
axes[0, 1].hist(word_word_accuracies, bins=10, alpha=0.7, label='Word ML', color='lightcoral')
axes[0, 1].hist(char_word_accuracies, bins=10, alpha=0.7, label='Char ML', color='lightgreen')
axes[0, 1].set_title('Word Accuracy Distributions')
axes[0, 1].set_xlabel('Word Accuracy')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Box plots for BLEU scores
bleu_data = [dict_bleu_scores, word_bleu_scores, char_bleu_scores]
axes[0, 2].boxplot(bleu_data, labels=['Dictionary', 'Word ML', 'Char ML'])
axes[0, 2].set_title('BLEU Score Box Plots')
axes[0, 2].set_ylabel('BLEU Score')
axes[0, 2].grid(True, alpha=0.3)

# Box plots for word accuracy
acc_data = [dict_word_accuracies, word_word_accuracies, char_word_accuracies]
axes[1, 0].boxplot(acc_data, labels=['Dictionary', 'Word ML', 'Char ML'])
axes[1, 0].set_title('Word Accuracy Box Plots')
axes[1, 0].set_ylabel('Word Accuracy')
axes[1, 0].grid(True, alpha=0.3)

# Correlation between BLEU and Word Accuracy
axes[1, 1].scatter(dict_bleu_scores, dict_word_accuracies, alpha=0.7, label='Dictionary', color='skyblue')
axes[1, 1].scatter(word_bleu_scores, word_word_accuracies, alpha=0.7, label='Word ML', color='lightcoral')
axes[1, 1].scatter(char_bleu_scores, char_word_accuracies, alpha=0.7, label='Char ML', color='lightgreen')
axes[1, 1].set_title('BLEU vs Word Accuracy')
axes[1, 1].set_xlabel('BLEU Score')
axes[1, 1].set_ylabel('Word Accuracy')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# Performance by sentence length
sentence_lengths = [len(sent.split()) for sent in test_roman]
axes[1, 2].scatter(sentence_lengths, dict_bleu_scores, alpha=0.7, label='Dictionary', color='skyblue')
axes[1, 2].scatter(sentence_lengths, word_bleu_scores, alpha=0.7, label='Word ML', color='lightcoral')
axes[1, 2].scatter(sentence_lengths, char_bleu_scores, alpha=0.7, label='Char ML', color='lightgreen')
axes[1, 2].set_title('Performance vs Sentence Length')
axes[1, 2].set_xlabel('Sentence Length (words)')
axes[1, 2].set_ylabel('BLEU Score')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Feature Analysis (ML Models)

In [None]:
# Analyze word-based model features
if hasattr(word_ml_model, 'vectorizer') and hasattr(word_ml_model, 'model'):
    try:
        feature_names = word_ml_model.vectorizer.get_feature_names_out()
        
        # Get feature importance for Random Forest
        if hasattr(word_ml_model.model, 'feature_importances_'):
            importances = word_ml_model.model.feature_importances_
            
            # Get top features
            top_indices = np.argsort(importances)[-20:][::-1]
            top_features = [feature_names[i] for i in top_indices]
            top_importances = [importances[i] for i in top_indices]
            
            print("Top 20 Features (Word-based Model):")
            print("=" * 40)
            for feature, importance in zip(top_features, top_importances):
                print(f"{feature:20}: {importance:.4f}")
            
            # Visualize feature importance
            plt.figure(figsize=(12, 8))
            plt.barh(range(len(top_features)), top_importances)
            plt.yticks(range(len(top_features)), top_features)
            plt.xlabel('Feature Importance')
            plt.title('Top 20 Feature Importances (Word-based Model)')
            plt.gca().invert_yaxis()
            plt.grid(True, alpha=0.3)
            plt.tight_layout()
            plt.show()
        
    except Exception as e:
        print(f"Feature analysis not available: {e}")
else:
    print("Feature analysis not available for this model type")

In [None]:
# Analyze character-based model patterns
if hasattr(char_ml_model, 'char_pairs'):
    print("Character Mapping Analysis:")
    print("=" * 30)
    
    # Show most common character mappings
    char_freq = Counter()
    for roman_char, urdu_chars in char_ml_model.char_pairs.items():
        for urdu_char in urdu_chars:
            char_freq[(roman_char, urdu_char)] += 1
    
    print("Top 20 Character Mappings:")
    for (roman, urdu), freq in char_freq.most_common(20):
        print(f"'{roman}' -> '{urdu}': {freq} times")
    
    # Analyze character coverage
    roman_chars = set(char_ml_model.char_pairs.keys())
    all_urdu_chars = set()
    for urdu_chars in char_ml_model.char_pairs.values():
        all_urdu_chars.update(urdu_chars)
    
    print(f"\nCharacter Coverage:")
    print(f"Roman characters: {len(roman_chars)}")
    print(f"Urdu characters: {len(all_urdu_chars)}")
    print(f"Total mappings: {len(char_ml_model.char_pairs)}")

## 8. Error Analysis

In [None]:
# Detailed error analysis for all models
def analyze_errors(predictions, references, model_name):
    print(f"\nError Analysis for {model_name}:")
    print("=" * (25 + len(model_name)))
    
    errors = []
    for i, (pred, ref) in enumerate(zip(predictions, references)):
        if pred != ref:
            word_acc = calculate_word_accuracy(pred, ref)
            errors.append({
                'index': i,
                'roman': test_roman[i],
                'reference': ref,
                'prediction': pred,
                'word_accuracy': word_acc
            })
    
    print(f"Total errors: {len(errors)} out of {len(predictions)}")
    print(f"Error rate: {len(errors)/len(predictions)*100:.1f}%")
    
    # Show worst errors
    errors.sort(key=lambda x: x['word_accuracy'])
    print(f"\nWorst 5 errors:")
    for error in errors[:5]:
        print(f"  Roman:     {error['roman']}")
        print(f"  Reference: {error['reference']}")
        print(f"  Predicted: {error['prediction']}")
        print(f"  Word Acc:  {error['word_accuracy']:.3f}")
        print()
    
    return errors

# Analyze errors for all models
dict_errors = analyze_errors(dict_predictions, test_urdu, "Dictionary Model")
word_errors = analyze_errors(word_predictions, test_urdu, "Word-based ML Model")
char_errors = analyze_errors(char_predictions, test_urdu, "Character-based ML Model")

In [None]:
# Compare error patterns
print("Error Pattern Comparison:")
print("=" * 30)

# Find common errors across models
dict_error_indices = {error['index'] for error in dict_errors}
word_error_indices = {error['index'] for error in word_errors}
char_error_indices = {error['index'] for error in char_errors}

all_errors = dict_error_indices | word_error_indices | char_error_indices
common_errors = dict_error_indices & word_error_indices & char_error_indices

print(f"Sentences with errors:")
print(f"  Dictionary only: {len(dict_error_indices - word_error_indices - char_error_indices)}")
print(f"  Word ML only: {len(word_error_indices - dict_error_indices - char_error_indices)}")
print(f"  Char ML only: {len(char_error_indices - dict_error_indices - word_error_indices)}")
print(f"  Common to all: {len(common_errors)}")
print(f"  Total unique errors: {len(all_errors)}")

# Show common difficult sentences
if common_errors:
    print(f"\nSentences that all models struggled with:")
    for idx in list(common_errors)[:3]:
        print(f"  Roman: {test_roman[idx]}")
        print(f"  Reference: {test_urdu[idx]}")
        print()

## 9. Model Performance Summary

In [None]:
# Create comprehensive performance summary
summary_data = {
    'Dictionary Model': {
        'type': 'Rule-based',
        'training_time': 'Instant',
        'prediction_speed': 'Very Fast',
        'memory_usage': 'Low',
        'interpretability': 'High',
        'metrics': dict_metrics
    },
    'Word-based ML': {
        'type': 'Machine Learning',
        'training_time': 'Fast',
        'prediction_speed': 'Fast',
        'memory_usage': 'Medium',
        'interpretability': 'Medium',
        'metrics': word_metrics
    },
    'Character-based ML': {
        'type': 'Machine Learning',
        'training_time': 'Fast',
        'prediction_speed': 'Fast',
        'memory_usage': 'Medium',
        'interpretability': 'Low',
        'metrics': char_metrics
    }
}

print("Comprehensive Model Comparison:")
print("=" * 40)

for model_name, info in summary_data.items():
    print(f"\n{model_name}:")
    print(f"  Type: {info['type']}")
    print(f"  Training Time: {info['training_time']}")
    print(f"  Prediction Speed: {info['prediction_speed']}")
    print(f"  Memory Usage: {info['memory_usage']}")
    print(f"  Interpretability: {info['interpretability']}")
    print(f"  Performance:")
    for metric, value in info['metrics'].items():
        if 'Distance' in metric:
            print(f"    {metric}: {value:.3f}")
        else:
            print(f"    {metric}: {value:.3f} ({value*100:.1f}%)")

## 10. Save Results and Models

In [None]:
# Create results directory if it doesn't exist
os.makedirs('../results', exist_ok=True)
os.makedirs('../models/saved', exist_ok=True)

# Save ML models
word_ml_model.save_model('../models/saved/word_ml_model.pkl')
char_ml_model.save_model('../models/saved/char_ml_model.pkl')

# Save comprehensive results
all_results = {
    'test_set_size': len(test_data),
    'training_set_size': len(sample_data),
    'models': {
        'dictionary': {
            'metrics': dict_metrics,
            'predictions': dict_predictions,
            'error_count': len(dict_errors)
        },
        'word_ml': {
            'metrics': word_metrics,
            'predictions': word_predictions,
            'error_count': len(word_errors)
        },
        'char_ml': {
            'metrics': char_metrics,
            'predictions': char_predictions,
            'error_count': len(char_errors)
        }
    },
    'comparison': comparison_df.to_dict('records'),
    'error_analysis': {
        'common_errors': len(common_errors),
        'total_error_sentences': len(all_errors)
    }
}

# Save results
with open('../results/ml_models_results.json', 'w', encoding='utf-8') as f:
    json.dump(all_results, f, ensure_ascii=False, indent=2)

# Save detailed predictions
detailed_results = []
for i in range(len(test_data)):
    detailed_results.append({
        'index': i,
        'roman': test_roman[i],
        'reference': test_urdu[i],
        'english': test_data[i].get('english', ''),
        'dict_prediction': dict_predictions[i],
        'word_ml_prediction': word_predictions[i],
        'char_ml_prediction': char_predictions[i],
        'dict_bleu': dict_bleu_scores[i],
        'word_ml_bleu': word_bleu_scores[i],
        'char_ml_bleu': char_bleu_scores[i],
        'dict_word_acc': dict_word_accuracies[i],
        'word_ml_word_acc': word_word_accuracies[i],
        'char_ml_word_acc': char_word_accuracies[i]
    })

detailed_df = pd.DataFrame(detailed_results)
detailed_df.to_csv('../results/detailed_model_predictions.csv', index=False, encoding='utf-8')

print("Results and models saved successfully!")
print("Files created:")
print("  - results/ml_models_results.json")
print("  - results/detailed_model_predictions.csv")
print("  - models/saved/word_ml_model.pkl")
print("  - models/saved/char_ml_model.pkl")

## 11. Final Summary and Recommendations

In [None]:
# Generate final performance ranking
model_scores = {
    'Dictionary': np.mean([dict_metrics['BLEU'], dict_metrics['Word_Accuracy'], dict_metrics['Character_Accuracy']]),
    'Word ML': np.mean([word_metrics['BLEU'], word_metrics['Word_Accuracy'], word_metrics['Character_Accuracy']]),
    'Character ML': np.mean([char_metrics['BLEU'], char_metrics['Word_Accuracy'], char_metrics['Character_Accuracy']])
}

ranked_models = sorted(model_scores.items(), key=lambda x: x[1], reverse=True)

print("=" * 60)
print("FINAL MACHINE LEARNING APPROACH SUMMARY")
print("=" * 60)

print("\nModel Performance Ranking:")
for i, (model, score) in enumerate(ranked_models, 1):
    print(f"{i}. {model}: {score:.3f} (average score)")

print("\nKey Findings:")
print("1. Dictionary Model:")
print(f"   - Excellent performance on known vocabulary")
print(f"   - BLEU: {dict_metrics['BLEU']:.3f}, Word Accuracy: {dict_metrics['Word_Accuracy']:.3f}")
print(f"   - Fast and interpretable")

print("\n2. Word-based ML Model:")
print(f"   - Good generalization capabilities")
print(f"   - BLEU: {word_metrics['BLEU']:.3f}, Word Accuracy: {word_metrics['Word_Accuracy']:.3f}")
print(f"   - Handles unknown words better")

print("\n3. Character-based ML Model:")
print(f"   - Character-level understanding")
print(f"   - BLEU: {char_metrics['BLEU']:.3f}, Word Accuracy: {char_metrics['Word_Accuracy']:.3f}")
print(f"   - Most flexible for variations")

print("\nRecommendations:")
print("1. Use Dictionary Model for high-accuracy, known vocabulary scenarios")
print("2. Use Word-based ML for balanced performance and generalization")
print("3. Use Character-based ML for handling spelling variations")
print("4. Consider ensemble approaches combining multiple models")
print("5. Implement hybrid systems using dictionary + ML fallback")

print("\nNext Steps:")
print("1. Implement deep learning approaches (Seq2Seq, Transformer)")
print("2. Expand training data for better ML performance")
print("3. Develop ensemble methods")
print("4. Implement human evaluation studies")
print("5. Deploy best performing model for real-world testing")

## Conclusions

### Machine Learning Approach Results:

#### Word-based ML Model:
- **Strengths**: Good balance between accuracy and generalization
- **Performance**: Competitive with dictionary on known vocabulary
- **Use Case**: Ideal for scenarios requiring good coverage and reasonable accuracy

#### Character-based ML Model:
- **Strengths**: Handles spelling variations and unknown words
- **Performance**: Lower precision but better robustness
- **Use Case**: Suitable for noisy text with many variations

### Comparison with Dictionary Approach:
- Dictionary model remains highly competitive for covered vocabulary
- ML models provide better generalization for unknown scenarios
- Hybrid approaches could combine strengths of both

### Technical Insights:
1. **Feature Engineering**: TF-IDF features work well for this task
2. **Model Selection**: Random Forest provides good balance of performance and interpretability
3. **Data Requirements**: More training data would significantly improve ML performance
4. **Error Patterns**: Common errors across models suggest inherent task difficulty

### Future Work:
- Implement sequence-to-sequence deep learning models
- Expand training dataset significantly
- Develop context-aware models
- Create ensemble methods
- Conduct human evaluation studies