# JapaneseTextClassifier Tutorial

This notebook demonstrates the **JapaneseTextClassifier** - an end-to-end solution for Japanese text classification with built-in LIME explanations.

## Features

- **Simple API**: Train, predict, and explain in just a few lines of code
- **Built-in tokenization**: Automatic Japanese tokenization with Sudachi
- **LIME explanations**: Human-readable explanations in Japanese and English
- **Model persistence**: Save and load trained models
- **FastAPI ready**: Results are JSON-serializable for web APIs

## Contents

1. [Quick Start](#1-quick-start)
2. [Training a Classifier](#2-training-a-classifier)
3. [Making Predictions with Explanations](#3-making-predictions-with-explanations)
4. [Saving and Loading Models](#4-saving-and-loading-models)
5. [API Integration (FastAPI/htmx)](#5-api-integration-fastapihtmx)

## 1. Quick Start

Here's the simplest way to use JapaneseTextClassifier:

In [None]:
# Install dependencies (run once)
# !pip install sudachipy sudachidict_core pandas scikit-learn

In [None]:
import sys
sys.path.insert(0, '..')

from tokusan import JapaneseTextClassifier
import pandas as pd

# Load data
df = pd.read_csv('fakenews.csv')
df = df[df['isfake'].isin([0, 2])].copy()
df['label'] = (df['isfake'] == 2).astype(int)

# Create and train classifier
clf = JapaneseTextClassifier(class_names=['Real', 'Fake'])
result = clf.train(df['context'], df['label'])

print(result.summary())

In [None]:
# Predict with explanation
text = "これは信頼できるニュース記事です。正確な情報が報道されています。"
pred = clf.predict(text, explain=True)

print(pred.summary_jp)

## 2. Training a Classifier

Let's look at training in more detail.

In [None]:
import sys
sys.path.insert(0, '..')

from tokusan import JapaneseTextClassifier
import pandas as pd

# Load and prepare dataset
df = pd.read_csv('fakenews.csv')
df = df.dropna(subset=['context', 'isfake'])

# Filter to binary classification (0=real, 2=fake)
df = df[df['isfake'].isin([0, 2])].copy()
df['label'] = (df['isfake'] == 2).astype(int)

print(f"Dataset size: {len(df)}")
print(f"Label distribution:")
print(df['label'].value_counts())

In [None]:
# Create classifier with custom settings
clf = JapaneseTextClassifier(
    class_names=['Real', 'Fake'],       # Class names (index 0, 1)
    classifier_type='logistic_regression',  # or 'random_forest'
    max_features=20000,                 # TF-IDF vocabulary size
    random_state=42                     # For reproducibility
)

print(clf)

In [None]:
# Train the model
result = clf.train(
    texts=df['context'],
    labels=df['label'],
    test_size=0.2  # 20% for testing
)

# Print training results
print(result.summary())

In [None]:
# Japanese summary
print(result.summary_jp())

In [None]:
# Get results as dictionary (for APIs)
result.to_dict()

### Using Random Forest

You can also use Random Forest classifier:

In [None]:
# Create Random Forest classifier
clf_rf = JapaneseTextClassifier(
    class_names=['Real', 'Fake'],
    classifier_type='random_forest',
    n_estimators=100,  # Number of trees
    random_state=42
)

# Train
result_rf = clf_rf.train(df['context'], df['label'])
print(f"Random Forest Accuracy: {result_rf.accuracy:.2%}")

## 3. Making Predictions with Explanations

The key feature is generating human-readable explanations.

In [None]:
# Sample text to classify
test_text = """
政府は本日、新たな経済政策を発表しました。
この政策は来年度から実施される予定です。
専門家の多くはこの政策を歓迎しています。
"""

# Predict with explanation
pred = clf.predict(
    test_text,
    explain=True,       # Generate LIME explanation
    num_features=10,    # Top 10 words in explanation
    num_samples=500     # LIME perturbation samples
)

print(f"Predicted class: {pred.predicted_class}")
print(f"Confidence: {pred.confidence:.1%}")

In [None]:
# Japanese summary (main feature!)
print("=" * 60)
print("日本語の説明:")
print("=" * 60)
print(pred.summary_jp)

In [None]:
# English summary
print("=" * 60)
print("English Explanation:")
print("=" * 60)
print(pred.summary_en)

In [None]:
# Access word weights directly
print("\nTop words influencing the prediction:")
for word, weight in pred.explanation.word_weights:
    direction = "↑" if weight > 0 else "↓"
    print(f"  {direction} {word}: {weight:+.4f}")

In [None]:
# Words that increase probability
print("\nWords increasing probability:")
for word, weight in pred.explanation.top_positive_words[:5]:
    print(f"  + {word}: {weight:+.4f}")

# Words that decrease probability
print("\nWords decreasing probability:")
for word, weight in pred.explanation.top_negative_words[:5]:
    print(f"  - {word}: {weight:+.4f}")

### Batch Predictions

In [None]:
# Predict multiple texts at once
texts = [
    "政府が新しい政策を発表しました。",
    "この情報は信頼できません。",
    "専門家がコメントしています。"
]

results = clf.predict_batch(texts, explain=False)  # Fast batch without explanation

for text, result in zip(texts, results):
    print(f"{text[:20]}... -> {result.predicted_class} ({result.confidence:.1%})")

## 4. Saving and Loading Models

Trained models can be saved and loaded for later use.

In [None]:
# Save the trained model
clf.save('my_classifier.pkl')
print("Model saved to 'my_classifier.pkl'")

In [None]:
# Load the model
loaded_clf = JapaneseTextClassifier.load('my_classifier.pkl')
print(f"Model loaded: {loaded_clf}")

In [None]:
# Use the loaded model
pred = loaded_clf.predict("テスト文章です", explain=True)
print(pred.summary_jp)

In [None]:
# Clean up
import os
os.remove('my_classifier.pkl')
print("Cleaned up model file")

## 5. API Integration (FastAPI/htmx)

The result classes are designed for easy integration with FastAPI and htmx.

In [None]:
# Results are JSON-serializable
pred = clf.predict("テスト文章", explain=True)

# Convert to dictionary for API response
api_response = pred.to_dict()

print("API Response (JSON-ready):")
print(f"Keys: {list(api_response.keys())}")

In [None]:
# View the full response
import json
print(json.dumps(api_response, ensure_ascii=False, indent=2))

In [None]:
# HTML output for htmx
html = pred.to_html(lang='jp')
print("HTML output (first 500 chars):")
print(html[:500])

### Example FastAPI Integration

Here's how you would use this with FastAPI:

```python
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from pydantic import BaseModel
from tokusan import JapaneseTextClassifier

app = FastAPI()

# Load model at startup
classifier = JapaneseTextClassifier.load("model.pkl")

class PredictRequest(BaseModel):
    text: str
    explain: bool = True

@app.post("/predict")
async def predict(request: PredictRequest):
    """Return JSON response for API clients."""
    result = classifier.predict(request.text, explain=request.explain)
    return result.to_dict()

@app.post("/predict/html", response_class=HTMLResponse)
async def predict_html(request: PredictRequest):
    """Return HTML fragment for htmx."""
    result = classifier.predict(request.text, explain=request.explain)
    return result.to_html(lang='jp')
```

## Summary

### Quick Reference

```python
from tokusan import JapaneseTextClassifier

# Create and train
clf = JapaneseTextClassifier(class_names=['Real', 'Fake'])
result = clf.train(texts, labels)

# Predict with explanation
pred = clf.predict(text, explain=True)
print(pred.summary_jp)  # Japanese explanation
print(pred.to_dict())   # For API

# Save/load
clf.save('model.pkl')
clf = JapaneseTextClassifier.load('model.pkl')
```

### Key Classes

| Class | Description |
|-------|-------------|
| `JapaneseTextClassifier` | Main classifier with train/predict/explain |
| `TrainingResult` | Training metrics and summary |
| `PredictionResult` | Prediction with probabilities and explanation |
| `ExplanationResult` | Word weights and Japanese/English summaries |

### Key Methods

| Method | Description |
|--------|-------------|
| `clf.train(texts, labels)` | Train the classifier |
| `clf.predict(text, explain=True)` | Predict with explanation |
| `clf.predict_batch(texts)` | Batch predictions |
| `clf.save(path)` / `clf.load(path)` | Save/load model |
| `result.summary_jp` | Japanese summary |
| `result.to_dict()` | JSON-serializable dict |
| `result.to_html()` | HTML for htmx |