# ДЗ 9: Information Extraction из новостных диалогов

**Трек B** — извлечение сущностей (PERSON, ORG, LOC, EVENT, DATE, IMPACT, SOURCE) из диалогов с помощью LLM.

**Этапы:**
1. Локальное развертывание моделей (quantized vs full)
2. Подготовка данных (WildChat-1M)
3. Оптимизация для IE (batch processing)
4. Анализ производительности

## 0. Установка зависимостей

> В Colab: **Runtime → Change runtime type → GPU (T4)**. После pip может понадобиться Restart.

In [1]:
# Без pipeline → меньше зависимостей, нет ошибки torchvision::nms
!pip install -q -U transformers accelerate bitsandbytes datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m55.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.2/515.2 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h

## 1. Загрузка данных (WildChat-1M)

Датасет: 1M диалогов человек–ChatGPT. Подвыборка 500–1K для CPU, 1K–2K для GPU. Demo: 100–200.

In [2]:
from datasets import load_dataset

N_SAMPLES = 200  # Demo: 200. Для полного: 1000 или 2000

# split='train[:N]' загружает только N примеров (экономит время и диск)
ds = load_dataset("allenai/WildChat-1M", split=f"train[:{N_SAMPLES}]")

def get_conversation_text(sample):
    conv = sample.get("conversation", [])
    parts = []
    for turn in conv:
        c = turn.get("content", "")
        if c:
            parts.append(c.strip())
    return " \n ".join(parts) if parts else ""

texts = [get_conversation_text(ds[i]) for i in range(len(ds))]
texts = [t for t in texts if len(t) > 50]  # фильтр коротких
print(f"Загружено {len(texts)} диалогов")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00014.parquet:   0%|          | 0.00/231M [00:00<?, ?B/s]

data/train-00001-of-00014.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

data/train-00002-of-00014.parquet:   0%|          | 0.00/206M [00:00<?, ?B/s]

data/train-00003-of-00014.parquet:   0%|          | 0.00/217M [00:00<?, ?B/s]



data/train-00004-of-00014.parquet:   0%|          | 0.00/208M [00:00<?, ?B/s]

data/train-00005-of-00014.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

data/train-00006-of-00014.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

data/train-00007-of-00014.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

data/train-00008-of-00014.parquet:   0%|          | 0.00/181M [00:00<?, ?B/s]

data/train-00009-of-00014.parquet:   0%|          | 0.00/269M [00:00<?, ?B/s]

data/train-00010-of-00014.parquet:   0%|          | 0.00/336M [00:00<?, ?B/s]

data/train-00011-of-00014.parquet:   0%|          | 0.00/300M [00:00<?, ?B/s]

data/train-00012-of-00014.parquet:   0%|          | 0.00/283M [00:00<?, ?B/s]

data/train-00013-of-00014.parquet:   0%|          | 0.00/336M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/837989 [00:00<?, ? examples/s]

Загружено 185 диалогов


## 2. Промпт для IE

Извлечение сущностей в JSON: PERSON, ORG, LOC, EVENT, DATE, IMPACT, SOURCE.

In [3]:
IE_PROMPT = """Extract entities from the text. Return JSON with keys: PERSON, ORG, LOC, EVENT, DATE, IMPACT, SOURCE. Each key is a list of strings. If nothing found, use empty list [].

Text:
{text}

JSON:"""

def make_ie_prompt(text, max_chars=1500, model_type="mistral"):
    t = text[:max_chars] if len(text) > max_chars else text
    body = IE_PROMPT.format(text=t)
    if model_type == "mistral":
        return f"<s>[INST] {body} [/INST]"
    if model_type == "tinyllama":
        return f"<|system|>\nYou are a helpful assistant.<|user|>\n{body}<|assistant|>\n"
    return body

## 3. Модели: TinyLlama (full/4-bit) и Mistral (4-bit)

Сравнение: quantized vs full precision по скорости и памяти.

In [4]:
import torch
# Используем только Auto* — без pipeline, чтобы избежать torchvision::nms в Colab
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

def load_model(name, use_4bit=False):
    tokenizer = AutoTokenizer.from_pretrained(name)
    tokenizer.pad_token = tokenizer.eos_token
    if use_4bit:
        bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
        model = AutoModelForCausalLM.from_pretrained(name, quantization_config=bnb, device_map="auto")
    else:
        model = AutoModelForCausalLM.from_pretrained(name, device_map="auto", torch_dtype=torch.float16)
    return model, tokenizer

# Конфиг: какая модель (для Colab T4 — tiny 4-bit или mistral 4-bit)
USE_MISTRAL = True  # True = Mistral 4-bit (медленнее, качественнее), False = TinyLlama

if USE_MISTRAL:
    model_id = "mistralai/Mistral-7B-Instruct-v0.2"
    model, tokenizer = load_model(model_id, use_4bit=True)
else:
    model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    model, tokenizer = load_model(model_id, use_4bit=False)

# Вместо pipeline используем model.generate() напрямую
def generate(pipe_model, pipe_tokenizer, prompt, max_new_tokens=256):
    inputs = pipe_tokenizer(prompt, return_tensors="pt").to(pipe_model.device)
    out = pipe_model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, pad_token_id=pipe_tokenizer.eos_token_id)
    return pipe_tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

pipe = (model, tokenizer)  # (model, tokenizer) для совместимости с extract_entities

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## 4. IE: единичный и batch

In [5]:
import time
import json
import re

def extract_entities(pipe, text, max_new_tokens=200, model_type="mistral"):
    model, tokenizer = pipe
    prompt = make_ie_prompt(text, model_type=model_type)
    raw = generate(model, tokenizer, prompt, max_new_tokens=max_new_tokens).strip()
    # Пытаемся вытащить JSON
    match = re.search(r"\\{[\\s\\S]*?\\}", raw)
    if match:
        try:
            return json.loads(match.group())
        except:
            pass
    return {"raw": raw}

def run_ie_batch(pipe, texts, batch_size=1, model_type="mistral"):
    results = []
    t0 = time.time()
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        for t in batch:
            r = extract_entities(pipe, t, model_type=model_type)
            results.append(r)
    elapsed = time.time() - t0
    return results, elapsed

In [6]:
# Запуск на подвыборке (10 для demo)
N_RUN = min(10, len(texts))
sample_texts = texts[:N_RUN]

model_type = "mistral" if USE_MISTRAL else "tinyllama"
results, elapsed = run_ie_batch(pipe, sample_texts, model_type=model_type)
print(f"Обработано {N_RUN} диалогов за {elapsed:.1f} сек")
print(f"Throughput: {N_RUN/elapsed:.2f} диалогов/сек")
print()
print("Пример извлечения:")
print(json.dumps(results[0], ensure_ascii=False, indent=2)[:500])

Обработано 10 диалогов за 128.4 сек
Throughput: 0.08 диалогов/сек

Пример извлечения:
{
  "raw": "{\n\"PERSON\": [\"I\", \"you\"],\n\"ORG\": [],\n\"LOC\": [\"desired reality\"],\n\"EVENT\": [\"reality shifting\"],\n\"DATE\": [],\n\"IMPACT\": [],\n\"SOURCE\": [\"text\"]\n}\n\nNo specific entities were mentioned in the text related to PERSON, ORG, DATE, IMPACT. However, the text does discuss several requirements for the \"desired reality\" which could be considered as entities for the LOC key. Here's an example of how those entities could be described:\n\n{\n\"PERSON\": [\"I\", \"y


## 5. Анализ производительности

- Скорость: диалогов/сек
- Ресурсы: VRAM (torch.cuda)

In [7]:
if torch.cuda.is_available():
    vram_gb = torch.cuda.max_memory_allocated() / 1e9
    print(f"VRAM пик: {vram_gb:.2f} GB")
print(f"Время на {N_RUN} диалогов: {elapsed:.1f} сек")
print(f"Среднее: {elapsed/N_RUN:.2f} сек/диалог")

VRAM пик: 12.36 GB
Время на 10 диалогов: 128.4 сек
Среднее: 12.84 сек/диалог


## 6. Опционально: вторая модель для сравнения

Перед запуском — освободить память (`del model`, `torch.cuda.empty_cache()`).

In [8]:
# Сравнение TinyLlama vs Mistral (запускать по очереди, не одновременно)
# 1) TinyLlama ~2GB VRAM, быстрее
# 2) Mistral 4-bit ~6GB VRAM, качественнее
# Замерь время и VRAM для каждой.

## 7. Опционально: системный промпт «когнитивный дизайнер»

Для объяснений в стиле когнитивного дизайнера — см. `prompt_cognitive_designer.md` или `../promt.md`. Добавь в начало промпта перед запросом пользователя.