##  kRAG - Knowledge Graph-Enhanced RAG System

### Objective:
Develop a kRAG system that uses Named Entity Recognition (NER) to build a knowledge graph of triples, and then incorporates relevant triples into the prompt for enhanced question answering.

##  1: NER and Knowledge Graph Construction

### Day 1-3: Data Preparation and NER Model Development

1. Choose a domain (e.g., scientific papers, news articles, or technical documentation)
2. Collect a corpus of 100-200 documents in the chosen domain
3. Implement or fine-tune an NER model using a framework like spaCy or Hugging Face Transformers


In [1]:
import spacy
from spacy.tokens import DocBin
from spacy.util import minibatch, compounding

def train_ner_model(train_data, iterations=30):
    nlp = spacy.blank("en")
    ner = nlp.add_pipe("ner")
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            random.shuffle(train_data)
            losses = {}
            batches = minibatch(train_data, size=compounding(4., 32., 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, drop=0.5, losses=losses)
            print(f"Iteration {itn}, Losses: {losses}")

    return nlp

# Train the model with your domain-specific data
ner_model = train_ner_model(train_data)
ner_model.to_disk("./domain_ner_model")

NameError: name 'train_data' is not defined

###  Relationship Extraction

Implement a relationship extraction module to identify connections between entities:

import spacy
import networkx as nx

def extract_relationships(doc):
    relationships = []
    for sent in doc.sents:
        root = sent.root
        subject = None
        obj = None
        for child in root.children:
            if child.dep_ == "nsubj":
                subject = child
            if child.dep_ in ["dobj", "pobj"]:
                obj = child
        if subject and obj:
            relationships.append((subject, root, obj))
    return relationships

nlp = spacy.load("./domain_ner_model")

def process_document(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    relationships = extract_relationships(doc)
    return entities, relationships

###  Knowledge Graph Construction

Build a knowledge graph using the extracted entities and relationships:

In [None]:
import networkx as nx

def build_knowledge_graph(documents):
    G = nx.DiGraph()
    for doc in documents:
        entities, relationships = process_document(doc)
        for entity, entity_type in entities:
            G.add_node(entity, type=entity_type)
        for subj, pred, obj in relationships:
            G.add_edge(subj.text, obj.text, relation=pred.text)
    return G

documents = [doc1, doc2, ...]  # Your corpus
knowledge_graph = build_knowledge_graph(documents)