# Build a Custom Retrieval-Augmented Pipeline on Your Private Notion Pages
_by Tuana Celik ([Twitter](https://twitter.com/tuanacelik), [LinkedIn](https://www.linkedin.com/in/tuanacelik/))_

In this Colab, we will:
- Creating a custom Haystack component called `NotionExporter`
- Building an indexing pipeline to write our Notion pages into an `InMemoryDocumentStore` with embeddings
- Build a custom RAG pipeline to do question answering on our Notion pages

In [None]:
!pip install haystack-ai cohere-haystack transformers sentence_transformers
!pip install notion-exporter
!pip install python-frontmatter
!pip install nest-asyncio

import nest_asyncio

nest_asyncio.apply()

Collecting cohere-haystack
  Downloading cohere_haystack-0.0.1-py3-none-any.whl (9.1 kB)
Collecting cohere (from cohere-haystack)
  Downloading cohere-4.37-py3-none-any.whl (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting fastavro<2.0,>=1.8 (from cohere->cohere-haystack)
  Downloading fastavro-1.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting importlib_metadata<7.0,>=6.0 (from cohere->cohere-haystack)
  Downloading importlib_metadata-6.11.0-py3-none-any.whl (23 kB)
Installing collected packages: importlib_metadata, fastavro, cohere, cohere-haystack
  Attempting uninstall: importlib_metadata
    Found existing installation: importlib-metadata 7.0.0
    Uninstalling importlib-metadata-7.0.0:
      Successfully uninstalled importlib-metadata-7

## Build a custom NotionExporter component

Documentation on [Custom Components](https://docs.haystack.deepset.ai/v2.0/docs/custom-components)

In [None]:
from typing import List
from notion_exporter import NotionExporter as _NotionExporter
import frontmatter
from haystack import component
from haystack.dataclasses import Document

@component
class NotionExporter():

    def __init__(self, api_token: str,):
        self.notion_exporter = _NotionExporter(
            notion_token=api_token,
        )

    @component.output_types(documents=List[Document])
    def run(self, page_ids: List[str]):
        extracted_pages = self.notion_exporter.export_pages(page_ids)

        documents = []
        for page_id, page in extracted_pages.items():
            metadata, markdown_text = frontmatter.parse(page)
            document = Document(content=markdown_text)
            documents.append(document)

        return {"documents": documents}

In [None]:
import getpass
import os

notion_api_key = getpass.getpass("Enter Notion API key:")
cohere_api_key = getpass.getpass("Cohere API key:")

Enter Notion API key:··········
Cohere API key:··········


### Test our custom NotionExporter component

- You can follow the steps outlined in the Notion [documentation](https://developers.notion.com/docs/create-a-notion-integration#create-your-integration-in-notion) to create a new Notion integration, connect it to your pages, and obtain your API token.
- Page IDs in Notion are the tailing numbers at the end of the page URL, separated by a '-' at 8-4-4-4-12 digits

In [None]:
exporter = NotionExporter(api_token=notion_api_key)

In [None]:
exporter.run(page_ids=["6f98e9a6-a880-40e9-b191-1c4f41efec87"])

{'documents': [Document(id=79a3fbd138a1b92c89128f14adbbc2f712edb5da8dc6d5b238440268328dff3e, content: '# Customizing RAG Pipelines to Summarize Latest Hacker News Posts with Haystack 2.0 Preview
  
  *Take a...')]}

## Build an Indexing Pipeline to Write Notion Pages to a Document Store

- Documentation on [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/v2.0/docs/sentencetransformersdocumentembedder)
- Documentation on [`DocumentSplitter`](https://docs.haystack.deepset.ai/v2.0/docs/documentsplitter)
- Documentation on [`DocumentWriter`](https://docs.haystack.deepset.ai/v2.0/docs/documentwriter)

In [None]:
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores import InMemoryDocumentStore


document_store = InMemoryDocumentStore()
exporter = NotionExporter(api_token=notion_api_key)
splitter = DocumentSplitter()
document_embedder = SentenceTransformersDocumentEmbedder()
writer = DocumentWriter(document_store=document_store)


In [None]:
from haystack import Pipeline

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=exporter, name="exporter")
indexing_pipeline.add_component(instance=splitter, name="splitter")
indexing_pipeline.add_component(instance=document_embedder, name="document_embedder")
indexing_pipeline.add_component(instance=writer, name="writer")

In [None]:
indexing_pipeline.connect("exporter.documents", "splitter.documents")
indexing_pipeline.connect("splitter.documents", "document_embedder.documents")
indexing_pipeline.connect("document_embedder.documents", "writer.documents")

In [None]:
indexing_pipeline.run(data={"exporter":{"page_ids": ["6f98e9a6-a880-40e9-b191-1c4f41efec87"]}})

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'writer': {'documents_written': 8}}

## Build a RAG Pipeline with Cohere

- Documentation on [`SentenceTransformersTextEmbedder`](https://docs.haystack.deepset.ai/v2.0/docs/sentencetransformerstextembedder)
- Documentation on [`PromptBuilder`](https://docs.haystack.deepset.ai/v2.0/docs/promptbuilder)
- Documentation on [`CohereGenerator`](https://docs.haystack.deepset.ai/v2.0/docs/coheregenerator)

In [None]:
import torch

from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from cohere_haystack.generator import CohereGenerator

prompt = """ Answer the query, based on the
content in the documents.

Documents:
{% for doc in documents %}
  {{doc.content}}
{% endfor %}

Query: {{query}}
"""
text_embedder = SentenceTransformersTextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
prompt_builder = PromptBuilder(template=prompt)
generator = CohereGenerator(api_key=cohere_api_key)


In [None]:
rag_pipeline = Pipeline()

rag_pipeline.add_component(instance=text_embedder, name="text_embedder")
rag_pipeline.add_component(instance=retriever, name="retriever")
rag_pipeline.add_component(instance=prompt_builder, name="prompt_builder")
rag_pipeline.add_component(instance=generator, name="generator")

rag_pipeline.connect("text_embedder", "retriever")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")

In [None]:
question = "What are the steps for creating a custom component?"
result = rag_pipeline.run(data={"text_embedder":{"text": question},
                       "prompt_builder":{"query": question}})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
print(result['generator']['replies'][0])

 To create a custom component in Haystack 2.0, you need to:
1. Add a `@component` decorator on the class declaration.
2. Implement a `run` function with a decorator `@component.output_types(my_output_name=my_output_type)` that describes what output the pipeline should expect from this component.

Here is an example of creating a custom component in Haystack 2.0:
```python
from haystack.preview import component
from typing import List
from newspaper import Article

@component
class HackernewsNewestFetcher():
    def run(self, last_k: int = 5):
        # Implement your logic to fetch the data from the API or database
        # Here we are using the newspapers library to fetch the data from the given URL
        newest_list = requests.get(url='https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
        articles = []
        for id in newest_list.json()[0:last_k]:
            article = requests.get(url=f"https://hacker-news.firebaseio.com/v0/item/{id}.json?print=pretty")
 