# Constructing a Knowledge Graph-Based RAG with ChromaDB, LlamaIndex and OpenAI

copyright 2024, Denis Rothman

**A Practical Guide to Building a Graph-Based Semantic Search Engine with ChromaDB, LlamaIndex, and OpenAI**

**Summary**

*   Pipeline 1 : Collecting and preparing the documents
*   Pipeline 2 : Creating and populating a ChromaDB Vector Store (Local)
*   Pipeline 3:  Index-based RAG.

**Topics**
*   Knowledge graph index-based semantic search and LLM response
*   Re-ranking
*   Metrics calculations and display


# Environment Setup

**Local Jupyter Setup:** This notebook uses `.env` file for API keys.

Required API keys in `.env` file:
```
OPENAI_API_KEY=sk-proj-...
```

**ChromaDB Storage:** Uses local ChromaDB for vector storage (100% free, no cloud dependencies).

**Important Notes:**
- This notebook was migrated from Deep Lake to ChromaDB for local development
- All data stored locally in `./chroma_db` directory
- No cloud API tokens required
- All file operations use UTF-8 encoding for Windows compatibility

In [1]:
# Environment Setup - Load API keys from .env file
import os
from dotenv import load_dotenv

# Load API keys from .env file
load_dotenv()

# Verify OpenAI API key is loaded
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY not found in .env file")

print("✓ Environment configured")
print(f"  OpenAI API Key: {os.getenv('OPENAI_API_KEY')[:10]}...")
print("  Using ChromaDB (local storage - no cloud API needed)")

✓ Environment configured
  OpenAI API Key: sk-proj-lq...
  Using ChromaDB (local storage - no cloud API needed)


In [2]:
import PIL

# Check Pillow version
current_version = PIL.__version__
print(f"Current Pillow version: {current_version}")

# Note: Pillow >= 10.2.0 is recommended
required_version = "10.2.0"

def version_tuple(version):
    return tuple(map(int, (version.split("."))))

if version_tuple(current_version) < version_tuple(required_version):
    print(f"⚠ Warning: Pillow {current_version} is less than recommended {required_version}")
    print("  Consider upgrading: pip install pillow>=10.2.0")
else:
    print(f"✓ Pillow version {current_version} meets requirements")

Current Pillow version: 10.4.0
✓ Pillow version 10.4.0 meets requirements


**Note:** ChromaDB is used instead of Deep Lake for local vector storage.

ChromaDB advantages:
- 100% free and open-source
- No cloud API tokens required
- Local-first storage in `./chroma_db`
- Perfect for development and production

LlamaIndex supports ChromaDB vector stores through the ChromaVectorStore class.

ChromaDB provides efficient local vector storage with no cloud dependencies.

Next, let's import the required modules and set the needed environmental variables:

In [3]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
import chromadb

print("✓ Imports loaded successfully")



✓ Imports loaded successfully


In [4]:
# Configure OpenAI API key from environment
import os
import openai

API_KEY = os.getenv("OPENAI_API_KEY")
os.environ['OPENAI_API_KEY'] = API_KEY
openai.api_key = API_KEY

print("✓ OpenAI API configured")

✓ OpenAI API configured


**Note:** No Activeloop token needed for ChromaDB - all data stored locally!

# Scenario


In [5]:
#File name for file management
graph_name="Marketing"

# Path for ChromaDB vector store (local storage)
chroma_db_path = "./chroma_db"
collection_name = "marketing_knowledge_graph"

#if True upserts data; if False, passes upserting and goes to connection
pop_vs=True
# if pop_vs==True, overwrite=True will overwrite dataset, False will append it:
ow=True

print(f"✓ ChromaDB path: {chroma_db_path}")
print(f"✓ Collection: {collection_name}")

✓ ChromaDB path: ./chroma_db
✓ Collection: marketing_knowledge_graph


# Pipeline 1 : Collecting and preparing the documents

In [6]:
!mkdir data

In [7]:
# Use the Marketing_urls.txt file we created with Wikipedia_API.ipynb
import shutil
import os

if pop_vs==True:
    # Source file (created by Wikipedia_API.ipynb)
    source_file = "Marketing_urls.txt"
    
    # Check if file exists
    if os.path.exists(source_file):
        print(f"✓ Found {source_file}")
        
        # Create citations directory if needed
        citations_dir = "citations"
        os.makedirs(citations_dir, exist_ok=True)
        
        # Copy to citations directory
        dest_file = os.path.join(citations_dir, source_file)
        shutil.copy2(source_file, dest_file)
        print(f"✓ Copied to {dest_file}")
        
        # Also keep a copy in current directory for easy access
        file_name = source_file
        print(f"✓ Using {file_name} from current directory")
    else:
        print(f"❌ Error: {source_file} not found")
        print("   Please run Wikipedia_API.ipynb first to generate the URLs file")
        raise FileNotFoundError(f"{source_file} not found. Run Wikipedia_API.ipynb first.")

✓ Found Marketing_urls.txt
✓ Copied to citations\Marketing_urls.txt
✓ Using Marketing_urls.txt from current directory


In [8]:
# Read URLs from the file
import requests
from bs4 import BeautifulSoup
import re
import os

if pop_vs==True:
  directory = "Chapter07/citations"
  file_name = graph_name+"_urls.txt"

  with open(file_name, 'r') as file:
      urls = [line.strip() for line in file]

  # Display the URLs
  print("Read URLs:")
  for url in urls:
      print(url)

Read URLs:
https://en.wikipedia.org/wiki/Marketing
https://en.wikipedia.org/wiki/24-hour_news_cycle
https://en.wikipedia.org/wiki/Account-based_marketing
https://en.wikipedia.org/wiki/Activism
https://en.wikipedia.org/wiki/Adam_Smith
https://en.wikipedia.org/wiki/Adam_Smith_Institute
https://en.wikipedia.org/wiki/Advertising
https://en.wikipedia.org/wiki/Advertising_agency
https://en.wikipedia.org/wiki/Advertising_mail
https://en.wikipedia.org/wiki/Advertising_management
https://en.wikipedia.org/wiki/Advertising_slogan
https://en.wikipedia.org/wiki/Advertorial
https://en.wikipedia.org/wiki/Advocacy
https://en.wikipedia.org/wiki/Advocacy_group
https://en.wikipedia.org/wiki/Affinity_marketing
https://en.wikipedia.org/wiki/Agenda-setting_theory
https://en.wikipedia.org/wiki/Agile_marketing
https://en.wikipedia.org/wiki/Agricultural_Marketing_Service
https://en.wikipedia.org/wiki/Agricultural_marketing
https://en.wikipedia.org/wiki/Airborne_leaflet_propaganda
https://en.wikipedia.org/wiki/

In [10]:
import requests
import re
import os
from bs4 import BeautifulSoup

def clean_text(content):
    # Remove references and unwanted characters
    content = re.sub(r'\[\d+\]', '', content)   # Remove references
    content = re.sub(r'[^\w\s\.]', '', content)  # Remove punctuation (except periods)
    return content

def fetch_and_clean(url):
    try:
        # Add proper headers to avoid 403 Forbidden
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1'
        }

        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # Raise exception for bad responses (e.g., 404)
        soup = BeautifulSoup(response.content, 'html.parser')

        # Prioritize "mw-parser-output" but fall back to "content" class if not found
        content = soup.find('div', {'class': 'mw-parser-output'}) or soup.find('div', {'id': 'content'})
        if content is None:
            print(f"No content found for {url}")
            return None

        # Remove specific sections, including nested ones
        for section_title in ['References', 'Bibliography', 'External links', 'See also', 'Notes']:
            section = content.find('span', id=section_title)
            while section:
                for sib in section.parent.find_next_siblings():
                    sib.decompose()
                section.parent.decompose()
                section = content.find('span', id=section_title)

        # Extract and clean text
        text = content.get_text(separator=' ', strip=True)
        text = clean_text(text)
        return text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching content from {url}: {e}")
        return None  # Return None on error

if pop_vs==True:
    # Directory to store the output files
    output_dir = './data/'  # More descriptive name
    os.makedirs(output_dir, exist_ok=True)

    # Processing each URL (and skipping invalid ones)
    print(f"Fetching {len(urls)} Wikipedia articles...")
    successful = 0
    failed = 0

    for idx, url in enumerate(urls, 1):
        article_name = url.split('/')[-1].replace('.html', '')  # Handle .html extension
        filename = os.path.join(output_dir, f"{article_name}.txt")

        clean_article_text = fetch_and_clean(url)
        if clean_article_text:  # Only write to file if content exists
            with open(filename, 'w', encoding='utf-8') as file:
                file.write(clean_article_text)
            successful += 1
            if idx % 10 == 0:
                print(f"Progress: {idx}/{len(urls)} articles processed...")
        else:
            failed += 1

    separator = "=" * 70
    print(f"\n{separator}")
    print(f"✓ Scraping complete")
    print(separator)
    print(f"Successfully fetched: {successful} articles")
    print(f"Failed: {failed} articles")
    print(f"Content written to {output_dir} directory")


Fetching 101 Wikipedia articles...
Progress: 10/101 articles processed...
Progress: 20/101 articles processed...
Progress: 30/101 articles processed...
Progress: 50/101 articles processed...
Progress: 60/101 articles processed...
Progress: 70/101 articles processed...
Progress: 80/101 articles processed...
Progress: 90/101 articles processed...
Progress: 100/101 articles processed...

✓ Scraping complete
Successfully fetched: 90 articles
Failed: 11 articles
Content written to ./data/ directory


In [11]:
if pop_vs==True:
  # load documents
  documents = SimpleDirectoryReader("./data/").load_data()
  # Print the first document
  print(documents[0])

Doc ID: 775aacdc-e33c-4af2-80e0-7cfee85418d4
Text: Investigation and reporting of news concomitant with fastpaced
lifestyles This article is about the fastpaced cycle of news media in
technologically advanced societies. For the longerterm cycle of news
and information see information cycle . Several simultaneous NBC News
broadcasts including MSNBC  NBC s Today and CNBC s Squawk Box
displayed on...


# Pipeline 2 : Creating and populating the ChromaDB Vector Store

In [12]:
if pop_vs==True:
    # Initialize ChromaDB client
    chroma_client = chromadb.PersistentClient(path=chroma_db_path)
    
    # Create or get collection
    if ow==True:
        # Delete existing collection if overwrite=True
        try:
            chroma_client.delete_collection(name=collection_name)
            print(f"✓ Deleted existing collection: {collection_name}")
        except:
            pass
    
    # Create new collection
    chroma_collection = chroma_client.get_or_create_collection(name=collection_name)
    print(f"✓ ChromaDB collection created: {collection_name}")
    
    # Create vector store
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    
    # Create an index over the documents
    print("Creating vector index...")
    index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
    print(f"✓ Index created with {len(documents)} documents")

✓ ChromaDB collection created: marketing_knowledge_graph
Creating vector index...
✓ Index created with 88 documents


In [13]:
# Connect to existing ChromaDB collection
chroma_client = chromadb.PersistentClient(path=chroma_db_path)
chroma_collection = chroma_client.get_collection(name=collection_name)

print(f"✓ Connected to ChromaDB collection: {collection_name}")
print(f"  Total documents: {chroma_collection.count()}")

✓ Connected to ChromaDB collection: marketing_knowledge_graph
  Total documents: 830


In [14]:
# Display ChromaDB collection info
print(f"Collection: {chroma_collection.name}")
print(f"Total documents: {chroma_collection.count()}")
print(f"Storage path: {chroma_db_path}")

# Get sample data
if chroma_collection.count() > 0:
    sample = chroma_collection.peek(limit=3)
    print(f"\nSample document IDs: {sample['ids'][:3]}")

Collection: marketing_knowledge_graph
Total documents: 830
Storage path: ./chroma_db

Sample document IDs: ['062bcd33-cfe1-440f-889c-33b28086fc84', '42168abb-63fb-46f6-aec4-496e8333f734', '768e8d47-3937-47c4-9ccb-f6b935450fa3']


In [18]:
import json
import pandas as pd
import numpy as np

# Retrieve all data from ChromaDB
results = chroma_collection.get(include=["embeddings", "documents", "metadatas"])

# Extract data from results
ids = results["ids"]
documents = results["documents"]
metadatas = results["metadatas"]
embeddings = results.get("embeddings", None)

# Create DataFrame with proper structure
data = []
for i in range(len(ids)):
    data.append({
        "id": ids[i],
        "text": documents[i],
        "metadata": metadatas[i] if metadatas else {},
        "embedding": embeddings[i] if (embeddings is not None and i < len(embeddings)) else None
    })

# Create DataFrame from list of dictionaries
df = pd.DataFrame(data)

print(f"✓ Created DataFrame with {len(df)} documents")
print(f"  Columns: {list(df.columns)}")
if len(df) > 0:
    print(f"  Sample ID: {df['id'].iloc[0]}")
    print(f"  Text length (first doc): {len(df['text'].iloc[0]) if df['text'].iloc[0] else 0} chars")

✓ Created DataFrame with 830 documents
  Columns: ['id', 'text', 'metadata', 'embedding']
  Sample ID: 062bcd33-cfe1-440f-889c-33b28086fc84
  Text length (first doc): 3698 chars


In [21]:
# Function to display a selected record
def display_record(record_number):
    record = df.iloc[record_number]
    display_data = {
        "ID": record.get("id", "N/A"),
        "Metadata": record.get("metadata", "N/A"),
        "Text": record.get("text", "N/A"),
        "Embedding": record.get("embedding", "N/A")
    }

    # Print the ID
    print("ID:")
    print(display_data["ID"])
    print()

    # Print the metadata in a structured format
    print("Metadata:")
    metadata = display_data["Metadata"]
    if isinstance(metadata, list):
        for item in metadata:
            for key, value in item.items():
                print(f"{key}: {value}")
            print()
    else:
        print(metadata)
    print()

    # Print the text
    print("Text:")
    print(display_data["Text"])
    print()

    # Print the embedding
    print("Embedding:")
    print(display_data["Embedding"])
    print()

# Example usage
rec = 829  # Replace with the desired record number
display_record(rec)


ID:
ef2611a4-3499-4d10-a890-8b885795dd25

Metadata:
{'file_type': 'text/plain', 'document_id': '3bf0ea09-7869-472e-8f60-d490128911bf', 'last_modified_date': '2025-11-02', 'doc_id': '3bf0ea09-7869-472e-8f60-d490128911bf', '_node_type': 'TextNode', 'file_name': 'The_Chartered_Institute_of_Marketing.txt', 'file_path': 'c:\\Users\\user\\Desktop\\RAG-Driven-Generative-AI\\Chapter07\\data\\The_Chartered_Institute_of_Marketing.txt', '_node_content': '{"id_": "ef2611a4-3499-4d10-a890-8b885795dd25", "embedding": null, "metadata": {"file_path": "c:\\\\Users\\\\user\\\\Desktop\\\\RAG-Driven-Generative-AI\\\\Chapter07\\\\data\\\\The_Chartered_Institute_of_Marketing.txt", "file_name": "The_Chartered_Institute_of_Marketing.txt", "file_type": "text/plain", "file_size": 8766, "creation_date": "2025-11-02", "last_modified_date": "2025-11-02"}, "excluded_embed_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "excluded_llm_metadata_keys

## Original documents

In [22]:
# Ensure 'text' column is of type string
df['text'] = df['text'].astype(str)
# Create documents with IDs
documents = [Document(text=row['text'], doc_id=str(row['id'])) for _, row in df.iterrows()]

# Pipeline 3:Knowledge Graph Index-based RAG

## Generating the Knowledge Graph Index

In [23]:
from llama_index.core import KnowledgeGraphIndex
import time
# Start the timer
start_time = time.time()

#graph index with embeddings
graph_index = KnowledgeGraphIndex.from_documents(
    documents,
    max_triplets_per_chunk=2,
    include_embeddings=True,
)

# Stop the timer
end_time = time.time()

# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Index creation time: {elapsed_time:.4f} seconds")

Index creation time: 1004.0605 seconds


In [25]:
print(type(graph_index))

<class 'llama_index.core.indices.knowledge_graph.base.KnowledgeGraphIndex'>


In [26]:
#similarity_top_k
k=3
#temperature
temp=0.1
#num_output
mt=1024
graph_query_engine = graph_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

### Displaying the graph

In [27]:
## create graph
from pyvis.network import Network

g = graph_index.get_networkx_graph()
net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(g)

# Set node and edge properties: colors and sizes
for node in net.nodes:
    node['color'] = 'lightgray'
    node['size'] = 10

for edge in net.edges:
    edge['color'] = 'black'
    edge['width'] = 1

In [29]:
fgraph = "Knowledge_graph_" + graph_name + ".html"

# Workaround for Windows UTF-8 encoding issue with pyvis
# pyvis uses default encoding which is cp950 on Windows
# We need to manually write with UTF-8 encoding
try:
    # Try normal write first
    net.write_html(fgraph)
except UnicodeEncodeError:
    # If encoding error, manually write with UTF-8
    print("UTF-8 encoding issue detected, using workaround...")
    html_content = net.generate_html()
    with open(fgraph, 'w', encoding='utf-8') as f:
        f.write(html_content)
    print(f"✓ Graph saved to {fgraph} (UTF-8 encoding)")
else:
    print(f"✓ Graph saved to {fgraph}")

print(fgraph)

UTF-8 encoding issue detected, using workaround...
✓ Graph saved to Knowledge_graph_Marketing.html (UTF-8 encoding)
Knowledge_graph_Marketing.html


In [47]:
from IPython.display import IFrame
import os

# Display using IFrame (more reliable for large interactive HTML)
print(f"✓ Knowledge graph HTML file created: {fgraph}")
print(f"  File size: {os.path.getsize(fgraph):,} bytes")
print()
print("To view the interactive knowledge graph, choose one of these options:")
print()
print("Option 1 (Best): Open the HTML file directly in your browser:")
print(f"  File path: {os.path.abspath(fgraph)}")
print()
print("Option 2: Use IFrame in notebook (may have limitations):")
print("  Run: IFrame(src=fgraph, width=1000, height=600)")
print()
print("Option 3: Display inline (may not work for large files):")
print("  Run: display(HTML(open(fgraph, 'r', encoding='utf-8').read()))")
print()

# Try IFrame display (more reliable than inline HTML for large files)
try:
    display(IFrame(src=fgraph, width=1000, height=600))
    print("✓ Graph displayed via IFrame")
except Exception as e:
    print(f"⚠ IFrame display failed: {e}")
    print("  Please open the HTML file directly in your browser instead.")

✓ Knowledge graph HTML file created: Knowledge_graph_Marketing.html
  File size: 730,798 bytes

To view the interactive knowledge graph, choose one of these options:

Option 1 (Best): Open the HTML file directly in your browser:
  File path: c:\Users\user\Desktop\RAG-Driven-Generative-AI\Chapter07\Knowledge_graph_Marketing.html

Option 2: Use IFrame in notebook (may have limitations):
  Run: IFrame(src=fgraph, width=1000, height=600)

Option 3: Display inline (may not work for large files):
  Run: display(HTML(open(fgraph, 'r', encoding='utf-8').read()))



✓ Graph displayed via IFrame


## Interacting with the Knowledge graph index

### User input and RAG functions

In [33]:
import time
import textwrap

def execute_query(user_input, k=3, temp=0.1, mt=1024):

    # Start the timer
    start_time = time.time()

    # Execute the query with additional parameters
    response = graph_query_engine.query(user_input)

    # Stop the timer
    end_time = time.time()

    # Calculate and print the execution time
    elapsed_time = end_time - start_time
    print(f"Query execution time: {elapsed_time:.4f} seconds")

    # Print the response, wrapped to 100 characters per line
    print(textwrap.fill(str(response), 100))
    return response

In [34]:
user_query="What is the primary goal of marketing for the consumer market?"

In [35]:
import time
import textwrap
import sys
import io
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")

print(textwrap.fill(str(response), 100))

Query execution time: 6.2953 seconds
The primary goal of marketing for the consumer market is to understand and influence consumer
behavior in order to create brand awareness, maintain brand loyalty, and ultimately drive consumer
purchasing decisions.


## Installing the similarity score packages and defining the functions

Install the package(s) that fit your project.

In [42]:
# Optional: HuggingFace token (if needed for model downloads)
import os

load_dotenv(override=True)

HF_TOKEN = os.getenv("HF_TOKEN")
if HF_TOKEN:
    print("✓ HuggingFace token configured")
else:
    print("ℹ HuggingFace token not found (may not be needed)")

✓ HuggingFace token configured


In [43]:
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_cosine_similarity_with_embeddings(text1, text2):
    embeddings1 = model.encode(text1)
    embeddings2 = model.encode(text2)
    similarity = cosine_similarity([embeddings1], [embeddings2])
    return similarity[0][0]

In [44]:
import time
import textwrap
import sys
import io

# Re-ranking

In [48]:
user_query="Which experts are often associated with marketing theory?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")

print(textwrap.fill(str(response), 100))

Query execution time: 6.5732 seconds
Philip Kotler and Christian Grönroos are often associated with marketing theory.


In [49]:
# prompt: read the code above and write a python code to find out what type of object "response" is
type(response)

llama_index.core.base.response.schema.Response

In [50]:
text2=user_query # User query scenario
#text2=user_query # human feedbak

In [51]:
# Assuming 'response' is the object containing the source_nodes
best_rank=""
best_score=0
best_text=""
for idx, node_with_score in enumerate(response.source_nodes):
    node = node_with_score.node
    print(f"Node {idx + 1}:")
    print(f"Score: {node_with_score.score}")
    print(f"ID to rank: {node.id_}")
    print("Relationships:")
    for relationship, info in node.relationships.items():
        print(f"  Relationship: {relationship}")
        print(f"    Node ID: {info.node_id}")
        print(f"    Node Type: {info.node_type}")
        print(f"    Metadata: {info.metadata}")
        print(f"    Hash: {info.hash}")
    #print(f"Text to rank: {node.text}")
    print(textwrap.fill(str(node.text), 100))
    print(f"Mimetype: {node.mimetype}")
    print(f"Start Char Index: {node.start_char_idx}")
    print(f"End Char Index: {node.end_char_idx}")
    print(f"Text Template: {node.text_template}")
    print(f"Metadata Template: {node.metadata_template}")
    print(f"Metadata Separator: {node.metadata_seperator}")
    text1=node.text
    #text2=user_query
    similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
    print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
    if similarity_score3>best_score:
      best_score=similarity_score3
      best_rank=idx + 1
      best_text=node.text
      print(f"Best Rank: {best_rank}")
      print(f"Best Score: {best_score}")
      print(f"Best Text: {best_text}")
    print("\n" + "="*40 + "\n")

print(f"Best Rank: {best_rank}")
print(f"Best Score: {best_score}")
#print(f"Best Text: {best_text}")
print(textwrap.fill(str(best_text), 100))

Node 1:
Score: 1000.0
ID to rank: 5ff94142-4a9e-4b10-97cf-73b915ebac61
Relationships:
  Relationship: NodeRelationship.SOURCE
    Node ID: 81b5286c-d7fd-4209-80b2-ee86b0fdea6a
    Node Type: 4
    Metadata: {}
    Hash: 142b0b360e7c2f67b301c10b56d7e00b6786a7946b4f481a620c787c2dba43bb
26 No. 3 2009 pp. 175184.  Campbell Colin L. 20150603. Marketing in Transition Scarcity Globalism
Sustainability Proceedings of the 2009 World Marketing Congress . Springer. ISBN 9783319186870 .
Pride W. M. Ferrell O. C. Lukas B. A. Schembri S. Niininen O. and Casidy E. Marketing Principles .
3rd AsiaPacific ed. Cengage 2018 p. 296.  a b c d e f g h i j Kotler Philip 2009. Principles of
marketing . Pearson Education Australia. ISBN 9781442500419 .  Compare Franzen Giep Moriarty Sandra
E. 20150212 . 1 The Brand as a System . The Science and Art of Branding . London Routledge published
2015. p. 19. ISBN 9781317454670 . Retrieved 20160816 . This deeper meaning the core values character
or essence of a brand i

# Examples for metrics

In [52]:
import numpy as np
import sys
# create an empty array score human feedback scores:
rscores =[]
# create an empty score for similarity function scores
scores=[]

## 1

In [53]:
user_query="Which experts are often associated with marketing theory?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 2.3881 seconds
Philip Kotler and Christian Grönroos are often associated with marketing theory.


In [54]:
text1=str(response)
text2=user_query
similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
scores.append(similarity_score3)
human_feedback=0.75
rscores.append(human_feedback)

Cosine Similarity Score with sentence transformer: 0.685


## 2

In [55]:
user_query="How does marketing boost sales?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 3.1054 seconds
Marketing boosts sales by helping to increase brand awareness, generate leads, and create demand for
products or services. By effectively reaching out to potential customers through various channels
and personalized communications, marketing can stimulate interest and drive purchasing decisions.
Additionally, marketing strategies such as promotions, advertising, and targeted campaigns can
influence consumer behavior, ultimately leading to increased sales and revenue for businesses.


In [56]:
text1=str(response)
text2=user_query
similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
scores.append(similarity_score3)
human_feedback=0.5
rscores.append(human_feedback)

Cosine Similarity Score with sentence transformer: 0.840


## 3

In [57]:
user_query="What is the difference between B2B and B2C?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 3.0975 seconds
B2B marketing involves selling products or services to other companies or organizations, while B2C
marketing focuses on selling directly to individual customers.


In [58]:
text1=str(response)
text2=user_query
similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
scores.append(similarity_score3)
human_feedback=0.8
rscores.append(human_feedback)

Cosine Similarity Score with sentence transformer: 0.743


## 4

In [59]:
user_query="What are the 4Ps? What do they stand for?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 2.5007 seconds
The 4Ps refer to Product, Price, Place, and Promotion. These are the key elements of a marketing mix
strategy used by businesses to effectively promote and sell their products or services.


In [60]:
text1=str(response)
text2=user_query
similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
scores.append(similarity_score3)
human_feedback=0.9
rscores.append(human_feedback)

Cosine Similarity Score with sentence transformer: 0.702


## 5

In [61]:
user_query="What are the 4Cs? What do they stand for?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 2.2705 seconds
The 4Cs stand for the following: 1. Consumer perspectives 2. Corporate social responsibility
approaches 3. Corporate social responsibility definition 4. Corporate social responsibility motives


In [62]:
text1=str(response)
text2=user_query
similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
scores.append(similarity_score3)
human_feedback=0.65
rscores.append(human_feedback)

Cosine Similarity Score with sentence transformer: 0.646


## 6

In [63]:
user_query="What is the difference between the 4Ps and 4Cs?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 2.8769 seconds
The difference between the 4Ps and 4Cs lies in their focus and approach to marketing. The 4Ps, also
known as the marketing mix, consist of Product, Price, Place, and Promotion, focusing on the
company's perspective in creating and promoting products. On the other hand, the 4Cs consist of
Consumer, Cost, Convenience, and Communication, shifting the focus towards a customer-centric
approach, emphasizing the needs and preferences of consumers in the marketing strategy.


In [64]:
text1=str(response)
text2=user_query
similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
scores.append(similarity_score3)
human_feedback=0.8
rscores.append(human_feedback)

Cosine Similarity Score with sentence transformer: 0.726


## 7

In [65]:
user_query="What commodity programs does the Agricultural Marketing Service (AMS) maintain?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 2.9271 seconds
The Agricultural Marketing Service (AMS) maintains programs in five commodity areas: cotton and
tobacco, dairy, fruit and vegetable, livestock and seed, and poultry.


In [66]:
text1=str(response)
text2=user_query
similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
scores.append(similarity_score3)
human_feedback=0.9
rscores.append(human_feedback)

Cosine Similarity Score with sentence transformer: 0.904


## 8

In [67]:
user_query="What kind of marketing is Got Milk?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 3.4353 seconds
Drip marketing is related to direct marketing.


In [68]:
text1=str(response)
text2=user_query
similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
scores.append(similarity_score3)
human_feedback=0.2
rscores.append(human_feedback)

Cosine Similarity Score with sentence transformer: 0.419


## 9

In [69]:
user_query="What an is industry trade group, business association, sector association or industry body?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 2.9076 seconds
An industry trade group, business association, sector association, or industry body is a
professional association.


In [70]:
text1=str(response)
text2=user_query
similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
scores.append(similarity_score3)
human_feedback=0.2
rscores.append(human_feedback)

Cosine Similarity Score with sentence transformer: 0.913


## 10

In [71]:
user_query="How many members are there in the American Marketing Association (AMA), theassociation for marketing professionals?"
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 2.2992 seconds
The American Marketing Association (AMA) has 30,000 members as of 2012.


In [72]:
text1=str(response)
text2=user_query
similarity_score3=calculate_cosine_similarity_with_embeddings(text1, text2)
print(f"Cosine Similarity Score with sentence transformer: {similarity_score3:.3f}")
scores.append(similarity_score3)
human_feedback=0.9
rscores.append(human_feedback)

Cosine Similarity Score with sentence transformer: 0.877


## Metrics calculation and display

In [73]:
print(len(scores), scores)
print(len(rscores), rscores)

10 [np.float32(0.68520683), np.float32(0.8399196), np.float32(0.74265814), np.float32(0.70164996), np.float32(0.6460962), np.float32(0.7261259), np.float32(0.9036964), np.float32(0.41940588), np.float32(0.9133468), np.float32(0.87746775)]
10 [0.75, 0.5, 0.8, 0.9, 0.65, 0.8, 0.9, 0.2, 0.2, 0.9]


Mean, Median, Standard Deviation, Variance, Minimum, Maximum, Range, Percentile (Q1), 75th Percentile and Interquartile Range (IQR)

In [74]:
# Calculating metrics
mean_score = np.mean(scores)
median_score = np.median(scores)
std_deviation = np.std(scores)
variance = np.var(scores)
min_score = np.min(scores)
max_score = np.max(scores)
range_score = max_score - min_score
percentile_25 = np.percentile(scores, 25)
percentile_75 = np.percentile(scores, 75)
iqr = percentile_75 - percentile_25

# Printing the metrics with 2 decimals
print(f"Mean: {mean_score:.2f}")
print(f"Median: {median_score:.2f}")
print(f"Standard Deviation: {std_deviation:.2f}")
print(f"Variance: {variance:.2f}")
print(f"Minimum: {min_score:.2f}")
print(f"Maximum: {max_score:.2f}")
print(f"Range: {range_score:.2f}")
print(f"25th Percentile (Q1): {percentile_25:.2f}")
print(f"75th Percentile (Q3): {percentile_75:.2f}")
print(f"Interquartile Range (IQR): {iqr:.2f}")

Mean: 0.75
Median: 0.73
Standard Deviation: 0.14
Variance: 0.02
Minimum: 0.42
Maximum: 0.91
Range: 0.49
25th Percentile (Q1): 0.69
75th Percentile (Q3): 0.87
Interquartile Range (IQR): 0.18
