## Steps to get started
### 1) Install GraphRAG
```bash
pip install -r requirements.txt
```

### 2) Create Folders
In the root folder "GraphRAG" create a folder "rag" and one inside that "input"
```
GraphRAG/
├── rag/
   ├── input
```

```bash command to do this
mkdir ./rag/input
```

### 3) Download data (A Christmas Carol)
On the command line run this
```bash
curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt -o ./rag/input/book.txt
```

### 4) Set up your workspace
On the command line run this
```bash
graphrag init --root ./rag
```

### 5) Update settings.yaml
The previous step created a starter settings.yaml file. Overwrite/replace it with the settings.yaml file in the root folder.

### 6) Setup your environment variables
I'll give this to you but put them in the .env file

### 7) Run the indexer
On the command line run this
``` bash
graphrag index --root ./rag

In [12]:
import os
import asyncio
from openai import AsyncAzureOpenAI
from typing import List, Union
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity 

In [None]:
# Azure OpenAI configuration
AZURE_ENDPOINT = "https://transformationacademyaoa.openai.azure.com/"
API_VERSION = "2024-12-01-preview"
DEPLOYMENT_NAME = "text-embedding-ada-002"
GRAPHRAG_API_KEY = ""

# Initialize the client
client = AsyncAzureOpenAI(
    api_key=GRAPHRAG_API_KEY,
    api_version=API_VERSION,
    azure_endpoint=AZURE_ENDPOINT
)

## When RAG isn't enough
- complex multi-step reasoning
- content in the middle is underweighted in importance
- answering meta-questions about the data


In [None]:
!graphrag query --root ./ragtest2 --method global --query "What are the top themes in this story?"

### How to Create Retrieval-Augmented Generation (RAG) System
- Chunking
- Meta-data enrichment (titles, summaries, keywords)
- Embedding
- Indexing (Vector, Hybrid, Hierarchical)
![Vector Similarity](./assets/vectorDbSimilarity.png)

In [15]:
async def get_text_embedding(text: str) -> List[float]:
    """
    Get embeddings for a given text string.
    
    Args:
        text: The text to get embeddings for
        
    Returns:
        List of floats representing the embedding vector
    """
    try:
        response = await client.embeddings.create(
            input=text,
            model=DEPLOYMENT_NAME
        )
        return response.data[0].embedding
    except Exception as e:
        print(f"Error getting embeddings: {e}")
        raise

In [16]:
def cosine_similarity_score(embedding1: List[float], embedding2: List[float]) -> float: 
    """ Calculate cosine similarity between two embedding vectors using scikit-learn.
    Args:
    embedding1: First embedding vector (list of floats)
    embedding2: Second embedding vector (list of floats)
    
Returns:
    float: Cosine similarity score between -1 and 1
           1 = identical vectors
           0 = orthogonal vectors  
           -1 = opposite vectors
"""
    # Convert to numpy arrays and reshape for sklearn
    vec1 = np.array(embedding1).reshape(1, -1)
    vec2 = np.array(embedding2).reshape(1, -1)

    # Calculate cosine similarity using sklearn
    similarity = cosine_similarity(vec1, vec2)[0][0]

    return float(similarity)

In [None]:
# Try it out
sample_text = "This is a sample text for generating embeddings using Azure OpenAI."

# Get embeddings
embedding = await get_text_embedding(sample_text)

print(f"Text: {sample_text}")
print(f"Embedding dimension: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print(f"Data type: {type(embedding[0])}")

In [None]:
similar_text_1 = "I am a software engineer with a passion for AI."
similar_text_2 = "I love coding and developing AI applications."
opposite_text = "I dislike programming and have no interest in AI."

# Get embeddings for the sample texts
embedding1 = await get_text_embedding(similar_text_1)
embedding2 = await get_text_embedding(similar_text_2)
opposite_embedding3 = await get_text_embedding(opposite_text)
# Calculate cosine similarity scores
similarity_score = cosine_similarity_score(embedding1, embedding2)
opposite_similarity_score = cosine_similarity_score(embedding1, opposite_embedding3)    

print(f"Cosine similarity between text1 and text2: {similarity_score:.4f}")
print(f"Cosine similarity between text1 and opposite_text3: {opposite_similarity_score:.4f}")

### How does RAG Work
- Prompt
- Vector Embedding
- Similarity (Cosine, Hybrid)
- Prompt Augmentation

### How to Create a GraphRAG System
- Chunking
- Entity Relationship Extraction
- Knowledge Graph Construction: Nodes = Entities, Edges = Relationships
- Embedding
- Create Graph Hierarchy (Leiden Algorithm) - summaries for multiple levels of abstraction
- Indexing

### How does GraphRAG Work
- Prompt
- Vector Embedding
- Community Selection using Similarity (Cosing,Hybrid)
- Recursive Traversal
- Map: partial answers are generated from each selected community
- Reduce: partial answers are ranked into a final response

![Graph Hierarchy](./assets/graphRag.png)