<a href="https://colab.research.google.com/github/isamdr86/towards-ai/blob/main/notebooks/GraphRAG_Implementation_ir.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic GraphRAG Implementation

## Extracting information from the PDF

In [1]:
# Install Necessary dependencies
!pip install numpy==1.24.4 scipy==1.12.0 google-generativeai==0.5.4 openai==1.30.1 tiktoken==0.7.0



In [6]:
import os
import time
import google.generativeai as genai

from google.colab import userdata

genai.configure(api_key=userdata.get('google_api_key'))

def upload_to_gemini(path, mime_type=None):

  file = genai.upload_file(path, mime_type=mime_type)
  print(f"Uploaded file '{file.display_name}' as: {file.uri}")
  return file

# Research paper
files = upload_to_gemini("./Lora.txt", mime_type="application/pdf")


def wait_for_files_active(files):
  print("Waiting for file processing...")
  for name in (file.name for file in files):
    file = genai.get_file(name)
    while file.state.name == "PROCESSING":
      print(".", end="", flush=True)
      time.sleep(10)
      file = genai.get_file(name)
    if file.state.name != "ACTIVE":
      raise Exception(f"File {file.name} failed to process")
    return file
  print("...all files ready")
  print()

wait_for_files_active([files])

Uploaded file 'Lora.txt' as: https://generativelanguage.googleapis.com/v1beta/files/8gpe5gl5a1q8
Waiting for file processing...


<google.generativeai.types.file_types.File at 0x7f1530e06550>

In [7]:
# Configuration
generation_config = {
  "temperature": 0,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain"
}

# Model Initialization
model = genai.GenerativeModel(model_name="gemini-1.5-pro",generation_config=generation_config,)

prompt = """
            Task:
              Thoroughly review and analyse the research paper. Extract all the infomation from introduction to references.
              Organize the information in extracted order. Extract all detailed information from a research paper,
              including every figure, table, diagram, architecture, equation, and any related content. Ensure to include
              captions, legends, footnotes, and any associated descriptions or explanations provided in the text. Additionally,
              capture any supplementary materials, such as appendices or additional figures and tables, maintaining the original
              format and organization. The extraction should preserve every detail and context without summarizing or omitting
              any information. Please ensure that all content is presented in the same order as it appears in the original paper.
              Output in full without any laziness or any summarisation. You have ample output context window to do write the full
              pdf without cutting out any detail - if you run out of output tokens i will just ask you to continue.
              Remember - You are NOT a dumb summarisation machine. Remember - it is 100% essential that you do not cut down or summarise the text
              in the pdf. But DO NOT summarise paragraphs into fewer sentences.Here you are Extracting text and other information
              not cutting out content, not summarising.
          """


result = model.generate_content(
    contents=[files,prompt,],
    request_options={"timeout": 1000},
)
print(result.text)



BadRequest: 400 POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint: The document has no pages.

In [None]:
response= result.text

## GraphRAG implementation

In [None]:
!pip install graphrag

Collecting graphrag
  Downloading graphrag-0.3.4-py3-none-any.whl.metadata (6.2 kB)
Collecting aiofiles<25.0.0,>=24.1.0 (from graphrag)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting aiolimiter<2.0.0,>=1.1.0 (from graphrag)
  Downloading aiolimiter-1.1.0-py3-none-any.whl.metadata (4.5 kB)
Collecting azure-identity<2.0.0,>=1.17.1 (from graphrag)
  Downloading azure_identity-1.17.1-py3-none-any.whl.metadata (79 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/79.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.4/79.4 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting azure-search-documents<12.0.0,>=11.4.0 (from graphrag)
  Downloading azure_search_documents-11.5.1-py3-none-any.whl.metadata (23 kB)
Collecting azure-storage-blob<13.0.0,>=12.22.0 (from graphrag)
  Downloading azure_storage_blob-12.22.0-py3-none-any.whl.metadata (26 kB)
Collecting datashaper<0.0.50,>=0.

In [None]:
from google.colab import userdata
import os
os.environ['GRAPHRAG_API_KEY'] = "GRAPHRAG_API_KEY"

#os.environ['GRAPHRAG_API_KEY'] = userdata.get('openai_api_key')

os.makedirs("/content/ragtest/input", exist_ok=True)
with open("/content/ragtest/input/Lora.txt", "w") as file:
  file.write(str(response))

In [None]:
!python -m graphrag.index --init --root /content/ragtest

2024-09-13 11:24:33.694234: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-13 11:24:33.717408: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-13 11:24:33.724392: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2KInitializing project at [35m/content/[0m[95mragtest[0m
⠋ GraphRAG Indexer 

In [None]:
# Indexing
!python -m graphrag.index --root /content/ragtest

2024-09-13 11:25:03.939860: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-13 11:25:03.962651: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-13 11:25:03.969923: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2KLogging enabled at [35m/content/ragtest/output/20240913-112515/reports/[0m[95mindexing-engine.log[0m
[2K⠼ GraphRAG Indexer 
[2K[1A[2K⠼ GraphRAG Indexer 
├── Loading Input (text) - 1 files loaded (0 filtered) [90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [36m0:00:00[0m [33m0:00:00[0m
[2K[1A[2K[1A[2K⠼ GraphRAG Indexer 
├── Loading Input (text

In [None]:
# Global Search
!python -m graphrag.query --root ./ragtest --method global " what this text document about?"

2024-09-13 11:26:57.430768: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-13 11:26:57.453830: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-13 11:26:57.460732: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


creating llm client with {'api_key': 'REDACTED,len=95', 'type': "openai_chat", 'model': 'gpt-4-turbo-preview', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': Tru

In [None]:
# Local search
!python -m graphrag.query --root ./ragtest --method local " what this Low Rank Adaptation"

2024-09-13 11:27:44.160122: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-13 11:27:44.183310: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-13 11:27:44.190340: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered



INFO: Vector Store Args: {}
[0m[38;5;8m[[0m2024-09-13T11:27:55Z [0m[33mWARN [0m lance::dataset[0m[38;5;8m][0m No existing dataset at /content/ragtest/output/20240913-112515/artifacts/lancedb/entity_description_embeddings.lance, it will be created
creating llm client with {'api_key': 'REDACTED,len=95', 'type': "openai_chat", 'model': 'gpt-4-turbo-previ

In [None]:
# Entity Extractions
import pandas as pd
df = pd.read_parquet("/content/ragtest/output/20240913-112515/artifacts/create_final_entities.parquet")
df

Unnamed: 0,id,name,type,description,human_readable_id,graph_embedding,text_unit_ids,description_embedding
0,b45241d70f0e43fca764df95b2b81f77,LORA,ORGANIZATION,"LoRA, which stands for Low-Rank Adaptation, is...",0,,"[0786e6330b62ea635381cc5810468800, 2f011732b2e...","[0.015118744224309921, 0.0260329432785511, 0.0..."
1,4119fd06010c494caa07f439b333f4c5,EDWARD HU,PERSON,Co-author of the LORA paper and associated wit...,1,,[2f011732b2e235a5450e4748972a1988],"[0.00908816047012806, -0.07558003813028336, 0...."
2,d3835bf3dda84ead99deadbeac5d0d7d,YELONG SHEN,PERSON,Co-author of the LORA paper and associated wit...,2,,[2f011732b2e235a5450e4748972a1988],"[0.01909063570201397, -0.059403978288173676, 0..."
3,077d2820ae1845bcbb1803379a3d1eae,PHILLIP WALLIS,PERSON,Co-author of the LORA paper and associated wit...,3,,[2f011732b2e235a5450e4748972a1988],"[0.018361538648605347, -0.03276399150490761, 0..."
4,3671ea0dd4e84c1a9b02c5ab2c8f4bac,ZEYUAN ALLEN-ZHU,PERSON,Co-author of the LORA paper and associated wit...,4,,[2f011732b2e235a5450e4748972a1988],"[0.03301824629306793, -0.018647611141204834, 0..."
...,...,...,...,...,...,...,...,...
72,fa3c4204421c48609e52c8de2da4c654,ZHAO ET AL.,PERSON,Researchers who have worked on low-rank struct...,72,,[9f53dc2796d638764d5c3b249a875d25],"[-0.015633100643754005, -0.008787119761109352,..."
73,53af055f068244d0ac861b2e89376495,KHODAK ET AL.,PERSON,Contributors to the study of low-rank constrai...,73,,[9f53dc2796d638764d5c3b249a875d25],"[0.00132056325674057, 0.010695008561015129, 0...."
74,c03ab3ce8cb74ad2a03b94723bfab3c7,DENIL ET AL.,PERSON,Researchers who explored low-rank updates to n...,74,,[9f53dc2796d638764d5c3b249a875d25],"[0.01580997370183468, 0.016001755371689796, 0...."
75,ed6d2eee9d7b4f5db466b1f6404d31cc,ALLEN-ZHU ET AL.,PERSON,Researchers who provided theoretical insights ...,75,,[9f53dc2796d638764d5c3b249a875d25],"[-0.003528870176523924, 0.0012279514921829104,..."


In [None]:
df[df['type']=='ORGANIZATION']

Unnamed: 0,id,name,type,description,human_readable_id,graph_embedding,text_unit_ids,description_embedding
0,b45241d70f0e43fca764df95b2b81f77,LORA,ORGANIZATION,"LoRA, which stands for Low-Rank Adaptation, is...",0,,"[0786e6330b62ea635381cc5810468800, 2f011732b2e...","[0.015118744224309921, 0.0260329432785511, 0.0..."
9,27f9fbe6ad8c4a8b9acee0d3596ed57c,MICROSOFT CORPORATION,ORGANIZATION,The company where the authors of the LORA pape...,9,,[2f011732b2e235a5450e4748972a1988],"[0.02478708326816559, -0.03073018416762352, 0...."
10,e1fd0e904a53409aada44442f23a51cb,GPT-3,ORGANIZATION,"GPT-3, developed by OpenAI, stands as the larg...",10,,"[2f011732b2e235a5450e4748972a1988, 69778cd905c...","[0.0070150806568562984, 0.0026917648501694202,..."
12,96aad7cb4b7d40e9b7e13b94a67af206,TRANSFORMER ARCHITECTURE,ORGANIZATION,A widely used architecture in machine learning...,12,,[bf0d8b21a8e386ddb70c174b70e68360],"[-0.012984410859644413, -0.0008722275379113853..."
15,d91a266f766b4737a06b0fda588ba40b,ADAM,ORGANIZATION,Adam is an optimization algorithm widely used ...,15,,"[61a5ae4416183e8742bd8fae1427699b, bf0d8b21a8e...","[0.02075398899614811, 0.0014302580384537578, 0..."
18,4a67211867e5464ba45126315a122a8a,GPT,ORGANIZATION,A generic multi-task learner based on the Tran...,18,,[bf0d8b21a8e386ddb70c174b70e68360],"[-0.042168520390987396, -0.0011740220943465829..."
32,85c79fd84f5e4f918471c386852204c5,GPT-2,ORGANIZATION,GPT-2 is a large Transformer language model de...,32,,"[22fdb19ffed1bce728841827dcacf018, 61a5ae44161...","[-0.029486937448382378, 0.017327744513750076, ..."
33,eae4259b19a741ab9f9f6af18c4a0470,NVIDIA QUADRO RTX8000,ORGANIZATION,The hardware used to measure inference latency...,33,,[22fdb19ffed1bce728841827dcacf018],"[-0.012682707980275154, 0.012317562475800514, ..."
34,3138f39f2bcd43a69e0697cd3b05bc4d,LO RA,ORGANIZATION,LO RA is a method characterized by its simple ...,34,,"[22fdb19ffed1bce728841827dcacf018, 9f53dc2796d...","[0.017171747982501984, 0.006384861655533314, 0..."
38,b462b94ce47a4b8c8fffa33f7242acec,TRANSFORMER,ORGANIZATION,The Transformer is a sequence-to-sequence arch...,38,,"[61a5ae4416183e8742bd8fae1427699b, 9f53dc2796d...","[-0.0009461896843276918, -0.003365109674632549..."
