Skip to content

lsh3163/Spider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spider

GraphRAG

To run GraphRAG, follow the installation instruction here. To run in CLI mode with custimized configuration, you can install GraphRAG via Pip:

pip install graphrag

Then create an input folder to store your customized data.

mkdir -p ./GraphRAG/test/input

Then put some text files inside the input folder you just created. There can be multiple files or multiple folders containing multiple files.

Data Collection

We provided a script collect_data.py to source articles from Wikipedia given a list of prompts. Save the prompts in the CSV format with the header "Prompt" and then change the corresponding path in the script to collect data. The script leverages LangChain. You need to install the following the package before running the script:

pip install -qU langchain_community wikipedia

Initialization

Run the following command to initialize a folder as GraphRAG's base:

graphrag init --root ./GraphRAG/test/input

Then two files will be generated for you in the GraphRAG/test: an environment file .env and a settings file settings.yaml.

You will need to specify your OpenAI API key in the .env: GRAPHRAG_API_KEY=<YOUR_API_KEY>. Additionally, you can change settings in settings.yaml. The most common one might be to change the driver language model to use. Inside settings.yaml, under the first llm block, change model to any OpenAI LLM model that fits your budget. For a complete list of models, check here.

Running the indexing of the tree

To form a hierarchical tree out of the text corpus you just stored, run:

graphrag index --root ./GraphRAG/test

A complete list of arguments can be found here

If you added new files in the input corpus, you can run:

graphrag update --root ./GraphRAG/test

to update the graph.

Querying the knowledge graph with prompts

For a global understanding of the knowledge graph you just built, run:

graphrag query --root ./GraphRAG/test --method global --query "<YOUR QUESTION HERE>"

In our T2I task, to get an answer you probably need to get local details. For example, after generating a graph that includes the entities you want to paint, run:

graphrag query --root ./GraphRAG/test --method local --query "Provide a detailed prompt for image generation of <YOUR IMAGE PROMPT>. It must be possible to directly input your answer into an image generation model for an accurate image."

Evaluation

Data Structure

The evaluation directory contains the prompts.csv file, the generated images, and the pseudo-groundtruth images.

├── evaluation
│   ├── prompts.csv
│   ├── images
│   │   ├── 2gen_img_LLAMAgraph
│   │   │   ├── 200_[prompt].jpg
│   │   │   ├── 201_[prompt].jpg
│   │   │   ├── 202_[prompt].jpg
│   │   │   ├── ...
│   │   │   ├── 399_[prompt].jpg
│   │   ├── 2gen_img_LLAMAgraphnoneigh
│   │   │   ├── ...
│   │   ├── 2gen_img_LLAMAuserprompt
│   │   │   ├── ...
│   │   ├── 2gen_img_graph
│   │   │   ├── ...
│   │   ├── 2gen_img_graphnoneigh
│   │   │   ├── ...
│   │   ├── 2gen_img_userprompt
│   │   │   ├── ...
│   ├── gt_images
│   │   ├── 200
│   │   ├── 201
│   │   ├── 202
│   │   ├── ...
│   │   ├── 399

Environment

The evaluation script requires torch, clip, dreamsim, and other regular Python libraries, including pandas and PIL, to run.

Scripts

remove_empty_images.ipynb is used to remove broken image files from the generation/crawling pipeline.
The CLIPScore is implemented in CLIPScore.ipynb and the visual similarity is implemented in visual_similarity.ipynb.

WikiGraphs

All details of the wikipedia graph processing, community detection, LLAMA prompt generation, Flux image generation, and web scraping are in Wikiraph-LLAMA-Flux-Scraping.

The dataset itself is not included as it is 7GB + 2GB of supplemental files generated by the code.

Code file names describe what they do.

The scraped images are in first101scraped.zip and next99scraped.zip. The generated images are in all_FLux_gen_images.zip. The from_graph_prompts_LLAMA_graph series stores the node texts and LLAMA prompts for the first 76 nodes and 200-230_withprompts.csv stores the same for nodes 200 to 229. 200-240.csv stores the first 6 words (puncutation and numbers removed) as well as the full text of the nodes 200-399.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors