<a href="https://colab.research.google.com/github/marcelcastrobr/marcelcastrobr.github.io/blob/new-notebooks/2023-03-27-VisualChatGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
aliases:
- /2023/03/27/VisualChatGPT
date: '2023-03-27'
output-file: 2023-03-27-visualchatgpt.html
title: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

---


Visual ChatGPT connects ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting.

ChatGPT is a language interface with distinctive conversational competency and reasoning capabilities across many domains. However, it is currently unable to process or generate images from the visual world. To address this limitation, the authors propose a system called Visual ChatGPT that incorporates different Visual Foundation Models to enable users to interact with ChatGPT using both language and images. The system is capable of handling complex visual questions or instructions that require multiple AI models and steps. Additionally, it allows for feedback and corrections.

Rather than creating a new multimodal ChatGPT from scratch, the authors propose building Visual ChatGPT by incorporating various (22) Visual Foundation Models (VFMs) directly into ChatGPT. To facilitate the integration of these VFMs, the authors introduce a Prompt Manager that supports several functions. These include specifying the input-output formats of each VFM, converting visual information to language format, and managing the histories, priorities, and conflicts of different VFMs. With the Prompt Manager's help, ChatGPT can use these VFMs iteratively and receive their feedback until it satisfies the users' requirements or reaches the end condition.

Paper: https://arxiv.org/abs/2303.04671

Code link: https://github.com/microsoft/visual-chatgpt

A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-palme

This notebook from https://github.com/microsoft/visual-chatgpt.

Download visual-chatgpt


In [1]:
!rm -rf visual-chatgpt && git clone https://github.com/microsoft/visual-chatgpt

Cloning into 'visual-chatgpt'...
remote: Enumerating objects: 163, done.[K
remote: Counting objects: 100% (91/91), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 163 (delta 63), reused 60 (delta 52), pack-reused 72[K
Receiving objects: 100% (163/163), 9.56 MiB | 18.30 MiB/s, done.
Resolving deltas: 100% (83/83), done.


Go to directory


In [2]:
cd visual-chatgpt/

/content/visual-chatgpt


Install packages


In [None]:
!pip install -r requirements.txt

Set your OPENAI_API_KEY


In [None]:
%env OPENAI_API_KEY=<YOURKEY>

Import dependencies

In [5]:
from visual_chatgpt import *

  jax.tree_util.register_keypaths(


Run VisualChatGPT

In [6]:
# load = "ImageCaptioning_cuda:0,Text2Image_cuda:0"
load = "Text2Image_cuda:0,ImageCaptioning_cuda:0,VisualQuestionAnswering_cuda:0,Image2Canny_cpu,Image2Line_cpu,Image2Pose_cpu,Image2Depth_cpu,CannyText2Image_cuda:0,InstructPix2Pix_cuda:0,Image2Seg_cuda:0"

load_dict = {e.split('_')[0].strip(): e.split('_')[1].strip() for e in load.split(',')}
bot = ConversationBot(load_dict=load_dict)
with gr.Blocks(css="#chatbot .overflow-y-auto{height:500px}") as demo:
    chatbot = gr.Chatbot(elem_id="chatbot", label="Visual ChatGPT")
    state = gr.State([])
    with gr.Row():
        with gr.Column(scale=0.7):
            txt = gr.Textbox(show_label=False, placeholder="Enter text and press enter, or upload an image").style(
                container=False)
        with gr.Column(scale=0.15, min_width=0):
            clear = gr.Button("Clear️")
        with gr.Column(scale=0.15, min_width=0):
            btn = gr.UploadButton("Upload", file_types=["image"])

    txt.submit(bot.run_text, [txt, state], [chatbot, state])
    txt.submit(lambda: "", None, txt)
    btn.upload(bot.run_image, [btn, state, txt], [chatbot, state, txt])
    clear.click(bot.memory.clear)
    clear.click(lambda: [], None, chatbot)
    clear.click(lambda: [], None, state)
    demo.launch(server_name="0.0.0.0", server_port=7861, share=True, debug=True)

Initializing VisualChatGPT, load_dict={'Text2Image': 'cuda:0', 'ImageCaptioning': 'cuda:0', 'VisualQuestionAnswering': 'cuda:0', 'Image2Canny': 'cpu', 'Image2Line': 'cpu', 'Image2Pose': 'cpu', 'Image2Depth': 'cpu', 'CannyText2Image': 'cuda:0', 'InstructPix2Pix': 'cuda:0', 'Image2Seg': 'cuda:0'}
Initializing Text2Image to cuda:0


Downloading (…)ain/model_index.json:   0%|          | 0.00/543 [00:00<?, ?B/s]

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

Downloading (…)_checker/config.json:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

Downloading (…)tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading (…)_encoder/config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

Downloading (…)cheduler_config.json:   0%|          | 0.00/308 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

Downloading (…)ch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Downloading (…)819/unet/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading (…)d819/vae/config.json:   0%|          | 0.00/547 [00:00<?, ?B/s]

Downloading (…)tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

Downloading (…)ch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.


Initializing ImageCaptioning to cuda:0


Downloading (…)rocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Initializing VisualQuestionAnswering to cuda:0


Downloading (…)rocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/524 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.54G [00:00<?, ?B/s]

Initializing Image2Canny
Initializing Image2Line


Downloading (…)d_large_512_fp32.pth:   0%|          | 0.00/6.34M [00:00<?, ?B/s]

Initializing Image2Pose


Downloading body_pose_model.pth:   0%|          | 0.00/209M [00:00<?, ?B/s]

No model was supplied, defaulted to Intel/dpt-large and revision e93beec (https://huggingface.co/Intel/dpt-large).
Using a pipeline without specifying a model name and revision in production is not recommended.


cuda
Initializing Image2Depth


Downloading (…)lve/main/config.json:   0%|          | 0.00/942 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.37G [00:00<?, ?B/s]

Some weights of DPTForDepthEstimation were not initialized from the model checkpoint at Intel/dpt-large and are newly initialized: ['neck.fusion_stage.layers.0.residual_layer1.convolution2.weight', 'neck.fusion_stage.layers.0.residual_layer1.convolution2.bias', 'neck.fusion_stage.layers.0.residual_layer1.convolution1.bias', 'neck.fusion_stage.layers.0.residual_layer1.convolution1.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)rocessor_config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.


Initializing CannyText2Image to cuda:0


Downloading (…)on_pytorch_model.bin:   0%|          | 0.00/1.45G [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/920 [00:00<?, ?B/s]

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_controlnet.StableDiffusionControlNetPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


Initializing InstructPix2Pix to cuda:0


Downloading (…)ain/model_index.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/518 [00:00<?, ?B/s]

Downloading (…)tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading (…)cheduler_config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

Downloading (…)_checker/config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

Downloading (…)_encoder/config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

Downloading (…)ch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

Downloading (…)6382/vae/config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading (…)ch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

Downloading (…)tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

Downloading (…)382/unet/config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Initializing Image2Seg


Downloading (…)rocessor_config.json:   0%|          | 0.00/372 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/8.76k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/328M [00:00<?, ?B/s]

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://7b55e143590acf0035.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces


Resize image form 4032x3024 to 512x384





Processed ImageCaptioning, Input Image: image/4d0de046.png, Output Text: a dog with a leash

Processed run_image, Input image: image/4d0de046.png
Current state: [('![](/file=image/4d0de046.png)*image/4d0de046.png*', 'Received.  ')]
Current Memory: 
Human: provide a figure named image/4d0de046.png. The description is: a dog with a leash. This information helps you to understand this image, but you should use tools to finish following tasks, rather than directly imagine from my description. If you understand, say "Received". 
AI: Received.  
history_memory:
Human: provide a figure named image/4d0de046.png. The description is: a dog with a leash. This information helps you to understand this image, but you should use tools to finish following tasks, rather than directly imagine from my description. If you understand, say "Received". 
AI: Received.  , n_tokens: 45


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m Yes
Action: Get Photo Description
Action Input: image/4d0de046.p

  0%|          | 0/20 [00:00<?, ?it/s]


Processed CannyText2Image, Input Canny: image/c9d2_edge_4d0de046_4d0de046.png, Input Text:  remove the background, Output Text: image/7c3e_canny2image_c9d2_4d0de046.png

Observation: [31;1m[1;3mimage/7c3e_canny2image_c9d2_4d0de046.png[0m
Thought:[32;1m[1;3m Do I need to use a tool? No
AI: The image with the background removed is saved as image/7c3e_canny2image_c9d2_4d0de046.png.[0m

[1m> Finished chain.[0m

Processed run_text, Input text: could you remove the background?
Current state: [('<img src="/file=image/4d0de046.png" alt=""><em>image/4d0de046.png</em>', 'Received.  '), ('What is in the picture?', 'The image contains a dog with a leash.'), ('could you remove the background?', 'The image with the background removed is saved as ![](/file=image/7c3e_canny2image_c9d2_4d0de046.png)*image/7c3e_canny2image_c9d2_4d0de046.png*.')]
Current Memory: 
Human: provide a figure named image/4d0de046.png. The description is: a dog with a leash. This information helps you to understand thi

  0%|          | 0/50 [00:00<?, ?it/s]


Processed Text2Image, Input Text: replace the dog for a cat, Output Image: image/40885fe0.png

Observation: [36;1m[1;3mimage/40885fe0.png[0m
Thought:[32;1m[1;3m Do I need to use a tool? No
AI: The image with the dog replaced by a cat is saved as image/40885fe0.png.[0m

[1m> Finished chain.[0m

Processed run_text, Input text: can you replace the dog for a cat?
Current state: [('<img src="/file=image/4d0de046.png" alt=""><em>image/4d0de046.png</em>', 'Received.  '), ('What is in the picture?', 'The image contains a dog with a leash.'), ('could you remove the background?', 'The image with the background removed is saved as <img src="/file=image/7c3e_canny2image_c9d2_4d0de046.png" alt=""><em>image/7c3e_canny2image_c9d2_4d0de046.png</em>.'), ('can you replace the dog for a cat?', 'The image with the dog replaced by a cat is saved as ![](/file=image/40885fe0.png)*image/40885fe0.png*.')]
Current Memory: 
Human: provide a figure named image/4d0de046.png. The description is: a dog with

  0%|          | 0/20 [00:00<?, ?it/s]


Processed CannyText2Image, Input Canny: image/2124_edge_40885fe0_40885fe0.png, Input Text:  generate a yellow dog, Output Text: image/4781_canny2image_2124_40885fe0.png

Observation: [31;1m[1;3mimage/4781_canny2image_2124_40885fe0.png[0m
Thought:[32;1m[1;3m Do I need to use a tool? No
AI: The image with a yellow dog is saved as image/4781_canny2image_2124_40885fe0.png.[0m

[1m> Finished chain.[0m

Processed run_text, Input text: Now generate a yellow dog based on image/2124_edge_40885fe0_40885fe0.png
Current state: [('<img src="/file=image/4d0de046.png" alt=""><em>image/4d0de046.png</em>', 'Received.  '), ('What is in the picture?', 'The image contains a dog with a leash.'), ('could you remove the background?', 'The image with the background removed is saved as <img src="/file=image/7c3e_canny2image_c9d2_4d0de046.png" alt=""><em>image/7c3e_canny2image_c9d2_4d0de046.png</em>.'), ('can you replace the dog for a cat?', 'The image with the dog replaced by a cat is saved as <img sr

  0%|          | 0/50 [00:00<?, ?it/s]


Processed Text2Image, Input Text: image/4781_canny2image_2124_40885fe0.png, add one cat to picture, Output Image: image/ca3a377b.png

Observation: [36;1m[1;3mimage/ca3a377b.png[0m
Thought:[32;1m[1;3m Do I need to use a tool? No
AI: The image with one cat added is saved as image/ca3a377b.png.[0m

[1m> Finished chain.[0m

Processed run_text, Input text: now add one cat to picture
Current state: [('<img src="/file=image/4d0de046.png" alt=""><em>image/4d0de046.png</em>', 'Received.  '), ('What is in the picture?', 'The image contains a dog with a leash.'), ('could you remove the background?', 'The image with the background removed is saved as <img src="/file=image/7c3e_canny2image_c9d2_4d0de046.png" alt=""><em>image/7c3e_canny2image_c9d2_4d0de046.png</em>.'), ('can you replace the dog for a cat?', 'The image with the dog replaced by a cat is saved as <img src="/file=image/40885fe0.png" alt=""><em>image/40885fe0.png</em>.'), ('could you generate the canny images of this picture?', 

  0%|          | 0/50 [00:00<?, ?it/s]


Processed Text2Image, Input Text: image/ca3a377b.png, one dog, Output Image: image/47da58e6.png

Observation: [36;1m[1;3mimage/47da58e6.png[0m
Thought:[32;1m[1;3m Do I need to use a tool? No
AI: The image with one dog added is saved as image/47da58e6.png.[0m

[1m> Finished chain.[0m

Processed run_text, Input text: now include one dog to the picture
Current state: [('<img src="/file=image/4d0de046.png" alt=""><em>image/4d0de046.png</em>', 'Received.  '), ('What is in the picture?', 'The image contains a dog with a leash.'), ('could you remove the background?', 'The image with the background removed is saved as <img src="/file=image/7c3e_canny2image_c9d2_4d0de046.png" alt=""><em>image/7c3e_canny2image_c9d2_4d0de046.png</em>.'), ('can you replace the dog for a cat?', 'The image with the dog replaced by a cat is saved as <img src="/file=image/40885fe0.png" alt=""><em>image/40885fe0.png</em>.'), ('could you generate the canny images of this picture?', 'The image with the edges det

  0%|          | 0/50 [00:00<?, ?it/s]


Processed Text2Image, Input Text: image/47da58e6.png, one cat, Output Image: image/d558507e.png

Observation: [36;1m[1;3mimage/d558507e.png[0m
Thought:[32;1m[1;3m Do I need to use a tool? No
AI: The image with one cat added is saved as image/d558507e.png.[0m

[1m> Finished chain.[0m

Processed run_text, Input text: now include one cat to the image image/47da58e6.png
Current state: [('<img src="/file=image/4d0de046.png" alt=""><em>image/4d0de046.png</em>', 'Received.  '), ('What is in the picture?', 'The image contains a dog with a leash.'), ('could you remove the background?', 'The image with the background removed is saved as <img src="/file=image/7c3e_canny2image_c9d2_4d0de046.png" alt=""><em>image/7c3e_canny2image_c9d2_4d0de046.png</em>.'), ('can you replace the dog for a cat?', 'The image with the dog replaced by a cat is saved as <img src="/file=image/40885fe0.png" alt=""><em>image/40885fe0.png</em>.'), ('could you generate the canny images of this picture?', 'The image w

  0%|          | 0/40 [00:00<?, ?it/s]


Processed InstructPix2Pix, Input Image: image/9dbacedb.png, Instruct Text:  rotate 90 degrees, Output Image: image/f32f_pix2pix_9dbacedb_9dbacedb.png

Observation: [36;1m[1;3mimage/f32f_pix2pix_9dbacedb_9dbacedb.png[0m
Thought:[32;1m[1;3m Do I need to use a tool? No
AI: The image has been rotated 90 degrees and is saved as image/f32f_pix2pix_9dbacedb_9dbacedb.png.[0m

[1m> Finished chain.[0m

Processed run_text, Input text: Can you rotate the  image/9dbacedb.png in 90 degrees?
Current state: [('<img src="/file=image/4d0de046.png" alt=""><em>image/4d0de046.png</em>', 'Received.  '), ('What is in the picture?', 'The image contains a dog with a leash.'), ('could you remove the background?', 'The image with the background removed is saved as <img src="/file=image/7c3e_canny2image_c9d2_4d0de046.png" alt=""><em>image/7c3e_canny2image_c9d2_4d0de046.png</em>.'), ('can you replace the dog for a cat?', 'The image with the dog replaced by a cat is saved as <img src="/file=image/40885fe0.

In [None]:
!ps -aux | grep "python"
# !kill -9 7507    