This repo allows you to perform image editing via text prompts, without needing to manually draw a mask image of the parts you want to edit (mask generation is automatic). All you need is an input image and text prompts of what you want to remove and replace with.
To do so, this inpainting tool uses an image segmentation model called CLIPSeg (GitHub | Paper). This model can perform image segmentation of arbitrary text prompts at test time. It uses OpenAI's pre-trained CLIP model (GitHub | Paper) to convert a text prompt and user-inputted image into CLIP embeddings, which then get fed into a decoder. Here is a great diagram explaining how the model works.
Then a latent text-to-image diffusion model is used: Stable-Diffusion-Inpainting (HuggingFace Model Card), which was initialized with Stable-Diffusion-v-1-2 weights and given inpainting training. Using the diffusion model, this inpainting tool takes the user-inputted image, mask image (from CLIPSeg), and inpainting prompt and outputs the final image.
The best way to get started is to open up quickstart.ipynb
in a Google Colab notebook with the GPU setting turned on. The notebook walks you through several different examples.
input_filepath = 'images/squirrel.jpg'
mask_prompt = 'squirrel'
inpaint_prompt = ''
input_filepath = 'images/roses.jpg'
mask_prompt = 'white roses'
inpaint_prompt = 'red roses'
input_filepath = 'images/food.jpg'
mask_prompt = 'food'
inpaint_prompt = 'steak and potatoes'
input_filepath = 'images/lake.jpg'
mask_prompt = 'people'
inpaint_prompt = 'robot'