# Personalized Text-to-Image Generator
Google Colab Pro access is required for running this notebook as atleast one A100 GPU is required to run the training and inference scripts.

**Installing conda on the Colab environment**

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:09
🔁 Restarting kernel...


Mounting the drive which will contain the code repository and the set of reference images. The code repository should be uploaded to the google drive of the account, you are using google colab from.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive

/content/drive/MyDrive


Creating a conda environment, that will contain all the installed dependencies required for the model to run, using the environment.yaml file.

In [None]:
!conda remove --name vico --all
!conda env create -f environment.yaml


Remove all packages in environment /usr/local/envs/vico:


## Package Plan ##

  environment location: /usr/local/envs/vico


The following packages will be REMOVED:

  _libgcc_mutex-0.1-conda_forge
  _openmp_mutex-4.5-2_kmp_llvm
  blas-1.0-mkl
  bzip2-1.0.8-h5eee18b_5
  ca-certificates-2023.12.12-h06a4308_0
  cudatoolkit-11.3.1-h2bc3f7f_2
  ffmpeg-4.3-hf484d3e_0
  freetype-2.12.1-h4a9f257_0
  gmp-6.2.1-h295c915_3
  gnutls-3.6.15-he1e5248_0
  jpeg-9e-h5eee18b_1
  lame-3.100-h7b6447c_0
  lcms2-2.12-h3be6417_0
  ld_impl_linux-64-2.38-h1181459_1
  lerc-3.0-h295c915_0
  libdeflate-1.17-h5eee18b_1
  libffi-3.3-he6710b0_2
  libgcc-ng-13.2.0-h807b86a_5
  libiconv-1.16-h7f8727e_2
  libidn2-2.3.4-h5eee18b_0
  libpng-1.6.39-h5eee18b_0
  libstdcxx-ng-11.2.0-h1234567_1
  libtasn1-4.19.0-h5eee18b_0
  libtiff-4.5.1-h6a678d5_0
  libunistring-0.9.10-h27cfd23_0
  libuv-1.44.2-h5eee18b_0
  libwebp-base-1.3.2-h5eee18b_0
  libzlib-1.2.13-hd590300_5
  llvm-openmp-17.0.6-h4dfa4b3_0
  lz4-c-1.9.4-h6a678d5_0

# Steps for running the training script

ACTUAL_RESUME: path where the pre-trained stable-diffusion model is saved.
DATA-ROOT: path to the folder containing a set of reference images
GPUS: list of indices of the GPUs you want to train on, separated by commas. For example, we have used a single GPU to train the model, so we provided the variable as –gpus 0,
INIT-WORD: word that generally describes the subject. Eg: Toy, Dog

**Activating the environment vico and running the training on a set of reference images.**

In [None]:
!source /usr/local/bin/activate vico && python main.py \
--base configs/v1-finetune.yaml -t \
--actual_resume models/stable-diffusion-v1/sd-v1-4.ckpt \
-n "" \
--gpus 0, \
--data_root images/ollie \
--init_word "dog"

Global seed set to 23
Running on GPUs 0,
Loading model from models/stable-diffusion-v1/sd-v1-4.ckpt
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 910.77 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 64, 64) = 16384 dimensions.
making attention of type 'vanilla' with 512 in_channels
Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModel: ['vision_model.encoder.layers.12.self_attn.k_proj.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.bias', 'vision_model.encoder.layers.17.self_attn.out_proj.weight', 'vision_model.encoder.layers.2.layer_norm1.bias', 'vision_model.encoder.layers.22.self_attn.out_proj.bias', 'vision_model.encoder.layers.14.self_attn.k_proj.bias', 'vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_model.encoder.layers.19.self_attn.k_proj.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_model.encoder.

# Steps for running the below inference script:
IMAGE-PATH: path to a reference image.

CHECKPOINTS-PATH: path containing the folder checkpoints (the embeddings after the training).

TEXT-PROMPT: your desired text prompt.

OUTPUT-DIR: path to the folder in which you want the generated images to be saved. (eg. outputs/dog)

**Running inference, given a text prompt and a reference image**

In [None]:
!source /usr/local/bin/activate vico && python scripts/vico_model.py \
--ddim_eta 0.0  --n_samples 4  --n_iter 2  --scale 7.5  --ddim_steps 50  \
--ckpt_path models/stable-diffusion-v1/sd-v1-4.ckpt  \
--image_path images/ollie/1.png \
--ft_path logs/ollie2024-03-12T18-36-55_v1-finetune \
--load_step 399 \
--prompt "A photo of a * with iridescent wings, soaring through a starry night sky filled with constellations" \
--outdir outputs/ollie-399

Global seed set to 42
cross attention path: logs/ollie2024-03-12T18-36-55_v1-finetune/checkpoints/cross_attention-399.pt
embedding path: logs/ollie2024-03-12T18-36-55_v1-finetune/checkpoints/embeddings_gs-399.pt
Loading model from models/stable-diffusion-v1/sd-v1-4.ckpt
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 910.77 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModel: ['vision_model.encoder.layers.17.mlp.fc2.bias', 'vision_model.encoder.layers.21.mlp.fc2.bias', 'vision_model.encoder.layers.17.layer_norm1.bias', 'vision_model.encoder.layers.20.mlp.fc1.weight', 'vision_model.encoder.layers.7.layer_norm1.bias', 'vision_model.encoder.layers.20.layer_norm2.bias', 'vision_model.encoder.layers.18.layer_norm2.bias', 'vision_m