when we load clip model,
eg
model_1, preprocess = clip.load("RN50", device=device, jit=False)
model_2, preprocess = clip.load("ViT-B/16", device=device, jit=False)
Obviously, the image encoders in model_1 and model_2 are different(ResNet and ViT),
how about the text encoder in these two models, are they also different?