<a href="https://colab.research.google.com/github/palomaalves/Notebooks_Machine_Learning/blob/main/C%C3%B3pia_de_Visual_Language_Models_Atividade.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Atividade

O objetivo deste exercício é o aluno praticar utilizar VLMs pré-treinados como um extrator de features. A atividade consiste em o aluno implementar um KNN Classifier para resolver o problema de classificação do CIFAR-10. Ela é composta de 4 etapas: 1) carregar os splits de treino e teste do CIFAR10, 2) carregar um VLM (CLIP ou ALIGN), 3) Extrair as features das imagens de treino e realizar o fit do KNN, 4) avaliar o KNN com as imagens de teste.



# 1) carregar os splits de treino e teste do CIFAR10

In [None]:
!pip install torch torchvision clip tqdm scikit-learn Pillow



In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from tqdm.notebook import tqdm

transform = transforms.Compose([
    transforms.ToTensor()
])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                    download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128,
                                      shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                   download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=128,
                                     shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat', 'deer',
          'dog', 'frog', 'horse', 'ship', 'truck')

Files already downloaded and verified
Files already downloaded and verified


# 2) carregar um VLM (CLIP ou ALIGN)

In [None]:
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

CLIPModel(
  (text_model): CLIPTextTransformer(
    (embeddings): CLIPTextEmbeddings(
      (token_embedding): Embedding(49408, 512)
      (position_embedding): Embedding(77, 512)
    )
    (encoder): CLIPEncoder(
      (layers): ModuleList(
        (0-11): 12 x CLIPEncoderLayer(
          (self_attn): CLIPSdpaAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e

#3) Extrair as features das imagens de treino

In [None]:
def extract_features(dataloader, model, processor, device):
    features = []
    labels = []

    with torch.no_grad():
        for images, targets in tqdm(dataloader):
            pil_images = [Image.fromarray((img.permute(1, 2, 0).numpy() * 255).astype(np.uint8))
                         for img in images]

            inputs = processor(images=pil_images, return_tensors="pt", padding=True)

            inputs = {k: v.to(device) for k, v in inputs.items()}

            outputs = model.get_image_features(**inputs)

            features.append(outputs.cpu().numpy())
            labels.extend(targets.numpy())

    return np.vstack(features), np.array(labels)

print("Extraindo features dos dados de treinamento...")
train_features, train_labels = extract_features(trainloader, model, processor, device)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_features, train_labels)

print("Treinamento da KNN completo!")

Extraindo features dos dados de treinamento...


  0%|          | 0/391 [00:00<?, ?it/s]

Treinamento da KNN completo!


# 4) avaliar o KNN com as imagens de teste

In [None]:
print("Extraindo features dos dados de teste...")
test_features, test_labels = extract_features(testloader, model, processor, device)

predictions = knn.predict(test_features)

accuracy = accuracy_score(test_labels, predictions)
print(f"Acurácia do Teste: {accuracy * 100:.2f}%")

Extraindo features dos dados de teste...


  0%|          | 0/79 [00:00<?, ?it/s]

Acurácia do Teste: 92.77%
