# Instructions - Report
--------------
motivations
related works
methods (formulation, architecture)
- Describe your computing resources and reformulate your problem managable: Colab Pro $10/month 
experiments (data preparation, hyperparameter tuning, quantitative/qualitative experimental results)
discussion & future direction
--------------
## basic points (overall) (2): 
- length (0.5)
- format (0.5)
- clarity of writing (1)

## Introduction (5): 
- motivation (2)
- problem definition (2)
- concise description of contribution (1)

## Methods (5): 
- significance/novelty (2)
- figure (1)
- reproducibility (2)-algorithm
: 수도코드 등 implementation의 architecture를 설계하는 파트가 포함이 되어야 한다.

## Experiments (7): 
- dataset (1)
- computer resource (CPU,GPU, OS, pytorch etc.) & experimental design (1)
- quantitative results (1)
  : -> 숫자, Plot(그래프 그림) 얘는 정량적 결과(quantitative)
- qualitative results (1)
  : -> 수로는 설명되지 않는, 그림 같은 느낌적인 느낌을 전달하는. 정성적.(ex 어텐션 맵)
- Figures (plots)/Tables and their analysis (2)
  : Visualising result를 통해 result가 좋은지 아닌지를 확인할 것.
- discussion why the proposed method is successful or unsuccessful (1) 
  : If your model is not competitive(degradation이 observe된다면), why&future direction 설명하면 됨.

## Future direction (1).

## Github history (2)

## Overleaf history (2)
## (Bonus+1) 
- pre-trained foundation models beyond ImageNet-pretrained CNNs (distillation, adaption, pseudo-labeling, baseline etc.), CLIP, BERT, RoBERTa
- CLIP처럼 foundation model을 선택해서 했으면 좋겠다… stable diffusion… se(segment efficient)m2,...

# CLIP model 기본 구조 구현

In [None]:
pip install git+https://github.com/openai/CLIP.git

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from clip import clip
from clip.simple_tokenizer import SimpleTokenizer as _Tokenizer
import os

# 토크나이저 초기화 
_tokenizer = _Tokenizer()

class TextEncoder(nn.Module):
    def __init__(self, clip_model):
        super().__init__()
        # CLIP의 텍스트 인코더 컴포넌트를 가져온다
        self.transformer = clip_model.transformer
        self.positional_embedding = clip_model.positional_embedding
        self.ln_final = clip_model.ln_final
        self.text_projection = clip_model.text_projection
        self.dtype = clip_model.dtype

    def forward(self, prompts, tokenized_prompts):
        x = prompts + self.positional_embedding.type(self.dtype)
        x = x.permute(1, 0, 2)
        x = self.transformer(x)
        x = x.permute(1, 0, 2)
        x = self.ln_final(x).type(self.dtype)
        x = x[torch.arange(x.shape[0]), tokenized_prompts.argmax(dim=-1)] @ self.text_projection
        return x

def load_clip_to_cpu(model_name="ViT-B/16"):
    # 모델을 저장할 디렉토리 설정
    root = os.path.expanduser("~/.cache/clip")
    
    # CLIP 모델을 직접 로드한다
    model, preprocess = clip.load(model_name, device="cpu", download_root=root)
    
    # 평가 모드로 설정
    model = model.eval()
    
    return model

def initialize_clip():
    # CLIP 모델 로드
    clip_model = load_clip_to_cpu()
    
    # 텍스트 인코더와 이미지 인코더 초기화
    text_encoder = TextEncoder(clip_model)
    image_encoder = clip_model.visual
    
    # GPU 사용 가능하면 GPU로 이동
    device = "cuda" if torch.cuda.is_available() else "cpu"
    text_encoder = text_encoder.to(device)
    image_encoder = image_encoder.to(device)
    
    return text_encoder, image_encoder, device

# 기본적인 데이터 전처리를 위한 transform 정의
def get_transforms():
    return transforms.Compose([
        transforms.Resize((224, 224)),  # CLIP 입력 크기에 맞춤
        transforms.ToTensor(),
        transforms.Normalize((0.48145466, 0.4578275, 0.40821073),
                           (0.26862954, 0.26130258, 0.27577711))  # CLIP 기본 정규화 값
    ])

# 사용 예시
if __name__ == "__main__":
    # CLIP 모델 초기화
    text_encoder, image_encoder, device = initialize_clip()
    print(f"Device: {device}")
    print("CLIP model initialized successfully!")