<a href="https://colab.research.google.com/github/js1022003/DL_study/blob/main/Sample_Code_ipynb%EC%9D%98_%EC%82%AC%EB%B3%B8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2022 소프트웨어중심대학 공동 딥러닝 챌린지


이 노트북은 2022 소프트웨어 중심대학 공동 딥러닝 챌린지를 위한 샘플코드 입니다.
이 코드를 구동하기 위해선 [Kaggle](https://www.kaggle.com/competitions/2022swunivchallenge/data)에서 데이터를 다운 받아 구글드라이브에 저장해서 실행해야 합니다. 

샘플코드는 Train 데이터셋을 이용한 모델 학습, Test 데이터셋에 대한 예측파일 (`submission.csv`) 저장으로 구성 되어 있습니다. 만들어진 `submission.csv`파일을 챌린지에 제출하시면 됩니다. 그리고 샘플코드는 단순한 예시 코드이므로 코드를 새로 작성하거나 모델이나 파라미터를 변경해서 사용하시면 됩니다.

---

샘플코드의 구성은 다음과 같습니다.
1. [Import Pakage](#scrollTo=b_B7smvmLaqQ)
2. [Argument Setting](#scrollTo=RSdu8EhUP5Dn)
3. [Checking Challenge Dataset](#scrollTo=pKn2ce-oR3Ne)
4. [Model Configuration](#scrollTo=lG00xINwJymW)
5. [Train Model & Test](#scrollTo=O1og1N7OgbBR)


챌린지를 위해서 도움이 될수 있는 사이트들을 소개합니다.

- [Deep Learning Zero To All:Pytorch](https://deeplearningzerotoall.github.io/season2/lec_pytorch.html): 딥러닝에 대한 기초내용을 배울 수 있습니다.
- [Dealing files Colab](https://neptune.ai/blog/google-colab-dealing-with-files) : 코랩에서 데이터를 다루는것을 배울 수 있습니다.
- [pytorch tutorial](https://pytorch.org/tutorials/) : 파이토치 기초에 대해 배울 수 있습니다.
- [image classification tutorial](https://github.com/bentrevett/pytorch-image-classification) : 이미지 분류 문제에 대한 기본을 배울 수 있습니다.
- [timm github](https://github.com/rwightman/pytorch-image-models): 사전 학습된 최신 모델 (SOTA)를 자유롭게 사용할 수 있는 라이브러리 입니다.

That's all. Now start your awesome work! 😁

# **Data download**

In [None]:
!pip install kaggle
from google.colab import files
files.upload()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"junseokimm","key":"0b1c49edd8207915882035f1a861a63c"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle competitions download -c 2022swunivchallenge

Downloading 2022swunivchallenge.zip to /content
 94% 194M/205M [00:06<00:00, 22.7MB/s]
100% 205M/205M [00:06<00:00, 32.0MB/s]


In [None]:
!unzip 2022swunivchallenge.zip

Archive:  2022swunivchallenge.zip
  inflating: dataset/test/features.npy  
  inflating: dataset/train/features.npy  
  inflating: dataset/train/labels.npy  
  inflating: submission.csv          


## 1.Import Package
샘플코드에 사용된 패키지들을 불러옵니다.

In [None]:
import os
import time
import datetime
import easydict
import random
from pathlib import Path

import torch
import numpy as np
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import SGD, AdamW
from torch.utils.data import DataLoader

## 2. Argument Setting
학습에 사용될 파라미터들을 설정합니다.  
**(중요)데이터셋 경로를 본인에게 맞게 변경해주세요.**

In [None]:
args = easydict.EasyDict({
    # device setting
    'device': 0,
    'seed' : 123,
    
    # training setting
    'batch_size' : 64,
    'num_workers' : 2,
    'epoch' : 20,
    
    # optimizer & criterion
    'lr' : 0.01,
    'momentum' : 0.9,
    'weight_decay' : 1e-4,
    'nesterov' : True,
    
    # directory
    'data_path' : '/content/dataset',
    'save_path' : '/content/submission',
    # etc
    'print_freq' : 30,
    'threshold' : 0.5,
})

def setup(args):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    if args.seed is not None:
        random.seed(args.seed)
        np.random.seed(args.seed)
        torch.random.manual_seed(args.seed)
    return device

## 3. Checking the Dataset
학습에 사용될 데이터를 불러와서 데이터의 구성을 확인합니다.
```
Train Data의 개수: 8,000               Train Data의 Shape: (2048, 7, 7)
Test Data의 개수 : 8,00                Test Data의 Shape: (2048, 7, 7)
Class의 개수: 80
```

In [None]:
def load_data(args, data_type:str='train'):
    """
    data_type(str): train or test
    """
    start = time.time()
    data_path = Path(args.data_path) / data_type
    features = np.load(data_path/'features.npy')
    if data_type == 'test':
        labels = np.zeros_like(features)  # dummy test label
    else:
        labels = np.load(data_path/'labels.npy')
    end = time.time()
    sec = end - start
    print(f"Completed Loading {data_type} data at {str(datetime.timedelta(seconds=sec)).split('.')[0]}")
    return features, labels

In [None]:
# Load train_data
train_data, train_label = load_data(args, 'train')

Completed Loading train data at 0:00:01


In [None]:
# Load test data
test_data, test_label = load_data(args, 'test')

Completed Loading test data at 0:00:00


In [None]:
# check dataset shape
args.num_features = train_data.shape[1]
args.num_classes = len(np.unique(train_label))
print(f"Train_shape: {train_data.shape}")
print(f"Test_shape: {test_data.shape}")
print(f"Number of classes: {len(np.unique(train_label))}")

Train_shape: (8000, 2048, 7, 7)
Test_shape: (800, 2048, 7, 7)
Number of classes: 80


In [None]:
#check label_data shape
print(train_label.shape)

(8000,)


## 4. Model configuration
* Dataset: 챌린지 데이터셋을 사용하여 DataLoader에 적용하기 위한 클래스
* SampleModel: BottleNeck 구조의 Conv Model
* SampleModel2: 임의로 생성한 Conv Model
* SampleModel3: 단순 Classifier Model 


In [None]:
class Dataset:
    def __init__(self, features, labels, transform=None):
        """Basic Dataset Class
        
        :arg
            features: numpy array(features)
            labels: numpy asrray(labels)
        """
        self.features = features
        self.labels = labels
        self.classes = np.unique(self.labels)
        self.transform = transform

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        feature = self.features[idx]
        
        if self.transform:
            feature = self.transform(feature)
        
        label = self.labels[idx]
        return feature, label

class SampleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=2048, out_channels=4096, kernel_size=1, stride=1)
        self.bn1 = nn.BatchNorm2d(4096)
        self.relu = nn.ReLU(inplace=True)
        
        self.conv2 = nn.Conv2d(in_channels=4096, out_channels=1024, kernel_size=1, stride=1)
        self.bn2 = nn.BatchNorm2d(1024)
        self.conv3 = nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=1)
        self.conv4 = nn.Conv2d(in_channels=1024, out_channels=4096, kernel_size=1, stride=1)
        self.bn3 = nn.BatchNorm2d(4096)
        
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
    
        self.fc1 = nn.Linear(4096, 80)
        
    def forward(self,x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu(x)
        
        x = self.conv3(x)
        x = self.bn2(x)
        x = self.relu(x)
        
        x = self.conv4(x)
        x = self.bn3(x)
        x = self.relu(x)
        
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        
        x = self.fc1(x)
        return x
    
class SampleModel2(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=2048, out_channels=1024, kernel_size=3, stride=2, padding=2)
        self.bn1 = nn.BatchNorm2d(1024)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        
        self.conv2 = nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=3, stride=2, padding=1)
        self.bn2 = nn.BatchNorm2d(512)
        
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.fc1 = nn.Linear(512,80)
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.maxpool(x)
        x = self.relu(x)
        
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu(x)
        
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        return x
    
class SampleModel3(nn.Module):
    def __init__(self):
        super().__init__()
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.fc1 = nn.Linear(2048, 80)
    
    def forward(self, x):
        x = self.avgpool(x)
        x = torch.flatten(x,1)
        x = self.fc1(x)
        return x



## 5. Train Model & Test
챌린지 데이터를 샘플모델로 학습시키고 테스트데이터를 예측해 submission.csv를 저장

In [None]:
class Metric:
    def __init__(self, header='', fmt='{val:.4f} ({avg:.4f})'):
        """Base Metric Class 
        :arg
            fmt(str): format representing metric in string
        """
        self.val = 0
        self.sum = 0
        self.n = 0
        self.avg = 0
        self.header = header
        self.fmt = fmt

    def update(self, val, n=1):
        if isinstance(val, torch.Tensor):
            val = val.detach().clone()

        self.val = val
        self.sum += val * n
        self.n += n
        self.avg = self.sum / self.n

    def compute(self):
        return self.avg

    def __str__(self):
        return self.header + ' ' + self.fmt.format(**self.__dict__)
    
def train_one_epoch(model, train_dataloader, optimizer, criterion, epoch, args):
    # 1. create metric
    data_m = Metric(header='Data:')
    batch_m = Metric(header='Batch:')
    loss_m = Metric(header='Loss:')

    # 2. start validate
    model.train()

    total_iter = len(train_dataloader)
    start_time = time.time()

    for batch_idx, (x, y) in enumerate(train_dataloader):
        batch_size = x.size(0)

        x = x.to(args.device)
        y = y.to(args.device)

        data_m.update(time.time() - start_time)

        y_hat = model(x)
        loss = criterion(y_hat, y)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        loss_m.update(loss, batch_size)

        if batch_idx and args.print_freq and batch_idx % args.print_freq == 0:
            num_digits = len(str(total_iter))
            print(f"TRAIN({epoch:03}): [{batch_idx:>{num_digits}}/{total_iter}] {batch_m} {data_m} {loss_m}")

        batch_m.update(time.time() - start_time)
        start_time = time.time()

    # 3. calculate metric
    duration = str(datetime.timedelta(seconds=batch_m.sum)).split('.')[0]
    data = str(datetime.timedelta(seconds=data_m.sum)).split('.')[0]
    f_b_o = str(datetime.timedelta(seconds=batch_m.sum - data_m.sum)).split('.')[0]
    loss = loss_m.compute()

    # 4. print metric
    space = 16
    num_metric = 5
    print('-'*space*num_metric)
    print(("{:>16}"*num_metric).format('Stage', 'Batch', 'Data', 'F+B+O', 'Loss'))
    print('-'*space*num_metric)
    print(f"{'TRAIN('+str(epoch)+')':>{space}}{duration:>{space}}{data:>{space}}{f_b_o:>{space}}{loss:{space}.4f}")
    print('-'*space*num_metric)

    return loss

def prediction_mask(output, args):
    """
    학습된 모델이 예측한 값이 Threshold 값을 넘지 못하면 Train에 없던 데이터로 판단하여 -1 라벨로 예측하도록 지정
    """
    prediction = torch.argsort(output, dim=-1, descending=True)
    mask = F.softmax(output).max(dim=-1)[0] < args.threshold
    prediction[mask,0] = -1
    prediction = prediction[:, :min(1, output.size(1))].squeeze(-1).tolist()
    return prediction

def prediction_submission(predict, args):
    """
    테스트 데이터의 idx와 예측한 prediction label을 submission.csv로 저장
    """
    submission = [[idx, label] for idx, label in enumerate(predict)]
    df = pd.DataFrame(data=submission, columns=['id_idx', 'label'], index=None)
    args.save_path = Path(args.save_path)
    args.save_path.mkdir(exist_ok=True)
    df.to_csv(args.save_path / 'submission.csv', index=False)

def test_submission(model, test_dataloader, args):
    """
    학습한 모델로 테스트 데이터의 라벨을 예측하고 그 결과를 submission.csv로 저장
    """
    model_predict = [] # for submission.csv
    for x,y in test_dataloader:
        x = x.to(args.device)
        y = y.to(args.device)
        output = model(x)
        prediction = prediction_mask(output, args)
        model_predict.extend(prediction)
    prediction_submission(model_predict, args)

In [None]:
import torchvision.models as models

In [None]:
def run(args):
    start = time.time()
    
    # 1. load train, test dataset
    train_dataset = Dataset(train_data, train_label)
    test_dataset = Dataset(test_data, test_label)
    train_dataloader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers)
    test_dataloader = DataLoader(test_dataset, batch_size=args.batch_size, shuffle=False, num_workers=args.num_workers)
    
    # 2. create model
    # model = SampleModel().to(args.device)
    model= models.mobilenet_v3()

    # 3. optimizer, criterion
    # optimizer = SGD(model.parameters(), lr=args.lr, momentum=args.momentum, weight_decay=args.weight_decay, nesterov=args.nesterov)
    optimizer = AdamW(model.parameters())
    criterion = nn.CrossEntropyLoss()
    
    # 4. train & validate
    for epoch in range(args.epoch):
        train_loss = train_one_epoch(model, train_dataloader, optimizer, criterion, epoch, args)
    
    test_submission(model, test_dataloader, args)
    end = time.time()
    sec = end - start
    print(f"Finished Training & Test at {str(datetime.timedelta(seconds=sec)).split('.')[0]} ....")

In [None]:
setup(args)
run(args)

AttributeError: ignored

In [None]:
# check submission.csv
pd.read_csv(args.save_path / 'submission.csv', index_col=False)