# 価格予測モデルのBaseline  
- CNNを用いたモデルを作成する．  
- 価格予測とクラス分類でタスクが大きく異なるので，imagenetで学習したモデルを用いないものを最初に作成する．  
- サイトに載せられる画像を教師データとしており，画像が大きく回転したりなどは不要と考えられるためそのような前処理は行わない．  
- 損失関数にはmaeもしくはrmseを用いる．  

## モデルの構築  
- EfficientNetB0（未学習）を用いて特徴量を抽出．  
- num_sales, コレクション名のone-hotベクトルを抽出した特徴量に結合．  
- 全結合層を重ねて出力．  
- ImageNetを用いて事前学習したものとしていないもので比較する．  
- 目的変数をそのまま予測するとスケールが大きすぎるので，先に対数変換して評価関数にRMSE, MAEなどを用いるほうが良いかも．  
- **このノートブックでやっているのは事前学習有り．**  

## 評価関数  
- RMSLEを用いる．  
$$RMSLE = \sqrt{\frac{1}{n}\sum_{i=1}^n (\log{(y_i+1)} - \log{(\hat{y_i} +1)})^2}$$

- 追加でMAPEを用いてみる．  
$$MAPE = \frac{100}{n} \sum_{i=1}^n |\frac{\hat{y}_i - y_i}{y_i}|$$

タスクAに関してはデータ不足の可能性が考えられるため，特徴量抽出とともにデータを追加で収集する．  

## 変数（タスクA）  
- 目的変数: last_sale.total_price  
- 説明変数: 画像データ，コレクション名（collection.name），num_sales，

## 変数（タスクB）  
- 目的変数: last_sale.total_price  
- 説明変数: 画像データ  

In [1]:
import os
from typing import List, Optional, Tuple, Dict
import math
import tempfile
import random

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import cv2
import tensorflow as tf
import tensorflow.keras.layers as layers
import tensorflow.keras.models as models
import tensorflow.keras.losses as losses
import tensorflow.keras.optimizers as optim
import tensorflow.keras.activations as activations
from tensorflow.keras.utils import Sequence
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
import tensorflow.keras.callbacks as callbacks
from tensorflow.keras.applications import EfficientNetB0 as efn
import cloudpickle

In [2]:
A_IMGPATH = "../data/taskA/img"
A_DFPATH = "../data/taskA/table"
B_IMGPATH = "../data/taskB/img"
B_DFPATH = "../data/taskB/table"
asset_df_A = pd.read_csv(os.path.join(A_DFPATH, "asset_data.csv"))
asset_df_B = pd.read_csv(os.path.join(B_DFPATH, "asset_data.csv"))

asset_df_A = asset_df_A.rename(columns={"last_sale.total_price": "target"})
asset_df_B = asset_df_B.rename(columns={"last_sale.total_price": "target"})

asset_df_A = pd.concat((asset_df_A, pd.get_dummies(asset_df_A["collection.name"])), axis=1)
asset_df_B[asset_df_A["collection.name"].unique()] = 0

asset_df_A["full_path"] =\
    asset_df_A["image_id"].apply(lambda x: A_IMGPATH + "/" + x)
asset_df_B["full_path"] =\
    asset_df_B["image_id"].apply(lambda x: B_IMGPATH + "/" + x)

asset_df_A['target'] = asset_df_A['target'].astype(float) * 1e-18
asset_df_B['target'] = asset_df_B['target'].astype(float) * 1e-18
asset_df_A = asset_df_A.query('target > 0')
asset_df_B = asset_df_B.query('target > 0')
asset_df_A['target'] = asset_df_A['target'].apply(lambda x: np.log1p(x))
asset_df_B['target'] = asset_df_B['target'].apply(lambda x: np.log1p(x))

os.makedirs("../models", exist_ok=True)

print(f"data shape: {asset_df_A.shape}")
print(f"data shape: {asset_df_B.shape}")

  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


data shape: (21747, 169)
data shape: (5200, 177)


## Helper Functions  

### DataLoader  

In [3]:
class FullPathDataLoader(Sequence):
    """
    Data loader that load images, meta data and targets.
    This class is inherited Sequence class of Keras.
    """

    def __init__(self, path_list: np.ndarray, target: Optional[np.ndarray],
                 meta_data: Optional[np.ndarray] = None, batch_size: int = 16,
                 task: str = "B", width: int = 256, height: int = 256,
                 resize: bool = True, shuffle: bool = True, is_train: bool = True):
        """
        Constructor. This method determines class variables.

        Parameters
        ----------
        path_list : np.ndarray[str]
            The array of absolute paths of images.
        meta_data : np.ndarray[int]
            One-hot vector of collections.
        target : np.ndarray
            Array of target variavles.
        batch_size : int
            Batch size used when model training.
        task : str
            Please determine this data loader will be used for task A or B(default=A).
        width : int
            Width of resized image.
        height : int
            Height of resize image.
        resize : bool
            Flag determine whether to resize.
        shuffle : bool
            Flag determine whether to shuffle on epoch end.
        is_train : bool
            Determine whether this data loader will be used training model.
            if you won't this data loader, you have set 'is_train'=False.
        """
        self.path_list = path_list
        self.batch_size = batch_size
        self.task = task
        self.width = width
        self.height = height
        self.resize = resize
        self.shuffle = shuffle
        self.is_train = is_train
        self.length = math.ceil(len(self.path_list) / self.batch_size)

        if self.is_train:
            self.target = target
        if self.task == "A":
            self.meta_data = meta_data

    def __len__(self):
        """
        Returns
        -------
        self.length : data length
        """
        return self.length

    def get_img(self, path_list: np.ndarray):
        """
        Load image data and resize image if 'resize'=True.

        Parameters
        ----------
        path_liist : np.ndarray
            The array of relative image paths from directory 'dir_name'.
            Size of this array is 'batch_size'.

        Returns
        -------
        img_list : np.ndarray
            The array of image data.
            Size of an image is (width, height, 3) if 'resize'=True.
        '"""
        img_list = []
        for path in path_list:
            img = cv2.imread(path)
            img = cv2.resize(img, (self.width, self.height))
            img = img / 255.
            img_list.append(img)

        img_list = np.array(img_list)
        return img_list

    def _shuffle(self):
        """
        Shuffle path_list, meta model.
        If 'is_train' is True, target is shuffled in association path_list.
        """
        idx = np.random.permutation(len(self.path_list))
        self.path_list = self.path_list[idx]
        if self.task == "A":
            self.meta_data = self.meta_data[idx]
        if self.is_train:
            self.target = self.target[idx]

    def __getitem__(self, idx):
        path_list = self.path_list[self.batch_size*idx:self.batch_size*(idx+1)]
        img_list = self.get_img(path_list)
        if self.is_train:
            target_list = self.target[self.batch_size*idx:self.batch_size*(idx+1)]
            if self.task == "A":
                meta = self.meta_data[self.batch_size*idx:self.batch_size*(idx+1)]
                return (img_list, meta), target_list
            else:
                return img_list, target_list
        else:
            if self.task == "A":
                meta = self.meta_data[self.batch_size*idx:self.batch_size*(idx+1)]
                return ((img_list, meta),)
            else:
                return img_list

    def on_epoch_end(self):
        if self.is_train:
            self._shuffle()

### seed settings  

In [4]:
def set_seed(random_state=6174):
    tf.random.set_seed(random_state)
    np.random.seed(random_state)
    random.seed(random_state)
    os.environ['PYTHONHASHSEED'] = str(random_state)

### Create model  

In [5]:
def create_model(input_shape: Tuple[int], output_shape: int,
                 activation, loss, meta_shape: Optional[int] = None,
                 task: str = "B", learning_rate: float = 0.001,
                 pretrain: bool = False) -> models.Model:
    """
    The function for creating model.

    Parameters
    ----------
    input_shape : int
        Shape of input image data.
    output_shape : int
        Shape of model output.
    activation : function
        The activation function used hidden layers.
    loss : function
        The loss function of model.
    meta_shape : int
        Shape of input meta data of image.
    task : str
        Please determine this model will be used for task A or B(default=A).
    learning_rate : float
        The learning rate of model.
    pretrain : bool
        Flag that deterimine whether use pretrain model(default=False).

    Returns
    -------
    model : keras.models.Model
        Model instance.
    """
    if pretrain:
        weights = 'imagenet'
    else:
        weights = None

    inputs = layers.Input(shape=input_shape)
    efn_model = efn(include_top=False, input_shape=input_shape,
                    weights=weights)(inputs)
    ga = layers.GlobalAveragePooling2D()(efn_model)

    if task == "A":
        meta_inputs = layers.Input(shape=meta_shape)
        concate = layers.Concatenate()([ga, meta_inputs])
        dense1 = layers.Dense(units=128)(concate)
        av1 = layers.Activation(activation)(dense1)
        dr1 = layers.Dropout(0.3)(av1)
        dense2 = layers.Dense(units=64)(dr1)
        av2 = layers.Activation(activation)(dense2)
        dr2 = layers.Dropout(0.3)(av2)
        outputs = layers.Dense(output_shape)(dr2)

        model = models.Model(inputs=[inputs, meta_inputs], outputs=[outputs])

    elif task == "B":
        dense1 = layers.Dense(units=128)(ga)
        av1 = layers.Activation(activation)(dense1)
        dr1 = layers.Dropout(0.3)(av1)
        dense2 = layers.Dense(units=64)(dr1)
        av2 = layers.Activation(activation)(dense2)
        dr2 = layers.Dropout(0.3)(av2)
        outputs = layers.Dense(output_shape)(dr2)

        model = models.Model(inputs=[inputs], outputs=[outputs])

    else:
        raise Exception("Please set task is A or B.")

    model.compile(loss=loss,
                  optimizer=optim.SGD(learning_rate=learning_rate, momentum=0.9),
                  metrics=['mae', 'mse'])
    return model

### Training model  

In [6]:
def train(path_list: np.ndarray, target: np.ndarray, loss,
          meta_data: Optional[np.ndarray] = None, task: str = "B"):
    """
    The function for training model.

    Parameters
    ----------
    path_list : np.ndarray
        The path list of all image data.
    target : np.ndarray
        The array of targets data.
    loss : function
        The loss function of keras.
    meta_data : np.ndarray
        The array of meta data of image.
    task : str
        Please determine you train model for task A or B(default=A).
    """
    if task == "A":
        train_path, val_path, train_meta, val_meta, train_y, val_y =\
            train_test_split(path_list, meta_data, target, test_size=0.1, random_state=6174)
        train_gen = FullPathDataLoader(path_list=train_path, target=train_y,
                                       meta_data=train_meta, batch_size=16,
                                       task=task)
        val_gen = FullPathDataLoader(path_list=val_path, target=train_y,
                                     meta_data=val_meta, batch_size=1,
                                     task=task)
    elif task == "B":
        train_path, val_path, train_y, val_y =\
            train_test_split(path_list, target, test_size=0.1, random_state=6174)
        train_gen = FullPathDataLoader(path_list=train_path, target=train_y,
                                       batch_size=16, task=task)
        val_gen = FullPathDataLoader(path_list=val_path, target=val_y,
                                     batch_size=1, task=task)
    else:
        raise Exception("Please set task is A or B")

    set_seed()
    model = NFTModel(
        create_model(input_shape=(256, 256, 3), output_shape=1,
                     activation=activations.relu, loss=loss,
                     meta_shape=len(meta_features), task=task,
                     learning_rate=0.00001, pretrain=True)
    )

    ES = callbacks.EarlyStopping(monitor='val_loss', patience=10,
                                 restore_best_weights=True)

    print("starting training")
    print('*' + '-' * 30 + '*')

    model.fit(train_gen, val_gen, epochs=100, batch_size=16,
              callbacks=[ES])

    print("finished training")
    print('*' + '-' * 30 + '*' + '\n')

    if task == "A":
        val_gen = FullPathDataLoader(path_list=val_path, target=train_y,
                                     meta_data=val_meta, batch_size=1, task=task,
                                     shuffle=False, is_train=False)
    else:
        val_gen = FullPathDataLoader(path_list=val_path, target=train_y,
                                     batch_size=1, task=task,
                                     shuffle=False, is_train=False)
    print("starting evaluate")
    print('*' + '-' * 30 + '*')

    model.evaluate(val_gen, val_y)

    print("finished evaluate")
    print('*' + '-' * 30 + '*' + '\n')

    return model

In [7]:
class NFTModel(KerasRegressor):
    """
    Model class.
    This class is inherited KerasRegressor class of keras.
    """

    def __init__(self, model_func):
        """
        Constructor.

        Prameters
        ---------
        model_func : function
            The function for creating model.
        """
        super().__init__(build_fn=model_func)

    def __getstate__(self):
        result = {'sk_params': self.sk_params}
        with tempfile.TemporaryDirectory() as dir:
            if hasattr(self, 'model'):
                self.model.save(dir + '/output.h5', include_optimizer=False)
                with open(dir + '/output.h5', 'rb') as f:
                    result['model'] = f.read()
        return result

    def __setstate__(self, serialized):
        self.sk_params = serialized['sk_params']
        with tempfile.TemporaryDirectory() as dir:
            model_data = serialized.get('model')
            if model_data:
                with open(dir + '/input.h5', 'wb') as f:
                    f.write(model_data)
                self.model = tf.keras.models.load_model(dir + '/input.h5')

    def fit(self, train_gen, val_gen, epochs, batch_size, callbacks):
        """
        Training model.

        Parameters
        ----------
        train_gen : iterator
            The generator of train data.
        val_gen : iterator
            The generator of validation data.
        epochs : int
            Number of epochs for training model.
        batch_size : int
            Size of batch for training model.
        callbacks : list
            The list of callbacks.
            For example [EarlyStopping instance, ModelCheckpoint instance]
        """
        self.model = self.build_fn
        self.model.fit(train_gen, epochs=epochs, batch_size=batch_size,
                       validation_data=val_gen, callbacks=callbacks)

    def evaluate(self, test_X, test_y):
        """
        Evaluate model.

        Parameters
        ----------
        test_X : iterator
            The generator of test data.
        test_y : np.ndarray
            The array of targets of test data.
        """
        pred = self.model.predict(test_X)
        pred = np.where(pred < 0, 0, pred)
        rmse = np.sqrt(mean_squared_error(test_y, pred))
        mae = np.sqrt(mean_absolute_error(test_y, pred))

        print(f"RMSE Score: {rmse}")
        print(f"MAE Score: {mae}")

    def predict(self, img_path: str, collection_name: str, num_sales: int,
                task: str = "B"):
        """
        Predict data using trained model.

        Parameters
        ----------
        img_path : str
            The path of image data.
        collection_name : str
            Name of collection of the NFT.
        num_sales : int
            Number of times the NFT sold.
        """
        if task == "A":
            collections = ['CryptoPunks',
                           'Bored Ape Yacht Club',
                           'Edifice by Ben Kovach',
                           'Mutant Ape Yacht Club',
                           'The Sandbox',
                           'Divine Anarchy',
                           'Cosmic Labs',
                           'Parallel Alpha',
                           'Art Wars | AW',
                           'Neo Tokyo Identities',
                           'Neo Tokyo Part 2 Vault Cards',
                           'Cool Cats NFT',
                           'CrypToadz by GREMPLIN',
                           'BearXLabs',
                           'Desperate ApeWives',
                           'Decentraland',
                           'Neo Tokyo Part 3 Item Caches',
                           'Doodles',
                           'The Doge Pound',
                           'Playboy Rabbitars Official',
                           'THE SHIBOSHIS',
                           'THE REAL GOAT SOCIETY',
                           'Sipherian Flash',
                           'Party Ape | Billionaire Club',
                           'Treeverse',
                           'Angry Apes United',
                           'CyberKongz',
                           'Emblem Vault [Ethereum]',
                           'Fat Ape Club',
                           'VeeFriends',
                           'JUNGLE FREAKS BY TROSLEY',
                           'Meebits',
                           'Furballs.com Official',
                           'Kaiju Kingz',
                           'Bears Deluxe',
                           'PUNKS Comic',
                           'Hor1zon Troopers',
                           'Lazy Lions',
                           'LOSTPOETS',
                           'Chain Runners',
                           'Chromie Squiggle by Snowfro',
                           'MekaVerse',
                           'Vox Collectibles',
                           'MutantCats',
                           'World of Women',
                           'SuperFarm Genesis Series',
                           'Eponym by ART AI',]
            collection_dict = {
                 collections[i]: i for i in range(len(collections))
            }
            meta_data = np.zeros(shape=(len(collection_dict)+1))
            if collection_name in collection_dict.keys():
                meta_data[collection_dict[collection_name]] = 1
            meta_data[-1] = num_sales
            meta_data = meta_data.reshape(1, -1)

            img = cv2.resize(cv2.imread(img_path)/256., (256, 256))
            img = img.reshape(1, 256, 256, 3)

            pred = self.model.predict([img, meta_data])
        elif task == "B":
            img = cv2.resize(cv2.imread(img_path)/256., (256, 256))
            img = img.reshape(1, 256, 256, 3)

            pred = self.model.predict(img)
        else:
            raise Exception("Please set task is A or B")

        return pred[0][0]

In [8]:
def save_model(instance, file_name: str):
    """
    Save model as pickle file

    Parameters
    ----------
    instance : Class instance
        The class instance you want to save as pickle file.
    file_name : str
        The absolute path of file saved the instance.
    """
    with open(file_name, mode='wb') as f:
        cloudpickle.dump(instance, f)

In [9]:
def load_model(file_name: str):
    """
    Load the model file of pickle.

    Parameters
    ----------
    file_name : str
        The absolute path of the model file.

    Returns
    -------
    model : tf.keras.models.Model
        Trained model object.
    """
    with open(file_name, mode='rb') as f:
        model = cloudpickle.load(f)

    return model

## Training models  

### TaskA

In [10]:
meta_features =\
    asset_df_A['collection.name'].unique().tolist() + ['num_sales']

path_list = asset_df_A['full_path'].values
meta_data = asset_df_A[meta_features].values
target = asset_df_A['target'].values

model_A = train(path_list, target, losses.mean_squared_error, meta_data,
                task="A")
# save_model(model_A, "../models/baselineA.pkl")

2021-11-14 08:31:09.150668: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-14 08:31:09.155139: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-14 08:31:09.155625: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-14 08:31:09.156568: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compil

starting training
*------------------------------*
Epoch 1/100


2021-11-14 08:31:10.735273: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-14 08:31:14.745896: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
2021-11-14 08:31:16.235274: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
finished training
*------------------------------*

starting evaluate
*------------------------------*
RMSE Score: 1.4561724464322845
MAE Score: 0.9523819673445525
finished evaluate
*------------------------------*



### TaskA（画像のみ）  

In [11]:
path_list = asset_df_A['full_path'].values
target = asset_df_A['target'].values

model_A = train(path_list, target, losses.mean_squared_error,
                task="B")

starting training
*------------------------------*
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
finished training
*------------------------------*

starting evaluate
*------------------------------*
RMSE Score: 1.16349655089246
MAE Score: 0.8567837679315085
finished evaluate
*------------------------------*



### TaskB

In [12]:
path_list = asset_df_B['full_path'].values
target = asset_df_B['target'].values

model_B = train(path_list, target, losses.mean_squared_error)
# save_model(model_B, "../models/baselineB.pkl")

starting training
*------------------------------*
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
finished training
*------------------------------*

starting evaluate
*------------------------------*
RMSE Score: 0.5453789268884527
MAE Score: 0.5304499481845472
finished evaluate
*------------------------------*



## Evaluate model  

### Task A

### Task B