## Regression example

Keras의 ['predict fuel efficiency'](https://www.tensorflow.org/tutorials/keras/basic_regression?hl=ko)을 학습해보자.

'Auto MPG dataset' 으로 연료의 효율을 계산하는 regression 문제를 해결해보자. 

*원문<br>
In a regression problem, we aim to predict the output of a continuous value, like a price or a probability. Contrast this with a classification problem, where we aim to predict a discrete label (for example, where a picture contains an apple or an orange).

This notebook uses the classic Auto MPG Dataset and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, we'll provide the model with a description of many models from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

This example uses the tf.keras API, see this guide for details.

In [1]:
#Setup
from __future__ import absolute_import, print_function, division

import tensorflow as tf
from tensorflow import keras
tf.enable_eager_execution()
tfe = tf.contrib.eager

import pandas as pd
import matplotlib.pyplot as plt
import time
import numpy as np

print(tf.__version__)

  from ._conv import register_converters as _register_converters


1.12.0


## Auto MPG dataset

UCI Machine Learning Repository에서 dataset을 불러오자.<br>
dataset documnet : https://archive.ics.uci.edu/ml/datasets/auto+mpg

In [2]:
#data를 다운받는다.
dataset_path = keras.utils.get_file("auto-mpg.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path

'C:\\Users\\home\\.keras\\datasets\\auto-mpg.data'

다운받은 data를 'pandas.Dataframe'으로 불러온다.

In [3]:
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin'] 

raw_dataset = pd.read_csv(dataset_path, names=column_names,
                          na_values="?", #null 값 처리
                          comment='\t', #Indicates remainder of line should not be parsed.
                          sep=" ", skipinitialspace=True) 

In [4]:
dataset = raw_dataset.copy()
dataset.head()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1


### Null 값 처리

dataset에 null값이 있는지 확인한 후에, null값을 제거해준다.

In [5]:
dataset.isna().sum()

MPG             0
Cylinders       0
Displacement    0
Horsepower      6
Weight          0
Acceleration    0
Model Year      0
Origin          0
dtype: int64

In [6]:
#Null 값 제거
dataset = dataset.dropna()
dataset.isna().sum()

MPG             0
Cylinders       0
Displacement    0
Horsepower      0
Weight          0
Acceleration    0
Model Year      0
Origin          0
dtype: int64

Origin column은 1~3 classs number로 표시되어있다(categorical). 이 데이터를 one-hot 형태로 바꿔보자.

In [7]:
origin = dataset.pop('Origin')

In [8]:
#0 또는 1로 나타낸다.
dataset['USA'] = (origin == 1) * 1.0
dataset['Europe'] = (origin == 2) * 1.0
dataset['Japan'] = (origin == 3) * 1.0
dataset.head()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,USA,Europe,Japan
0,18.0,8,307.0,130.0,3504.0,12.0,70,1.0,0.0,0.0
1,15.0,8,350.0,165.0,3693.0,11.5,70,1.0,0.0,0.0
2,18.0,8,318.0,150.0,3436.0,11.0,70,1.0,0.0,0.0
3,16.0,8,304.0,150.0,3433.0,12.0,70,1.0,0.0,0.0
4,17.0,8,302.0,140.0,3449.0,10.5,70,1.0,0.0,0.0


### Split the data into train and test

불러온 dataset을 train과 test dataset으로 분리한다.<br>
`dataset.sample`로 dataset를 쉽게 나눌 수 있다.

In [9]:
train_data = dataset.sample(frac=0.8, random_state=0)
test_data = dataset.drop(train_data.index)

In [10]:
print(train_data.shape, test_data.shape)

(314, 10) (78, 10)


pandas function중, Dataframe.describe()로 해당 data의 통계량을 손쉽게 볼 수 있다.<br>
여기서 transpose를 하는 이유는 아래의 normalize 단계에서 mean과 std 통계값을 사용하기 위해 바꿔준다.

In [11]:
train_stats = train_data.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()
train_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Cylinders,314.0,5.477707,1.699788,3.0,4.0,4.0,8.0,8.0
Displacement,314.0,195.318471,104.331589,68.0,105.5,151.0,265.75,455.0
Horsepower,314.0,104.869427,38.096214,46.0,76.25,94.5,128.0,225.0
Weight,314.0,2990.251592,843.898596,1649.0,2256.5,2822.5,3608.0,5140.0
Acceleration,314.0,15.559236,2.78923,8.0,13.8,15.5,17.2,24.8
Model Year,314.0,75.898089,3.675642,70.0,73.0,76.0,79.0,82.0
USA,314.0,0.624204,0.485101,0.0,0.0,1.0,1.0,1.0
Europe,314.0,0.178344,0.383413,0.0,0.0,0.0,0.0,1.0
Japan,314.0,0.197452,0.398712,0.0,0.0,0.0,0.0,1.0


### Split label dataset

target dataset을 만든다. 우리가 구하고자하는 값은 'MPG'이므로 해당 column을 분리해서 label dataset으로 만들어준다.

In [12]:
train_label = train_data.pop('MPG')
test_label = test_data.pop('MPG')

print(train_label.shape, test_label.shape)

(314,) (78,)


### Normalize the data

데이터를 살펴보면 각각의 feature의 범위(range)가 다르게 되어있다. 이렇게 서로 다른 크기(scale)와 범위(range)의 데이터를 사용할 때는 normalize를 사용하면 좋다.<br>

참고 사이트 : https://brunch.co.kr/@rapaellee/4

Look again at the train_stats block above and note how different the ranges of each feature are.

It is good practice to normalize features that use different scales and ranges. Although the model might converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

**Note*<br>
*That we intentionally use the statistics from only the training set, these statistics will also be used for evaluation. This is so that the model doesn't have any information about the test set.*

In [13]:
def norm(x):
    return (x - train_stats['mean']) / train_stats['std']

In [14]:
normed_train_data = norm(train_data)
normed_test_data = norm(test_data)

지금 train_data와 label은 어떤 type일까?

In [15]:
print(type(normed_train_data))
print(type(train_label))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


### Create `tf.data`

모델에 넣을 데이터타입을 tf.data로 바꿔주자. `tf.data`를 만들기 위해서는 Data_type을 dataFrame에서 numpy로 변경시켜줘야 한다.

In [16]:
np_tr_d = np.array(normed_train_data)
np_te_d = np.array(normed_test_data)

np_tr_l = np.array(train_label)
np_te_l = np.array(test_label)

In [22]:
tr_dataset = tf.data.Dataset.from_tensor_slices((
    tf.cast(np_tr_d, tf.float64),
    tf.cast(np_tr_l, tf.float64)))
#tr_dataset = tr_dataset.shuffle(1000)

test_dataset = tf.data.Dataset.from_tensor_slices((
    tf.cast(np_te_d, tf.float64),
    tf.cast(np_te_l, tf.float64)))

tf.data에서는 tensor가, norm_data에서는 numpy가 반환되는 것을 확인 할 수 있다.

In [24]:
for x,y in tr_dataset.take(1):
    print(x.shape)
    print(type(x))
    print(x)
    
for x in normed_train_data.values:
    print(x.shape)
    print(type(x))
    print(x)
    break

(9,)
<class 'tensorflow.python.framework.ops.EagerTensor'>
tf.Tensor(
[-0.86934805 -1.0094591  -0.78405236 -1.0253028  -0.3797592  -0.51639657
  0.77467638 -0.46514837 -0.49522541], shape=(9,), dtype=float64)
(9,)
<class 'numpy.ndarray'>
[-0.86934805 -1.0094591  -0.78405236 -1.0253028  -0.3797592  -0.51639657
  0.77467638 -0.46514837 -0.49522541]


## Define the model

학습시킬 model의 configuration을 정의하자. 여기서는 3가지의 형태로 모델을 정의하고 training 시켜보도록 한다.

error?
subclass는 input에서 에러가 발생한다. 반면에 그냥 def로 정의한 모델은 model.fit이 정상적으로 작동한다.

### the simple model

y = W*x + b의 간단한 회귀식을 정의해보자.

In [None]:
class first_model(tf.keras.Model):
    def __init__(self):
        super(first_model, self).__init__()
        self.W = tf.Variable(tf.random_normal([1,9], dtype=tf.float64))
        self.b = tf.Variable(tf.random_normal([1], dtype=tf.float64))
        
    def call(self, inputs):
        return tf.matmul(self.W, inputs[:,tf.newaxis]) + self.b 
    #tr_data을 for문으로 돌리면 (9,)로 들어오게 되어 tf.matmul에서 error가 발생한다. 따라서 dimension을 추가해준다.

In [None]:
model = first_model()

In [None]:
for x,y in tr_dataset.take(1):
    print(model(x))

### The dence model

subclass 형태로 dense model을 정의하자.

In [30]:
class d_model(keras.Model):
    def __init__(self):
        super(d_model, self).__init__()
        self.dense1 = keras.layers.Dense(64, activation=tf.nn.relu, input_shape=(9,))
        self.dense2 = keras.layers.Dense(64, activation=tf.nn.relu)
        self.dense3 = keras.layers.Dense(1)
        
    def call(self, inputs):
        inputs = inputs[tf.newaxis,:]
        x = self.dense1(inputs)
        x = self.dense2(x)
        return self.dense3(x)

In [31]:
model_2 = d_model()

In [32]:
for x,y in tr_dataset.take(1):
    print(model_2(x))

tf.Tensor([[-0.18949315]], shape=(1, 1), dtype=float64)


### Original Model

기존에 있는 code로 모델을 정의하고 `fit`으로 학습시켜보자.

In [60]:
def o_model():
    model = keras.Sequential([
        keras.layers.Dense(64, activation=tf.nn.relu, input_shape=(9,)),
        keras.layers.Dense(64, activation=tf.nn.relu),
        keras.layers.Dense(1)
    ])
    
    model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
                 loss = 'mse',
                 metrics=['mse', 'mae'])
    return model

In [61]:
model_3 = o_model()

In [62]:
model_3.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_18 (Dense)             (None, 64)                640       
_________________________________________________________________
dense_19 (Dense)             (None, 64)                4160      
_________________________________________________________________
dense_20 (Dense)             (None, 1)                 65        
Total params: 4,865
Trainable params: 4,865
Non-trainable params: 0
_________________________________________________________________


### Original model training with image and label dataset

In [63]:
start = time.time()
o_model_hist = model_3.fit(normed_train_data, train_label,
                           validation_data=(normed_test_data, test_label),
                                            epochs=100, verbose=0)

print("Training end ", time.time()-start)

Training end  3.987733840942383


In [None]:
plt.figure(figsize=(12,8))
plt.plot(o_model_hist.epoch, o_model_hist.history['mean_squared_error'])
plt.plot(o_model_hist.epoch, o_model_hist.history['val_mean_squared_error'], '--')
plt.show()

### Define loss function and gradient function

In [None]:
def loss(model, inputs, outputs):
    _y = model(inputs)
    return tf.reduce_mean(tf.square(_y - outputs))

def grads(model, inputs, outputs):
    with tf.GradientTape() as tape:
        loss_value = loss(model, inputs, outputs)
        return loss_value, tape.gradient(loss_value, [model.W, model.b])
    
def grads2(model, inputs, outputs):
    with tf.GradientTape() as tape:
        loss_value = loss(model, inputs, outputs)
        return loss_value, tape.gradient(loss_value, model.variables)    

### Define optimzizer and Global step

In [None]:
optimizer = tf.train.RMSPropOptimizer(0.001)

global_step = tf.train.get_or_create_global_step()
global_step2 = tf.train.get_or_create_global_step()

loss_hist = []
loss_hist2 = []

In [None]:
epoch_loss = tfe.metrics.Mean()
epoch_loss2 = tfe.metrics.Mean()

start = time.time()

for epoch in range(100):
    
    for x,y in tr_dataset:
        loss_value, grad = grads(model, x, y)
        loss_value2, grad2 = grads2(model_2, x, y)
        
        optimizer.apply_gradients(zip(grad, [model.W, model.b]), global_step)
        optimizer.apply_gradients(zip(grad2, model_2.variables), global_step2)

        epoch_loss(loss_value)
        epoch_loss2(loss_value2)

        if global_step % 10 == 0:
            print("Epoch : {:3d}, Loss : {:.3f]}".format(global_step.numpy(), loss_value))
    
    loss_hist.append(epoch_loss.result())
    loss_hist2.append(epoch_loss2.result())

print("Training end ", time.time()-start)

In [None]:
plt.plot(loss_hist, label='model1_loss')
plt.plot(loss_hist2, label='model2_loss')
plt.legend()
plt.show()

In [None]:
test_loss = tfe.metrics.Mean()
test_loss2 = tfe.metrics.Mean()

for x,y in test_dataset:
    loss_value = loss(model, x,y)
    loss_value2 = loss(model_2, x, y)
    
    test_loss(loss_value)
    test_loss2(loss_value2)
    
    
print("Test Loss : {}".format(test_loss.result()))
print("Test Loss2 : {}".format(test_loss2.result()))

keras document, Model 관련 글[(About Keras Model)](https://keras.io/models/about-keras-models/)의 마지막 부분에 있는 내용이다.<br>
위에서 학습한 결과를 살펴보면 subclass로 만든 model의 학습 시간이 `model.fit`으로 학습시킨 시간보다 오래 걸린 부분을 확인 할 수 있다.
따라서 주어진 job에 따라서 API를 적절히 사용하는 것이 중요하다.

Key point: use the right API for the job. The Model subclassing API can provide you with greater flexbility for implementing complex models, but it comes at a cost (in addition to these missing features): it is more verbose, more complex, and has more opportunities for user errors. If possible, prefer using the functional API, which is more user-friendly.