# Sect5. Logistic (regression) classifier

### 로지스틱 회귀 (Logistic regression)
>위키피디어 : 로지스틱 회귀 (Logistic regression)

- 로지스틱 회귀는 D.R.Cox가 1958년에 제안한 확률 모델로서 독립 변수의 선형 결합을 이용하여 사건의 발생 가능성을 예측하는데 사용되는 통계 기법이다.
- 로지스틱 회귀는 선형 회귀 분석과는 다르게 종속 변수가 범주형 데이터를 대상으로 하며 입력 데이터가 주어졌을 때 해당 데이터의 결과가 특정 분류로 나뉘기 때문에 일종의 분류 (classification) 기법으로도 볼 수 있다.

- $$ H(X) = W X $$

- $$ z = H(X) $$
- $$ g(z) $$

### 로지스틱 함수
- 로지스틱 모형 식은 독립 변수가 [-∞,∞]의 어느 숫자이든 상관 없이 종속 변수 또는 결과 값이 항상 범위 [0,1] 사이에 있도록 한다.
- 이는 오즈비(odds ratio)를 로짓(logit) 변환을 수행함으로써 얻어진다

### Logistic Regression Classifier
- Logistic Model : holehouse.org / Logistic Regression
- Cost Function : Logistic Regression Cost Function

$$ H(X) = \frac{1}{1 + e^-W^TX} $$

$$ cost(W) = -\frac{1}{m} \sum y log(H(x)) + (1-y)(log(1 - H(x)) $$
$$ W := W - α \frac{σ}{σW} cost(W) $$

### Anyway, Cost를 최소화 하는 것이 최적화의 방법
- y=1일때, 우리의 예측이 맞았다면, cost는 0
H(x) = 1 → cost(1) = 0
- y=1일때, 우리의 예측이 틀리다면, cost는 ∞
H(x) = 0 → cost(0) = ∞

- y=0일때, 우리의 예측이 맞았다면, cost는 0
H(x) = 0 → cost(0) = 0

- y=0일때, 우리의 예측이 틀리다면, cost는 ∞
H(x) = 1 → cost(1) = ∞

In [5]:
from tqdm import tqdm_notebook

In [2]:
# import tensorflow as tf
import tensorflow.compat.v1 as tf

tf.disable_v2_behavior()

tf.set_random_seed(777)  # for reproducibility

# Training Data 2dim: x1(hours), x2(attendence)
x_data = [[1, 2], [2, 3], [3, 1], [4, 3], [5, 3], [6, 2]]  
# Result Data : y(0:fail or 1:pass)
y_data = [[0], [0], [0], [1], [1], [1]]                     

# placeholders for a tensor that will be always fed.
X = tf.placeholder(tf.float32, shape=[None, 2])
Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random_normal([2, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# Hypothesis using sigmoid: tf.div(1., 1. + tf.exp(tf.matmul(X, W)))
hypothesis = tf.sigmoid(tf.matmul(X, W) + b)

# cost/loss function
cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) *
                       tf.log(1 - hypothesis))

train = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

# Accuracy computation
# True if hypothesis>0.5 else False
predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32))

Instructions for updating:
non-resource variables are not supported in the long term


In [6]:
# Launch graph
with tf.Session() as sess:
    # Initialize TensorFlow variables
    sess.run(tf.global_variables_initializer())

    for step in tqdm_notebook(range(10001)):
        cost_val, _ = sess.run([cost, train], feed_dict={X: x_data, Y: y_data})
        if step % 200 == 0 or step < 10 :
            print("Step : {} \t Cost : {}".format(step, cost_val))
            
    # Accuracy report
    h, c, a = sess.run([hypothesis, predicted, accuracy],
                       feed_dict={X: x_data, Y: y_data})

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for step in tqdm_notebook(range(10001)):


  0%|          | 0/10001 [00:00<?, ?it/s]

Step : 0 	 Cost : 0.7430492043495178
Step : 1 	 Cost : 0.7321328520774841
Step : 2 	 Cost : 0.72150057554245
Step : 3 	 Cost : 0.7111538052558899
Step : 4 	 Cost : 0.7010929584503174
Step : 5 	 Cost : 0.6913184523582458
Step : 6 	 Cost : 0.6818299889564514
Step : 7 	 Cost : 0.6726274490356445
Step : 8 	 Cost : 0.6637097001075745
Step : 9 	 Cost : 0.6550756096839905
Step : 200 	 Cost : 0.4218468964099884
Step : 400 	 Cost : 0.3823147714138031
Step : 600 	 Cost : 0.3591019809246063
Step : 800 	 Cost : 0.3428586423397064
Step : 1000 	 Cost : 0.33000972867012024
Step : 1200 	 Cost : 0.31902754306793213
Step : 1400 	 Cost : 0.3091982305049896
Step : 1600 	 Cost : 0.3001621663570404
Step : 1800 	 Cost : 0.29172614216804504
Step : 2000 	 Cost : 0.28377869725227356
Step : 2200 	 Cost : 0.2762506902217865
Step : 2400 	 Cost : 0.26909589767456055
Step : 2600 	 Cost : 0.26228079199790955
Step : 2800 	 Cost : 0.25577929615974426
Step : 3000 	 Cost : 0.24956989288330078
Step : 3200 	 Cost : 0.24363

In [7]:
print("# Hypothesis: \n{h} \n\n# Correct (Y): \n{c} \n\n# Accuracy: {a}".format(
    h = h, c = c, a = a
))

# Hypothesis: 
[[0.0242024 ]
 [0.1489122 ]
 [0.27185953]
 [0.79689527]
 [0.9489273 ]
 [0.9833507 ]] 

# Correct (Y): 
[[0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]] 

# Accuracy: 1.0


### Classifying diabetes
> 당뇨병 예측하는 실습 예제

- 예측값이 1이면 당뇨병 O
- 예측값이 0이면 당뇨병 X

In [9]:
import pandas as pd

# df = pd.read_csv("./data/data-03-diabetes.csv")
df = pd.read_csv("./data/data-03-diabetes.csv", header=None)
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,-0.294118,0.487437,0.180328,-0.292929,0.0,0.00149,-0.53117,-0.033333,0
1,-0.882353,-0.145729,0.081967,-0.414141,0.0,-0.207153,-0.766866,-0.666667,1
2,-0.058824,0.839196,0.04918,0.0,0.0,-0.305514,-0.492741,-0.633333,0
3,-0.882353,-0.105528,0.081967,-0.535354,-0.777778,-0.162444,-0.923997,0.0,1
4,0.0,0.376884,-0.344262,-0.292929,-0.602837,0.28465,0.887276,-0.6,0
5,-0.411765,0.165829,0.213115,0.0,0.0,-0.23696,-0.894962,-0.7,1
6,-0.647059,-0.21608,-0.180328,-0.353535,-0.791962,-0.076006,-0.854825,-0.833333,0
7,0.176471,0.155779,0.0,0.0,0.0,0.052161,-0.952178,-0.733333,1
8,-0.764706,0.979899,0.147541,-0.090909,0.283688,-0.090909,-0.931682,0.066667,0
9,-0.058824,0.256281,0.57377,0.0,0.0,0.0,-0.868488,0.1,0


In [10]:
# Lab 5 Logistic Regression Classifier
# import tensorflow as tf
import tensorflow.compat.v1 as tf
import numpy as np

tf.disable_v2_behavior()

tf.set_random_seed(777)  # for reproducibility

xy = np.loadtxt('./data/data-03-diabetes.csv', delimiter=',', dtype=np.float32)
x_data = xy[:, 0:-1]
y_data = xy[:, [-1]]

# print(x_data.shape, y_data.shape)
print(" x_data.shape : {x_shape} \n y_data.shape : {y_shape}".format(
        x_shape = x_data.shape, 
        y_shape = y_data.shape
    ))

# placeholders for a tensor that will be always fed.
X = tf.placeholder(tf.float32, shape=[None, 8])
Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random_normal([8, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# Hypothesis using sigmoid: tf.div(1., 1. + tf.exp(tf.matmul(X, W)))
hypothesis = tf.sigmoid(tf.matmul(X, W) + b)

# cost/loss function
cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) *
                       tf.log(1 - hypothesis))

train = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

# Accuracy computation
# True if hypothesis>0.5 else False
predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32))

 x_data.shape : (759, 8) 
 y_data.shape : (759, 1)


In [11]:
print(len(xy)) 
xy[:10]

759


array([[-0.294118  ,  0.487437  ,  0.180328  , -0.292929  ,  0.        ,
         0.00149028, -0.53117   , -0.0333333 ,  0.        ],
       [-0.882353  , -0.145729  ,  0.0819672 , -0.414141  ,  0.        ,
        -0.207153  , -0.766866  , -0.666667  ,  1.        ],
       [-0.0588235 ,  0.839196  ,  0.0491803 ,  0.        ,  0.        ,
        -0.305514  , -0.492741  , -0.633333  ,  0.        ],
       [-0.882353  , -0.105528  ,  0.0819672 , -0.535354  , -0.777778  ,
        -0.162444  , -0.923997  ,  0.        ,  1.        ],
       [ 0.        ,  0.376884  , -0.344262  , -0.292929  , -0.602837  ,
         0.28465   ,  0.887276  , -0.6       ,  0.        ],
       [-0.411765  ,  0.165829  ,  0.213115  ,  0.        ,  0.        ,
        -0.23696   , -0.894962  , -0.7       ,  1.        ],
       [-0.647059  , -0.21608   , -0.180328  , -0.353535  , -0.791962  ,
        -0.0760059 , -0.854825  , -0.833333  ,  0.        ],
       [ 0.176471  ,  0.155779  ,  0.        ,  0.        ,  0

In [12]:
print(len(x_data)) 
x_data[:10]

759


array([[-0.294118  ,  0.487437  ,  0.180328  , -0.292929  ,  0.        ,
         0.00149028, -0.53117   , -0.0333333 ],
       [-0.882353  , -0.145729  ,  0.0819672 , -0.414141  ,  0.        ,
        -0.207153  , -0.766866  , -0.666667  ],
       [-0.0588235 ,  0.839196  ,  0.0491803 ,  0.        ,  0.        ,
        -0.305514  , -0.492741  , -0.633333  ],
       [-0.882353  , -0.105528  ,  0.0819672 , -0.535354  , -0.777778  ,
        -0.162444  , -0.923997  ,  0.        ],
       [ 0.        ,  0.376884  , -0.344262  , -0.292929  , -0.602837  ,
         0.28465   ,  0.887276  , -0.6       ],
       [-0.411765  ,  0.165829  ,  0.213115  ,  0.        ,  0.        ,
        -0.23696   , -0.894962  , -0.7       ],
       [-0.647059  , -0.21608   , -0.180328  , -0.353535  , -0.791962  ,
        -0.0760059 , -0.854825  , -0.833333  ],
       [ 0.176471  ,  0.155779  ,  0.        ,  0.        ,  0.        ,
         0.052161  , -0.952178  , -0.733333  ],
       [-0.764706  ,  0.979899  

In [13]:
print(len(y_data)) 
y_data[:10]

759


array([[0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.]], dtype=float32)

In [14]:
# Launch graph
with tf.Session() as sess:
    # Initialize TensorFlow variables
    sess.run(tf.global_variables_initializer())

    for step in tqdm_notebook(range(10001)):
        cost_val, _ = sess.run([cost, train], feed_dict={X: x_data, Y: y_data})       
        if step % 200 == 0 or step < 10 :
            print("Step : {} \t Cost : {}".format(step, cost_val))    

    # Accuracy report
    h, c, a = sess.run([hypothesis, predicted, accuracy],
                       feed_dict={X: x_data, Y: y_data})

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for step in tqdm_notebook(range(10001)):


  0%|          | 0/10001 [00:00<?, ?it/s]

Step : 0 	 Cost : 1.0378105640411377
Step : 1 	 Cost : 1.0360039472579956
Step : 2 	 Cost : 1.0342001914978027
Step : 3 	 Cost : 1.0323994159698486
Step : 4 	 Cost : 1.0306012630462646
Step : 5 	 Cost : 1.028806209564209
Step : 6 	 Cost : 1.0270142555236816
Step : 7 	 Cost : 1.0252249240875244
Step : 8 	 Cost : 1.023438811302185
Step : 9 	 Cost : 1.0216553211212158
Step : 200 	 Cost : 0.7447590827941895
Step : 400 	 Cost : 0.600660502910614
Step : 600 	 Cost : 0.5523813962936401
Step : 800 	 Cost : 0.5376949906349182
Step : 1000 	 Cost : 0.5314603447914124
Step : 1200 	 Cost : 0.527259886264801
Step : 1400 	 Cost : 0.5237146019935608
Step : 1600 	 Cost : 0.5205262303352356
Step : 1800 	 Cost : 0.5176125764846802
Step : 2000 	 Cost : 0.5149363875389099
Step : 2200 	 Cost : 0.5124714374542236
Step : 2400 	 Cost : 0.5101962685585022
Step : 2600 	 Cost : 0.5080920457839966
Step : 2800 	 Cost : 0.5061423182487488
Step : 3000 	 Cost : 0.5043325424194336
Step : 3200 	 Cost : 0.50264972448349


In [15]:
print("# Hypothesis: \n{h} \n\n# Correct (Y): \n{c} \n\n# Accuracy: {a}".format(
    # h = h, c = c, a = a
    h = h[:20], c = c[:20], a = a
))

# Hypothesis: 
[[0.3627589 ]
 [0.92355967]
 [0.29080147]
 [0.94547033]
 [0.18961821]
 [0.8234605 ]
 [0.9493171 ]
 [0.60130876]
 [0.20435551]
 [0.57932526]
 [0.72445863]
 [0.16475578]
 [0.38877195]
 [0.12278756]
 [0.7965804 ]
 [0.4620533 ]
 [0.7865777 ]
 [0.82233024]
 [0.82345665]
 [0.6144735 ]] 

# Correct (Y): 
[[0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]] 

# Accuracy: 0.7694334387779236
