<a href="https://colab.research.google.com/github/milkyme/DSlab_practice/blob/main/Tensorflow_Regression_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression
- The code below performs linear regression by optimizing the w and b in the function $$y_i = wx_i + b$$
- The dataset is from Keras, where we want to predict 'MPG' based on 'Displacement'

In [None]:
import pathlib

import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")

column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
raw_dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset = raw_dataset.copy()
dataset = dataset.dropna()
dataset.tail()

Downloading data from http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data


Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1
397,31.0,4,119.0,82.0,2720.0,19.4,82,1


In [None]:
# x, y data for linear regression
x_data = dataset['MPG']
y_data = dataset['Displacement']

- The code below represents a linear regression model and procedure to optimize w and b to make a cost function lower by going through iteration steps.
- w and b are randomly initialized variables which have value between [0, 1], [-1, 1] repectively.
- Cost function, mean square error(MSE) is used as objective function. As MSE represents the error between model and data, iteration steps progress to find optimal w and b to minimize the MSE.(It causes minimizing the error between model and data)
$$ MSE = {1 \over n}\sum^n_i(h_i - data_i)^2 $$
- Gradient Descent, which changes w and b in proportion to learning rate and partial derivative every iteration, is used as optimizer.
$$ Gradient Descent : w_{new} = w_{old} - \alpha{ \partial {MSE} \over \partial w}\mbox{(b is renewed as same)}$$

In [None]:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# initial_value from uniform[-1, 1]
w = tf.Variable(tf.random.uniform([1], 0.0, 1.0))
b = tf.Variable(tf.random.uniform([1], -1.0, 1.0))

hypothesis = w*x_data + b

cost = tf.reduce_mean(tf.square(hypothesis - y_data))

rate = tf.Variable(0.001) # learning rate, alpha
optimizer = tf.train.GradientDescentOptimizer(rate)
train = optimizer.minimize(cost)

init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

for step in range(800):
  sess.run(train)
  if step % 50 == 0:
    print('{:4} {} {} {}'.format(step, sess.run(cost), sess.run(w), sess.run(b))) # [step, cost, w, b]

   0 24772.533203125 [7.6781034] [0.40304157]
  50 23351.3359375 [6.211534] [4.7757273]
 100 22967.33984375 [6.0433354] [9.154481]
 150 22590.892578125 [5.876801] [13.489928]
 200 22221.861328125 [5.711913] [17.7825]
 250 21860.091796875 [5.5486565] [22.03261]
 300 21505.443359375 [5.3870144] [26.240683]
 350 21157.77734375 [5.2269707] [30.407145]
 400 20816.953125 [5.0685105] [34.53239]
 450 20482.833984375 [4.911617] [38.616844]
 500 20155.296875 [4.756275] [42.660896]
 550 19834.203125 [4.6024704] [46.664948]
 600 19519.431640625 [4.4501867] [50.629395]
 650 19210.85546875 [4.299409] [54.554638]
 700 18908.3515625 [4.1501226] [58.441048]
 750 18611.802734375 [4.0023136] [62.289]


# Logistic Regression(Classification)
- The code below performs logistic regression by optimizing w and b to classify accurately for given label with function h$$h(x) = wx + b$$
- The dataset is titanic data from external site, where we want to classify 'survived' based on 'sex'

In [None]:
from six.moves import urllib

import tensorflow.compat.v2.feature_column as fc

dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')

dftrain.head()

sex = dftrain.pop('sex')

dftrain['sex'] = (sex == 'male')*1.0 # 0 is female, 1 is male
dftrain.tail()


Unnamed: 0,survived,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone,sex
622,0,28.0,0,0,10.5,Second,unknown,Southampton,y,1.0
623,0,25.0,0,0,7.05,Third,unknown,Southampton,y,1.0
624,1,19.0,0,0,30.0,First,B,Southampton,y,0.0
625,0,28.0,1,2,23.45,Third,unknown,Southampton,n,0.0
626,0,32.0,0,0,7.75,Third,unknown,Queenstown,y,1.0


In [None]:
# x, y data for logistic regression
x_data = dftrain['sex']
y_data = dftrain['survived'] # 1 is survived, 0 is not.

- The code below represents a logistic regression model and procedure to optimize w and b to make a cost function lower by going through iteration steps.
- w and b are randomly initialized variables that both have values between [-1, 1]
- A cost function, cross entropy error is used as an objective function. Cross entropy error is represented as $$ CEE = -{1 \over n}\sum^n_iy_ilog(H_i) + (1-y_i)log(1-H_i)$$
$$ H_i = {1 \over {1 + e^{-h_i}}} $$
$$ h_i = wx_i + b $$
It means that decreasing cross entropy error causes that the model can classify more accurately for labled classes. Through each iteration step, w and b is changed to minimize the cross entropy error.
- Gradient Descent optimizer is used as above.

In [None]:
(import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np

X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)

w = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
b = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
h = w*X + b
hypothesis = tf.div(1.0, 1.0 + tf.exp(-h))

cost = -tf.reduce_mean(Y*tf.log(hypothesis) + (1-Y)*tf.log(1-hypothesis))

rate = tf.Variable(0.1)
optimizer = tf.train.GradientDescentOptimizer(rate)
train = optimizer.minimize(cost)

init = tf.initialize_all_variables()

sess = tf.Session()
sess.run(init)

for step in range(2000):
  sess.run(train, feed_dict={X: x_data, Y: y_data})
  if step % 100 == 0:
    print(step, sess.run(cost, feed_dict={X: x_data, Y: y_data}), sess.run(w), sess.run(b)) # [step, cost, w, b]

0 0.6360776 [-0.31528845] [0.01135195]
100 0.549276 [-1.1618564] [0.1266236]
200 0.52156305 [-1.597391] [0.41808507]
300 0.50767845 [-1.9007237] [0.6320486]
400 0.5004616 [-2.1190622] [0.7869947]
500 0.49658895 [-2.2787642] [0.90094435]
600 0.49445635 [-2.3970764] [0.98584867]
700 0.49325785 [-2.4856298] [1.0497373]
800 0.49257347 [-2.5524478] [1.0981666]
900 0.49217775 [-2.603185] [1.135081]
1000 0.49194705 [-2.6419003] [1.1633363]
1100 0.4918116 [-2.6715546] [1.1850322]
1200 0.4917313 [-2.6943357] [1.201732]
1300 0.4916837 [-2.711877] [1.2146108]
1400 0.49165538 [-2.7254076] [1.2245559]
1500 0.49163857 [-2.7358587] [1.2322454]
1600 0.49162826 [-2.7439396] [1.2381957]
1700 0.49162236 [-2.7501929] [1.2428027]
1800 0.49161866 [-2.7550352] [1.2463717]
1900 0.49161643 [-2.758787] [1.2491378]
