## 随机梯度下降法

学习率随着训练的进行，越来越小

最简单的学习率设置：1/迭代次数。
缺点：当初始迭代次数很少（例如：1,2等）的时候，学习率的下降变化太大。
解决方案是：
1. 1/（迭代次数+常数值）
2. a/(迭代次数+常数值)
3. 模拟退火思想：t0/(迭代次数+t1)

In [1]:
import numpy as np
import matplotlib.pyplot as plt


In [2]:
m=100000

x=np.random.normal(size=m)
X=x.reshape((-1,1))
y=4.0*x+3.0+np.random.normal(0.0,3.0, size=m)

## 先用batch gradient descent实现

In [6]:
def J(theta, X_b, y):
    return np.sum((X_b.dot(theta) - y)**2)/len(y)

def dJ(theta, X_b, y):
    return X_b.T.dot(X_b.dot(theta) - y)*2/len(y)

def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
    theta=initial_theta
    i_iter=0
    
    while i_iter<n_iters:
        gradient=dJ(theta, X_b, y)
        last_theta=theta
        theta=theta-eta*gradient
        
        if abs(J(theta, X_b, y) - J(last_theta, X_b ,y))<epsilon:
            break
        i_iter+=1
    return theta

In [7]:
%%time
X_b=np.hstack([np.ones((len(X),1)), X])
initial_theta=np.zeros((X_b.shape[1]))
eta=0.01
theta=gradient_descent(X_b, y, initial_theta, eta)

Wall time: 1.1 s


In [8]:
theta

array([ 3.00482694,  4.00016777])

## 再用随机梯度下降法实现

In [14]:
def dJ_sgd(theta, X_b_i, y_i):
    return X_b_i.T.dot(X_b_i.dot(theta) - y_i)*2.0

def sgd(X_b, y, initial_theta, n_iters):
    t0=5.0
    t1=50.0
    def learning_rate(t):
        return t0/(t+t1)
    
    theta=initial_theta
    i_iter=0
    
    while i_iter<n_iters:
        rand_i=np.random.randint(0, len(X_b))
        gradient=dJ_sgd(theta, X_b[rand_i], y[rand_i])
        theta=theta - learning_rate(i_iter)*gradient
        
        i_iter+=1
    return theta

In [15]:
%%time
initial_theta_sgd=np.zeros(X_b.shape[1])
theta_sgd=sgd(X_b, y, initial_theta_sgd, len(X_b)//3)

Wall time: 274 ms


In [16]:
theta_sgd

array([ 2.96630065,  3.9987102 ])

## 可以看出
1. 随机梯度下降法，使用更少的迭代次数，也能达到差不多的效果
2. 耗费的时间明显减少

In [22]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [18]:
boston=load_boston()

x=boston.data
y=boston.target

X=x[y<50]
y=y[y<50]

In [20]:
X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.2)

In [21]:
standardScaler=StandardScaler()
standardScaler.fit(X_train)
X_train_standard=standardScaler.transform(X_train)
X_test_standard=standardScaler.transform(X_test)

In [25]:
lin_sgd=SGDRegressor()
lin_sgd.fit(X_train_standard, y_train)
y_test_predict=lin_sgd.predict(X_test_standard)
print(r2_score(y_test, y_test_predict))
print(lin_sgd.score(X_test_standard, y_test))

0.842874971215
0.842874971215


