## Principal Component Analysis(PCA)  主成分分析：特徵擷取的一種方法

為降維(Dimension reduction)內特徵擷取(Feature extraction)的一種方法，降維就是希望資料的維度數減少，但整體的效能不會差異太多甚至會更好，降維（Dimensionality Reduction）是一種無監督學習，其最主要的目的是「化繁為簡」：將原本高維的數據（比方說 N 維）重新以一個相較低維的形式表達（比方說 K 維，且 K<N）。理想上只要該 K 維的表徵（representation）具有代表性，能夠抓住原來 N 維數據的大部分特性，我們就能在沒有損失什麼資訊的情況下，用更簡潔的方式呈現該組數據，進而對其本質有更深的理解。

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA

pca = PCA(n_components = 7) #指定降維
Xtrain = pca.fit_transform(Xtrain)
Xtest = pca.transform(Xtest)
pca.explained_variance_ratio_.cumsum()

In [None]:
plt.plot(pca.explained_variance_ratio_.cumsum())

In [None]:
model = LinearRegression()
model.fit(Xtrain , Ytrain)
model.score(Xtest , Ytest) #測試與驗證

## 交叉驗證(Cross validation)
一般來說我們會將數據分為兩個部分，一部分用來訓練，一部分用來測試，交叉驗證是一種統計學上將樣本切割成多個小子集的做測試與訓練。交叉驗證主要分為以下幾類：

- k-folder cross-vailation
- kk folder cross-vaildation
- least-one-out cross-validation
- 10-fold corss validation

In [None]:
from sklearn.model_selection import KFold 
k_fold = KFold(n_splits=5) #訓練了模型5次
test_scores = []
for train_idx , test_idx in k_fold.split(X):
    Xtrain = X[train_idx]
    Ytrain = Y[train_idx]

    Xtest = X[test_idx]
    Ytest = Y[test_idx]

    model = LinearRegression()
    model.fit(Xtrain , Ytrain)

    test_scores.append(model.score(Xtest , Ytest))

In [None]:
# 平均值決定了我們的全局分數，這意味著我們可以確信該模型的實時性能在這個數據集上就會出現。
# 0.76678 比我們之前的 0.7497 好

print(" mean score of k folds : " , np.mean(test_scores))
plt.plot(test_scores)
plt.plot([np.mean(test_scores)]*len(test_scores))
plt.show()

In [None]:
# Y = W.X + c
model.coef_.dot(Xtest[10,:]) + model.intercept_

In [None]:
model.predict(Xtest[10,:].reshape(1,-1))

In [None]:
from scipy.special import inv_boxcox
transformed_data = inv_boxcox(Y , lam)
transformed_data[:10]

In [None]:
Original_Y[:10]

## 指標定義

In [None]:
def rmse_score(y_test , y_pred):
    value = (1/len(y_test))*np.sum((y_test - y_pred)**2)
    return np.sqrt(value)

def r2_score(y_test , y_pred):
    ssr = (1/len(y_test))*np.sum((y_test - y_pred)**2)
    sst = (1/len(y_test))*np.sum((y_test - np.mean(y_test))**2)
    return (1 - (ssr/sst))

def mae(y_test , y_pred):
    return (1/len(y_test))*np.sum(np.abs(y_test - y_pred))

def adj_r2_score(y_test , y_pred , n_features):
    numerator = (1-r2_score(y_test , y_pred))*(len(y_test) - 1)
    denominator = len(y_test) - n_features - 1
    return 1 - (numerator/denominator)

In [None]:
k_fold = KFold(n_splits=5)

# Plotting Root mean squared error 
rmse_scores = []
r2_scores = []
mae_scores = []
r2_adj_scores = []

for train_idx , test_idx in k_fold.split(X):
    Xtrain = X[train_idx]
    Ytrain = Y[train_idx]

    Xtest = X[test_idx]
    Ytest = Y[test_idx]

    model = LinearRegression()
    model.fit(Xtrain , Ytrain)

    Ypred = model.predict(Xtest)
    rmse_scores.append(rmse_score(Ytest , Ypred))
    r2_scores.append(r2_score(Ytest , Ypred))
    mae_scores.append(mae(Ytest , Ypred))
    r2_adj_scores.append(adj_r2_score(Ytest , Ypred , Xtest.shape[1]))

print(" Average RMSE " , np.mean(rmse_scores))
plt.plot(rmse_scores)
plt.plot([np.mean(rmse_scores)]*len(rmse_scores))
plt.title(" RMSE ")
plt.show()

print(" Average MAE " , np.mean(mae_scores))
plt.plot(mae_scores)
plt.plot([np.mean(mae_scores)]*len(mae_scores))
plt.title(" MAE ")
plt.show()

print(" Average R square " , np.mean(r2_scores))
plt.plot(r2_scores)
plt.plot([np.mean(r2_scores)]*len(r2_scores))
plt.title(" R square ")
plt.show()

print(" Average Adj R square " , np.mean(r2_adj_scores))
plt.plot(r2_adj_scores)
plt.plot([np.mean(r2_adj_scores)]*len(r2_adj_scores))
plt.title(" Adj R square ")
plt.show()

In [None]:
from scipy.special import inv_boxcox

Real_data = inv_boxcox(Y , lam)

In [None]:
Real_data[:10]

In [None]:
Original_Y[:10]