# 実習4　回帰

## 実習概要

scikit-learnライブラリを使用して、以下の操作を習得します。

- 標準化
- 主成分分析
- グリッドサーチ

※セルは適宜追加してください。

## データ概要

- 電力消費量.csv

    C:\PyML\exercise\2.Regression\電力消費量
    ※実習２の課題と同じデータを使います。

    あなたは、スマートハウスにおいて、電力関連のサービスの企画を行う部署のメンバーです。スマートハウスの各部屋に設置された電力センサーから得られたデータを元に、以後１ヶ月の家電電力使用量、および、電灯電力使用量の予測値をパネルに表示させるサービスを作成しています。これにより、省エネ活動の見える化ができると考えています。

|列名|意味|
|:--|:--|
|家電使用量|スマートハウス内の、センシング以後１ヶ月の全家電電力使用総量。|
|電灯使用量|スマートハウス内の、センシング以後１ヶ月の電灯のみの電力使用総量。|
|各種温度、湿度情報|スマートハウス内の、各部屋でセンシングされた、直近の温度･湿度データ。|
|気圧、風速、視程、露点温度|スマートハウス建設地域の環境情報|
|乱数1、乱数2|無関係なデータ。|

## 事前準備

### ライブラリのインポート

※必要であれば、ライブラリを追加インポートしてください

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

### ファイルの読み込み

In [2]:
df = pd.read_csv('電力消費量.csv',encoding='shift_jis',engine='python')

In [3]:
df.head(3)

Unnamed: 0,日時,家電使用量,電灯使用量,温度_キッチン,湿度_キッチン,温度_居間,湿度_居間,温度_洗濯室,湿度_洗濯室,温度_執務室,...,温度_親部屋,湿度_親部屋,温度_外気,気圧,湿度_外気,風速,視程,露点温度,乱数1,乱数2
0,2016/1/11 17:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016/1/11 17:10,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.48,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016/1/11 17:20,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.37,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668


## データ加工

下記の加工リストを参考に、適切な加工を実行してください。

- 標準化
- 主成分分析
- グリッドサーチ

### 目的変数と特徴変数の抽出

In [4]:
X = df.iloc[:, 3:].values
y1 = df[['家電使用量']].values
y2 = df[['電灯使用量']].values

### 標準化

In [5]:
sc = StandardScaler()

In [6]:
sc.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [7]:
df_post_sc = pd.DataFrame(sc.transform(X))

In [8]:
df_post_sc.columns = df.iloc[:,3:].columns
df_post_sc.head()

Unnamed: 0,温度_キッチン,湿度_キッチン,温度_居間,湿度_居間,温度_洗濯室,湿度_洗濯室,温度_執務室,湿度_執務室,温度_浴室,湿度_浴室,...,温度_親部屋,湿度_親部屋,温度_外気,気圧,湿度_外気,風速,視程,露点温度,乱数1,乱数2
0,-1.118656,1.843973,-0.520351,1.073699,-1.235078,1.686132,-0.90818,1.506578,-1.314898,0.471156,...,-1.217304,0.958216,-0.152674,-2.976255,0.822031,1.207662,2.091544,0.367006,-0.807929,-0.807929
1,-1.118656,1.616951,-0.520351,1.057113,-1.235078,1.704568,-0.90818,1.604673,-1.314898,0.471156,...,-1.200758,0.965443,-0.175241,-2.96274,0.822031,1.071675,1.766532,0.343166,-0.440201,-0.440201
2,-1.118656,1.518099,-0.520351,1.033565,-1.235078,1.748609,-0.94408,1.581061,-1.314898,0.458963,...,-1.233851,0.950989,-0.195928,-2.949226,0.822031,0.935689,1.44152,0.319327,0.252137,0.252137
3,-1.118656,1.459459,-0.520351,1.024556,-1.235078,1.769093,-0.96203,1.542667,-1.314898,0.458963,...,-1.233851,0.926901,-0.218495,-2.935711,0.822031,0.799702,1.116508,0.295487,1.408811,1.408811
4,-1.118656,1.526476,-0.520351,1.009813,-1.235078,1.769093,-0.96203,1.498131,-1.296826,0.458963,...,-1.233851,0.926901,-0.241062,-2.922196,0.822031,0.663715,0.791496,0.271648,-1.028075,-1.028075


### 主成分分析

#### 主成分の探索

In [9]:
num_components = np.arange(1, len(df_post_sc.columns)+1)
pca = []
cnt = 0
list_pca = np.empty((0,2))

In [10]:
for i in num_components:
    pca = np.append(pca,PCA(n_components = i))
    pca[cnt].fit(df_post_sc)
    list_pca = np.append(list_pca, np.array([[i, pca[cnt].explained_variance_ratio_.sum()]]), axis = 0)
    cnt += 1

In [11]:
list_pca

array([[  1.        ,   0.3582363 ],
       [  2.        ,   0.629544  ],
       [  3.        ,   0.70674705],
       [  4.        ,   0.77644697],
       [  5.        ,   0.81700582],
       [  6.        ,   0.85461685],
       [  7.        ,   0.8907666 ],
       [  8.        ,   0.91221476],
       [  9.        ,   0.93256601],
       [ 10.        ,   0.9477439 ],
       [ 11.        ,   0.95810623],
       [ 12.        ,   0.96493856],
       [ 13.        ,   0.97053124],
       [ 14.        ,   0.97592839],
       [ 15.        ,   0.98046517],
       [ 16.        ,   0.98486352],
       [ 17.        ,   0.98848194],
       [ 18.        ,   0.99130556],
       [ 19.        ,   0.99389083],
       [ 20.        ,   0.99563778],
       [ 21.        ,   0.99732164],
       [ 22.        ,   0.99842873],
       [ 23.        ,   0.99930413],
       [ 24.        ,   0.99985714],
       [ 25.        ,   1.        ],
       [ 26.        ,   1.        ]])

#### 主成分分析

In [12]:
best_pca = pca[8]

In [13]:
df_post_pca = pd.DataFrame(best_pca.transform(df_post_sc))

In [14]:
df_post_pca.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,-2.798545,-4.576192,1.407763,1.103657,0.837833,2.845482,-0.15236,-0.791935,-1.185403


## 学習、評価

### グリッドサーチによるベストパラメーターの探索

In [15]:
param_grid = {'alpha':[0.0001,0.001,0.01,0.1,1,10,100,1000,10000,100000]}

In [16]:
clf_l2 = Ridge()

In [17]:
gs = GridSearchCV( estimator=clf_l2,
                  param_grid=param_grid,
                  scoring='neg_mean_squared_error',
                  cv=10,
                  n_jobs=-1)

In [18]:
gs.fit(X=df_post_pca,y=y1)

GridSearchCV(cv=10, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)

In [19]:
gs.best_score_

-10398.721063520548

In [20]:
#RMSE
print(np.sqrt(np.sqrt(np.square(gs.best_score_))))

101.974119577


In [21]:
gs.best_params_

{'alpha': 10000}

---

### グリッドサーチによるベストパラメーターの探索（家電使用量予測）

In [22]:
param_grid = {'alpha':np.arange(1,30)}

In [23]:
clf_l1 = Lasso()

In [24]:
gs = GridSearchCV( estimator=clf_l1,
                  param_grid=param_grid,
                  scoring='neg_mean_squared_error',
                  cv=10,
                  n_jobs=-1)

In [25]:
gs.fit(X=df_post_pca,y=y1)

GridSearchCV(cv=10, error_score='raise',
       estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'alpha': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)

In [26]:
gs.best_score_

-10363.551116175817

In [27]:
#RMSE
print(np.sqrt(np.sqrt(np.square(gs.best_score_))))

101.801528064


In [28]:
gs.best_params_

{'alpha': 6}

In [29]:
gs.best_estimator_.coef_

array([ 1.81168212,  0.08126153,  0.        ,  7.40050242,  0.        ,
        0.        ,  0.        , -0.        , -0.        ])

In [30]:
results = pd.DataFrame(gs.cv_results_)

In [31]:
results.sort_values(by='rank_test_score')

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_alpha,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,...,split7_test_score,split7_train_score,split8_test_score,split8_train_score,split9_test_score,split9_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
5,0.01404,0.00624,-10363.551116,-10256.228383,6,{'alpha': 6},1,-17380.562667,-9461.975772,-10942.539567,...,-8190.186337,-10502.981165,-7887.784634,-10525.274185,-8252.86576,-10506.406499,0.01092,0.007642,2862.841208,322.268836
6,0.23244,0.0,-10365.802325,-10266.445805,7,{'alpha': 7},2,-17357.657557,-9471.952369,-10910.323128,...,-8210.639076,-10513.762174,-7853.468338,-10538.133629,-8285.843042,-10517.013572,0.254089,0.0,2855.748092,322.528317
4,0.01248,0.0,-10366.226271,-10243.812701,5,{'alpha': 5},3,-17428.404233,-9440.495859,-10979.998399,...,-8169.949969,-10490.90433,-7927.566046,-10512.551119,-8224.855878,-10495.266612,0.00624,0.0,2875.346889,324.208204
7,0.291721,0.00936,-10370.173812,-10277.951801,8,{'alpha': 8},4,-17335.76546,-9483.463829,-10880.880126,...,-8233.359695,-10526.20182,-7826.045138,-10552.971448,-8320.763366,-10529.252507,0.287739,0.015909,2848.210397,322.841084
3,0.22464,0.0,-10376.123035,-10226.008508,4,{'alpha': 4},5,-17490.453449,-9419.383697,-11037.459624,...,-8149.957735,-10475.409999,-7988.973133,-10489.380748,-8201.614273,-10476.280527,0.289186,0.0,2889.123492,324.384091
8,0.01404,0.0,-10377.271368,-10289.90511,9,{'alpha': 9},6,-17315.102843,-9496.419557,-10852.665815,...,-8260.13187,-10538.380979,-7806.481373,-10566.499917,-8359.20444,-10541.485702,0.00468,0.0,2839.927112,322.576201
9,0.0156,0.00156,-10387.017722,-10302.497343,10,{'alpha': 10},7,-17300.846127,-9508.474983,-10825.680197,...,-8289.358679,-10551.164934,-7791.448015,-10580.928445,-8400.148244,-10554.10307,0.012084,0.00468,2832.879097,322.831852
2,0.03276,0.00156,-10391.34049,-10205.871984,3,{'alpha': 3},8,-17560.315327,-9402.963132,-11144.969562,...,-8126.601255,-10452.599663,-8070.1002,-10468.711537,-8170.56038,-10454.169558,0.039125,0.00468,2905.677251,322.539759
10,0.283921,0.00624,-10398.249055,-10316.415072,11,{'alpha': 11},9,-17287.406746,-9521.799392,-10799.92327,...,-8320.216573,-10565.294568,-7780.519742,-10596.875765,-8442.534953,-10568.048583,0.290998,0.010348,2825.62519,323.115716
11,0.01404,0.00156,-10410.965359,-10331.658301,12,{'alpha': 12},10,-17274.784616,-9536.392802,-10775.395036,...,-8352.705554,-10580.769883,-7773.696554,-10614.341878,-8486.364565,-10583.32224,0.008401,0.00468,2818.160022,323.428173


### グリッドサーチによるベストパラメーターの探索（電灯使用量予測）

In [32]:
param_grid = {'alpha':[0.00001,0.0001,0.001,0.01,0.1,0,10,100,1000,10000]}

In [33]:
clf_l1 = Lasso()

In [34]:
gs = GridSearchCV( estimator=clf_l1,
                  param_grid=param_grid,
                  scoring='neg_mean_squared_error',
                  cv=10,
                  n_jobs=-1)

In [35]:
gs.fit(X=df_post_pca,y=y2)

GridSearchCV(cv=10, error_score='raise',
       estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'alpha': [1e-05, 0.0001, 0.001, 0.01, 0.1, 0, 10, 100, 1000, 10000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)

In [36]:
gs.best_score_

-61.635419407657785

In [37]:
#RMSE
print(np.sqrt(np.sqrt(np.square(gs.best_score_))))

7.85082284908


In [38]:
gs.best_params_

{'alpha': 0.1}

In [39]:
gs.best_estimator_.coef_

array([-0.21045217, -0.25059492,  0.        ,  0.04950419,  0.21502621,
       -0.        ,  1.06091508, -0.        ,  0.09237583])

In [40]:
results = pd.DataFrame(gs.cv_results_)

In [41]:
results.sort_values(by='rank_test_score')

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_alpha,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,...,split7_test_score,split7_train_score,split8_test_score,split8_train_score,split9_test_score,split9_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
4,0.156,0.0,-61.635419,-60.574074,0.1,{'alpha': 0.1},1,-84.476543,-57.982437,-94.54358,...,-36.247607,-63.327064,-28.809428,-64.331144,-39.600731,-63.042817,0.158323,0.0,24.428895,2.706255
3,0.02808,0.0,-61.830308,-60.512047,0.01,{'alpha': 0.01},2,-85.310822,-57.893941,-94.76399,...,-36.351494,-63.267918,-29.221879,-64.260052,-39.378198,-62.985086,0.011674,0.0,24.459013,2.706312
2,0.05928,0.0,-61.865412,-60.511261,0.001,{'alpha': 0.001},3,-85.417309,-57.893015,-94.833441,...,-36.367438,-63.267153,-29.275378,-64.259268,-39.361298,-62.984314,0.102486,0.0,24.473347,2.706318
1,0.06708,0.00156,-61.869126,-60.511253,0.0001,{'alpha': 0.0001},4,-85.428204,-57.893006,-94.840943,...,-36.369081,-63.267145,-29.280903,-64.25926,-39.359672,-62.984306,0.104427,0.00468,24.474915,2.706318
0,0.04524,0.00156,-61.869498,-60.511253,1e-05,{'alpha': 1e-05},5,-85.429296,-57.893005,-94.841695,...,-36.369245,-63.267145,-29.281456,-64.25926,-39.35951,-62.984306,0.02652,0.00468,24.475073,2.706318
5,1.333802,0.0,-61.86954,-60.511253,0.0,{'alpha': 0},6,-85.429417,-57.893005,-94.841778,...,-36.369264,-63.267145,-29.281517,-64.25926,-39.359492,-62.984306,1.010642,0.0,24.47509,2.706318
6,0.01092,0.00156,-63.583688,-62.946,10.0,{'alpha': 10},7,-89.276104,-60.089808,-100.704128,...,-38.348677,-65.732079,-29.776665,-66.746646,-40.374595,-65.527472,0.007149,0.00468,25.784716,2.863509
7,0.0078,0.0,-63.583688,-62.946,100.0,{'alpha': 100},7,-89.276104,-60.089808,-100.704128,...,-38.348677,-65.732079,-29.776665,-66.746646,-40.374595,-65.527472,0.0078,0.0,25.784716,2.863509
8,0.00936,0.0,-63.583688,-62.946,1000.0,{'alpha': 1000},7,-89.276104,-60.089808,-100.704128,...,-38.348677,-65.732079,-29.776665,-66.746646,-40.374595,-65.527472,0.007642,0.0,25.784716,2.863509
9,0.01092,0.0,-63.583688,-62.946,10000.0,{'alpha': 10000},7,-89.276104,-60.089808,-100.704128,...,-38.348677,-65.732079,-29.776665,-66.746646,-40.374595,-65.527472,0.007149,0.0,25.784716,2.863509


## ※オプション課題

以下の内容を実装してください。

- 必要なデータのシリアライズ
- 新たなノートブックファイルの作成
- 「電力消費量_予測用.csv」を使い、新たな特徴変数から、家電使用量や電灯使用量の予測