[View in Colaboratory](https://colab.research.google.com/github/idleuncle/Colab/blob/master/TPOT_Tutorial.ipynb)

# TPOT Tutorial



## 一、Colab 环境准备

1. 在Colab中打开ColabTemplate.ipynb，另存为你的项目MyProject.ipynb并打开。

2. "Colab 环境准备" 完成以下工作，只需要项目打开 时执行一次。

    - 安装系统依赖

    - 授权登录 Google Drive

    - 安装 Colab 编程环境支持包 (IpynbImporter.py, [ColabModules.ipynb](https://colab.research.google.com/drive/1IMv93f2bMYhrx2lfL3cmDBI7kmjCMy01#scrollTo=VyDM84dOxu18))
    
3. 修改并保存ColabModules.ipynb后，执行“下载 Colab 编程环境支持包”及“导入 Colab 编程环境支持包”。

drive变量指向登录的Google Drive。

### 1.1 授权登录Google Drive

#### 第一次授权登录Google Drive

In [0]:
# 安装 PyDrive 操作库，该操作每个 notebook 只需要执行一次
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

def google_drive_login():
  # 授权登录，仅第一次的时候会鉴权
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  return drive

drive = google_drive_login()

#### 第二次授权映射Google Drive至本地driver目录

In [0]:
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse

from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials

creds = GoogleCredentials.get_application_default()

In [0]:
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

In [0]:
!mkdir -p drive
!google-drive-ocamlfuse drive
!ls -lt drive

### 1.2 下载 Colab 编程环境支持包

In [0]:
import os
def google_drive_download_files(drive, file_name_prefix, colab_dir=".", overwrite=True):
  # choose a local (colab) directory to store the data.
  local_download_path = os.path.expanduser(colab_dir)
  try:
    os.makedirs(local_download_path)
  except: pass

  # 2. Auto-iterate using the query syntax
  #    https://developers.google.com/drive/v2/web/search-parameters
  file_list = drive.ListFile(
      {'q': "title contains '%s'" % (file_name_prefix) }).GetList()

  files_dict = {}
  for f in file_list:
    # 3. Create & download by id.
    print('title: %s, id: %s' % (f['title'], f['id']))
    fname = os.path.join(local_download_path, f['title'])
    if overwrite or not os.path.exists(fname):
      print('downloading to {}'.format(fname))
      f_ = drive.CreateFile({'id': f['id']})
      f_.GetContentFile(fname)
      print('Download Completed!')
    files_dict[ f['title'] ] = fname

  # with open(fname, 'r') as f:
  #   print(f.read())
  return files_dict, local_download_path

# 修改完ColabModles.ipynb后，执行以下命令，(并在项目中执行菜单项 Runtime/Restart runtime ???)
google_drive_download_files(drive, 'IpynbImporter.py')
google_drive_download_files(drive, 'ColabModules.ipynb')

!ls -lt


### 1.3 导入Colab基础编程环境支持包

In [0]:
import IpynbImporter
from ColabModules import *

colab_ready()

!hostname
!ls -lt

## 二、开始研究代码

In [0]:
# Install tpot on the server
!pip install tpot

# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# Import the tpot regressor
from tpot import TPOTRegressor

### 2.1 数据集

In [10]:
# Read in features from GitHub
train_features = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project-walkthrough/master/data/X_train.csv')
test_features = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project-walkthrough/master/data/X_test.csv')

# Read in labels from GitHub
train_labels = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project-walkthrough/master/data/Y_train.csv')
test_labels = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project-walkthrough/master/data/Y_test.csv')

print('Training features shape: ', train_features.shape)
print('Testing features shape:  ', test_features.shape)

train_features.head()

Training features shape:  (6622, 82)
Testing features shape:   (2839, 82)


Unnamed: 0,Order,Property Id,DOF Gross Floor Area,Largest Property Use Type - Gross Floor Area (ft²),Year Built,Number of Buildings - Self-reported,Occupancy,Site EUI (kBtu/ft²),Weather Normalized Site EUI (kBtu/ft²),Weather Normalized Site Electricity Intensity (kWh/ft²),...,Largest Property Use Type_Restaurant,Largest Property Use Type_Retail Store,Largest Property Use Type_Self-Storage Facility,Largest Property Use Type_Senior Care Community,Largest Property Use Type_Social/Meeting Hall,Largest Property Use Type_Strip Mall,Largest Property Use Type_Supermarket/Grocery Store,Largest Property Use Type_Urgent Care/Clinic/Other Outpatient,Largest Property Use Type_Wholesale Club/Supercenter,Largest Property Use Type_Worship Facility
0,13276,5849784,90300.0,77300.0,1950,1,100,126.0,136.8,5.2,...,0,0,0,0,0,0,0,0,0,0
1,7377,4398442,52000.0,52000.0,1926,1,100,95.4,102.0,4.7,...,0,0,0,0,0,0,0,0,0,0
2,9479,4665374,104700.0,105000.0,1954,1,100,40.4,40.0,3.8,...,0,0,0,0,0,0,0,0,0,0
3,14774,3393340,129333.0,129333.0,1992,1,100,157.1,163.1,16.9,...,0,0,0,1,0,0,0,0,0,0
4,3286,2704325,109896.0,116041.0,1927,1,100,62.3,68.2,3.5,...,0,0,0,0,0,0,0,0,0,0


In [0]:
# Convert to numpy arrays
training_features = np.array(train_features)
testing_features = np.array(test_features)

# Sklearn wants the labels as one-dimensional vectors
training_targets = np.array(train_labels).reshape((-1,))
testing_targets = np.array(test_labels).reshape((-1,))

### 2.2 TPOT优化器

In [0]:
# Create a tpot object with a few parameters
tpot = TPOTRegressor(scoring = 'neg_mean_absolute_error', 
                    max_time_mins = 480, 
                    n_jobs = -1,
                    verbosity = 2,
                    cv = 5)

In [0]:
# Fit the tpot model on the training data
tpot.fit(training_features, training_targets)

# Show the final model
print(tpot.fitted_pipeline_)

#### 2.2.1 导出最佳模型

In [0]:
# Export the pipeline as a python script file
tpot.export('tpot_exported_pipeline.py')

In [0]:
# Import file management
from google.colab import file

# Download the pipeline for local use
files.download('tpot_exported_pipeline.py')

#### 2.2.2 查看优选指标

In [0]:
# To examine all fitted models
# tpot.evaluated_individuals_

### 2.3 在测试集上验证最佳模型

In [0]:
# Imports that the final pipeline needs
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LassoLarsCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import Imputer
from tpot.builtins import StackingEstimator

# Preprocessing steps
imputer = Imputer(strategy="median")
imputer.fit(training_features)
training_features = imputer.transform(training_features)
testing_features = imputer.transform(testing_features)

# Final pipeline from TPOT
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=LassoLarsCV(normalize=True)),
    GradientBoostingRegressor(alpha=0.95, learning_rate=0.1, loss="lad", 
                              max_depth=7, max_features=0.75, 
                              min_samples_leaf=3, min_samples_split=18, 
                              n_estimators=100, subsample=0.60)
)

In [0]:
# Fit on the training data
exported_pipeline.fit(training_features, training_targets)

In [0]:
# Make predictions on the testing data
predictions = exported_pipeline.predict(testing_features)

print('Mean Absolute Error = %0.4f' % np.mean(abs(predictions - testing_targets)))