# 預測腳踏車每小時的租借量

## 問題描述

### 依不同的因素，預測每小時租借腳踏車的人數，
### 考慮的因素(特徵值)有季節、月份、時間、假日、星期、工作天、天氣、溫度、體感溫度、濕度、風速，
### 而預測目標(label)為 每一小時的租用數量

In [None]:
import numpy as np
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.evaluation import RegressionMetrics
from pyspark.mllib.tree import DecisionTree
import math 

In [None]:
global Path  
Path="file:/home/spark/spark-workshop/"

## Note: we need some utility function to hanlde RDD

In [None]:
def convert_float(x):
    return (0 if x=="?" else float(x))

In [None]:
def extract_label(record):
    label=(record[-1])
    return float(label)

In [None]:
# 原始資料來源的格式包括: 
# instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
# 把要考慮的特徵值(季節、月份、時間、假日、星期、工作天、天氣、溫度、體感溫度、濕度、風速)取出 
# season, mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed
def extract_features(record,featureEnd):
    featureSeason=[convert_float(field)  for  field in record[2]] 
    features=[convert_float(field)  for  field in record[4: featureEnd-2]]
    return  np.concatenate( (featureSeason, features))

# 準備資料

In [None]:
#----------------------1.匯入並轉換資料-------------
print("開始匯入資料...")
rawDataWithHeader = sc.textFile(Path+"data/hour.csv")
header = rawDataWithHeader.first() 
rawData = rawDataWithHeader.filter(lambda x:x !=header)    
lines = rawData.map(lambda x: x.split(","))
print("共計：" + str(lines.count()) + "筆\n")
print ("RDD 資料格式為: \n" + str(lines.first()))

### RDD[tuple] -> RDD[LabelPoint]

In [None]:
#----------------------2.建立訓練評估所需資料 RDD[LabeledPoint]-------------
labelpointRDD = lines.map(lambda r:LabeledPoint(
                                        extract_label(r), 
                                        extract_features(r,len(r) - 1)))

In [None]:
print (lines.first())

In [None]:
print labelpointRDD.first()

# 訓練模型


In [None]:
#----------------------3.以隨機方式將資料分為2部份, 訓練 & 驗證-------------
(trainData, validationData) = labelpointRDD.randomSplit([99, 1])
trainData.persist()
validationData.persist()

model = DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo={}, impurity="variance", maxDepth=10, maxBins=100)

# 進行預測

In [None]:
for lp in validationData.take(100):
        predict = int(model.predict(lp.features))
        label=lp.label
        features=lp.features
        error = math.fabs(label - predict)
        dataDesc = "==> 預測結果: " + str(predict ) + "  \t 實際:" + str(label) + " \t 誤差:" + str(error)
        print dataDesc