## __Chapter 4. 의사결정나무로 산림 식생 분포 예측하기__

### 4.1 ~ 4.3.

* 지도학습(Supervised Learning) ⊃ 분류, 회귀  
* 분류/회귀 모두에 적용 가능한, 대중적이고 유연한 알고리즘 __"Decision Tree"__, Decision Tree의 확장판인 __"RandomForest"__  

* 특징(Feature) = 차원(Dimension) = 예측변수(Predictor) = 변수(Variable)
* Categorical Feature , Numeric Feature      
* 목표(Target) : 회귀에서는 수를, 분류에서는 범주를 대상으로 함.

### 4.4. 의사결정나무와 랜덤 포레스트

(1) Decision Tree
* 범주형, 수치형 변수 모두 다룰 수 있다.  
* 단일 트리와 다중 트리 모두 병렬로 구축할 수 있다.   
* 이상치에 잘 휘둘리지 않는다 (극단적인 오류값들이 예측치에 영향을 줄 수 없다)  
* 전처리나 정규화를 거치지 않고도 다른 유형과 다른 척도의 데이터를 다룰 수 있다.  
  
(2) Random Forest  
* Decision Tree를 더 강력하게 일반화한 알고리즘

Tree기반 알고리즘의 장점 : 직관적으로 이해 및 추론할 수 있다. (일련의 예/아니오 선택)

### 4.5. Covtype 데이터셋

* 데이터셋 다운로드

In [None]:
# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz

In [2]:
# !gzip -d covtype.data.gz

In [3]:
# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info

* 불러오기

In [4]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

datapath = './'
DRIVER_MEMORY = "8g"

spark = SparkSession.builder.appName('ch04')\
    .master('local[*]')\
    .config('spark.sql.warehouse.dir', datapath)\
    .config("spark.driver.memory", DRIVER_MEMORY)\
    .getOrCreate()

dataWithoutHeader = spark.read\
    .option('inferSchema', True)\
    .option('header', False)\
    .csv('covtype.data')

In [5]:
print((dataWithoutHeader.count(), len(dataWithoutHeader.columns)))

(581012, 55)


* 변수명 만들기

In [6]:
colinfo = """
Elevation                               quantitative    meters                       Elevation in meters
Aspect                                  quantitative    azimuth                      Aspect in degrees azimuth
Slope                                   quantitative    degrees                      Slope in degrees
Horizontal_Distance_To_Hydrology        quantitative    meters                       Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology          quantitative    meters                       Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways         quantitative    meters                       Horz Dist to nearest roadway
Hillshade_9am                           quantitative    0 to 255 index               Hillshade index at 9am, summer solstice
Hillshade_Noon                          quantitative    0 to 255 index               Hillshade index at noon, summer soltice
Hillshade_3pm                           quantitative    0 to 255 index               Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points      quantitative    meters                       Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns)      qualitative     0 (absence) or 1 (presence)  Wilderness area designation
Soil_Type (40 binary columns)           qualitative     0 (absence) or 1 (presence)  Soil Type designation
Cover_Type (7 types)
"""

In [7]:
colNames = []
for line in colinfo.split('\n') :
    if len(line) > 0 :
        a_colname = line.split(' ')[0].strip()
        colNames.append(a_colname)
colNames

['Elevation',
 'Aspect',
 'Slope',
 'Horizontal_Distance_To_Hydrology',
 'Vertical_Distance_To_Hydrology',
 'Horizontal_Distance_To_Roadways',
 'Hillshade_9am',
 'Hillshade_Noon',
 'Hillshade_3pm',
 'Horizontal_Distance_To_Fire_Points',
 'Wilderness_Area',
 'Soil_Type',
 'Cover_Type']

In [8]:
# One-Hot Encoding column name
ohe_colNames = []
for a_ohe_colname, n_class in [('Wilderness_Area', 4), ('Soil_Type', 40)] :
    for n in range(n_class) :
        ohe_colNames.append(a_ohe_colname + '_' + str(n+1))

print(len(ohe_colNames))
ohe_colNames[-5:]

44


['Soil_Type_36',
 'Soil_Type_37',
 'Soil_Type_38',
 'Soil_Type_39',
 'Soil_Type_40']

In [9]:
colNames = colNames[:-3] + ohe_colNames + [colNames[-1]]
print(len(colNames))
colNames[:15]

55


['Elevation',
 'Aspect',
 'Slope',
 'Horizontal_Distance_To_Hydrology',
 'Vertical_Distance_To_Hydrology',
 'Horizontal_Distance_To_Roadways',
 'Hillshade_9am',
 'Hillshade_Noon',
 'Hillshade_3pm',
 'Horizontal_Distance_To_Fire_Points',
 'Wilderness_Area_1',
 'Wilderness_Area_2',
 'Wilderness_Area_3',
 'Wilderness_Area_4',
 'Soil_Type_1']

In [10]:
type(dataWithoutHeader)

pyspark.sql.dataframe.DataFrame

In [11]:
data = dataWithoutHeader.toDF(*colNames).withColumn('Cover_Type', F.col('Cover_Type').cast("double"))
data.cache()

DataFrame[Elevation: int, Aspect: int, Slope: int, Horizontal_Distance_To_Hydrology: int, Vertical_Distance_To_Hydrology: int, Horizontal_Distance_To_Roadways: int, Hillshade_9am: int, Hillshade_Noon: int, Hillshade_3pm: int, Horizontal_Distance_To_Fire_Points: int, Wilderness_Area_1: int, Wilderness_Area_2: int, Wilderness_Area_3: int, Wilderness_Area_4: int, Soil_Type_1: int, Soil_Type_2: int, Soil_Type_3: int, Soil_Type_4: int, Soil_Type_5: int, Soil_Type_6: int, Soil_Type_7: int, Soil_Type_8: int, Soil_Type_9: int, Soil_Type_10: int, Soil_Type_11: int, Soil_Type_12: int, Soil_Type_13: int, Soil_Type_14: int, Soil_Type_15: int, Soil_Type_16: int, Soil_Type_17: int, Soil_Type_18: int, Soil_Type_19: int, Soil_Type_20: int, Soil_Type_21: int, Soil_Type_22: int, Soil_Type_23: int, Soil_Type_24: int, Soil_Type_25: int, Soil_Type_26: int, Soil_Type_27: int, Soil_Type_28: int, Soil_Type_29: int, Soil_Type_30: int, Soil_Type_31: int, Soil_Type_32: int, Soil_Type_33: int, Soil_Type_34: int, 

In [12]:
target_distribution = data.groupBy('Cover_Type').count().orderBy('count', ascending=False)
target_distribution.show()   # imbalanced!

+----------+------+
|Cover_Type| count|
+----------+------+
|       2.0|283301|
|       1.0|211840|
|       3.0| 35754|
|       7.0| 20510|
|       6.0| 17367|
|       5.0|  9493|
|       4.0|  2747|
+----------+------+



In [13]:
y_dist = target_distribution.withColumn('count', F.col('count') / data.count()).toPandas()  # Type2가 약 50%, Type2 & Type1이 약 85%.
y_dist

Unnamed: 0,Cover_Type,count
0,2.0,0.487599
1,1.0,0.364605
2,3.0,0.061537
3,7.0,0.0353
4,6.0,0.029891
5,5.0,0.016339
6,4.0,0.004728


### 4.7. 첫 번째 의사 결정 나무

* Train(+Validation) : Test = 9 : 1

In [14]:
trainData, testData = data.randomSplit([0.9, 0.1], seed=123)

In [15]:
trainData.cache()
testData.cache()

DataFrame[Elevation: int, Aspect: int, Slope: int, Horizontal_Distance_To_Hydrology: int, Vertical_Distance_To_Hydrology: int, Horizontal_Distance_To_Roadways: int, Hillshade_9am: int, Hillshade_Noon: int, Hillshade_3pm: int, Horizontal_Distance_To_Fire_Points: int, Wilderness_Area_1: int, Wilderness_Area_2: int, Wilderness_Area_3: int, Wilderness_Area_4: int, Soil_Type_1: int, Soil_Type_2: int, Soil_Type_3: int, Soil_Type_4: int, Soil_Type_5: int, Soil_Type_6: int, Soil_Type_7: int, Soil_Type_8: int, Soil_Type_9: int, Soil_Type_10: int, Soil_Type_11: int, Soil_Type_12: int, Soil_Type_13: int, Soil_Type_14: int, Soil_Type_15: int, Soil_Type_16: int, Soil_Type_17: int, Soil_Type_18: int, Soil_Type_19: int, Soil_Type_20: int, Soil_Type_21: int, Soil_Type_22: int, Soil_Type_23: int, Soil_Type_24: int, Soil_Type_25: int, Soil_Type_26: int, Soil_Type_27: int, Soil_Type_28: int, Soil_Type_29: int, Soil_Type_30: int, Soil_Type_31: int, Soil_Type_32: int, Soil_Type_33: int, Soil_Type_34: int, 

In [17]:
# (수정사항)   Y distribution 그대로 Sampling 됐을까?   Yes :)
import pandas as pd
dist_train1 = trainData.groupBy('Cover_Type').count().withColumn('count', F.col('count') / trainData.count()).toPandas().sort_values(by='count', ascending=False) 
dist_test1  = testData.groupBy('Cover_Type').count().withColumn('count', F.col('count') / testData.count()).toPandas().sort_values(by='count', ascending=False)
pd.concat([dist_train1, dist_test1],axis=1)

Unnamed: 0,Cover_Type,count,Cover_Type.1,count.1
4,2.0,0.48761,2.0,0.487503
1,1.0,0.364457,1.0,0.365943
3,3.0,0.061725,3.0,0.059837
0,7.0,0.035257,7.0,0.035691
5,6.0,0.029967,6.0,0.0292
6,5.0,0.016251,5.0,0.017136
2,4.0,0.004732,4.0,0.004691


* MLlib는 feature들을 모아 하나의 열로 구성한 벡터를 사용한다 --> VectorAssembler 클래스

In [18]:
from pyspark.ml.feature import VectorAssembler

In [19]:
inputCols = [c for c in trainData.columns if c != 'Cover_Type']

assembler = VectorAssembler()\
    .setInputCols(inputCols)\
    .setOutputCol("featureVector")  # 결과로 얻은 DF에서는 새로운 'featureVector' 열이 만들어진다.
assembledTrainData = assembler.transform(trainData)

assembledTrainData.select('featureVector').show(n=5, truncate=False)  
# 저장공간을 절약하기 위해 SparseVector 인스턴스로 표현.
# 54개 값 대부분이 0 이기 때문에, 0이 아닌 경우의 인덱스와 값만 저장한다.

+----------------------------------------------------------------------------------------------------+
|featureVector                                                                                       |
+----------------------------------------------------------------------------------------------------+
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1863.0,37.0,17.0,120.0,18.0,90.0,217.0,202.0,115.0,769.0,1.0,1.0]) |
|(54,[0,1,2,5,6,7,8,9,13,18],[1874.0,18.0,14.0,90.0,208.0,209.0,135.0,793.0,1.0,1.0])                |
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1888.0,33.0,22.0,150.0,46.0,108.0,209.0,185.0,103.0,735.0,1.0,1.0])|
|(54,[0,1,2,3,4,5,6,7,8,9,13,14],[1889.0,28.0,22.0,150.0,23.0,120.0,205.0,185.0,108.0,759.0,1.0,1.0])|
|(54,[0,1,2,3,4,5,6,7,8,9,13,18],[1889.0,353.0,30.0,95.0,39.0,67.0,153.0,172.0,146.0,600.0,1.0,1.0]) |
+----------------------------------------------------------------------------------------------------+
only showing top 5 rows



* DecisionTree 모델 만들기

In [20]:
from pyspark.ml.classification import DecisionTreeClassifier
# from pyspark.util import Random

In [21]:
%%time
classifier = DecisionTreeClassifier()\
    .setSeed(123)\
    .setLabelCol('Cover_Type')\
    .setFeaturesCol('featureVector')\
    .setPredictionCol('prediction')     # 예측값을 저장할 column name

model = classifier.fit(assembledTrainData)
print(model.toDebugString)

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_06ab10692562, depth=5, numNodes=43, numClasses=8, numFeatures=54
  If (feature 0 <= 3045.5)
   If (feature 0 <= 2563.5)
    If (feature 10 <= 0.5)
     If (feature 0 <= 2466.5)
      If (feature 3 <= 15.0)
       Predict: 4.0
      Else (feature 3 > 15.0)
       Predict: 3.0
     Else (feature 0 > 2466.5)
      If (feature 17 <= 0.5)
       Predict: 2.0
      Else (feature 17 > 0.5)
       Predict: 3.0
    Else (feature 10 > 0.5)
     Predict: 2.0
   Else (feature 0 > 2563.5)
    If (feature 0 <= 2952.5)
     If (feature 15 <= 0.5)
      If (feature 17 <= 0.5)
       Predict: 2.0
      Else (feature 17 > 0.5)
       Predict: 3.0
     Else (feature 15 > 0.5)
      Predict: 3.0
    Else (feature 0 > 2952.5)
     If (feature 3 <= 211.0)
      If (feature 36 <= 0.5)
       Predict: 2.0
      Else (feature 36 > 0.5)
       Predict: 1.0
     Else (feature 3 > 211.0)
      Predict: 2.0
  Else (feature 0 > 3045.5)
   If (feature 0 <= 

* Feature Importance

In [22]:
model.featureImportances

SparseVector(54, {0: 0.8035, 3: 0.0384, 4: 0.0034, 5: 0.0019, 7: 0.0269, 9: 0.0034, 10: 0.0326, 12: 0.0113, 15: 0.0236, 17: 0.0304, 36: 0.0064, 45: 0.0183})

In [23]:
model.featureImportances.toArray()

array([0.80352448, 0.        , 0.        , 0.03835108, 0.00338292,
       0.00186985, 0.        , 0.02689828, 0.        , 0.00340973,
       0.03260615, 0.        , 0.01130231, 0.        , 0.        ,
       0.02360167, 0.        , 0.03039992, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.00635212, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.01830148, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

In [24]:
ft_imp = model.featureImportances.toArray()
type(ft_imp)

numpy.ndarray

In [25]:
import pandas as pd
df_imp = pd.DataFrame(zip(inputCols, ft_imp), columns=['column_name', 'importances'])
df_imp.sort_values(by='importances', ascending=False).head(15)  # 영향을 주는 변수는 12개 뿐이네...

Unnamed: 0,column_name,importances
0,Elevation,0.803524
3,Horizontal_Distance_To_Hydrology,0.038351
10,Wilderness_Area_1,0.032606
17,Soil_Type_4,0.0304
7,Hillshade_Noon,0.026898
15,Soil_Type_2,0.023602
45,Soil_Type_32,0.018301
12,Wilderness_Area_3,0.011302
36,Soil_Type_23,0.006352
9,Horizontal_Distance_To_Fire_Points,0.00341


* 학습데이터로 예측해보자

In [26]:
predictions = model.transform(assembledTrainData)
predictions.select('Cover_Type', 'prediction', 'probability').show(n = 5, truncate = False)

+----------+----------+------------------------------------------------------------------------------------------------------------------+
|Cover_Type|prediction|probability                                                                                                       |
+----------+----------+------------------------------------------------------------------------------------------------------------------+
|6.0       |3.0       |[0.0,0.0,0.03994933255383416,0.6242164409366981,0.04777680340381305,0.0,0.28805742310565463,0.0]                  |
|6.0       |4.0       |[0.0,4.1631973355537054E-4,0.05328892589508743,0.28184845961698585,0.4121565362198168,0.0,0.25228975853455454,0.0]|
|6.0       |3.0       |[0.0,0.0,0.03994933255383416,0.6242164409366981,0.04777680340381305,0.0,0.28805742310565463,0.0]                  |
|6.0       |3.0       |[0.0,0.0,0.03994933255383416,0.6242164409366981,0.04777680340381305,0.0,0.28805742310565463,0.0]                  |
|6.0       |3.0       |[0.0

Target class는 7개 뿐이지만, probability 벡터 값은 8개다.  
맨 앞은 항상 0.0이며, 무의미한 내용이다. (정보를 벡터로 표현할 때의 습관과 같은 것이며 조심해야한다.)

* 모델 성능 평가

In [27]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [28]:
evaluator = MulticlassClassificationEvaluator()\
    .setLabelCol('Cover_Type')\
    .setPredictionCol('prediction')

acc = evaluator.setMetricName('accuracy').evaluate(predictions)
f1 = evaluator.setMetricName('f1').evaluate(predictions)
print('Accuracy : {:.4f},  F1-Score : {:.4f}'.format(acc, f1))

Accuracy : 0.7026,  F1-Score : 0.6854


In [29]:
from pyspark.mllib.evaluation import MulticlassMetrics

predictionRDD = predictions\
    .withColumn('prediction', F.col('prediction').cast('double'))\
    .withColumn('Cover_Type', F.col('Cover_Type').cast('double'))\
    .rdd

In [30]:
predictionRDD.count()

523238

In [31]:
predictionRDD.first()

Row(Elevation=1863, Aspect=37, Slope=17, Horizontal_Distance_To_Hydrology=120, Vertical_Distance_To_Hydrology=18, Horizontal_Distance_To_Roadways=90, Hillshade_9am=217, Hillshade_Noon=202, Hillshade_3pm=115, Horizontal_Distance_To_Fire_Points=769, Wilderness_Area_1=0, Wilderness_Area_2=0, Wilderness_Area_3=0, Wilderness_Area_4=1, Soil_Type_1=0, Soil_Type_2=1, Soil_Type_3=0, Soil_Type_4=0, Soil_Type_5=0, Soil_Type_6=0, Soil_Type_7=0, Soil_Type_8=0, Soil_Type_9=0, Soil_Type_10=0, Soil_Type_11=0, Soil_Type_12=0, Soil_Type_13=0, Soil_Type_14=0, Soil_Type_15=0, Soil_Type_16=0, Soil_Type_17=0, Soil_Type_18=0, Soil_Type_19=0, Soil_Type_20=0, Soil_Type_21=0, Soil_Type_22=0, Soil_Type_23=0, Soil_Type_24=0, Soil_Type_25=0, Soil_Type_26=0, Soil_Type_27=0, Soil_Type_28=0, Soil_Type_29=0, Soil_Type_30=0, Soil_Type_31=0, Soil_Type_32=0, Soil_Type_33=0, Soil_Type_34=0, Soil_Type_35=0, Soil_Type_36=0, Soil_Type_37=0, Soil_Type_38=0, Soil_Type_39=0, Soil_Type_40=0, Cover_Type=6.0, featureVector=SparseV

In [None]:
multiclassMetrics = MulticlassMetrics(predictionRDD)
multiclassMetrics.confusionMatrix()   # ValueError: Length of object (59) does not match with length of fields (3)

In [33]:
confusionMatrix = predictions\
    .groupBy('Cover_Type')\
    .pivot('prediction', [i+1 for i in range(7)])\
    .count()\
    .fillna(0)\
    .orderBy('Cover_Type')
confusionMatrix.show()

+----------+------+------+-----+---+---+---+----+
|Cover_Type|     1|     2|    3|  4|  5|  6|   7|
+----------+------+------+-----+---+---+---+----+
|       1.0|135581| 51194|  167|  1|  0|  0|3755|
|       2.0| 54825|195195| 4526|128|  0|  0| 462|
|       3.0|     0|  5213|26407|677|  0|  0|   0|
|       4.0|     0|    15| 1471|990|  0|  0|   0|
|       5.0|    14|  7802|  687|  0|  0|  0|   0|
|       6.0|     0|  5528| 9546|606|  0|  0|   0|
|       7.0|  8928|    22|   53|  0|  0|  0|9445|
+----------+------+------+-----+---+---+---+----+



* Train & Testset 예측성능 모두를 고려한 전체 정확도 계산

In [None]:
def classProba(df) :
    total = df.count()
    list_proba = df\
        .groupBy('Cover_Type')\
        .count()\
        .orderBy('Cover_Type')\
        .select(F.col('count').cast('double'))\
        .rdd.map(lambda x: x['count'] / total)\
        .collect()
    return list_proba

In [None]:
trainPriorProba = classProba(trainData)
testPriorProba  = classProba(testData)
accuracy = sum([x * y for x, y in zip(trainPriorProba, testPriorProba)])
accuracy

### 4.8. 의사결정나무 하이퍼파라미터

* maxDepth : 결정의 최대 갯수. 과적합을 막으려면 depth를 제한하는 것이 좋다.   
* maxBins  : 결정 규칙에 집어넣을 값들의 집합. bin이 크면 처리시간이 길어지지만, 더 최적화된 결정 규칙을 찾을 수 있다.   
* impurity : 좋은 결정규칙을 판단하는 '불순도'의 측정 방법. 보통 gini(default)와 entropy를 사용하며, 둘다 값이 작을수록 좋다.  
* minInfoGain : 최소정보획득량. 과적합 방지 목적. 부분집합의 불순도를 충분히 개선하지 못하는 규칙은 기각된다.

### 4.9. 의사결정나무 튜닝하기

* 두 단계를 캡슐화하는 Pipeline 객체를 생성한다.

In [35]:
from pyspark.ml import Pipeline

In [36]:
inputCols = [c for c in trainData.columns if c != 'Cover_Type']

assembler = VectorAssembler()\
    .setInputCols(inputCols)\
    .setOutputCol("featureVector")

classifier = DecisionTreeClassifier()\
    .setSeed(123)\
    .setLabelCol('Cover_Type')\
    .setFeaturesCol('featureVector')\
    .setPredictionCol('prediction')

pipeline = Pipeline().setStages([assembler, classifier])

* 하이퍼파라미터 조합 정의 - param당 2개씩을 후보로, 총 16개 모델을 구축하고 평가하겠다.

In [37]:
from pyspark.ml.tuning import ParamGridBuilder

paramGrid = ParamGridBuilder()\
    .addGrid(classifier.impurity, ['gini', 'entropy'])\
    .addGrid(classifier.maxDepth, [1, 20])\
    .addGrid(classifier.maxBins, [40, 300])\
    .addGrid(classifier.minInfoGain, [0, 0.05])\
    .build()

multiClassEval = MulticlassClassificationEvaluator()\
    .setLabelCol('Cover_Type')\
    .setPredictionCol('prediction')\
    .setMetricName('accuracy')

* TrainValidationSplit을 활용한 평가  
(cf. k-Fold CrossValidator : 비용이 k배 더 들고, 빅데이터에서는 추가로 얻는 이득이 많지 않다.)

In [38]:
%%time
from pyspark.ml.tuning import TrainValidationSplit

validator = TrainValidationSplit()\
    .setSeed(123)\
    .setEstimator(pipeline)\
    .setEvaluator(multiClassEval)\
    .setEstimatorParamMaps(paramGrid)\
    .setTrainRatio(0.9)

validatorModel = validator.fit(trainData)

CPU times: user 1.95 s, sys: 1.11 s, total: 3.06 s
Wall time: 4min 47s


* 최적 모델의 하이퍼파라미터를 확인해보자

In [39]:
from pyspark.ml import PipelineModel

bestModel = validatorModel.bestModel

In [40]:
param_map = bestModel.stages[-1].extractParamMap()
len(param_map)

16

In [41]:
for i, a_pair in enumerate(param_map.items()) :
    if i >= 3 :
        break
    else :
        print(a_pair, '\n')

(Param(parent='DecisionTreeClassifier_ab3c8a9e8b49', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.'), False) 

(Param(parent='DecisionTreeClassifier_ab3c8a9e8b49', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.'), 10) 

(Param(parent='DecisionTreeClassifier_ab3c8a9e8b49', name='featuresCol', doc='features column name.'), 'featureVector') 



In [42]:
def print_param_map(param_map) :
    rtn_dict = {}
    for k,v in param_map.items() :
        rtn_dict.update({str(k).split('__')[-1] : v})
    return rtn_dict
        
print_param_map(param_map)

{'cacheNodeIds': False,
 'checkpointInterval': 10,
 'featuresCol': 'featureVector',
 'impurity': 'entropy',
 'labelCol': 'Cover_Type',
 'leafCol': '',
 'maxBins': 40,
 'maxDepth': 20,
 'maxMemoryInMB': 256,
 'minInfoGain': 0.0,
 'minInstancesPerNode': 1,
 'minWeightFractionPerNode': 0.0,
 'predictionCol': 'prediction',
 'probabilityCol': 'probability',
 'rawPredictionCol': 'rawPrediction',
 'seed': 123}

In [43]:
%%time
validatorModel = validator.fit(trainData)

temp_pnm = validatorModel.validationMetrics
type(temp_pnm)

CPU times: user 2.31 s, sys: 1.24 s, total: 3.55 s
Wall time: 5min 9s


list

In [44]:
temp_pnm[0]

0.633552340031984

In [45]:
paramsAndMetrics = [(metric, param) for metric, param in zip(temp_pnm, validatorModel.getEstimatorParamMaps())]
paramsAndMetrics = sorted(paramsAndMetrics, key=lambda x: x[0], reverse=True)

for metric, param in paramsAndMetrics :
    print(metric)
    print(print_param_map(param))
    print()

0.9111963160632743
{'impurity': 'entropy', 'maxDepth': 20, 'maxBins': 40, 'minInfoGain': 0.0}

0.911061443902815
{'impurity': 'entropy', 'maxDepth': 20, 'maxBins': 300, 'minInfoGain': 0.0}

0.9044334405888134
{'impurity': 'gini', 'maxDepth': 20, 'maxBins': 40, 'minInfoGain': 0.0}

0.9031039864357141
{'impurity': 'gini', 'maxDepth': 20, 'maxBins': 300, 'minInfoGain': 0.0}

0.7263443864280071
{'impurity': 'entropy', 'maxDepth': 20, 'maxBins': 300, 'minInfoGain': 0.05}

0.7243405714726113
{'impurity': 'entropy', 'maxDepth': 20, 'maxBins': 40, 'minInfoGain': 0.05}

0.6692549276507196
{'impurity': 'gini', 'maxDepth': 20, 'maxBins': 300, 'minInfoGain': 0.05}

0.6690044507812951
{'impurity': 'gini', 'maxDepth': 20, 'maxBins': 40, 'minInfoGain': 0.05}

0.63376428199842
{'impurity': 'gini', 'maxDepth': 1, 'maxBins': 300, 'minInfoGain': 0.0}

0.63376428199842
{'impurity': 'gini', 'maxDepth': 1, 'maxBins': 300, 'minInfoGain': 0.05}

0.633552340031984
{'impurity': 'gini', 'maxDepth': 1, 'maxBins':

이 모델이 Test dataset 으로는 얼마의 정확도를 달성할 것인가?

In [47]:
multiClassEval.evaluate(bestModel.transform(testData))

0.914754041610413

### 4.10. 범주형 특징 다시 살펴보기

* 40가지 값을 가진 1개의 범주형 변수를 40개의 수치형 변수로 표현하면, 메모리 사용량이 늘고 성능은 나빠진다.
* One-hot Encoding을 사용하지 않으면 어떨까?

* pandas UDF  
https://docs.microsoft.com/ko-kr/azure/databricks/spark/latest/spark-sql/udf-python-pandas   

    - a user-defined function that uses Apache Arrow to transfer data and pandas to work with the data.
    - You define a pandas UDF using the keyword "pandas_udf" as a __decorator__ and wrap the function with a __Python type hint__. 
    - (1) Series --> Series   
    - (2) Iterator of Series --> Iterator of Series   
    - (3) Iterator of multiple Series --> Iterator of Series  
    - (4) Series --> scalar

In [34]:
wilder_cols = [c for c in inputCols if 'Wilder' in c]
wilder_cols

['Wilderness_Area_1',
 'Wilderness_Area_2',
 'Wilderness_Area_3',
 'Wilderness_Area_4']

In [35]:
# 일단 Array로 묶어서 하나의 컬럼으로 만들어본다.
df_wilder = data.select(*wilder_cols, F.array(*wilder_cols).alias('wilderness_area'))
df_wilder.show(3)

+-----------------+-----------------+-----------------+-----------------+---------------+
|Wilderness_Area_1|Wilderness_Area_2|Wilderness_Area_3|Wilderness_Area_4|wilderness_area|
+-----------------+-----------------+-----------------+-----------------+---------------+
|                1|                0|                0|                0|   [1, 0, 0, 0]|
|                1|                0|                0|                0|   [1, 0, 0, 0]|
|                1|                0|                0|                0|   [1, 0, 0, 0]|
+-----------------+-----------------+-----------------+-----------------+---------------+
only showing top 3 rows



In [36]:
df_wilder.printSchema()  # Array 라는 점

root
 |-- Wilderness_Area_1: integer (nullable = true)
 |-- Wilderness_Area_2: integer (nullable = true)
 |-- Wilderness_Area_3: integer (nullable = true)
 |-- Wilderness_Area_4: integer (nullable = true)
 |-- wilderness_area: array (nullable = false)
 |    |-- element: integer (containsNull = true)



In [37]:
import pyspark.sql.functions as F
import pyspark.sql.types as T
import numpy as np

@F.pandas_udf(T.LongType())
def unhot_udf(arrs: pd.Series) -> pd.Series :
    return pd.Series([np.argmax(a)+1 for a in arrs])  # Array에서 가장 큰값의 위치를 리턴한다.

In [38]:
df_wilder.select(unhot_udf(F.col('wilderness_area')).alias('Wilderness_Area')).show(3)

+---------------+
|Wilderness_Area|
+---------------+
|              1|
|              1|
|              1|
+---------------+
only showing top 3 rows

