# ロジスティック回帰 による製造品質の予測
Spark Machie Learning Library というSpark対応の機械学習ライブラリを使用して、製造工程データから品質を予測するモデルを作成します。

In [2]:
factory = spark.table("factory_csv")
display(factory)

ID,Quality,ProcessA-Pressure,ProcessA-Humidity,ProcessA-Vibration,ProcessB-Light,ProcessB-Skill,ProcessB-Temp,ProcessB-Rotation,ProcessC-Density,ProcessC-PH,ProcessC-skewness,ProcessC-Time
1,0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8
2,0,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5
3,0,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1
4,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9
5,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9
6,0,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1
7,0,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6
8,0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8
9,0,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5
10,0,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0


In [3]:
#基礎統計量の表示
display(factory.describe())

summary,ID,Quality,ProcessA-Pressure,ProcessA-Humidity,ProcessA-Vibration,ProcessB-Light,ProcessB-Skill,ProcessB-Temp,ProcessB-Rotation,ProcessC-Density,ProcessC-PH,ProcessC-skewness,ProcessC-Time
count,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,2449.5,0.2164148632094732,6.854787668436075,0.2782411188240108,0.3341915067374373,6.391414863209486,0.0457723560636995,35.30808493262556,138.36065741118824,0.9940273764801896,3.1882666394446693,0.4898468762760325,10.514267047770147
stddev,1414.0751394462743,0.4118423235271302,0.8438682276875127,0.1007945484248653,0.1210198042029825,5.072057784014878,0.0218479680937288,17.00713732523259,42.498064554142985,0.0029909069169369,0.1510005996150667,0.1141258339488322,1.230620567752269
min,1.0,0.0,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0
max,4898.0,1.0,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.82,1.08,14.2


In [4]:
# 説明変数の指定
nonFeatureCols = ['ID','Quality']
featureCols = factory.columns
for i in range(len(nonFeatureCols)):
  featureCols.remove(nonFeatureCols[i])
print(featureCols)

In [5]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=featureCols,outputCol="features")
dataset = assembler.transform(factory)
train, test = dataset.select("Quality","features").randomSplit([0.85, 0.15], seed=1)

In [6]:
from pyspark.ml.classification import LogisticRegression
# ランダムフォレストのモデル学習
model = LogisticRegression(labelCol="Quality", featuresCol="features", regParam=0.1,elasticNetParam=0.3).fit(train)

In [7]:
# モデルの各パラメータを表示
model.explainParams()

In [8]:
predictions = model.transform(test)
display(predictions)

Quality,features,rawPrediction,probability,prediction
0,"List(1, 11, List(), List(4.6, 0.445, 0.0, 1.4, 0.053, 11.0, 178.0, 0.99426, 3.79, 0.55, 10.2))","List(1, 2, List(), List(1.5938356876555162, -1.5938356876555162))","List(1, 2, List(), List(0.8311550732895667, 0.1688449267104333))",0.0
0,"List(1, 11, List(), List(4.8, 0.29, 0.23, 1.1, 0.044, 38.0, 180.0, 0.98924, 3.28, 0.34, 11.9))","List(1, 2, List(), List(0.7964564889398018, -0.7964564889398018))","List(1, 2, List(), List(0.6892159799404767, 0.3107840200595234))",0.0
0,"List(1, 11, List(), List(5.0, 0.2, 0.4, 1.9, 0.015, 20.0, 98.0, 0.9897, 3.37, 0.55, 12.05))","List(1, 2, List(), List(0.5771241530706483, -0.5771241530706483))","List(1, 2, List(), List(0.6404054058845735, 0.3595945941154265))",0.0
0,"List(1, 11, List(), List(5.0, 0.35, 0.25, 7.8, 0.031, 24.0, 116.0, 0.99241, 3.39, 0.4, 11.3))","List(1, 2, List(), List(1.0166688144472895, -1.0166688144472895))","List(1, 2, List(), List(0.7343232172079186, 0.26567678279208135))",0.0
0,"List(1, 11, List(), List(5.1, 0.11, 0.32, 1.6, 0.028, 12.0, 90.0, 0.99008, 3.57, 0.52, 12.2))","List(1, 2, List(), List(0.5367566358130527, -0.5367566358130527))","List(1, 2, List(), List(0.6310576065321765, 0.36894239346782354))",0.0
0,"List(1, 11, List(), List(5.1, 0.165, 0.22, 5.7, 0.047, 42.0, 146.0, 0.9934, 3.18, 0.55, 9.9))","List(1, 2, List(), List(1.5683713524763963, -1.5683713524763963))","List(1, 2, List(), List(0.827551307724404, 0.17244869227559603))",0.0
0,"List(1, 11, List(), List(5.1, 0.29, 0.28, 8.3, 0.026, 27.0, 107.0, 0.99308, 3.36, 0.37, 11.0))","List(1, 2, List(), List(1.0921885245003253, -1.0921885245003253))","List(1, 2, List(), List(0.7487936109773383, 0.25120638902266174))",0.0
0,"List(1, 11, List(), List(5.1, 0.35, 0.26, 6.8, 0.034, 36.0, 120.0, 0.99188, 3.38, 0.4, 11.5))","List(1, 2, List(), List(0.9479678159143659, -0.9479678159143659))","List(1, 2, List(), List(0.7207063058024723, 0.27929369419752764))",0.0
0,"List(1, 11, List(), List(5.3, 0.36, 0.27, 6.3, 0.028, 40.0, 132.0, 0.99186, 3.37, 0.4, 11.6))","List(1, 2, List(), List(0.8867339387529871, -0.8867339387529871))","List(1, 2, List(), List(0.708215712627724, 0.2917842873722761))",0.0
0,"List(1, 11, List(), List(5.4, 0.15, 0.32, 2.5, 0.037, 10.0, 51.0, 0.98878, 3.04, 0.58, 12.6))","List(1, 2, List(), List(0.42879432363881365, -0.42879432363881365))","List(1, 2, List(), List(0.605585727328901, 0.3944142726710989))",0.0


In [9]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
areaUnderROC = BinaryClassificationEvaluator(labelCol="Quality").evaluate(predictions)

In [10]:
print("areaUnderROC: ", areaUnderROC)