# ロジスティック回帰による分類予測
- bank-fullのy列（定期預金を申し込んだか否か）について分類予測する
- 特徴量は簡単のため、数値の列と文字列の"default"列のみ使用する
- 数値の列は標準化を行う
- 文字列の列はインデックス化する
- 精度評価として混同行列とAUCを計算する

In [1]:
import numpy as np
import pandas as pd

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("logistic_regression").getOrCreate()

In [3]:
filename = "./data/bank/bank-full.csv"
data = spark.read.csv(filename, header=True, inferSchema=True, sep=';')
data.show()

+---+------------+--------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|age|         job| marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|  y|
+---+------------+--------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
| 58|  management| married| tertiary|     no|   2143|    yes|  no|unknown|  5|  may|     261|       1|   -1|       0| unknown| no|
| 44|  technician|  single|secondary|     no|     29|    yes|  no|unknown|  5|  may|     151|       1|   -1|       0| unknown| no|
| 33|entrepreneur| married|secondary|     no|      2|    yes| yes|unknown|  5|  may|      76|       1|   -1|       0| unknown| no|
| 47| blue-collar| married|  unknown|     no|   1506|    yes|  no|unknown|  5|  may|      92|       1|   -1|       0| unknown| no|
| 33|     unknown|  single|  unknown|     no|      1|     no|  no|unknown|  5|  may

In [33]:
y_pred = pred_test.select("prediction")
y_pred = y_pred.toPandas()
y_pred

Unnamed: 0,prediction
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
...,...
13557,0.0
13558,1.0
13559,1.0
13560,0.0


In [35]:
class_name = [0, 1]
cnf_matrix = confusion_matrix(y_true, y_pred, labels=class_name)
cnf_matrix

array([[11749,   214],
       [ 1304,   295]])

In [36]:
tn, fp, fn, tp = cnf_matrix.flatten()
print(tn, fp, fn, tp)

11749 214 1304 295


In [37]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print("accuracy:{}".format(accuracy_score(y_true, y_pred)))

accuracy:0.8880696062527651


In [38]:
print("precision:{}".format(precision_score(y_true, y_pred)))
print("recall:{}".format(recall_score(y_true, y_pred)))
print("f1:{}".format(f1_score(y_true, y_pred)))

precision:0.5795677799607073
recall:0.18449030644152595
f1:0.27988614800759015
