PySpark을 로컬머신에 설치하고 노트북을 사용하기 보다는 머신러닝 관련 다양한 라이브러리가 이미 설치되었고 좋은 하드웨어를 제공해주는 Google Colab을 통해 실습을 진행한다.

이를 위해 pyspark과 Py4J 패키지를 설치한다. Py4J 패키지는 파이썬 프로그램이 자바가상머신상의 오브젝트들을 접근할 수 있게 해준다. Local Standalone Spark을 사용한다.

In [1]:
!pip install pyspark==3.0.1 py4j==0.10.9 

Collecting pyspark==3.0.1
  Downloading pyspark-3.0.1.tar.gz (204.2 MB)
[K     |████████████████████████████████| 204.2 MB 33 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 53.8 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=af774b053ace0e4d7df60a9ac0fba76503d6cd39128282a0cdf2c4de1565f3a0
  Stored in directory: /root/.cache/pip/wheels/5e/34/fa/b37b5cef503fc5148b478b2495043ba61b079120b7ff379f9b
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Boston Housing Linear Regression example") \
    .getOrCreate()

# 보스턴 주택 가격 예측 모델 만들기




In [3]:
spark

In [4]:
!wget https://s3-geospatial.s3-us-west-2.amazonaws.com/boston_housing.csv

--2022-03-29 18:54:27--  https://s3-geospatial.s3-us-west-2.amazonaws.com/boston_housing.csv
Resolving s3-geospatial.s3-us-west-2.amazonaws.com (s3-geospatial.s3-us-west-2.amazonaws.com)... 52.218.185.249
Connecting to s3-geospatial.s3-us-west-2.amazonaws.com (s3-geospatial.s3-us-west-2.amazonaws.com)|52.218.185.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36240 (35K) [text/csv]
Saving to: ‘boston_housing.csv’


2022-03-29 18:54:27 (2.60 MB/s) - ‘boston_housing.csv’ saved [36240/36240]



In [5]:
!ls -tl

total 40
drwxr-xr-x 1 root root  4096 Mar 23 14:22 sample_data
-rw-r--r-- 1 root root 36240 Jan 31  2021 boston_housing.csv


In [6]:
data = spark.read.csv('./boston_housing.csv', header=True, inferSchema=True)

In [7]:
data.printSchema()

root
 |-- crim: double (nullable = true)
 |-- zn: double (nullable = true)
 |-- indus: double (nullable = true)
 |-- chas: integer (nullable = true)
 |-- nox: double (nullable = true)
 |-- rm: double (nullable = true)
 |-- age: double (nullable = true)
 |-- dis: double (nullable = true)
 |-- rad: integer (nullable = true)
 |-- tax: integer (nullable = true)
 |-- ptratio: double (nullable = true)
 |-- b: double (nullable = true)
 |-- lstat: double (nullable = true)
 |-- medv: double (nullable = true)



In [8]:
data.show()

+-------+----+-----+----+-----+-----+-----+------+---+---+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm|  age|   dis|rad|tax|ptratio|     b|lstat|medv|
+-------+----+-----+----+-----+-----+-----+------+---+---+-------+------+-----+----+
|0.00632|18.0| 2.31|   0|0.538|6.575| 65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421| 78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07|   0|0.469|7.185| 61.1|4.9671|  2|242|   17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18|   0|0.458|6.998| 45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18|   0|0.458|7.147| 54.2|6.0622|  3|222|   18.7| 396.9| 5.33|36.2|
|0.02985| 0.0| 2.18|   0|0.458| 6.43| 58.7|6.0622|  3|222|   18.7|394.12| 5.21|28.7|
|0.08829|12.5| 7.87|   0|0.524|6.012| 66.6|5.5605|  5|311|   15.2| 395.6|12.43|22.9|
|0.14455|12.5| 7.87|   0|0.524|6.172| 96.1|5.9505|  5|311|   15.2| 396.9|19.15|27.1|
|0.21124|12.5| 7.87|   0|0.524|5.631|100.0|6.0821|  5|311|   15.2

## 피쳐 벡터를 만들기

In [9]:
from pyspark.ml.feature import VectorAssembler

# 학습시킬 feature를 벡터 하나로 묶고, 테이블 상으로는 feature vector / 정답 label 로 이루어진 형태로 간소화한다
feature_columns = data.columns[:-1] # 마지막 feature만 빼고
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

In [10]:
feature_columns

['crim',
 'zn',
 'indus',
 'chas',
 'nox',
 'rm',
 'age',
 'dis',
 'rad',
 'tax',
 'ptratio',
 'b',
 'lstat']

In [11]:
data_2 = assembler.transform(data) # transform(data) : data를 받아서 data에 새로운 column(assembler)을 추가해주는 기능

In [13]:
data_2.show() # features Column이 추가되어 있음

+-------+----+-----+----+-----+-----+-----+------+---+---+-------+------+-----+----+--------------------+
|   crim|  zn|indus|chas|  nox|   rm|  age|   dis|rad|tax|ptratio|     b|lstat|medv|            features|
+-------+----+-----+----+-----+-----+-----+------+---+---+-------+------+-----+----+--------------------+
|0.00632|18.0| 2.31|   0|0.538|6.575| 65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|[0.00632,18.0,2.3...|
|0.02731| 0.0| 7.07|   0|0.469|6.421| 78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|[0.02731,0.0,7.07...|
|0.02729| 0.0| 7.07|   0|0.469|7.185| 61.1|4.9671|  2|242|   17.8|392.83| 4.03|34.7|[0.02729,0.0,7.07...|
|0.03237| 0.0| 2.18|   0|0.458|6.998| 45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|[0.03237,0.0,2.18...|
|0.06905| 0.0| 2.18|   0|0.458|7.147| 54.2|6.0622|  3|222|   18.7| 396.9| 5.33|36.2|[0.06905,0.0,2.18...|
|0.02985| 0.0| 2.18|   0|0.458| 6.43| 58.7|6.0622|  3|222|   18.7|394.12| 5.21|28.7|[0.02985,0.0,2.18...|
|0.08829|12.5| 7.87|   0|0.524|6.012| 66.6|5.5

## 훈련용과 테스트용 데이터를 나누고 Linear Regression 모델을 하나 만든다

In [14]:
train, test = data_2.randomSplit([0.7, 0.3])

In [15]:
from pyspark.ml.regression import LinearRegression

algo = LinearRegression(featuresCol="features", labelCol="medv")
model = algo.fit(train)

## 모델 성능 측정

In [16]:
evaluation_summary = model.evaluate(test)

In [17]:
evaluation_summary

<pyspark.ml.regression.LinearRegressionSummary at 0x7f80c5cce150>

In [18]:
evaluation_summary.meanAbsoluteError

3.459933231752423

In [19]:
evaluation_summary.rootMeanSquaredError

4.966602825649544

In [20]:
evaluation_summary.r2

0.6878842288595739

## 모델 예측값 살펴보기 

In [21]:
predictions = model.transform(test) # 여기서의 transform: bulk prediction 결과 column을 추가해줌

In [22]:
predictions.show()

+-------+----+-----+----+------+-----+----+------+---+---+-------+------+-----+----+--------------------+------------------+
|   crim|  zn|indus|chas|   nox|   rm| age|   dis|rad|tax|ptratio|     b|lstat|medv|            features|        prediction|
+-------+----+-----+----+------+-----+----+------+---+---+-------+------+-----+----+--------------------+------------------+
|0.01501|90.0| 1.21|   1| 0.401|7.923|24.8| 5.885|  1|198|   13.6|395.52| 3.16|50.0|[0.01501,90.0,1.2...| 45.46486256768377|
|0.01778|95.0| 1.47|   0| 0.403|7.135|13.9|7.6534|  3|402|   17.0| 384.3| 4.45|32.9|[0.01778,95.0,1.4...| 30.91797979388035|
|0.01951|17.5| 1.38|   0|0.4161|7.104|59.5|9.2229|  3|216|   18.6|393.24| 8.05|33.0|[0.01951,17.5,1.3...|23.813266549681686|
|0.02498| 0.0| 1.89|   0| 0.518| 6.54|59.7|6.2669|  1|422|   15.9|389.96| 8.65|16.5|[0.02498,0.0,1.89...|22.960027602013582|
|0.02731| 0.0| 7.07|   0| 0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|[0.02731,0.0,7.07...|25.010168851197506|


In [23]:
predictions.select(predictions.columns[13:]).show()

+----+--------------------+------------------+
|medv|            features|        prediction|
+----+--------------------+------------------+
|50.0|[0.01501,90.0,1.2...| 45.46486256768377|
|32.9|[0.01778,95.0,1.4...| 30.91797979388035|
|33.0|[0.01951,17.5,1.3...|23.813266549681686|
|16.5|[0.02498,0.0,1.89...|22.960027602013582|
|21.6|[0.02731,0.0,7.07...|25.010168851197506|
|33.4|[0.03237,0.0,2.18...| 28.96931669702303|
|20.6|[0.03306,0.0,5.19...|22.259275064949172|
|19.5|[0.03427,0.0,5.19...|20.099677428849866|
|24.1|[0.03445,82.5,2.0...|29.003915213148975|
|19.4|[0.03466,35.0,6.0...|23.330666729706433|
|48.5|[0.0351,95.0,2.68...|42.365766920851826|
|22.0|[0.03537,34.0,6.0...|29.342519132425743|
|27.9|[0.03615,80.0,4.9...|31.329113002933653|
|24.8|[0.03659,25.0,4.8...|26.033327988186954|
|35.4|[0.03705,20.0,3.3...| 34.83030661743946|
|23.2|[0.03871,52.5,5.3...|26.840517411249277|
|33.3|[0.04011,80.0,1.5...|36.637584301731124|
|20.5|[0.04337,21.0,5.6...|23.765251195523284|
|24.8|[0.0441

In [24]:
model.save("boston_housing_model")

In [25]:
!ls boston_housing_model

data  metadata


In [26]:
!ls -tl boston_housing_model

total 8
drwxr-xr-x 2 root root 4096 Mar 29 19:04 data
drwxr-xr-x 2 root root 4096 Mar 29 19:04 metadata


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
model_save_name = "boston_housing_model"
path = F"/content/gdrive/My Drive/boston_housing_model2" 
model.save(path)

In [None]:
from pyspark.ml.regression import LinearRegressionModel

loaded_model = LinearRegressionModel.load(path)  # "boston_housing_model")

In [None]:
predictions2 = loaded_model.transform(test)

In [None]:
predictions2.select(predictions.columns[13:]).show()

+----+--------------------+------------------+
|medv|            features|        prediction|
+----+--------------------+------------------+
|22.0|[0.01096,55.0,2.2...|26.571482297719307|
|32.7|[0.01301,35.0,1.5...|30.167506151964368|
|35.4|[0.01311,90.0,1.2...|30.335900587302184|
|18.9|[0.0136,75.0,4.0,...|15.228505739892949|
|50.0|[0.01501,90.0,1.2...| 44.80928824019599|
|30.1|[0.01709,90.0,2.0...|25.264400971073016|
|50.0|[0.02009,95.0,2.6...| 41.68916053418553|
|42.3|[0.02177,82.5,2.0...|36.146191514394395|
|16.5|[0.02498,0.0,1.89...|22.746885437097184|
|23.9|[0.02543,55.0,3.7...|27.769299940712642|
|30.8|[0.02763,75.0,2.9...| 30.72520875882616|
|25.0|[0.02875,28.0,15....|28.649351219272933|
|18.5|[0.03041,0.0,5.19...|19.578146685590635|
|34.9|[0.0315,95.0,1.47...|29.605091447377312|
|33.4|[0.03237,0.0,2.18...|28.946444601423302|
|19.5|[0.03427,0.0,5.19...| 20.31754513618022|
|19.4|[0.03466,35.0,6.0...|22.979032509331866|
|45.4|[0.03578,20.0,3.3...| 38.10797692996398|
|20.7|[0.0373