# Linear Regression Example

Let's walk through the steps of the official documentation example. Doing this will help your ability to read from the documentation, understand it, and then apply it to your own problems (the upcoming Consulting Project).

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('lr_example').getOrCreate()

In [3]:
from pyspark.ml.regression import LinearRegression

In [4]:
# Load training data
# 좀 더 일반적인 형식을 사용하기 위해 .format을 사용함
# sample_linear_regression_data.txt 데이터 형식이 libsvm이다.
training = spark.read.format("libsvm").load("sample_linear_regression_data.txt")

Interesting! We haven't seen libsvm formats before. In fact the aren't very popular when working with datasets in Python, but the Spark Documentation makes use of them a lot because of their formatting. Let's see what the training data looks like:

In [5]:
training.show()
# label열과 feature열이 있는데
# 실제로 스파크에서 머신 러닝 알고리즘이 나타나는 형태이다.
# 실생활에 접하는 대부분의 데이터는 이런 방식으로 형식이 지정되어 있지 않다.
# spark 홈페이지 document에 이러한 형식이 지정된 데이터가 제공된다.

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -7.966593841555266|(10,[0,1,2,3,4,5,...|
| -7.896274316726144|(10,[0,1,2,3,4,5,...|
| -8.464803554195287|(10,[0,1,2,3,4,5,...|
| 2.1214592666251364|(10,[0,1,2,3,4,5,...|
| 1.0720117616524107|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -5.082010756207233|(10,[0,1,2,3,4,5,...|
|  7.887786536531237|(10,[0,1,2,3,4,5,...|
| 14.323146365332388|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-0.8995693247765151|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|  5.601801561245534|(10,[0,1,2,3,4,5,...|
|-3.2256352187273354|(10,[0,1,2,3,4,5,...|
| 1.5299675726687754|(10,[0,1,2,3,4,5,...|
| -0.250102447941961|(10,[0,1,2,3,4,5,...|
+----------

This is the format that Spark expects. Two columns with the names "label" and "features". 

The "label" column then needs to have the numerical label, either a regression numerical value, or a numerical value that matches to a classification grouping. Later on we will talk about unsupervised learning algorithms that by their nature do not use or require a label.

The feature column has inside of it a vector of all the features that belong to that row. ***Usually what we end up doing is combining the various feature columns we have into a single 'features' column using the data transformations we've learned about.***

Let's continue working through this simple example!

In [6]:
# These are the default values for the featuresCol, labelCol, predictionCol
# train 데이터셋에 나온대로 맞춰야 한다.
# 예측이 필요하면 'prediction' 컬럼이 생성될 것이다.
lr = LinearRegression(featuresCol='features', labelCol='label', predictionCol='prediction')

# You could also pass in additional parameters for regularization, do the reading 
# in ISLR to fully understand that, after that its just some simple parameter calls.
# Check the documentation with Shift+Tab for more info!

In [7]:
# Fit the model
lrModel = lr.fit(training)

In [8]:
# Print the coefficients and intercept for linear regression
# 회귀계수와 절편임
print("Coefficients: {}".format(str(lrModel.coefficients))) # For each feature...
print('\n')
print("Intercept:{}".format(str(lrModel.intercept)))

Coefficients: [0.0073350710225801715,0.8313757584337543,-0.8095307954684084,2.441191686884721,0.5191713795290003,1.1534591903547016,-0.2989124112808717,-0.5128514186201779,-0.619712827067017,0.6956151804322931]


Intercept:0.14228558260358093


There is a summary attribute that contains even more info!

In [9]:
# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary

Lots of info, here are a few examples:

In [10]:
# 기능이 있음을 알아두자.(값은 중요하지 않다.)

trainingSummary.residuals.show()
print("RMSE: {}".format(trainingSummary.rootMeanSquaredError))
print("r2: {}".format(trainingSummary.r2))

+-------------------+
|          residuals|
+-------------------+
|-11.011130022096554|
| 0.9236590911176538|
|-4.5957401897776675|
|  -20.4201774575836|
|-10.339160314788181|
|-5.9552091439610555|
|-10.726906349283922|
|  2.122807193191233|
|  4.077122222293811|
|-17.316168071241652|
| -4.593044343959059|
|  6.380476690746936|
| 11.320566035059846|
|-20.721971774534094|
| -2.736692773777401|
| -16.66886934252847|
|  8.242186378876315|
|-1.3723486332690233|
|-0.7060332131264666|
|-1.1591135969994064|
+-------------------+
only showing top 20 rows

RMSE: 10.16309157133015
r2: 0.027839179518600154


## Train/Test Splits

But wait! We've commited a big mistake, we never separated our data set into a training and test set. Instead we trained on ALL of the data, something we generally want to avoid doing. Read ISLR and check out the theory lecture for more info on this, but remember we won't get a fair evaluation of our model by judging how well it does again on the same data it was trained on!

Luckily Spark DataFrames have an almost too convienent method of splitting the data! Let's see it:

In [11]:
all_data = spark.read.format("libsvm").load("sample_linear_regression_data.txt")

In [13]:
split_object = all_data.randomSplit([0.7,0.3])
split_object
# 두 개의 데이터프레임이 반환된다.

[DataFrame[label: double, features: vector],
 DataFrame[label: double, features: vector]]

In [14]:
# Pass in the split between training/test as a list.
# No correct, but generally 70/30 or 60/40 splits are used. 
# Depending on how much data you have and how unbalanced it is.
train_data,test_data = all_data.randomSplit([0.7,0.3])

In [15]:
train_data.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
|-28.571478869743427|(10,[0,1,2,3,4,5,...|
|-28.046018037776633|(10,[0,1,2,3,4,5,...|
|-26.805483428483072|(10,[0,1,2,3,4,5,...|
|-26.736207182601724|(10,[0,1,2,3,4,5,...|
|-22.949825936196074|(10,[0,1,2,3,4,5,...|
|-22.837460416919342|(10,[0,1,2,3,4,5,...|
|-21.432387764165806|(10,[0,1,2,3,4,5,...|
|-20.212077258958672|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-19.884560774273424|(10,[0,1,2,3,4,5,...|
|-19.872991038068406|(10,[0,1,2,3,4,5,...|
| -19.66731861537172|(10,[0,1,2,3,4,5,...|
|-19.402336030214553|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|-18.845922472898582|(10,[0,1,2,3,4,5,...|
| -18.27521356600463|(10,[0,1,2,3,4,5,...|
|-17.494200356883344|(10,[0,1,2,3,4,5,...|
| -16.71909683360509|(10,[0,1,2,3,4,5,...|
|-16.692207021311106|(10,[0,1,2,3,4,5,...|
|-16.151349351277112|(10,[0,1,2,3,4,5,...|
+----------

In [16]:
train_data.describe().show()
# count를 비교한다

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                354|
|   mean| 0.6065868299972056|
| stddev| 10.403590079660882|
|    min|-28.571478869743427|
|    max| 27.111027963108548|
+-------+-------------------+



In [17]:
test_data.describe().show()
# count를 비교한다

+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|               147|
|   mean|-0.585241074145043|
| stddev|10.093732294560995|
|    min|-23.51088409032297|
|    max| 27.78383192005107|
+-------+------------------+



Now we only train on the train_data

In [20]:
correct_model = lr.fit(train_data)

Now we can directly get a .summary object using the evaluate method:

In [21]:
test_results = correct_model.evaluate(test_data)

In [22]:
test_results.residuals.show()
print("RMSE: {}".format(test_results.rootMeanSquaredError))

+-------------------+
|          residuals|
+-------------------+
|-23.299035437852197|
| -25.29978873743065|
|-20.294289484012964|
| -16.62351816293436|
|-20.932310441799416|
|-17.174177731273772|
|-19.913274661656125|
|-15.188237908762025|
|-20.128344988257616|
|-17.430138549276485|
|-17.501335368778758|
|-15.667469202175303|
|-18.446387313101635|
| -12.11687057251665|
| -11.66788691410732|
|-14.218698802705932|
|-18.300866071999728|
| -13.09300297719901|
|-12.559689878523102|
| -13.01871908584511|
+-------------------+
only showing top 20 rows

RMSE: 10.409632421844782


선형 회귀 모델에 더 많은 매개변수를 추가할 수 있다.<br>
이러한 매개변수를 대량으로 입력한 다음 무엇이 작동하는지 무엇이 작동하지 않는지 확인하고 해당 모델에 능숙해질 때까지 이 프로세스를 반복한다. (하이퍼파라미터 튜닝을 의미하는듯) <br>
이 과정을 거치면 모델을 배포할 준비가 된 것이다. <br>
일반적으로 label이 없는 데이터에 모델을 배포해야 한다. <br>

In [24]:
unlabeled_data = test_data.select('features')
# label이 없는 데이터를 만들기위해 feature만 가져다 새로운 변수에 저장
# 실제 배포 과정을 설명하기 위한 것으로 생각됨

In [25]:
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
+--------------------+
only showing top 20 rows



Well that is nice, but realistically we will eventually want to test this model against unlabeled data, after all, that is the whole point of building the model in the first place. We can again do this with a convenient method call, in this case, transform(). Which was actually being called within the evaluate() method. Let's see it in action:

In [26]:
predictions = correct_model.transform(unlabeled_data)
# label이 없는 데이터셋이니 evaluate를 할 수 없다.
# transform method를 사용한다.

In [27]:
predictions.show()

+--------------------+--------------------+
|            features|          prediction|
+--------------------+--------------------+
|(10,[0,1,2,3,4,5,...| -0.2118486524707739|
|(10,[0,1,2,3,4,5,...|   1.812348616494139|
|(10,[0,1,2,3,4,5,...|   0.511526694398427|
|(10,[0,1,2,3,4,5,...|  -1.180108025730155|
|(10,[0,1,2,3,4,5,...|   3.503635870859908|
|(10,[0,1,2,3,4,5,...|-0.15254300140217736|
|(10,[0,1,2,3,4,5,...|  2.8478750357801084|
|(10,[0,1,2,3,4,5,...|  -1.838254355447523|
|(10,[0,1,2,3,4,5,...|  3.8669147128048844|
|(10,[0,1,2,3,4,5,...|  1.3444795082549956|
|(10,[0,1,2,3,4,5,...|  1.7692470965395137|
|(10,[0,1,2,3,4,5,...|-0.05604641087326623|
|(10,[0,1,2,3,4,5,...|   3.086842433268959|
|(10,[0,1,2,3,4,5,...|  -3.232000582862604|
|(10,[0,1,2,3,4,5,...| -3.6430936753089678|
|(10,[0,1,2,3,4,5,...|  0.2425678715532281|
|(10,[0,1,2,3,4,5,...|  4.5284245102968566|
|(10,[0,1,2,3,4,5,...| 0.05307491309439577|
|(10,[0,1,2,3,4,5,...|0.001114089666912...|
|(10,[0,1,2,3,4,5,...|   0.51794

Okay, so this data is a bit meaningless, so let's explore this same process with some data that actually makes a little more intuitive sense!