## Spark SQL Documentation:

[Regression: Linear least squares, Lasso, and ridge regression](https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)

In [4]:
from pyspark.sql import SparkSession

In [5]:
spark = SparkSession.builder.appName("linear-regression").getOrCreate()

In [0]:
from pyspark.ml.regression import LinearRegression

We will need to split our data for training.
Read in all data, then split...

In [18]:
all_data = spark.read.format('libsvm').load("sample_linear_regression_data.txt")

randomSplit takes an array of percentages.
In this case, we do a 70/30 split
We end up with 2 DataFrames

In [19]:
train_data,test_data = all_data.randomSplit([0.7, 0.3])

In [23]:
train_data.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                359|
|   mean| 0.6434249129513124|
| stddev| 10.351608098840114|
|    min|-28.571478869743427|
|    max| 26.903524792043335|
+-------+-------------------+



In [26]:
test_data.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                142|
|   mean|-0.7203397452805057|
| stddev|  10.20302685641113|
|    min|-28.046018037776633|
|    max|  27.78383192005107|
+-------+-------------------+



In [0]:
lr = LinearRegression(
    featuresCol='features',
    labelCol='label',
    predictionCol='prediction'
)

Using LinearRegression `lr`, let see how well our training data did

In [32]:
correct_model = lr.fit(train_data)

In [33]:
test_results = correct_model.evaluate(test_data)

Using evaulate on our test data,
we can compare our predicitons to the labels already assigned on the test data

In [34]:
test_results.rootMeanSquaredError

10.553514572368591

Here we would keep modifying parameters to LinearRegression until we get a `correct model` (???)

Once we have our model, we apply that to unlabeled data:

In [31]:
unlabeled_data = test_data.select('features')

In [35]:
predictions = correct_model.transform(unlabeled_data)

In [36]:
predictions.show()

+--------------------+-------------------+
|            features|         prediction|
+--------------------+-------------------+
|(10,[0,1,2,3,4,5,...|  1.804426046859911|
|(10,[0,1,2,3,4,5,...|-2.7778043983324205|
|(10,[0,1,2,3,4,5,...|-1.2247633913288318|
|(10,[0,1,2,3,4,5,...|-1.1320147941825116|
|(10,[0,1,2,3,4,5,...| 3.6146317804971666|
|(10,[0,1,2,3,4,5,...| 2.6436316265297313|
|(10,[0,1,2,3,4,5,...| 2.8514117315183487|
|(10,[0,1,2,3,4,5,...|  3.385192616199273|
|(10,[0,1,2,3,4,5,...| 2.8271267451744113|
|(10,[0,1,2,3,4,5,...| -1.348867378918169|
|(10,[0,1,2,3,4,5,...|  2.236827231940151|
|(10,[0,1,2,3,4,5,...| 2.4998659153479155|
|(10,[0,1,2,3,4,5,...| 1.5388613772929964|
|(10,[0,1,2,3,4,5,...|-1.2852531544905879|
|(10,[0,1,2,3,4,5,...| 0.9678483198682253|
|(10,[0,1,2,3,4,5,...|  1.441515614170371|
|(10,[0,1,2,3,4,5,...|  5.126180272010419|
|(10,[0,1,2,3,4,5,...| 0.2316880294556295|
|(10,[0,1,2,3,4,5,...|-0.4735716613926174|
|(10,[0,1,2,3,4,5,...|-2.8416314136783467|
+----------

So many questions at this point....

Most important is how do we know what a correct model is? Take a class in statistics? :)