### Spark Sprint Challenge:

**Question 1**:

In this question, you will utilize the MapReduce paradigm  in SPARK to aggregate movie ratings. The dataset you will work with is the MovieLens dataset - 100k version of it. It can be accessed from here:

http://files.grouplens.org/datasets/movielens/ml-100k.zip  -- download the dataset and unzip the archive (if you are running on Colab, you can use `!unzip`).

The structure of the "**u.data**" file is described within the folloing README.txt

http://files.grouplens.org/datasets/movielens/ml-100k-README.txt

However for the sake of simplicity, I have cut and pasted the structire of the "u.data" file here:

**u.data**     -- The full u data set, 100000 ratings by 943 users on 1682 items.
                    Each user has rated at least 20 movies.  Users and items are
                    numbered consecutively from 1.  The data is randomly
                    ordered. This is a tab separated list of 
	                  user id | movie id | rating | timestamp. 
                    The time stamps are unix seconds since 1/1/1970 UTC 



**Step 1**: Load the contents from the "u.data" into a RDD

**Step 2: ** Leverage a map operation to rearrange the results from step 1. The order should be user id, rating and then item id. Wihtin the map operation, ensure that the elements are of type int, float and float respectively.

For example after rearranging the first row of data, the result will be ==> 196, 3.0, 242

**Step 3: ** Use the map and reduce operators to sum up all the ratings within the data set



In [0]:
# Spark setup
# https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
!tar xf spark-2.3.1-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

import findspark
findspark.init()
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

In [0]:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip ml-100k.zip

--2018-06-22 19:28:03--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.235
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip.4’


2018-06-22 19:28:04 (4.87 MB/s) - ‘ml-100k.zip.4’ saved [4924029/4924029]

Archive:  ml-100k.zip
replace ml-100k/allbut.pl? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [0]:
ls ml-100k


[0m[01;32mallbut.pl[0m*  u1.base  u2.test  u4.base  u5.test  ub.base  u.genre  u.occupation
[01;32mmku.sh[0m*     u1.test  u3.base  u4.test  ua.base  ub.test  u.info   u.user
README      u2.base  u3.test  u5.base  ua.test  u.data   u.item


In [0]:
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = SparkContext.getOrCreate()

In [0]:
movies = sc.textFile('ml-100k/u.data')

In [0]:
movies

ml-100k/u.data MapPartitionsRDD[4] at textFile at NativeMethodAccessorImpl.java:0

In [0]:
movies.collect()

['196\t242\t3\t881250949',
 '186\t302\t3\t891717742',
 '22\t377\t1\t878887116',
 '244\t51\t2\t880606923',
 '166\t346\t1\t886397596',
 '298\t474\t4\t884182806',
 '115\t265\t2\t881171488',
 '253\t465\t5\t891628467',
 '305\t451\t3\t886324817',
 '6\t86\t3\t883603013',
 '62\t257\t2\t879372434',
 '286\t1014\t5\t879781125',
 '200\t222\t5\t876042340',
 '210\t40\t3\t891035994',
 '224\t29\t3\t888104457',
 '303\t785\t3\t879485318',
 '122\t387\t5\t879270459',
 '194\t274\t2\t879539794',
 '291\t1042\t4\t874834944',
 '234\t1184\t2\t892079237',
 '119\t392\t4\t886176814',
 '167\t486\t4\t892738452',
 '299\t144\t4\t877881320',
 '291\t118\t2\t874833878',
 '308\t1\t4\t887736532',
 '95\t546\t2\t879196566',
 '38\t95\t5\t892430094',
 '102\t768\t2\t883748450',
 '63\t277\t4\t875747401',
 '160\t234\t5\t876861185',
 '50\t246\t3\t877052329',
 '301\t98\t4\t882075827',
 '225\t193\t4\t879539727',
 '290\t88\t4\t880731963',
 '97\t194\t3\t884238860',
 '157\t274\t4\t886890835',
 '181\t1081\t1\t878962623',
 '278\t603\t5\t

In [0]:
data = movies.map(lambda x: x.split())

In [0]:
data.collect()

[['196', '242', '3', '881250949'],
 ['186', '302', '3', '891717742'],
 ['22', '377', '1', '878887116'],
 ['244', '51', '2', '880606923'],
 ['166', '346', '1', '886397596'],
 ['298', '474', '4', '884182806'],
 ['115', '265', '2', '881171488'],
 ['253', '465', '5', '891628467'],
 ['305', '451', '3', '886324817'],
 ['6', '86', '3', '883603013'],
 ['62', '257', '2', '879372434'],
 ['286', '1014', '5', '879781125'],
 ['200', '222', '5', '876042340'],
 ['210', '40', '3', '891035994'],
 ['224', '29', '3', '888104457'],
 ['303', '785', '3', '879485318'],
 ['122', '387', '5', '879270459'],
 ['194', '274', '2', '879539794'],
 ['291', '1042', '4', '874834944'],
 ['234', '1184', '2', '892079237'],
 ['119', '392', '4', '886176814'],
 ['167', '486', '4', '892738452'],
 ['299', '144', '4', '877881320'],
 ['291', '118', '2', '874833878'],
 ['308', '1', '4', '887736532'],
 ['95', '546', '2', '879196566'],
 ['38', '95', '5', '892430094'],
 ['102', '768', '2', '883748450'],
 ['63', '277', '4', '875747401

In [0]:
dir(data)

['__add__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_bypass_serializer',
 '_computeFractionForSampleSize',
 '_defaultReducePartitions',
 '_id',
 '_is_pipelinable',
 '_jrdd',
 '_jrdd_deserializer',
 '_jrdd_val',
 '_memory_limit',
 '_pickled',
 '_prev_jrdd',
 '_prev_jrdd_deserializer',
 '_reserialize',
 '_to_java_object_rdd',
 'aggregate',
 'aggregateByKey',
 'cache',
 'cartesian',
 'checkpoint',
 'coalesce',
 'cogroup',
 'collect',
 'collectAsMap',
 'combineByKey',
 'context',
 'count',
 'countApprox',
 'countApproxDistinct',
 'countByKey',
 'countByValue',
 'ctx',
 'distinct',
 'filter',
 'first',
 'flatMap',
 'flatMapValues',
 'fold',
 'fol

In [0]:
movie_df = (spark.read
      .option('header', 'false')
      .option('inferSchema', value=True)
      .csv(data))

movie_df.show()

+------+-------+----+-------------+
|   _c0|    _c1| _c2|          _c3|
+------+-------+----+-------------+
|['196'|  '242'| '3'| '881250949']|
|['186'|  '302'| '3'| '891717742']|
| ['22'|  '377'| '1'| '878887116']|
|['244'|   '51'| '2'| '880606923']|
|['166'|  '346'| '1'| '886397596']|
|['298'|  '474'| '4'| '884182806']|
|['115'|  '265'| '2'| '881171488']|
|['253'|  '465'| '5'| '891628467']|
|['305'|  '451'| '3'| '886324817']|
|  ['6'|   '86'| '3'| '883603013']|
| ['62'|  '257'| '2'| '879372434']|
|['286'| '1014'| '5'| '879781125']|
|['200'|  '222'| '5'| '876042340']|
|['210'|   '40'| '3'| '891035994']|
|['224'|   '29'| '3'| '888104457']|
|['303'|  '785'| '3'| '879485318']|
|['122'|  '387'| '5'| '879270459']|
|['194'|  '274'| '2'| '879539794']|
|['291'| '1042'| '4'| '874834944']|
|['234'| '1184'| '2'| '892079237']|
+------+-------+----+-------------+
only showing top 20 rows



* user id | item id | rating | timestamp
* wants ->  user id, rating and then item id

In [0]:
# Helper function to make labeled points
def make_labeled_points(row):
  user_id = int(row[0])
  rating = int(row[2])
  item_id = int(row[1])

  return LabeledPoint(user_id, [
    rating, item_id
  ])

labeled_points_rdd = data.map(make_labeled_points)
labeled_points_rdd.take(5)

[LabeledPoint(196.0, [3.0,242.0]),
 LabeledPoint(186.0, [3.0,302.0]),
 LabeledPoint(22.0, [1.0,377.0]),
 LabeledPoint(244.0, [2.0,51.0]),
 LabeledPoint(166.0, [1.0,346.0])]

**Question 2: **

In this question, the ask is to uncover the most popular movie.  The objective is find the movie id which appears most frequently in the dataset. 

**Step 1: ** Read in the "u.data" file

**Step 2:** Use the "map" operator to create a RDD which contains the respective movie id's and a value of 1

**Step 3: ** Use the reduceByKey operator to sum the values by movie id

**Step 4:** Sort the movie id's based on the number of occurences

**Step 5: ** Output the results



In [0]:
data.sample(False, 0.0001).collect()

[['181', '1330', '1', '878962052'],
 ['293', '939', '2', '888906516'],
 ['470', '847', '3', '879178568'],
 ['495', '226', '4', '888633011'],
 ['864', '408', '5', '877214085'],
 ['632', '82', '4', '879457903'],
 ['753', '242', '4', '891399477'],
 ['886', '657', '5', '876031695'],
 ['838', '289', '5', '887061035'],
 ['654', '1115', '3', '887863779'],
 ['328', '435', '4', '885045844'],
 ['178', '1119', '4', '882827400']]

In [0]:
popular = movies.map(lambda x: x.split()[0])

In [0]:
popular.take(5)

['196', '186', '22', '244', '166']

In [0]:
popular.countByValue()

defaultdict(int,
            {'1': 272,
             '10': 184,
             '100': 59,
             '101': 67,
             '102': 216,
             '103': 29,
             '104': 111,
             '105': 23,
             '106': 64,
             '107': 22,
             '108': 33,
             '109': 234,
             '11': 181,
             '110': 133,
             '111': 24,
             '112': 46,
             '113': 51,
             '114': 48,
             '115': 92,
             '116': 143,
             '117': 86,
             '118': 71,
             '119': 181,
             '12': 51,
             '120': 26,
             '121': 74,
             '122': 61,
             '123': 54,
             '124': 24,
             '125': 182,
             '126': 45,
             '127': 23,
             '128': 184,
             '129': 30,
             '13': 636,
             '130': 353,
             '131': 30,
             '132': 22,
             '133': 26,
             '134': 25,
             '13

In [0]:
import operator
x = popular.countByValue()

print("=== The 20 Most Popular Movies in Dataset ===")
most_popular = sorted(x.items(), key=operator.itemgetter(1), reverse=True)
most_popular[0:20]

=== The 20 Most Popular Movies in Dataset ===


[('405', 737),
 ('655', 685),
 ('13', 636),
 ('450', 540),
 ('276', 518),
 ('416', 493),
 ('537', 490),
 ('303', 484),
 ('234', 480),
 ('393', 448),
 ('181', 435),
 ('279', 434),
 ('429', 414),
 ('846', 405),
 ('7', 403),
 ('94', 400),
 ('682', 399),
 ('308', 397),
 ('92', 388),
 ('293', 388)]

**Question 3:**

In this question, you will need to determine the avg. # of friends grouped by age

**Dataset**: The dataset contains information on the **a)** Name, **b)** Age and **c)** # of Friends. The dataset is accessible via the following link: https://www.dropbox.com/s/xzw44ntsau1pr4m/friendsData.csv?raw=1

**ASK**: Leverage the "**groupByKey**" operator to determine the avg. # of friends grouped by age

**Step 1**: Read in the data into a RDD

**Step 2**: Use the "**map**" operator to create a **key, value pair **where key is the age and the value is a tuple containing the # of friends and 1 - for example (33, (30,1)) 

**Step 3**: Utilize the "**reduceByKey**" operator to calculate the sum of friends by age

**Step 4**: Compute the average of the # of friends by age

**Step 5**: Output the results




In [0]:
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = SparkContext.getOrCreate()

In [0]:
!wget https://www.dropbox.com/s/xzw44ntsau1pr4m/friendsData.csv?raw=1 -O friends
friends = sc.textFile('friends')

--2018-06-22 20:20:19--  https://www.dropbox.com/s/xzw44ntsau1pr4m/friendsData.csv?raw=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.1, 2620:100:601a:1::a27d:701
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ucec4fa296515312e9f51ec53d80.dl.dropboxusercontent.com/cd/0/inline/AJhPiQIYgEgVkpW1Pi7J0HZ7aYQQ0wJLn_7dRytfz32Sth6Wje1QZX4KV2M8FNcaY45ygWGWh2tckT8BzRW0tktkzoUM7AU19aJPuvAXONrLLYAqkv7-6e0v1lqyZjUCrm8XZnlanEv0-alGWsMeWW-7awDwMurnXvJbwyOn6m8aUw-zdrTVdSsgt2b24EGHlKc/file [following]
--2018-06-22 20:20:20--  https://ucec4fa296515312e9f51ec53d80.dl.dropboxusercontent.com/cd/0/inline/AJhPiQIYgEgVkpW1Pi7J0HZ7aYQQ0wJLn_7dRytfz32Sth6Wje1QZX4KV2M8FNcaY45ygWGWh2tckT8BzRW0tktkzoUM7AU19aJPuvAXONrLLYAqkv7-6e0v1lqyZjUCrm8XZnlanEv0-alGWsMeWW-7awDwMurnXvJbwyOn6m8aUw-zdrTVdSsgt2b24EGHlKc/file
Resolving ucec4fa296515312e9f51ec53d80.dl.dropboxusercontent.com (ucec4fa296515312e9f51ec53d80.d

In [0]:
friends.take(10)

['ID,NAME,AGE,Number of Friends',
 '0,Will,33,385',
 '1,Jean-Luc,26,2',
 '2,Hugh,55,221',
 '3,Deanna,40,465',
 '4,Quark,68,21',
 '5,Weyoun,59,318',
 '6,Gowron,37,220',
 '7,Will,54,307',
 '8,Jadzia,38,380']

In [0]:
df = (spark.read
      .option('header', 'true')
      .option('inferSchema', value=True)
      .csv(friends))


In [0]:
df.show(10)

+---+--------+---+-----------------+
| ID|    NAME|AGE|Number of Friends|
+---+--------+---+-----------------+
|  0|    Will| 33|              385|
|  1|Jean-Luc| 26|                2|
|  2|    Hugh| 55|              221|
|  3|  Deanna| 40|              465|
|  4|   Quark| 68|               21|
|  5|  Weyoun| 59|              318|
|  6|  Gowron| 37|              220|
|  7|    Will| 54|              307|
|  8|  Jadzia| 38|              380|
|  9|    Hugh| 27|              181|
+---+--------+---+-----------------+
only showing top 10 rows



In [0]:
# Selecting multiple columns, formatting as table
friends_by_age = df.select('AGE', 'Number of Friends')
friends_by_age.show(5)


+---+-----------------+
|AGE|Number of Friends|
+---+-----------------+
| 33|              385|
| 26|                2|
| 55|              221|
| 40|              465|
| 68|               21|
+---+-----------------+
only showing top 5 rows



In [0]:
friends_by_age.groupBy('AGE').avg('Number of Friends').show()

+---+----------------------+
|AGE|avg(Number of Friends)|
+---+----------------------+
| 31|                267.25|
| 65|                 298.2|
| 53|    222.85714285714286|
| 34|                 245.5|
| 28|                 209.1|
| 26|    242.05882352941177|
| 27|               228.125|
| 44|     282.1666666666667|
| 22|    206.42857142857142|
| 47|    233.22222222222223|
| 52|     340.6363636363636|
| 40|     250.8235294117647|
| 20|                 165.0|
| 57|     258.8333333333333|
| 54|     278.0769230769231|
| 48|                 281.4|
| 19|    213.27272727272728|
| 64|     281.3333333333333|
| 41|    268.55555555555554|
| 43|    230.57142857142858|
+---+----------------------+
only showing top 20 rows



In [0]:
y = popular.countByValue()

print("=== The 20 Most Popular Movies in Dataset ===")
most_popular = sorted(y.items(), key=operator.itemgetter(1), reverse=True)
most_popular[0:20]

In [0]:
# # Filter to a specific location
# df.filter(df['Location'] == 'Nipomo').show(5)

# # Counts of properties grouped by location
# df.groupBy('Location').count().show(5)

# # Get all properties with price < 500000
# df.filter(df['Price'] < 500000).show(5)

# # Order by price descending
# df.orderBy(df['Price'], ascending=False).show(5)

# # Group by location and aggregate by average price
# df.groupBy('Location').avg('Price').show()

# # Group, average, and sort
# df.groupBy('Location').avg('Price').orderBy('avg(Price)').show()

**Question 4:**

In this question, you will construct a Linear Regression model to take some predictions

**Dataset**: It contains a set of x, y values where x represents the feature and the y represents the label. This is a made up dataset meant to provide practice on training and testing a ML model in Spark to make predictions

https://www.dropbox.com/s/950mm5nqr533hsh/feature_label_dataset.txt?raw=1

**Step 1: ** Read in the dataset into a RDD

**Step 2: ** Split on the comma delimiter and convert (i.e map) the read in data to the following format ==> (x, Vectors.dense(y))

**Step 3: ** Convert the RDD into a DataFrame format

**Step 4:** Split the data into training and testing parts in the ratio 60% (train the model with 60% of the data) to 40% (test the model with 40% of the data)

**Step 5: ** Create a Linear Regression Model and train it with the training dataset

**Step 6:** Predict values using the test data

**Step 7:** Output the predicted values and actual values (from the underlying data set) side-by-side to enable comparision








In [0]:
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = SparkContext.getOrCreate()

!wget https://www.dropbox.com/s/950mm5nqr533hsh/feature_label_dataset.txt?raw=1 -O stuff
prac = sc.textFile('stuff')

--2018-06-22 21:00:05--  https://www.dropbox.com/s/950mm5nqr533hsh/feature_label_dataset.txt?raw=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.1, 2620:100:601a:1::a27d:701
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uca5c7f968059f23359bb1cec364.dl.dropboxusercontent.com/cd/0/inline/AJhyEujjNcycA12xBBBran0cXkujr-MecQ69sReSC6OJmfZUHIxJTMgnCBQNzhMmQ3uvgM5z8Ty8VvQFwj48O6C9E8Js0W8r6-1Pi6o4L5iHXFU3XqwJ82N25Z7FfF-4qb-Ta8I10FaAWYOo5Xa0gfVDutv0Ljh2-LIC09hwtWz-NTm_hdFu9emTISHskwUTPWY/file [following]
--2018-06-22 21:00:05--  https://uca5c7f968059f23359bb1cec364.dl.dropboxusercontent.com/cd/0/inline/AJhyEujjNcycA12xBBBran0cXkujr-MecQ69sReSC6OJmfZUHIxJTMgnCBQNzhMmQ3uvgM5z8Ty8VvQFwj48O6C9E8Js0W8r6-1Pi6o4L5iHXFU3XqwJ82N25Z7FfF-4qb-Ta8I10FaAWYOo5Xa0gfVDutv0Ljh2-LIC09hwtWz-NTm_hdFu9emTISHskwUTPWY/file
Resolving uca5c7f968059f23359bb1cec364.dl.dropboxusercontent.com (uca5c7f968059f23359

In [0]:
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.feature import VectorAssembler

In [0]:
prac.sample(False, 0.01).collect()

['-0.52,0.69',
 '0.12,-0.23',
 '0.16,-0.21',
 '-0.22,0.29',
 '-1.66,1.71',
 '0.10,-0.12',
 '-0.93,0.91',
 '-0.71,0.62',
 '-0.44,0.43',
 '-1.39,1.44',
 '0.69,-0.73',
 '-1.54,1.49']

In [0]:
df2 = (spark.read
      .option('header', 'false')
      .option('inferSchema', value=True)
      .csv(prac))

df2.show(10)

+-----+-----+
|  _c0|  _c1|
+-----+-----+
|-1.74| 1.66|
| 1.24|-1.18|
| 0.29| -0.4|
|-0.13| 0.09|
|-0.39| 0.38|
|-1.79| 1.73|
| 0.71|-0.77|
| 1.39|-1.48|
| 1.15|-1.43|
| 0.13|-0.07|
+-----+-----+
only showing top 10 rows



In [0]:
vectorAssembler = VectorAssembler(inputCols = ['_c0'], outputCol = 'features')
vec_df = vectorAssembler.transform(df2)
vec_df = vec_df.select(['features', '_c1'])
vec_df.show(3)

+--------+-----+
|features|  _c1|
+--------+-----+
| [-1.74]| 1.66|
|  [1.24]|-1.18|
|  [0.29]| -0.4|
+--------+-----+
only showing top 3 rows



In [0]:
train, test = vec_df.randomSplit([0.6, 0.4], seed=66)

In [0]:
train.take(5)

[Row(features=DenseVector([-2.58]), _c1=2.57),
 Row(features=DenseVector([-2.54]), _c1=2.39),
 Row(features=DenseVector([-2.36]), _c1=2.63),
 Row(features=DenseVector([-2.29]), _c1=2.35),
 Row(features=DenseVector([-2.27]), _c1=2.19)]

In [0]:
lr = LinearRegression(featuresCol = 'features', labelCol='_c1', maxIter=10, 
                      regParam=0.3, elasticNetParam=0.8)

In [0]:
lr_model = lr.fit(train)

In [0]:
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

Coefficients: [-0.7180620716240425]
Intercept: 0.0021857550049637392
RMSE: 0.300137
r2: 0.913493


In [0]:
train.describe().show()

+-------+--------------------+
|summary|                 _c1|
+-------+--------------------+
|  count|                 579|
|   mean|-0.00245250431778...|
| stddev|  1.0213405579776098|
|    min|                -2.9|
|    max|                2.89|
+-------+--------------------+



In [0]:
lr_predictions = lr_model.transform(test)
lr_predictions.select("prediction","_c1","features").show(25)

from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="_c1",metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

+------------------+----+--------+
|        prediction| _c1|features|
+------------------+----+--------+
|2.6877379028788826|3.75| [-3.74]|
|1.5244773468479338| 1.9| [-2.12]|
|1.4885742432667315|2.04| [-2.07]|
| 1.409587415388087|1.98| [-1.96]|
| 1.395226173955606|1.94| [-1.94]|
|1.3736843118068849|1.83| [-1.91]|
|1.2946974839282401|1.84|  [-1.8]|
| 1.273155621779519|1.66| [-1.77]|
|1.1798075524683933|1.84| [-1.64]|
|1.1367238281709509|1.65| [-1.58]|
|1.1008207245897488|1.68| [-1.53]|
| 1.021833896711104|1.59| [-1.42]|
| 1.007472655278623|1.32|  [-1.4]|
|1.0002920345623827|1.32| [-1.39]|
| 0.985930793129902|1.25| [-1.37]|
|0.9572083102649402|1.48| [-1.33]|
| 0.935666448116219|1.14|  [-1.3]|
| 0.935666448116219|1.18|  [-1.3]|
|0.9141245859674977|1.19| [-1.27]|
|0.8997633445350168|1.32| [-1.25]|
|0.8925827238187763|1.09| [-1.24]|
|0.8782214823862955| 1.2| [-1.22]|
|0.8423183788050934|1.24| [-1.17]|
| 0.799234654507651| 1.0| [-1.11]|
| 0.799234654507651|1.23| [-1.11]|
+------------------+

In [0]:
# points_rdd = sc.parallelize(points)

# # Helper function to encode to LabeledPoints
# def construct_labeled_points(row):
#   square_feet = row[0]
#   price = row[1]
#   return LabeledPoint(price, Vectors.dense(square_feet))

# points_rdd_labeled = points_rdd.map(construct_labeled_points)
# points_rdd_labeled.cache()

# # Train a linear regression model with 100 iterations
# # and stepsize 0.0000006
# model = LinearRegressionWithSGD.train(points_rdd_labeled, 100,
#                                       0.0000006)