# Predicting CA housing prices using SparkMLib

**Table of Contents**  
- - Boiler plate - initialize SparkSession & Context
  - About CA housing dataset
  - Preprocess data
  - Convert RDD to Spark DataFrame
  - Exploratory data analysis

- Feature engineering
    - Re-order columns and split table into label and features
    - Scale data by shifting mean to 0 and making SD = 1
- Split data into training and test sets
- Perform Multiple Regression
    - Inspect model properties
    - Perform predictions
        - Regression evaluator
        - Errors - MAE, RMSE
    - Compare training vs prediction errors
- Export data as a Pandas DataFrame
- Write to disk as CSV
- Publish to GIS
- Spark jobs

### Boiler plate - initialize SparkSession & Context

In [None]:
# Import SparkSession
from pyspark.sql import SparkSession

# Build the SparkSession
spark = SparkSession.builder \
   .master("local") \
   .appName("Linear Regression Model") \
   .config("spark.executor.memory", "1gb") \
   .getOrCreate()
   
sc = spark.sparkContext

### About CA housing dataset  
Number of records: 20640  

variables: Lat, Long, Median Age, #rooms, #bedrooms, population in block, households, med income, med house value  

### Preprocess data

In [None]:
# load data file
rdd = sc.textFile('cal_housing.data')

# load header
header = sc.textFile('cal_housing.domain')

In [None]:
len(rdd.collect())

In [None]:
len(rdd.take(5))

In [None]:
rdd.take(5)

In [None]:
# split by comma
rdd = rdd.map(lambda line : line.split(','))

# get the first two lines
rdd.first()

### Convert RDD to Spark DataFrame

In [None]:
# convert RDD to a dataframe
from pyspark.sql import Row

# Map the RDD to a DF
df = rdd.map(lambda line: Row(longitude=line[0], 
                              latitude=line[1], 
                              housingMedianAge=line[2],
                              totalRooms=line[3],
                              totalBedRooms=line[4],
                              population=line[5], 
                              households=line[6],
                              medianIncome=line[7],
                              medianHouseValue=line[8])).toDF()

# show the top few DF rows
df.show(5)

In [None]:
df.printSchema()

In [None]:
# convert all strings to float using a User Defined Function

from pyspark.sql.types import *

def cast_columns(df):
    for column in df.columns:
        df = df.withColumn(column, df[column].cast(FloatType()))
    return df

new_df = cast_columns(df)

In [None]:
new_df.show(2)

In [None]:
new_df.printSchema()

### Exploratory data analysis
Print the summary stats of the table

In [None]:
new_df.describe().show()

### Feature engineering
Add more columns such as ‘number of bedrooms per room’, ‘rooms per household’. Also scale the ‘medianHouseValue’ by 1000 so it falls within range of other numbers.

In [None]:
from pyspark.sql.functions import col

df = df.withColumn('medianHouseValue', col('medianHouseValue')/100000)

In [None]:
df.first()

In [None]:
# add rooms per household
df = df.withColumn('roomsPerHousehold', col('totalRooms')/col('households'))

# add population per household (num people in the home)
df = df.withColumn('popPerHousehold', col('population')/col('households'))

# add bedrooms per room
df = df.withColumn('bedroomsPerRoom', col('totalBedRooms')/col('totalRooms'))

In [None]:
df.first()

### Re-order columns and split table into label and features

In [None]:
df.columns

In [None]:
df = df.select('medianHouseValue','households',
 'housingMedianAge',
 'latitude',
 'longitude',
 'medianIncome',
 'population',
 'totalBedRooms',
 'totalRooms',
 'roomsPerHousehold',
 'popPerHousehold',
 'bedroomsPerRoom')

Create a new DataFrame that explicitly labels the columns as labels and features. DenseVector is used to temporarily convert the data into numpy array and regroup into a named column DataFrame

In [None]:
from pyspark.ml.linalg import DenseVector

# return a tuple of first column and all other columns
temp_data = df.rdd.map(lambda x:(x[0], DenseVector(x[1:])))

#construct back a new DataFrame
df2 = spark.createDataFrame(temp_data, ['label','features'])

In [None]:
df2.take(2)

**Scale data by shifting mean to 0 and making SD = 1**  
This ensures all columns have similar levels of variability

In [None]:
# use StandardScaler to scale the features to std normal distribution
from pyspark.ml.feature import StandardScaler

s_scaler_model = StandardScaler(inputCol='features', outputCol='features_scaled')
scaler_fn = s_scaler_model.fit(df2)
scaled_df = scaler_fn.transform(df2)

scaled_df.take(2)

### Split data into training and test sets

In [None]:
train_data, test_data = scaled_df.randomSplit([.8,.2], seed=101)

In [None]:
type(train_data)

### Perform Multiple Regression
Train the model  

In [None]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(labelCol='label', maxIter=20)

linear_model = lr.fit(train_data)

**Inspect model properties**

In [None]:
type(linear_model)

In [None]:
linear_model.coefficients

In [None]:
list(zip(df.columns[1:], linear_model.coefficients))

In [None]:
linear_model.intercept

In [None]:
linear_model.summary.numInstances

MAE from training data

In [None]:
linear_model.summary.meanAbsoluteError * 100000

Thus, MAE on training data is off by $50,000

In [None]:
linear_model.summary.meanSquaredError

In [None]:
linear_model.summary.rootMeanSquaredError * 100000

Thus, RMSE shows fitting on training data is off by $68,392

In [None]:
list(zip(df.columns[1:], linear_model.summary.pValues))

### Perform predictions

In [None]:
predicted = linear_model.transform(test_data)
predicted.columns

In [None]:
type(predicted)

In [None]:
test_predictions = predicted.select('prediction').rdd.map(lambda x:x[0])
test_labels = predicted.select('label').rdd.map(lambda x:x[0])

test_predictions_labels = test_predictions.zip(test_labels)
test_predictions_labels_df = spark.createDataFrame(test_predictions_labels, 
                                                   ['predictions','labels'])

test_predictions_labels_df.take(2)

#### Regression evaluator

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

linear_reg_eval = RegressionEvaluator(predictionCol='predictions', labelCol='labels')

In [None]:
linear_reg_eval.evaluate(test_predictions_labels_df)

#### Errors - MAE, RMSE

In [None]:
# mean absolute error
prediction_mae = linear_reg_eval.evaluate(test_predictions_labels_df, 
                                          {linear_reg_eval.metricName:'mae'}) * 100000
prediction_mae

In [None]:
# RMSE
prediction_rmse = linear_reg_eval.evaluate(test_predictions_labels_df, 
                                           {linear_reg_eval.metricName:'rmse'}) * 100000

prediction_rmse

#### Compare training vs prediction errors


In [None]:
print('(training error, prediction error)')
print((linear_model.summary.rootMeanSquaredError * 100000, prediction_rmse))
print((linear_model.summary.meanAbsoluteError * 100000, prediction_mae))

### Export data as a Pandas DataFrame

In [None]:
predicted_pandas_df = predicted.select('prediction').toPandas()
predicted_pandas_df1 = predicted.select('features')
predicted_pandas_df2 = predicted_pandas_df1.rdd.map(lambda x:[float(y) for y in x['features']]).toDF(df.columns[1:]).toPandas()

In [None]:
predicted_pandas_df2.columns

In [None]:
predicted_pandas_df2['predictedHouseValue'] = predicted_pandas_df['prediction']
predicted_pandas_df2.head()

### Write to disk as CSV

In [None]:
predicted_pandas_df2.to_csv('CA_house_prices_predicted.csv')

In [None]:
predicted_pandas_df2.shape

In [None]:
CA_house_prices_predicted=predicted_pandas_df2.to_dict()

### Publish to GIS

In [None]:
#import sys
#sys.executable

In [None]:
import pandas as pd
from arcgis.gis import GIS
gis = GIS("https://www.arcgis.com","Priyanka_Bhoyar_LearnArcGIS8", "Digipen@123")

In [None]:
from arcgis.features import SpatialDataFrame

In [None]:
sdf = SpatialDataFrame.from_dict(CA_house_prices_predicted)
sdf.head(5)

In [None]:
houses_predicted_fc = gis.content.import_data(sdf[:999])
houses_predicted_fc

In [None]:
ca_map = gis.map('California')
ca_map

In [None]:
ca_map.add_layer(houses_predicted_fc, {'renderer':'ClassedColorRenderer',
                                      'field_name':'predictedHouseValue'})

In [None]:
ca_map = gis.map('California')
ca_map