### University of Virginia
### DS 5559: Big Data Analytics
### Linear Regression Modeling of California Home Prices
### Last updated: Oct 21, 2019

### Name: Jay Hombal
### Computing Id: mh4ey

**TOTAL POINTS: 10**

**Instructions**  
In this project, you will work with the California Home Price dataset to train a regression model and predict median home prices. Please do the following:  

1) (6 PTS) Go through all code and fill in the missing cells. This will prep data, train a model, predict, and evaluate model fit.  Compute and report the Mean Squared Error (MSE).  
2) (1 PT) Repeat Part 1 with at least one additional feature from the original set.  
3) (2 PTS) Repeat Part 1 with at least one engineered feature based on one or more variables from the original set.  
4) (1 PT) Repeat Part 1 using Lasso Regression

Please report resuts in the following way:  
In the **RESULTS SECTION** table at the very bottom, there are three cells where you should copy your code from parts 2,3,4.  
In the very last cell, print a dataframe containing two columns: `question_part` and `MSE`.  
This dataframe must report your MSE results.

**Data Source**  
StatLib---Datasets Archive  
http://lib.stat.cmu.edu/datasets/

In [1]:
import os
import pandas as pd

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.linalg import DenseVector
spark = SparkSession.builder.getOrCreate()

In [2]:
# read text file into pyspark dataframe
filename = 'cal_housing_data_preproc_w_header.txt'
df = spark.read.csv(filename,  inferSchema=True, header = True)

In [3]:
df.show(5)

+------------------+-----------------+------------------+-----------+--------------+----------+----------+--------+---------+
|median_house_value|    median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|
+------------------+-----------------+------------------+-----------+--------------+----------+----------+--------+---------+
|          452600.0|           8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|
|          358500.0|           8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|
|          352100.0|7.257399999999999|              52.0|     1467.0|         190.0|     496.0|     177.0|   37.85|  -122.24|
|          341300.0|           5.6431|              52.0|     1274.0|         235.0|     558.0|     219.0|   37.85|  -122.25|
|          342200.0|           3.8462|              52.0|     1627.0|         280.0|     565.0|     259.0|   37.85|  -

#### Additional Preprocessing

We want to do three more things before training a model:  

**SCALING (1 POINT)**   
Scale the response variable median_house_value, dividing by 100000 and saving into column median_house_value_final

In [4]:
df = df.withColumn('median_house_value_final', col('median_house_value')/100000)

**FEATURE ENGINEERING**  **(1 POINT)**  
Add new feature:  rooms_per_household

In [5]:
df = df.withColumn('rooms_per_household', col('total_rooms')/col('households'))

df.select('median_house_value_final','rooms_per_household').show(5)

+------------------------+-------------------+
|median_house_value_final|rooms_per_household|
+------------------------+-------------------+
|                   4.526|  6.984126984126984|
|                   3.585|  6.238137082601054|
|                   3.521|  8.288135593220339|
|                   3.413| 5.8173515981735155|
|                   3.422|  6.281853281853282|
+------------------------+-------------------+
only showing top 5 rows



#### Code for Part1

**SELECT AND STANDARDIZE FEATURES**  **(2 POINTS)**

In [6]:
# retain these predictors for Part 1
vars_to_keep = ["median_house_value_final", 
              "total_bedrooms", 
              "population", 
              "households", 
              "median_income", 
              "rooms_per_household"]

# subset the dataframe on these predictors
df1 = df.select(vars_to_keep)

We want to standardize the features, but not the response variable.

In [7]:
# extract labels and features; stored as RDDs
transformed_data = df1.rdd.map(lambda x: (x[0], DenseVector(x[1:])))
transformed_df = spark.createDataFrame(transformed_data, ['label', 'features'])
transformed_df.take(2)

[Row(label=4.526, features=DenseVector([129.0, 322.0, 126.0, 8.3252, 6.9841])),
 Row(label=3.585, features=DenseVector([1106.0, 2401.0, 1138.0, 8.3014, 6.2381]))]

In [8]:
# Feature scaling
# use StandardScaler to scale the features to std normal distribution
from pyspark.ml.feature import StandardScaler

# Initialize the `standardScaler`
standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled", 
                                withStd=True, withMean=False)

# Fit the DataFrame to the scaler; this computes the mean, standard deviation of each feature
scaler = standardScaler.fit(transformed_df)

# Transform the data in `df2` with the scaler
scaled_df = scaler.transform(transformed_df)

Split data into train set (80%), test set (20%) using seed=314  

In [9]:
seed = 314
train_test = [0.8, 0.2]
train_data, test_data = scaled_df.randomSplit(train_test,seed=seed)

In [10]:
print(f'train data is {train_data.dtypes}')
print(f'test data is a {type(test_data)}')

train data is [('label', 'double'), ('features', 'vector'), ('features_scaled', 'vector')]
test data is a <class 'pyspark.sql.dataframe.DataFrame'>


Initialize the linear regression object with given parameters **(1 POINT)**

In [11]:
from pyspark.ml.regression import LinearRegression # note this is from the ML package

maxIter=10
regParam=0.3
elasticNetParam=0.8

lr = LinearRegression (labelCol='label',\
                       maxIter=maxIter,\
                       elasticNetParam=elasticNetParam,\
                       regParam=regParam)

Fit the model using the training data

In [12]:
linear_model = lr.fit(train_data)

In [13]:
list(zip(df1.columns[1:],linear_model.coefficients))

[('total_bedrooms', 0.0),
 ('population', 0.0),
 ('households', 0.0),
 ('median_income', 0.2768150738137753),
 ('rooms_per_household', 0.0)]

In [14]:
linear_model.intercept

1.0006902251741692

In [15]:
linear_model.summary.numInstances

16472

For each datapoint in the test set, make a prediction (hint: apply `transform()` to the model).
You will want the returned object to be a dataframe

In [16]:
predictions = linear_model.transform(test_data)
print(predictions.columns)

['label', 'features', 'features_scaled', 'prediction']


In [17]:
predictions.show(2)

+-------+--------------------+--------------------+------------------+
|  label|            features|     features_scaled|        prediction|
+-------+--------------------+--------------------+------------------+
|0.14999|[267.0,628.0,225....|[0.63383104398397...|2.1614311926900918|
|  0.225|[73.0,216.0,63.0,...|[0.17329463000310...| 1.741170547626018|
+-------+--------------------+--------------------+------------------+
only showing top 2 rows



In [18]:
predsandlabels_df = predictions.select("prediction", "label")
predsandlabels_df.take(2)

[Row(prediction=2.1614311926900918, label=0.14999),
 Row(prediction=1.741170547626018, label=0.225)]

**COMPUTE MSE (1 POINT)**  
Evaluate the model by computing Mean Squared Error (MSE), which is the average sum of squared differences between predicted and label. 

This can be computed in a single line using `reduce()`

##### Method I

In [19]:
MSE1 = predsandlabels_df\
    .rdd\
    .map(lambda x: (x[0] - x[1])**2)\
    .reduce(lambda x,y : x+y) /predsandlabels_df.count()
MSE1

0.755384454564553

##### Method 2

In [20]:
from pyspark.ml.evaluation import RegressionEvaluator
lr_eval = RegressionEvaluator(predictionCol='prediction', labelCol='label')
lr_eval.evaluate(predsandlabels_df)

0.8691285604354243

In [21]:
# MAE
prediction_mae = lr_eval.evaluate(predsandlabels_df, 
                                           {lr_eval.metricName:'mae'}) 

prediction_mae

0.6713569550594953

In [22]:
# MSE
prediction_mse = lr_eval.evaluate(predsandlabels_df, 
                                           {lr_eval.metricName:'mse'})

prediction_mse

0.755384454564553

In [23]:
# MSE
prediction_rmse = lr_eval.evaluate(predsandlabels_df, 
                                           {lr_eval.metricName:'rmse'}) 

prediction_rmse

0.8691285604354243

**RESULTS SECTION**

#### Code for Part 2

In [24]:
# Code for Part 2 - (1 PT) Repeat Part 1 with at least one additional feature from the original set.
# retain these predictors for Part 1
vars_to_keep2 = ["median_house_value_final", 
              "total_bedrooms", 
              "population", 
              "households", 
              "median_income",
              "rooms_per_household",
              "housing_median_age"]

# subset the dataframe on these predictors
df2 = df.select(vars_to_keep2)

df2.show(2)

+------------------------+--------------+----------+----------+-------------+-------------------+------------------+
|median_house_value_final|total_bedrooms|population|households|median_income|rooms_per_household|housing_median_age|
+------------------------+--------------+----------+----------+-------------+-------------------+------------------+
|                   4.526|         129.0|     322.0|     126.0|       8.3252|  6.984126984126984|              41.0|
|                   3.585|        1106.0|    2401.0|    1138.0|       8.3014|  6.238137082601054|              21.0|
+------------------------+--------------+----------+----------+-------------+-------------------+------------------+
only showing top 2 rows



In [25]:
# extract labels and features; stored as RDDs
transformed_data2 = df2.rdd.map(lambda x: (x[0], DenseVector(x[1:])))
transformed_df2 = spark.createDataFrame(transformed_data2, ['label', 'features'])
transformed_df2.take(2)

[Row(label=4.526, features=DenseVector([129.0, 322.0, 126.0, 8.3252, 6.9841, 41.0])),
 Row(label=3.585, features=DenseVector([1106.0, 2401.0, 1138.0, 8.3014, 6.2381, 21.0]))]

In [26]:
# Feature scaling
# Fit the DataFrame to the scaler; this computes the mean, standard deviation of each feature
scaler2 = standardScaler.fit(transformed_df2)

# Transform the data in `df2` with the scaler
scaled_df2 = scaler2.transform(transformed_df2)

seed = 314
train_test = [0.8, 0.2]
train_data2, test_data2 = scaled_df2.randomSplit(train_test,seed=seed)

print(f'train data is {train_data2.dtypes}')
print(f'test data is a {test_data2.dtypes}')

linear_model2 = lr.fit(train_data2)

train data is [('label', 'double'), ('features', 'vector'), ('features_scaled', 'vector')]
test data is a [('label', 'double'), ('features', 'vector'), ('features_scaled', 'vector')]


In [27]:
list(zip(df2.columns[1:],linear_model2.coefficients))

[('total_bedrooms', 0.0),
 ('population', 0.0),
 ('households', 0.0),
 ('median_income', 0.27681502029465255),
 ('rooms_per_household', 0.0),
 ('housing_median_age', 0.0)]

In [28]:
linear_model2.intercept

1.000690432060719

In [29]:
predictions2 = linear_model2.transform(test_data2)
print(predictions2.columns)

['label', 'features', 'features_scaled', 'prediction']


In [30]:
predsandlabels_df2 = predictions2.select("prediction", "label")
predsandlabels_df2.take(2)

[Row(prediction=2.1614311751602564, label=0.14999),
 Row(prediction=1.7411706113489145, label=0.225)]

In [31]:
MSE2 = predsandlabels_df2\
    .rdd\
    .map(lambda x: (x[0] - x[1])**2)\
    .reduce(lambda x,y : x+y) /predsandlabels_df2.count()
MSE2

0.7553845110003091

#### Code for Part 3

In [32]:
# Code for Part 3 - Repeat Part 1 with at least one engineered feature based on one or more variables from the original set.
df3 = df.withColumn('median_house_value_final', col('median_house_value')/100000)

# add rooms_per_household 
df = df.withColumn('rooms_per_household', col('total_rooms')/col('households'))

# add population_per_household (num people in the home)
df3 = df.withColumn('population_per_household', col('population')/col('households'))


df3.show(2)

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+------------------------+-------------------+------------------------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|median_house_value_final|rooms_per_household|population_per_household|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+------------------------+-------------------+------------------------+
|          452600.0|       8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|                   4.526|  6.984126984126984|      2.5555555555555554|
|          358500.0|       8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|                   3.585|  6.238137082601054|       2.109841827768014|
+------------------+-------------+------

In [33]:
# retain these predictors for Part 1
vars_to_keep2 = ["median_house_value_final", 
              "total_bedrooms", 
              "population", 
              "households", 
              "median_income",
              "rooms_per_household",
              "population_per_household"]

# subset the dataframe on these predictors
df3 = df3.select(vars_to_keep2)

df3.show(2)

+------------------------+--------------+----------+----------+-------------+-------------------+------------------------+
|median_house_value_final|total_bedrooms|population|households|median_income|rooms_per_household|population_per_household|
+------------------------+--------------+----------+----------+-------------+-------------------+------------------------+
|                   4.526|         129.0|     322.0|     126.0|       8.3252|  6.984126984126984|      2.5555555555555554|
|                   3.585|        1106.0|    2401.0|    1138.0|       8.3014|  6.238137082601054|       2.109841827768014|
+------------------------+--------------+----------+----------+-------------+-------------------+------------------------+
only showing top 2 rows



In [34]:
# extract labels and features; stored as RDDs
transformed_data3 = df3.rdd.map(lambda x: (x[0], DenseVector(x[1:])))
transformed_df3 = spark.createDataFrame(transformed_data3, ['label', 'features'])
transformed_df3.take(2)

[Row(label=4.526, features=DenseVector([129.0, 322.0, 126.0, 8.3252, 6.9841, 2.5556])),
 Row(label=3.585, features=DenseVector([1106.0, 2401.0, 1138.0, 8.3014, 6.2381, 2.1098]))]

In [35]:
# Feature scaling
# Fit the DataFrame to the scaler; this computes the mean, standard deviation of each feature
scaler3 = standardScaler.fit(transformed_df3)

# Transform the data in `df2` with the scaler
scaled_df3 = scaler3.transform(transformed_df3)

seed = 314
train_test = [0.8, 0.2]
train_data3, test_data3 = scaled_df3.randomSplit(train_test,seed=seed)

print(f'train data is {train_data3.dtypes}')
print(f'test data is a {test_data3.dtypes}')

linear_model3 = lr.fit(train_data3)

train data is [('label', 'double'), ('features', 'vector'), ('features_scaled', 'vector')]
test data is a [('label', 'double'), ('features', 'vector'), ('features_scaled', 'vector')]


In [36]:
list(zip(df3.columns[1:],linear_model3.coefficients))

[('total_bedrooms', 0.0),
 ('population', 0.0),
 ('households', 0.0),
 ('median_income', 0.276815072316851),
 ('rooms_per_household', 0.0),
 ('population_per_household', 0.0)]

In [37]:
linear_model3.intercept

1.0006902309607646

In [38]:
predictions3 = linear_model3.transform(test_data3)
print(predictions3.columns)

['label', 'features', 'features_scaled', 'prediction']


In [39]:
predsandlabels_df3 = predictions3.select("prediction", "label")
predsandlabels_df3.take(2)

[Row(prediction=2.1614311921997844, label=0.14999),
 Row(prediction=1.741170549408341, label=0.225)]

In [40]:
MSE3 = predsandlabels_df3\
    .rdd\
    .map(lambda x: (x[0] - x[1])**2)\
    .reduce(lambda x,y : x+y) /predsandlabels_df3.count()
MSE3

0.7553844561430528

#### Code for Part 4

In [41]:
# Code for Part 4

# elasticNetParam corresponds to α and regParam corresponds to λ
# Lasso - When λ>0 (i.e. regParam >0) and α = 1 (i.e. elasticNetParam =1), then the penalty is an L1 penalty.
lr_lasso= LinearRegression(featuresCol="features",\
                           labelCol="label",\
                           predictionCol="prediction",\
                           maxIter=maxIter,\
                           regParam=0.1,\
                           elasticNetParam=1.0)

linear_model_lasso = lr_lasso.fit(train_data)

In [42]:
list(zip(df.columns[1:],linear_model_lasso.coefficients))

[('median_income', 0.0),
 ('housing_median_age', 0.0),
 ('total_rooms', 0.0),
 ('total_bedrooms', 0.36532274714313834),
 ('population', 0.0)]

In [43]:
linear_model_lasso.intercept

0.6585499536995455

In [44]:
predictions_lasso = linear_model_lasso.transform(test_data)
print(predictions_lasso.columns)

['label', 'features', 'features_scaled', 'prediction']


In [45]:
predsandlabels_lasso_df = predictions_lasso.select("prediction", "label")
predsandlabels_lasso_df.take(2)

[Row(prediction=2.190421297020153, label=0.14999),
 Row(prediction=1.6357883023074404, label=0.225)]

In [46]:
MSE4 = predsandlabels_lasso_df\
    .rdd\
    .map(lambda x: (x[0] - x[1])**2)\
    .reduce(lambda x,y : x+y) /predsandlabels_lasso_df.count()
MSE4

0.691575957689366

#### MSE for Question part 1 to part 4

Print dataframe containing `question_part`, `MSE` values for parts 1-4 in the next cell.

In [47]:
# print dataframe containing question_part, MSE
mse_dict = {
        'Question part' :  ['prep data, train a model, predict, and evaluate model fit. Compute and report the Mean Squared Error (MSE)',
                             'Repeat Part 1 with at least one additional feature from the original set.',
                             'Repeat Part 1 with at least one engineered feature based on one or more variables from the original set.',
                             'Repeat Part 1 using Lasso Regression'],
                           
         'MSE' : [MSE1, MSE2, MSE3, MSE4]
        }
pd.set_option("display.max_columns", 100)
pd.set_option("max_colwidth", 80)
mse_df = pd.DataFrame(mse_dict)
mse_df.head()

Unnamed: 0,Question part,MSE
0,"prep data, train a model, predict, and evaluate model fit. Compute and repor...",0.755384
1,Repeat Part 1 with at least one additional feature from the original set.,0.755385
2,Repeat Part 1 with at least one engineered feature based on one or more vari...,0.755384
3,Repeat Part 1 using Lasso Regression,0.691576


In [48]:
!jupyter nbconvert DS5559_M54HW__calhousing_JayHombal.ipynb --to pdf

[NbConvertApp] Converting notebook DS5559_M54HW__calhousing_JayHombal.ipynb to pdf
[NbConvertApp] Writing 78507 bytes to DS5559_M54HW__calhousing_JayHombal.pdf
