## Linear Regression Example in Spark
This example demonstrates running a simple linear regression as well as a Gradient Boosted Tree regression model in spark. We will use California housing dataset from this kaggle competition: https://www.kaggle.com/camnugent/california-housing-prices/version/1 You can download the dataset from here: https://uofi.box.com/s/ibp6witxuq5udtvalex7s4h8selwfol5. This dataset consists of 9 predictive variables on about 20.6K observations. The dataset is not big; however, the program we will have here is completely scalable and can be run on big data.  Our goal is to build a regression model which can predict the median_house value per block group based on the features collected in this dataset If you look at the data description on kaggle, it states that all variable are per block group. "A Census Block Group is a geographical unit used by the United States Census Bureau which is between the Census Tract and the Census Block. It is the smallest geographical unit for which the bureau publishes sample data, i.e. data which is only collected from a fraction of all households"(wikipedia).

The goal of this notebook is to learn how to build and train Linear Regression and Gradient Boosted Regression Tree models in spark, tune their hyper-parameters and evaluate them using cross-validation.

As before, let's first configure our spark shell on yarn:

In [1]:
%%init_spark
launcher.master="yarn"
launcher.num_executors=6
launcher.executor_cores=2
launcher.executor_memory='2600m'


## Loading and Exploring Data
I have copied the data to hdfs, let's load the data in spark, see the schema, and print a few rows.

In [4]:
//Read the CSV file and load it into a dataframe. Note that the "inferschema" parameter is set to true
val housing_df=spark.read.option("header","true").option("inferschema", "true").csv("/hadoop-user/data/housing.csv")
housing_df.cache()
housing_df.printSchema()
housing_df.show(3)
housing_df.count

2018-11-02 11:56:22 WARN  CacheManager:66 - Asked to cache already cached data.
root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|   

housing_df: org.apache.spark.sql.DataFrame = [longitude: double, latitude: double ... 8 more fields]
res2: Long = 20640


Looks like all the variables are contineous except for ocean_proximity. Let's get a quick distirbution of this categorical variable:

In [5]:
housing_df.createOrReplaceTempView("housing")
spark.sql("select ocean_proximity,count(ocean_proximity) from housing group by ocean_proximity").show()

+---------------+----------------------+
|ocean_proximity|count(ocean_proximity)|
+---------------+----------------------+
|         ISLAND|                     5|
|     NEAR OCEAN|                  2658|
|       NEAR BAY|                  2290|
|      <1H OCEAN|                  9136|
|         INLAND|                  6551|
+---------------+----------------------+



Now let's do some exploratory data analysis. First let's check some statistics on each column (number of rows,min, max, standard deviation,etc.). In spark, you can use the describe method of the dataframe to get a basic summary statistics.


In [6]:
val stat=housing_df.describe()
stat.show()

+-------+-------------------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+---------------+
|summary|          longitude|         latitude|housing_median_age|       total_rooms|    total_bedrooms|        population|       households|     median_income|median_house_value|ocean_proximity|
+-------+-------------------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+---------------+
|  count|              20640|            20640|             20640|             20640|             20433|             20640|            20640|             20640|             20640|          20640|
|   mean|-119.56970445736148| 35.6318614341087|28.639486434108527|2635.7630813953488| 537.8705525375618|1425.4767441860465|499.5396802325581|3.8706710029070246|206855.81690891474|           null|
| stddev|  2.0035317

stat: org.apache.spark.sql.DataFrame = [summary: string, longitude: string ... 9 more fields]


As you can see because of the number of features it is hard to see the statistics on each feature. It is better to transpose this dataset, that is to flip the rows and columns, so we can have features as rows and their stattistics as columns. Unfortunately, spark does not have a built-in feature for transposing a dataframe. Spylon allows us to share spark dataframes between python . You just need to create a temporary view from the dataframe. 

In [8]:
stat.createOrReplaceTempView("stat")

Then wee can use %%python to switch to pyspark. Since the dataframe is small $5\times10$ (five descriptive values for 10 columns) we can convert it to a non-distributed python dataframe using spark toPandas method. This method acts similar to collect in that it collects the entire dataset to the driver, except that it collects data as a python dataframe which resides in memory of the driver node. We can use the transpose method now from "pandas" library in python to transpose the dataframe.

In [9]:
%%python
import pandas as pd
stat_python=spark.sql("select * from stat" )
stat_python_nonDistributed=stat_python.toPandas().transpose()
pd.set_option('display.max_columns', 7)
pd.set_option('display.width', 100)

print(stat_python_nonDistributed)

                        0                    1                   2          3           4
summary             count                 mean              stddev        min         max
longitude           20640  -119.56970445736148   2.003531723502584    -124.35     -114.31
latitude            20640     35.6318614341087   2.135952397457101      32.54       41.95
housing_median_age  20640   28.639486434108527   12.58555761211163        1.0        52.0
total_rooms         20640   2635.7630813953488  2181.6152515827944        2.0     39320.0
total_bedrooms      20433    537.8705525375618  421.38507007403115        1.0      6445.0
population          20640   1425.4767441860465    1132.46212176534        3.0     35682.0
households          20640    499.5396802325581   382.3297528316098        1.0      6082.0
median_income       20640   3.8706710029070246   1.899821717945263     0.4999     15.0001
median_house_value  20640   206855.81690891474  115395.61587441359    14999.0    500001.0
ocean_prox

The "count" column above shows the number of non-null entries for each feature. It looks like "total_bedrooms" feature has some null entries or missing values.  Two general cataegories of methods to deal with missing values are: 1- complete case analysis,and 2- data imputation. In complete case analysis we simply get rid of all the rows with missing values in any of the columns. This method can be used when your dataset is not too small and when only a small percentage of your rows have missing values. If there are a large number of rows with missing values, then throwing away all of those rows can result in loss of information and possibly a weak machine learning model. In that case, you should use data imputation to infer the missing value from the rest of the data. There are a variety of imputation methods, the easiest one being replacing all the missing values with the mean of the colum.However, the mean imputation suffers from a number of shortcomings. For a more detailed discussion on missing value imputation, please refer to this tutorial: http://www.stat.columbia.edu/~gelman/arm/missing.pdf. In this example, however only about 200 rows out of 20620 have missing values and that is less than 1% of the data so it is reasonable to drop the rows with  missing values. we can use the funciton na.drop() in spark to get rid of all the rows with null values. 

In [10]:
val housing_complete=housing_df.na.drop()
housing_complete.createOrReplaceTempView("housing")
housing_complete.count()

housing_complete: org.apache.spark.sql.DataFrame = [longitude: double, latitude: double ... 8 more fields]
res6: Long = 20433



## Feature Engineering
Now let's do some feature engineering. Feature engineering is typically referred to as creating new features from existing features. For this particular example we need to do two things:

1. If you look at the summary statistics above, you will see that variables are on different scale and the range between min and max values greatly varies between features. So that tells us that we need to scale our data. Let's scale the numeric predictors using StandardScaler. Please note that scaling of the target/outcome variable (median_house_value) is not neccessary. Nevertheless, since the median_house_values are quite large it will make the interpretation easier if we divide each value by 100K, that way instead of the unit of house_value being a dollar it will be 100K dollars and a median house value of 5 for example will represent a $500K value. 


2. The categorical predictor "ocean_proximity" needs to be converted to a numeric value before we can feed it to a machine learning algorithm, so we can use StringIndexer to convert this categorical variable to category indices. Since this variable does not have a natural ordering, we should also use one-hot-encoding on top of stringIndexer.

Before feeding a dataset to a machine learning algorithm in spark, we need to convert it into (features,label) form where features is a numeric vector of predictors and label is a numeric target variable. 

In code segment below we first create a numeric vector from all numeric features and standardize this vector using standardScaler. Then we conver the categorical variable "ocean_proximity" to its one-hot-encoding and assemble that with our previously created numeric feature vector. This gives us a numeric vecotr with all the predictors. 

In [12]:
/* the "withColumn" method of data frame can be used to replace the values of a column or create new column based on an existing column
 * It takes two parameters: 1- the name of the column, 2- the column values
 * We use withColumn to divide the target variable median_house_value by 100,000*/
val housing=housing_complete.withColumn("median_house_value", housing_complete("median_house_value")/100000)

import org.apache.spark.ml.feature._

//get all the numeric features except the target variable
val numeric_features=housing_complete.columns.filter(c => !c.equals("ocean_proximity") && !c.equals("median_house_value"))

//Use VectorAssesmbler to aseemble numeric features into a vector
val vectorizer_numeric=new VectorAssembler().setInputCols(numeric_features).setOutputCol("numeric_features")

//Create an estimator to standardize the numeric feature
val standardizer=new StandardScaler().setWithMean(true).setInputCol("numeric_features").setOutputCol("numeric_features_vector")

//Do one-hot-encoding of the "ocean_proximity" variable
val indexer=new StringIndexer().setInputCol("ocean_proximity").setOutputCol("ocean_proximity_indexer")
val encoder= new OneHotEncoderEstimator().setInputCols(Array("ocean_proximity_indexer")).setOutputCols(Array("ocean_proximity_coded"))

//Now let's add the encoded cataegorical variable to our feature vector using VectorAssembler again
val vectorizer_all=new VectorAssembler().setInputCols(Array("numeric_features_vector","ocean_proximity_coded")).setOutputCol("features")
                                                                        

housing: org.apache.spark.sql.DataFrame = [longitude: double, latitude: double ... 8 more fields]
import org.apache.spark.ml.feature._
numeric_features: Array[String] = Array(longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income)
vectorizer_numeric: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_bdbe02e86e0e
standardizer: org.apache.spark.ml.feature.StandardScaler = stdScal_fd0d52f6f28c
indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_91cc971014eb
encoder: org.apache.spark.ml.feature.OneHotEncoderEstimator = oneHotEncoder_c79bf424bfc2
vectorizer_all: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_cb96cfb0c251


## Building the linear regression model for prediction
Now that we preprocessed our data,  we are ready to create a linear regression model and fit it to the training data. You can use "LinearRegression" to create a linear regression model in spark. The "setLabelCol" method specifies the name of the target variable and the "setFeaturesCol" method specifies the name of the vector of predictors. "setMAxIter" sets the maximum number of iterations used in optimizing the cost function for linear regression, "setRegParam", and "setElasticNetParam" sepecify the values for lambda and alpha in elastic net regularization, respectively.

In [14]:
import org.apache.spark.ml._
import org.apache.spark.ml.regression._
//Creating the linearRegression model and fit it to the transformed training data
val lr= new LinearRegression().setLabelCol("median_house_value").setFeaturesCol("features").setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.7)

import org.apache.spark.ml._
import org.apache.spark.ml.regression._
lr: org.apache.spark.ml.regression.LinearRegression = linReg_06a106ba8921


### Creating a Spark Pipeline
If you notice, we have not yet fit any of the preprocesisng steps on our data, neither did we fit the regression model to our data. Instead of fitting and transforming data separately in each step of feature engineering and machine learning model, we can create a Pipeline object and add all the preprocessing and regression stages we did so far to this pipeline. A single  call of "fit" or "transform" on the Pipeline object will put data through all the fitting or transformations in the pipeline.  

In [16]:
//Creating a Pipeline and add the transformation we did so far to this pipeline
val pipeline = new Pipeline().setStages(Array(vectorizer_numeric,standardizer,indexer,encoder,vectorizer_all, lr))

import org.apache.spark.ml.evaluation._
pipeline: org.apache.spark.ml.Pipeline = pipeline_7d090ae608a3


After creating the pipeline, we van use the method randomSplit on dataframe to split the data randomly to train and test set . randomSplit takes an array of training and testing proportions ( we use 80% for training and 20% for testing) and a seed for random number generator (same seeds produce the same random split) .
Then we fit the pipeline to the training data and transform the training and testing data. This will apply all the preprocessing steps to our training data and build a regression model. 

In [21]:
import org.apache.spark.ml.evaluation._

//Split the data randomly to 80% tranining and 20% testing. The training data is used to build the model and the testing data is used for testing the model
val Array(training,testing)=housing.randomSplit(Array(0.8,0.2),111)

//Fitting the pipeline to the traning data and transforming the training data
val pipeline_model= pipeline.fit(training)


import org.apache.spark.ml.evaluation._
training: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [longitude: double, latitude: double ... 8 more fields]
testing: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [longitude: double, latitude: double ... 8 more fields]
pipeline_model: org.apache.spark.ml.PipelineModel = pipeline_7d090ae608a3


## Evaluating the regression model
Once we built the pipeline model on the training data, We can apply it to the test data to compute the predicted median_house_value for each predictors vector in the testing data. Please note that the pipeline should NOT be fit on the testing data. The testing data should only be reserved for testing and Not used for fitting the model. You should only call transform on the testing data to apply the model that is alreaday built on the training data.

Once we have the predictions on the test data, we can use Root Mean Squared Error (RMSE) evaluation metric to see how these predictions deviate from the actual median_house_values in the test data. The transform method on the pipeline_model will create a new dataframe with an additional "predictions" column. We can then use spark's "RegresionEvaluator" to evaluate our regression model. We should specify the target(i.e., label) column and prediction column as well as the metric we want to use to evaluate our regression model (we used rmse here).

The rmse of our linear regression model is about 0.82. Intuitively this means that on average our predicted house values deviates about 82000 dolalrs from the actual house values. 


In [22]:
import org.apache.spark.ml.evaluation._

//apllyintg the model to the test data to make predictions
val predictions = pipeline_model.transform(testing)

// Select example rows to display.
predictions.select("prediction", "median_house_value", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
  .setLabelCol("median_house_value")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")


+------------------+------------------+--------------------+
|        prediction|median_house_value|            features|
+------------------+------------------+--------------------+
|1.7148592037835118|             0.761|[-2.3332227378130...|
|1.7319496212199748|             0.669|[-2.3132569008644...|
|2.1296595027817435|             0.901|[-2.3032739823901...|
|1.8450013018575213|              0.69|[-2.3032739823901...|
|1.5913243599391245|             0.646|[-2.2982825231530...|
+------------------+------------------+--------------------+
only showing top 5 rows

Root Mean Squared Error (RMSE) on test data = 0.8211974817480996


import org.apache.spark.ml.evaluation._
predictions: org.apache.spark.sql.DataFrame = [longitude: double, latitude: double ... 14 more fields]
evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_b02fd9b71da7
rmse: Double = 0.8211974817480996


## Model Selection and HyperParameter Tunning
There were a few hyper-parameter in our linear regression model that we arbitrarily set: max iterations, regularization parameter (lambda), and elastic net parameter(alpha)).  Now we want to try a range of values for this parameter to see if we can get a better model. In other words, we would like to tune the models' hyper-parameter to get a better prediction. Let's use param-grid together with cross validation to select a combination parameter which gives the best RMSE. We can build a hyperparameter grid in spark by using "ParamGridBuilder" and call addGrid to add a set of values for a hyperparameter to the grid.
Here we try three arbitrary values for lambda (regParam), three arbitrary values for alpha(elasticNetParam) and two values for max number of iterations for solving the linear regression. That gives us a $3\times 3\times 2=18$ different combinations to try for the parameters. 

After building a parameter grid, we can use "CrossValidator" in spark to run a cross-validation on the paramGrid to evaluate the regression model built based on each combination. We have to provide the model, the parameter grid and the evaluator to our CrossValidator. I also set the number of folds to 10. This will split the train data to 10 parts and then train the model 10 times, with a separate part used for validation each time. This is done for each parameter combination in the param grid. So altogether, we are fitting a total of 10*18=180 linear regression models.

We create a new pipeline which includes all the stages of preprocessing plus the cross validation stage. Note that the cross validation stage, already includes the linear regression model so there is no need to add lr as a separate stage to the new pipeline.


In [41]:

import org.apache.spark.ml.tuning._

//Create ParamGrid for Cross Validation to the linear regression model
val paramGrid = new ParamGridBuilder()
             .addGrid(lr.regParam, Array(0.01, 0.5, 2.0))
             .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
             .addGrid(lr.maxIter, Array(5, 10))
             .build()
// Create 10-fold CrossValidator
val cv = new CrossValidator().setEstimator(lr).setEstimatorParamMaps(paramGrid).setEvaluator(evaluator).setNumFolds(10)

//uUPDATING THE PIPE
val new_pipeline=new Pipeline().setStages(Array(vectorizer_numeric,standardizer,indexer,encoder,vectorizer_all, cv))

import org.apache.spark.ml.tuning._
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	linReg_06a106ba8921-elasticNetParam: 0.0,
	linReg_06a106ba8921-maxIter: 5,
	linReg_06a106ba8921-regParam: 0.01
}, {
	linReg_06a106ba8921-elasticNetParam: 0.0,
	linReg_06a106ba8921-maxIter: 10,
	linReg_06a106ba8921-regParam: 0.01
}, {
	linReg_06a106ba8921-elasticNetParam: 0.5,
	linReg_06a106ba8921-maxIter: 5,
	linReg_06a106ba8921-regParam: 0.01
}, {
	linReg_06a106ba8921-elasticNetParam: 0.5,
	linReg_06a106ba8921-maxIter: 10,
	linReg_06a106ba8921-regParam: 0.01
}, {
	linReg_06a106ba8921-elasticNetParam: 1.0,
	linReg_06a106ba8921-maxIter: 5,
	linReg_06a106ba8921-regParam: 0.01
}, {
	linReg_06a106ba8921-elasticNetParam: 1.0,
	linReg_06a106ba8921-maxIter: 10,
	linReg_06a106ba8921-regParam: 0.0...

Now we can fit this new_pipeline to our training data to build a new model and apply it to the testing data. The best model is then used to generate predictions on the test data. We see below that by tuning hyper-parameter we were able to achieve about 14% improvement in our RMSE.


In [43]:


// Fit the new pipeline to the training data.This will likely take a fair amount of time because of the amount of models that we're creating and testing
val new_Model = new_pipeline.fit(training)

//new_Model uses the best model found from the Cross Validation.We Use test set to measure the accuracy of our model on new data
val predictions = new_Model.transform(testing)
val rmse=evaluator.evaluate(predictions)
println("Root Mean Squared Error (RMSE) of the best model on the test data = $rmse")



Root Mean Squared Error (RMSE) of the best model on the test data = $rmse


new_Model: org.apache.spark.ml.PipelineModel = pipeline_3c8d809002f1
predictions: org.apache.spark.sql.DataFrame = [longitude: double, latitude: double ... 14 more fields]
rmse: Double = 0.7027095891602264


# Building a Gradient Boosted Tree Regression Model
Now let's try to solve this problem using Gradient Boosted Tree (GBT) models. For GBT models, you don't need to scale your numeric feature or do one-hot-encoding of categorical feature. All we have to do is to convert the String features to indicecs using StringIndexer and assemble all the features ( except the target variable) as a vector. We can then create a GBT regression model usin GBTRegressor and evaluate it using cross validation. The two hyper-parameter we tune here using the ParamGrid are 1-maxDepth ( the maximum depth of each decision tree, this controls the overfitting due to complexity of each tree) and 2- maxIteration ( The maximum number of trees in the ensemble). 

We create a pipeline of StringIndexer, VectorAssembler, and crossValidation stages, fit our pipeline model to training data and use it to transfer and predict the median_house_value for test data. Finally, we evaluate the predictions on test data using RMSE.

When you run this, be patient, We are building a lot of trees. Precisely, we are building $(100+20+10)\times 2=260$ decision tree models. So get a coffee or a beer and relax for a while. It will take some time to run on our tiny three node cluster. You might get a couple "warnings", just ignore them and let the code run to completion. 
After the code completes, you can see that the RMSE on the test data is decreased by almost 27% using the GBT model. This is probably due to the fact that the GBT model was able to capture some nonlinear relationship between the outcome and the predictors.

In [None]:
import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.tuning._
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.feature._

val housing=housing_complete.withColumn("median_house_value", housing_complete("median_house_value")/100000)


//get all the numeric features except the target variable
val numeric_features=housing.columns.filter(c => !c.equals("ocean_proximity") && !c.equals("median_house_value"))


//index the "ocean_proximity" variable
val indexer=new StringIndexer().setInputCol("ocean_proximity").setOutputCol("ocean_proximity_indexer")

//Now let's assemble everyting together in a feature vector
val vectorizer=new VectorAssembler().setInputCols(numeric_features++Array("ocean_proximity_indexer")).setOutputCol("features")

// Create a GBT model.
val gbt = new GBTRegressor()
  .setLabelCol("median_house_value")
  .setFeaturesCol("features")



//Create ParamGrid for Cross Validation
val paramGrid = new ParamGridBuilder()
             .addGrid(gbt.maxDepth, Array(2,5))
             .addGrid(gbt.maxIter, Array(10, 20,100))
             .build()
val evaluator = new RegressionEvaluator()
  .setLabelCol("median_house_value")
  .setPredictionCol("prediction")
  .setMetricName("rmse")

val cv = new CrossValidator().setEstimator(gbt).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(10)


val pipeline = new Pipeline().setStages(Array(indexer, vectorizer,cv))


val Array(training,testing)=housing.randomSplit(Array(0.8,0.2),111)

//Fit the training data to the pipeline
val pipelineModel = pipeline.fit(training)

// Make predictions.
val predictions = pipelineModel.transform(testing)

// Select example rows to display.
predictions.select("prediction", "median_house_value", "features").show(5)

val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")



## Getting the feature Importance
Once we built the model we can get the variable importance of the best model (the model with the least MSE on the cross validation set). The variable importance is a numeric vector which gives each feature a number between [0,1] indicating the importance of that feature in predicting the outcome.  

To get the variable importance we first have to access the best model. pipelineModel.stages gives us the array of stages of the pipelineModel. The cross validation stage (cv) is the third stage in our pipelineModel and we can access it by index 2, cast it to CrossValidatorModel and get its bestModel. Then we cast the best Model to GBTRegressionModel and use featureImportances method to get the variable importance. This will give us a numeric vector of variable importance. Then we zip this vector to the feature names vector which we used to build our model and sort it in the descending order of its importance.

The result shows that the location (latitude and longtitude), median_household_income, and the median_age of the house were the most important features in predicting the median house value.

In [64]:
//print variable importance.
import org.apache.spark.mllib.linalg._
val featureImportance=pipelineModel.stages(2).asInstanceOf[CrossValidatorModel].bestModel.asInstanceOf[GBTRegressionModel].featureImportances

val features= numeric_features++Array("ocean_proximity_indexer")
val res = features.zip(featureImportance.toArray).sortBy(-_._2).foreach(println)

(longitude,0.19152242888428286)
(latitude,0.1622161601481415)
(median_income,0.15253168850003201)
(housing_median_age,0.11890735656062079)
(population,0.09867196284202912)
(ocean_proximity_indexer,0.07987662505311782)
(total_rooms,0.0782894711037661)
(total_bedrooms,0.06015861437288235)
(households,0.05782569253512725)


import org.apache.spark.mllib.linalg._
featureImportance: org.apache.spark.ml.linalg.Vector = (9,[0,1,2,3,4,5,6,7,8],[0.19152242888428286,0.1622161601481415,0.11890735656062079,0.0782894711037661,0.06015861437288235,0.09867196284202912,0.05782569253512725,0.15253168850003201,0.07987662505311782])
features: Array[String] = Array(longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, ocean_proximity_indexer)
