## Предсказание стоимости жилья

В проекте нужно обучить модель линейной регрессии на данных о жилье в Калифорнии в 1990 году. На основе данных нужно предсказать медианную стоимость дома в жилом массиве. Обучить модель и сделать предсказания на тестовой выборке. Для оценки качества модели необходимо использовать метрики RMSE, MAE и R2.

В колонках датасета содержатся следующие данные:

    longitude — широта;
    latitude — долгота;
    housing_median_age — медианный возраст жителей жилого массива;
    total_rooms — общее количество комнат в домах жилого массива;
    total_bedrooms — общее количество спален в домах жилого массива;
    population — количество человек, которые проживают в жилом массиве;
    households — количество домовладений в жилом массиве;
    median_income — медианный доход жителей жилого массива;
    median_house_value — медианная стоимость дома в жилом массиве;
    ocean_proximity — близость к океану.

На основе данных нужно предсказать медианную стоимость дома в жилом массиве — median_house_value.  
Цель- обучить модель и сделать предсказания на тестовой выборке.   
Для оценки качества модели будем использовать метрики RMSE, MAE и R2.

# Подготовка данных

In [1]:
import pandas as pd 
import numpy as np

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F

from pyspark.ml.feature import Imputer
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.evaluation import RegressionMetrics

from pyspark.ml.feature import OneHotEncoder    

        
RANDOM_SEED = 12345

Создам спарк-сессию

In [2]:
spark = (SparkSession.builder 
                    .master("local") 
                    .appName("California houses - Logistic regression")
                    .getOrCreate()
        )

In [3]:
df = spark.read.option('header', 'true').csv('/datasets/housing.csv', inferSchema = True) 
spark #проверим версию спарка

                                                                                

In [4]:
df.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)



В основом данные с числовыми значениями и один со строковым (единственный категориальный признак)

In [5]:
df.show(10)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
|  -122.25|   37.85|              

In [6]:
df.describe().toPandas()

                                                                                

Unnamed: 0,summary,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0,20640
1,mean,-119.56970445736148,35.6318614341087,28.639486434108527,2635.7630813953488,537.8705525375618,1425.4767441860463,499.5396802325581,3.8706710029070246,206855.81690891477,
2,stddev,2.003531723502584,2.135952397457101,12.58555761211163,2181.6152515827944,421.3850700740312,1132.46212176534,382.3297528316098,1.899821717945263,115395.6158744136,
3,min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0,<1H OCEAN
4,max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0,NEAR OCEAN


Есть немного пропусков в total_bedrooms

In [7]:
(20640-20433)/20640 * 100

1.002906976744186

Пропусков 1%,  можно удалить

Разделим датасет на данные на обуачающую и тестовую выборки в пропорции 80/20 перед заполнением 

In [8]:
train_data, test_data = df.randomSplit([.8,.2], seed=RANDOM_SEED)
print(train_data.count(), test_data.count()) 

                                                                                

16431 4209


Использую Imputer для заполнения пропусков

In [9]:
imputer = Imputer(
    strategy="mean",
    inputCol='total_bedrooms', outputCol="total_bedrooms",)
model = imputer.fit(train_data)
train_data = model.transform(train_data)
test_data = model.transform(test_data)
train_data.limit(5).toPandas()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,94600.0,NEAR OCEAN
1,-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,NEAR OCEAN
2,-124.3,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,NEAR OCEAN
3,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,NEAR OCEAN
4,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,NEAR OCEAN


Заполнил пропуски средним, с медианой метрики совсем немного, но были хуже

In [10]:
df.describe().toPandas()

                                                                                

Unnamed: 0,summary,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0,20640
1,mean,-119.56970445736148,35.6318614341087,28.639486434108527,2635.7630813953488,537.8705525375618,1425.4767441860463,499.5396802325581,3.8706710029070246,206855.81690891477,
2,stddev,2.003531723502584,2.135952397457101,12.58555761211163,2181.6152515827944,421.3850700740312,1132.46212176534,382.3297528316098,1.899821717945263,115395.6158744136,
3,min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0,<1H OCEAN
4,max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0,NEAR OCEAN


Выведем колонки с названиями признаков

In [11]:
numerical_cols = train_data.drop('ocean_proximity','median_house_value').columns
categorical_cols = 'ocean_proximity'
target = 'median_house_value'

In [12]:
numerical_cols

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income']

Преобразуем категориальный признак в индексы, чтобы модель их восприняла

In [13]:
indexer = StringIndexer(inputCol="ocean_proximity", 
                        outputCol="ocean_proximity_idx") 

train_data = indexer.fit(train_data).transform(train_data)
test_data = indexer.fit(test_data).transform(test_data)


                                                                                

In [14]:
train_data.select('ocean_proximity_idx').show(5)

+-------------------+
|ocean_proximity_idx|
+-------------------+
|                2.0|
|                2.0|
|                2.0|
|                2.0|
|                2.0|
+-------------------+
only showing top 5 rows



In [15]:
test_data.select('ocean_proximity_idx').show(5)

+-------------------+
|ocean_proximity_idx|
+-------------------+
|                2.0|
|                2.0|
|                2.0|
|                2.0|
|                2.0|
+-------------------+
only showing top 5 rows



И применим к индексированной кагегориальной фиче OHE

In [16]:
encoder = OneHotEncoder(inputCol='ocean_proximity_idx',
                        outputCol='ocean_proximity_idx_ohe')
train_data = encoder.fit(train_data).transform(train_data)
test_data = encoder.fit(test_data).transform(test_data)

train_data.select('ocean_proximity_idx_ohe').show(3) 

+-----------------------+
|ocean_proximity_idx_ohe|
+-----------------------+
|          (4,[2],[1.0])|
|          (4,[2],[1.0])|
|          (4,[2],[1.0])|
+-----------------------+
only showing top 3 rows



Создадим новую колонку под названием categorical_features

In [17]:
categorical_assembler = (
        VectorAssembler(inputCols=['ocean_proximity_idx_ohe'],
                                        outputCol="categorical_features")
    )
train_data = categorical_assembler.transform(train_data)
test_data  = categorical_assembler.transform(test_data )

In [18]:
train_data.limit(5).toPandas()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,ocean_proximity_idx,ocean_proximity_idx_ohe,categorical_features
0,-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,94600.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)"
1,-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)"
2,-124.3,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)"
3,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)"
4,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)"


Преобразуем все числовые признаки и затем применим к ним StandardScaler

In [19]:
numerical_assembler = VectorAssembler(inputCols=numerical_cols, outputCol="numerical_features")
train_data = numerical_assembler.transform(train_data) 
test_data = numerical_assembler.transform(test_data) 

In [20]:
train_data.limit(5).toPandas()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,ocean_proximity_idx,ocean_proximity_idx_ohe,categorical_features,numerical_features
0,-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,94600.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.35, 40.54, 52.0, 1820.0, 300.0, 806.0, 2..."
1,-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.3, 41.8, 19.0, 2672.0, 552.0, 1298.0, 47..."
2,-124.3,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.3, 41.84, 17.0, 2677.0, 531.0, 1244.0, 4..."
3,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.27, 40.69, 36.0, 2349.0, 528.0, 1194.0, ..."
4,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.26, 40.58, 52.0, 2217.0, 394.0, 907.0, 3..."


In [21]:
standardScaler = StandardScaler(inputCol='numerical_features', outputCol="numerical_features_scaled")
train_data = standardScaler.fit(train_data).transform(train_data) 
test_data = standardScaler.fit(test_data).transform(test_data) 


In [22]:
train_data.limit(5).toPandas()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,ocean_proximity_idx,ocean_proximity_idx_ohe,categorical_features,numerical_features,numerical_features_scaled
0,-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,94600.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.35, 40.54, 52.0, 1820.0, 300.0, 806.0, 2...","[-62.00413245156368, 18.933996002402658, 4.129..."
1,-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.3, 41.8, 19.0, 2672.0, 552.0, 1298.0, 47...","[-61.97920115584532, 19.522472444509894, 1.508..."
2,-124.3,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.3, 41.84, 17.0, 2677.0, 531.0, 1244.0, 4...","[-61.97920115584532, 19.541154236322825, 1.350..."
3,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.27, 40.69, 36.0, 2349.0, 528.0, 1194.0, ...","[-61.9642423784143, 19.00405272170114, 2.85903..."
4,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.26, 40.58, 52.0, 2217.0, 394.0, 907.0, 3...","[-61.959256119270634, 18.95267779421559, 4.129..."


In [23]:
train_data.columns

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value',
 'ocean_proximity',
 'ocean_proximity_idx',
 'ocean_proximity_idx_ohe',
 'categorical_features',
 'numerical_features',
 'numerical_features_scaled']

Оставим одну колонку признаков (включены и числовые и категориальные признаки)

In [24]:
all_features = ['numerical_features', 'categorical_features']

final_assembler = VectorAssembler(inputCols=all_features, 
                                  outputCol="features") 
train_data = final_assembler.transform(train_data)
test_data = final_assembler.transform(test_data)



Тут используем только числовые признаки

In [25]:
final_assembler_num = VectorAssembler(inputCols=['numerical_features'], 
                                  outputCol="num_features") 
train_data_num = final_assembler_num.transform(train_data)
test_data_num = final_assembler_num.transform(test_data)

In [26]:
train_data.select(all_features).show(4)

+--------------------+--------------------+
|  numerical_features|categorical_features|
+--------------------+--------------------+
|[-124.35,40.54,52...|       (4,[2],[1.0])|
|[-124.3,41.8,19.0...|       (4,[2],[1.0])|
|[-124.3,41.84,17....|       (4,[2],[1.0])|
|[-124.27,40.69,36...|       (4,[2],[1.0])|
+--------------------+--------------------+
only showing top 4 rows



In [27]:
train_data_num.select('numerical_features').show(4)

+--------------------+
|  numerical_features|
+--------------------+
|[-124.35,40.54,52...|
|[-124.3,41.8,19.0...|
|[-124.3,41.84,17....|
|[-124.27,40.69,36...|
+--------------------+
only showing top 4 rows



In [28]:
train_data.limit(5).toPandas()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,ocean_proximity_idx,ocean_proximity_idx_ohe,categorical_features,numerical_features,numerical_features_scaled,features
0,-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,94600.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.35, 40.54, 52.0, 1820.0, 300.0, 806.0, 2...","[-62.00413245156368, 18.933996002402658, 4.129...","[-124.35, 40.54, 52.0, 1820.0, 300.0, 806.0, 2..."
1,-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.3, 41.8, 19.0, 2672.0, 552.0, 1298.0, 47...","[-61.97920115584532, 19.522472444509894, 1.508...","[-124.3, 41.8, 19.0, 2672.0, 552.0, 1298.0, 47..."
2,-124.3,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.3, 41.84, 17.0, 2677.0, 531.0, 1244.0, 4...","[-61.97920115584532, 19.541154236322825, 1.350...","[-124.3, 41.84, 17.0, 2677.0, 531.0, 1244.0, 4..."
3,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.27, 40.69, 36.0, 2349.0, 528.0, 1194.0, ...","[-61.9642423784143, 19.00405272170114, 2.85903...","[-124.27, 40.69, 36.0, 2349.0, 528.0, 1194.0, ..."
4,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,NEAR OCEAN,2.0,"(0.0, 0.0, 1.0, 0.0)","(0.0, 0.0, 1.0, 0.0)","[-124.26, 40.58, 52.0, 2217.0, 394.0, 907.0, 3...","[-61.959256119270634, 18.95267779421559, 4.129...","[-124.26, 40.58, 52.0, 2217.0, 394.0, 907.0, 3..."


# Обучение моделей

In [29]:
features = train_data.select(['features', 'median_house_value'])
num_features = train_data_num.select(['features', 'median_house_value'])

In [30]:
features.show(5, truncate=100)

+---------------------------------------------------------------------+------------------+
|                                                             features|median_house_value|
+---------------------------------------------------------------------+------------------+
| [-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,0.0,0.0,1.0,0.0]|           94600.0|
|  [-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797,0.0,0.0,1.0,0.0]|           85800.0|
| [-124.3,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,0.0,0.0,1.0,0.0]|          103600.0|
|[-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,0.0,0.0,1.0,0.0]|           79000.0|
| [-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,0.0,0.0,1.0,0.0]|          111400.0|
+---------------------------------------------------------------------+------------------+
only showing top 5 rows



In [31]:
num_features.show(5, truncate=100)

+---------------------------------------------------------------------+------------------+
|                                                             features|median_house_value|
+---------------------------------------------------------------------+------------------+
| [-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,0.0,0.0,1.0,0.0]|           94600.0|
|  [-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797,0.0,0.0,1.0,0.0]|           85800.0|
| [-124.3,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,0.0,0.0,1.0,0.0]|          103600.0|
|[-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,0.0,0.0,1.0,0.0]|           79000.0|
| [-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,0.0,0.0,1.0,0.0]|          111400.0|
+---------------------------------------------------------------------+------------------+
only showing top 5 rows



Обучим 2 модели- одну на всех признаках, вторую без категориальных

In [32]:
lr = LinearRegression(labelCol='median_house_value', featuresCol='features')

model = lr.fit(train_data) 

24/02/19 14:11:31 WARN Instrumentation: [1a753b5e] regParam is zero, which might cause numerical instability and overfitting.
24/02/19 14:11:32 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
24/02/19 14:11:32 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
24/02/19 14:11:32 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
24/02/19 14:11:32 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
                                                                                

In [33]:
lr_num = LinearRegression(labelCol='median_house_value', featuresCol='num_features')

model_num = lr_num.fit(train_data_num) 

24/02/19 14:11:34 WARN Instrumentation: [eff4daab] regParam is zero, which might cause numerical instability and overfitting.


Выведем предсказания моделей

In [34]:
predictions = model.transform(test_data)
predictions_num =  model_num.transform(test_data_num)

predictedLabes= predictions.select("median_house_value", "prediction")
predictedLabes_num = predictions_num.select("median_house_value", "prediction")
predictedLabes.show() 
predictedLabes_num.show()

+------------------+------------------+
|median_house_value|        prediction|
+------------------+------------------+
|          106700.0|216087.60925776418|
|          128900.0|206922.95425788546|
|          116100.0|233227.54237230122|
|           70500.0| 162780.8485760684|
|           85600.0|187478.17195710773|
|           75500.0|137446.33413939644|
|           79600.0| 160086.7610943555|
|           92800.0|208629.36113462178|
|           97300.0| 167099.6187994955|
|           82100.0|158695.64149460103|
|          126900.0|157473.01958837314|
|          119400.0| 166826.4396998654|
|           71300.0|168546.74359048577|
|           75600.0|148638.00878931116|
|           98800.0|184655.81815120578|
|           92600.0|150614.00529157603|
|          152700.0|141229.11361282645|
|          150000.0|156772.47377330298|
|           74000.0|158070.49044044083|
|           82400.0|163099.63733509835|
+------------------+------------------+
only showing top 20 rows

+-------------

Проверю метрики: MAE, RMSE, R2

In [35]:
results = predictions.select(['prediction', 'median_house_value'])
 
results_collect = results.collect()
results_list = [ (float(i[0]), float(i[1])) for i in results_collect]
scoreAndLabels = spark.sparkContext.parallelize(results_list)
 
metrics = RegressionMetrics(scoreAndLabels)
print("MAE score: ", metrics.meanAbsoluteError) 
print("RMSE score: ", metrics.rootMeanSquaredError) 
print("R2 score: ", metrics.r2) 

                                                                                

MAE score:  49090.7514192648
RMSE score:  67712.601490065
R2 score:  0.6588164615851325


Результаты с заполнением пропусков медианой

MAE score:  49828.73915337292

RMSE score:  68709.32557762177

R2 score:  0.6454530166046596

In [36]:
results_num = predictions_num.select(['prediction', 'median_house_value'])
 
results_num_collect = results_num.collect()
results_num_list = [ (float(i[0]), float(i[1])) for i in results_num_collect]
scoreAndLabelsNum = spark.sparkContext.parallelize(results_num_list)
 
metrics_num = RegressionMetrics(scoreAndLabelsNum)
print("MAE score: ", metrics_num.meanAbsoluteError) 
print("RMSE score: ", metrics_num.rootMeanSquaredError) 
print("R2 score: ", metrics_num.r2)

MAE score:  50295.51413988937
RMSE score:  68812.8941205236
R2 score:  0.6476382832035474


Результаты с заполнением пропусков медианой

MAE score:  50922.857653104635

RMSE score:  69658.1903557703

R2 score:  0.6355929262797926


In [37]:
spark.stop()

# Анализ результатов

На входе был получен датасет с данными о жилье в Калифорнии. Датасет небольшой-порядка 20 тыс срок.
Цель- обучить модель, которая предсказывает стоимость жилья и проверить качество модели метриками MAE, RMSE, R2

В ходе изучения данных было выявлено, что предобработка практически не требуется- заполнили только 1% пропусков в признаке total_bedrooms средним по колонке (пробовал медианой, результат был практически идентичный, но со средним чуть, но лучше).

Произвел транформацию категориального признака с помощью OHE, и числовых с помощью StandartScaler

При обучении были получены результаты:

С использованием всех признаков:

MAE score:  49090.75  
RMSE score:  67712.60  
R2 score:  0.6588  

Без использования категориального признака:                                                                    

MAE score:  50295.51  
RMSE score:  68812.89  
R2 score:  0.6476  


То есть окончальный вывод: при использовании всех данных и заполнении пропусков средним модель показала лучший результат