# Предсказание стоимости жилья

В проекте предстоит обучить модель линейной регрессии на данных о жилье в Калифорнии в 1990 году. На основе данных нужно предсказать медианную стоимость дома в жилом массиве. Для оценки качества модели будут использованы метрики RMSE, MAE и R2.

## Подготовка данных

In [10]:
import pandas as pd 
import numpy as np

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.feature import OneHotEncoder 
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, RegressionEvaluator #MulticlassClassificationEvaluator


In [11]:
spark = SparkSession.builder \
                    .master("local") \
                    .appName("Learning DataFrame Window Functions") \
                    .getOrCreate()

df_housing = spark.read.load('/datasets/housing.csv', format="csv", inferSchema=True, header="true")
df_housing.printSchema()
df_housing.show()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR B

Разделим колонки на два типа: числовые и текстовые, которые представляют категориальные данные.

Заполним пропущенные значения в данных:

In [14]:
#считаем пропуски
df_housing.select([F.count(F.when(F.isnan(x) | F.col(x).isNull(), x)).alias(x) for x in df_housing.columns]).show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|        0|       0|                 0|          0|           207|         0|         0|            0|                 0|              0|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+



In [4]:
subset = ['longitude', 'latitude', 'housing_median_age','total_rooms','total_bedrooms','population','households', 'median_income', 'median_house_value']
df_housing = df_housing.na.fill(0, subset)

In [5]:
df_housing.printSchema()

root
 |-- longitude: double (nullable = false)
 |-- latitude: double (nullable = false)
 |-- housing_median_age: double (nullable = false)
 |-- total_rooms: double (nullable = false)
 |-- total_bedrooms: double (nullable = false)
 |-- population: double (nullable = false)
 |-- households: double (nullable = false)
 |-- median_income: double (nullable = false)
 |-- median_house_value: double (nullable = false)
 |-- ocean_proximity: string (nullable = false)



Разделим колонки с признаками на числовые и категориальные, а также выделим целевой признак:

In [6]:
categorical_cols = ['ocean_proximity']
numerical_cols  = ['longitude', 'latitude', 'housing_median_age','total_rooms','total_bedrooms','population','households', 'median_income']
target = 'median_house_value' 

Разделяем  датасет на обучающую и тестовую выборки (80 и 20 процентов соответсвенно):

In [7]:
train_data, test_data = df_housing.randomSplit([.8,.2], seed=2077)
print(train_data.count(), test_data.count()) 

                                                                                

16630 4010


Далее трансформируем категориальные признаки с помощью трансформера StringIndexer:

In [8]:
indexer = StringIndexer(inputCols=categorical_cols, 
                        outputCols=[c+'_idx' for c in categorical_cols]) 

ind_fit = indexer.fit(train_data)
train_data = ind_fit.transform(train_data)
test_data = ind_fit.transform(test_data)

cols = [c for c in train_data.columns for i in categorical_cols if (c.startswith(i))]
cols = [c for c in test_data.columns for i in categorical_cols if (c.startswith(i))]

                                                                                

In [9]:
train_data.select(cols).show(3) 
test_data.select(cols).show(3) 

+---------------+-------------------+
|ocean_proximity|ocean_proximity_idx|
+---------------+-------------------+
|     NEAR OCEAN|                2.0|
|     NEAR OCEAN|                2.0|
|     NEAR OCEAN|                2.0|
+---------------+-------------------+
only showing top 3 rows

+---------------+-------------------+
|ocean_proximity|ocean_proximity_idx|
+---------------+-------------------+
|     NEAR OCEAN|                2.0|
|     NEAR OCEAN|                2.0|
|     NEAR OCEAN|                2.0|
+---------------+-------------------+
only showing top 3 rows



Теперь преобразуем колонку с категориальными значениями с помощью OHE:

In [10]:
encoder = OneHotEncoder(inputCols=[c+'_idx' for c in categorical_cols],
                        outputCols=[c+'_ohe' for c in categorical_cols])

enc_fit = encoder.fit(train_data)
train_data = enc_fit.transform(train_data)
test_data = enc_fit.transform(test_data)

cols = [c for c in train_data.columns for i in categorical_cols if (c.startswith(i))]
cols = [c for c in test_data.columns for i in categorical_cols if (c.startswith(i))]

In [11]:
train_data.select(cols).show(3) 
test_data.select(cols).show(3) 

+---------------+-------------------+-------------------+
|ocean_proximity|ocean_proximity_idx|ocean_proximity_ohe|
+---------------+-------------------+-------------------+
|     NEAR OCEAN|                2.0|      (4,[2],[1.0])|
|     NEAR OCEAN|                2.0|      (4,[2],[1.0])|
|     NEAR OCEAN|                2.0|      (4,[2],[1.0])|
+---------------+-------------------+-------------------+
only showing top 3 rows

+---------------+-------------------+-------------------+
|ocean_proximity|ocean_proximity_idx|ocean_proximity_ohe|
+---------------+-------------------+-------------------+
|     NEAR OCEAN|                2.0|      (4,[2],[1.0])|
|     NEAR OCEAN|                2.0|      (4,[2],[1.0])|
|     NEAR OCEAN|                2.0|      (4,[2],[1.0])|
+---------------+-------------------+-------------------+
only showing top 3 rows



Далее следует объединение признаков в один вектоp:

In [12]:
categorical_assembler = \
        VectorAssembler(inputCols=[c+'_ohe' for c in categorical_cols], outputCol="categorical_features")


train_data = categorical_assembler.transform(train_data)
test_data = categorical_assembler.transform(test_data) 

Следующим шагом проведём шкалирование значений, чтобы сильные выбросы не смещали предсказания модели:

In [13]:
numerical_assembler = VectorAssembler(inputCols=numerical_cols, outputCol="numerical_features")

train_data = numerical_assembler.transform(train_data)
test_data = numerical_assembler.transform(test_data)


standardScaler = StandardScaler(inputCol='numerical_features', outputCol="numerical_features_scaled")

scal_fit = standardScaler.fit(train_data)
train_data = scal_fit.transform(train_data) 
test_data = scal_fit.transform(test_data) 

Посмотрим на получившиеся в итоге колонки:

In [14]:
print(train_data.columns)
print(test_data.columns)

['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value', 'ocean_proximity', 'ocean_proximity_idx', 'ocean_proximity_ohe', 'categorical_features', 'numerical_features', 'numerical_features_scaled']
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value', 'ocean_proximity', 'ocean_proximity_idx', 'ocean_proximity_ohe', 'categorical_features', 'numerical_features', 'numerical_features_scaled']


Финальный шаг - собрать трансформированные категорийные и числовые признаки с помощью VectorAssembler:

In [15]:
all_features = ['categorical_features','numerical_features_scaled']

final_assembler = VectorAssembler(inputCols=all_features, 
                                  outputCol="features") 


train_data = final_assembler.transform(train_data)
test_data = final_assembler.transform(test_data)

In [16]:
train_data.select(all_features).show(3) 
train_data.select(all_features).show(3) 

+--------------------+-------------------------+
|categorical_features|numerical_features_scaled|
+--------------------+-------------------------+
|       (4,[2],[1.0])|     [-61.978011544806...|
|       (4,[2],[1.0])|     [-61.953090752065...|
|       (4,[2],[1.0])|     [-61.953090752065...|
+--------------------+-------------------------+
only showing top 3 rows

+--------------------+-------------------------+
|categorical_features|numerical_features_scaled|
+--------------------+-------------------------+
|       (4,[2],[1.0])|     [-61.978011544806...|
|       (4,[2],[1.0])|     [-61.953090752065...|
|       (4,[2],[1.0])|     [-61.953090752065...|
+--------------------+-------------------------+
only showing top 3 rows



## Обучение моделей

###  Модель с использованием всех данных файла

Для построения модели используем оценщик LinearRegression из библиотеки MLlib:

In [17]:
lr = LinearRegression(labelCol=target, featuresCol='features', regParam=0.000000001)

model = lr.fit(train_data) 

22/10/28 18:14:54 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/10/28 18:14:54 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
22/10/28 18:14:54 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
22/10/28 18:14:54 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
                                                                                

In [18]:
predictions = model.transform(test_data)

predictedLabes = predictions.select(target, 'prediction') 

Найдем метрики RMSE, MAE и R2 для модели:

In [19]:
rmse_full = RegressionEvaluator(metricName="rmse", labelCol="median_house_value").evaluate(predictions)
mae_full = RegressionEvaluator(metricName="mae", labelCol="median_house_value", predictionCol="prediction").evaluate(predictions)
r2_full = RegressionEvaluator(metricName="r2", labelCol="median_house_value", predictionCol="prediction").evaluate(predictions)

###  Модель с использованием только числовых данных 

In [20]:
lr = LinearRegression(labelCol=target, featuresCol='numerical_features_scaled', regParam=0.000000001)

model = lr.fit(train_data) 

In [21]:
predictions = model.transform(test_data)

predictedLabes = predictions.select(target, 'prediction')
predictedLabes.show(5) 

+------------------+------------------+
|median_house_value|        prediction|
+------------------+------------------+
|          106700.0|192376.12662631925|
|           78300.0| 78809.67054378614|
|           68400.0| 80417.21510817297|
|           70000.0|116597.57348956773|
|           82800.0| 140495.2337529566|
+------------------+------------------+
only showing top 5 rows



Найдем метрики RMSE, MAE и R2 для модели:

In [22]:
rmse_num = RegressionEvaluator(metricName="rmse", labelCol="median_house_value").evaluate(predictions)
mae_num = RegressionEvaluator(metricName="mae", labelCol="median_house_value", predictionCol="prediction").evaluate(predictions)
r2_num = RegressionEvaluator(metricName="r2", labelCol="median_house_value", predictionCol="prediction").evaluate(predictions)

spark.stop()

Сравним полученные метрики:

In [23]:
print('      full_data', '          num_data')
print('RMSE:', rmse_full, '  ', rmse_num)
print('MAE: ', mae_full, ' ', mae_num)
print('R2:  ', r2_full, '', r2_num)

      full_data           num_data
RMSE: 68233.7948917424    69309.6452146553
MAE:  49637.20150082431   50895.60604758911
R2:   0.6406215146432721  0.6291994461671384


## Анализ результатов

По итогам исследования можно сделать вывод, что модель линейной регрессии наиболее точно предсказывает целевой признак на основе использования всех данных, как числовых так и категориальных, на что указывает меньшее значения метрик RMSE и MAE - , а так же большее значение метрики R2 - 0.640 против 0.629 у модели с использованием только числовых признаков.