## California Housing Prices
Median house prices for California districts derived from the 1990 census.

### About Dataset
#### Context
This is the dataset used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being to toyish and too cumbersome.

The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.

### Content
The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned so there are some preprocessing steps required! The columns are as follows, their names are pretty self explanitory:

- longitude
- latitude
- housing_median_age
- total_rooms
- total_bedrooms
- population
- households
- median_income
- median_house_value
- ocean_proximity

### Acknowledgements
This data was initially featured in the following paper:
Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.

and I encountered it in 'Hands-On Machine learning with Scikit-Learn and TensorFlow' by Aurélien Géron.
Aurélien Géron wrote:
This dataset is a modified version of the California Housing dataset available from:
Luís Torgo's page (University of Porto)

In [1]:
import kagglehub
import shutil
from pathlib import Path
import os

# Download latest version
path = kagglehub.dataset_download("camnugent/california-housing-prices")

print("Path to dataset files:", path)

fullpath = os.path.join(path, os.listdir(path)[0])

# Now lets copy it to the project's folder
destination_folder = Path('data')

try:
    destination_folder.mkdir(parents=True, exist_ok=True)
    shutil.copy(fullpath, destination_folder)
    print(f"Successfully copied file")

except Exception as e:
    print(e)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: C:\Users\Gabriel\.cache\kagglehub\datasets\camnugent\california-housing-prices\versions\1
Successfully copied file


In [2]:
import os 
import sys

# Set the Python executable for PySpark workers to match the current one
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import socket


try:
    s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    s.settimeout(0)
    s.connect(('8.8.8.8', 80))
    ip = s.getsockname()[0]
except Exception:
    ip = "127.0.0.1"
finally:
    s.close()

print(f"Using IP address: ",ip)
spark = (
    SparkSession.builder.appName('California_Housing_Prediction')
    .master('local[4]')
    .config("spark.driver.host", "localhost")
    .config("spark.driver.bindAddress", ip)
    .config("spark.driver.memory", '4g')
    .config("spark.python.worker.faulthandler.enabled", "true")
    .getOrCreate()
)
spark.sparkContext.setLogLevel('ERROR')

Using IP address:  192.168.15.7


Py4JError: An error occurred while calling None.org.apache.spark.sql.SparkSession. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.SparkSession([class org.apache.spark.SparkContext, class java.util.HashMap]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:180)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:197)
	at py4j.Gateway.invoke(Gateway.java:237)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
	at java.base/java.lang.Thread.run(Thread.java:842)



In [None]:
sys.executable

'c:\\Users\\Gabriel\\envs\\pyspark\\Scripts\\python.exe'

In [None]:
df = spark.read.csv('./data/housing.csv', header=True, inferSchema=True)
df.show(5)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
|  -122.25|   37.85|              

In [None]:
df.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)



In [None]:
print('Total rows:',df.count())

Total rows: 20640


In [None]:
# Lets crate an unique id column
df = df.withColumn('id', F.monotonically_increasing_id())

In [None]:
# Check Nulls in the dataframe
df.select(
    [F.count(F.when(F.col(col).isNull(), 1)).alias(col) for col in df.columns]
).show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+---+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity| id|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+---+
|        0|       0|                 0|          0|           207|         0|         0|            0|                 0|              0|  0|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+---+



In [None]:
# Check categorical column uniques
df.select(F.count_distinct(F.col('ocean_proximity'))).show()
df.groupBy("ocean_proximity").count().show()

+-------------------------------+
|count(DISTINCT ocean_proximity)|
+-------------------------------+
|                              5|
+-------------------------------+

+---------------+-----+
|ocean_proximity|count|
+---------------+-----+
|         ISLAND|    5|
|     NEAR OCEAN| 2658|
|       NEAR BAY| 2290|
|      <1H OCEAN| 9136|
|         INLAND| 6551|
+---------------+-----+



#### To handle the missing values for the 'total_bedrooms' column, I'm going to train a Linear Regressor

In [None]:
# Separate the feature types
numerical_features = []
categorical_features = []
for var, type in df.dtypes:
    if type in ['int', 'float', 'double']:
        numerical_features.append(var)
    elif type in ['string']:
        categorical_features.append(var)

categorical_features_indexed = [col + '_ind' for col in categorical_features]

In [None]:
# We could actually fill the Null values with a linear regression model
df_train = df.where(F.col('total_bedrooms').isNotNull())
df_test = df.where(F.col('total_bedrooms').isNull())

# Ensure there arent any nulls
df_train.select(
    F.count(F.when(F.col('total_bedrooms').isNull(), 1)).alias('df_train Nulls')
).show()
df_test.select(
    F.count(F.when(F.col('total_bedrooms').isNull(), 1)).alias('df_test Nulls')
).show()

+--------------+
|df_train Nulls|
+--------------+
|             0|
+--------------+

+-------------+
|df_test Nulls|
+-------------+
|          207|
+-------------+



In [None]:
from pyspark.ml.feature import StringIndexer, VectorAssembler, MinMaxScaler
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.regression import LinearRegression

numerical_features = [col for col in numerical_features if col != 'total_bedrooms']

# Preprocess Features:
si = StringIndexer(
    inputCols=categorical_features,
    outputCols=categorical_features_indexed,
    handleInvalid='skip'
)
va = VectorAssembler(
    inputCols=numerical_features + categorical_features_indexed,
    outputCol='feature_vector',
    handleInvalid='skip'
)
scaler = MinMaxScaler(
    inputCol='feature_vector',
    outputCol='scaled_feature_vector'
)

lr = LinearRegression(
    featuresCol='scaled_feature_vector',
    labelCol='total_bedrooms',
)

pipeline = Pipeline(
    stages = [si,va,scaler,lr]
)

In [None]:
pl_fit = pipeline.fit(df_train)
test_preds = pl_fit.transform(df_test)

In [None]:
test_preds.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+----+-------------------+--------------------+---------------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|  id|ocean_proximity_ind|      feature_vector|scaled_feature_vector|        prediction|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+----+-------------------+--------------------+---------------------+------------------+
|  -122.16|   37.77|              47.0|     1256.0|          NULL|     570.0|     218.0|        4.375|          161900.0|       NEAR BAY| 290|                3.0|[-122.16,37.77,47...| [0.21812749003984...|210.09994196212418|
|  -122.17|   37.75|              38.0|      992.0|          NULL|     732.0|     259.0|       1.619

Now that we have the predicted total_bedrooms, we can use those predictions to complete the original dataframe

In [None]:
predictions = test_preds.select(['id','prediction'])

df_with_preds = df.join(predictions, on='id', how='left')

df_filled = df_with_preds.withColumn(
    "total_bedrooms",
    F.coalesce(F.col('total_bedrooms'), F.col('prediction'))
)

df_final = df_filled.drop('prediction')

df_final.select(
    F.count(F.when(F.col('total_bedrooms').isNaN(),1)).alias('Null Count for total_bedrooms')
).show()

+-----------------------------+
|Null Count for total_bedrooms|
+-----------------------------+
|                            0|
+-----------------------------+



Now it's time to train the final model with all the features. For that I'm going to:
1. Divide the dataset into train and test following a stratified sampling technique for the categorical column
2. Train a baseline model and a more advanced model
3. Compare the models and keep the better one
4. Performe some hyperparameter tunning on the improved model
5. Train the final tuned model with the whole dataset
6. Present to Kaggle.

In [None]:
categories = df_final.select('ocean_proximity').distinct().collect()
category_fractions = {category.ocean_proximity : 0.7 for category in categories}

train_df = df_final.sampleBy('ocean_proximity', category_fractions, seed=42)
test_df = df_final.join(train_df, on='id', how='left_anti')

In [None]:
train_df.groupBy('ocean_proximity').count().show()
test_df.groupBy('ocean_proximity').count().show()

+---------------+-----+
|ocean_proximity|count|
+---------------+-----+
|         ISLAND|    5|
|     NEAR OCEAN| 1808|
|       NEAR BAY| 1628|
|      <1H OCEAN| 6466|
|         INLAND| 4602|
+---------------+-----+

+---------------+-----+
|ocean_proximity|count|
+---------------+-----+
|     NEAR OCEAN|  850|
|       NEAR BAY|  662|
|      <1H OCEAN| 2670|
|         INLAND| 1949|
+---------------+-----+



In [None]:
from pyspark.ml.feature import StringIndexer, VectorAssembler, MinMaxScaler
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.regression import LinearRegression
from xgboost.spark import SparkXGBRegressor

numerical_features = [col for col in numerical_features if col != 'median_house_value']

# Preprocess Features:
si = StringIndexer(
    inputCols=categorical_features,
    outputCols=categorical_features_indexed,
    handleInvalid='skip'
)
va = VectorAssembler(
    inputCols=numerical_features + categorical_features_indexed,
    outputCol='feature_vector',
    handleInvalid='skip'
)
scaler = MinMaxScaler(
    inputCol='feature_vector',
    outputCol='scaled_feature_vector'
)
 # === Linear Regression ===
lr = LinearRegression(
    featuresCol='scaled_feature_vector',
    labelCol='median_house_value',
)

lr_pipeline = Pipeline(
    stages = [si,va,scaler,lr]
)

 # === XGB Regression ===
xgb = SparkXGBRegressor(
    features_col='scaled_feature_vector',
    label_col='median_house_value',
)

xgb_pipeline = Pipeline(
    stages = [si,va,scaler,xgb]
)

In [None]:
# fitted_lr_pl = lr_pipeline.fit(train_df)
# lr_prediction_df = fitted_lr_pl.transform(test_df)

fitted_xgb_pl = xgb_pipeline.fit(train_df)
xgb_prediction_df = fitted_xgb_pl.transform(test_df)

2025-09-27 20:58:51,552 INFO XGBoost-PySpark: _fit Running xgboost-3.0.5 on 1 workers with
	booster params: {'objective': 'reg:squarederror', 'device': 'cpu', 'nthread': 1}
	train_call_kwargs_params: {'verbose_eval': True, 'num_boost_round': 100}
	dmatrix_kwargs: {'nthread': 1, 'missing': nan}


Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(70, 0) finished unsuccessfully.
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:624)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:599)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:123)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:101)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:532)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
	at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:91)
	at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:82)
	at org.apache.spark.api.python.PythonRDD$.writeNextElementToStream(PythonRDD.scala:334)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeNextInputToStream(PythonRunner.scala:906)
	at org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.writeAdditionalInputToPythonWorker(PythonRunner.scala:844)
	at org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.read(PythonRunner.scala:767)
	at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
	at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:933)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:925)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:532)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.mutable.Growable.addAll(Growable.scala:61)
	at scala.collection.mutable.Growable.addAll$(Growable.scala:57)
	at scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
	at scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
	at scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1057)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2524)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
	at org.apache.spark.scheduler.Task.run(Task.scala:147)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:104)
	... 38 more

	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$3(DAGScheduler.scala:2935)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2935)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2927)
	at scala.collection.immutable.List.foreach(List.scala:334)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2927)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:2283)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3201)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3141)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3130)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:50)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1009)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2484)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2505)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2524)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2549)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1057)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:417)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1056)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:203)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
	at java.base/java.lang.Thread.run(Thread.java:842)


In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    predictionCol='prediction',
    labelCol='median_house_value',
)

evaluation_metrics = {
    evaluator.metricName: "r2",
    evaluator.metricName: "mae",
}

lr_evaluation = evaluator.evaluate(lr_prediction_df, evaluation_metrics)
xgb_evaluation = evaluator.evaluate(xgb_prediction_df, evaluation_metrics)

NameError: name 'lr_prediction_df' is not defined