## California Housing Prices
Median house prices for California districts derived from the 1990 census.

### About Dataset
#### Context
This is the dataset used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being to toyish and too cumbersome.

The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.

### Content
The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned so there are some preprocessing steps required! The columns are as follows, their names are pretty self explanitory:

- longitude
- latitude
- housing_median_age
- total_rooms
- total_bedrooms
- population
- households
- median_income
- median_house_value
- ocean_proximity

### Acknowledgements
This data was initially featured in the following paper:
Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.

and I encountered it in 'Hands-On Machine learning with Scikit-Learn and TensorFlow' by Aurélien Géron.
Aurélien Géron wrote:
This dataset is a modified version of the California Housing dataset available from:
Luís Torgo's page (University of Porto)

In [14]:
import kagglehub
import shutil
from pathlib import Path
import os

# Download latest version
path = kagglehub.dataset_download("camnugent/california-housing-prices")

print("Path to dataset files:", path)

fullpath = os.path.join(path, os.listdir(path)[0])

# Now lets copy it to the project's folder
destination_folder = Path('data')

try:
    destination_folder.mkdir(parents=True, exist_ok=True)
    shutil.copy(fullpath, destination_folder)
    print(f"Successfully copied file")

except Exception as e:
    print(e)

Path to dataset files: C:\Users\Gabriel\.cache\kagglehub\datasets\camnugent\california-housing-prices\versions\1
Successfully copied file


In [8]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = (
    SparkSession.builder.appName('California_Housing_Prediction')
    .master('local[*]')
    .config("spark.driver.memory", '4g')
    .getOrCreate()
)
spark.sparkContext.setLogLevel('ERROR')

In [6]:
df = spark.read.csv('./data/housing.csv', header=True, inferSchema=True)
df.show(5)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
|  -122.25|   37.85|              

In [7]:
df.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)



In [None]:
print('Total rows:',df.count())

Total rows: 20640


In [None]:
# Lets crate an unique id column
df = df.withColumn('id', F.monotonically_increasing_id())

In [None]:
# Check Nulls in the dataframe
df.select(
    [F.count(F.when(F.col(col).isNull(), 1)).alias(col) for col in df.columns]
).show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|        0|       0|                 0|          0|           207|         0|         0|            0|                 0|              0|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+



In [35]:
# Check categorical column uniques
df.select(F.count_distinct(F.col('ocean_proximity'))).show()
df.groupBy("ocean_proximity").count().show()

+-------------------------------+
|count(DISTINCT ocean_proximity)|
+-------------------------------+
|                              5|
+-------------------------------+

+---------------+-----+
|ocean_proximity|count|
+---------------+-----+
|         ISLAND|    5|
|     NEAR OCEAN| 2658|
|       NEAR BAY| 2290|
|      <1H OCEAN| 9136|
|         INLAND| 6551|
+---------------+-----+



#### To handle the missing values for the 'total_bedrooms' column, I'm going to train a Linear Regressor

In [None]:
# Separate the feature types
numerical_features = []
categorical_features = []
for var, type in df.dtypes:
    if type in ['int', 'float', 'double']:
        numerical_features.append(var)
    elif type in ['string']:
        categorical_features.append(var)

categorical_features_indexed = [col + '_ind' for col in categorical_features]

In [81]:
# We could actually fill the Null values with a linear regression model
df_train = df.where(F.col('total_bedrooms').isNotNull())
df_test = df.where(F.col('total_bedrooms').isNull())

# Ensure there arent any nulls
df_train.select(
    F.count(F.when(F.col('total_bedrooms').isNull(), 1)).alias('df_train Nulls')
).show()
df_test.select(
    F.count(F.when(F.col('total_bedrooms').isNull(), 1)).alias('df_test Nulls')
).show()

+--------------+
|df_train Nulls|
+--------------+
|             0|
+--------------+

+-------------+
|df_test Nulls|
+-------------+
|          207|
+-------------+



In [86]:
from pyspark.ml.feature import StringIndexer, VectorAssembler, MinMaxScaler
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.regression import LinearRegression

# Preprocess Features:
si = StringIndexer(
    inputCols=categorical_features,
    outputCols=categorical_features_indexed,
    handleInvalid='skip'
)
va = VectorAssembler(
    inputCols=numerical_features + categorical_features_indexed,
    outputCol='feature_vector',
    handleInvalid='skip'
)
scaler = MinMaxScaler(
    inputCol='feature_vector',
    outputCol='scaled_feature_vector'
)
lr = LinearRegression(
    featuresCol='scaled_feature_vector',
    labelCol='total_bedrooms',
)

pipeline = Pipeline(
    stages = [si,va,scaler,lr]
)

In [95]:
pl_fit = pipeline.fit(df_train)
test_preds = pl_fit.transform(df_test)

In [96]:
test_preds.count()

0