#Project - House Price Prediction
##Scenario:
BR Properties is a real estate company specializing in maintaining, managing, and leasing residential properties across California. The company offers a range of services including property management, leasing, maintenance, and real estate investment opportunities. They work with property owners, tenants, and real estate investors to provide comprehensive property solutions.

##Problem Statement:

BR Properties aims to build a predictive analytics program that can accurately forecast house prices. This program will aid the company in making informed decisions about property investments, pricing strategies, and market trends. The predictive model should leverage historical data and current market conditions to provide reliable price predictions.

##Tasks:

###1. Install PySpark and import necessary libraries for performing the MLLIB Project.

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, col, concat, array
from pyspark.ml.feature import Imputer, VectorAssembler, StandardScaler, StringIndexer, OneHotEncoder
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.evaluation import RegressionMetrics

###2. Load a CSV File into a DataFrame and Display Schema.

In [4]:
sparkSession = SparkSession.builder.appName("House Price Prediction").getOrCreate()

In [5]:
df = sparkSession.read.csv("/content/drive/MyDrive/Colab Notebooks/data/housing.csv",
                           header=True,
                           inferSchema=True)
df.printSchema()
df.show(5)

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR B

###3. Add a unique ID column to the DataFrame and display the first three rows.

In [6]:
df = df.withColumn('id', monotonically_increasing_id())
df.show(3)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+---+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity| id|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+---+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|  0|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|  1|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|  2|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+---+
only s

###4. Split the DataFrame into training and test sets.

In [7]:
train, test = df.randomSplit([0.8,0.2], seed = 1)
print(f'Registros train: {train.count()}')
print(f'Registros test: {test.count()}')

Registros train: 16507
Registros test: 4133


###5. Define a list of numerical features excluding specific columns (median_house_value, id, ocean_proximity).

In [8]:
featureCols = [col for col in train.columns if col not in ['median_house_value','id','ocean_proximity']]
print(featureCols)

['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']


###6. Use an Imputer to fill missing values in numerical features in both training and test sets.

In [9]:
meanImputerTrainer = Imputer(strategy = 'median',
                      inputCols = featureCols,
                      outputCols = featureCols)
meanImputerModel = meanImputerTrainer.fit(train)
train = meanImputerModel.transform(train)
test = meanImputerModel.transform(test)


In [10]:
print('Nulos en train:')
for c in featureCols:
  print(f'\t- Nulos en {c}: {train.where(train[c].isNull()).count()}')
#End for
print('Nulos en test:')
for c in featureCols:
  print(f'\t- Nulos en {c}: {test.where(test[c].isNull()).count()}')
#End for

Nulos en train:
	- Nulos en longitude: 0
	- Nulos en latitude: 0
	- Nulos en housing_median_age: 0
	- Nulos en total_rooms: 0
	- Nulos en total_bedrooms: 0
	- Nulos en population: 0
	- Nulos en households: 0
	- Nulos en median_income: 0
Nulos en test:
	- Nulos en longitude: 0
	- Nulos en latitude: 0
	- Nulos en housing_median_age: 0
	- Nulos en total_rooms: 0
	- Nulos en total_bedrooms: 0
	- Nulos en population: 0
	- Nulos en households: 0
	- Nulos en median_income: 0


###7. Create a vector of numerical features using VectorAssembler in both training and test sets.

In [11]:
vectorAssemblerModel = VectorAssembler(inputCols = featureCols,
                                       outputCol = 'features',
                                       handleInvalid= 'skip')
train = vectorAssemblerModel.transform(train)
test = vectorAssemblerModel.transform(test)

In [12]:
train.show(5, truncate=False)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+----+-----------------------------------------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|id  |features                                             |
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+----+-----------------------------------------------------+
|-124.35  |40.54   |52.0              |1820.0     |300.0         |806.0     |270.0     |3.0147       |94600.0           |NEAR OCEAN     |2655|[-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147] |
|-124.3   |41.8    |19.0              |2672.0     |552.0         |1298.0    |478.0     |1.9797       |85800.0           |NEAR OCEAN     |1851|[-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797]  |
|-124.3   |41.8

In [13]:
test.show(5, truncate=False)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+----+------------------------------------------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|id  |features                                              |
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+----+------------------------------------------------------+
|-124.26  |40.58   |52.0              |2217.0     |394.0         |907.0     |369.0     |2.3571       |111400.0          |NEAR OCEAN     |2653|[-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571]  |
|-124.19  |40.73   |21.0              |5694.0     |1056.0        |2907.0    |972.0     |3.5363       |90100.0           |NEAR OCEAN     |2629|[-124.19,40.73,21.0,5694.0,1056.0,2907.0,972.0,3.5363]|
|-124.19  

###8. Standardize numerical features using StandardScaler in both training and test sets.

In [14]:
standarScalerTrainer = StandardScaler(inputCol = 'features',
                                    outputCol = 'scaled_features')
standarScalerModel = standarScalerTrainer.fit(train)
train = standarScalerModel.transform(train)
test = standarScalerModel.transform(test)

In [15]:
train.select('features','scaled_features').show(5, truncate=False)

+-----------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                             |scaled_features                                                                                                                                          |
+-----------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|[-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147] |[-62.197108857510784,18.915938254601453,4.132214770447993,0.8317667426563216,0.7137550811868852,0.7055079551257347,0.7053392074487548,1.5885830812075576]|
|[-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797]  |[-62.17209996774098,19.503853454423794,1.5098477045867669,1.221143261745984,1.313309349383869,1.136165416

In [16]:
test.select('features','scaled_features').show(5, truncate=False)

+------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                              |scaled_features                                                                                                                                            |
+------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|[-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571]  |[-62.15209285592514,18.934602229198987,4.132214770447993,1.0132015760819038,0.937398339958776,0.7939152795273466,0.9639635835132982,1.2420636151903452]    |
|[-124.19,40.73,21.0,5694.0,1056.0,2907.0,972.0,3.5363]|[-62.117080410247404,19.00459213393974,1.6687790419116895,2.6022416663104915,2.5124178857778

###9. Index the categorical feature ocean_proximity using StringIndexer in both training and test sets.

In [17]:
stringIndexerTrainer = StringIndexer(inputCol = 'ocean_proximity',
                                    outputCol = 'ocean_proximity_index')
stringIndexerModel = stringIndexerTrainer.fit(train)
train = stringIndexerModel.transform(train)
test = stringIndexerModel.transform(test)

In [18]:
train.select('ocean_proximity','ocean_proximity_index').show(5, truncate=False)

+---------------+---------------------+
|ocean_proximity|ocean_proximity_index|
+---------------+---------------------+
|NEAR OCEAN     |2.0                  |
|NEAR OCEAN     |2.0                  |
|NEAR OCEAN     |2.0                  |
|NEAR OCEAN     |2.0                  |
|NEAR OCEAN     |2.0                  |
+---------------+---------------------+
only showing top 5 rows



In [19]:
test.select('ocean_proximity','ocean_proximity_index').show(5, truncate=False)

+---------------+---------------------+
|ocean_proximity|ocean_proximity_index|
+---------------+---------------------+
|NEAR OCEAN     |2.0                  |
|NEAR OCEAN     |2.0                  |
|NEAR OCEAN     |2.0                  |
|NEAR OCEAN     |2.0                  |
|NEAR OCEAN     |2.0                  |
+---------------+---------------------+
only showing top 5 rows



###10. Perform one-hot encoding on the indexed categorical feature ocean_category_index in both training and test sets.

In [20]:
oneHotEncoderTrainer = OneHotEncoder(inputCols = ['ocean_proximity_index'],
                                    outputCols = ['ocean_proximity_vector'],
                                     handleInvalid = 'error') #No tratamiento de errores en esta transformación
oneHotEncoderModel = oneHotEncoderTrainer.fit(train)
train = oneHotEncoderModel.transform(train)
test = oneHotEncoderModel.transform(test)

In [21]:
train.select('ocean_proximity','ocean_proximity_index','ocean_proximity_vector').show(5, truncate=False)

+---------------+---------------------+----------------------+
|ocean_proximity|ocean_proximity_index|ocean_proximity_vector|
+---------------+---------------------+----------------------+
|NEAR OCEAN     |2.0                  |(4,[2],[1.0])         |
|NEAR OCEAN     |2.0                  |(4,[2],[1.0])         |
|NEAR OCEAN     |2.0                  |(4,[2],[1.0])         |
|NEAR OCEAN     |2.0                  |(4,[2],[1.0])         |
|NEAR OCEAN     |2.0                  |(4,[2],[1.0])         |
+---------------+---------------------+----------------------+
only showing top 5 rows



In [22]:
test.select('ocean_proximity','ocean_proximity_index','ocean_proximity_vector').show(5, truncate=False)

+---------------+---------------------+----------------------+
|ocean_proximity|ocean_proximity_index|ocean_proximity_vector|
+---------------+---------------------+----------------------+
|NEAR OCEAN     |2.0                  |(4,[2],[1.0])         |
|NEAR OCEAN     |2.0                  |(4,[2],[1.0])         |
|NEAR OCEAN     |2.0                  |(4,[2],[1.0])         |
|NEAR OCEAN     |2.0                  |(4,[2],[1.0])         |
|NEAR OCEAN     |2.0                  |(4,[2],[1.0])         |
+---------------+---------------------+----------------------+
only showing top 5 rows



###11. Create a final feature vector by combining scaled numerical features and one-hot encoded categorical features in both training and test sets.

In [23]:
vectorAssemblerModelFinal = VectorAssembler(inputCols = ['scaled_features','ocean_proximity_vector'],
                                       outputCol = 'final_features',
                                       handleInvalid= 'skip')
train = vectorAssemblerModelFinal.transform(train)
test = vectorAssemblerModelFinal.transform(test)

In [24]:
train.select('scaled_features','ocean_proximity_vector','final_features').show(5, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|scaled_features                                                                                                                                          |ocean_proximity_vector|final_features                                                                                                                                                           |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------

In [25]:
test.select('scaled_features','ocean_proximity_vector','final_features').show(5, truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|scaled_features                                                                                                                                            |ocean_proximity_vector|final_features                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+------------------------------------------------------------------------------------------------------------------

###12. Initialize a Linear Regression model using final_feature_vector as features and median_house_value as the label.

In [26]:
linearRegressionTrainer = LinearRegression(featuresCol = 'final_features',
                                          labelCol = 'median_house_value',
                                           predictionCol = 'prediction')

###13. Train the Linear Regression model using the training data.

In [27]:
linearRegressionModel = linearRegressionTrainer.fit(train)

###14. Generate predictions on the training data using the trained Linear Regression model.

In [28]:
train_predictions = linearRegressionModel.transform(train.select('final_features','median_house_value'))
train_predictions.show(5)

+--------------------+------------------+------------------+
|      final_features|median_house_value|        prediction|
+--------------------+------------------+------------------+
|[-62.197108857510...|           94600.0| 210872.6937109509|
|[-62.172099967740...|           85800.0|114388.20828529005|
|[-62.172099967740...|          103600.0|151034.82641915954|
|[-62.157094633879...|           79000.0| 182975.2948124616|
|[-62.147091077971...|           76100.0|170024.62336446927|
+--------------------+------------------+------------------+
only showing top 5 rows



###15. Generate predictions on the test data using the trained Linear Regression model.

In [29]:
test_predictions = linearRegressionModel.transform(test.select('final_features','median_house_value'))
test_predictions.show(5)
#

+--------------------+------------------+------------------+
|      final_features|median_house_value|        prediction|
+--------------------+------------------+------------------+
|[-62.152092855925...|          111400.0| 190400.3942266286|
|[-62.117080410247...|           90100.0| 198108.5642802564|
|[-62.117080410247...|           69000.0|176489.41627549124|
|[-62.117080410247...|           70000.0|147980.29676335142|
|[-62.112078632293...|          107000.0|189445.67376457038|
+--------------------+------------------+------------------+
only showing top 5 rows



###16. Convert the predictions DataFrame (pred_test_df) to a Pandas DataFrame (pred_test_pd_df) for further analysis or visualization.

In [30]:
test_predictions_pd = test_predictions.toPandas()
test_predictions_pd.head()

Unnamed: 0,final_features,median_house_value,prediction
0,"[-62.15209285592514, 18.934602229198987, 4.132...",111400.0,190400.394227
1,"[-62.117080410247404, 19.00459213393974, 1.668...",90100.0,198108.56428
2,"[-62.117080410247404, 19.02325610853728, 2.383...",69000.0,176489.416275
3,"[-62.117080410247404, 19.02792210218666, 2.940...",70000.0,147980.296763
4,"[-62.112078632293446, 18.95326620379652, 2.781...",107000.0,189445.673765


###17. Extract predicted and actual median house values from pred_test_df and convert to RDD (predictions_and_actuals_rdd).

In [31]:
predictions_and_actuals_rdd = test_predictions.select('prediction','median_house_value').rdd
predictions_and_actuals_rdd.take(5)

[Row(prediction=190400.3942266286, median_house_value=111400.0),
 Row(prediction=198108.5642802564, median_house_value=90100.0),
 Row(prediction=176489.41627549124, median_house_value=69000.0),
 Row(prediction=147980.29676335142, median_house_value=70000.0),
 Row(prediction=189445.67376457038, median_house_value=107000.0)]

###18. Map RDD to Tuples for further processing or analysis.

In [32]:
predictions_and_actuals_tuples = predictions_and_actuals_rdd.map(lambda row: (row.prediction, row.median_house_value))
predictions_and_actuals_tuples.take(5)


[(190400.3942266286, 111400.0),
 (198108.5642802564, 90100.0),
 (176489.41627549124, 69000.0),
 (147980.29676335142, 70000.0),
 (189445.67376457038, 107000.0)]

###Calculate regression metrics (Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, R-squared) using predictions_and_actuals_rdd.

In [33]:
regressionEvaluator = RegressionMetrics(predictions_and_actuals_tuples)
print(f'MSE: {regressionEvaluator.meanSquaredError}')
print(f'RMSE: {regressionEvaluator.rootMeanSquaredError}')
print(f'MAE: {regressionEvaluator.meanAbsoluteError}')
print(f'R2: {regressionEvaluator.r2}')



MSE: 4725761047.684798
RMSE: 68744.17100878299
MAE: 49956.137834375404
R2: 0.651264962472694
