<table class="table table-bordered">
    <tr>
        <th style="width:250px"><img src='https://www.np.edu.sg/images/default-source/default-album/img-logo.png?sfvrsn=764583a6_0' style="width: 100%; height: 125px; "></th>
        <th style="text-align:center;"><h1>Distributed Data Pipelines</h1><h2>Assignment 1 </h2><h3>Diploma in Data Science</h3></th>
    </tr>
</table>

Learning Objectives:
- Design PySpark Based Machine Learning
- Execute PySpark Syntax Correctly
- Evaluate and Select Final Model based on Metrics

You will be **graded on the use of PySpark**, so usage of **Pandas itself should be avoided as much as possible**, especially if a particular native method or function is already available in PySpark. **Penalties will be imposed in such cases.**

In [1]:
# import the packages
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('ASG1').getOrCreate()

### Step 1: Problem Statement Formulation

In [2]:
# load and explore data
df=spark.read.csv('./data/sg_flat_prices_mod.csv', header=True, inferSchema=True) # inferSchema auto detects data type

In [3]:
df.show()

+----+-----+----------+---------+-----+-----------------+------------+--------------+--------------+-------------------+---------------+------------+
|year|month|      town|flat_type|block|      street_name|storey_range|floor_area_sqm|    flat_model|lease_commence_date|remaining_lease|resale_price|
+----+-----+----------+---------+-----+-----------------+------------+--------------+--------------+-------------------+---------------+------------+
|2017|    1|ANG MO KIO|   2 ROOM|  406|ANG MO KIO AVE 10|    10 TO 12|          44.0|      Improved|               1979|            736|    232000.0|
|2017|    1|ANG MO KIO|   3 ROOM|  108| ANG MO KIO AVE 4|    01 TO 03|          67.0|New Generation|               1978|            727|    250000.0|
|2017|    1|ANG MO KIO|   3 ROOM|  602| ANG MO KIO AVE 5|    01 TO 03|          67.0|New Generation|               1980|            749|    262000.0|
|2017|    1|ANG MO KIO|   3 ROOM|  465|ANG MO KIO AVE 10|    04 TO 06|          68.0|New Generation|

In [4]:
# value based problem statement

In [5]:
# Build a simple machine learning model to predict the resale prices of any given HDB
# resale transaction.

### Step 2: Exploratory Data Analysis and Data Cleansing

In [6]:
# consider NaN Treatment

In [7]:
# Get a list of the data types for each column
col_types = df.dtypes

# Filter the list to only include numerical data types
num_cols = [col[0] for col in col_types if col[1] in ("int", "double")]

# Print the list of numeric columns
print(num_cols)

['year', 'month', 'floor_area_sqm', 'lease_commence_date', 'remaining_lease', 'resale_price']


In [8]:
# Filter the list to only include string data types
cat_cols = [col[0] for col in col_types if col[1] == "string"]

# Print the list of categorical columns
print(cat_cols)

['town', 'flat_type', 'block', 'street_name', 'storey_range', 'flat_model']


In [9]:
# selected varables for the demonstration
df.select(num_cols).describe().show()

+-------+------------------+------------------+-----------------+-------------------+------------------+------------------+
|summary|              year|             month|   floor_area_sqm|lease_commence_date|   remaining_lease|      resale_price|
+-------+------------------+------------------+-----------------+-------------------+------------------+------------------+
|  count|             64247|             64247|            64197|              64247|             64247|             64247|
|   mean|2018.0262424704656| 6.779133656046197|97.77009984890256| 1993.6012420813422| 894.6413840334957|438943.70469516085|
| stddev|0.8146939469668695|3.2635673352950514|24.26994610142912| 12.465629502278013|149.62669792791093|153760.65294972394|
|    min|              2017|                 1|             31.0|               1966|               553|          150000.0|
|    max|              2019|                12|            249.0|               2016|              1160|         1205000.0|
+-------

In [10]:
df.select(df.columns).distinct() and df.select(df.columns).count()

64247

In [11]:
df.count()

64247

In [12]:
# how to show null count
from pyspark.sql.functions import col, isnan, when, count
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+----+-----+----+---------+-----+-----------+------------+--------------+----------+-------------------+---------------+------------+
|year|month|town|flat_type|block|street_name|storey_range|floor_area_sqm|flat_model|lease_commence_date|remaining_lease|resale_price|
+----+-----+----+---------+-----+-----------+------------+--------------+----------+-------------------+---------------+------------+
|   0|    0|   0|        0|    0|          0|           0|            50|         0|                  0|              0|           0|
+----+-----+----+---------+-----+-----------+------------+--------------+----------+-------------------+---------------+------------+



In [13]:
grouped_data = df.groupBy("flat_type")
mean_floor_area_sqms = grouped_data.mean("floor_area_sqm")

In [14]:
df = df.join(mean_floor_area_sqms, on="flat_type", how="left")
from pyspark.sql.functions import when
df = df.withColumn("floor_area_sqm", when(df["floor_area_sqm"].isNull(), df["avg(floor_area_sqm)"]).otherwise(df["floor_area_sqm"]))

In [15]:
#Check for null values again
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+---------+----+-----+----+-----+-----------+------------+--------------+----------+-------------------+---------------+------------+-------------------+
|flat_type|year|month|town|block|street_name|storey_range|floor_area_sqm|flat_model|lease_commence_date|remaining_lease|resale_price|avg(floor_area_sqm)|
+---------+----+-----+----+-----+-----------+------------+--------------+----------+-------------------+---------------+------------+-------------------+
|        0|   0|    0|   0|    0|          0|           0|             0|         0|                  0|              0|           0|                  0|
+---------+----+-----+----+-----+-----------+------------+--------------+----------+-------------------+---------------+------------+-------------------+



In [16]:
df.groupBy('flat_type').count().show()

+----------------+-----+
|       flat_type|count|
+----------------+-----+
|          3 ROOM|15589|
|          1 ROOM|   29|
|          4 ROOM|26592|
|          2 ROOM|  919|
|       EXECUTIVE| 5169|
|          5 ROOM|15916|
|MULTI-GENERATION|   33|
+----------------+-----+



In [17]:
# df = df.drop(df["flat_type"].isin(["1 ROOM", "MULTI-GENERATION"]))
# Remove rows where the "flat_type" column is "1 ROOM" or "MULTI-GENERATION"
df = df.filter(~df["flat_type"].isin(["1 ROOM", "MULTI-GENERATION"]))


In [18]:
df.groupBy('flat_type').count().show()

+---------+-----+
|flat_type|count|
+---------+-----+
|   3 ROOM|15589|
|   4 ROOM|26592|
|   2 ROOM|  919|
|EXECUTIVE| 5169|
|   5 ROOM|15916|
+---------+-----+



In [19]:
df.groupBy('flat_model').count().show(truncate = False)

+----------------------+-----+
|flat_model            |count|
+----------------------+-----+
|Apartment             |2627 |
|Premium Maisonette    |7    |
|Improved              |16042|
|Type S2               |63   |
|New Generation        |9068 |
|Improved-Maisonette   |13   |
|Model A-Maisonette    |110  |
|Maisonette            |1909 |
|Model A               |20743|
|DBSS                  |936  |
|Simplified            |2754 |
|Terrace               |38   |
|Adjoined flat         |119  |
|Type S1               |117  |
|Standard              |1802 |
|Premium Apartment     |6945 |
|Model A2              |885  |
|Premium Apartment Loft|7    |
+----------------------+-----+



In [20]:
df.groupBy('town').count().show(50, truncate = False)

+---------------+-----+
|town           |count|
+---------------+-----+
|QUEENSTOWN     |1708 |
|BEDOK          |3429 |
|CLEMENTI       |1403 |
|SERANGOON      |1327 |
|BUKIT PANJANG  |2384 |
|BUKIT TIMAH    |189  |
|YISHUN         |4314 |
|GEYLANG        |1554 |
|WOODLANDS      |4988 |
|BUKIT MERAH    |2549 |
|TOA PAYOH      |2184 |
|BISHAN         |1263 |
|PUNGGOL        |4013 |
|HOUGANG        |2982 |
|ANG MO KIO     |2917 |
|PASIR RIS      |1836 |
|SENGKANG       |4970 |
|KALLANG/WHAMPOA|1916 |
|BUKIT BATOK    |2451 |
|TAMPINES       |4071 |
|JURONG WEST    |4945 |
|JURONG EAST    |1446 |
|MARINE PARADE  |387  |
|CENTRAL AREA   |545  |
|SEMBAWANG      |1754 |
|CHOA CHU KANG  |2660 |
+---------------+-----+



In [21]:


# filter the DataFrame to include only resale prices above 1 million
count_over1m = df.filter(df["resale_price"] > 1000000)

# count the number of rows in the filtered DataFrame
count_over1m = count_over1m.count()

# print the result
print("Number of resale prices above 1 million:", count_over1m)


Number of resale prices above 1 million: 154


In [22]:


# compute the approximate quantiles of the resale_price column
quantiles = df.approxQuantile("resale_price", [0.25, 0.75], 0.05)

# define the lower and upper bounds of the expected range
lower_bound = quantiles[0] - 1.5 * (quantiles[1] - quantiles[0])
upper_bound = quantiles[1] + 1.5 * (quantiles[1] - quantiles[0])

# filter the DataFrame to include only data points outside of the expected range
outliers = df.filter((df["resale_price"] < lower_bound) | (df["resale_price"] > upper_bound))

# print the number of outliers found
print("Number of outliers:", outliers.count())

# print the range of values for potential outliers
print("Outlier range:", lower_bound, "to", upper_bound)


Number of outliers: 3800
Outlier range: 80500.0 to 732500.0


In [23]:
from pyspark.sql import functions as F

# Group the data by town
grouped_data = df.groupBy("town")

# Compute the average resale price for each town
avg_resale_price = grouped_data.agg(F.mean("resale_price").alias("resale_price"))

# Sort the resulting DataFrame by average resale price
sorted_data = avg_resale_price.orderBy("resale_price", ascending=False)

# Show the results
sorted_data.show(50)


+---------------+------------------+
|           town|      resale_price|
+---------------+------------------+
|    BUKIT TIMAH| 714816.9735449735|
|         BISHAN| 643720.7957244655|
|   CENTRAL AREA| 623428.0220183487|
|    BUKIT MERAH| 568323.6060729699|
|     QUEENSTOWN| 554835.8535597189|
|  MARINE PARADE| 518115.9173126615|
|KALLANG/WHAMPOA|496043.73121085594|
|      TOA PAYOH| 494166.7532051282|
|      PASIR RIS| 492123.0871459695|
|      SERANGOON| 490769.0934438583|
|       TAMPINES| 473382.8806190125|
|       CLEMENTI| 469028.6115466857|
|        PUNGGOL| 453269.6137253925|
|       SENGKANG|433994.11826156947|
|        GEYLANG|430605.67181467183|
|        HOUGANG|429212.74610328645|
|  BUKIT PANJANG|428196.38632550335|
|    JURONG EAST| 416185.7745504841|
|     ANG MO KIO| 411547.1964346932|
|          BEDOK| 410944.0495771362|
|    JURONG WEST| 387879.4531122346|
|  CHOA CHU KANG|384960.08120300755|
|      SEMBAWANG| 378804.1880729761|
|    BUKIT BATOK| 377715.2757037944|
|

In [24]:
from pyspark.sql import functions as F
from pyspark.sql import Window

# Define a window for the data, partitioned by town
window = Window.partitionBy("town")

# Group the data by town and model type
grouped_data = df.groupBy("town", "flat_type")

# Count the number of rows in each group
counts = grouped_data.agg(F.count("*").alias("count"))

counts.show()

+---------------+---------+-----+
|           town|flat_type|count|
+---------------+---------+-----+
|    BUKIT MERAH|   3 ROOM|  907|
|    JURONG EAST|   5 ROOM|  360|
|      SERANGOON|   3 ROOM|  283|
|   CENTRAL AREA|   2 ROOM|   16|
|          BEDOK|   3 ROOM| 1491|
|     QUEENSTOWN|   3 ROOM|  791|
|     ANG MO KIO|   4 ROOM|  777|
|    JURONG WEST|   3 ROOM|  947|
|  CHOA CHU KANG|   3 ROOM|  165|
|       TAMPINES|EXECUTIVE|  464|
|       SENGKANG|   3 ROOM|  330|
|    BUKIT MERAH|   5 ROOM|  581|
|KALLANG/WHAMPOA|EXECUTIVE|   42|
|      PASIR RIS|   2 ROOM|    2|
|        PUNGGOL|   3 ROOM|  295|
|       TAMPINES|   4 ROOM| 1584|
|        GEYLANG|   5 ROOM|  203|
|    JURONG EAST|   4 ROOM|  422|
|  CHOA CHU KANG|   4 ROOM| 1263|
|      PASIR RIS|EXECUTIVE|  511|
+---------------+---------+-----+
only showing top 20 rows



In [25]:
from pyspark.sql import functions as F

# Group the data by town
grouped_data = df.groupBy("storey_range")

# Compute the average resale price for each town
avg_resale_price = grouped_data.agg(F.mean("resale_price").alias("resale_price"))

# Sort the resulting DataFrame by average resale price
sorted_data = avg_resale_price.orderBy("resale_price", ascending=False)

# Show the results
sorted_data.show(50)


+------------+------------------+
|storey_range|      resale_price|
+------------+------------------+
|    43 TO 45|1037833.3333333334|
|    49 TO 51|1022814.6666666666|
|    46 TO 48|1018845.4545454546|
|    40 TO 42|       894045.9375|
|    37 TO 39| 845602.7674418605|
|    34 TO 36| 802757.8962962963|
|    31 TO 33| 800630.9291338583|
|    28 TO 30| 751391.7605177993|
|    25 TO 27| 666919.1401295896|
|    22 TO 24| 610122.5498092031|
|    19 TO 21| 591394.6781223805|
|    16 TO 18|514570.93940520444|
|    13 TO 15|472987.87508488825|
|    10 TO 12|438019.12078618666|
|    07 TO 09| 423331.7550947538|
|    04 TO 06| 411894.9576388235|
|    01 TO 03| 394287.5250354052|
+------------+------------------+



### Step 3: Data Wrangling and Transformation

In [26]:
# consider categorical and numerical variable treatment and transformations

In [27]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer


stage_string = [StringIndexer(inputCol= c, outputCol= c+"_string_encoded") for c in cat_cols]
stage_one_hot = [OneHotEncoder(inputCol= c+"_string_encoded", outputCol= c+ "_one_hot") for c in cat_cols]

ppl = Pipeline(stages= stage_string + stage_one_hot)
df = ppl.fit(df).transform(df)
df.show(5)



+---------+----+-----+----------+-----+-----------------+------------+--------------+--------------+-------------------+---------------+------------+-------------------+-------------------+------------------------+--------------------+--------------------------+---------------------------+-------------------------+--------------+-----------------+------------------+-------------------+--------------------+------------------+
|flat_type|year|month|      town|block|      street_name|storey_range|floor_area_sqm|    flat_model|lease_commence_date|remaining_lease|resale_price|avg(floor_area_sqm)|town_string_encoded|flat_type_string_encoded|block_string_encoded|street_name_string_encoded|storey_range_string_encoded|flat_model_string_encoded|  town_one_hot|flat_type_one_hot|     block_one_hot|street_name_one_hot|storey_range_one_hot|flat_model_one_hot|
+---------+----+-----+----------+-----+-----------------+------------+--------------+--------------+-------------------+---------------+------

In [28]:
df.dtypes

[('flat_type', 'string'),
 ('year', 'int'),
 ('month', 'int'),
 ('town', 'string'),
 ('block', 'string'),
 ('street_name', 'string'),
 ('storey_range', 'string'),
 ('floor_area_sqm', 'double'),
 ('flat_model', 'string'),
 ('lease_commence_date', 'int'),
 ('remaining_lease', 'int'),
 ('resale_price', 'double'),
 ('avg(floor_area_sqm)', 'double'),
 ('town_string_encoded', 'double'),
 ('flat_type_string_encoded', 'double'),
 ('block_string_encoded', 'double'),
 ('street_name_string_encoded', 'double'),
 ('storey_range_string_encoded', 'double'),
 ('flat_model_string_encoded', 'double'),
 ('town_one_hot', 'vector'),
 ('flat_type_one_hot', 'vector'),
 ('block_one_hot', 'vector'),
 ('street_name_one_hot', 'vector'),
 ('storey_range_one_hot', 'vector'),
 ('flat_model_one_hot', 'vector')]

In [29]:
from pyspark.sql import functions as F

# Get a list of all the columns in the DataFrame

# Set the name of the target variable
target = 'resale_price'

# Loop through each column
for col in num_cols:
  # Skip the target variable
  if col == target:
    continue

  # Calculate the median and IQR of the column
  quantiles = df.approxQuantile(col, [0.25, 0.75], 0.25)
  median = quantiles[1]
  iqr = quantiles[1] - quantiles[0]

  # Create a new column that is the result of applying robust scaling to the column
  df = df.withColumn(
      f'scaled_{col}',
      (df[col] - median) / iqr
  )


In [30]:
df.dtypes

[('flat_type', 'string'),
 ('year', 'int'),
 ('month', 'int'),
 ('town', 'string'),
 ('block', 'string'),
 ('street_name', 'string'),
 ('storey_range', 'string'),
 ('floor_area_sqm', 'double'),
 ('flat_model', 'string'),
 ('lease_commence_date', 'int'),
 ('remaining_lease', 'int'),
 ('resale_price', 'double'),
 ('avg(floor_area_sqm)', 'double'),
 ('town_string_encoded', 'double'),
 ('flat_type_string_encoded', 'double'),
 ('block_string_encoded', 'double'),
 ('street_name_string_encoded', 'double'),
 ('storey_range_string_encoded', 'double'),
 ('flat_model_string_encoded', 'double'),
 ('town_one_hot', 'vector'),
 ('flat_type_one_hot', 'vector'),
 ('block_one_hot', 'vector'),
 ('street_name_one_hot', 'vector'),
 ('storey_range_one_hot', 'vector'),
 ('flat_model_one_hot', 'vector'),
 ('scaled_year', 'double'),
 ('scaled_month', 'double'),
 ('scaled_floor_area_sqm', 'double'),
 ('scaled_lease_commence_date', 'double'),
 ('scaled_remaining_lease', 'double')]

In [31]:
from pyspark.ml.feature import VectorAssembler
cols1 = ["scaled_year", "scaled_month", "scaled_floor_area_sqm", "scaled_lease_commence_date","scaled_remaining_lease","town_one_hot", "flat_type_one_hot", "block_one_hot",
         "street_name_one_hot", "storey_range_one_hot", "flat_model_one_hot"]
featureassembler=VectorAssembler(inputCols=cols1,outputCol="Xcols")

In [32]:
featureassembler

VectorAssembler_76e553a12520

In [33]:
output = featureassembler.transform(df)

In [34]:
output.show()

+---------+----+-----+----------+-----+-----------------+------------+--------------+--------------+-------------------+---------------+------------+-------------------+-------------------+------------------------+--------------------+--------------------------+---------------------------+-------------------------+--------------+-----------------+------------------+-------------------+--------------------+------------------+-----------+------------+---------------------+--------------------------+----------------------+--------------------+
|flat_type|year|month|      town|block|      street_name|storey_range|floor_area_sqm|    flat_model|lease_commence_date|remaining_lease|resale_price|avg(floor_area_sqm)|town_string_encoded|flat_type_string_encoded|block_string_encoded|street_name_string_encoded|storey_range_string_encoded|flat_model_string_encoded|  town_one_hot|flat_type_one_hot|     block_one_hot|street_name_one_hot|storey_range_one_hot|flat_model_one_hot|scaled_year|scaled_month|

In [35]:
final_data = output.select("Xcols", "resale_price")

In [36]:
final_data.show(truncate = False)

+-----------------------------------------------------------------------------------------------------------------------------+------------+
|Xcols                                                                                                                        |resale_price|
+-----------------------------------------------------------------------------------------------------------------------------+------------+
|(2993,[0,1,2,3,4,13,138,2424,2962,2977],[-1.0,-1.0,-0.9534883720930233,-0.74,-0.6985172981878089,1.0,1.0,1.0,1.0,1.0])       |232000.0    |
|(2993,[0,1,2,3,4,13,32,56,2440,2963,2978],[-1.0,-1.0,-0.8465116279069768,-0.76,-0.71334431630972,1.0,1.0,1.0,1.0,1.0,1.0])   |250000.0    |
|(2993,[0,1,2,3,4,13,32,166,2451,2963,2978],[-1.0,-1.0,-0.8465116279069768,-0.72,-0.6771004942339374,1.0,1.0,1.0,1.0,1.0,1.0])|262000.0    |
|(2993,[0,1,2,3,4,13,32,432,2424,2960,2978],[-1.0,-1.0,-0.8418604651162791,-0.72,-0.6836902800658978,1.0,1.0,1.0,1.0,1.0,1.0])|265000.0    |
|(2993,[0,1,2

In [37]:
# from pyspark.ml.feature import StandardScaler

# sScaler = StandardScaler(withMean=True, withStd=True, inputCol="Xcols", outputCol="Xcols_sscaled")

In [38]:
# final_data = sScaler.fit(final_data).transform(final_data)

In [39]:
# final_data = final_data.select("Xcols_sscaled","resale_price")
# final_data.show(2, truncate = False)

### Step 4: Machine Learning Modelling

In [40]:
# Get the number of rows
num_rows = final_data.count()

# Get the number of columns
num_columns = len(final_data.columns)

# Print the shape of the DataFrame
print(f"The shape of the DataFrame is {num_rows} x {num_columns}")


The shape of the DataFrame is 64185 x 2


In [41]:
final_data.show(10, truncate = False)

+-----------------------------------------------------------------------------------------------------------------------------+------------+
|Xcols                                                                                                                        |resale_price|
+-----------------------------------------------------------------------------------------------------------------------------+------------+
|(2993,[0,1,2,3,4,13,138,2424,2962,2977],[-1.0,-1.0,-0.9534883720930233,-0.74,-0.6985172981878089,1.0,1.0,1.0,1.0,1.0])       |232000.0    |
|(2993,[0,1,2,3,4,13,32,56,2440,2963,2978],[-1.0,-1.0,-0.8465116279069768,-0.76,-0.71334431630972,1.0,1.0,1.0,1.0,1.0,1.0])   |250000.0    |
|(2993,[0,1,2,3,4,13,32,166,2451,2963,2978],[-1.0,-1.0,-0.8465116279069768,-0.72,-0.6771004942339374,1.0,1.0,1.0,1.0,1.0,1.0])|262000.0    |
|(2993,[0,1,2,3,4,13,32,432,2424,2960,2978],[-1.0,-1.0,-0.8418604651162791,-0.72,-0.6836902800658978,1.0,1.0,1.0,1.0,1.0,1.0])|265000.0    |
|(2993,[0,1,2

In [42]:

# Import the LinearRegression class
from pyspark.ml.regression import LinearRegression


# Split the data into training and test sets
train_data, test_data = final_data.randomSplit([0.8, 0.2])

# Create a LinearRegression instance
lr = LinearRegression(featuresCol='Xcols', labelCol='resale_price')

# Fit the model on the training data
lr_model = lr.fit(train_data)

# Evaluate the model on the test data
test_results = lr_model.evaluate(test_data)
train_results = lr_model.evaluate(train_data)

In [43]:
# use code to show number of rows and columns,
# as well as a sample of 10 rows before heading into Machine Learning Modelling

### Step 5: Model Evaluation and Selection

In [44]:
train_results.predictions.show()

+--------------------+------------+------------------+
|               Xcols|resale_price|        prediction|
+--------------------+------------+------------------+
|(2993,[0,1,2,3,4,...|    281888.0|261808.03814540792|
|(2993,[0,1,2,3,4,...|    280000.0|283707.27704078064|
|(2993,[0,1,2,3,4,...|    375000.0| 345913.2488107191|
|(2993,[0,1,2,3,4,...|    328000.0|336740.91460026405|
|(2993,[0,1,2,3,4,...|    330000.0| 313709.6301737266|
|(2993,[0,1,2,3,4,...|    312000.0| 306988.7127639124|
|(2993,[0,1,2,3,4,...|    345000.0|336639.15919652465|
|(2993,[0,1,2,3,4,...|    335000.0| 336281.2354278392|
|(2993,[0,1,2,3,4,...|    298000.0|328128.62294328306|
|(2993,[0,1,2,3,4,...|    348000.0|349928.00438458833|
|(2993,[0,1,2,3,4,...|    330000.0| 348467.8443447413|
|(2993,[0,1,2,3,4,...|    345000.0| 361651.6128497516|
|(2993,[0,1,2,3,4,...|    282000.0| 333375.1408080085|
|(2993,[0,1,2,3,4,...|    350000.0| 356141.3395403697|
|(2993,[0,1,2,3,4,...|    328000.0|350243.22368292697|
|(2993,[0,

In [45]:
test_results.predictions.show()

+--------------------+------------+------------------+
|               Xcols|resale_price|        prediction|
+--------------------+------------+------------------+
|(2993,[0,1,2,3,4,...|    230000.0|255087.12073559372|
|(2993,[0,1,2,3,4,...|    350000.0| 356390.8554797844|
|(2993,[0,1,2,3,4,...|    343500.0| 366436.3543712138|
|(2993,[0,1,2,3,4,...|    348000.0| 372770.8830472375|
|(2993,[0,1,2,3,4,...|    275000.0|305088.47285441484|
|(2993,[0,1,2,3,4,...|    235000.0|261039.40396909765|
|(2993,[0,1,2,3,4,...|    275000.0| 296761.4931424358|
|(2993,[0,1,2,3,4,...|    318000.0|331558.04085681215|
|(2993,[0,1,2,3,4,...|    310000.0| 328801.6047223287|
|(2993,[0,1,2,3,4,...|    340000.0| 343816.0519340498|
|(2993,[0,1,2,3,4,...|    315000.0|322298.84090055537|
|(2993,[0,1,2,3,4,...|    253000.0| 305336.5434307745|
|(2993,[0,1,2,3,4,...|    275000.0|    289667.9497155|
|(2993,[0,1,2,3,4,...|    340000.0|315294.32333801757|
|(2993,[0,1,2,3,4,...|    270000.0| 312432.1850326834|
|(2993,[0,

In [46]:
# Print the MSE, MAE, and R2 values
print("MSE: %f" % train_results.meanSquaredError)
print("MAE: %f" % train_results.meanAbsoluteError)
print("R2: %f" % train_results.r2)

MSE: 1222362799.220270
MAE: 25819.072648
R2: 0.947616


In [47]:
# Print the MSE, MAE, and R2 values
print("MSE: %f" % test_results.meanSquaredError)
print("MAE: %f" % test_results.meanAbsoluteError)
print("R2: %f" % test_results.r2)

MSE: 1380246561.859578
MAE: 27601.393176
R2: 0.943570


In [48]:
pyspark_two_rows = test_data.limit(2)
pyspark_two_rows.show()

+--------------------+------------+
|               Xcols|resale_price|
+--------------------+------------+
|(2993,[0,1,2,3,4,...|    230000.0|
|(2993,[0,1,2,3,4,...|    350000.0|
+--------------------+------------+



In [49]:
lr_model.evaluate(pyspark_two_rows).predictions.show()

+--------------------+------------+------------------+
|               Xcols|resale_price|        prediction|
+--------------------+------------+------------------+
|(2993,[0,1,2,3,4,...|    230000.0|255087.12073559372|
|(2993,[0,1,2,3,4,...|    350000.0| 356390.8554797844|
+--------------------+------------+------------------+



### Step 6: Report

## Table of Contents
- [Problem Statement Formulation](#part1)
    - [Load and Explore the Data](#part2)
    - [Understand the Data](#part3)
    - [Formulate a Value Based Problem Statement](#part4)
    
    
- [Exploratory Data Analysis and Data Cleansing](#part5)
    - [Interesting Trends](#part6)
    - [Anomalies](#part7)
    - [Potential Errors](#part8)
    - [Missing Value Treatment](#part9)
    
    
- [Data Wrangling and Transformation](#part10)
    - [Categorical Data](#part11)
    - [Numerical Data](#part12)
    - [Others](#part13)
    
    
- [Machine Learning Modelling](#part14)
    - [Show Count of Rows and Columns](#part15)
    - [Sample of 10 Rows before Modelling](#part16)
    - [Build the Predictive Model](#part17)
    
    
- [Model Evaluation and Selection ](#part18)
    - [Utilize Model Metrics for Evaluation](#part19)
    - [Compare Models and Decide on Final Model](#part20)
    
    
- [Summary and Further Improvements](#part21)
    - [Summarize your findings](#part22)
    - [Explain the possible further improvements ](#part23)

## <a id='part1'>Problem Statement Formulation</a>
the goal of this assignment is to conduct data preprocessing for the purpose of Building a simple linear regression machine learning model to predict the resale prices of any given HDB resale transaction.

### <a id='part2'>Load and Explore the Data</a>

In [50]:

- <a id='part3'>part3</a>
- <a id='part4'>part4</a>
- <a id='part5'>part5</a>
- <a id='part6'>part6</a>
- <a id='part7'>part7</a>

SyntaxError: invalid syntax (3660373953.py, line 1)

### "Unlisted" Youtube Link to Video Presentation

In [None]:
# insert your link in this cell, you are allowed to comment it out
# youtube link: 