## Used Car Prediction

Main Objective : Predict MMR (market price) for car sellers

<img src="https://img.etimg.com/thumb/msid-106586397,width-300,height-225,imgsize-39324,resizemode-75/car-sales.jpg" />

In [2]:
from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName('Used Car Prediction')
         .master('local[*]')
         .getOrCreate()
        )

24/05/21 22:08:24 WARN Utils: Your hostname, Sidis-Laptop.local resolves to a loopback address: 127.0.0.1; using 192.168.100.21 instead (on interface en0)
24/05/21 22:08:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/21 22:08:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
df = spark.read.format('csv').options(header=True, inferSchema=True).load('dataset/car_prices.csv').repartition(4)
df.show(30, truncate=50)



+----+-------------+----------------+--------------------+-----------+------------+-----------------+-----+---------+--------+--------+--------+---------------------------------------------+-----+------------+---------------------------------------+
|year|         make|           model|                trim|       body|transmission|              vin|state|condition|odometer|   color|interior|                                       seller|  mmr|sellingprice|                               saledate|
+----+-------------+----------------+--------------------+-----------+------------+-----------------+-----+---------+--------+--------+--------+---------------------------------------------+-----+------------+---------------------------------------+
|2009|     Chrysler|Town and Country|             Touring|    Minivan|   automatic|2a8hr54179r570758|   wi|      1.6| 90655.0|    gold|     tan|                        dt credit corporation| 8425|        7900|Wed Jan 21 2015 02:00:00 GMT-0800 (PST)|


24/05/21 22:08:40 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## Preprocessing & Cleaning

In [4]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

In [5]:
# Count for missing numbers
df.select([count(when(isnan(c) | isnull(c),1)).alias(c) for c in df.columns]).show()

[Stage 7:>                                                          (0 + 4) / 4]

+----+-----+-----+-----+-----+------------+---+-----+---------+--------+-----+--------+------+---+------------+--------+
|year| make|model| trim| body|transmission|vin|state|condition|odometer|color|interior|seller|mmr|sellingprice|saledate|
+----+-----+-----+-----+-----+------------+---+-----+---------+--------+-----+--------+------+---+------------+--------+
|   0|10301|10399|10651|13195|       65353|  4|    0|    11794|      94|  749|     749|     0|  0|           0|       0|
+----+-----+-----+-----+-----+------------+---+-----+---------+--------+-----+--------+------+---+------------+--------+



                                                                                

In [20]:
df = df.dropna(subset=["year", "make", "model", "body", "transmission", "condition", "odometer", "color"])
df.show()

+----+-------------+---------------+---------------+---------+------------+-----------------+-----+---------+--------+--------+--------+--------------------+-----+------------+--------------------+---------+
|year|         make|          model|           trim|     body|transmission|              vin|state|condition|odometer|   color|interior|              seller|  mmr|sellingprice|            saledate|body_type|
+----+-------------+---------------+---------------+---------+------------+-----------------+-----+---------+--------+--------+--------+--------------------+-----+------------+--------------------+---------+
|2013|    chevrolet|          cruze|            1LT|    sedan|   automatic|1g1pc5sb9d7167061|   va|      2.8|   29322|     red|    gray|fiserv/citizens a...|11200|       10200|Mon Dec 22 2014 0...|    sedan|
|2008|          gmc|         acadia|          SLT-2|      suv|   automatic|1gker33768j202809|   tx|      2.9|   97458|    gray|   black|ford motor credit...|12250|     

In [21]:
df.count()

440403

In [22]:
df.printSchema()

root
 |-- year: integer (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- trim: string (nullable = true)
 |-- body: string (nullable = true)
 |-- transmission: string (nullable = true)
 |-- vin: string (nullable = true)
 |-- state: string (nullable = true)
 |-- condition: double (nullable = true)
 |-- odometer: integer (nullable = true)
 |-- color: string (nullable = true)
 |-- interior: string (nullable = true)
 |-- seller: string (nullable = true)
 |-- mmr: integer (nullable = true)
 |-- sellingprice: integer (nullable = true)
 |-- saledate: string (nullable = true)
 |-- body_type: string (nullable = true)



In [23]:
df = df.where((df["transmission"] == "automatic") | (df["transmission"] == "manual"))
df = df.where((df["color"] != '—'))
df = df.where((df["interior"] != '—'))
df.show()

+----+-------------+---------------+---------------+---------+------------+-----------------+-----+---------+--------+--------+--------+--------------------+-----+------------+--------------------+---------+
|year|         make|          model|           trim|     body|transmission|              vin|state|condition|odometer|   color|interior|              seller|  mmr|sellingprice|            saledate|body_type|
+----+-------------+---------------+---------------+---------+------------+-----------------+-----+---------+--------+--------+--------+--------------------+-----+------------+--------------------+---------+
|2013|    chevrolet|          cruze|            1LT|    sedan|   automatic|1g1pc5sb9d7167061|   va|      2.8|   29322|     red|    gray|fiserv/citizens a...|11200|       10200|Mon Dec 22 2014 0...|    sedan|
|2008|          gmc|         acadia|          SLT-2|      suv|   automatic|1gker33768j202809|   tx|      2.9|   97458|    gray|   black|ford motor credit...|12250|     

In [24]:
df = df.withColumn("body", lower(df["body"]))
df = df.withColumn("make", lower(df["make"]))
df = df.withColumn("model", lower(df["model"]))
df.show()

+----+-------------+---------------+---------------+---------+------------+-----------------+-----+---------+--------+--------+--------+--------------------+-----+------------+--------------------+---------+
|year|         make|          model|           trim|     body|transmission|              vin|state|condition|odometer|   color|interior|              seller|  mmr|sellingprice|            saledate|body_type|
+----+-------------+---------------+---------------+---------+------------+-----------------+-----+---------+--------+--------+--------+--------------------+-----+------------+--------------------+---------+
|2013|    chevrolet|          cruze|            1LT|    sedan|   automatic|1g1pc5sb9d7167061|   va|      2.8|   29322|     red|    gray|fiserv/citizens a...|11200|       10200|Mon Dec 22 2014 0...|    sedan|
|2008|          gmc|         acadia|          SLT-2|      suv|   automatic|1gker33768j202809|   tx|      2.9|   97458|    gray|   black|ford motor credit...|12250|     

In [25]:
# "year", "make", "model", "body", "transmission", "condition", "odometer", "color"
df.groupBy("make").count().sort("count").show()
df.groupBy("model").count().sort(desc("count")).show()
df.groupBy("body").count().sort(("count")).show()
df.groupBy("color").count().sort("count").show()
df.groupBy("interior").count().sort("count").show()
df.groupBy("transmission").count().sort("count").show()

+------------+-----+
|        make|count|
+------------+-----+
|       lotus|    1|
|      daewoo|    2|
| lamborghini|    3|
|      fisker|    9|
|    plymouth|   15|
|     ferrari|   15|
| rolls-royce|   15|
|         geo|   16|
|       tesla|   22|
|aston martin|   22|
|     bentley|  102|
|    maserati|  105|
|       isuzu|  167|
|  oldsmobile|  307|
|       smart|  332|
|        saab|  404|
|        fiat|  673|
|      hummer|  738|
|      suzuki|  945|
|     porsche| 1136|
+------------+-----+
only showing top 20 rows

+--------------+-----+
|         model|count|
+--------------+-----+
|        altima|15094|
|         f-150|10713|
|         camry|10405|
|        fusion|10079|
|        escape| 9127|
|        accord| 8327|
|         focus| 8258|
|        impala| 7275|
|         civic| 6894|
| grand caravan| 6586|
|       corolla| 6533|
|       g sedan| 6526|
|      3 series| 6503|
|        malibu| 6015|
|        sonata| 5513|
|silverado 1500| 5510|
|         cruze| 5211|
|       el

In [26]:
# Cast to respective data types
df = df.withColumn('mmr', col('mmr').cast('integer'))
df = df.withColumn('condition', col('condition').cast('double'))
df = df.withColumn('odometer', col('odometer').cast('integer'))

In [27]:
df.select(countDistinct("year")).show()
df.select(countDistinct("model")).show()


+--------------------+
|count(DISTINCT year)|
+--------------------+
|                  26|
+--------------------+

+---------------------+
|count(DISTINCT model)|
+---------------------+
|                  761|
+---------------------+



In [28]:
df.show()

+----+-------------+---------------+---------------+---------+------------+-----------------+-----+---------+--------+--------+--------+--------------------+-----+------------+--------------------+---------+
|year|         make|          model|           trim|     body|transmission|              vin|state|condition|odometer|   color|interior|              seller|  mmr|sellingprice|            saledate|body_type|
+----+-------------+---------------+---------------+---------+------------+-----------------+-----+---------+--------+--------+--------+--------------------+-----+------------+--------------------+---------+
|2013|    chevrolet|          cruze|            1LT|    sedan|   automatic|1g1pc5sb9d7167061|   va|      2.8|   29322|     red|    gray|fiserv/citizens a...|11200|       10200|Mon Dec 22 2014 0...|    sedan|
|2008|          gmc|         acadia|          SLT-2|      suv|   automatic|1gker33768j202809|   tx|      2.9|   97458|    gray|   black|ford motor credit...|12250|     

In [29]:
def is_numeric(value):
    try:
        float(value)
        return True
    except ValueError:
        return False

is_numeric_udf = udf(is_numeric, BooleanType())

df.filter(~is_numeric_udf(col("mmr"))).show()

+----+----+-----+----+----+------------+---+-----+---------+--------+-----+--------+------+---+------------+--------+---------+
|year|make|model|trim|body|transmission|vin|state|condition|odometer|color|interior|seller|mmr|sellingprice|saledate|body_type|
+----+----+-----+----+----+------------+---+-----+---------+--------+-----+--------+------+---+------------+--------+---------+
+----+----+-----+----+----+------------+---+-----+---------+--------+-----+--------+------+---+------------+--------+---------+





In [30]:
df.summary().show()

[Stage 182:>                                                        (0 + 4) / 4]

+-------+------------------+------+-----------------+-----------------+----------+------------+-----------------+------+------------------+------------------+------+--------+--------------------+-----------------+------------------+--------------------+---------+
|summary|              year|  make|            model|             trim|      body|transmission|              vin| state|         condition|          odometer| color|interior|              seller|              mmr|      sellingprice|            saledate|body_type|
+-------+------------------+------+-----------------+-----------------+----------+------------+-----------------+------+------------------+------------------+------+--------+--------------------+-----------------+------------------+--------------------+---------+
|  count|            440403|440403|           440403|           440403|    440403|      440403|           440403|440403|            440403|            440403|440403|  440403|              440403|           44

                                                                                

In [31]:
# Body Type Categorization
df = df.withColumn(
    "body_type",
     when(col("body").like("%cab%"), "cab")
    .when(col("body").like("%suv%"), "suv")
    .when(col("body").like("%sedan%"), "sedan")
    .when(col("body").like("%hatchback%"), "hatchback")
    .when(col("body") == "minivan", "minivan")
    .when(col("body").like("%van%"), "van")
    .when(col("body") == "supercrew", "cab")
    .when(col("body").like("%coupe%"), "coupe")
    .when(col("body").like("%wagon%"), "wagon")
    .when(col("body").like("%convertible%"), "convertible")
    .when(col("body").like("%koup%"), "koup")
)
df.groupBy("body_type").count().sort(desc("count")).show()

+-----------+------+
|  body_type| count|
+-----------+------+
|      sedan|203397|
|        suv|112922|
|        cab| 38070|
|  hatchback| 21539|
|    minivan| 20907|
|      coupe| 16679|
|      wagon| 13378|
|convertible|  8905|
|        van|  4449|
|       koup|   157|
+-----------+------+



## Model-Making

In [32]:
# "year", "make", "model", "body", "transmission", "condition", "odometer", "color"
features_df = df.select(["year", "body_type", "transmission", "condition", "odometer", "color", "interior", "mmr"])
features_df.show()

+----+-----------+------------+---------+--------+--------+--------+-----+
|year|  body_type|transmission|condition|odometer|   color|interior|  mmr|
+----+-----------+------------+---------+--------+--------+--------+-----+
|2009|      sedan|   automatic|      2.9|   91341|   black|   black| 8650|
|2008|convertible|   automatic|      3.3|   69135|     red|   black| 7500|
|2008|        suv|   automatic|      4.1|  170726|    blue|    gray|10300|
|2006|      sedan|   automatic|      1.9|  110866|     red|    gray| 2025|
|2013|      sedan|      manual|      3.9|   24072|    gray|   black|11750|
|2013|        suv|   automatic|      3.7|   19499|  silver|   black|17050|
|2010|      sedan|   automatic|      2.0|  153354|    gray|    gray| 5750|
|2010|        cab|   automatic|      2.6|   76422|   black|    gray|19400|
|2014|        van|   automatic|      4.9|    5620|   white|    gray|20400|
|2014|        cab|   automatic|      4.2|   17665|     red|    gray|20500|
|2012|        suv|   auto

In [33]:
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import VectorIndexer
from pyspark.ml import Pipeline

In [34]:
indexers = [StringIndexer(inputCol=col, outputCol=col+'_idx').fit(features_df) for col in ["body_type", "transmission","color", "interior"]]
pipeline = Pipeline(stages=indexers)
features_df = pipeline.fit(features_df).transform(features_df)
features_df.show()

+----+-----------+------------+---------+--------+--------+--------+-----+-------------+----------------+---------+------------+
|year|  body_type|transmission|condition|odometer|   color|interior|  mmr|body_type_idx|transmission_idx|color_idx|interior_idx|
+----+-----------+------------+---------+--------+--------+--------+-----+-------------+----------------+---------+------------+
|2009|      sedan|   automatic|      2.9|   91341|   black|   black| 8650|          0.0|             0.0|      0.0|         0.0|
|2008|convertible|   automatic|      3.3|   69135|     red|   black| 7500|          7.0|             0.0|      5.0|         0.0|
|2008|        suv|   automatic|      4.1|  170726|    blue|    gray|10300|          1.0|             0.0|      4.0|         1.0|
|2006|      sedan|   automatic|      1.9|  110866|     red|    gray| 2025|          0.0|             0.0|      5.0|         1.0|
|2013|      sedan|      manual|      3.9|   24072|    gray|   black|11750|          0.0|         

In [35]:
assembler = VectorAssembler(
    inputCols=["year", "body_type_idx", "transmission_idx", "odometer", "color_idx", "interior_idx"],
    outputCol="features"
)
features_df = assembler.transform(features_df)
features_df.show(truncate=False)

+----+-----------+------------+---------+--------+--------+--------+-----+-------------+----------------+---------+------------+---------------------------------+
|year|body_type  |transmission|condition|odometer|color   |interior|mmr  |body_type_idx|transmission_idx|color_idx|interior_idx|features                         |
+----+-----------+------------+---------+--------+--------+--------+-----+-------------+----------------+---------+------------+---------------------------------+
|2009|sedan      |automatic   |2.9      |91341   |black   |black   |8650 |0.0          |0.0             |0.0      |0.0         |(6,[0,3],[2009.0,91341.0])       |
|2008|convertible|automatic   |3.3      |69135   |red     |black   |7500 |7.0          |0.0             |5.0      |0.0         |[2008.0,7.0,0.0,69135.0,5.0,0.0] |
|2008|suv        |automatic   |4.1      |170726  |blue    |gray    |10300|1.0          |0.0             |4.0      |1.0         |[2008.0,1.0,0.0,170726.0,4.0,1.0]|
|2006|sedan      |auto

In [36]:
features_vector_df = features_df.select(['features','mmr'])
features_vector_df = features_vector_df.withColumn('mmr', features_vector_df['mmr'].cast(IntegerType()))
features_vector_df.show(truncate=False)

+---------------------------------+-----+
|features                         |mmr  |
+---------------------------------+-----+
|(6,[0,3],[2008.0,128962.0])      |5475 |
|[2011.0,0.0,0.0,52875.0,0.0,2.0] |16950|
|[2002.0,0.0,0.0,125093.0,3.0,0.0]|2475 |
|[2014.0,5.0,0.0,4760.0,5.0,0.0]  |18100|
|[2011.0,1.0,0.0,49007.0,2.0,0.0] |21000|
|[2010.0,1.0,0.0,82837.0,0.0,1.0] |27600|
|[2011.0,2.0,0.0,47190.0,5.0,1.0] |15700|
|[2004.0,2.0,0.0,117633.0,9.0,2.0]|4825 |
|[2015.0,1.0,0.0,21433.0,3.0,0.0] |19600|
|[2005.0,0.0,0.0,118274.0,4.0,0.0]|4775 |
|[2013.0,0.0,0.0,19031.0,0.0,10.0]|33100|
|[2011.0,0.0,0.0,72619.0,1.0,0.0] |7875 |
|[2008.0,0.0,0.0,108617.0,0.0,1.0]|4475 |
|[2006.0,0.0,0.0,97114.0,0.0,1.0] |10500|
|[2006.0,1.0,0.0,117611.0,1.0,1.0]|3525 |
|[2012.0,0.0,0.0,56051.0,3.0,0.0] |11200|
|[2004.0,0.0,1.0,126041.0,2.0,0.0]|2325 |
|[2013.0,0.0,0.0,39510.0,8.0,0.0] |19800|
|[2013.0,1.0,0.0,21231.0,1.0,1.0] |15650|
|[2013.0,4.0,0.0,26123.0,3.0,0.0] |13250|
+---------------------------------

In [48]:
(trainData, testData) = features_vector_df.randomSplit([0.8, 0.2], seed=42)

In [49]:
trainData.show()

+--------------------+----+
|            features| mmr|
+--------------------+----+
|(6,[0,3],[1991.0,...| 825|
|(6,[0,3],[1995.0,...|1825|
|(6,[0,3],[1995.0,...| 550|
|(6,[0,3],[1996.0,...| 850|
|(6,[0,3],[1996.0,...| 850|
|(6,[0,3],[1997.0,...| 950|
|(6,[0,3],[1998.0,...|1650|
|(6,[0,3],[1998.0,...| 700|
|(6,[0,3],[1998.0,...| 925|
|(6,[0,3],[1999.0,...|1575|
|(6,[0,3],[1999.0,...|2275|
|(6,[0,3],[1999.0,...|2025|
|(6,[0,3],[1999.0,...|1000|
|(6,[0,3],[1999.0,...|1475|
|(6,[0,3],[1999.0,...|2025|
|(6,[0,3],[2000.0,...|2050|
|(6,[0,3],[2000.0,...|2850|
|(6,[0,3],[2000.0,...|1925|
|(6,[0,3],[2000.0,...|1750|
|(6,[0,3],[2000.0,...|1725|
+--------------------+----+
only showing top 20 rows



In [50]:
# Import classification models

from pyspark.ml.regression import RandomForestRegressor

In [51]:
rf = RandomForestRegressor(labelCol="mmr", featuresCol="features").fit(trainData)

[Stage 294:>                                                        (0 + 4) / 4]                                                                                

In [52]:
predictedData = rf.transform(testData)
predictedData.show()

+--------------------+----+------------------+
|            features| mmr|        prediction|
+--------------------+----+------------------+
|(6,[0,3],[1995.0,...|1000|3772.2759705701947|
|(6,[0,3],[1997.0,...|1175|3772.2759705701947|
|(6,[0,3],[1998.0,...|4000|12315.465208868423|
|(6,[0,3],[1999.0,...| 825|5271.3616879513675|
|(6,[0,3],[2000.0,...|2900|11150.932468240342|
|(6,[0,3],[2000.0,...|2625| 4145.303389299348|
|(6,[0,3],[2000.0,...| 950|3772.2759705701947|
|(6,[0,3],[2001.0,...|2300| 6626.539169637092|
|(6,[0,3],[2001.0,...|1425|3753.7845590225274|
|(6,[0,3],[2001.0,...|1850|3753.7845590225274|
|(6,[0,3],[2001.0,...|2175|3753.7845590225274|
|(6,[0,3],[2001.0,...|2525|3753.7845590225274|
|(6,[0,3],[2001.0,...|3550|3753.7845590225274|
|(6,[0,3],[2001.0,...| 975|3753.7845590225274|
|(6,[0,3],[2002.0,...|5125| 6700.053904849291|
|(6,[0,3],[2002.0,...|1800| 5344.876423163568|
|(6,[0,3],[2002.0,...|3075|3827.2992942347278|
|(6,[0,3],[2003.0,...|6025| 6363.784416557144|
|(6,[0,3],[20

In [53]:
from pyspark.ml.regression import LinearRegression

In [54]:
lr = LinearRegression(labelCol="mmr", featuresCol="features")

In [55]:
# Fit the LinearRegression model to the training data
lrModel = lr.fit(trainData)

# Use the trained model to make predictions on the test data
predictedData = lrModel.transform(testData)

# Show the predicted data
predictedData.show()

24/05/21 22:13:44 WARN Instrumentation: [f0a03281] regParam is zero, which might cause numerical instability and overfitting.


+--------------------+----+------------------+
|            features| mmr|        prediction|
+--------------------+----+------------------+
|(6,[0,3],[1995.0,...|1000|-7477.758853127714|
|(6,[0,3],[1997.0,...|1175|-4854.502307966119|
|(6,[0,3],[1998.0,...|4000| 6491.479753039312|
|(6,[0,3],[1999.0,...| 825|1484.4832051359117|
|(6,[0,3],[2000.0,...|2900| 6049.502044158056|
|(6,[0,3],[2000.0,...|2625|1243.0052000151481|
|(6,[0,3],[2000.0,...| 950| -3088.20066066063|
|(6,[0,3],[2001.0,...|2300| 4934.584440817358|
|(6,[0,3],[2001.0,...|1425| 568.7619451463688|
|(6,[0,3],[2001.0,...|1850|285.00870034401305|
|(6,[0,3],[2001.0,...|2175|267.64307053689845|
|(6,[0,3],[2001.0,...|2525|27.045994859188795|
|(6,[0,3],[2001.0,...|3550|-85.17150732688606|
|(6,[0,3],[2001.0,...| 975|-800.7156926142052|
|(6,[0,3],[2002.0,...|5125| 5838.550255252281|
|(6,[0,3],[2002.0,...|1800|4077.8014798900113|
|(6,[0,3],[2002.0,...|3075|  1579.44328004634|
|(6,[0,3],[2003.0,...|6025| 5978.657607411966|
|(6,[0,3],[20

## Model Evaluation

In [56]:
# Import evaluation models

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics


In [57]:
evaluator1 = RegressionEvaluator(
    labelCol="mmr", predictionCol="prediction", metricName="rmse")
evaluator2 = RegressionEvaluator(
    labelCol="mmr", predictionCol="prediction", metricName="mse")
evaluator3 = RegressionEvaluator(
    labelCol="mmr", predictionCol="prediction", metricName="r2")
rmse = evaluator1.evaluate(predictedData)
mse = evaluator2.evaluate(predictedData)
r2 = evaluator3.evaluate(predictedData)
print("Root Mean Squared Error (RMSE)", rmse)
print("Mean Squared Error (MSE)", mse)
print("R-squared (R²)", r2)

Root Mean Squared Error (RMSE) 7223.210532345055
Mean Squared Error (MSE) 52174770.394580536
R-squared (R²) 0.4055157492134709


In [58]:
predictions = predictedData.withColumn("ape", abs((col("mmr") - col("prediction")) / col("mmr")))
mape = predictions.select(mean(col("ape"))).collect()[0][0] * 100

print(f"Mean Absolute Percentage Error (MAPE) on test data = {mape}%")

Mean Absolute Percentage Error (MAPE) on test data = 63.076883842008236%
