# Baseline solution

In this notebook, a naive solution for the prediction problem is implemented and evaluated. This naive solution, which is used as a baseline, always predicts the `NumBikes+1` label being equal to `NumBikes`.

In [1]:
from pyspark.ml.evaluation import RegressionEvaluator
import pyspark.sql.functions as F

## Input data parsing

In [2]:
testData = spark.read.load("/user/garza/LabReply/ProjectData/testData.csv", format="csv", header=True, inferSchema=True)

In [3]:
testData.printSchema()
testData.show()

root
 |-- StationId: integer (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
 |-- NumBikes-4: integer (nullable = true)
 |-- NumBikes-3: integer (nullable = true)
 |-- NumBikes-2: integer (nullable = true)
 |-- NumBikes-1: integer (nullable = true)
 |-- NumBikes: integer (nullable = true)
 |-- NumBikes+1: integer (nullable = true)

+---------+-------------------+----------+----------+----------+----------+--------+----------+
|StationId|          Timestamp|NumBikes-4|NumBikes-3|NumBikes-2|NumBikes-1|NumBikes|NumBikes+1|
+---------+-------------------+----------+----------+----------+----------+--------+----------+
|      122|2008-09-01 05:00:00|         0|         0|         0|         0|       0|         0|
|       16|2008-09-01 05:00:00|         0|         0|         0|         0|       0|         0|
|      163|2008-09-01 05:00:00|         0|         0|         0|         0|       0|         0|
|      170|2008-09-01 05:00:00|         0|         0|         0|         0| 

## Prediction

To implement our naive solution, we exploit standard DataFrame API functionalities to create the `prediction` and `label` columns from the existing `NumBikes` and `NumBikes+1`. A casting to the `float` type is needed since MLlib expects this data type for the prediction columns.

In [4]:
predictedTestData = testData \
    .withColumn("prediction", F.col("NumBikes").astype("float")) \
    .withColumn("label", F.col("NumBikes+1").astype("float"))

## Evaluation

We evaluate this naive labeling using standard regression metrics such as the Root Mean Squared Error (RMSE) or the R2 coefficient.

We observe that, even for this very simple solution, the obtained figures are quite promising, especially with respect to the large data size (more than 800K test records). Indeed, as we noted in the data exploration part, for the large majority of records the difference between the `NumBikes+1` and `NumBikes` observations is zero. This is also motivated by the high correlation between these two features, which are essentially linearly dependent one on the other in the observations that we have (the computed Pearson coefficient is 0.98).

In [5]:
rmseEvaluator = RegressionEvaluator(metricName="rmse")
print("RMSE:", rmseEvaluator.evaluate(predictedTestData))

RMSE: 1.4672708580571527


In [6]:
r2Evaluator = RegressionEvaluator(metricName="r2")
print("R2:", r2Evaluator.evaluate(predictedTestData))

R2: 0.9652016063526873
