## PySpark-Regression
**PySpark** comes with a very powerful **MachineLearning** library called **MLLib**. In the notebook below we will use regression functions of MLLib.

In [1]:
#Lets import PySpark
from pyspark.sql import SparkSession

In [2]:
#Lets start a spark session
spark = SparkSession.builder.appName('regression').getOrCreate()

Load Seoul bike data

https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand

In [3]:
df = spark.read.csv('../data/SeoulBikeData/SeoulBikeData.csv',
                    header=True)

In [4]:
#Show the data
df.show()

+----------+-----------------+----+---------------+-----------+----------------+----------------+-------------------------+-----------------------+------------+-------------+-------+----------+---------------+
|      Date|Rented Bike Count|Hour|Temperature(�C)|Humidity(%)|Wind speed (m/s)|Visibility (10m)|Dew point temperature(�C)|Solar Radiation (MJ/m2)|Rainfall(mm)|Snowfall (cm)|Seasons|   Holiday|Functioning Day|
+----------+-----------------+----+---------------+-----------+----------------+----------------+-------------------------+-----------------------+------------+-------------+-------+----------+---------------+
|01/12/2017|              254|   0|           -5.2|         37|             2.2|            2000|                    -17.6|                      0|           0|            0| Winter|No Holiday|            Yes|
|01/12/2017|              204|   1|           -5.5|         38|             0.8|            2000|                    -17.6|                      0|           0|

In [5]:
#printSchema
df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Rented Bike Count: string (nullable = true)
 |-- Hour: string (nullable = true)
 |-- Temperature(�C): string (nullable = true)
 |-- Humidity(%): string (nullable = true)
 |-- Wind speed (m/s): string (nullable = true)
 |-- Visibility (10m): string (nullable = true)
 |-- Dew point temperature(�C): string (nullable = true)
 |-- Solar Radiation (MJ/m2): string (nullable = true)
 |-- Rainfall(mm): string (nullable = true)
 |-- Snowfall (cm): string (nullable = true)
 |-- Seasons: string (nullable = true)
 |-- Holiday: string (nullable = true)
 |-- Functioning Day: string (nullable = true)



Note Spark infered all the columns as strings, below we will convert numeric columns to numeric types.

In [6]:
#Get the column names
print(df.columns)

#Rename some Columns
df = df.withColumnRenamed('Temperature(�C)','Temperature') \
        .withColumnRenamed('Humidity(%)','Humidity') \
        .withColumnRenamed('Dew point temperature(�C)', 'Dew point temperature')

['Date', 'Rented Bike Count', 'Hour', 'Temperature(�C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(�C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 'Seasons', 'Holiday', 'Functioning Day']


In [7]:
print(df.columns)

['Date', 'Rented Bike Count', 'Hour', 'Temperature', 'Humidity', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 'Seasons', 'Holiday', 'Functioning Day']


In [8]:
#Convert the data to the types we want
from pyspark.sql.types import (StructField, 
                               StringType, 
                               IntegerType,
                               DateType,
                               DoubleType,
                               StructType)

In [9]:
data_schema = [StructField('Date', StringType(), True), 
               StructField('Rented Bike Count', IntegerType(), True),
               StructField('Hour', IntegerType(), True),
               StructField('Temperature', DoubleType(), True),
               StructField('Humidity', DoubleType(), True),
               StructField('Wind speed (m/s)', DoubleType(), True),
               StructField('Visibility (10m)', DoubleType(), True),
               StructField('Dew point temperature', DoubleType(), True),
               StructField('Solar Radiation (MJ/m2)', DoubleType(), True),
               StructField('Rainfall(mm)', DoubleType(), True),
               StructField('Snowfall (cm)', DoubleType(), True),
               StructField('Seasons', StringType(), True),
               StructField('Holiday', StringType(), True),
               StructField('Functioning Day',  StringType(), True)
              ]
final_struct = StructType(fields=data_schema)

Reload the data again with correct data types in schema

In [10]:
df = spark.read.csv('../data/SeoulBikeData/SeoulBikeData.csv', header=True, schema=final_struct)

In [11]:
#Show the dataframe
df.show()

+----------+-----------------+----+-----------+--------+----------------+----------------+---------------------+-----------------------+------------+-------------+-------+----------+---------------+
|      Date|Rented Bike Count|Hour|Temperature|Humidity|Wind speed (m/s)|Visibility (10m)|Dew point temperature|Solar Radiation (MJ/m2)|Rainfall(mm)|Snowfall (cm)|Seasons|   Holiday|Functioning Day|
+----------+-----------------+----+-----------+--------+----------------+----------------+---------------------+-----------------------+------------+-------------+-------+----------+---------------+
|01/12/2017|              254|   0|       -5.2|    37.0|             2.2|          2000.0|                -17.6|                    0.0|         0.0|          0.0| Winter|No Holiday|            Yes|
|01/12/2017|              204|   1|       -5.5|    38.0|             0.8|          2000.0|                -17.6|                    0.0|         0.0|          0.0| Winter|No Holiday|            Yes|
|01/1

In [12]:
#Describe the dataframe for all the numerical columns
df.describe().show()

+-------+----------+-----------------+-----------------+------------------+------------------+------------------+-----------------+---------------------+-----------------------+------------------+-------------------+-------+----------+---------------+
|summary|      Date|Rented Bike Count|             Hour|       Temperature|          Humidity|  Wind speed (m/s)| Visibility (10m)|Dew point temperature|Solar Radiation (MJ/m2)|      Rainfall(mm)|      Snowfall (cm)|Seasons|   Holiday|Functioning Day|
+-------+----------+-----------------+-----------------+------------------+------------------+------------------+-----------------+---------------------+-----------------------+------------------+-------------------+-------+----------+---------------+
|  count|      8760|             8760|             8760|              8760|              8760|              8760|             8760|                 8760|                   8760|              8760|               8760|   8760|      8760|         

In [13]:
#Double check the schema
df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Rented Bike Count: integer (nullable = true)
 |-- Hour: integer (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Humidity: double (nullable = true)
 |-- Wind speed (m/s): double (nullable = true)
 |-- Visibility (10m): double (nullable = true)
 |-- Dew point temperature: double (nullable = true)
 |-- Solar Radiation (MJ/m2): double (nullable = true)
 |-- Rainfall(mm): double (nullable = true)
 |-- Snowfall (cm): double (nullable = true)
 |-- Seasons: string (nullable = true)
 |-- Holiday: string (nullable = true)
 |-- Functioning Day: string (nullable = true)



In [14]:
#Convert Date to dateType in PySpark
from pyspark.sql.functions import to_date
df = df.withColumn('New_date', to_date(df['Date'],format='dd/MM/yyyy'))

In [15]:
#Drop old date column
df = df.drop(df['Date'])
#rename New_date as date
df = df.withColumnRenamed('New_Date', 'Date')
df.printSchema()

root
 |-- Rented Bike Count: integer (nullable = true)
 |-- Hour: integer (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Humidity: double (nullable = true)
 |-- Wind speed (m/s): double (nullable = true)
 |-- Visibility (10m): double (nullable = true)
 |-- Dew point temperature: double (nullable = true)
 |-- Solar Radiation (MJ/m2): double (nullable = true)
 |-- Rainfall(mm): double (nullable = true)
 |-- Snowfall (cm): double (nullable = true)
 |-- Seasons: string (nullable = true)
 |-- Holiday: string (nullable = true)
 |-- Functioning Day: string (nullable = true)
 |-- Date: date (nullable = true)



In [17]:
#Import vector and VectorAssembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [18]:
#Assemble the features vector
assembler = VectorAssembler(inputCols=['Hour',
                                       'Temperature',
                                       'Humidity',
                                       'Wind speed (m/s)',
                                       'Visibility (10m)', 
                                       'Dew point temperature',
                                       'Solar Radiation (MJ/m2)',
                                       'Rainfall(mm)',
                                       'Snowfall (cm)'], 
                           outputCol='features')

In [19]:
output = assembler.transform(df)

In [20]:
df_final = output.select(['features', 'Rented Bike Count'])
#Rename the Rented Bike Count as label
df_final = df_final.withColumnRenamed('Rented Bike Count', 'label')
df_final.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[0.0,-5.2,37.0,2....|  254|
|[1.0,-5.5,38.0,0....|  204|
|[2.0,-6.0,39.0,1....|  173|
|[3.0,-6.2,40.0,0....|  107|
|[4.0,-6.0,36.0,2....|   78|
|[5.0,-6.4,37.0,1....|  100|
|[6.0,-6.6,35.0,1....|  181|
|[7.0,-7.4,38.0,0....|  460|
|[8.0,-7.6,37.0,1....|  930|
|[9.0,-6.5,27.0,0....|  490|
|[10.0,-3.5,24.0,1...|  339|
|[11.0,-0.5,21.0,1...|  360|
|[12.0,1.7,23.0,1....|  449|
|[13.0,2.4,25.0,1....|  451|
|[14.0,3.0,26.0,2....|  447|
|[15.0,2.1,36.0,3....|  463|
|[16.0,1.2,54.0,4....|  484|
|[17.0,0.8,58.0,1....|  555|
|[18.0,0.6,66.0,1....|  862|
|[19.0,0.0,77.0,1....|  600|
+--------------------+-----+
only showing top 20 rows



In [21]:
#Train test split
train_data, test_data = df_final.randomSplit([0.7, 0.3])

In [22]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression()
lr_model = lr.fit(train_data)
test_result = lr_model.evaluate(test_data)
train_result = lr_model.evaluate(train_data)

In [23]:
#Lets print r2
print(f"Test Result = {test_result.r2: 0.3}")
print(f"Train Result = {train_result.r2 :0.3}")

Test Result =  0.467
Train Result = 0.472


Now lets try L1, L2 and Elastic Net regularization

In [24]:
#L2
#Lets setup a grid search to search for regualization parameter
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

lr = LinearRegression()
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [5, 3, 2.0, 1, 0.1]) \
    .build()

crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator(),
                          numFolds=5) 
cv_model = crossval.fit(train_data)
test_result = cv_model.transform(test_data)
train_result = cv_model.transform(train_data)

#get regularization parameter for best Model 
print(f"Regularization Parameter for bestModel {cv_model.bestModel.getOrDefault('regParam')}")

#Evaluate the model
re = RegressionEvaluator()
print(f"Test result = {re.evaluate(test_result, {re.metricName: 'r2'}):0.3}")
print(f"Train result = {re.evaluate(train_result, {re.metricName: 'r2'}):0.3}")

Regularization Parameter for bestModel 2.0
Test result = 0.467
Train result = 0.472


In [25]:
#L1
#Lets setup a grid search to search for regualization parameter

lr = LinearRegression(elasticNetParam=1.0)
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [5, 3, 2.0, 1, 0.1]) \
    .build()

crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator(),
                          numFolds=5) 
cv_model = crossval.fit(train_data)
test_result = cv_model.transform(test_data)
train_result = cv_model.transform(train_data)

#get regularization parameter for best Model 
print(f"Regularization Parameter for bestModel {cv_model.bestModel.getOrDefault('regParam')}")

#Evaluate the model
re = RegressionEvaluator()
print(f"Test result = {re.evaluate(test_result, {re.metricName: 'r2'}):0.3}")
print(f"Train result = {re.evaluate(train_result, {re.metricName: 'r2'}):0.3}")

Regularization Parameter for bestModel 1.0
Test result = 0.466
Train result = 0.472


In [26]:
#elastic net
#Lets setup a grid search to search for regualization parameter

lr = LinearRegression()
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [5, 3, 2.0, 1, 0.1]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator(),
                          numFolds=5) 
cv_model = crossval.fit(train_data)
test_result = cv_model.transform(test_data)
train_result = cv_model.transform(train_data)

#get regularization parameter for best Model 
print(f"Regularization Parameter for bestModel {cv_model.bestModel.getOrDefault('regParam')}")
print(f"Elastic Net Parameter for bestModel {cv_model.bestModel.getOrDefault('elasticNetParam')}")

#Evaluate the model
re = RegressionEvaluator()
print(f"Test result = {re.evaluate(test_result, {re.metricName: 'r2'}):0.3}")
print(f"Train result = {re.evaluate(train_result, {re.metricName: 'r2'}):0.3}")

Regularization Parameter for bestModel 1.0
Elastic Net Parameter for bestModel 1.0
Test result = 0.466
Train result = 0.472


Lets try tree based methods
1. Decision Tree regressor
2. Random forest regressor
3. Gradient boosted trees regressor.

In [27]:
from pyspark.ml.regression import ( DecisionTreeRegressor,
                                   RandomForestRegressor,
                                   GBTRegressor )

In [28]:
#Make three models DecisionTree
dtr = DecisionTreeRegressor()
#Random Forest
rfr = RandomForestRegressor(numTrees=100)
#Gradient Boosting trees
gbt = GBTRegressor()

In [29]:
#Train three models
dtr_model = dtr.fit(train_data)
rfr_model = rfr.fit(train_data)
gbt_model = gbt.fit(train_data)

In [30]:
#Predict on test and train
#Decision Tree
dtr_train_result = dtr_model.transform(train_data)
dtr_test_result = dtr_model.transform(test_data)

#Random Forest
rfr_train_result = rfr_model.transform(train_data)
rfr_test_result = rfr_model.transform(test_data)

#Gradient Boosting trees
gbt_train_result = gbt_model.transform(train_data)
gbt_test_result = gbt_model.transform(test_data)



In [31]:
dtr_train_result.show()

+--------------------+-----+------------------+
|            features|label|        prediction|
+--------------------+-----+------------------+
|(9,[0,1,2,4],[5.0...|  162| 163.1220472440945|
|(9,[1,2,3,4],[20....|  841| 759.1698113207547|
|(9,[1,2,4,5],[10....|  520| 431.8782608695652|
|(9,[1,2,4,5],[11....|  848| 431.8782608695652|
|(9,[2,3,4,5],[75....|  177|159.11016949152543|
|[0.0,-15.9,43.0,3...|   78|159.11016949152543|
|[0.0,-15.0,42.0,1...|   80|159.11016949152543|
|[0.0,-12.3,47.0,0...|  116|159.11016949152543|
|[0.0,-11.0,51.0,1...|  133|159.11016949152543|
|[0.0,-10.3,54.0,2...|   98|159.11016949152543|
|[0.0,-10.0,34.0,1...|  108|159.11016949152543|
|[0.0,-9.5,48.0,1....|  168|159.11016949152543|
|[0.0,-9.3,45.0,0....|   80|159.11016949152543|
|[0.0,-8.2,50.0,1....|  136|159.11016949152543|
|[0.0,-8.1,41.0,2....|  175|159.11016949152543|
|[0.0,-7.9,37.0,2....|  133|159.11016949152543|
|[0.0,-7.7,52.0,3....|  103|159.11016949152543|
|[0.0,-7.5,36.0,2....|  125|159.11016949

In [32]:
rfr_train_result.show()

+--------------------+-----+------------------+
|            features|label|        prediction|
+--------------------+-----+------------------+
|(9,[0,1,2,4],[5.0...|  162|261.51091130022155|
|(9,[1,2,3,4],[20....|  841| 705.0882521556236|
|(9,[1,2,4,5],[10....|  520|401.15577894706706|
|(9,[1,2,4,5],[11....|  848| 451.3354875930239|
|(9,[2,3,4,5],[75....|  177|208.77549117110797|
|[0.0,-15.9,43.0,3...|   78| 126.6858377403711|
|[0.0,-15.0,42.0,1...|   80| 133.8327623207651|
|[0.0,-12.3,47.0,0...|  116|160.89342655688964|
|[0.0,-11.0,51.0,1...|  133|160.66495972386616|
|[0.0,-10.3,54.0,2...|   98|164.73509088656098|
|[0.0,-10.0,34.0,1...|  108|  160.612337935749|
|[0.0,-9.5,48.0,1....|  168|161.03620852099587|
|[0.0,-9.3,45.0,0....|   80|162.20137116839476|
|[0.0,-8.2,50.0,1....|  136|167.57162816083738|
|[0.0,-8.1,41.0,2....|  175|158.54194806499592|
|[0.0,-7.9,37.0,2....|  133|158.33677437126363|
|[0.0,-7.7,52.0,3....|  103|163.14745186387717|
|[0.0,-7.5,36.0,2....|  125|157.71192687

In [33]:
gbt_train_result.show()

+--------------------+-----+------------------+
|            features|label|        prediction|
+--------------------+-----+------------------+
|(9,[0,1,2,4],[5.0...|  162|242.12687409022388|
|(9,[1,2,3,4],[20....|  841| 915.3636609461033|
|(9,[1,2,4,5],[10....|  520|412.81563221671246|
|(9,[1,2,4,5],[11....|  848| 452.9882858321737|
|(9,[2,3,4,5],[75....|  177| 116.8891160944871|
|[0.0,-15.9,43.0,3...|   78|111.56474210419492|
|[0.0,-15.0,42.0,1...|   80| 140.5520733882255|
|[0.0,-12.3,47.0,0...|  116| 140.5520733882255|
|[0.0,-11.0,51.0,1...|  133| 140.5520733882255|
|[0.0,-10.3,54.0,2...|   98|149.93985688818896|
|[0.0,-10.0,34.0,1...|  108| 140.5520733882255|
|[0.0,-9.5,48.0,1....|  168| 140.5520733882255|
|[0.0,-9.3,45.0,0....|   80| 140.5520733882255|
|[0.0,-8.2,50.0,1....|  136|171.50810425758618|
|[0.0,-8.1,41.0,2....|  175|153.40273109832734|
|[0.0,-7.9,37.0,2....|  133|153.40273109832734|
|[0.0,-7.7,52.0,3....|  103|152.68114776118546|
|[0.0,-7.5,36.0,2....|  125|153.40273109

In [34]:
#Evaluate the model
re = RegressionEvaluator()

#Decision Tree
print(f"Test result = {re.evaluate(dtr_test_result, {re.metricName: 'r2'}):0.3}")
print(f"Train result = {re.evaluate(dtr_train_result, {re.metricName: 'r2'}):0.3}")

#Random Forest
print(f"Test result = {re.evaluate(rfr_test_result, {re.metricName: 'r2'}):0.3}")
print(f"Train result = {re.evaluate(rfr_train_result, {re.metricName: 'r2'}):0.3}")

#Gradient Boosting trees
print(f"Test result = {re.evaluate(gbt_test_result, {re.metricName: 'r2'}):0.3}")
print(f"Train result = {re.evaluate(gbt_train_result, {re.metricName: 'r2'}):0.3}")

Test result = 0.627
Train result = 0.663
Test result = 0.662
Train result = 0.679
Test result = 0.729
Train result = 0.778


We can also look at feature importance 

In [35]:
dtr_model.featureImportances

SparseVector(9, {0: 0.3505, 1: 0.4368, 2: 0.0975, 3: 0.0022, 4: 0.0012, 6: 0.102, 7: 0.0097})

In [36]:
rfr_model.featureImportances

SparseVector(9, {0: 0.3264, 1: 0.3059, 2: 0.1085, 3: 0.0073, 4: 0.0237, 5: 0.1197, 6: 0.0632, 7: 0.0402, 8: 0.005})

In [37]:
gbt_model.featureImportances

SparseVector(9, {0: 0.304, 1: 0.3136, 2: 0.0758, 3: 0.0331, 4: 0.0429, 5: 0.0774, 6: 0.1072, 7: 0.0457, 8: 0.0004})

We can see from above feature 1 or Temperature seems to be most important feature

We can see from the above all the methods gave better results than Linear regression. Gradient Boosting trees beating it the most.