### Linear Regression using Python and Spark

In [1]:
import findspark

In [2]:
findspark.init('/home/jinudaniel74/spark-2.1.1-bin-hadoop2.7')

In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName('linearregression').getOrCreate()

#### Loading the Data set

The data set we will use is related to shipping industry. We will predict the number of crew member needed given a certain features

In [5]:
from pyspark.ml.regression import LinearRegression

In [6]:
data = spark.read.csv('cruise_ship_info.csv', inferSchema=True, header=True)

In [7]:
data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

#### Feature Engineering

Let's handpick some the features that we think will play a role in determining the number of crew member needed. Convert the string column 'Cruise_line' to an int column as ML models work on int columns

In [8]:
from pyspark.ml.feature import VectorAssembler, StringIndexer

In [9]:
indexer = StringIndexer(inputCol='Cruise_line', outputCol='Cruise_lineIndex')
indexed = indexer.fit(data).transform(data)

In [10]:
indexed.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_lineIndex|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|            16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|            16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|             1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|             1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|             1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.

In [11]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_lineIndex']

In [12]:
assembler = VectorAssembler(inputCols=['Age','Tonnage','passengers','length','cabins','passenger_density','Cruise_lineIndex'],
                           outputCol = 'features')
output = assembler.transform(indexed)

In [13]:
output.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------------+--------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_lineIndex|            features|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------------+--------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|            16.0|[6.0,30.276999999...|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|            16.0|[6.0,30.276999999...|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|             1.0|[26.0,47.262,14.8...|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|             1.0|[11.0,110.0,29.74...|
|    Destiny|   Carnival| 17|           101.353|     26

In [14]:
final_data = output.select('features','crew')

Split the data to train and test set in the ratio 70:30

In [15]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

#### Train the model using linear regression

In [16]:
lr = LinearRegression(labelCol='crew')

In [17]:
lr_model = lr.fit(train_data)

#### Evaluate the model on Test set

In [18]:
test = lr_model.evaluate(test_data)

In [19]:
lr_model.summary.r2

0.9197503367172872