## Estimate how many crew members a ship will require using Linear Regression

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:96% !important; }</style>"))

## Description

Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners. The want to predict accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
Create a regression model that will help predict how many crew members will be needed for future ships.

## Start a new spark session

In [2]:
# start a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cruise').getOrCreate()

## Load the data

In [3]:
data = spark.read.csv("data/cruise_ship_info.csv", inferSchema=True, header=True)

In [4]:
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [5]:
data.describe().show()

+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|summary|Ship_name|Cruise_line|               Age|           Tonnage|       passengers|           length|            cabins|passenger_density|             crew|
+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|  count|      158|        158|               158|               158|              158|              158|               158|              158|              158|
|   mean| Infinity|       null|15.689873417721518| 71.28467088607599|18.45740506329114|8.130632911392404| 8.830000000000005|39.90094936708861|7.794177215189873|
| stddev|      NaN|       null| 7.615691058751413|37.229540025907866|9.677094775143416|1.793473548054825|4.4714172221480615| 8.63921711391542|3.503486564627034|
|    min|Adventure|    Azamara|   

## Dealing with the categorical variables

Ship Name is a useless arbitrary string, but the cruise_line itself may be useful. Let's make it into a categorical variable.

In [6]:
data.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [7]:
from pyspark.ml.feature import StringIndexer

In [8]:
indexed_data = StringIndexer(inputCol="Cruise_line", outputCol="cruise_cat").fit(data).transform(data)

## Create feature vector

In [10]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [11]:
data = VectorAssembler(inputCols=['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density', 'cruise_cat'],
                       outputCol='features').transform(indexed_data).select(['features', 'crew'])

In [13]:
data.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



Create a train test split

In [18]:
train_data, test_data = data.randomSplit([0.8,0.2])

In [19]:
train_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              125|
|   mean|7.852160000000013|
| stddev|3.466784692366371|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



In [20]:
test_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|               33|
|   mean|7.574545454545454|
| stddev|3.685976881124165|
|    min|             0.59|
|    max|             19.1|
+-------+-----------------+



## Create a Linear Regression model

In [21]:
from pyspark.ml.regression import LinearRegression

In [None]:
# Create a Linear Regression Model object
lr = LinearRegression(labelCol='crew')

# Fit the model to the data and call this model lrModel
lrModel = lr.fit(train_data)

In [22]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [-0.0123329837067,0.0068477494033,-0.148538279768,0.46408031196,0.849866943325,-0.0150563830223,0.0675444923293] Intercept: -0.8271360476697983


In [30]:
print("RMSE of the model : {}".format(lrModel.summary.rootMeanSquaredError))

RMSE of the model : 0.6633793999025167


## Evaluate the model

In [31]:
test_results = lrModel.evaluate(test_data)

In [32]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))
print("R2: {}".format(test_results.r2))

RMSE: 1.6396765293233182
MSE: 2.6885391208137626
R2: 0.7959319061201596


Lets check the feature correlation with target variable 

In [58]:
dict(zip(['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density', 'cruise_cat'], 
         lrModel.coefficients.toArray()))

{'Age': -0.012332983706749812,
 'Tonnage': 0.0068477494033016839,
 'cabins': 0.84986694332481316,
 'cruise_cat': 0.067544492329338623,
 'length': 0.46408031195995281,
 'passenger_density': -0.015056383022348213,
 'passengers': -0.14853827976841416}

In [33]:
from pyspark.sql.functions import corr

In [61]:
for f in ['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density', 'cruise_cat']:
    indexed_data.select(corr('crew',f)).show()

+-------------------+
|    corr(crew, Age)|
+-------------------+
|-0.5306565039638852|
+-------------------+

+-------------------+
|corr(crew, Tonnage)|
+-------------------+
|  0.927568811544939|
+-------------------+

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+

+------------------+
|corr(crew, length)|
+------------------+
|0.8958566271016579|
+------------------+

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+

+-----------------------------+
|corr(crew, passenger_density)|
+-----------------------------+
|         -0.15550928421699717|
+-----------------------------+

+----------------------+
|corr(crew, cruise_cat)|
+----------------------+
|  -0.48332562728617057|
+----------------------+

