# Linear Regression - Crew Members on Ship

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 


In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('cruise').getOrCreate()

In [3]:
df = spark.read.csv('../data/cruise_ship_info.csv',inferSchema=True,header=True)

In [4]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [8]:
%%html
<style>
div.output_area pre {
    white-space: pre;
}
</style>

In [11]:
df.show(n=5, truncate=False, vertical=True)

-RECORD 0-------------------------------
 Ship_name         | Journey            
 Cruise_line       | Azamara            
 Age               | 6                  
 Tonnage           | 30.276999999999997 
 passengers        | 6.94               
 length            | 5.94               
 cabins            | 3.55               
 passenger_density | 42.64              
 crew              | 3.55               
-RECORD 1-------------------------------
 Ship_name         | Quest              
 Cruise_line       | Azamara            
 Age               | 6                  
 Tonnage           | 30.276999999999997 
 passengers        | 6.94               
 length            | 5.94               
 cabins            | 3.55               
 passenger_density | 42.64              
 crew              | 3.55               
-RECORD 2-------------------------------
 Ship_name         | Celebration        
 Cruise_line       | Carnival           
 Age               | 26                 
 Tonnage        

In [12]:


df.describe().show(n=5, truncate=False, vertical=True)

-RECORD 0-------------------------------
 summary           | count              
 Ship_name         | 158                
 Cruise_line       | 158                
 Age               | 158                
 Tonnage           | 158                
 passengers        | 158                
 length            | 158                
 cabins            | 158                
 passenger_density | 158                
 crew              | 158                
-RECORD 1-------------------------------
 summary           | mean               
 Ship_name         | Infinity           
 Cruise_line       | null               
 Age               | 15.689873417721518 
 Tonnage           | 71.28467088607599  
 passengers        | 18.45740506329114  
 length            | 8.130632911392404  
 cabins            | 8.830000000000005  
 passenger_density | 39.90094936708861  
 crew              | 7.794177215189873  
-RECORD 2-------------------------------
 summary           | stddev             
 Ship_name      

## Dealing with the Cruise_line categorical variable
Ship Name is a useless arbitrary string, but the cruise_line itself may be useful. Let's make it into a categorical variable!

In [7]:
df.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [15]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="Cruise_line", outputCol="cruise_cat")
indexed = indexer.fit(df).transform(df)
indexed.show(5)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|cruise_cat|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|      16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|      16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|       1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|       1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|       1.0|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+
only showing top 5 rows



In [16]:
from pyspark.ml.feature import VectorAssembler

In [17]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'cruise_cat']

In [18]:
assembler = VectorAssembler(
  inputCols=['Age',
             'Tonnage',
             'passengers',
             'length',
             'cabins',
             'passenger_density',
             'cruise_cat'],
    outputCol="features")

In [21]:
output = assembler.transform(indexed)
output.show(n=5, truncate=False, vertical=True)

-RECORD 0---------------------------------------------------------------
 Ship_name         | Journey                                            
 Cruise_line       | Azamara                                            
 Age               | 6                                                  
 Tonnage           | 30.276999999999997                                 
 passengers        | 6.94                                               
 length            | 5.94                                               
 cabins            | 3.55                                               
 passenger_density | 42.64                                              
 crew              | 3.55                                               
 cruise_cat        | 16.0                                               
 features          | [6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0] 
-RECORD 1---------------------------------------------------------------
 Ship_name         | Quest                         

In [22]:
output.select("features", "crew").show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



In [23]:
final_data = output.select("features", "crew")

In [24]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [25]:
from pyspark.ml.regression import LinearRegression
# Create a Linear Regression Model object
lr = LinearRegression(labelCol='crew')

In [26]:
# Fit the model to the data and call this model lrModel
lrModel = lr.fit(train_data)

In [27]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [-0.016208130834326522,0.003498280936095452,-0.13598010652529774,0.4771086689550805,0.8563375018456144,-0.0009234242519510042,0.05654545060532841] Intercept: -1.356636750347607


In [28]:
test_results = lrModel.evaluate(test_data)

In [29]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))
print("R2: {}".format(test_results.r2))

RMSE: 0.9486112938413002
MSE: 0.8998633868032656
R2: 0.8920276058948289
