# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

In [None]:
import findspark
findspark.init('/home/kajili/spark-2.4.5-bin-hadoop2.7')

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('lr_consulting_project').getOrCreate()

In [4]:
from pyspark.ml.regression import LinearRegression

In [5]:
# Create the raw data DataFrame from the csv file
data = spark.read.csv('cruise_ship_info.csv',inferSchema=True, header=True)

In [6]:
# Display the data DataFrame
data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [7]:
# schema for raw DataFrame
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



# Using StringIndexer to turn Cruise_line into meaningful numerical data that the model can train on

In [8]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol='Cruise_line', outputCol='Cruise_line_index') 

indexed = indexer.fit(data).transform(data)
indexed.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_line_index|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|              1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|              1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|              1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|       

# Now the Cruise_line_index will be taken into account as a `feature` when creating the Linear Regression Model: 
- `The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! `

# Now moving on to just creating a model that can predict number of crew members using all the feature data

In [9]:
indexed.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_line_index|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|              1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|              1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|              1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|       

In [10]:
for item in indexed.head():
    print(item)

Journey
Azamara
6
30.276999999999997
6.94
5.94
3.55
42.64
3.55
16.0


In [11]:
# Assemble the features into a `feature` column:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [12]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_line_index']

In [13]:
assembler = VectorAssembler(
    inputCols=["Age", "Tonnage", "passengers", "length",'cabins', 'passenger_density', 'Cruise_line_index'],
    outputCol="features")

In [14]:
output = assembler.transform(indexed)

In [15]:
output.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Cruise_line_index: double (nullable = false)
 |-- features: vector (nullable = true)



In [16]:
output.head()

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_line_index=16.0, features=DenseVector([6.0, 30.277, 6.94, 5.94, 3.55, 42.64, 16.0]))

In [17]:
final_data = output.select(['features', 'crew'])

In [18]:
final_data.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



In [19]:
# Split final_data into train_data and test_data
train_data, test_data = final_data.randomSplit([0.7,0.3])

In [20]:
train_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              108|
|   mean|7.916203703703716|
| stddev| 3.58009860397612|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



In [21]:
test_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                50|
|   mean|7.5306000000000015|
| stddev|3.3520370961368475|
|    min|               0.6|
|    max|              13.6|
+-------+------------------+



In [22]:
# Create LinearRegression Object
lr = LinearRegression(labelCol='crew')

In [23]:
# Create LinearRegression Model and then train it using LinearRegression object with train_data as the data.
lr_model = lr.fit(train_data)

# Evaluating model against the test_data (allows us to see how accurate the model is)

In [24]:
# Evaluate the model with test_data
test_results = lr_model.evaluate(test_data)

In [25]:
# Show residuals from test_results.
test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
|  0.1296488656231567|
| -0.6395528043074972|
| -0.6395528043074972|
| 0.24046138433411812|
| -0.5711083299991895|
| 0.17523773185200042|
| -0.5334417775334916|
|  0.4520161969583576|
| 0.16362885365125024|
| -0.3306717199948661|
| -0.6661718004998676|
|-0.21172490619813544|
|-0.43469849144301875|
|  0.7685648224555717|
|   0.847001545794452|
| 0.19201403446725018|
|  0.2073003496736021|
| -0.3880670668951334|
|-0.23872526289117246|
|  0.7645380510074222|
+--------------------+
only showing top 20 rows



In [26]:
test_results.rootMeanSquaredError

0.6215463559839005

In [27]:
test_results.r2

0.9649164654121274

# Testing model prediction on simulated Unlabeled Data Set (To see the predictions that the model generates):

In [28]:
unlabeled_data = test_data.select('features')

In [29]:
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|[5.0,115.0,35.74,...|
|[6.0,30.276999999...|
|[6.0,30.276999999...|
|[6.0,110.23899999...|
|[6.0,112.0,38.0,9...|
|[6.0,113.0,37.82,...|
|[6.0,158.0,43.7,1...|
|[8.0,77.499,19.5,...|
|[10.0,77.0,20.16,...|
|[10.0,90.09,25.01...|
|[10.0,105.0,27.2,...|
|[11.0,86.0,21.24,...|
|[11.0,90.09,25.01...|
|[11.0,91.0,20.32,...|
|[11.0,108.977,26....|
|[12.0,2.329,0.94,...|
|[12.0,42.0,14.8,7...|
|[12.0,58.6,15.66,...|
|[12.0,90.09,25.01...|
|[12.0,91.0,20.32,...|
+--------------------+
only showing top 20 rows



In [30]:
predictions = lr_model.transform(unlabeled_data)

In [31]:
# Show full values of columns and showing 50 rows of our Models predictions on the Unlabeled Test Data
predictions.show(50, False)

+---------------------------------------------------+------------------+
|features                                           |prediction        |
+---------------------------------------------------+------------------+
|[5.0,115.0,35.74,9.0,15.32,32.18,9.0]              |12.070351134376843|
|[6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0] |4.189552804307497 |
|[6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0] |4.189552804307497 |
|[6.0,110.23899999999999,37.0,9.51,14.87,29.79,1.0] |11.259538615665882|
|[6.0,112.0,38.0,9.51,15.0,29.47,5.0]               |11.47110832999919 |
|[6.0,113.0,37.82,9.51,15.57,29.88,2.0]             |11.824762268148   |
|[6.0,158.0,43.7,11.25,18.0,36.16,0.0]              |14.133441777533491|
|[8.0,77.499,19.5,8.56,9.75,39.74,2.0]              |8.547983803041642 |
|[10.0,77.0,20.16,8.56,9.75,38.19,9.0]              |8.83637114634875  |
|[10.0,90.09,25.01,9.62,10.5,36.02,0.0]             |8.910671719994866 |
|[10.0,105.0,27.2,8.9,13.56,38.6,5.0]              

In [32]:
# Show full values of columns and showing 50 rows of the actual crew data from the original Test Data
test_data.show(50, False)

+---------------------------------------------------+-----+
|features                                           |crew |
+---------------------------------------------------+-----+
|[5.0,115.0,35.74,9.0,15.32,32.18,9.0]              |12.2 |
|[6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0] |3.55 |
|[6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0] |3.55 |
|[6.0,110.23899999999999,37.0,9.51,14.87,29.79,1.0] |11.5 |
|[6.0,112.0,38.0,9.51,15.0,29.47,5.0]               |10.9 |
|[6.0,113.0,37.82,9.51,15.57,29.88,2.0]             |12.0 |
|[6.0,158.0,43.7,11.25,18.0,36.16,0.0]              |13.6 |
|[8.0,77.499,19.5,8.56,9.75,39.74,2.0]              |9.0  |
|[10.0,77.0,20.16,8.56,9.75,38.19,9.0]              |9.0  |
|[10.0,90.09,25.01,9.62,10.5,36.02,0.0]             |8.58 |
|[10.0,105.0,27.2,8.9,13.56,38.6,5.0]               |10.68|
|[11.0,86.0,21.24,9.63,10.62,40.49,1.0]             |9.3  |
|[11.0,90.09,25.01,9.62,10.5,36.02,0.0]             |8.48 |
|[11.0,91.0,20.32,9.65,9.75,44.78,6.0]  

In [33]:
predictions.describe().show()

+-------+------------------+
|summary|        prediction|
+-------+------------------+
|  count|                50|
|   mean|7.7827198818171475|
| stddev|3.4000908917338006|
|    min|0.4079859655327498|
|    max|14.133441777533491|
+-------+------------------+



In [34]:
test_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                50|
|   mean|7.5306000000000015|
| stddev|3.3520370961368475|
|    min|               0.6|
|    max|              13.6|
+-------+------------------+



# Checking data further

In [35]:
from pyspark.sql.functions import corr

In [36]:
data.select(corr('crew','passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+



In [37]:
data.select(corr('crew','cabins')).show()

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+

