<a href="https://colab.research.google.com/github/muhammetsnts/SPARK/blob/main/projects/2.Predicting_Crew_Members.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Info

We have a `cruise_shşp_info.csv` dataset. We will try to predict how many crew members needed when building a ship.

# Setting Environment

In [1]:
# install Java8
!apt-get -q install openjdk-8-jdk-headless -qq > /dev/null

# download spark3.1.1
!wget -q https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

# unzip it
!tar xf spark-3.1.1-bin-hadoop2.7.tgz

# install findspark 
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Download and Read the Data

In [2]:
!wget -q https://raw.githubusercontent.com/muhammetsnts/SPARK/main/data/cruise_ship_info.csv

# Create Spark DataFrame

In [3]:
data = spark.read.csv('cruise_ship_info.csv', inferSchema=True, header=True)
data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [4]:
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [5]:
data.describe().show()

+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|summary|Ship_name|Cruise_line|               Age|           Tonnage|       passengers|           length|            cabins|passenger_density|             crew|
+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|  count|      158|        158|               158|               158|              158|              158|               158|              158|              158|
|   mean| Infinity|       null|15.689873417721518| 71.28467088607599|18.45740506329114|8.130632911392404| 8.830000000000005|39.90094936708861|7.794177215189873|
| stddev|     null|       null| 7.615691058751413|37.229540025907866|9.677094775143416|1.793473548054825|4.4714172221480615| 8.63921711391542|3.503486564627034|
|    min|Adventure|    Azamara|   

# Data Cleaning

Lets check the cruise lines have how many ships.

In [6]:
data.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



As you can see, our `Cruise_line` column is in StringType, so we need to convert this to number using `StringIndexer`.

In [7]:
from pyspark.ml.feature import StringIndexer

In [8]:
indexer = StringIndexer(inputCol='Cruise_line', outputCol='Cruise_line_index')

data_indexed = indexer.fit(data).transform(data)

data_indexed.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_line_index|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|              1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|              1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|              1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|       

In [9]:
data_indexed.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Cruise_line_index: double (nullable = false)



# Preparing the Data For Model

In [10]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

We dropped the string type values. Actually, we could filter the dataframe here by setting the numeric input columns as features.

In [13]:
data_indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_line_index']

In [14]:
assembler = VectorAssembler(inputCols=['Age',
                                       'Tonnage',
                                       'passengers',
                                       'length',
                                       'cabins',
                                       'passenger_density',
                                       'crew',
                                       'Cruise_line_index'], 
                            outputCol='features')

We will transform our data before splitting.

In [15]:
output = assembler.transform(data_indexed)

output.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Cruise_line_index: double (nullable = false)
 |-- features: vector (nullable = true)



We will choose only `crew` and `features` columns.

In [16]:
final_data = output.select('crew', 'features')

final_data.show()

+----+--------------------+
|crew|            features|
+----+--------------------+
|3.55|[6.0,30.276999999...|
|3.55|[6.0,30.276999999...|
| 6.7|[26.0,47.262,14.8...|
|19.1|[11.0,110.0,29.74...|
|10.0|[17.0,101.353,26....|
| 9.2|[22.0,70.367,20.5...|
| 9.2|[15.0,70.367,20.5...|
| 9.2|[23.0,70.367,20.5...|
| 9.2|[19.0,70.367,20.5...|
|11.5|[6.0,110.23899999...|
|11.6|[10.0,110.0,29.74...|
| 6.6|[28.0,46.052,14.5...|
| 9.2|[18.0,70.367,20.5...|
| 9.2|[17.0,70.367,20.5...|
| 9.3|[11.0,86.0,21.24,...|
|11.6|[8.0,110.0,29.74,...|
|10.3|[9.0,88.5,21.24,9...|
| 9.2|[15.0,70.367,20.5...|
| 9.3|[12.0,88.5,21.24,...|
| 9.2|[20.0,70.367,20.5...|
+----+--------------------+
only showing top 20 rows



# Train-Test Split

In [17]:
train_data, test_data = final_data.randomSplit([0.7,0.3])

In [18]:
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               111|
|   mean| 7.610720720720719|
| stddev|3.4659907967437253|
|    min|              0.59|
|    max|              19.1|
+-------+------------------+



In [19]:
test_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|               47|
|   mean|8.227446808510638|
| stddev|3.590702648982996|
|    min|              0.6|
|    max|             21.0|
+-------+-----------------+



# Modelling

In [20]:
from pyspark.ml.regression import LinearRegression

In [23]:
lr = LinearRegression(labelCol='crew')

In [34]:
lrModel = lr.fit(train_data)

# Evaluate Model

In [35]:
test_results = lrModel.evaluate(test_data)

test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
|-1.11022302462515...|
|6.994405055138486...|
|-4.44089209850062...|
|9.769962616701378...|
|7.549516567451064...|
|-3.55271367880050...|
|                 0.0|
|-8.88178419700125...|
|-8.88178419700125...|
|1.776356839400250...|
|5.329070518200751...|
|1.154631945610162...|
|-1.77635683940025...|
|-1.77635683940025...|
|1.421085471520200...|
|-4.44089209850062...|
|-7.10542735760100...|
|-4.44089209850062...|
|-9.76996261670137...|
|7.993605777301127...|
+--------------------+
only showing top 20 rows



In [36]:
test_results.rootMeanSquaredError

8.703375347946784e-15

In [42]:
'{:.20f}'.format(test_results.rootMeanSquaredError)

'0.00000000000000870338'

In [37]:
test_results.r2

1.0

In [39]:
test_results.meanAbsoluteError

5.72827837173618e-15

In [27]:
lrModel.coefficients

DenseVector([0.0, -0.0, 0.0, -0.0, -0.0, 0.0, 1.0, -0.0])

In [28]:
lrModel.intercept

9.727115066007133e-15

In [30]:
training_summary = lrModel.summary

In [31]:
training_summary.r2

1.0

In [32]:
training_summary.rootMeanSquaredError

9.404952889062478e-15

In [None]:
training_summary.pValues

[0.7630532446463902,
 0.33248404922392516,
 0.008194436494670665,
 0.010315399636608324,
 3.753122257421637e-09,
 0.9598444749944621,
 0.0,
 0.11046631503034399,
 0.4117843068643867]

## Correlation Check

In [43]:
from pyspark.sql.functions import corr

In [45]:
data.describe().show()

+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|summary|Ship_name|Cruise_line|               Age|           Tonnage|       passengers|           length|            cabins|passenger_density|             crew|
+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|  count|      158|        158|               158|               158|              158|              158|               158|              158|              158|
|   mean| Infinity|       null|15.689873417721518| 71.28467088607599|18.45740506329114|8.130632911392404| 8.830000000000005|39.90094936708861|7.794177215189873|
| stddev|     null|       null| 7.615691058751413|37.229540025907866|9.677094775143416|1.793473548054825|4.4714172221480615| 8.63921711391542|3.503486564627034|
|    min|Adventure|    Azamara|   

Lets check the correlations between `crew` and some of the other columns.

In [47]:
data.select(corr('crew', 'cabins')).show()

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+



In [48]:
data.select(corr('crew', 'passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+



In [49]:
data.select(corr('crew', 'tonnage')).show()

+-------------------+
|corr(crew, tonnage)|
+-------------------+
|  0.927568811544939|
+-------------------+



This means if there is more passengers on board, you'll need more crew, or if there's more cabins, more tonnage; you'll need more crew. This also shows how the features are good to predict the crew column.