# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!All the best!!!!!@

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors
spark=SparkSession.builder.appName("Linear Regression Project").getOrCreate()

In [4]:
df=spark.read.csv("C:/Users/User/Desktop/SparkFolder/Data/cruise_ship_info.csv",inferSchema=True,header=True)

In [5]:
df.show(3)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
only showing top 3 rows



In [6]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [11]:
for i in df.head(5):
    print(i)
    print('\n')

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)


Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)


Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7)


Row(Ship_name='Conquest', Cruise_line='Carnival', Age=11, Tonnage=110.0, passengers=29.74, length=9.53, cabins=14.88, passenger_density=36.99, crew=19.1)


Row(Ship_name='Destiny', Cruise_line='Carnival', Age=17, Tonnage=101.353, passengers=26.42, length=8.92, cabins=13.21, passenger_density=38.36, crew=10.0)




In [12]:
# Perform StringIndexing( Label Encoding) on Cruise_line Column . This is an important feature
df.groupBy("Cruise_line").count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



# String Indexing
* Remember String Index uses fit and Transform
* Vector Assmebler only uses Transform

In [18]:
from pyspark.ml.feature import StringIndexer
# Follow same style as VectorAssembler
indexer=StringIndexer(inputCol="Cruise_line", outputCol="Cruise_cat")
indexed=indexer.fit(df).transform(df)

Exception ignored in: <function JavaWrapper.__del__ at 0x000001BC851A58B0>
Traceback (most recent call last):
  File "C:\spark\python\pyspark\ml\wrapper.py", line 42, in __del__
    if SparkContext._active_spark_context and self._java_obj is not None:
AttributeError: 'StringIndexer' object has no attribute '_java_obj'


In [20]:
indexed.show(5)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_cat|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|      16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|      16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|       1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|       1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|       1.0|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+
only showing top 5 rows



In [21]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_cat']

# Vector Assembler 

In [25]:
feature=['Age','Tonnage','passengers','length','cabins','passenger_density', 'Cruise_cat']
assembler=VectorAssembler(inputCols=feature, outputCol="features")

In [26]:
indexed_tran=assembler.transform(indexed)

In [28]:
indexed_tran.show(truncate=False,n=3)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+--------------------------------------------------+
|Ship_name  |Cruise_line|Age|Tonnage           |passengers|length|cabins|passenger_density|crew|Cruise_cat|features                                          |
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+--------------------------------------------------+
|Journey    |Azamara    |6  |30.276999999999997|6.94      |5.94  |3.55  |42.64            |3.55|16.0      |[6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0]|
|Quest      |Azamara    |6  |30.276999999999997|6.94      |5.94  |3.55  |42.64            |3.55|16.0      |[6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0]|
|Celebration|Carnival   |26 |47.262            |14.86     |7.22  |7.43  |31.8             |6.7 |1.0       |[26.0,47.262,14.86,7.22,7.43,31.8,1.0]            |
+-----------+-----------+---+-----------------

In [29]:
df2=indexed_tran.select("features","crew")

In [31]:
df2.show(truncate=False,n=3)

+--------------------------------------------------+----+
|features                                          |crew|
+--------------------------------------------------+----+
|[6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0]|3.55|
|[6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0]|3.55|
|[26.0,47.262,14.86,7.22,7.43,31.8,1.0]            |6.7 |
+--------------------------------------------------+----+
only showing top 3 rows



# Train Test Split Data

In [32]:
train_data,test_data=df2.randomSplit([0.7,0.3])

In [33]:
train_data.count()

102

In [34]:
test_data.count()

56

# Trainng Data

In [35]:
lr=LinearRegression(featuresCol='features',labelCol="crew",predictionCol="prediction")
lrmodel=lr.fit(train_data)

In [37]:
results_train=lrmodel.transform(train_data)

In [38]:
results_train.show()

+--------------------+-----+------------------+
|            features| crew|        prediction|
+--------------------+-----+------------------+
|[4.0,220.0,54.0,1...| 21.0|20.500578778664266|
|[5.0,115.0,35.74,...| 12.2|11.690163778276109|
|[5.0,133.5,39.59,...|13.13|12.981513647023535|
|[6.0,30.276999999...| 3.55| 4.453803511569759|
|[6.0,30.276999999...| 3.55| 4.453803511569759|
|[6.0,112.0,38.0,9...| 10.9| 11.10627982369699|
|[6.0,158.0,43.7,1...| 13.6|13.685634211270646|
|[7.0,116.0,31.0,9...| 12.0|12.513549687563168|
|[8.0,77.499,19.5,...|  9.0| 8.595886990933275|
|[9.0,59.058,17.0,...|  7.4| 7.609910166377019|
|[9.0,81.0,21.44,9...| 10.0| 9.541097563483467|
|[9.0,85.0,19.68,9...| 8.69|  9.36275703452774|
|[9.0,90.09,25.01,...| 8.69| 9.239960761899914|
|[9.0,110.0,29.74,...| 11.6|11.983340389864841|
|[10.0,46.0,7.0,6....| 4.47|2.9220813108011696|
|[10.0,58.825,15.6...|  7.0| 7.356585435522866|
|[10.0,77.0,20.16,...|  9.0|  8.77932045276899|
|[10.0,81.76899999...| 8.42| 8.836652961

In [39]:
results_train2=lrmodel.evaluate(train_data)

In [40]:
results_train2.r2

0.9504024608728223

In [45]:
results_train2.meanAbsoluteError

0.5224495118560041

# Predicting Test Data

In [46]:
test_res=lrmodel.transform(test_data)

In [47]:
test_res.show()

+--------------------+-----+------------------+
|            features| crew|        prediction|
+--------------------+-----+------------------+
|[5.0,86.0,21.04,9...|  8.0| 9.305385264158282|
|[5.0,122.0,28.5,1...|  6.7| 6.348753396090691|
|[5.0,160.0,36.34,...| 13.6|14.931222991403052|
|[6.0,90.0,20.0,9....|  9.0| 10.13089638407925|
|[6.0,93.0,23.94,9...|11.09|  10.5665113822071|
|[6.0,110.23899999...| 11.5| 10.92433134978542|
|[6.0,113.0,37.82,...| 12.0|11.460844160508968|
|[7.0,89.6,25.5,9....| 9.87|11.085652445211904|
|[7.0,158.0,43.7,1...| 13.6|13.606360290140984|
|[8.0,91.0,22.44,9...| 11.0|10.106714306623312|
|[8.0,110.0,29.74,...| 11.6|11.997333042531626|
|[9.0,88.5,21.24,9...| 10.3| 9.573309294801748|
|[9.0,105.0,27.2,8...|10.68|11.151860870867646|
|[9.0,113.0,26.74,...|12.38|11.246841815789109|
|[9.0,113.0,26.74,...|12.38|11.246841815789109|
|[9.0,116.0,26.0,9...| 11.0|11.068497618618464|
|[10.0,68.0,10.8,7...| 6.36| 6.587884312679228|
|[10.0,90.09,25.01...| 8.58| 8.852453181

In [48]:
test_eval=lrmodel.evaluate(test_data)

In [49]:
test_eval.r2

0.8771639432562722

In [50]:
test_eval.meanAbsoluteError

0.7346869161165628

In [51]:
test_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|               56|
|   mean| 8.41267857142857|
| stddev|3.437878040605559|
|    min|             0.59|
|    max|             19.1|
+-------+-----------------+



In [53]:
df.corr("crew","passengers")

0.9152341306065384

In [54]:
df.corr("crew","cabins")

0.9508226063578497