# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('consulting').getOrCreate()

In [2]:
from pyspark.ml.regression import LinearRegression

In [9]:
df = spark.read.csv("cruise_ship_info.csv",inferSchema=True,header=True)

In [10]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [11]:
df.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [12]:
df.head()

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)

In [7]:
# 결측값이 있는지 확인할 수 있는 API를 모르겠음
type(df)
# pandas API에는 있지만 데이터셋의 타입이 pyspark.sql.Datafrmae API에는 없다.

pyspark.sql.dataframe.DataFrame

In [32]:
# 필요한 feature를 벡터로 만들어야하는데
# 그 전에 Cruise_line을 수치화 시켜야한다.
# Stringindexer를 사용한다.
from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol='Cruise_line', outputCol='indexed_Cruise_line')
stringIndexer.getParam('inputCol')
# 살펴보니 fit을 이용할 수 있을듯
# 따라서 feature 벡터화를 먼저하자

Param(parent='StringIndexer_8d7c233fa85e', name='inputCol', doc='input column name.')

In [24]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
    inputCols=["Age", "Tonnage", 
               "length",'cabins','passenger_density','crew'],
    outputCol="features")

In [25]:
output = assembler.transform(df)

In [26]:
output.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+--------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|            features|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+--------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|[6.0,30.276999999...|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|[6.0,30.276999999...|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|[26.0,47.262,7.22...|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|[11.0,110.0,9.53,...|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|[17.0,101.353,8.9...|
|    Ecstasy|   Carnival| 22|            70.367|     20.

In [27]:
final_data = output.select("features",'Cruise_line','passengers')

In [28]:
final_data.show()

+--------------------+-----------+----------+
|            features|Cruise_line|passengers|
+--------------------+-----------+----------+
|[6.0,30.276999999...|    Azamara|      6.94|
|[6.0,30.276999999...|    Azamara|      6.94|
|[26.0,47.262,7.22...|   Carnival|     14.86|
|[11.0,110.0,9.53,...|   Carnival|     29.74|
|[17.0,101.353,8.9...|   Carnival|     26.42|
|[22.0,70.367,8.55...|   Carnival|     20.52|
|[15.0,70.367,8.55...|   Carnival|     20.52|
|[23.0,70.367,8.55...|   Carnival|     20.56|
|[19.0,70.367,8.55...|   Carnival|     20.52|
|[6.0,110.23899999...|   Carnival|      37.0|
|[10.0,110.0,9.51,...|   Carnival|     29.74|
|[28.0,46.052,7.27...|   Carnival|     14.52|
|[18.0,70.367,8.55...|   Carnival|     20.52|
|[17.0,70.367,8.55...|   Carnival|     20.52|
|[11.0,86.0,9.63,1...|   Carnival|     21.24|
|[8.0,110.0,9.51,1...|   Carnival|     29.74|
|[9.0,88.5,9.63,10...|   Carnival|     21.24|
|[15.0,70.367,8.55...|   Carnival|     20.52|
|[12.0,88.5,9.63,1...|   Carnival|

In [29]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

# stringindexer 잘 모르겠음