# Linear Regression Project - SOLUTIONS

Here is what the data looks like in this project:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". 

Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Your tasks:

1). You may run the codes in each cell one by one.


2). When you study this solution, please try to understand each Python statement in the cells, as below.


In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName('cruise').getOrCreate()

In [None]:
df = spark.read.csv('cruise_ship_info.csv',inferSchema=True,header=True)

In [None]:
df.printSchema()

In [None]:
df.show()

In [None]:
df.describe().show()

## Dealing with the Cruise_line categorical variable
Ship Name is a useless arbitrary string, but the cruise_line itself may be useful. 

Let's make it into a categorical variable, as it is represnetd in String.

In [None]:
df.groupBy('Cruise_line').count().show()

In [None]:
from pyspark.ml.feature import StringIndexer

# using "help" command to understand the class "StringIndexer"
# Why is it needed?

help(StringIndexer)

indexer = StringIndexer(inputCol="Cruise_line", outputCol="cruise_cat")
indexed = indexer.fit(df).transform(df)
indexed.head(5)

In [None]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [None]:
indexed.columns

# All the columns in original dataset

In [None]:
# Some columns will not be used as features, such as Ship_name
# What others are not included in the features?

assembler = VectorAssembler(
  inputCols=['Age',
             'Tonnage',
             'passengers',
             'length',
             'cabins',
             'passenger_density',
             'cruise_cat'],
    outputCol="features")

In [None]:
output = assembler.transform(indexed)

In [None]:
# "crew" will be "label"column 

output.select("features", "crew").show()

In [None]:
final_data = output.select("features", "crew")

In [None]:
# Divide the dataset into train and test data

train_data,test_data = final_data.randomSplit([0.7,0.3])

In [None]:
from pyspark.ml.regression import LinearRegression
# Create a Linear Regression Model object

lr = LinearRegression(labelCol='crew')

In [None]:
# Fit the model to the data and call this model lrModel

lrModel = lr.fit(train_data)

In [None]:
# Print the coefficients and intercept for linear regression

print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

In [None]:
test_results = lrModel.evaluate(test_data)

In [None]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))
print("R2: {}".format(test_results.r2))

In [None]:
# R2 of 0.86 is pretty good, let's check the data a little closer
from pyspark.sql.functions import corr

In [None]:
df.select(corr('crew','passengers')).show()

In [None]:
df.select(corr('crew','cabins')).show()

Okay, so maybe it does make sense! Well that is good news for us, this is information we can bring to the company!


Hope you enjoyed your first consulting gig!
# Great Job!