In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
plt.rcParams['figure.figsize'] = (16, 12)

# Cruise Ship Crew Requirements

You've been given a dataset describing various features of cruise ships such as their tonnage, total number of passengers, number of cabins etc. You've been asked to explore the dataset, present some summary statistics and (hopefully) build a model to calculate the number of crew required to staff a new ship. Again, you are asked to break the problem down into three stages

### Part 1: Exploration and Cleaning
1. Read in this data, clean and tidy if necessary.
1. Calculate basic statistics to understand what you have been given
1. Select some columns which might be correlated with price

### Part 2: Analysis and Model Building
1. Create training and testing sets
2. Scale and transform features as needed 
3. Fit a model to predict crew requirements for new ships

### Part 3: Results
1. Summarize your findings

## Part 1

In [3]:
sCSV = 'https://m2pi.syzygy.ca/data/cruise_ship_info.csv'

sDF = pd.read_csv(sCSV)

 * Take a look at the numerical columns
 * I threw away `Ship_name`, seemed redundant
 * Look at distributions of the numerical columns. `scatter_matrix` from `pandas.plotting` can help
 * Look at correlations with `sDF.corr`
 * Hopefully you should find that the strongest correlations are with `Tonnage`, `Passengers`, `length` and `cabins`.

## Part 2

### Splitting the test and training data

**You are supposed to do this year, and leave your test data aside until the very end**

* From part 1, if the distribution of features was not normal, you would need to split with that in mind (`StratifiedShuffleSplit` can help with this). Here things were mostly OK so you can just use `train_test_split` from `sklearn.model_selection`.
* Use `StandardScaler` on the numerical columns and `OneHotEncoder` on the categorical
* fit and transform your training dataset
* Try fitting a model. Maybe start with `LinearRegression`, then see what the [choosing the right estimator](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) page recommends.
* Once you are happy with the model type, try tuning the hyperparameters with `GridSearchCV` from `sklearn.model_selection`.

## Part 3

At this point, you should  be ready to "open the box" and see how you did with your test data. As mentioned at the top, this is the very last thing you do.