## The objective of this task is to  Build a regressor that recommends the crew member size for potential ship buyers.The following steps will serve as a guideline 
1. Read the file and display columns.
2. Calculate basic statistics of the data (count, mean, std, etc) and examine data and state your  observations.
3. Select columns that will be probably important to predict crew size.Create training and testing sets (use 60% of the data for the training and reminder for testing).
4. Build a machine learning model to predict the crew size.
5. Calculate the Pearson correlation coefficient for the training set and testing data sets.
6. Explain Overfitting, and How Can You Avoid It? 
7. What’s the difference between bias and variance?
8. When Will You Use Classification over Regression?

    The dataset used in this excerise can be found in this repo.

In [1]:
#import dependencies
import pandas as pd
import numpy as np

In [2]:
#load dataset
ship_info = pd.read_csv('ship_info.csv')
ship_info.head(10)

Unnamed: 0,Ship_name,Cruise_line,Age,Tonnage,passengers,length,cabins,passenger_density,crew
0,Journey,Azamara,6,30.277,6.94,5.94,3.55,42.64,3.55
1,Quest,Azamara,6,30.277,6.94,5.94,3.55,42.64,3.55
2,Celebration,Carnival,26,47.262,14.86,7.22,7.43,31.8,6.7
3,Conquest,Carnival,11,110.0,29.74,9.53,14.88,36.99,19.1
4,Destiny,Carnival,17,101.353,26.42,8.92,13.21,38.36,10.0
5,Ecstasy,Carnival,22,70.367,20.52,8.55,10.2,34.29,9.2
6,Elation,Carnival,15,70.367,20.52,8.55,10.2,34.29,9.2
7,Fantasy,Carnival,23,70.367,20.56,8.55,10.22,34.23,9.2
8,Fascination,Carnival,19,70.367,20.52,8.55,10.2,34.29,9.2
9,Freedom,Carnival,6,110.239,37.0,9.51,14.87,29.79,11.5


In [3]:
ship_info.tail(10)

Unnamed: 0,Ship_name,Cruise_line,Age,Tonnage,passengers,length,cabins,passenger_density,crew
148,Wind,Silversea,19,16.8,2.96,5.14,1.48,56.76,1.97
149,Aries,Star,22,3.341,0.66,2.8,0.33,50.62,0.59
150,Gemini,Star,21,19.093,8.0,5.37,4.0,23.87,4.7
151,Libra,Star,12,42.0,14.8,7.13,7.4,28.38,6.8
152,Pisces,Star,24,40.053,12.87,5.79,7.76,31.12,7.5
153,Taurus,Star,22,3.341,0.66,2.79,0.33,50.62,0.59
154,Virgo,Star,14,76.8,19.6,8.79,9.67,39.18,12.0
155,Spirit,Windstar,25,5.35,1.58,4.4,0.74,33.86,0.88
156,Star,Windstar,27,5.35,1.67,4.4,0.74,32.04,0.88
157,Surf,Windstar,23,14.745,3.08,6.17,1.56,47.87,1.8


In [4]:
ship_info.describe()

Unnamed: 0,Age,Tonnage,passengers,length,cabins,passenger_density,crew
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,15.689873,71.284671,18.457405,8.130633,8.83,39.900949,7.794177
std,7.615691,37.22954,9.677095,1.793474,4.471417,8.639217,3.503487
min,4.0,2.329,0.66,2.79,0.33,17.7,0.59
25%,10.0,46.013,12.535,7.1,6.1325,34.57,5.48
50%,14.0,71.899,19.5,8.555,9.57,39.085,8.15
75%,20.0,90.7725,24.845,9.51,10.885,44.185,9.99
max,48.0,220.0,54.0,11.82,27.0,71.43,21.0


In [5]:
# To know the data structure 
ship_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Ship_name          158 non-null    object 
 1   Cruise_line        158 non-null    object 
 2   Age                158 non-null    int64  
 3   Tonnage            158 non-null    float64
 4   passengers         158 non-null    float64
 5   length             158 non-null    float64
 6   cabins             158 non-null    float64
 7   passenger_density  158 non-null    float64
 8   crew               158 non-null    float64
dtypes: float64(6), int64(1), object(2)
memory usage: 11.2+ KB


As seen above,we have 2 object variable which needs to be converted to numerical variable so our model can function a lot better. We will do this by using a model from scikit_learn called label encoding 

In [6]:
#import labelEncoder
from sklearn.preprocessing import LabelEncoder

#instantiate LabelEncoder
le = LabelEncoder()

#Iterate over all the values of each column and extract their dtypes
for col in ship_info.columns.values:
    #compare if the dtype is object
    if ship_info[col].dtypes=='object':
        ship_info[col]=le.fit_transform(ship_info[col])

In [7]:
ship_info.head()

Unnamed: 0,Ship_name,Cruise_line,Age,Tonnage,passengers,length,cabins,passenger_density,crew
0,52,0,6,30.277,6.94,5.94,3.55,42.64,3.55
1,91,0,6,30.277,6.94,5.94,3.55,42.64,3.55
2,11,1,26,47.262,14.86,7.22,7.43,31.8,6.7
3,15,1,11,110.0,29.74,9.53,14.88,36.99,19.1
4,20,1,17,101.353,26.42,8.92,13.21,38.36,10.0


As seen above,we now have all our variables as numerical 

Spliting the dataset into train and test sets. It is important we drop variables which are not neccesary for predicting crew member size,variables like **Ship_name and Cruise_line** will be dropped and at thw same type we will define our target variable which is **Crew** . We will do this by importig a tool from scikit_learn called train_test_split

In [8]:
ship_info.shape

(158, 9)

In [9]:
#import train_test_split
from sklearn.model_selection import train_test_split

#drop feature ship_name and cruise_line
ship_info = ship_info.drop(['Ship_name', 'Cruise_line'], axis=1)

#get target variable crew
X = ship_info.drop('crew', axis=1)
#separate the target variable
y = ship_info['crew']

print(X.head())
y.head()



#split into train(60%) and test(40%)
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.4, random_state=42)

   Age  Tonnage  passengers  length  cabins  passenger_density
0    6   30.277        6.94    5.94    3.55              42.64
1    6   30.277        6.94    5.94    3.55              42.64
2   26   47.262       14.86    7.22    7.43              31.80
3   11  110.000       29.74    9.53   14.88              36.99
4   17  101.353       26.42    8.92   13.21              38.36


As seen above, our data is now split in train and test set.

**We have to scale our data so it can be fit in a machine learning model.The scaler to be used is standardzation(Standardscaler) because it will help center the values around the mean with a unit standard deviation unlike the normalization(MinMax scaler),which will range the values between 0-1**

In [10]:
#import standardscaler
from sklearn.preprocessing import StandardScaler

#instantiate StandardScaler to rescale x_train and x_test
S_scaler = StandardScaler()
rescaledX_train = S_scaler.fit_transform(X_train)
rescaledX_test = S_scaler.transform(X_test)

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Create the regressor: lin_reg
lin_reg = LinearRegression()

# Fit logreg to the train set
lin_reg.fit(rescaledX_train, y_train)

# Predict on the test data: y_pred
y_pred = lin_reg.predict(rescaledX_test)

print(lin_reg.intercept_)
print(lin_reg.coef_)
print(lin_reg.score(rescaledX_test, y_test))

7.523191489361703
[-0.00963213  0.54389059 -1.68304167  0.54554151  3.88951399  0.05898596]
0.9362030498069416


In [13]:
#Evaluating the performance of the algorithm
from sklearn.metrics import r2_score,mean_squared_error
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))

Mean Squared Error: 0.8060380752609646
Root Mean Squared Error: 0.8977962325945484


In [14]:
#Comparing the actual output values with the predicted values
ship_info = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
ship_info

Unnamed: 0,Actual,Predicted
128,7.20,8.232544
45,6.36,6.237792
134,21.00,21.575604
156,0.88,1.196622
90,3.50,3.852233
...,...,...
60,5.88,6.088496
151,6.80,6.350130
66,7.00,7.267055
16,10.30,9.756833


# CONCLUSION:

   Linear regression model returned a maximum prediction of 94%,which is very good. And also from the Actual and Predicted values above we can denote that our model successfully predicted the crew member size.








## What is Overfitting?

Overfitting is a major problem in machine learning. It happens when a model captures noise (randomness) instead of signal (the real effect). As a result, the model performs impressively in a training set, but performs poorly in a test set.All data set has two patterns:

**Random effect:** The random effect is a randomness (noise) in our data. It is different in all data sets. A model will be overly optimistic in training set because it becomes specialized in a randomness in data. For instance, if an employee received back to back annual promotions, our model would treat this randomness as a real effect and take this into account while making salary prediction. Most employees only get an annual default raise and no promotions.

**Real effect:** The real effect is an underlying pattern (signal) in our data that we are interested in. It is the same in all data sets. All employees receiving 5% raise each year in a company regardless of their promotions, is an example of real effect that our model should consider while making salary prediction.

# Possible Causes of Overfitting:

* Our model is too complex, and it includes multi-collinear features, which increase the variance in our data.
* The number of features in our data is greater than or equal to the number of data points.
* We have very few data points.
* We did not tune hyperparameters. As a result, our models became non-parametric and very flexible to fit all data.

# How to Deal with Overfitting?

**Cross-validation:** Cross-validation is a model validation technique where we evaluate the quality of our model in an unseen data set. In cross-validation, training and validation are done together. K-fold cross validation and leave one out cross validation (LOOCV) are two most popular cross-validation techniques.

**Dimension reduction:** If our data have an overwhelming number of attributes and multicollinearity between the attributes, we should use dimension reduction models such as Principal Component Analysis (PCA) and feature selection such as LASSO and Elastic Net regression. This helps to make our model simpler and better.

**Regularization:** Regularization method adds a penalty term for complex models to avoid the risk of overfitting. It is a form of regression which shrinks coefficients of our features towards zero. However, applying regularization to an overly simple model leads to underfitting, a situation where a machine ignores real effects, i.e. signal.


# What is the difference between Bias and Variance?

***What is Bias?***

Bias is the amount that a model’s prediction differs from the target value, compared to the training data. Bias error results from simplifying the assumptions used in a model so the target functions are easier to approximate. Bias can be introduced by model selection. Data scientists conduct resampling to repeat the model building process and derive the average of prediction values. Resampling data is the process of extracting new samples from a data set in order to get more accurate results. There are a variety of ways to resample data including:

* K fold resampling, in which a given data set is split into a K number of sections, or folds, where each fold is used as a testing set.
* Bootstrapping, which involves iteratively resampling a dataset with replacement.

Resampling can affect bias. If the average prediction values are significantly different from the true value based on the sample data, the model has a high level of bias.

Every algorithm starts with some level of bias, because bias results from assumptions in the model that make the target function easier to learn. A high level of bias can lead to underfitting, which occurs when the algorithm is unable to capture relevant relations between features and target outputs. A high bias model typically includes more assumptions about the target function or end result. A low bias model incorporates fewer assumptions about the target function. 

A linear algorithm often has high bias, which makes them learn fast. In linear regression analysis, bias refers to the error that is introduced by approximating a real-life problem, which may be complicated, by a much simpler model. Though the linear algorithm can introduce bias, it also makes their output easier to understand. The simpler the algorithm, the more bias it has likely introduced. In contrast, nonlinear algorithms often have low bias.

***What Is Variance?***

Variance indicates how much the estimate of the target function will alter if different training data were used. In other words, variance describes how much a random variable differs from its expected value. Variance is based on a single training set. Variance measures the inconsistency of different predictions using different training sets — it’s not a measure of overall accuracy.

Variance can lead to overfitting, in which small fluctuations in the training set are magnified. A model with high-level variance may reflect random noise in the training data set instead of the target function. The model should be able to identify the underlying connections between the input data and variables of the output.

A model with low variance means sampled data is close to where the model predicted it would be. A model with high variance will result in significant changes to the projections of the target function.

Machine learning algorithms with low variance include linear regression, logistics regression, and linear discriminant analysis. Those with high variance include decision trees, support vector machines and k-nearest neighbors.


**Bias-Variance Trade-Off**

The goal of any supervised machine learning algorithm is to achieve low bias and low variance. In turn the algorithm should achieve good prediction performance.

You can see a general trend in the examples above:

* Linear machine learning algorithms often have a high bias but a low variance.
* Nonlinear machine learning algorithms often have a low bias but a high variance.
The parameterization of machine learning algorithms is often a battle to balance out bias and variance.

Below are two examples of configuring the bias-variance trade-off for specific algorithms:

* The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute t the prediction and in turn increases the bias of the model.
* The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

There is no escaping the relationship between bias and variance in machine learning.

* Increasing the bias will decrease the variance.
* Increasing the variance will decrease the bias.
There is a trade-off at play between these two concerns and the algorithms you choose and the way you choose to configure them are finding different balances in this trade-off for your problem

In reality, we cannot calculate the real bias and variance error terms because we do not know the actual underlying target function. Nevertheless, as a framework, bias and variance provide the tools to understand the behavior of machine learning algorithms in the pursuit of predictive performance.

# When to use classification over regression?

**Classification:**
Classification is a process of finding a function which helps in dividing the dataset into classes based on different parameters. In Classification, a computer program is trained on the training dataset and based on that training, it categorizes the data into different classes.

The task of the classification algorithm is to find the mapping function to map the input(x) to the discrete output(y).

Example: The best example to understand the Classification problem is Email Spam Detection. The model is trained on the basis of millions of emails on different parameters, and whenever it receives a new email, it identifies whether the email is spam or not. If the email is spam, then it is moved to the Spam folder.

**Types of ML Classification Algorithms:***

Classification Algorithms can be further divided into the following types:

* Logistic Regression

* K-Nearest Neighbours

* Support Vector Machines

* Kernel SVM

* Naïve Bayes

* Decision Tree Classification

* Random Forest Classification


**Regression:**
Regression is a process of finding the correlations between dependent and independent variables. It helps in predicting the continuous variables such as prediction of Market Trends, prediction of House prices, etc.

The task of the Regression algorithm is to find the mapping function to map the input variable(x) to the continuous output variable(y).

Example: Suppose we want to do weather forecasting, so for this, we will use the Regression algorithm. In weather prediction, the model is trained on the past data, and once the training is completed, it can easily predict the weather for future days.

***Types of Regression Algorithm:***

* Simple Linear Regression

* Multiple Linear Regression

* Polynomial Regression

* Support Vector Regression

* Decision Tree Regression

* Random Forest Regression

**Difference between Regression and Classification**

1. In Regression, the output variable must be of continuous nature or real value *while* In Classification, the output variable must be a discrete value.

2. The task of the regression algorithm is to map the input value (x) with the continuous output variable(y). *while* The task of the classification algorithm is to map the input value(x) with the discrete output variable(y).

3. Regression Algorithms are used with continuous data.*while* Classification Algorithms are used with discrete data.

4. In Regression, we try to find the best fit line, which can predict the output more accurately.*while* In Classification, we try to find the decision boundary, which can divide the dataset into different classes.

5. Regression algorithms can be used to solve the regression problems such as Weather Prediction, House price prediction, etc *while* Classification Algorithms can be used to solve classification problems such as Identification of spam emails, Speech Recognition, Identification of cancer cells, etc.

6. The regression Algorithm can be further divided into Linear and Non-linear Regression.*while* The Classification algorithms can be divided into Binary Classifier and Multi-class Classifier.