# Learning basic ML model training concepts by solving Titanic Survival Prediction problem

### About the Problem:
    Using the machine learning tools, we need to analyze the information about the passensgers of RMS Titanic and predict which passenger has survived. This problem has been published by Kaggle and is widely used for learning basic concepts of Machine Learning

### About the data sets

#### Data Dictionary

- Age: Age
- Cabin: Cabin
- Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- Fare: Passenger Fare
- Name: Name
- Parch: Number of Parents/Children Aboard
- Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- Sex: Sex
- Sibsp: Number of Siblings/Spouses Aboard
- Survived: Survival (0 = No; 1 = Yes)
- Ticket: Ticket Number

#### Variable Notes

- pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

- age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

- sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

- parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

#### Download location
training data location ->  "https://www.kaggle.com/c/titanic/download/train.csv" <br>
test data location -> "https://www.kaggle.com/c/titanic/download/test.csv"

### 1.0 Load data sets

In [None]:
import sys
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_366a081119f849e6862e88812b3ed98f = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='k0nEqGfb_IxLkXnE1WOFkiQsLFZ-aVYpkGeJW-66PELy',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_366a081119f849e6862e88812b3ed98f.get_object(Bucket='titanicsurvivalpredictionf8684a7b97d94dde9b87f6e498cf1eb0',Key='train.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

training_df = pd.read_csv(body)
training_df.head()

In [None]:
body = client_366a081119f849e6862e88812b3ed98f.get_object(Bucket='titanicsurvivalpredictionf8684a7b97d94dde9b87f6e498cf1eb0',Key='test.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

test_df = pd.read_csv(body)
test_df.head()

Combine the training and test data set so that we can perform data transformations on both these sets in a single attempt. Once the data transformation is complete, the data sets have be segregated back to training and test datasets with out any mix up of samples between the data sets

In [None]:
test_df['Survived'] = 0
test_df.head()

In [None]:
complete_data_df = training_df.append(test_df, ignore_index=True)
complete_data_df.head()

In [None]:
print("No. of Training Data samples: " + str(training_df.shape[0]))
print("No. of Test Data samples: " + str(test_df.shape[0]))
print("Complete Data samples: " + str(complete_data_df.shape[0]))

### 2.0 Data Pre-processing

##### 2.1 Handle Missing Data

Check for missing values in the columns 

In [None]:
complete_data_df.isnull().sum()
training_df.isnull().sum()

Around 80% of Cabin's data is missing. So it will not be of much use to train the model. 

Let us replace the missing values for age with median. Though not a best approach to replace missing data, we shall use this method for sake of simplicity.

In [None]:
complete_data_df['Age'] = complete_data_df['Age'].fillna(complete_data_df['Age'].median())

Replace missing data for Embarked. Let us use the port where maximum passengers have boarded

In [None]:
complete_data_df.Embarked.value_counts()


In [None]:
complete_data_df['Embarked'] = complete_data_df['Embarked'].fillna('S')
complete_data_df.Embarked.unique()


##### 2.2 Encode categorical feature columns

Encode the values of the categorical columns -- Sex, Embarked

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
def encode_features(data_set, feature_names):
    for feature_name in feature_names:
        le = LabelEncoder()
        le.fit(data_set[feature_name])
        encoded_column = le.transform(data_set[feature_name])
        data_set[feature_name] = encoded_column
    return data_set    

In [None]:
features_to_encode = ['Sex', 'Embarked']
complete_data_df = encode_features(complete_data_df, features_to_encode)
complete_data_df.head(10)


### 3.0 Feature Engineering

##### 3.1 Infer Title of the passengers from their names and consider it as a feature

In [None]:
parsed_names = complete_data_df.Name.str.split('[,.]')
parsed_names[:10]

In [None]:
titles = [str.strip(name[1]) for name in parsed_names.values]

In [None]:
complete_data_df['Title'] = titles
complete_data_df.Title.unique()

Combine the titles with similar meanings

In [None]:
complete_data_df.Title.values[complete_data_df.Title.isin(['Mme', 'Mlle'])] = 'Mlle'
complete_data_df.Title.values[complete_data_df.Title.isin(['Capt', 'Don', 'Major', 'Sir'])] = 'Sir'
complete_data_df.Title.values[complete_data_df.Title.isin(['Dona', 'Lady', 'the Countess', 'Jonkheer'])] = 'Lady'

Encode the Title feature column

In [None]:
complete_data_df = encode_features(complete_data_df, ['Title'])

In [None]:
complete_data_df.head()

##### 3.2 Infer if the passenger is a Minor and consider it as a feature

In [None]:
import numpy as np

In [None]:
complete_data_df['IsMinor']=np.where(complete_data_df['Age']<=16, 1, 0)

In [None]:
complete_data_df.head()

Now, having cleaned up the data set, let us train a model and see how it performs. But before we train the model, we need prepare the list of features that we want to use to train the model and split the combined data set back into training and test data set. As we will be doing this multiple times, let us create a function for this task.

Prepare the list of features that we want to train 

In [None]:
features = ['Age', 'Embarked', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp', 'Title', 'IsMinor']

In [None]:
from sklearn import cross_validation
def get_training_data(combined_data_set):
    training_data = combined_data_set.iloc[:891].copy()
    return training_data

def get_test_data(combined_data_set):
    training_data = combined_data_set.iloc[892:].copy()
    return training_data


In [None]:
training_data = get_training_data(complete_data_df)

Let us use the Logistic Regression algorithm for training

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression(random_state=1)

Let us see how good the model performs by using calculating the accuracy of the prediction on the test data

In [None]:
features_wo_minor = ['Age', 'Embarked', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp', 'Title']

In [None]:
scores = cross_validation.cross_val_score(lr, training_data[features_wo_minor], training_data['Survived'], cv=3)
print("Score Result: " + str(scores))
print("Average Score: " + str(scores.mean()))

In [None]:
features_w_minor = ['Age', 'Embarked', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp', 'Title', 'IsMinor']

In [None]:
scores = cross_validation.cross_val_score(lr, training_data[features_w_minor], training_data['Survived'], cv=3)
print("Score Result: " + str(scores))
print("Average Score: " + str(scores.mean()))

Let us finalize the features

In [None]:
selected_features = ['Age', 'Embarked', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp', 'Title', 'IsMinor' ]

### 4.0 Train and evaluate the model

In [None]:
from sklearn.grid_search import GridSearchCV
import numpy as np

In [None]:
params = {'C': np.arange(1e-05, 3, 0.1)}

In [None]:
scoring = {'Accuracy': 'accuracy', 'AUC': 'roc_auc'}

In [None]:
gs = GridSearchCV(LogisticRegression(),
                  param_grid=params)

In [None]:
gs.fit(training_data[selected_features], training_data['Survived'])

In [None]:
gs

In [None]:
print("Best score: %s" % (gs.best_score_))
print("Best parameter set: %s" % (gs.best_params_))

### 5. Deploy the model

Now that we have a well trained model, we can deploy that in a production environment to be used the end users or applications. 

Here, I will be using IBM Watson Machine Learning Service to deploy a trained model as a ReST service.

##### IBM WML Service Credentials

In [None]:

wml_credentials = {
  "url": "https://ibm-watson-ml.mybluemix.net",
  "access_key": "xxx",
  "username": "xxx",
  "password": "xxx",
  "instance_id": "xxx"
}



#### 5.1. Save the model to WML Repository

Inorder to deploy the model in WML service, the model has to be saved in the WML Repository. We will be using WML's Python client for this purpose.

In [None]:
from repository_v3.mlrepository import MetaNames
from repository_v3.mlrepository import MetaProps
from repository_v3.mlrepositoryclient import MLRepositoryClient
from repository_v3.mlrepositoryartifact import MLRepositoryArtifact

import pprint

Initialize the watson_machine_learning_client

In [None]:
ml_repository_client = MLRepositoryClient(wml_credentials['url'])
ml_repository_client.authorize(wml_credentials['username'], wml_credentials['password'])

The code below uploads the saved model's compressed tar ball in WML Repository. The API returns a bunch of metadata that was created as part of saving the model.

In [None]:
props_meta = MetaProps({MetaNames.AUTHOR_NAME:"Krishna", MetaNames.AUTHOR_EMAIL:"krishna@in.ibm.com"})

In [None]:
model_artifact = MLRepositoryArtifact(gs, name='titanic_survival_prediction', meta_props=props_meta)
saved_model = ml_repository_client.models.save(model_artifact)


In [None]:

saved_model_details = saved_model.meta.get()
print("Model GUID: " + saved_model.uid )
pprint.pprint(saved_model_details)


#### 5.2 Deploy the model

In [None]:

import urllib3
import time
import base64
import requests
import json
import pprint
import numpy as np

##### 5.2.1 Generate token 

In [None]:
headers = urllib3.util.make_headers(basic_auth='{}:{}'.format(wml_credentials['username'], wml_credentials['password']))
url = '{}/v3/identity/token'.format(wml_credentials['url'])
response = requests.get(url, headers=headers)
mltoken = json.loads(response.text).get('token')
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken}

Get published_models url from instance details

In [None]:
endpoint_instance = wml_credentials['url'] + "/v3/wml_instances/" + wml_credentials['instance_id']
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken} 

response_get_instance = requests.get(endpoint_instance, headers=header)
print(response_get_instance)
print(response_get_instance.text)

In [None]:
endpoint_published_models = json.loads(response_get_instance.text).get('entity').get('published_models').get('url')
print("Published models url: " + endpoint_published_models)


Get deployment URL of the saved model

In [None]:
response_models = requests.get(endpoint_published_models, headers=header)
[deployment_url] = [x.get('entity').get('deployments').get('url') for x in json.loads(response_models.text).get('resources') if x.get('metadata').get('guid') == saved_model.uid]
print(deployment_url)

Prepare payload for deploying the model

In [None]:
payload_online = {"name": "titanic_surv_prediction", "type": "online"}
response_online = requests.post(deployment_url, json=payload_online, headers=header)


Submit request for deployment

In [None]:
print("Response Code: " + str(response_online.status_code))
pprint.pprint(response_online.content)

### 6.0 Predictions based on deployed model

###### 6.1 Get input data for scoring from test data 

In [None]:
test_data = get_test_data(complete_data_df)[selected_features]

In [None]:
input_for_prediction = test_data.values[np.random.randint(test_data.shape[0])]

In [None]:
input_for_prediction = input_for_prediction.tolist()

In [None]:
input_for_prediction

###### 6.2 Prepare JSON paylod for scoring

In [None]:
payload_scoring = { "values": [input_for_prediction] }

###### 6.3 Get URL for scoring request from deployment's response

In [None]:
scoring_url = json.loads(response_online.text).get('entity').get('scoring_url')
print(scoring_url)

In [None]:
response_scoring = requests.post(scoring_url, json=payload_scoring, headers=header)
pprint.pprint(response_scoring.text)

### 7.0 References

Kaggle Titanic - Machine Learning from Disaster: https://www.kaggle.com/c/titanic <br>
IBM Data Science Experience: https://datascience.ibm.com/ <br>
IBM Bluemix: https://console.bluemix.net/ <br>
