#### Copyright 2019 Google LLC.

In [1]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Basic Classification Project

In this project you will perform a basic classification task.
You will apply what you learned about binary classification and tensorflow to implement a Kaggle project without much guidance. The challenge is to achieve a high accuracy score when trying to predict which passengers survived the Titanic crash. After building your model, you will upload your predictions to Kaggle and submit the score that you receive.

## Overview

### Learning Objectives

* Define, build, train and evaluate a Linear Classifier model in TensorFlow.
* Submit predictions to a Kaggle challenge.


### Prerequisites

* T05-09 Classification with TensorFlow

## Titanic: Machine Learning from Disaster

[Kaggle](https://www.kaggle.com) has a [dataset](https://www.kaggle.com/c/titanic/data) containing the passenger list for the Titanic voyage. The data contains passenger features such as age, gender, and ticket class, as well as whether or not they survived.

Your job is to load the data and create a binary classifier using TensorFlow to determine if a passenger survived or not. Then, upload your predictions to Kaggle and submit your accuracy score at the end of this colab, along with a brief conclusion.


# Exercises

## Exercise 1: Create a Classifier

**Graded** demonstrations of competency:

1. Download the [dataset](https://www.kaggle.com/c/titanic/data).
2. Load the data into this Colab.
3. Look at the description of the [dataset](https://www.kaggle.com/c/titanic/data) to understand the columns.
4. Explore the dataset. Ask yourself: are there any missing values? Do the data values make sense? Which features seem to be the most important? Are they highly correlated with each other?
5. Prep the data (deal with missing values, drop unnecessary columns, transform the data if needed, etc).
6. Split the data into testing and training set.
7. Create a `tensorflow.estimator.LinearClassifier`.
8. Train the classifier using an input function that feeds the classifier training data.
9. Make predictions on the test data using your classifier.
10. Find the accuracy, precision, and recall of your classifier.
 

### Student Solution

In [2]:
import numpy as np 
import pandas as pd
import math
import matplotlib
import matplotlib.pyplot as plt
import re
import seaborn as sns

dataset_filename = "./train.csv"
train_csv =pd.read_csv(dataset_filename)


In [3]:
train_csv.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
train_csv.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
train_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


**Preprocessing the data**

In [6]:
mean = train_csv.groupby('Sex')['Age'].mean()
mean

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

In [7]:
train_csv['Embarked']=train_csv['Embarked'].fillna('S')

In [8]:
total = train_csv.isnull().sum().sort_values(ascending=False)

In [9]:
train_csv.Cabin = train_csv.Cabin.fillna('N')
train_csv.Cabin = train_csv.Cabin.map(lambda x:x[0])

In [10]:
train_csv['Family'] = train_csv.Parch + train_csv.SibSp + 1

In [11]:
train_csv['Title'] = train_csv.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())

In [12]:
train_csv['Age']= train_csv.groupby(['Sex','Pclass'])['Age'].apply(lambda x: x.fillna(x.mean()))

In [13]:
train_csv.update(train_csv['Age'].astype(int) / train_csv['Age'].max())

In [14]:
train_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          891 non-null object
Embarked       891 non-null object
Family         891 non-null int64
Title          891 non-null object
dtypes: float64(2), int64(6), object(6)
memory usage: 97.5+ KB


In [15]:
# train_csv.Cabin.unique()

In [16]:
# train_csv['Sex'] = train_csv['Sex'].astype('category').cat.codes
# train_csv['Sex'].astype(int)

In [17]:
# train_csv['Embarked'] = train_csv['Embarked'].astype('category').cat.codes
# train_csv['Embarked']

In [18]:
# train_csv.Fare = train_csv.Fare.astype(int)

# Create, Train, Test the model  

In [19]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(
  train_csv,
  stratify = train_csv.Sex,
  test_size=0.2,
  random_state = 42
)

In [20]:
train_df.groupby('Sex')['Sex'].agg('count')

Sex
female    251
male      461
Name: Sex, dtype: int64

In [21]:
test_df.groupby('Sex')['Sex'].agg('count')

Sex
female     63
male      116
Name: Sex, dtype: int64

In [22]:
import tensorflow as tf
CATEGORICAL_COLUMNS = ['Pclass','Sex', 'SibSp', 'Parch','Embarked','Family','Title']
NUMERIC_COLUMNS = ['Age', 'Fare']
columns = ['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title']
feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = train_df[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

In [23]:
class_count = len(train_df['Survived'].unique())

In [24]:
from tensorflow.estimator import LinearClassifier

classifier = LinearClassifier(feature_columns=feature_columns, n_classes=class_count)

W0731 14:40:15.497383 4712515008 deprecation_wrapper.py:119] From /Users/dorishuang/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/api/_v1/estimator/__init__.py:10: The name tf.estimator.inputs is deprecated. Please use tf.compat.v1.estimator.inputs instead.

W0731 14:40:15.500555 4712515008 estimator.py:1811] Using temporary folder as model directory: /var/folders/0n/ctf3gvvx57z27lg3_l0nbxzh0000gn/T/tmpi_rc9j4p


In [25]:
from tensorflow.data import Dataset

def training_input():
  features = {}
  feature_columns = ['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title']

  for feat in feature_columns:
    features[feat] = train_df[feat]
 
  labels = train_df['Survived']

  training_ds = Dataset.from_tensor_slices((features, labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(10)

  return training_ds

classifier.train(training_input)

W0731 14:40:15.554804 4712515008 deprecation.py:323] From /Users/dorishuang/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0731 14:40:16.435414 4712515008 deprecation.py:323] From /Users/dorishuang/anaconda3/lib/python3.6/site-packages/tensorflow/python/feature_column/feature_column_v2.py:2655: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0731 14:40:17.112155 4712515008 deprecation.py:323] From /Users/dorishuang/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/canned/linear

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifier at 0x1281e5b70>

In [26]:
def testing_input():
  features = {}
  feature_columns = ['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title']
  for feat in feature_columns:
    features[feat] = test_df[feat]

  return Dataset.from_tensor_slices((features)).batch(1)

predictions_iterator = classifier.predict(testing_input)
predictions = [p['class_ids'][0] for p in predictions_iterator]

W0731 14:40:23.183722 4712515008 deprecation.py:323] From /Users/dorishuang/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


In [27]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

print(precision_score(test_df['Survived'], predictions, average='micro'))
print(recall_score(test_df['Survived'], predictions, average='micro'))

0.8156424581005587
0.8156424581005587


## Exercise 2: Upload your predictions to Kaggle

**Graded** demonstrations of competency:
1. Download the test.csv file from Kaggle and re-run your model using all of the training data.
2. Use this new test data to generate predictions using your model.
3. Follow the instructions in the [evaluation section](https://www.kaggle.com/c/titanic/overview/evaluation) to output the preditions in the format of the gender_submission.csv file. Download the predictions file from your Colab and upload it to Kaggle.


**Written Response**

Write down your conclusion along with the score that you got from Kaggle.


### Student Solution

In [29]:
# Your code goes here
dataset_filename = "./test.csv"
test_csv =pd.read_csv(dataset_filename)

train_csv['Embarked']=train_csv['Embarked'].fillna('S')
test_csv['Title'] = test_csv.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
test_csv['Family'] = test_csv.Parch + test_csv.SibSp + 1
test_csv['Age']= test_csv.groupby(['Sex','Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))
test_csv.update(test_csv['Age'].astype(int) / test_csv['Age'].max())

def testing_input():
  features = {}
  feature_columns = ['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title']
  for feat in feature_columns:
    features[feat] = test_csv[feat]

  return Dataset.from_tensor_slices((features)).batch(1)



In [30]:
predictions_test = classifier.predict(testing_input)
predictions = [p['class_ids'][0] for p in predictions_test]
test_csv['Survived'] = predictions
df = pd.DataFrame(test_csv, columns= ['PassengerId', 'Survived'])
df
export_csv = df.to_csv ('./gender_submission.csv', index = None, header=True) 

{### Your written response goes here. Make sure to include your Kaggle score. ###}



**Kaggle Score**

0.78468 - Age NaN filled according to sex and pclass median, scaled down. 

test train 80 20 split, features =  'Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title'

(https://drive.google.com/file/d/1OIPx5m6nnsAAQsh4uc3U6JKGVTXaeeVN/view?usp=sharing)

0.77990

0.77033

0.76076

*** The Conclusion ***

First, we preprocess the data to fill in Embarked and Age missing values. We replace the missing age values with the median of the class according to sex and ticket class. Then we created two more features - family size and title. 

As we tried more features, some of the featuers improved our model, some of the features didn't. We tried to fill in Cabin and use it as one of the features, but it decreased our score since we were filling the large amount of missing data with one category. 

We also tried to play around with train, test split ratio, but it turned out  the 80-20 is the best way to not overfit. The added class we have for family size and title derived from name improved our model. We also stratified the split according to sex to make more precise predictions. 

Eventaully, we set on Pclass, Sex, Age, SibSp, Patch, Fare, Embarked, Family and Title with 80-20 train test split for our model. 

We tried not to incorprate our bias into the dataset while preprocessing the data and making the model. 
In the future, we want to test more models and see how they will preform differently.



## Exercise 3: Improve your model

The predictions returned by the LinearClassifer contain scoring and/or confidence information about why the decision was made to classify a passenger as a survivor or not. Find the number used to make the decision and manually play around with different thresholds to build a precision vs. recall chart.

### Student Solution

In [31]:
# # Your code goes here


## Exercise 4: Dig deeper (optional and ungraded)

Check out the different approaches in [this kernel](https://www.kaggle.com/startupsci/titanic-data-science-solutions) (kernels are solutions or data exploration notebooks shared by other users).
Try using a different approach and see if you can improve your results.

Alternatively, you can try implementing a simple decision tree by hand, as in this [Udacity Project](https://github.com/juemura/machine-learning/blob/master/projects/titanic_survival_exploration/titanic_survival_exploration.ipynb). 

### Student Solution

In [32]:
# Your code goes here
# from sklearn.model_selection import cross_val_score
# from sklearn.tree import DecisionTreeClassifier

# feature_columns = ['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title']
# X_train = train_df[feature_columns]
# y_train = train_df['Survived']
# X_test = test_df[feature_columns]
# y_test = test_df['Survived']

# clf = DecisionTreeClassifier(random_state=2)
# clf.fit(X=X_train, y=y_train)
# clf.score(X=X_test, y=y_test)