#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Basic Classification Project

In this project you will perform a basic classification task.
You will apply what you learned about binary classification and tensorflow to implement a Kaggle project without much guidance. The challenge is to achieve a high accuracy score when trying to predict which passengers survived the Titanic crash. After building your model, you will upload your predictions to Kaggle and submit the score that you receive.

## Overview

### Learning Objectives

* Define, build, train and evaluate a Linear Classifier model in TensorFlow.
* Submit predictions to a Kaggle challenge.


### Prerequisites

* T05-09 Classification with TensorFlow

### Estimated Duration

330 minutes (270 minutes working time, 60 minutes for presentations)

### Deliverables

1. A copy of this Colab notebook containing your code and a written response with your conclusions and the score that you receive from Kaggle.
1. A group presentation. After everyone is done, we will ask each group to stand in front of the class and give a brief presentation about what they have done in this lab. The presentation can be a code walkthrough, a group discussion, a slide show, or any other means that conveys what you did over the course of the day and what you learned. If you do create any artifacts for your presentation, please share them in the class folder.

### Grading Criteria

This project is worth 50 points in your final grade, and it will be graded in separate sections that each contribute a percentage of the total score:

1. Building and Using a Model (Exercise 1) (60%)
2. Kaggle score and conclusion (Exercise 2) (20%)
3. Improving your model (Exercise 3) (10%)
4. Project Presentation (10%)

#### 1. Building and Using a Model (Exercise 1) 

There are 10 demonstrations of competency listed in the first exercise. Each competency is graded on a 3 point scale for a total of 30 available points. The following rubric will be used:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but sub-optimally |
| 3      | Successful demonstration of competency |


#### 2. Kaggle score and conclusion (Exercise 2)

There are 3 demonstrations of competency and 1 question in the second exercise. Each competency is worth 2 points, and your written response is worth 4 points. The rubric for calculating the competency points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency |
| 2      | Successful demonstration of competency |

The rubric for the written response is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at question or answer was off-topic or didn't make sense |
| 1      | Question was answered, but answer didn't include Kaggle score and relevant observations |
| 2      | Question was answered, but answer didn't include Kaggle score or relevant observations |
| 3      | Question was answered and included Kaggle score and observations, but conclusion was superficial |
| 4      | Answer adequately included Kaggle score and meaningful observations about the model and its performance |


#### 3. Improving your model (Exercise 3)

This exercise is worth 5 points and it will be graded on your demonstrated ability to manually modify your model to test different thresholds and build a precision vs. recall chart.

The rubric for calculating the competency points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but did not try multiple thresholds and did not show precision/recall changes |
| 3      | Attempted competency correctly and tried multiple thresholds, but did not show precision/recall changes |
| 4      | Attempted competency correctly, tried multiple thresholds, and showed precision/recall changes, but did not clearly show precision/recall tradeoff |
| 5      | Successful demonstration of competency - Different thresholds attempted clearly show precision/recall tradeoff  |

#### Project Presentation

The project presentation will be graded on participation. All members of a team should actively participate.

## Team

Please enter your team members names in the placeholders in this text area:

*   *Team Member Placeholder*
*   *Team Member Placeholder*
*   *Team Member Placeholder*


## Titanic: Machine Learning from Disaster

[Kaggle](https://www.kaggle.com) has a [dataset](https://www.kaggle.com/c/titanic/data) containing the passenger list for the Titanic voyage. The data contains passenger features such as age, gender, and ticket class, as well as whether or not they survived.

Your job is to load the data and create a binary classifier using TensorFlow to determine if a passenger survived or not. Then, upload your predictions to Kaggle and submit your accuracy score at the end of this colab, along with a brief conclusion.


# Exercises

## Exercise 1: Create a Classifier

**Graded** demonstrations of competency:

1. Download the [dataset](https://www.kaggle.com/c/titanic/data).
2. Load the data into this Colab.
3. Look at the description of the [dataset](https://www.kaggle.com/c/titanic/data) to understand the columns.
4. Explore the dataset. Ask yourself: are there any missing values? Do the data values make sense? Which features seem to be the most important? Are they highly correlated with each other?
5. Prep the data (deal with missing values, drop unnecessary columns, transform the data if needed, etc).
6. Split the data into testing and training set.
7. Create a `tensorflow.estimator.LinearClassifier`.
8. Train the classifier using an input function that feeds the classifier training data.
9. Make predictions on the test data using your classifier.
10. Find the accuracy, precision, and recall of your classifier.
 

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
## Step 1: Download the dataset and load the data. ##

import pandas as pd

titanic_df = pd.read_csv('https://raw.githubusercontent.com/juemura/amli/master/titanic/train.csv')

In [0]:
## Step 2: Look at the description of the dataset to understand the columns. ##

titanic_df.describe()

In [0]:
## SOLUTION: Automatically removed from student copy (keep this comment) ##
## Examine the dataset format and sample data points ##

print(titanic_df.dtypes)
titanic_df.head()

In [0]:
## Step 3: Perform analysis on the dataset and repair or drop columns and rows of data as needed. 
## Are there any missing values? 

titanic_df.isnull().sum()

In [0]:
## Step 3: (cont)
## Do the data values make sense?

def titanic_data_prep(df):
  # PassengerId is a unique identifier and thus should be used as the index
  df.set_index('PassengerId', inplace=True)

  # Embarked has 2 null values, we can simply replace them with "unknown"
  df['Embarked'].fillna('unknown', inplace=True)

  # Encode categorical data, i.e. Sex and Embarked
  # Among other options, you can use LabelEncoder, One-Hot Encoding, or simply replace the values
  # For Sex the categories are simple, but for ports we can create a list of unique labels
  # and use a dictionary comprehension
  port_labels = df['Embarked'].astype('category').cat.categories.tolist()
  encoding = {'Sex': {'male': 0, 'female': 1}, 'Embarked': {k:v for k,v in zip(port_labels, range(1, len(port_labels)+1))}}
  df.replace(encoding, inplace=True)
  
  # Age has 177 null values that we need to deal with before training our model
  # One simple approach is to replace null values with the mean
  df.fillna(df.mean(), inplace=True)
  
  # Name, Ticket, and Cabin are likely irrelevant for our model since they are 
  # simply identifiers and don't indicate any distinguishing characteristic
  df.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], inplace=True, axis=1)

titanic_data_prep(titanic_df)

# Double check that we got rid of all null values
print(titanic_df.isnull().sum())

# Check the description of the dataset again to make sure everything looks good 
titanic_df.describe()

In [0]:
## Step 3: (cont)
## Which features seem to be the most important?
## Are there highly correlated with each other?

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure()
corMat = titanic_df.corr(method='pearson')
print(corMat)

sns.heatmap(corMat, square=True)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()

# Observations:
# Sex and Fare seem to have the highest correlation with whether or not they 
# survived, and they are not highly correlated with each other.
# Thus they appear to be important features

In [0]:
## Step 4: Split the data into testing and training set. ##

from sklearn.model_selection import train_test_split

TARGET = 'Survived'
FEATURES = [col for col in titanic_df.columns if col != TARGET]

train_df, test_df = train_test_split(
  titanic_df,
  test_size=0.2,
)

print("Training set shape: {}".format(train_df.shape))
print("Test set shape: {}".format(test_df.shape))

In [0]:
## Step 5: Create a tensorflow.estimator.LinearClassifier. ##

from tensorflow.feature_column import numeric_column
from tensorflow.estimator import LinearClassifier


# Declare the feature columns
feature_columns = []

for column_name in FEATURES:
  feature_columns.append(numeric_column(str(column_name)))

for feature in feature_columns:
  print(feature)


# Check the class count
class_count = len(train_df[TARGET].unique()) # This should be 2 (survived / didn't survive)
print("Class count: {}".format(class_count))


# Create a classifier
classifier = LinearClassifier(feature_columns=feature_columns, n_classes=class_count)

In [0]:
## Step 6: Train the classifier using an input function that feeds the classifier training data. ##

import tensorflow as tf
from tensorflow.data import Dataset


def training_input():
  features = {}
  for feature in FEATURES:
    features[feature] = train_df[feature]
  
  labels = train_df[TARGET]

  training_ds = Dataset.from_tensor_slices((features, labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(5)

  return training_ds

classifier.train(training_input)

In [0]:
## Step 7: Make predictions on the test data using your classifier. ##

def testing_input():
  features = {}
  for feature in FEATURES:
    features[feature] = test_df[feature]
  return Dataset.from_tensor_slices((features)).batch(1)

predictions_iterator = classifier.predict(testing_input)

print(predictions_iterator)

predictions_iterator = classifier.predict(testing_input)

predictions = [p['class_ids'][0] for p in predictions_iterator]

In [0]:
## Step 8: Find the accuracy, precision, and recall of your classifier. ##

from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Accuracy: {}".format(accuracy_score(test_df['Survived'], predictions)))

print("Precision: {}".format(precision_score(test_df['Survived'], predictions, average='micro')))

print("Recall: {}".format(recall_score(test_df['Survived'], predictions, average='micro')))

**Validation**

In [0]:
# TODO(juemura)

## Exercise 2: Upload your predictions to Kaggle

**Graded** demonstrations of competency:
1. Download the test.csv file from Kaggle and re-run your model using all of the training data.
2. Use this new test data to generate predictions using your model.
3. Follow the instructions in the [evaluation section](https://www.kaggle.com/c/titanic/overview/evaluation) to output the predictions in the format of the gender_submission.csv file. Download the predictions file from your Colab and upload it to Kaggle.


**Written Response**

Write down your conclusion along with the score that you got from Kaggle.


### Student Solution

In [0]:
# Your code goes here

{### Your written response goes here. Make sure to include your Kaggle score. ###}



### Answer Key

**Solution**

In [0]:
## Step 1: Re-run your model using all of the training data. ##

full_train_df = titanic_df

print("Training set shape: {}".format(train_df.shape))

import tensorflow as tf
from tensorflow.data import Dataset


def training_input():
  features = {}
  for feature in FEATURES:
    features[feature] = full_train_df[feature]
  
  labels = full_train_df[TARGET]

  training_ds = Dataset.from_tensor_slices((features, labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(10)
  training_ds = training_ds.repeat(5)

  return training_ds

classifier.train(training_input)

In [0]:
## Step 2: Download the test.csv and use it to generate predictions. ##

full_test_df = pd.read_csv('https://raw.githubusercontent.com/juemura/amli/master/titanic/test.csv')
print("Test set shape: {}".format(test_df.shape))

titanic_data_prep(full_test_df)

def testing_input():
  features = {}
  for feature in FEATURES:
    features[feature] = full_test_df[feature]
  return Dataset.from_tensor_slices((features)).batch(1)

predictions_iterator = classifier.predict(testing_input)

print(predictions_iterator)

predictions_iterator = classifier.predict(testing_input)

predictions = [p['class_ids'][0] for p in predictions_iterator]

In [0]:
## Step 3: Output the predictions in the format of the gender_submission.csv file.
## Download the predictions file from your Colab and upload it to Kaggle. ##

from google.colab import files

results = pd.DataFrame({'PassengerId': full_test_df.index, 'Survived': predictions})
results.to_csv('titanic_predictions.csv', index=False)
files.download('titanic_predictions.csv')

**Validation**

In [0]:
# TODO(juemura)

## Exercise 3: Improve your model

The predictions returned by the LinearClassifer contain scoring and/or confidence information about why the decision was made to classify a passenger as a survivor or not. Find the number used to make the decision and manually play around with different thresholds to build a precision vs. recall chart.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
# TODO(juemura)

**Validation**

In [0]:
# TODO(juemura)

## Exercise 4: Dig deeper (optional and ungraded)

Check out the different approaches in [this kernel](https://www.kaggle.com/startupsci/titanic-data-science-solutions) (kernels are solutions or data exploration notebooks shared by other users).
Try using a different approach and see if you can improve your results.

Alternatively, you can try implementing a simple decision tree by hand, as in this [Udacity Project](https://github.com/juemura/machine-learning/blob/master/projects/titanic_survival_exploration/titanic_survival_exploration.ipynb). 

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
# TODO(juemura)

**Validation**

In [0]:
# TODO(juemura)