<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/04_classification/04_classification_project/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Classification Project

In this project you will apply what you have learned about classification and TensorFlow to complete a project from Kaggle. The challenge is to achieve a high accuracy score while trying to predict which passengers survived the Titanic ship crash. After building your model, you will upload your predictions to Kaggle and submit the score that you get.

## The Titanic Dataset

[Kaggle](https://www.kaggle.com) has a [dataset](https://www.kaggle.com/c/titanic/data) containing the passenger list on the Titanic. The data contains passenger features such as age, gender, ticket class, as well as whether or not they survived.

Your job is to create a binary classifier using TensorFlow to determine if a passenger survived or not. The `Survived` column lets you know if the person survived. Then, upload your predictions to Kaggle and submit your accuracy score at the end of this Colab, along with a brief conclusion.


To get the dataset, you'll need to accept the competition's rules by clicking the "I understand and accept" button on the [competition rules page](https://www.kaggle.com/c/titanic/rules). Then upload your `kaggle.json` file and run the code below.

In [0]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && cp kaggle.json ~/.kaggle/ && echo 'Done'
! kaggle competitions download -c titanic
! ls

**Note: If you see a "403 - Forbidden" error above, you still need to click "I understand and accept" on the [competition rules page](https://www.kaggle.com/c/titanic/rules).**

Three files are downloaded:

1. `train.csv`: training data (contains features and targets)
1. `test.csv`: feature data used to make predictions to send to Kaggle
1. `gender_submission.csv`: an example competition submission file

## Step 1: Exploratory Data Analysis

Perform exploratory data analysis and data preprocessing. Use as many text and code blocks as you need to explore the data. Note any findings. Repair any data issues you find.

**Student Solution**

In [0]:
# Your code goes here

---

### Answer Key

We first load and describe the data.

In [0]:
import pandas as pd

df = pd.read_csv('train.csv')

df.describe(include='all')

We can see a few columns seem to be missing data, including `Age`, `Cabin` and `Embarked`. `Name` is likely not very useful, nor are `PassengerId` and `Ticket`.

`Survived` is our target. Let's make sure that it has two unique values.

In [0]:
TARGET = 'Survived'
df[TARGET].hist()

`PClass` is the passenger cabin class. Let's check it out.

In [0]:
df['Pclass'].hist()

This seems to be a categorical column with each number representing a class of service.

 Next up is `Sex`.

In [0]:
df['Sex'].hist()

Another categorical column. This one has strings instead of numbers. We'll need to change this later once we decide on the type of model that we are creating.

`Age` is next. It is definitely missing values. Let's fill those in with the mean age and then see what the data looks like.

In [0]:
df.loc[df['Age'].isna(), 'Age'] = round(df['Age'].mean())

_ = df['Age'].hist()

`SibSp` seems to be okay:

In [0]:
df['SibSp'].hist()

As does `Parch`:

In [0]:
df['Parch'].hist()

The `Fare` might be important since the amount paid might have had an effect on your chances of survival. 

In [0]:
df['Fare'].hist()

`Cabin` is missing in more than half the columns, so let's just drop it at the end of our data analysis.

`Embarked` is missing a few values. Let's see the distribution of what is present.

In [0]:
df['Embarked'].hist()

We have a few strategies that we could use here. We could add a new code, 'U', for unknown. We could just use 'S' for the missing values since that is where most people embarked from. We could even just remove the offending rows. Or if we decide that the `Embarked` column isn't important, we can remove it.

Before we do that, let's look at the data.

In [0]:
df[df['Embarked'].isna()]

We can try to see if there are any other passengers with the same ticket number or cabin and use their `Embark` value. It turns out that none exist though.

In [0]:
df[(df['Ticket'] == '113572') | (df['Cabin'] == 'B28')]

Since we aren't really getting anywhere, we'll just go with the majority and use `S` for Southampton. It turns out that was a [good](https://www.encyclopedia-titanica.org/titanic-survivor/amelia-icard.html) [guess](https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html).

In [0]:
df.loc[df['Embarked'].isna(), 'Embarked'] = 'S'

And finally we drop the columns we aren't using.

In [0]:
df = df.drop(labels=['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

df.describe(include='all')

Next week can see if there are any strong correlations.

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(11,10))

_ = sns.heatmap(df.corr(), cmap='coolwarm', annot=True)

There's nothing very interesting there.

---

## Step 2: The Model

Build, fit, and evaluate a classification model. Perform any model-specific data processing that you need to perform. If the toolkit you use supports it, create visualizations for loss and accuracy improvements. Use as many text and code blocks as you need to explore the data. Note any findings.

**Student Solution**

In [0]:
# Your code goes here

---

### Answer Key

We chose to create a deep neural network using TensorFlow. In order to do this, we needed to do some data preprocessing. First, we convert `Pclass`, `Sex`, and `Embarked` into one-hot encoded columns.

In [0]:
FEATURES = []

for column in ('Pclass', 'Sex', 'Embarked'):
  for value in (sorted(df[column].unique())):
    new_column = column + '_' + str(value)
    df[new_column] = (df[column] == value).apply(int)
    FEATURES.append(new_column)
  df = df.drop(column, axis=1)

df[FEATURES].describe()

The remaining columns will need to be normalized.

In [0]:
TO_NORMALIZE = ['Age', 'SibSp', 'Parch', 'Fare']

df.loc[:, TO_NORMALIZE] = ((df[TO_NORMALIZE] - df[TO_NORMALIZE].min()) / (df[TO_NORMALIZE].max() - df[TO_NORMALIZE].min()))

FEATURES += TO_NORMALIZE

df.describe(include='all')

We can now build a model.

In [0]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation=tf.nn.relu, 
                          input_shape=(len(FEATURES),)),
    tf.keras.layers.Dense(32, activation=tf.nn.relu),
    tf.keras.layers.Dense(16, activation=tf.nn.relu),
    tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

model.compile(
    loss='binary_crossentropy',
    optimizer='Adam',
    metrics=['accuracy']
)

model.summary()

We'll now split the data to keep a validation set.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_validate, y_train, y_validate = train_test_split(
    df[FEATURES], df[TARGET], test_size=0.2, stratify=df['Sex_female'])

X_train.shape, X_validate.shape

In [0]:
history = model.fit(X_train, y_train, epochs=500, verbose=1)

history.history['accuracy'][-1]

Let's plot the loss and accuracy.

In [0]:
import matplotlib.pyplot as plt

plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
plt.plot(history.history['accuracy'])
plt.title('Training Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train_accuracy'], loc='best')

plt.subplot(1,2,2)
plt.plot(history.history['loss'])
plt.title('Training Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
_ = plt.legend(['train_loss'], loc='best')

Let's see how our predictions look.

In [0]:
import matplotlib.pyplot as plt

predictions = model.predict(X_validate)

_ = plt.hist(predictions)

Let's calculate the accuracy for a few different threshold values.

In [0]:
from sklearn.metrics import accuracy_score

for x in range(3, 9):
  threshold = round(0.1 * x, 1)
  predicted_class = [int(x > threshold) for x in predictions]
  score = round(accuracy_score(y_validate, predicted_class), 2)
  print(threshold, score)

Our accuracies tended to be slightly better for more conservative thresholds like 0.4.

---

## Step 3: Make Predictions and Upload To Kaggle

In this step you will make predictions on the features found in the `test.csv` file and upload them to Kaggle using the [Kaggle API](https://github.com/Kaggle/kaggle-api). Use as many text and code blocks as you need to explore the data. Note any findings.

**Student Solution**

In [0]:
# Your code goes here

What was your Kaggle score?

> *Record your score here*

---

### Answer Key

**Solution**

We'll need to load the test data, transform it, and then make predictions that are written to a file and uploaded to Kaggle.

First, load the data.

In [0]:
test_df = pd.read_csv('test.csv')
test_df.describe(include='all')

Let's go ahead and one-hot our columns. Double check that we have the same one-hot columns as were in our training data. Sometimes new values can pop in, or old values might not exist. In this case it lines up.

In [0]:
FEATURES = []

for column in ('Pclass', 'Sex', 'Embarked'):
  for value in (sorted(test_df[column].unique())):
    new_column = column + '_' + str(value)
    test_df[new_column] = (test_df[column] == value).apply(int)
    FEATURES.append(new_column)
  test_df = test_df.drop(column, axis=1)

test_df[FEATURES].describe()

`Age` and `Fare` were both missing values. We'll fill them in with the mean.

In [0]:
test_df.loc[test_df['Age'].isna(), 'Age'] = test_df['Age'].mean()
test_df.loc[test_df['Fare'].isna(), 'Fare'] = test_df['Fare'].mean()

Now we'll normalize.

In [0]:
TO_NORMALIZE = ['Age', 'SibSp', 'Parch', 'Fare']

test_df.loc[:, TO_NORMALIZE] = (
    (test_df[TO_NORMALIZE] - test_df[TO_NORMALIZE].min()) /
    (test_df[TO_NORMALIZE].max() - test_df[TO_NORMALIZE].min()))

FEATURES += TO_NORMALIZE

test_df[TO_NORMALIZE].describe()

And finally, drop the columns we don't need. We don't drop `PassengerId` this time since we need it for contest submission.

In [0]:
test_df = test_df.drop(labels=['Name', 'Ticket', 'Cabin'], axis=1)

test_df.describe(include='all')

Now we can make predictions.

In [0]:
predictions = [int(x[0] > 0.4) for x in model.predict(test_df[FEATURES])]
predictions[:10]

In [0]:
results = pd.DataFrame({
  'PassengerId': test_df['PassengerId'],
  'Survived': predictions,
})

results.to_csv('titanic_predictions.csv', index=False)

! head titanic_predictions.csv

And then submit the model.

In [0]:
!kaggle competitions submit -f titanic_predictions.csv -m 'Keras submission' titanic

And check your score.

In [0]:
!kaggle competitions submissions titanic

---

## Step 4: Iterate on Your Model

In this step you're encouraged to play around with your model settings and to even try different models. See if you can get a better score. Use as many text and code blocks as you need to explore the data. Note any findings.

**Student Solution**

In [0]:
# Your code goes here

---

### Answer Key

There is no answer key for this step. The students will just need to show some adjustment to the model that they created above, or they can create an entirely new model.

---