#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Basic Classification Project

In this project you will perform a basic classification task.
You will apply what you learned about binary classification and tensorflow to implement a Kaggle project without much guidance. The challenge is to achieve a high accuracy score when trying to predict which passengers survived the Titanic crash. After building your model, you will upload your predictions to Kaggle and submit the score that you receive.

## Overview

### Learning Objectives

* Define, build, train and evaluate a Linear Classifier model in TensorFlow.
* Submit predictions to a Kaggle challenge.


### Prerequisites

* T05-09 Classification with TensorFlow

### Estimated Duration

330 minutes (270 minutes working time, 60 minutes for presentations)

### Deliverables

1. A copy of this Colab notebook containing your code and a written response with your conclusions and the score that you receive from Kaggle.
1. A group presentation. After everyone is done, we will ask each group to stand in front of the class and give a brief presentation about what they have done in this lab. The presentation can be a code walkthrough, a group discussion, a slide show, or any other means that conveys what you did over the course of the day and what you learned. If you do create any artifacts for your presentation, please share them in the class folder.

### Grading Criteria

This project is worth 50 points in your final grade, and it will be graded in separate sections that each contribute a percentage of the total score:

1. Building and Using a Model (Exercise 1) (60%)
2. Kaggle score and conclusion (Exercise 2) (20%)
3. Improving your model (Exercise 3) (10%)
4. Project Presentation (10%)

#### 1. Building and Using a Model (Exercise 1) 

There are 10 demonstrations of competency listed in the first exercise. Each competency is graded on a 3 point scale for a total of 30 available points. The following rubric will be used:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but sub-optimally |
| 3      | Successful demonstration of competency |


#### 2. Kaggle score and conclusion (Exercise 2)

There are 3 demonstrations of competency and 1 question in the second exercise. Each competency is worth 2 points, and your written response is worth 4 points. The rubric for calculating the competency points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency |
| 2      | Successful demonstration of competency |

The rubric for the written response is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at question or answer was off-topic or didn't make sense |
| 1      | Question was answered, but answer didn't include Kaggle score and relevant observations |
| 2      | Question was answered, but answer didn't include Kaggle score or relevant observations |
| 3      | Question was answered and included Kaggle score and observations, but conclusion was superficial |
| 4      | Answer adequately included Kaggle score and meaningful observations about the model and its performance |


#### 3. Improving your model (Exercise 3)

This exercise is worth 5 points and it will be graded on your demonstrated ability to manually modify your model to test different thresholds and build a precision vs. recall chart.

The rubric for calculating the competency points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but did not try multiple threshholds and did not show precision/recall changes |
| 3      | Attempted competency correctly and tried multiple thresholds, but did not show precision/recall changes |
| 4      | Attempted competency correctly, tried multiple thresholds, and showed precision/recall changes, but did not clearly show precision/recall tradeoff |
| 5      | Successful demonstration of competency - Different thresholds attempted clearly show precision/recall tradeoff  |

#### Project Presentation

The project presentation will be graded on participation. All members of a team should actively participate.

## Team

Please enter your team members names in the placeholders in this text area:

*   *Huize Huang*
*   *Max Matuska*


## Titanic: Machine Learning from Disaster

[Kaggle](https://www.kaggle.com) has a [dataset](https://www.kaggle.com/c/titanic/data) containing the passenger list for the Titanic voyage. The data contains passenger features such as age, gender, and ticket class, as well as whether or not they survived.

Your job is to load the data and create a binary classifier using TensorFlow to determine if a passenger survived or not. Then, upload your predictions to Kaggle and submit your accuracy score at the end of this colab, along with a brief conclusion.


# Exercises

## Exercise 1: Create a Classifier

**Graded** demonstrations of competency:

1. Download the [dataset](https://www.kaggle.com/c/titanic/data).
2. Load the data into this Colab.
3. Look at the description of the [dataset](https://www.kaggle.com/c/titanic/data) to understand the columns.
4. Explore the dataset. Ask yourself: are there any missing values? Do the data values make sense? Which features seem to be the most important? Are they highly correlated with each other?
5. Prep the data (deal with missing values, drop unnecessary columns, transform the data if needed, etc).
6. Split the data into testing and training set.
7. Create a `tensorflow.estimator.LinearClassifier`.
8. Train the classifier using an input function that feeds the classifier training data.
9. Make predictions on the test data using your classifier.
10. Find the accuracy, precision, and recall of your classifier.
 

### Student Solution

In [0]:
# Your code goes here
import altair as alt
import numpy as np 
import pandas as pd
import math
import matplotlib
import matplotlib.pyplot as plt
import re
import seaborn as sns
from google.colab import files 

dataset_filename = "./train.csv"
train_csv =pd.read_csv(dataset_filename)


In [0]:
train_csv.head()

In [0]:
train_csv.describe()

In [0]:
train_csv.info()

**Preprocessing the data**

In [0]:
mean = train_csv.groupby('Sex')['Age'].mean()
mean

In [0]:
train_csv['Embarked']=train_csv['Embarked'].fillna('S')

In [0]:
total = train_csv.isnull().sum().sort_values(ascending=False)

In [0]:
train_csv.Cabin = train_csv.Cabin.fillna('N')
train_csv.Cabin = train_csv.Cabin.map(lambda x:x[0])

In [0]:
train_csv['Family'] = train_csv.Parch + train_csv.SibSp + 1

In [0]:
train_csv['Title'] = train_csv.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())

In [0]:
train_csv['Age']= train_csv.groupby(['Sex','Pclass'])['Age'].apply(lambda x: x.fillna(x.mean()))

In [0]:
train_csv.update(train_csv['Age'].astype(int) / train_csv['Age'].max())

In [0]:
train_csv.info()

In [0]:
# train_csv.Cabin.unique()

In [0]:
# train_csv['Sex'] = train_csv['Sex'].astype('category').cat.codes
# train_csv['Sex'].astype(int)

In [0]:
# train_csv['Embarked'] = train_csv['Embarked'].astype('category').cat.codes
# train_csv['Embarked']

In [0]:
# train_csv.Fare = train_csv.Fare.astype(int)

# Create, Train, Test the model  

In [0]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(
  train_csv,
  stratify = train_csv.Sex,
  test_size=0.2,
  random_state = 42
)

In [0]:
train_df.groupby('Sex')['Sex'].agg('count')

In [0]:
test_df.groupby('Sex')['Sex'].agg('count')

In [0]:
import tensorflow as tf
CATEGORICAL_COLUMNS = ['Pclass','Sex', 'SibSp', 'Parch','Embarked','Family','Title']
NUMERIC_COLUMNS = ['Age', 'Fare']
columns = ['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title']
feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = train_df[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

In [0]:
class_count = len(train_df['Survived'].unique())

In [0]:
from tensorflow.estimator import LinearClassifier

classifier = LinearClassifier(feature_columns=feature_columns, n_classes=class_count)

In [0]:
from tensorflow.data import Dataset

def training_input():
  features = {}
  feature_columns = ['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title']

  for feat in feature_columns:
    features[feat] = train_df[feat]
 
  labels = train_df['Survived']

  training_ds = Dataset.from_tensor_slices((features, labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(10)

  return training_ds

classifier.train(training_input)

In [0]:
def testing_input():
  features = {}
  feature_columns = ['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title']
  for feat in feature_columns:
    features[feat] = test_df[feat]

  return Dataset.from_tensor_slices((features)).batch(1)

predictions_iterator = classifier.predict(testing_input)
predictions = [p['class_ids'][0] for p in predictions_iterator]

In [0]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

print(precision_score(test_df['Survived'], predictions, average='micro'))
print(recall_score(test_df['Survived'], predictions, average='micro'))

## Exercise 2: Upload your predictions to Kaggle

**Graded** demonstrations of competency:
1. Download the test.csv file from Kaggle and re-run your model using all of the training data.
2. Use this new test data to generate predictions using your model.
3. Follow the instructions in the [evaluation section](https://www.kaggle.com/c/titanic/overview/evaluation) to output the preditions in the format of the gender_submission.csv file. Download the predictions file from your Colab and upload it to Kaggle.


**Written Response**

Write down your conclusion along with the score that you got from Kaggle.


### Student Solution

In [0]:
# Your code goes here
dataset_filename = "./test.csv"
test_csv =pd.read_csv(dataset_filename)

train_csv['Embarked']=train_csv['Embarked'].fillna('S')
test_csv['Title'] = test_csv.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
test_csv['Family'] = test_csv.Parch + test_csv.SibSp + 1
test_csv['Age']= test_csv.groupby(['Sex','Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))
test_csv.update(test_csv['Age'].astype(int) / test_csv['Age'].max())

def testing_input():
  features = {}
  feature_columns = ['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title']
  for feat in feature_columns:
    features[feat] = test_csv[feat]

  return Dataset.from_tensor_slices((features)).batch(1)



In [0]:
predictions_test = classifier.predict(testing_input)
predictions = [p['class_ids'][0] for p in predictions_test]
test_csv['Survived'] = predictions
df = pd.DataFrame(test_csv, columns= ['PassengerId', 'Survived'])
df
export_csv = df.to_csv ('./gender_submission.csv', index = None, header=True) 

{### Your written response goes here. Make sure to include your Kaggle score. ###}



**Kaggle Score**

0.78468 - Age NaN filled according to sex and pclass median, scaled down. 

test train 80 20 split, features =  'Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title'

(https://drive.google.com/file/d/1OIPx5m6nnsAAQsh4uc3U6JKGVTXaeeVN/view?usp=sharing)

0.77990

0.77033

0.76076

*** The Conclusion ***

First, we preprocess the data to fill in Embarked and Age missing values. We replace the missing age values with the median of the class according to sex and ticket class. Then we created two more features - family size and title. 

As we tried more features, some of the featuers improved our model, some of the features didn't. We tried to fill in Cabin and use it as one of the features, but it decreased our score since we were filling the large amount of missing data with one category. 

We also tried to play around with train, test split ratio, but it turned out  the 80-20 is the best way to not overfit. The added class we have for family size and title derived from name improved our model. We also stratified the split according to sex to make more precise predictions. 

Eventaully, we set on Pclass, Sex, Age, SibSp, Patch, Fare, Embarked, Family and Title with 80-20 train test split for our model. 

We tried not to incorprate our bias into the dataset while preprocessing the data and making the model. 
In the future, we want to test more models and see how they will preform differently.



## Exercise 3: Improve your model

The predictions returned by the LinearClassifer contain scoring and/or confidence information about why the decision was made to classify a passenger as a survivor or not. Find the number used to make the decision and manually play around with different thresholds to build a precision vs. recall chart.

### Student Solution

In [0]:
# # Your code goes here


## Exercise 4: Dig deeper (optional and ungraded)

Check out the different approaches in [this kernel](https://www.kaggle.com/startupsci/titanic-data-science-solutions) (kernels are solutions or data exploration notebooks shared by other users).
Try using a different approach and see if you can improve your results.

Alternatively, you can try implementing a simple decision tree by hand, as in this [Udacity Project](https://github.com/juemura/machine-learning/blob/master/projects/titanic_survival_exploration/titanic_survival_exploration.ipynb). 

### Student Solution

In [0]:
# Your code goes here
# from sklearn.model_selection import cross_val_score
# from sklearn.tree import DecisionTreeClassifier

# feature_columns = ['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked','Family','Title']
# X_train = train_df[feature_columns]
# y_train = train_df['Survived']
# X_test = test_df[feature_columns]
# y_test = test_df['Survived']

# clf = DecisionTreeClassifier(random_state=2)
# clf.fit(X=X_train, y=y_train)
# clf.score(X=X_test, y=y_test)