# Assignment 2: Classification

You will demonstrate your ability to solve a classification task.

The notebook that you submit *should follow the Recipe for Machine Learning* in addition to answering the questions.

# Objectives

The purpose of this assignment is
- to familiarize yourself with Classifiers
- to have you explore *feature engineering*

We will be using the well-known Titanic challenge, which we introduced in class.

Because there are so many solutions to this problem available on the Internet
- the base points for this assignment will be lower than usual
- your efforts at feature engineering will be a key part of the grade
- the *quality of your explanations* will be a big part of the grade
    - if your solution is inspired by a model you found elsewhere, your writeup must convince us that you deeply understand all the choices involved
        - motivate it by your Exploratory Data Analysis
        - try several variations and explain your final choice

To make this even more fun, we'll have a contest: extra credit for students whose models
perform best on a held-out dataset.

# The Data

Here's the code to get the data.
It has already been split into training and test

In [1]:
import pandas as pd
import os

# Note the use of *relative path*; your assignments should all use relative rather than absolute paths
TITANIC_PATH = "./data/assignment_2"

train_data = pd.read_csv( os.path.join(TITANIC_PATH, "train.csv") )
test_data  = pd.read_csv( os.path.join(TITANIC_PATH, "test.csv")  )

## Note on the test examples

The test examples **do not** have targets (Yes/No for Survived) associated with them,
so you can't use these as examples on which to evaluate the Performance Metric.

The reason: this problem was part of a competition; the competitors were evaluated on how well
they did on the test examples -- the answers were only known to the judges so competitors couldn't cheat.

If you want to see how well you predict out of sample
- You can choose to create your own test set as a subset of the training examples
- You can choose to perform cross-validation

# Create a base model using Naive Bayes

Use a Naive Bayes classifier as the Base Model in the Recipe.

- Report the average of the cross validation scores, using 5 fold cross validation
- Use Accuracy as your Performance Metric; report the accuracy

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements


In [2]:
model = "Naive Bayes"
cross_val_avg = 0

print("{m:s}: Avg cross val score = {sc:3.2f}".format(m=model, sc=cross_val_avg) )

accuracy = 0
print("{m:s}: Accuracy = {a:.2%}".format(m=model, a=accuracy) )

Naive Bayes: Avg cross val score = 0.00
Naive Bayes: Accuracy = 0.00%


# Create a Logistic Regression model with minimal feature engineering

Create a Logistic Regression classifier using only transformations that are absolutely necessary,
for example
- dealing with missing features
- categorical transformations

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

In [3]:
model = "Logistic Regression, version 0"
cross_val_avg = 0

print("{m:s}: Avg cross val score = {sc:3.2f}".format(m=model, sc=cross_val_avg) )

accuracy = 0
print("{m:s}: Accuracy = {a:.2%}".format(m=model, a=accuracy) )


Logistic Regression, version 0: Avg cross val score = 0.00
Logistic Regression, version 0: Accuracy = 0.00%


# Perform feature engineering and create another Logistic Regression classifier

Use transformations and creation of new features to improve your first Logistic Regression classifier.

The first set of  transformations require you to convert `Age` from continuous to buckets/bins.
This means choosing how many buckets/bins and what the boundaries are.

You will make two different choices for the buckets and report the accuracy of each.
You should *clearly explain* why you made the choices that you did (based on logic, Exploratory Data Analysis, etc.).

The steps are:
- choose a set of buckets and compare your Accuracy out of sample with the first Logistic Regression classifier
- choose a *second* set of buckets and compare your Accuracy out of sample with the first Logistic Regression classifier

So you will answer two nearly identical questions.  Please report the **best** result in the **second answer**.

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

In [4]:
model = "Logistic Regression, bucketing version 1"
cross_val_avg = 0

print("{m:s}: Avg cross val score = {sc:3.2f}".format(m=model, sc=cross_val_avg) )

accuracy = 0
print("{m:s}: Accuracy = {a:.2%}".format(m=model, a=accuracy) )


Logistic Regression, bucketing version 1: Avg cross val score = 0.00
Logistic Regression, bucketing version 1: Accuracy = 0.00%


Try a *different* bucketing scheme for `Age`

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

In [5]:
model = "Logistic Regression, bucketing version 2"
cross_val_avg = 0

print("{m:s}: Avg cross val score = {sc:3.2f}".format(m=model, sc=cross_val_avg) )

accuracy = 0
print("{m:s}: Accuracy = {a:.2%}".format(m=model, a=accuracy) )


Logistic Regression, bucketing version 2: Avg cross val score = 0.00
Logistic Regression, bucketing version 2: Accuracy = 0.00%


## Age bucket: categorical or numeric ? 

Using your best bucketing choice (the second one above)
- What is the accuracy when you treat the buckets as numeric ?
- What is the accuracy when you treat the buckets as categorical ?


In [6]:
model = "Logistic Regression, bucketing version 2; buckets treated as numeric features"
cross_val_avg = 0

print("{m:s}: Avg cross val score = {sc:3.2f}".format(m=model, sc=cross_val_avg) )

accuracy = 0
print("{m:s}: Accuracy = {a:.2%}".format(m=model, a=accuracy) )


Logistic Regression, bucketing version 2; buckets treated as numeric features: Avg cross val score = 0.00
Logistic Regression, bucketing version 2; buckets treated as numeric features: Accuracy = 0.00%


In [7]:
model = "Logistic Regression, bucketing version 2; buckets treated as categorical features"
cross_val_avg = 0

print("{m:s}: Avg cross val score = {sc:3.2f}".format(m=model, sc=cross_val_avg) )

accuracy = 0
print("{m:s}: Accuracy = {a:.2%}".format(m=model, a=accuracy) )


Logistic Regression, bucketing version 2; buckets treated as categorical features: Avg cross val score = 0.00
Logistic Regression, bucketing version 2; buckets treated as categorical features: Accuracy = 0.00%


# (Extra credit)  Perform more feature engineering

We will award extra points for each (up to a maximum of 3) transformation  judged to be well thought out.
This means that *you must clearly explain* your ideas, experiments and results.

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements
- Also replace the "???" with a description of your feature engineering.

Repeat this cell for each new feature engineering/transformation that you submit.

In [8]:
title = "???"
cross_val_avg = 0

print("{m:s}: Avg cross val score = {sc:3.2f}".format(m=title, sc=cross_val_avg) )

accuracy = 0
print("{m:s}: Accuracy = {a:.2%}".format(m=title, a=accuracy) )

???: Avg cross val score = 0.00
???: Accuracy = 0.00%


# (Extra, extra credit) Contest !

Come up with your best model for the Titanic !  Use any model and whatever feature engineering you'd like.

We will evaluate your model on a held-out dataset.  Top scorers will get extra credit.

## Rules
- You *may not* include any packages that are not part of the standard installation
    - they won't run on the grader's machine
- Your feature engineering *must* deal with missing data for all attributes
    - the evaluation set *will* have missing values for some features
    

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

## Practical advice

The held-out dataset comes from the same distribution as the training examples.

But it's possible that there may be some feature
- that was *not* missing in any training example
- but *is* missing in some example in the held-out dataset

Code defensively !  It's not a bad idea to perform missing feature imputation on *all* features, whether
or not they are missing in training examples.

In [9]:
title = "Titanic contest"
cross_val_avg = 0

print("{m:s}: Avg cross val score = {sc:3.2f}".format(m=title, sc=cross_val_avg) )

accuracy = 0
print("{m:s}: Accuracy = {a:.2%}".format(m=title, a=accuracy) )

Titanic contest: Avg cross val score = 0.00
Titanic contest: Accuracy = 0.00%
