# Homework 4
We will apply a neural network to determine whether a patient is likely to have cancer.

## Problem Description
In this assignment, we start with a sample of cancer data.  The sample data has 30 unknown features, and indicates whether each sample is malignant or benign.  We will create a neural network to identify the relevant features for determining malignancy, and apply the resulting network to predict cancer in a separate dataset.  

## Solution Method
I will create a neural network using the `sklearn.neural_network` module.  I will reserve a small subset (probably 15%) of the training data for testing purposes, and use the rest to train the neural network.  I will try several configurations of the network to identify the optimal settings, potentiall varying the number of hidden nodes, the activation function, and othe relevant parameters.

## Input

In [14]:
import pandas as pd
import numpy as np

cancer_data = pd.read_csv('cancer-data.csv')
patients = pd.read_csv('cancer-patients.csv')

from sklearn import model_selection

# Select relevant features for X and y
X = np.transpose(np.array([
    cancer_data['F01'], cancer_data['F02'], cancer_data['F03'], cancer_data['F04'], cancer_data['F05'],
    cancer_data['F06'], cancer_data['F07'], cancer_data['F08'], cancer_data['F09'], cancer_data['F10'],
    cancer_data['F11'], cancer_data['F12'], cancer_data['F13'], cancer_data['F14'], cancer_data['F15'],
    cancer_data['F16'], cancer_data['F17'], cancer_data['F18'], cancer_data['F19'], cancer_data['F20'],
    cancer_data['F21'], cancer_data['F22'], cancer_data['F23'], cancer_data['F24'], cancer_data['F25'],
    cancer_data['F26'], cancer_data['F27'], cancer_data['F28'], cancer_data['F29'], cancer_data['F30'],
]))

y = np.array([1 if diag == 'M' else 0 for diag in cancer_data['Class']])

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)

## Analysis

In [34]:
from sklearn import neural_network

# Several hidden layer sizes that may be appropriate to the number of features
hiddens = (
    (2),
    (5),
    (8),
    (10),
    (15),
    (20),
    (30),
    (50),
    (50,50),
    (70,40)
)

# Create a default classifier and score
best_clf = None
best_score = 0.0

for hidden in hiddens:
    # Train a new classifier
    clf = neural_network.MLPClassifier(
        hidden_layer_sizes=hidden,
        max_iter=10000,
        random_state=12)
    clf.fit(X_train, y_train)

    # Update best classifier
    score = clf.score(X_test, y_test)
    
    if score > best_score:
        best_clf = clf
        best_score = score

## Results

In [74]:
# Data to use when running the final tests
X_run = np.transpose(np.array([
    patients['F01'], patients['F02'], patients['F03'], patients['F04'], patients['F05'],
    patients['F06'], patients['F07'], patients['F08'], patients['F09'], patients['F10'],
    patients['F11'], patients['F12'], patients['F13'], patients['F14'], patients['F15'],
    patients['F16'], patients['F17'], patients['F18'], patients['F19'], patients['F20'],
    patients['F21'], patients['F22'], patients['F23'], patients['F24'], patients['F25'],
    patients['F26'], patients['F27'], patients['F28'], patients['F29'], patients['F30'],
]))

print(f"Best score: {best_score}\n")

for i, (patient, diagnosis) in enumerate(zip(patients.iterrows(), results)):
    patient_ID = int(patient[1]['ID'])
    patient_diag = 'malignant' if diagnosis == 1 else 'benign'
    
    print(f"Patient {patient_ID}, cancer is {patient_diag}.")

Best score: 0.9264705882352942

Patient 1544, cancer is malignant.
Patient 1545, cancer is benign.
Patient 1546, cancer is benign.
Patient 1547, cancer is malignant.
Patient 1548, cancer is malignant.
Patient 1549, cancer is benign.
Patient 1550, cancer is benign.
Patient 1551, cancer is benign.
Patient 1552, cancer is benign.
Patient 1553, cancer is malignant.
Patient 1554, cancer is malignant.
Patient 1555, cancer is malignant.
Patient 1556, cancer is malignant.
Patient 1557, cancer is benign.
Patient 1558, cancer is malignant.
Patient 1559, cancer is benign.
Patient 1560, cancer is malignant.
Patient 1561, cancer is malignant.
Patient 1562, cancer is benign.
Patient 1563, cancer is benign.
Patient 1564, cancer is benign.
Patient 1565, cancer is benign.
Patient 1566, cancer is malignant.
Patient 1567, cancer is benign.
Patient 1568, cancer is benign.


## Discussion
Since I do not have the actual results of the cancer samples used when applying the neural network, I cannot truly assess the accuracy of my classifier.  Additionally, the nature of a neural network makes the determinant factors in any given decision difficult to determine.  Even with the proper results, it could be difficult to determine the features that generated the results, whether correct or incorrect.

Nonetheless, the resulting distribution of diagnosis looks similar to that of the training data, which is encouraging, and the accuracy for the test data partition was cited at over 90%.  This suggests that the results are a reasonably accurate portrait of the actual diagnoses.

I was surpised to find that, in spite of possessing 30 features, a hidden layer of size 10 generated the most accurate classifier.  I assumed that we would need a number of hidden layers comparable to the number of inputs before additional nodes became unnecessary.  I underestimated the efficacy of simpler configurations.

I intended to try different activation functions, but time has thus far prohibited me from exploring this.  Nonetheless, the default rectified linear function seems to have provided a good balance of accuracy and performance.