# Project 4

### Max Brehmer & Joakim Andersson Svendsen

For this project we chose to work with the `ARCENE` dataset from the UCI Machine Learning Repository. It was donated on $2/28/2008$.

ARCENE's task is to distinguish cancer versus normal patterns from mass-spectrometric data. This is a two-class classification problem with continuous input variables. This dataset is one of 5 datasets of the NIPS 2003 feature selection challenge.

What characterizes this dataset is that it has a large number of features ($10 000$) compared to the number of instances ($900$), of which $3000$ of the features are probes which have no predictive power. 

Training, validation and test sets are split into the following structure:

ARCENE          -- Positive ex. -- Negative ex. -- Total \
Training set    -- 44           -- 56           -- 100 \
Validation set  -- 44           -- 56           -- 100 \
Test set        -- 310          -- 390          -- 700 \
All             -- 398          -- 502          -- 900

In [99]:
import pandas as pd

train_data = pd.read_csv('../arcene/ARCENE/arcene_train.data', sep=' ', header=None)
train_labels = pd.read_csv('../arcene/ARCENE/arcene_train.labels', header=None)
test_data = pd.read_csv('../arcene/ARCENE/arcene_test.data', sep=' ', header=None)
valid_data = pd.read_csv('../arcene/ARCENE/arcene_valid.data', sep=' ', header=None)
valid_labels = pd.read_csv('../arcene/arcene_valid.labels', header=None)

The goal of this project is to perform a two-way classification of the ARCENE dataset into cancerous and non-cancerous cases. 
We plan on using a Convolutional Neural Network (CNN) approach since it has not been tested for the ARCENE data in the original data description. We want to compare the error rate using a CNN against other benchmark models. We have many features which may imply that the data behaves similarly to image classification where CNN's are the standard approach.

In [100]:
# Remove column 10000 as it is empty
def remove_empty_columns(data):
    for column in data.columns:
        if data[column].isnull().all():
            data.drop(columns=[column], inplace=True)

remove_empty_columns(train_data)
remove_empty_columns(valid_data)
remove_empty_columns(test_data)

In [None]:
import setuptools.dist
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import MinMaxScaler
import numpy as np

"""
    Reshape data
"""

# Convert labels from {-1, 1} to {0, 1}
train_labels = (train_labels + 1) // 2
valid_labels = (valid_labels + 1) // 2

# One-hot encode the labels
train_labels_onehot = to_categorical(train_labels, num_classes=2)
valid_labels_onehot = to_categorical(valid_labels, num_classes=2)

# Normalize datasets
scaler = MinMaxScaler()
train_data = scaler.fit_transform(train_data)
valid_data = scaler.transform(valid_data)
test_data = scaler.transform(test_data)

# Convert the numpy array back to a pandas dataframe
train_data = pd.DataFrame(train_data)
valid_data = pd.DataFrame(valid_data)
test_data = pd.DataFrame(test_data)
train_labels_onehot = pd.DataFrame(train_labels_onehot)
valid_labels_onehot = pd.DataFrame(valid_labels_onehot)

pd.set_option('display.max_columns', 20)
#train_data
#valid_data
#test_data
#train_labels_onehot
#valid_labels_onehot

In [None]:
"""
    Create model
"""

model = models.Sequential()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
0,0.000000,0.000000,0.661017,0.170769,0.020202,1.129353,0.900609,0.000000,0.004950,0.000000,...,0.033666,1.058104,0.215385,0.000000,0.076220,0.138144,0.120735,0.000000,0.000000,0.872420
1,0.000000,0.032110,0.000000,-0.030769,0.000000,0.174129,0.000000,0.000000,0.108911,0.567775,...,0.794264,1.029052,0.000000,0.000000,0.027439,0.414433,0.724409,0.000000,0.086735,0.373358
2,0.000000,0.146789,0.000000,0.681538,0.535354,0.776119,0.000000,0.000000,0.693069,0.375959,...,0.256858,0.704893,0.102564,0.409639,0.251524,0.463918,0.031496,0.000000,0.119898,0.410882
3,0.000000,0.353211,0.000000,0.269231,0.000000,0.624378,0.957404,0.000000,0.297030,0.000000,...,0.149626,0.848624,0.066667,0.000000,0.743902,0.000000,0.078740,0.000000,0.451531,0.913696
4,0.000000,0.155963,0.432203,0.761538,0.020202,0.728856,0.959432,0.000000,0.193069,0.000000,...,0.066085,0.963303,0.000000,0.349398,0.292683,0.000000,0.086614,0.000000,0.012755,0.780488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.127660,0.334862,0.000000,0.629231,0.929293,0.544776,0.000000,0.000000,0.688119,0.667519,...,0.415212,0.825688,0.000000,1.036145,0.198171,0.752577,0.044619,0.944444,0.007653,0.069418
96,0.058511,0.266055,0.211864,0.469231,1.101010,0.527363,0.247465,0.000000,0.371287,0.342711,...,0.216958,0.542813,0.800000,0.927711,0.039634,0.571134,0.587927,0.000000,0.091837,0.489681
97,0.494681,0.146789,0.580508,0.449231,0.000000,0.206468,0.468560,0.381818,0.000000,0.000000,...,1.082294,0.013761,0.000000,0.000000,0.000000,0.503093,0.703412,0.000000,0.704082,0.585366
98,0.632979,0.055046,0.838983,0.480000,0.000000,0.268657,0.831643,0.000000,0.000000,0.010230,...,0.897756,0.000000,0.189744,0.000000,0.000000,0.527835,0.947507,0.000000,0.000000,0.656660
