# Breast Cancer Prediction

### Reading the data

Read `cancer.txt` line by line. The first line is the header and is ignored. The variable `data` stores all other lines of the file as a list of strings. Each string-item of the list `data` corresponds to one data-line of the file. 

In [9]:
f = open('./data/cancer.txt', 'r')

# read and ignore header
f.readline()

# store record
data = f.readlines()

### Data transformation

Converts strings to numerical data. The values are stored in the nested list `X`.

In [10]:
num = len(data)
X = []
for line in data:
    fields = line.split()
    x = []
    for item in fields:
        x.append(float(item))
    X.append(x)

Splits class labels from the data matrix `X`. The matrix `X` is a list of feature vectors and the vector `Y` consists of the corresponding class labels

In [11]:
Y = []
for i in range(num):
    Y.append(X[i][-1])
    X[i] = X[i][:-1]

### Distance matrix
Computes the pairwise Euclidean distances between the feature vectors.

X[i]
X[j]

dist[i][j] = || X[i] - X[j] || dist[i][i] = inf

In [12]:
import math as m

# initialize distance matrix with infinity 
dist = []
for i in range(num):
    row = []
    for j in range(num):
        row.append(m.inf)
    dist.append(row)

# compute pairwise distances
for i in range(num):
    for j in range(i+1,num):
        dist[i][j] = m.dist(X[i], X[j])
        dist[j][i] = dist[i][j]

### Classification
Estimates the error rate of the nearest-neighbor classifier using the leave-one-out test protocol.

In [13]:
err = 0
for i in range(num):
    k = dist[i].index(min(dist[i]))
    if Y[k] != Y[i]:
        err += 1
err = 100.0*err/num
print(f'error rate : {err:.1f}')      

error rate : 40.5


### Classification with standardized data

Repeat experiment with Z-transformed data. The proposed approach is simple but methodologically incorrect. A methodologically correct approach only uses the training data to estimate the average and standard deviation. Here, the test element also contributes to both quantities.

**First step:** compute column-wise average (`avg`) and standard deviation (`std`) of the feature matrix `X`

In [16]:
# initialize avg and std for each column
avg = []
std = []
dim = len(X[0])
for i in range(dim):
    avg.append(0)
    std.append(0)

# compute avg for each column
for x in X:
    for i in range(dim):
        avg[i] += x[i]
for i in range(dim):
    avg[i] /= num

# compute std for each column 
for x in X:
    for i in range(dim):
        std[i] += (x[i] - avg[i])**2
for i in range(dim):
    std[i] /= num-1
    std[i] **= 0.5

**Second step:** Z-transformation of the matrix `X`

In [17]:
for x in X:
    for i in range(dim):
        x[i] = (x[i] - avg[i])/std[i]

**Third step:** classification using the leave-one-out protocol 

In [18]:
dist = []
for i in range(num):
    row = []
    for j in range(num):
        row.append(m.inf)
    dist.append(row)

for i in range(num):
    for j in range(i+1,num):
        dist[i][j] = m.dist(X[i], X[j])
        dist[j][i] = dist[i][j]
        
err = 0
for i in range(num):
    k = dist[i].index(min(dist[i]))
    if Y[k] != Y[i]:
        err += 1
err = 100.0*err/num
print(f'error rate : {err:.1f}')      

error rate : 31.9
