# Breast Cancer Prediction


This Project is used to predict if a person is having Breast Cancer/ if the tumor is malignant or benign.<br>

We are using __Breast Cancer Wisconsin (Diagnostic) Data Set__ (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29).<br>
***

## Data Set Information

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
<br>

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.
<br>

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].
<br>

This database is also available through the UW CS ftp server:
>`ftp ftp.cs.wisc.edu`
<br>`cd math-prog/cpo-dataset/machine-learn/WDBC/`

## Attribute Information

Ten real-valued features are computed for each cell nucleus:

1. Radius (mean of distances from center to points on the perimeter)
2. Texture (standard deviation of gray-scale values)
3. Perimeter
4. Area
5. Smoothness (local variation in radius lengths)
6. Compactness (perimeter^2 / area - 1.0)
7. Concavity (severity of concave portions of the contour)
8. Concave points (number of concave portions of the contour) 
9. Symmetry
10. Fractal dimension ("coastline approximation" - 1)

The real-valued dataset is available at Kaggle dataset [Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data#data.csv)

This data is converted into the dataset we are using in this project, which has 9 attributes with an id and a label column.

***

### Importing modules

In [1]:
import numpy as np
from math import sqrt
from collections import Counter
import pandas as pd
import random

### Accessing Dataset from Kaggle

In [2]:
df = pd.read_csv('https://storage.googleapis.com/kaggle-datasets/136259/323346/breast-cancer-wisconsin.data.txt?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1557079013&Signature=dItvLB1KwKVX4LIPuAuGho%2FnMoJ70TQ8N1JWwdCMj4iik3pfBjEpK8AaPtt9m4q%2B9%2FeGSjSOA5tWy%2BMZrYtYFhS%2FdQf%2F%2Bj9xMeLsB0pSki9r%2BDgTw6A%2F%2BBhGA3zVehPQnbK9V9WJfBETZBq3wGry4a1YHZ0wR0TC9f0ds55FN9VMhB882HANpffKjbUkvsxnlM0ahJctNxxUIHrSwsRXbTwwfSbjX0pXm6ulzD0HwwwbSmyFyQNNCs%2BFdr9SM5PW5MBA4RTvTSiUWOv0QN9c9dZS%2F6VU2iLMWRnRZsHSb5KlUPqoYnZ5rjFi2matw3fi%2Fti%2FgQ21qYCUyHiPoihshw%3D%3D')
print(df.head())
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)
full_data = df.astype(float).values.tolist()

        id  x1  x2  x3  x4  x5  x6  x7  x8  x9  label
0  1000025   5   1   1   1   2   1   3   1   1      2
1  1002945   5   4   4   5   7  10   3   2   1      2
2  1015425   3   1   1   1   2   2   3   1   1      2
3  1016277   6   8   8   1   3   4   3   7   1      2
4  1017023   4   1   1   3   2   1   3   1   1      2


- We remove the id column and saved in the form of list.
- The list with the dataset is shuffled randomly.
- Divide the dataset into train_set and test_set dictionaries with keys representing the *label [Malignant or Benign]*

In [4]:
random.shuffle(full_data)
test_size = 0.2
train_set = {2:[], 4:[]} #2 & 4 is output data
test_set = {2:[], 4:[]}  #2 is for the benign tumors   4 is for malignant tumors,

train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]


In [5]:
for i in train_data:  
    train_set[i[-1]].append(i[:-1])
for i in test_data:
    test_set[i[-1]].append(i[:-1])

In [11]:
def k_nearest_neighbors(data, predict, k=3):
    if len(data) >= k:
        warnings.warn('K is set to a value less than total voting groups!')
        
    distances = []
    for group in data:
        for features in data[group]:
            euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
            distances.append([euclidean_distance,group])
    #print(distances)
    #print(sorted(distances))
    #print(sorted(distances)[:k])
        
    votes = [i[1] for i in sorted(distances)[:k]]
    #print(Counter(votes))
    vote_result = Counter(votes).most_common(1)[0][0]
    return vote_result   # 2,4

In [17]:
correct = 0
total=0
for group in test_set:
    for data in test_set[group]:
        vote = k_nearest_neighbors(train_set, data, k=11)
        if group == vote: 
            correct += 1
        total += 1

data=test_set[random.choice([2,4])][5]
print(data)
vote = k_nearest_neighbors(train_set,data,k=11)
if(vote==2):
    print('According to the prediction, the person has a benign tumor.')
else:
    print('According to the prediction, the person has a malignant tumor!!\n')
print('Prediction Accuracy:', round((correct/total)*100, 2), '%')

[10.0, 4.0, 4.0, 6.0, 2.0, 10.0, 2.0, 3.0, 1.0]
According to the prediction, the person has a malignant tumor!!

Prediction Accuracy: 96.4 %
