## Meta-Learning for Breast Cancer Prediction
**By Timothy Yao, Jennifer Yu, and Cecelia Zhang**

### Pre-processing of Wisconsin Datasets

We retrieved our datasets from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+\(Diagnostic\)). There were a few dataset files included, with the following files: `breast-cancer-wisconsin.data`, `wdbc.data` and `wpbc.data`. We will be using the dataset labeled `wdbc.data` as the features are computed from a digitized image of the fine needle aspirate (FNA) of a breast mass. There are a total of 569 instances and 32 total attributes: 
  1. ID number
  2. Diagnosis (M = malginant, B = benign)
  3. (through 32) Ten real-valued features are computed for each cell nucleus:
    * radius (mean of distances from center to points on the perimeter)
	* texture (standard deviation of gray-scale values)
	* perimeter
	* area
	* smoothness (local variation in radius lengths)
	* compactness (perimeter^2 / area - 1.0)
	* concavity (severity of concave portions of the contour)
	* concave points (number of concave portions of the contour)
	* symmetry 
	* fractal dimension ("coastline approximation" - 1)
    
There are 32 total features because features 3-32 follow this schematic: "The mean, standard error, and "worst" or largest (mean of the three largest values) of the ten features were computed for each image, resulting in 30 features.  For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius."

We do not use the `breast-cancer-wisconsin.data` dataset since it does not indicate that the information is obtained from a FNA. For `wpbc.data`, the dataset is evaluated to predict if a mass is recurrent or not. Our projects goal is to determine from a FNA if a breast mass is malignant or benign, so `wdbc.data` best serves our purposes.

Below, we have the basic code parsing our given dataset. 

In [4]:
import pandas as pd


data = pd.read_csv("data/wdbc.data", sep=",", header=None)

In [10]:
# denote features and improve readability
features = ["radius", "texture", "perimeter", "area", "smoothness", "compactness", "concavity", "concave pts", 
            "symmetry", "frac. dim"]
features3 = []
descr = ["mean", "stderr", "worst"]
for i in range(30):
    if i < 10: 
        features3.append(descr[0] + " "+ features[i%10])
    elif i < 20: 
        features3.append(descr[1] + " " + features[i%10])
    else: 
        features3.append(descr[2] + " " + features[i%10])
data.columns = ["ID", "Malignant/Benign"] + features3
print(data)

           ID Malignant/Benign  mean radius  mean texture  mean perimeter  \
0      842302                M        17.99         10.38          122.80   
1      842517                M        20.57         17.77          132.90   
2    84300903                M        19.69         21.25          130.00   
3    84348301                M        11.42         20.38           77.58   
4    84358402                M        20.29         14.34          135.10   
..        ...              ...          ...           ...             ...   
564    926424                M        21.56         22.39          142.00   
565    926682                M        20.13         28.25          131.20   
566    926954                M        16.60         28.08          108.30   
567    927241                M        20.60         29.33          140.10   
568     92751                B         7.76         24.54           47.92   

     mean area  mean smoothness  mean compactness  mean concavity  \
0     