This notebook purpose: Research with data  
* Understand the business  
* Understand the problem  
* Find one of the key problems (Basic stats)



I selected this dataset last week. [Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)  
The other options can be found [here](https://docs.google.com/document/d/1YCuJsHgygPVuPIDFq_8RLVFFSw5ihU81y3ssbJKaKGQ/edit)  

Table of contents
* [Understanding-the-business](#Understanding-the-business)
    * [This-dataset](#This-dataset)
    * [First-usage-of-this-dataset:-1992](#First-usage-of-this-dataset:-1992)

# Understanding the business

In this case the "business" is the University of Wisconsin

<hr>

### This dataset

The information about this dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at [Web Link](http://pages.cs.wisc.edu/~street/images/)

<hr>

### First usage of this dataset: __1992__  

First usage of this dataset: __1992__  
[W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.](https://www.researchgate.net/publication/2512520_Nuclear_Feature_Extraction_For_Breast_Tumor_Diagnosis)  
        
        Abstract
Interactive **image processing techniques** along with a **linear-programming-based inductive classifier**, have been used to create a highly accurate system for diagnosis of brest tumors. A **small fraction of a fine needle aspirate slide is seleceted and digitized**. With an interactive interface, *the user initializes active contour models. __known as snakes__, near the __boudaries of a set of cell nuclei__*. The customized snakes are deformed to the exact shape of the nuclei. This allows for precise, automated analysis of nuclear size, shape and texture. Then such features are computed for each nucleaus, and the mean value, largest (or "worst") value and standard error of each feature are found over the range of isolated cells.  
        After 569 images were analyzed in this fashion, different combinations of features were tested to find those which best separate benign from malignant samples. Ten-fold cross-validation mean texture, worst area and worst smoothness. This represents an improvement over the best diagnostic results in the medical literature. **The system is currently in use at the University of Wisconsin Hospitals**. The same feature set has also been utilized in the muvh more difficult task of predicting distant recurrence of malignancy in patients, resulting in an accuracy of 86%.
        
        

Questions about paper  
1. What's the background of the authors?
2. What's the arclenght of a closed curve? (The energy function is defined to minimize it)
3. How many patients were needed to get the 569 tumors? Can this lead into some bias if multiple tumors com from the same patient?
4. Why is so important sensitivty vs specificity?

Observations  
1. A lot of geometrical approaches
2. Basic metrics based on geometry
3. Take advantage of the tecnoholy like the texture using the colors
4. All the dataset's features are well explained in this paper


Own questions  
1. How can you create feature engineering to optimize the binary classification using geometry basics?
2. Is there another way to visualize sensitivity vs specificity?

# Load data

In [1]:
import pandas as pd
import cufflinks as cf

cf.go_offline()

In [2]:
df = pd.read_csv('data/datasets_180_408_data.csv')

In [3]:
del df['Unnamed: 32']

Attribute Information:

1) ID number  
2) Diagnosis (M = malignant, B = benign)  
3-32)  

Ten real-valued features are computed for each cell nucleus:  

a) radius (mean of distances from center to points on the perimeter)  
b) texture (standard deviation of gray-scale values)  
c) perimeter  
d) area  
e) smoothness (local variation in radius lengths)  
f) compactness (perimeter^2 / area - 1.0)  
g) concavity (severity of concave portions of the contour)  
h) concave points (number of concave portions of the contour)  
i) symmetry  
j) fractal dimension ("coastline approximation" - 1)  

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

### Own research (Google scholar)
radious importance  
textura importance  
perimeter importance  
area importance  
smoothness importance  
compactness importance  
concavity images and importance  
concave points drawing and importance  
symmetry image and importance  
fractal dimension meaning and importance  

# Visualize data

In [4]:
nbins=40

In [7]:
l_inter = []
for c in [c for c in df.columns if c not in ['id', 'diagnosis']][:]:
    title = c
    
    x = df[[c, 'diagnosis']]
    x['b'] = x[x.diagnosis=='B'][c]
    x['m'] = x[x.diagnosis=='M'][c]

    # x[['b', 'm']].iplot(kind='hist', title=title)

    x[c+'_binned'] = pd.cut(x[c], nbins)
    categories = x['diagnosis'].unique().tolist()
    x[c+'_binned'] = pd.cut(x[c], nbins)
    x = x.pivot_table(index=c+'_binned', columns='diagnosis', aggfunc='count', values=c)
    x.fillna(0, inplace=True)

    x.index = x.index.astype(str)

    x['B_orig'] = x['B']
    x['M_orig'] = x['M']
    
    x['B'] /= x['B'].sum()
    x['M'] /= x['M'].sum()

    x['intersection'] = x.min(axis=1)

    # x['tot'] = x.sum(axis=1) - x['intersection']

    y = x.sum().to_frame().T

    # print('Intersection total: {:,.0%}'.format(float(y.intersection / y.tot)))
    # print('Intersection relative B: {:,.0%}'.format(float(y.intersection / y.B)))
    # print('Intersection relative M: {:,.0%}'.format(float(y.intersection / y.M)))


    intersection_pct = float(y.intersection / y.B)
    
    
    # x[['B', 'M', 'intersection']].iplot(title = 'Variable: ' + title + ' | ' + '{:,.0%} intersection'.format(intersection_pct))

    l_inter.append([c, intersection_pct])

In [8]:
df_inter = pd.DataFrame(l_inter)

In [9]:
(df_inter.sort_values(1,ascending=False).set_index(0)*100).tail(10).iplot(
    kind='barh', 
    xTitle='Intersection percentage', 
    title='The 10 less intersected distributions'
)

In [10]:
less_intersected_features = (df_inter.sort_values(1,ascending=False).set_index(0)*100).tail(10)

In [11]:
less_intersected_features

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
concavity_worst,28.481581
radius_mean,28.435336
area_mean,27.860578
perimeter_mean,25.149305
concavity_mean,24.308969
area_worst,20.092754
radius_worst,19.400402
concave points_worst,18.486074
concave points_mean,17.365625
perimeter_worst,17.262565


In [12]:
z = df[less_intersected_features.index.tolist() + ['diagnosis']]

In [13]:
l_inter = []
for c in less_intersected_features.index.tolist()[::-1]:
    title = c
    
    x = df[[c, 'diagnosis']]
    x['b'] = x[x.diagnosis=='B'][c]
    x['m'] = x[x.diagnosis=='M'][c]

    # x[['b', 'm']].iplot(kind='hist', title=title)

    x[c+'_binned'] = pd.cut(x[c], nbins)
    categories = x['diagnosis'].unique().tolist()
    x[c+'_binned'] = pd.cut(x[c], nbins)
    x = x.pivot_table(index=c+'_binned', columns='diagnosis', aggfunc='count', values=c)
    x.fillna(0, inplace=True)

    x.index = x.index.astype(str)

    x['B_orig'] = x['B']
    x['M_orig'] = x['M']
    
    x['B'] /= x['B'].sum()
    x['M'] /= x['M'].sum()

    x['intersection'] = x.min(axis=1)

    # x['tot'] = x.sum(axis=1) - x['intersection']

    y = x.sum().to_frame().T

    # print('Intersection total: {:,.0%}'.format(float(y.intersection / y.tot)))
    # print('Intersection relative B: {:,.0%}'.format(float(y.intersection / y.B)))
    # print('Intersection relative M: {:,.0%}'.format(float(y.intersection / y.M)))


    intersection_pct = float(y.intersection / y.B)
    
    
    x[['B', 'M', 'intersection']].iplot(title = 'Variable: ' + title + ' | ' + '{:,.0%} intersection'.format(intersection_pct))

    l_inter.append([c, intersection_pct])

In [15]:
df[less_intersected_features.tail(2).index.tolist() + ['diagnosis']].iplot(
    kind='scatter',
    mode='markers',
    x=less_intersected_features.tail(2).index.tolist()[0],
    xTitle=less_intersected_features.tail(2).index.tolist()[0],
    
    y=less_intersected_features.tail(2).index.tolist()[1],
    yTitle=less_intersected_features.tail(2).index.tolist()[1],
    categories='diagnosis'
)


The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead


The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead

