# WDBC Data Sets Analysis

The following jupyter notebook provides summary of the analysis undertaken on the `WDBC` data set.

## Download

Download available files.
 1. List available files from the website
 2. Create absolute url paths and download files to temp folder
*The snippet useful if re-running the netbook and saves hassle with sourcing raw data*

In [61]:
# Modules
# File download and storage
import os
import sys
import tempfile
from urllib.request import urlopen
from urllib.request import urlretrieve
from urllib.parse import urljoin
from bs4 import BeautifulSoup
# Data manipulation
import pandas as pd
# Gadgets
from termcolor import colored # Coloured console output
import humanize # Human readable units (useful for file sizes, etc.)
from IPython.display import display, HTML # Common aproach to keep nice layout and avoid printing index

In [2]:
# Download relevant files to temp storage
tmp_fld = tempfile.mkdtemp(suffix='wdbc_data', prefix='tmp')
url_dta_fld = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/'
site = urlopen(url_dta_fld)
content = site.read()
soup = BeautifulSoup(content, "html.parser")
list_urls = soup.find_all('a')
# Download desired files
dta_fls = []
for url in list_urls:
    asst_url = urljoin(url_dta_fld,url['href'])
    # Names are useful for the attribute information
    if asst_url.endswith(('.data', '.names')):
        dta_fls.append(os.path.join(tmp_fld, asst_url.split('/')[-1]))
        print(asst_url)
        urlretrieve(url=asst_url, filename=dta_fls[-1])
        print(colored('Saved: ' + dta_fls[-1], 'green'))

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
[32mSaved: /var/folders/7x/kwc1y_l96t55_rwlv35mg8xh0000gn/T/tmpjhtnd_lvwdbc_data/breast-cancer-wisconsin.data[0m
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
[32mSaved: /var/folders/7x/kwc1y_l96t55_rwlv35mg8xh0000gn/T/tmpjhtnd_lvwdbc_data/breast-cancer-wisconsin.names[0m
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
[32mSaved: /var/folders/7x/kwc1y_l96t55_rwlv35mg8xh0000gn/T/tmpjhtnd_lvwdbc_data/wdbc.data[0m
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names
[32mSaved: /var/folders/7x/kwc1y_l96t55_rwlv35mg8xh0000gn/T/tmpjhtnd_lvwdbc_data/wdbc.names[0m
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.data
[32mSaved: /var/folders/7x/kwc1y_l96t55_rwlv35mg8xh0000gn/T/tmpjhtnd_lvwdbc

### Available files
_Includes summary of .names files_

In [41]:
dta_fls_info = pd.DataFrame(
    {'file' : list(map(os.path.basename, dta_fls)),
     'size' : [humanize.naturalsize(os.path.getsize(fle)) for fle in dta_fls],
     'lines' : list(map(lambda fle: sum(1 for _ in fle), dta_fls))
    })
display(HTML(dta_fls_info.to_html(index=False)))

file,lines,size
breast-cancer-wisconsin.data,98,19.9 kB
breast-cancer-wisconsin.names,99,5.7 kB
wdbc.data,79,124.1 kB
wdbc.names,80,4.7 kB
wpbc.data,79,44.2 kB
wpbc.names,80,5.7 kB


## Import

In [37]:
# --- Do not run ---
stdout_nms = []
for fle in dta_fls:
    if fle.endswith('names'):
        stdout_nms.append(os.popen("cat " + fle).read())
# --- Do not run ---
# stdout_nms
# The one coule analyse *.names files to pick relevant list but the .names file format and number of variables
# does not justify that effort.

### Conveniance function (pd sample)
Simple conveniance function providing data frame preview by taking head/tail and wee sample in for the middle. More useful than changing defult printing options, IMHO.

In [69]:
def pd_preview (data_frame, sample_size = 3, tail_lines = 1, head_lines = 1):
    # Generate data frames to append
    df_head = data_frame.head(n = head_lines)
    df_tail = data_frame.tail(n = tail_lines)
    df_smpl = data_frame.sample(n = sample_size)
    # Append
    df_preview = df_head.append([df_smpl, df_tail])
    return df_preview

### Read

Read data sets, including **core** wdbc data set as well as other data set available in the folder (for the evenual further 

In [73]:
dta_brst_cncr = pd.read_csv(filepath_or_buffer=dta_fls[0],
                            names=['id_num', 
                                   'clump_thickness',
                                   'uniformity_cell_size',
                                   'uniformity_cell_shape',
                                   'marginal_adhesion',
                                   'single_epithelial_cell_size',
                                   'bare_nuclei',
                                   'bland_chromatin',
                                   'normal_nucleoli',
                                   'mitoses',
                                   'class'])
preview(dta_brst_cncr)

Unnamed: 0,id_num,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
430,1276091,1,3,1,1,2,1,2,2,1,2
192,1212232,5,1,1,1,2,1,2,1,1,2
348,832226,3,4,4,10,5,1,3,3,1,4
698,897471,4,8,8,5,4,5,10,4,1,4


This is **core** data set, initial analysis will be conducted on this data set.

The column names are derived as per the information available in wdbc.names. `*_lsrgst` columns reflect "worst" or largest (mean of the three largest values)" metric as per the [available documentation](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names).

In [85]:
# Col names
col_nms = ['ID', 'diagnosis', 'radius_avg', 
           'texture_avg', 'perimeter_avg', 'area_avg', 
           'smoothness_mean', 'compactness_avg', 'concavity_avg',
           'concave_points_avg', 'symmetry_avg', 
           'fractal_dimension_mean', 'radius_se', 'texture_se', 
           'perimeter_se', 'area_se', 'smoothness_se', 
           'compactness_se', 'concavity_se', 'concave_points_se', 
           'symmetry_se', 'fractal_dimension_se', 
           'radius_lrgst', 'texture_lrgst', 'perimeter_lrgst',
           'area_lrgst', 'smoothness_lrgst', 
           'compactness_lrgst', 'concavity_lrgst', 
           'concave_points_lrgst', 'symmetry_lrgst', 
           'fractal_dimension_lrgst']


In [86]:
dta_wdbc = pd.read_csv(filepath_or_buffer=dta_fls[2],
                      names=col_nms)
# US ID as index
dta_wdbc.set_index(['ID'], inplace=True)

In [87]:
preview(dta_wdbc)

Unnamed: 0_level_0,diagnosis,radius_avg,texture_avg,perimeter_avg,area_avg,smoothness_mean,compactness_avg,concavity_avg,concave_points_avg,symmetry_avg,...,radius_lrgst,texture_lrgst,perimeter_lrgst,area_lrgst,smoothness_lrgst,compactness_lrgst,concavity_lrgst,concave_points_lrgst,symmetry_lrgst,fractal_dimension_lrgst
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
89827,B,11.06,14.96,71.49,373.9,0.1033,0.09097,0.05397,0.03341,0.1776,...,11.92,19.9,79.76,440.0,0.1418,0.221,0.2299,0.1075,0.3301,0.0908
8912055,B,11.74,14.02,74.24,427.3,0.07813,0.0434,0.02245,0.02763,0.2101,...,13.31,18.26,84.7,533.7,0.1036,0.085,0.06735,0.0829,0.3101,0.06688
8912049,M,19.16,26.6,126.2,1138.0,0.102,0.1453,0.1921,0.09664,0.1902,...,23.72,35.9,159.8,1724.0,0.1782,0.3841,0.5754,0.1872,0.3258,0.0972
92751,B,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,...,9.456,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039


# Descriptive analysis

The following section provides initial descriptive analysis as well as introduces some addittional metrics.