In [None]:
*Breast Cancer data challenge*

you belong to a data team at a local research hospital. you've been tasked with developing a means to help doctors diagnose breast cancer. you've been given data about biopsied breast cells; where it is benign (not harmful) or malignant (cancerous).

1. what features of a cell are the largest drivers of malignancy?
2. how would a physician use your product?
3. there is a non-zero cost in time and money to collect each feature about a given cell. how would you go about determining the most cost-effective method of detecting malignancy?

We want to predict the cancer state using information about the cancerous cells.

In [6]:
import pandas as pd


def clean_names(df):
    '''a function for cleaning column names to be usable'''
    df.columns = df.columns.str.strip()
    df.columns = df.columns.str.lower()
    df.columns = df.columns.str.replace(' ', '_')
    df.columns = df.columns.str.replace('(', '')
    df.columns = df.columns.str.replace(')', '')
    return df
    
    
# import data
data_breast = pd.read_csv('breast-cancer-wisconsin.txt')
# reminder:
# class is cancer state
# benign == 2
# malignant == 4

# messy column names
data_breast = clean_names(data_breast)

The first question I like to ask is what is the missing data situation like. Are there are lot of missing values? Are they randomly distributed? Or are the concetrated in certain observations?

In [7]:
# any missing data?
print(data_breast.isnull().values.any())
# which columns have NAs?
sum_na = data_breast.isna().sum()
print(sum_na)

True
index                           0
id                              0
clump_thickness                 0
uniformity_of_cell_size        28
uniformity_of_cell_shape       28
marginal_adhesion              28
single_epithelial_cell_size    28
bare_nuclei                    28
bland_chromatin                28
normal_nucleoli                28
mitoses                        28
class                          28
dtype: int64


most columns have NAs. also there are always 28 NAs, even in the cancer class.

are they always the same rows?

In [8]:
missing_obs = data_breast[data_breast.isnull().any(axis=1)]
print(missing_obs)

       index       id  clump_thickness uniformity_of_cell_size  \
355      355  1111249               10                     NaN   
573      573  1111249               10                     NaN   
1188    1188   601265               10                     NaN   
1980    1980  1241035                7                     NaN   
3981    3981   691628                8                     NaN   
4104    4104  1112209                8                     NaN   
4460    4460  1198641               10                     NaN   
4788    4788  1169049                7                     NaN   
4903    4903  1200892                8                     NaN   
5340    5340  1111249               10                     NaN   
5435    5435  1169049                7                     NaN   
7945    7945   691628                8                     NaN   
8145    8145  1142706                5                     NaN   
9301    9301  1110524               10                     NaN   
9872    98

luckily for me, where there is missing data, 
those entries are missing all data (except clump thickness)

this makes imputation very difficult because imputing whole rows, not indiv values so let's just drop those observations.

In [9]:
data_clean = data_breast.dropna()

There might be other issues lurking in our data. Let's dive deeper to see if there are miscodings or strings where they shouldn't be.

In [12]:
# all of these columns should be int64 type
data_clean.dtypes
# they are not!
# this means there are mixed types in the columns

index                           int64
id                              int64
clump_thickness                 int64
uniformity_of_cell_size        object
uniformity_of_cell_shape       object
marginal_adhesion              object
single_epithelial_cell_size    object
bare_nuclei                    object
bland_chromatin                object
normal_nucleoli                object
mitoses                        object
class                          object
dtype: object

In [16]:
# unique values in a sample column
# if correct, should only be integer values 1-10
set(data_clean.uniformity_of_cell_shape)

{'#',
 '1',
 '10',
 '100',
 '2',
 '3',
 '30',
 '4',
 '40',
 '5',
 '50',
 '6',
 '60',
 '7',
 '70',
 '8',
 '9',
 '?',
 'No idea'}

In [28]:
data_clean = data_clean.apply(lambda x: pd.to_numeric(x, errors='coerce'))


def dec_magnitude(x):
    

data_clean.uniformity_of_cell_shape[data_clean.uniformity_of_cell_shape > 10] / 10

167       6.0
213      10.0
243       4.0
374       6.0
765       4.0
924       6.0
1100      6.0
1191      3.0
1222      4.0
1291     10.0
1430      4.0
1546      4.0
1875      6.0
1888      5.0
1922      6.0
2053      6.0
2067      6.0
2166     10.0
2183     10.0
2306      4.0
2377      5.0
2392      5.0
2407      5.0
2492      6.0
2586      5.0
2809      4.0
2819      6.0
2846      4.0
2906      4.0
2945      4.0
         ... 
12233     4.0
12350     4.0
12427     4.0
12486     7.0
12688     6.0
12884     4.0
12918     5.0
13012     7.0
13375    10.0
13440     4.0
13453     4.0
13595     4.0
13699     4.0
13822     3.0
13907     5.0
14030     6.0
14117     4.0
14149     4.0
14204     4.0
14295     5.0
14394     6.0
14405     4.0
14443     3.0
14808     4.0
14967    10.0
14984     5.0
15196     6.0
15479     6.0
15528     3.0
15841     6.0
Name: uniformity_of_cell_shape, Length: 152, dtype: float64

According to the data description all columns record numbers between 1 and 10. Let's make sure that is true. 

Now that i'm looking at the non-na value-d data, we can begin to explore some of the variables.

Our response variable -- the value of interest -- is the cancer state of the patient. this is the variable `class` in the dataset.

The other variables are bound between 1 and 10. The difficulty here is understanding if these values are different states of categorical variables or they are (discrete) measures. Looking at the prompt and the explination of variables, my best guess is that they are measures.

In [51]:
# data_clean.clump_thickness