Cancer diagnosis data set
Question I want to answer is:
What elements have the highest correlation with detecting cancer?
Question I want a machine learning model to answer:
Can I make a machine learning model that accurately predicts cancer diagnosis?


In [2]:
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from sklearn.tree import DecisionTreeClassifier

In [3]:
db = pd.read_csv('data.csv')
db.head()

Unnamed: 0,id,diagnosis,Radius_mean,Texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,21.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Because the dataset is pretty cleaned up and has no missing values only need to replace one thing to make the machine learning model a little easier to work with.

In [4]:
db["diagnosis"] = (db["diagnosis"]
    .replace("M",1)
    .replace("B",0))

Using a corr() command to look at the correlation of diagnosis and the other columns.

In [16]:
corr = db.corr()
imp_elements_table = corr.sort_values(by=['diagnosis'], ascending=False)
imp_elements_table

Unnamed: 0,id,diagnosis,Radius_mean,Texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
diagnosis,0.039769,1.0,0.730029,0.417232,0.742636,0.708984,0.35856,0.596534,0.69636,0.776614,...,0.776454,0.456903,0.782914,0.733825,0.421465,0.590998,0.65961,0.793566,0.416294,0.323872
concave points_worst,0.035174,0.793566,0.744214,0.29704,0.771241,0.722017,0.503053,0.815573,0.861323,0.910155,...,0.787424,0.359755,0.816322,0.747419,0.547691,0.80108,0.855434,1.0,0.502528,0.511114
perimeter_worst,0.079986,0.782914,0.965137,0.360485,0.970387,0.95912,0.238853,0.59021,0.729565,0.855923,...,0.993708,0.365098,1.0,0.977578,0.236775,0.529408,0.618344,0.816322,0.269493,0.138957
concave points_mean,0.044158,0.776614,0.822529,0.294307,0.850977,0.823269,0.553695,0.831135,0.921391,1.0,...,0.830318,0.292752,0.855923,0.80963,0.452753,0.667454,0.752399,0.910155,0.375744,0.368661
radius_worst,0.082405,0.776454,0.969539,0.355463,0.969476,0.962746,0.21312,0.535315,0.688236,0.830318,...,1.0,0.359921,0.993708,0.984015,0.216574,0.47582,0.573975,0.787424,0.243529,0.093492
perimeter_mean,0.073159,0.742636,0.997855,0.332231,1.0,0.986507,0.207278,0.556936,0.716136,0.850977,...,0.969476,0.303038,0.970387,0.94155,0.150549,0.455774,0.563879,0.771241,0.189115,0.051019
area_worst,0.107187,0.733825,0.941082,0.346576,0.94155,0.959213,0.206718,0.509604,0.675987,0.80963,...,0.984015,0.345842,0.977578,1.0,0.209145,0.438296,0.543331,0.747419,0.209146,0.079647
Radius_mean,0.074626,0.730029,1.0,0.326716,0.997855,0.987357,0.170581,0.506124,0.676764,0.822529,...,0.969539,0.297008,0.965137,0.941082,0.119616,0.413463,0.526911,0.744214,0.163953,0.007066
area_mean,0.096893,0.708984,0.987357,0.324149,0.986507,1.0,0.177028,0.498502,0.685983,0.823269,...,0.962746,0.287489,0.95912,0.959213,0.123523,0.39041,0.512606,0.722017,0.14357,0.003738
concavity_mean,0.05008,0.69636,0.676764,0.302324,0.716136,0.685983,0.521984,0.883121,1.0,0.921391,...,0.688236,0.299879,0.729565,0.675987,0.448822,0.754968,0.884103,0.861323,0.409464,0.51493


From the table I created I was able to see all the elements in order from the most important when it comes to diagnosis. The top 5 elements were: concave points_worst,perimeter_worst,concave points_mean,radius_worst	,perimeter_mean. 

So when I create a machine learning model these are elements I will want to include.

In [24]:
x = db.filter([
    'Radius_mean',
    'Texture_mean',
    'perimeter_mean',
    'area_mean',
    'compactness_mean',
    'concavity_mean',
    'concave points_mean',
    'radius_worst',
    'perimeter_worst',
    'compactness_worst',
    'area_worst',
    'concave points_worst',
    'symmetry_worst',
    'fractal_dimension_worst',
    ])

y = db.filter(['diagnosis'])

In [25]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .25, random_state = 222)

In [20]:
y_train.head()

Unnamed: 0,diagnosis
542,0
402,0
24,1
67,0
560,0


In [29]:
# create the model
classifier = DecisionTreeClassifier()

# train the model
classifier.fit(x_train, y_train)

# make predictions
y_predictions = classifier.predict(x_test)

# test how accurate predictions are
metrics.accuracy_score(y_test, y_predictions)

0.9300699300699301

I now have a way to make predictions using machine learning and a descision tree. By comparing the test and the predictions I can get a result on how accurate the model is.