# Tumor Detection using classification – Machine Learning and Python

### Step 1: Pre-processing the Data:

In [2]:
# Importing dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Checking for any warning
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Including & Reading the CSV file:
df = pd.read_csv("../../datasets/data.csv")

Now we will check that the CSV file has been read successfully or not? <br>So we will use the head method: head() <br> This is method used to return top n (5 by default) rows of a data frame or series. 

In [4]:
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


So this command will fetch the column’s header names. The output will be this:

In [5]:
# Check the names of all columns
df.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

Now in order to understand the data set briefly by getting a quick overview of the data-set, we will use info() method. This method very well handles the exploratory analysis of the data-sets.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In the CSV file, there may be some blanked fields that can harm the project (that is they will hamper the prediction). 

In [7]:
df['Unnamed: 32']

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
       ..
564   NaN
565   NaN
566   NaN
567   NaN
568   NaN
Name: Unnamed: 32, Length: 569, dtype: float64

Now as we have successfully found the vacant spaces in the data set, so now we will remove them.

In [8]:
df = df.drop("Unnamed: 32", axis=1)

# to check whether those values are
# deleted or not:
df.head()

# also check the columns after this
# process:
df.columns

df.drop('id', axis=1, inplace=True)
# we can do this also: df = df.drop('id', axis=1)

# To see the change, again go through
# the columns
df.columns

Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

Now we will check the class type of the columns with the help of type() method. It returns the class type of the argument(object) passed as a parameter. 

In [9]:
type(df.columns)

pandas.core.indexes.base.Index

We will be needing to traverse and sort the data by their columns, so we will save the columns in a variable. 

In [10]:
l = list(df.columns)
print(l)

['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']


Now we will access the data with different start points. Say we will categorize the columns from 1 to 11 in a variable named features_mean and so on. 

In [12]:
features_mean = l[1:11]

features_se = l[11:21]

features_worst = l[21:]

In [13]:
df.head (2)

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In the ‘Diagnosis’ column of the CSV file, there are two options one is M = Malignant & B = Begin which basically tells the stage of the Tumor. But the same we will verify from the code. 

In [21]:
# To check what value does the Diagnosis field have
df['diagnosis'].unique()
# M stands for Malignant, B stands for Begin


array(['M', 'B'], dtype=object)

So it verifies that there are only two values in the Diagnosis field.

Now in order to get a fair idea of how many cases are having malignant tumor and who are in the beginning stage, we will use the countplot() method. 

In [22]:
sns.countplot(df['diagnosis'], label="Count",);


ValueError: could not convert string to float: 'M'

If we don’t have to see the graph for the values, then I can use a function that will return the numerical values of the occurrences. 

In [16]:
df['diagnosis'].value_counts()

diagnosis
B    357
M    212
Name: count, dtype: int64

Now we will be avalue_counts be using the shape() method. Shape returns the form of an array. The form could be a tuple of integers. These numbers tell the lengths of the corresponding array dimension. In other words: The “shape” of an array may be a tuple with the number of elements per axis (dimension). For instance, the form is adequate to (6, 3), i.e. we’ve got 6 lines and three columns.

In [17]:
df.shape


(569, 31)

which means that in the data set there are 539 lines and 31 columns.
As of now, we are ready with the to-be-processed dataset, so we will be able to be using describe( ) method which is employed to look at some basic statistical details like percentile, mean, std etc. of a knowledge frame or a series of numeric values.

In [19]:
# Summary of all numeric values
df.describe()


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


After all, this stuff, we will be using the corr( ) method to find the correlation between different fields. Corr( )  is used to find the pairwise correlation of all columns in the data frame. Any nan values are automatically excluded. For any non-numeric data type columns in the data-frame, it is ignored. 

In [20]:
# Correlation Plot
corr = df.corr()
corr


ValueError: could not convert string to float: 'M'

This command will provide 30 rows * 30 columns table which will be having rows like radius_mean, texture_se and so on.

The command corr.shape( ) will return (30, 30). The next step is plotting the statistics via heatmap. A heatmap could even be a two-dimensional graphical representation of information where the individual values that are contained during a matrix are represented as colors. The seaborn package allows the creation of annotated heatmaps which can be changed a little by using Matplotlib tools as per the creator’s requirement.

In [23]:
# making a heatmap
plt.figure(figsize=(14, 14))
sns.heatmap(corr)


NameError: name 'corr' is not defined

<Figure size 1400x1400 with 0 Axes>

Again we will be checking the CSV data set in order to ensure that the columns are just fine and haven’t been affected by the operations.