Data Mining: Data mining is the process of sorting through large data sets to identify 
patterns and relationships that can help solve business problems through data analysis. 
Data mining techniques and tools help enterprises to predict future trends and make more informed business decisions.

### Importing Libraries

In [47]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Taking a dataset from the internet

In [48]:
#Importing the dataset
df = pd.read_csv("cancer dataset.csv")

In [49]:
#First we take a look at the dataset
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


#### Using Label Encoder to convert categorical data to numerical data

In [50]:
#using the LabelEncoder class from the sklearn.preprocessing module to convert
#categorical text data into model-understanble numerical data

from sklearn.preprocessing import LabelEncoder

labelencoder_Y=LabelEncoder() #creates an instance of the LabelEncoder class

df.iloc[:,1] = labelencoder_Y.fit_transform(df.iloc[:,1].values)

#this code is transforming the second column of the dataframe "df" from categorical data to 
# numerical data so that it can be used in a machine learning model

### Taking the X and the Y co-ordinate for calculations

In [51]:
#the x co-ordinate is basically the features which are taken account into for calculations 
X = df.iloc[:, 2:31]

#the y co-ordinate is taking use of the convert value of the diagnosis
Y = df.iloc[:, 1].values

### Splitting the datset for training and testing

In [52]:
#importing train_test_split from sklearn.model_selection library
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=0)

In [53]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() #Initialize StandardScaler

X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

### Logistic Regression

In [54]:
def logreg (X_train, Y_train):
    from sklearn.linear_model import LogisticRegression
    log=LogisticRegression (random_state=0)
    log.fit(X_train, Y_train)
    print("Logistic Regression Training Accuracy:", log.score(X_train, Y_train))
    return log

### Random Forest

In [59]:
def randomForest(X_train, Y_train):
    from sklearn.ensemble import RandomForestClassifier
    # creating a RF classifier
    clf = RandomForestClassifier(n_estimators = 100) 

    # Training the model on the training dataset
    # fit function is used to train the model using the training sets as parameters
    clf.fit(X_train, Y_train)

    # performing predictions on the test dataset
    y_pred = clf.predict(X_test)

    # metrics are used to find accuracy or error
    from sklearn import metrics 

    # using metrics module for accuracy calculation
    print("Random Forest Training Accuracy:", metrics.accuracy_score(Y_test, y_pred))

In [60]:
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

### Accuracy of models

In [61]:
logrex=logreg(X_train, Y_train)

Logistic Regression Training Accuracy: 0.9906103286384976


In [74]:
randomfor = randomForest(X_train, Y_train)

Random Forest Training Accuracy: 0.972027972027972


### Confusion Matrix

In [76]:
#For Logistic Regression

from sklearn.metrics import confusion_matrix
cm_logistic_regression = confusion_matrix(Y_test, logrex.predict(X_test))
print(cm_logistic_regression)

[[86  4]
 [ 3 50]]


In this Logistic Regression Model:
1. true positive=86
2. True negative=50
3. False positive=4
4. False negative=3

In [80]:
#For Random Forest
from sklearn.ensemble import RandomForestClassifier
randomForest = RandomForestClassifier()
randomForest.fit(X_train, Y_train)
cm_random_forest = confusion_matrix(Y_test, randomForest.predict(X_test))
print(cm_random_forest)

[[85  5]
 [ 1 52]]


In this Random Forest Model:
1. true positive=85
2. True negative=52
3. False positive=5
4. False negative=1

### Testing Accuracy

In [83]:
#For Logistic Regression
TP = cm_logistic_regression[0][0]
TN = cm_logistic_regression[1][1]
FP = cm_logistic_regression[0][1]
FN = cm_logistic_regression[1][0]

testing_accuracy = (TP + TN) / (TP + TN + FP + FN)

print("Testing Accuracy of Logistic Regression: ", testing_accuracy)

Testing Accuracy of Logistic Regression:  0.951048951048951


In [85]:
#For Random Forest
TP = cm_random_forest[0][0]
TN = cm_random_forest[1][1]
FP = cm_random_forest[0][1]
FN = cm_random_forest[1][0]

testing_accuracy = (TP + TN) / (TP + TN + FP + FN)

print("Testing Accuracy of Random Forest: ", testing_accuracy)

Testing Accuracy of Random Forest:  0.958041958041958
