In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=a55f45d68903cece405a2821274c01d0fc6446eed3696721bf04490da6304cb4
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.stat import Correlation

spark = SparkSession.builder.appName('Breast_cancer_prediction').getOrCreate()

# Data Set Description:
Attribute Information:

1) ID number

2) Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)


**source:** https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

In [3]:
df = spark.read.csv('/content/data.csv',header=True,inferSchema=True)

In [4]:
spark.read.option('header','true').csv('/content/data.csv').show()

+--------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+----+
|      id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|_c32|
+--------+---------+-----------+------

**Checking for null values:**

In [7]:
null_list = [] # list to store sum of all null values in each column

for column in df.columns:
    null_list.append(df.filter(df[column].isNull()).count())

print(null_list)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 569]


In [8]:
df.count()

569

In [9]:
df.select('id').distinct().count()

569

Given that the '_c32' column contains only null values (with 569 null values indicating that every record in this column is null), it's appropriate to drop the entire column. Similarly, the 'id' column can also be dropped as it holds no correlation with the diagnosis, being unique for each patient.

In [10]:
df = df.drop('id', '_c32')

In [11]:
df.describe().show()

+-------+---------+------------------+-----------------+-----------------+-----------------+--------------------+-------------------+-------------------+--------------------+--------------------+----------------------+------------------+------------------+------------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+------------------+------------------+-----------------+--------------------+-------------------+-------------------+--------------------+-------------------+-----------------------+
|summary|diagnosis|       radius_mean|     texture_mean|   perimeter_mean|        area_mean|     smoothness_mean|   compactness_mean|     concavity_mean| concave points_mean|       symmetry_mean|fractal_dimension_mean|         radius_se|        texture_se|      perimeter_se|          area_se|       smoothness_se|      compactness_se|        concavity_se|   concave points_se|  

**Verifying the distribution of the target classes:**

In [12]:
class_distribution = df.groupBy('diagnosis').count()
class_distribution.show()

+---------+-----+
|diagnosis|count|
+---------+-----+
|        B|  357|
|        M|  212|
+---------+-----+



Data set seems to be quite well balanced.

**Changing target column records into numeric values:**

In [13]:
indexer = StringIndexer(inputCol="diagnosis", outputCol="label")
df = indexer.fit(df).transform(df)
df.show()

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|label|
+---------+-----------+------------+--------------+---

The 'label' column provides numeric values corresponding to the 'diagnosis', where 1 represents 'M' (malignant) and 0 represents 'B' (benign).

In [14]:
df.select(['diagnosis', 'label']).show()

+---------+-----+
|diagnosis|label|
+---------+-----+
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        M|  1.0|
|        B|  0.0|
+---------+-----+
only showing top 20 rows



**Vectorization**: The VectorAssembler takes a list of columns as an input and combines them into a single vector column. It is useful for combining various raw as well as generated/transformed features into a single feature vector which then can be used for modeling.

In [17]:
df.columns

['diagnosis',
 'radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'smoothness_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'perimeter_se',
 'area_se',
 'smoothness_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'symmetry_se',
 'fractal_dimension_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst',
 'label']

In [20]:
inputCols = df.columns
inputCols.remove('diagnosis')
inputCols.remove('label')
assembler = VectorAssembler(inputCols = inputCols, outputCol = 'features')

df = assembler.transform(df)
df.drop('diagnosis')
df.show()

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----+--------------------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|label|            features|
+---------+-

**Creating new data set containing feature vectors and corresponding labels:**.

In [30]:
df_trans = df.select(['features','label'])
df_trans.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[17.99,10.38,122....|  1.0|
|[20.57,17.77,132....|  1.0|
|[19.69,21.25,130....|  1.0|
|[11.42,20.38,77.5...|  1.0|
|[20.29,14.34,135....|  1.0|
|[12.45,15.7,82.57...|  1.0|
|[18.25,19.98,119....|  1.0|
|[13.71,20.83,90.2...|  1.0|
|[13.0,21.82,87.5,...|  1.0|
|[12.46,24.04,83.9...|  1.0|
|[16.02,23.24,102....|  1.0|
|[15.78,17.89,103....|  1.0|
|[19.17,24.8,132.4...|  1.0|
|[15.85,23.95,103....|  1.0|
|[13.73,22.61,93.6...|  1.0|
|[14.54,27.54,96.7...|  1.0|
|[14.68,20.13,94.7...|  1.0|
|[16.13,20.68,108....|  1.0|
|[19.81,22.15,130....|  1.0|
|[13.54,14.36,87.4...|  0.0|
+--------------------+-----+
only showing top 20 rows



In [31]:
train, test = df_trans.randomSplit([0.7, 0.3])

# Modelling - Logistic Regression:

In [32]:
logistic_regression = LogisticRegression(featuresCol="features", labelCol="label")
model = logistic_regression.fit(train)

predictions = model.transform(test)
predictions.show()

+--------------------+-----+--------------------+-----------+----------+
|            features|label|       rawPrediction|probability|prediction|
+--------------------+-----+--------------------+-----------+----------+
|[6.981,13.43,43.7...|  0.0|[1196.47934450218...|  [1.0,0.0]|       0.0|
|[8.618,11.79,54.3...|  0.0|[843.732046208658...|  [1.0,0.0]|       0.0|
|[8.671,14.45,54.4...|  0.0|[1322.33783832014...|  [1.0,0.0]|       0.0|
|[8.878,15.49,56.7...|  0.0|[777.095061598381...|  [1.0,0.0]|       0.0|
|[9.465,21.01,60.1...|  0.0|[452.827809375224...|  [1.0,0.0]|       0.0|
|[9.567,15.91,60.2...|  0.0|[688.184719345113...|  [1.0,0.0]|       0.0|
|[9.876,17.27,62.9...|  0.0|[816.855970048707...|  [1.0,0.0]|       0.0|
|[9.876,19.4,63.95...|  0.0|[567.870147608901...|  [1.0,0.0]|       0.0|
|[9.904,18.06,64.6...|  0.0|[1084.89313581531...|  [1.0,0.0]|       0.0|
|[10.03,21.28,63.1...|  0.0|[758.860315870841...|  [1.0,0.0]|       0.0|
|[10.08,15.11,63.7...|  0.0|[809.830408065185...|  

In [35]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="label")
auc = evaluator.evaluate(predictions)
multi_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
accuracy = multi_evaluator.evaluate(predictions, {multi_evaluator.metricName: "accuracy"})
precision = multi_evaluator.evaluate(predictions, {multi_evaluator.metricName: "weightedPrecision"})
recall = multi_evaluator.evaluate(predictions, {multi_evaluator.metricName: "weightedRecall"})

print(f"AUC-ROC: {auc:.4f}")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")

AUC-ROC: 0.9742
Accuracy: 0.9216
Precision: 0.9223
Recall: 0.9216


AUC-ROC : This metric indeed compares the relationship between the true positive rate and the false positive rate. The AUC-ROC value ranges from 0 to 1, where a higher value indicates better performance.

Accuracy: This measures the proportion of correct predictions made by a model over the total number of predictions made.

Precision: It's calculated as the ratio of true positives to the sum of true positives and false positives. Precision focuses on the accuracy of positive predictions.

Recall: This measures the proportion of actual positive cases that were correctly identified by the model. It's calculated as the ratio of true positives to the sum of true positives and false negatives.

**Let's delve into manual calculations of metrics to gain a deeper comprehension of each one.**

FALSE POSITIVE

In [53]:
FP = predictions.filter(predictions['label'] == 0).filter(predictions['prediction'] == 1).count()
FP

7

FAlSE NEGATIVE

In [54]:
FN = predictions.filter(predictions['label'] == 1).filter(predictions['prediction'] == 0).count()
FN

5

TRUE POSITIVE

In [55]:
TP = predictions.filter(predictions['label'] == 1).filter(predictions['prediction'] == 1).count()
TP

56

TRUE NEGATIVE

In [56]:
TN = predictions.filter(predictions['label'] == 0).filter(predictions['prediction'] == 0).count()
TN

85

In [66]:
recall_ = TP / (TP + FN)
print(f"recall: {recall_:.4f}")

recall: 0.9180


In [67]:
accuracy_ = (TP+TN)/(TP+TN+FP+FN)
print(f"accuracy: {accuracy_:.4f}")

accuracy: 0.9216


In [68]:
precision_ = TP/(TP+FP)
print(f"accuracy: {precision_:.4f}")

accuracy: 0.8889


In [69]:
recall_ = TP/(TP+FN)
print(f"accuracy: {recall_:.4f}")

accuracy: 0.9180


An accuracy of 0.92, suggests that the model performs well in correctly classifying breast cancer based on the given features.