<a href="https://colab.research.google.com/github/lauramanor/long-view/blob/main/breastcancer_KNNandDTC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**[Can AI Diagnose Breast Cancer?](https://www.sciencebuddies.org/science-fair-projects/project_ideas/ArtificialIntelligence_p010/artificial-intelligence/KNN-breast-cancer)**

---
https://www.sciencebuddies.org/science-fair-projects/project_ideas/ArtificialIntelligence_p010/artificial-intelligence/KNN-breast-cancer

---




This notebook was developed by Science Buddies [www.sciencebuddies.org](https://www.sciencebuddies.org/) as part of a science project to allow students to explore and learn about artificial intelligence. For personal use, this notebook can be downloaded and modified with attribution. For all other uses, please see our [Terms and Conditions of Fair Use](https://www.sciencebuddies.org/about/terms-and-conditions-of-fair-use).  

Ms. Manor has havily edited this notebook.

**Troubleshooting tips**
*   Read the written instructions at Science Buddies and the text and comments on this page carefully.


##**1. Importing Libraries**
We will start this science project by importing some necessary libraries. These libraries contain functions that we will be using to create and display our maze. The comments tell you what each libary is for.

In [None]:
# The pandas library allows us to work with data like spreadsheets.
# It helps us organize, clean, and analyze data easily
import pandas as pd

# Set various display options for pandas to show all columns, rows, and remove width limitations.
pd.set_option("display.max_columns", None)    # Display all columns without limit.
pd.set_option("display.max_rows", None)       # Display all rows without limit.
pd.set_option("display.width", None)          # Remove width restrictions for displaying data.
pd.set_option("display.max_colwidth", None)   # Display columns with unlimited width.


# For doing math and working with numbers, this library is like a super calculator.
# It's great for handling big sets of numbers and doing fancy math operations.
import numpy as np

# This function helps us convert categorical data into numerical values.
# It assigns a unique integer to each category, making it suitable for many algorithms that require numerical input
from sklearn.preprocessing import LabelEncoder

# We often want to test how well our "smart" programs work.
# This library helps us split our data into parts: one for teaching the program and another for testing it.
from sklearn.model_selection import train_test_split

# A machine learning algorithms used for classification tasks.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# A metric used to evaluate the performance of classification models by measuring the proportion of correctly
# predicted instances in the total instances
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix


# When we want to draw graphs and charts to show our data visually, we use this.
# It helps us see patterns and trends in the data.
import matplotlib.pyplot as plt

# Imagine you have a lot of information about people, like their height, weight, age, etc.
# Sometimes there's too much information, and we use this to simplify it while keeping the important stuff
from sklearn.decomposition import PCA

print("You have imported all the libraries")

You have imported all the libraries


In [None]:
# Load the CSV file into a pandas DataFrame
data = pd.read_csv("https://www.sciencebuddies.org/ai/colab/breastcancer.csv?t=AQXm10t69CcSJQgDXixMI7XQnL9jwNxi10NnK6MvgKMxYg")

# We can see what the data frame looks like by using the head function
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave_points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## **2. Preprocessing the data**

### 2.1 Drop unnecesary columns
We will first drop features that we think will be uninformative for modeling. In our case, we will be dropping the ID column because it simply serves as a unique identifier for each row and doesn't provide any meaningful information for the analysis or modeling.

In [47]:
# Dropping the ID column
data.drop('id', axis=1, inplace=True)

# Let's check if that column is now gone
data.head()

# Remember, if you get a KeyError, it is likely becuase you already dropped the column. Try reloading the data and trying again!

KeyError: "['id'] not found in axis"

### 2.2 Encode and Separate the Target Variable

We will be trying to diagnose cancer, so let's take a closer look at the diagnosis column.

In [None]:
data['diagnosis'].unique()

array(['M', 'B'], dtype=object)

As we can see, there are only two options for diagnosis. For binary classification problems, we can use label encoding.

In [None]:
# Initialize the label encoder
label_encoder = LabelEncoder()

# Fit and transform the target data using label encoding
data['diagnosis'] = label_encoder.fit_transform(data['diagnosis'])

# Let's check what the dataframe looks like now!
data.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave_points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,1,0.521037,0.022658,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,0.643144,0.272574,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,0.601496,0.39026,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,0.21009,0.360839,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,0.629893,0.156578,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Now, lets separate the diagnosis into the its own dataframe (y) and drop it from the original dataframe.

In [48]:
# Assign the 'diagnosis' column (target variable) to the variable 'y'. This will be the value we aim to predict
y = data['diagnosis']

# Drop the column from the dataframe'axis=1' indicates we are dropping a column
data = data.drop('diagnosis', axis=1)

### 2.3 Normalization

It is usually important to normalize or scale the features to ensure that all features contribute equally to any distance calculations. Let's make a new data frame, `normalized` so that we can compare the normalized results to the standard results.

Since all of our remaining columns are numerical, we can just normalize all columns!

In [28]:
# Make a copy of the original data set
normalized = data.copy()

# Apply min-max scaling to the entire datafreme
normalized = (normalized - normalized.min()) / (normalized.max() - normalized.min())

In [29]:
# Let's see what our normalization did! Do you remember which function we used to look at the data frame?
normalized.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave_points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,1.0,0.521037,0.022658,0.545989,0.363733,0.593753,0.792037,0.70314,0.731113,0.686364,0.605518,0.356147,0.120469,0.369034,0.273811,0.159296,0.351398,0.135682,0.300625,0.311645,0.183042,0.620776,0.141525,0.66831,0.450698,0.601136,0.619292,0.56861,0.912027,0.598462,0.418864
1,1.0,0.643144,0.272574,0.615783,0.501591,0.28988,0.181768,0.203608,0.348757,0.379798,0.141323,0.156437,0.082589,0.12444,0.12566,0.119387,0.081323,0.04697,0.253836,0.084539,0.09111,0.606901,0.303571,0.539818,0.435214,0.347553,0.154563,0.192971,0.639175,0.23359,0.222878
2,1.0,0.601496,0.39026,0.595743,0.449417,0.514309,0.431017,0.462512,0.635686,0.509596,0.211247,0.229622,0.094303,0.18037,0.162922,0.150831,0.283955,0.096768,0.389847,0.20569,0.127006,0.556386,0.360075,0.508442,0.374508,0.48359,0.385375,0.359744,0.835052,0.403706,0.213433
3,1.0,0.21009,0.360839,0.233501,0.102906,0.811321,0.811361,0.565604,0.522863,0.776263,1.0,0.139091,0.175875,0.126655,0.038155,0.251453,0.543215,0.142955,0.353665,0.728148,0.287205,0.24831,0.385928,0.241347,0.094008,0.915472,0.814012,0.548642,0.88488,1.0,0.773711
4,1.0,0.629893,0.156578,0.630986,0.48929,0.430351,0.347893,0.463918,0.51839,0.378283,0.186816,0.233822,0.093065,0.220563,0.163688,0.332359,0.167918,0.143636,0.357075,0.136179,0.1458,0.519744,0.123934,0.506948,0.341575,0.437364,0.172415,0.319489,0.558419,0.1575,0.142595


### 2.4 Train-Test Split

Train-Test Split: Now that we have finished preprocessing our data, it is now time to split our data into training and testing datasets. The training dataset is used to train the KNN model, and the testing dataset is used to evaluate its performance on unseen data.

In [51]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.2, random_state=42)
X_train_norm, X_test_norm, y_train_norm, y_test_norm = train_test_split(normalized, y, test_size=0.2, random_state=42)

#### **TODO #1:**
* The shapes should be exactly the same for both versions, though the data may look slightly different. Print out the shapes of the normalized splits below to confirm

In [56]:
# Print the shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# TODO: print the shapes of the normalized splits, similar to how we split the non-normalized above.

X_train shape: (455, 30)
X_test shape: (114, 30)
y_train shape: (455,)
y_test shape: (114,)


## **3. The KNN Model**

### 3.1 KNN without Normalization

I have completed the code for the non-normalized data below. You will be asked to finish the code for and compare the two different types of pre-processing.




In [77]:
# Set the value of 'k', which represents the number of neighbors to consider for each prediction
k = 5

# Create an instance of the KNN classifier
knn = KNeighborsClassifier(n_neighbors=k)

# Fit the KNN classifier to the training data
knn.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = knn.predict(X_test)

# Evaluate the performance of the KNN model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

precision = precision_score(y_test, y_pred)
print("Precision:", precision)

recall = recall_score(y_test, y_pred)
print("Recall:", recall)

Accuracy: 0.956140350877193
Precision: 1.0
Recall: 0.8837209302325582


### 3.2 KNN with Normalization

#### **TODO #2**:
* Compelte the code for the normalized data below

In [78]:
# Set the value of 'k', which represents the number of neighbors to consider for each prediction
k = 5

# Create an instance of the KNN classifier

# Fit the KNN classifier to the training data

# Make predictions on the testing data

# Evaluate the performance of the KNN model


#### **TODO #3**:
* Answer within this text box: Does normalization matter? Which preprocessing method worked better? Be detailed - if there is one metric that is driving your decision, make sure to be explicit about what the metric MEANS in this context.




**DOUBLE CLICK HERE TO WRITE YOUR ANSWER**




### 3.3 Comparing different neighbor sizes using a loop

#### **TODO #4**
* update the training data based on which model preformed better above

In [79]:
# Define a range of values for the number of neighbors
neighbors = np.arange(1, 21)

# Initialize an empty list to store accuracy scores
accuracy_scores = []
precision_scores = []
recall_scores = []

# Iterate over different values of neighbors
for n in neighbors:
    # Create an instance of the KNN classifier with the current number of neighbors
    knn = KNeighborsClassifier(n_neighbors=n)

    # Fit the KNN classifier to the training data
    # TODO: use the correct trainig data based on your obervations above
    knn.fit(___,____)

    # Make predictions on the testing data
    y_pred = knn.predict(_____)

    # Compute the accuracy score and append it to the list
    # note that y_test should not have to change because they should be exactly the same! (since we used a random seed for the splitting)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

    precision = precision_score(y_test, y_pred)
    precision_scores.append(precision)

    recall = recall_score(y_test, y_pred)
    recall_scores.append(recall)

NameError: name '____' is not defined

In [None]:
# Plot the accuracy scores
plt.plot(neighbors, accuracy_scores, label='accuracy')
plt.plot(neighbors, precision_scores, label='precision')
plt.plot(neighbors, recall_scores, label='recall')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.title('Comparison of Performance Metrics for Different Numbers of Neighbors')
plt.legend()
plt.show()

#### **TODO #5**:
* Answer within this text box: Which K would you chose for your release? Why? if there is one metric that is driving your decision, make sure to be explicit about what the metric MEANS in this context.



**DOUBLE CLICK HERE TO WRITE YOUR ANSWER**



## **4. The Decision Tree Model**


In [69]:
# Create an instance of the Decision Tree classifier
dtc = DecisionTreeClassifier(random_state=42)

# Fit the KNN classifier to the training data
dtc.fit(X_train,y_train)

# Make predictions on the testing data
y_pred = dtc.predict(X_test)

# Evaluate the performance of the KNN model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

precision = precision_score(y_test, y_pred)
print("Precision:", precision)

recall = recall_score(y_test, y_pred)
print("Recall:", recall)

Accuracy: 0.9473684210526315
Precision: 0.9302325581395349
Recall: 0.9302325581395349


In [70]:
class_names = list(y.unique())
feature_names = list(data.columns)

In [76]:
from sklearn.tree import export_text
r = export_text(dtc, feature_names=feature_names, show_weights=True)
print(r)

|--- concave_points_mean <= 0.05
|   |--- radius_worst <= 16.83
|   |   |--- area_se <= 48.70
|   |   |   |--- smoothness_worst <= 0.18
|   |   |   |   |--- smoothness_se <= 0.00
|   |   |   |   |   |--- texture_worst <= 27.76
|   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |--- texture_worst >  27.76
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- smoothness_se >  0.00
|   |   |   |   |   |--- texture_worst <= 33.35
|   |   |   |   |   |   |--- weights: [237.00, 0.00] class: 0
|   |   |   |   |   |--- texture_worst >  33.35
|   |   |   |   |   |   |--- texture_worst <= 33.56
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- texture_worst >  33.56
|   |   |   |   |   |   |   |--- weights: [14.00, 0.00] class: 0
|   |   |   |--- smoothness_worst >  0.18
|   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |--- area_se >  48.70
|   |   |   |--- concavity_se <= 0.02
|   |   |

Take a look at the tree created by the model. Make some observations and at least one argument based on the results. Example, which features seem to be the mores important when it comes to detecting cancer? Why? Do you think all of the branches are needed? Why? Why not?

Remember that you can find descriptions of each feature here: https://www.sciencebuddies.org/science-fair-projects/project_ideas/ArtificialIntelligence_p010/artificial-intelligence/KNN-breast-cancer

#### **TODO #6**:
* Answer within this text box: Make at least one argument based off of the results of the decision tree



**DOUBLE CLICK HERE TO WRITE YOUR ANSWER**



## Challenges
Once you have completed all 6 TODOs in this notebook, you are invited to try one (or all!) of these challenges. Please leave the notebook above as-is, and add any new code/notes below.


### Mild:
Take a look at the [documentation for the decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier). Do some experiments with the parameters (e.g. criterion, max_depth, etc). Then, write a reflection on which parameters seem to made a difference on the results and why.


### Medium:
Use the Decision Tree Classifier to analyze the titanic data! See your old colab notebook for the link to the data and make sure to preprocess the data properly. (But please put your code here so I don't have to go and search for it.)


### Spicy:
1. Go to the sklearn website and find a classifier that we have not yet worked with! Implement it! Try to do a bit of resesarch and figure out how it works.


2. Go to the sklearn website and find a new dataset to run either the KNN or Decisition tree on! https://scikit-learn.org/stable/user_guide.html make sure to do any preprocessing etc!

In [None]:
# Feel free to add as many code blocks  and text blocks as you need for your experiments!