# Final Project

Riju Pant, Andrew ZiYu Wang, Alisa Sumwalt

## Getting Ready

### Imports and Getting the Dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold, cross_validate
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, log_loss
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

In [None]:
df = pd.read_csv('data.csv')

# from https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=download
# originally from https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

The dataset we will be using for this project is to do with breast cancer. Breast cancer is a leading cause of death amongst women worldwide, so being able to understand and determine whether one has breast cancer or not is crucial to saving their life. In this notebook, we aim to clasify a tumour as benign or malignant - no harm, or spreads across the body. We got the dataset from Kaggle and it shows 32 columns worth of information for 569 observations. Using this dataset, we want to build classification models to predict the diagnosis, or whether the breast cancer is at a stage that will harm the body or not -we aim to find whether it is malignant or benign. This will be useful for the medical context as predicting the severity of the cancer can help a doctor and patient understand the situation and work towards a solution. Having a working model/models that can give good diagnosis will be key as the key problem with detection of breast cancer is how to classify tumors into the malignant and benign category. Going further, we will understand the dataset and clean it if required, build classification models, and finetune where necessary.

## Exploring the Dataset

In [None]:
print("Information about the Dataset:")
print()

df.info()

row, column = df.shape
print()

print(f"Rows = {row}, Columns = {column}")

Something we find appreciative of this dataset is that everything is a float. When we worked with other datasets previously, we would have some data that should be represented numerically as strings, which would force us to make a new column that has numerical values. Let's learn more about the dataset.

In [None]:
df.head(10)

In [None]:
df.tail(10)

It seemed that there was an extra column that was probably added accidentally. We decided to drop it.

In [None]:
df = df.drop("Unnamed: 32",axis=1)

### Checking for Null or NA

In [None]:
df.isnull().any()

In [None]:
df.isna().any()

The fact that the dataset had no missing entires shocked us at first and we were extremely grateful for the luck, but then when we read the description of the dataset on Kaggle, it had mentioned "Missing attribute values: none". Nonetheless, we need to learn more about the dataset. 

In [None]:
df.describe()

There are clearly many features in this dataset, meaning training the model could be a bit challenging, especially if only a select few features are what really helps the model learn. We are curious to see more about the id column and something that was not shown in this table - the diagnosis column.

In [None]:
df["id"].value_counts()

After looking at the df.head and df.tail of the dataset, and the value counts of the id column, we realized that the id column is just a patient's ID number, and a feature we do not need for training our model. Therefore, we decided to drop the column.

In [None]:
df = df.drop("id", axis=1)

In [None]:
df["diagnosis"].value_counts()

When we saw that there are exactly only two classifications, we recalled binary classification, and thought of how we can name our classes as 0 and 1. We decided to just that, where instead of calling a class malignant, we call it 1, and benign as 0.

In [None]:
df["diagnosis"] = df["diagnosis"].map({'M':1,'B':0})

After making sense of some of the columns and what they meant, we tried to understand what the other columns meant. Thanfully, this was mentioned in the description of the dataset on Kaggle.
* id = ID number
* diagnosis = Diagnosis (M = malignant, B = benign)

For the remaining columns, we have "Ten real-valued features are computed for each cell nucleus"
There are:
* radius (mean of distances from center to points on the perimeter)
* texture (standard deviation of gray-scale values)
* perimeter
* area
* smoothness (local variation in radius lengths)
* compactness (perimeter^2 / area - 1.0)
* concavity (severity of concave portions of the contour)
* concave points (number of concave portions of the contour)
* symmetry
* fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance (when we include id and call it field 1), field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Upon further research and some guessing, we understood that:
* radius means the size of the cell nucleus. Malignant cells tend to have larger radii, so radius could be an important feature for us
* texture measures the variation in the pixel intensity of a nucleus. This could be an important feature for us because it can help us differentiate between a healthy cell and a malignant one
* perimeter means the total length of the boundary of the cell nucleus. This is an important feature with similar reasons to radius, but would we need both radius and perimeter is a question we should tackle later
* area means the area of the cell nucleus. This is also important for reasons similar to radius and perimeter; need to think whether to keep as a feature or remove later
* smoothness means how smooth or rough the cell nucleus is. Upon researching, breast cancer cells "... often has angular, irregular, asymmetrical edges, as opposed to being smooth..." This would definitely be an important feature for our model (https://www.massgeneralbrigham.org/en/about/newsroom/articles/what-does-a-breast-cancer-lump-feel-like)
* compactness means how much the shape of the cell nuclei deviates from a perfect circle. This would be a very useful feature for us, especially since breast cancer cells, as mentioned above, often have roughness, meaning they would not be a perfect circle. *A perfect circle would have a compactness of 0*
* concavity means how severe the dents in the nucleus cells are. This is also an important feature as breast cancer cells are not exactly smooth
* concave points means how many dents there are in the nucleus cells
* symmetry means how symmetric the cell nucleus us. If the symmetry is low, it would indicate irregularity and could be a possibility for a malignant cell. This could be an important feature
* fractal dimension is an advanced method of measuring how rough the cell nucleus' perimeter is. This is an important feature that could give us reason to not include the perimeter feature as a higher fractal dimension indicates the irregularity of the boundary, which is often a feature of breast cancer cells


Let's look at the dataset as a whole once again:

In [None]:
df.describe()

In [None]:
df.corr()

It is hard to make sense of this correlation matrix as we have too many columns. Instead, we decided to make a heatmap:

In [None]:
corrMatrix = df.corr()
plt.figure(figsize=(16, 9))
sns.heatmap(
    corrMatrix,
    annot=True,
    cmap='coolwarm',
    fmt=".2f",
    cbar = True
)
plt.show()

This heatmap helps us visualize our correlations, but we realized this is not a good heatmap because it includes our diagnosis values. Since there are 3 sets of 10 real-valued features for each cell nucleus, we decided we can split this into 3 sets.

In [None]:
ftMean = df.columns[1:11]
ftMean = list(ftMean)
ftSE = df.columns[11:21]
ftSE = list(ftSE)
ftWorst = df.columns[21:31]
ftWorst = list(ftWorst)
# print(ftMean)
# print(ftSE)
# print(ftWorst)

Now, we can make a heatmap for the three sets:

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 9))

sns.set_context("notebook", font_scale=0.7)

corrFtMean = df[ftMean].corr()
sns.heatmap(
    corrFtMean,
    annot=True,
    cmap='coolwarm',
    fmt=".2f",
    cbar = True,
    annot_kws={'size': 6},
    ax=axes[0], 
)
axes[0].set_title('Heatmap for Feature Mean')

corrFtSE = df[ftSE].corr()
sns.heatmap(
    corrFtSE,
    annot=True,
    cmap='coolwarm',
    fmt=".2f",
    cbar = True,
    annot_kws={'size': 6},
    ax=axes[1]
)
axes[1].set_title('Heatmap for Feature Standard Error')

corrFtWorst = df[ftWorst].corr()
sns.heatmap(
    corrFtWorst,
    annot=True,
    cmap='coolwarm',
    fmt=".2f",
    cbar = True,
    annot_kws={'size': 6},
    ax=axes[2]
)
axes[2].set_title('Heatmap for Feature Worst')



# plt.tight_layout(rect=[0, 0, 1, 0])
plt.subplots_adjust(wspace=0.5)
# plt.tight_layout(pad=1)
plt.show()

From the Heatmaps, we can see that
- Radius, Area, and Perimeter are highly correlated, which makes sense because area and perimeter are derived using radius
- Compactness, Concavity, and Concave Points are highly correlated, which makes sense because these three are indicating that the shape of the cell is not a perfect circle

Since we have 10 features which could take time training with, we can reduce the number of features since some of them are highly correlated with each other. Between Radius, Area, and Perimeter, we can choose one of them. Between Compactness, Concavity, and Concave Points, we can choose one of them.

Now we will start working on the models:

## Logistic Regression

In [None]:
logisticResults = []

In [None]:
def logistic(X, y):
    x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
    log = LogisticRegression(max_iter=1000, random_state=42)
    logSettings = log
    log.fit(x_train,y_train)

    scoringMetrics = ['accuracy', 'f1', 'neg_log_loss']
    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_validate(logSettings, X, y, cv=cv, scoring=scoringMetrics)

    predictions = log.predict(x_test)
    # print(predictions)
    logisticResults.append(
        [predictions, classification_report(y_test, predictions), confusion_matrix(y_test, predictions), log.predict_proba(x_test)]
    )

    print(f"Average CV Accuracy: {scores['test_accuracy'].mean():.4f}")
    print(f"Accuracy Standard Deviation: {scores['test_accuracy'].std():.4f}")

    print(f"Average CV F1: {scores['test_f1'].mean():.4f}")
    print(f"F1 Standard Deviation: {scores['test_f1'].std():.4f}")

    print(f"Average CV Log-Loss: {scores['test_neg_log_loss'].mean():.4f}")
    print(f"Log-Loss Standard Deviation: {scores['test_neg_log_loss'].std():.4f}")

    print()

In [None]:
def printLogisticResults(X, y, index):
    print("Results from our Cross Validate Function")

    logistic(X, y)

    print("-"*100)
    print()
    print("Results from our .fit Function")

    print(logisticResults[index][0])
    print(logisticResults[index][1])
    print("[[TN, FP]\n [FN, TP]]\n")
    print(logisticResults[index][2])
    # print(f"Confidence for Each Sample [Percentage Confidence for Class 0, Percentage Confidence for Class 1]: {logisticResults[0][3]}")

    sns.heatmap(logisticResults[index][2], annot=True, fmt='d', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('Observed')
    plt.show()

In [None]:
X = df[ftMean]
y = df['diagnosis']

printLogisticResults(X, y, 0)

In [None]:
X = df[ftSE]
y = df['diagnosis']

printLogisticResults(X, y, 1)

Splitting the data sets into three different data sets since the mean, standard error, and worst of these features were computed for each image. So we can study the correlations between these images and see which set would fit the model the best. 

Accuracy for standard error was 91% which is less than for the mean set (94%). The log loss value for the standard error set was higher than for the mean set, which shows how significant the difference is between the predicted probabilities vs the actual values. 

In [None]:
X = df[ftWorst]
y = df['diagnosis']

printLogisticResults(X, y, 2)

Intuitive to remove perimeter or area or radius since they are all mathematically related. As you can see the log-loss went down after removing area and perimeter. 

### Conclusion for Logistic Regression
For the project purpose we want to ensure that all of the cases that are malignant are marked as malignant. The worse case would be a malignant cell marked as benign. Therefore, the priority from the outputs of the model would be the rate of accuracy for '1' within the model. The two models with the best recall number for '1' is mean and worst subsets, with an accuracy of 93%. This means that of all the true malignant cases, 93% of them are marked as malignant.

We also want to focus on the false negative rate within the confusion matrices since those would indicate the amount of cases where the cell was malignant but marked as benign. The subset with the best false negative rate is worst compared the others, since 3 out of 70 were false negative. For mean it was 3 out of 66, and for standard error it was 5 out of 69.

## Support Vector Machine

We will be splitting the dataset into three major catagories: The mean, Standard Error, and worst. This way, we can catagorize the data and make the svms more accurate and efficient. 

In [None]:
svmResults = []

Lets make the SVM and we will use cross validation which 5 k folds

In [None]:
def svm(X, y, C, kernel):
    x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

    svm = SVC(kernel=kernel, C=C)
    svmSettings = svm
    svm.fit(x_train,y_train)

    predictions = svm.predict(x_test)
    # print(predictions)
    svmResults.append(
        [predictions, classification_report(y_test, predictions), confusion_matrix(y_test, predictions)]
    )

    scoringMetrics = ['accuracy', 'f1']
    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_validate(svmSettings, X, y, cv=cv, scoring=scoringMetrics)
    print(f"Average CV Accuracy: {scores['test_accuracy'].mean():.4f}")
    print(f"Accuracy Standard Deviation: {scores['test_accuracy'].std():.4f}")

    print(f"Average CV F1: {scores['test_f1'].mean():.4f}")
    print(f"F1 Standard Deviation: {scores['test_f1'].std():.4f}")
    print()

In [None]:
def printSVMResults(X, y, index, C, kernel):
    print("Results from our Cross Validate Function")

    svm(X, y, C=C, kernel=kernel)

    print("-"*100)
    print()
    print("Results from our .fit Function")

    print(svmResults[index][0])
    print(svmResults[index][1])
    print("[[TN, FP]\n [FN, TP]]\n")
    print(svmResults[index][2])

    sns.heatmap(svmResults[index][2], annot=True, fmt='d', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('Observed')
    plt.show()

### Lets use SVM and train the data with a margin C of 1.0

Lets use it to predict


In [None]:
X = df[ftMean]
y = df['diagnosis']

printSVMResults(X, y, 0, C=1, kernel='linear')

In [None]:
X = df[ftSE]
y = df['diagnosis']

printSVMResults(X, y, 1, C=1, kernel='linear')

In [None]:
X = df[ftWorst]
y = df['diagnosis']

printSVMResults(X, y, 2, C=1, kernel='linear')

As we can see, the accuracy score is pretty high for all three catagories, especially the "worst" data catagory. The f1 scores are also good which is also a good sign. However, despite having a better accuracy and a better average f1 score for ftWorst data set, it has more false negatives than false positives which is not good in the context of predicting brest cancer because marking a cancerous cell as not cancerous is more punishing than marking a non-cancerous cell as cancerous. If this is the case, I think it would be better to use the ftMean instead of the ftWorst SVM as it has better false positive than false negative even though the overall accuracy is worse. Lets test this with a more flexable margin. 

In [None]:
X = df[ftMean]
y = df['diagnosis']

printSVMResults(X, y, 3, C=5, kernel='linear')

In [None]:
X = df[ftSE]
y = df['diagnosis']

printSVMResults(X, y, 4, C=5, kernel='linear')

In [None]:
X = df[ftWorst]
y = df['diagnosis']

printSVMResults(X, y, 5, C=5, kernel='linear')

We can see that not that the margin is softer, the overall accuracy of some of these has gone down, but the linear one with c = 5.0 for ftMean has a perfect porportion of false negative and false positive as well as being more accurate than it's c = 1.0 version. So far, this is the best one. 

### Let's try polynomial

We will try polynomial kernel instead of the linear one to see if the data can be better separated with not just a straight line, but a curved polynomial one. This could potentially create a better accuracy and seapration. One thing we have to watch out is overfitting as it can fit it really well with higher degrees.

In [None]:
X = df[ftMean]
y = df['diagnosis']

printSVMResults(X, y, 6, C=1, kernel='poly')

In [None]:
X = df[ftSE]
y = df['diagnosis']

printSVMResults(X, y, 7, C= 1, kernel='poly')

In [None]:
X = df[ftWorst]
y = df['diagnosis']

printSVMResults(X, y, 8, C=1, kernel='poly')

In [None]:
X = df[ftMean]
y = df['diagnosis']

printSVMResults(X, y, 9, C=5, kernel='poly')

In [None]:
X = df[ftSE]
y = df['diagnosis']

printSVMResults(X, y, 10, C=5, kernel='poly')

In [None]:
X = df[ftWorst]
y = df['diagnosis']

printSVMResults(X, y, 11, C=5, kernel='poly')

### Let's try radial

Radial is also non-linear, but it is different to the polynomial kernel as it can map to infinite dimensions and is more versatile. However, since the linear kernel is already getting very good accuracy, most likely the data is already linearly separable so non-linear might be overfitting or unessesary. We will try it with kernel to see if there are any improvement regardless.

In [None]:
X = df[ftMean]
y = df['diagnosis']

printSVMResults(X, y, 12, C=1, kernel='rbf')

In [None]:
X = df[ftSE]
y = df['diagnosis']

printSVMResults(X, y, 13, C=1, kernel='rbf')

In [None]:
X = df[ftWorst]
y = df['diagnosis']

printSVMResults(X, y, 14, C=1, kernel='rbf')

In [None]:
X = df[ftMean]
y = df['diagnosis']

printSVMResults(X, y, 15, C=5, kernel='rbf')

In [None]:
X = df[ftSE]
y = df['diagnosis']

printSVMResults(X, y, 16, C=5, kernel='rbf')

In [None]:
X = df[ftWorst]
y = df['diagnosis']

printSVMResults(X, y, 17, C=5, kernel='rbf')

### Conclusion for SVM

Overall, the polynomial SVM performed similar to the Radial SVM, however the linear one performed the best and is the least complex. Therefore, even if the non-linear SVMs gave us a small increase in accuracy, it might still be better to use lienar as it is less complex and sufficient. Furthermore, with higher complexity, we have to worry more about overfitting. This is probably due to the fact that the data can already be defined linearly and wouldn't require a higher dimension function. The best performing one is the Linear with c = 1.0 with the ftWorst data set as it is more overall accurate and has a good recall rate for both cancerous and non-cancerous cells.

## Random Forest

After understanding the dataset, we thought good features for the Random Forest would be: radius, texture, smoothness, compactness, symmetry, and fractal dimension. As mentioned previously, some of the features are highly correlated with each other, meaning we can save training time by keeping only one of the highly correlated features.

First, we split the dataset into three parts, each according to their category:

In [None]:
predictorsFtMean = ['radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean']
predictorsFtSE = ['radius_se', 'texture_se', 'smoothness_se', 'compactness_se', 'symmetry_se', 'fractal_dimension_se']
predictorsFtWorst = ['radius_worst', 'texture_worst', 'smoothness_worst', 'compactness_worst', 'symmetry_worst', 'fractal_dimension_worst']

Now, we set up the Random Forest that we will use to build our models:

In [None]:
modelResults = []

In [None]:
def randomForest(X, y, predictors):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    RF = RandomForestClassifier(
        n_estimators=500,
        random_state=42,
        max_features=len(predictors),
        min_samples_leaf=5
    )

    RFSettings = RF

    RF.fit(X_train, y_train)

    # predictionsTraining = RF.predict(X_train)

    print("Results from our Cross Validate Function")
    
    scoringMetrics = ['accuracy', 'f1']
    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_validate(RFSettings, X, y, cv=cv, scoring=scoringMetrics)
    # print(f"Scores for each fold: {scores}")
    print(f"Average CV Accuracy: {scores['test_accuracy'].mean():.4f}")
    print(f"Accuracy Standard Deviation: {scores['test_accuracy'].std():.4f}")

    print(f"Average CV F1: {scores['test_f1'].mean():.4f}")
    print(f"F1 Standard Deviation: {scores['test_f1'].std():.4f}")

    print()
    print("-"*100)
    print()
    print("Results from our .fit Function")
    
    predictions = RF.predict(X_test)
    print(classification_report(y_test, predictions))
    print("[[TN, FP]\n [FN, TP]]\n")
    print(confusion_matrix(y_test, predictions))

    sns.heatmap(confusion_matrix(y_test, predictions), annot=True, fmt='d', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('Observed')
    plt.show()

    modelResults.append([
        [f"Average CV Accuracy: {scores['test_accuracy'].mean():.4f}"],
        [f"Accuracy Standard Deviation: {scores['test_accuracy'].std():.4f}"],
        [f"Average CV F1: {scores['test_f1'].mean():.4f}"],
        [f"F1 Standard Deviation: {scores['test_f1'].std():.4f}"]
    ])

    return pd.Series(RF.feature_importances_, index=predictors).sort_values(ascending=False)

### Working with the Mean set of features

In [None]:
importantFeaturesFtMean = randomForest(df[predictorsFtMean], df['diagnosis'], predictorsFtMean)

In [None]:
print(importantFeaturesFtMean)

From the results above, using cross validation, we see that our average accuracy after 5 folds CV is 92.09%, with a standard deviation of 2.4%. We also see that the average F1 score is 89.12%, with a standard deviation of 3.2%. We can see that the model is both highly accurate and reliable. The average scores indicate strong predicitve performance, and the low standard deviations indicate good reproducibility across different subsets of the dataset. A 92.09% average accuracy means that the model accurately identified whether a nucleus cell is malignant or benign 92.09% of the times across the 5 validation folds. The low standard deviation indicates that this was not luck and that the model's performance was consistent and stable. This suggests that the model will work well and similar on new unseen data. An 89.12% F1 score indicates that the model has good precision and recall, and that it minimizes False Positives and False Negatives. Essentially, the high F1 score indicates that it is able to correctly identify true positives well. We can say that the model does well to generalize, and would be able to classify between malignant and benign cells with unseen data. Looking at the important feature, it is clear that the most important feature from ['radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean'] is radius_mean. An advantage of using Random Forests is because it is able to tell us which features were important for the training and predicting. This is why we wanted to use Random Forests as one of our classification methods. From the 6 features that were used to train this model, radius_mean had an overwhelming importance of 69.4%. This indicates that the radius of the nucles cell tells us a lot about whether it is cancerous or not. Moving further, we want to see how the Random Forest works for the other 2 sets, standard error and worst, however before that, we wanted to see whether the features that we removed would also have high importance. First, we decided to train another Random Forest model without omitting any features from the mean set.

In [None]:
importantFeaturesFtMeanAll = randomForest(df[ftMean], df['diagnosis'], ftMean)

In [None]:
print(importantFeaturesFtMeanAll)

We were surprised by the results. The average accuracy and the average F1 score both went up, and not by a small amout. They went up by 2% and 3%, and the standard deviation more than halved. The model works much better, only making the points we made earlier for the RF above even stronger for this one. We can say that this model works better because the time it took to train and make the predictions were almost as long as the previous model. One point that surprised us was that the concave points_mean was the feature that showed the most importance, with almost 79%. This was absolutely shocking as we had removed this feature since we thought Compactness, Concavity, and Concave Points were all strongly correlated to each other and we would be fine using only one of them. Now, we want to see how the top 6 features from this importance list perform with their own mode, comparing it with the previous model that also used 6 features, and with this model that uses all features.

In [None]:
ftMeanImportant = list(importantFeaturesFtMeanAll.index[0:6])

In [None]:
importantFeaturesFtMeanMost = randomForest(df[ftMeanImportant], df['diagnosis'], ftMeanImportant)

In [None]:
print(importantFeaturesFtMeanMost)

We have gone in the correct direction, as we have been able to maintain both our accuracy and F1 scores while reducing their standard deviations, after reducing the number of features back down to 6. This means that these 6 features are key to determining whether a cell is malignant or benign. Moving forward for the other two sets, we should start our model with these features.

In [None]:
predictorsFtSE = list(importantFeaturesFtMeanMost.index.str.replace("_mean", "_se"))
predictorsFtWorst = list(importantFeaturesFtMeanMost.index.str.replace("_mean", "_worst"))

### Working with the Standard Error set of features

In [None]:
importantFeaturesFtSE = randomForest(df[predictorsFtSE], df['diagnosis'], predictorsFtSE)

In [None]:
print(importantFeaturesFtSE)

The results were slightly underwhelming, but nonetheless, the model is not bad. In fact, on its own is pretty good, but when comparing to our previous results, the model shows some underperformance. This might be because we are assuming that the best predictors from our mean features are applicable for the others. We should make a model with all the standard error features.

In [None]:
importantFeaturesFtSEAll = randomForest(df[ftSE], df['diagnosis'], ftSE)

In [None]:
print(importantFeaturesFtSEAll)

The results are disappointing as our accuracy and F1 score have dropped after using all the features for our model. This may be because we are using more features that do not really have much importance to making predictions. The addition of features could be an introduction to noise. These features may have little to no importance for the prediction of malginant or benign, and the model may be trying to find relationships between them. Also, adding features can lead to overcomplicating the model, which is why it is underperforming. A phenomenon our steps have shown agreement with is the Curse of Dimensionality. As we increased our features, the data became more spread out, making it harder for the model to find meaningful patterns. Our next step will be to see how the model behaves with the top 6 important features that we have seen here.

In [None]:
ftSEImportant = list(importantFeaturesFtSEAll.index[0:6])

In [None]:
importantFeaturesFtSEMost = randomForest(df[ftSEImportant], df['diagnosis'], ftSEImportant)

In [None]:
print(importantFeaturesFtSEMost)

Overall, after seeing the results of the three models using the standard error features, it is clear that the mean features are much better at making predictions. We could change the hyperparameters of the Random Forest, however doing so would mean that the model becomes more complex and not worth the training power. Let's continue with the worst set.

### Working with the Worst set of features

In [None]:
importantFeaturesFtWorst = randomForest(df[predictorsFtWorst], df['diagnosis'], predictorsFtWorst)

In [None]:
print(importantFeaturesFtWorst)

Considering how the standard error features model with the most important features from the mean set went, we were well surprised by the results, immediately seeing a 95.8% accuracy and 94.4% F1 score. These are values we welcome with open arms, especially since this was the first model with the worst features. We hypothesize that if we use all the features, we would see the Curse of Dimensionality playing an effect in the results of the next model. Nonetheless, we will see how it goes.

In [None]:
importantFeaturesFtWorstAll = randomForest(df[ftWorst], df['diagnosis'], ftWorst)

In [None]:
print(importantFeaturesFtWorstAll)

We were surprised to see that the Curse of Dimensionality did not play an effect, and rather, the accuracy increase, though ever so slightly. This was impressive because it shows that the worst features are also able to make good models and good predictions. Something that surprised us was that the top 6 important features for the worst set is coincidentally also the same as the top 6 important features for the mean set. This would mean that these 6 are strong predictors for distinguishing between a malignant cell and a benign cell. Overall, though the introduction of 4 additional features does technically improve the accuracy of the model, and the training time is basically the same as mentioned when building models for the mean dataset, soemthing we realized later on is that our sample size is also small. With a larger sample size, the time it would take to train a model with 6 features vs 10 features would only increase apart. The difference in the two models we made with the worst features are hardly justifible for saying the 10 feature model is worth using. We can safely say that the 6 feature model will perform better and is less expensive to train. To be consistent with how we did the mean and standard error models, we will once again build a model with the top 6 important features we have found.

In [None]:
ftWorstImportant = list(importantFeaturesFtWorstAll.index[0:6])

In [None]:
importantFeaturesFtWorstMost = randomForest(df[ftWorstImportant], df['diagnosis'], ftWorstImportant)

In [None]:
print(importantFeaturesFtWorstMost)

Funnily, when we trained our model with the same 6 features we used for the first one, we yielded slightly different results. This time, we yielded the same results as the 10 feature model. This is interesting as it goes to show that the four extra features that the 10 feature model had where of no no signficance to the model, and removing those 4 had no effects on the prediction of the nucles cells. A possible reason why the results yielded are different, even though we set the random state to 42, from the first random forests is possibly because the order of introducing the predictors may be different. The ordering of this changes the order in which the Random Forests does its operations. Nonetheless, the importance of the features both times were very similar.

### Conclusion from Random Forest

In [None]:
print(modelResults[2])
print(importantFeaturesFtMeanMost)
print("-"*50)
print(modelResults[8])
print(importantFeaturesFtWorstMost)

Looking at the two models that yielded the best results, it seems that for both the mean features and worst features, the same category of features were the most important - concave points, texture, area, perimeter, radius, and concavity. These make sense when we think about it as the number of dents a cell nucleus has is a good indicator of whether it is healthy, and the more dents we have, the more likely it is that it is malignant. The texture, area, perimeter, radius, and concavity also share the same reasons as to why they are important. The shape of the cell nucleus is vital to determining wheter it is cancerous or not, and thanks to Random Forests, we were able to determine that these 6 features hold the most important for classifying new, unseen data. Moving forward, it would be interesting to see if when we use these 6 features, would they also yield better results than using all 10 features.

### Testing 10 vs 6 Features on Logistic Regression and SVM

We will use the cross validate function to see which model on average is better. Since we have the code KFold(n_splits=5, shuffle=True, random_state=42), all calls for the cross_validate function results in the same folds of the data. Meaning, we can with confidence determine whether using all 10 features or the 6 best features for each of the three sets is better or not. 

First, let's try Logistic Regression

In [None]:
logistic(df[ftMean], df['diagnosis']) # 10 features

In [None]:
logistic(df[importantFeaturesFtMeanMost.index], df['diagnosis']) # 6 features

In [None]:
logistic(df[ftSE], df['diagnosis']) # 10 features

In [None]:
logistic(df[importantFeaturesFtSEMost.index], df['diagnosis']) # 6 features

In [None]:
logistic(df[ftWorst], df['diagnosis']) # 10 features

In [None]:
logistic(df[importantFeaturesFtWorstMost.index], df['diagnosis']) # 6 features

From the above results, it seems that there are no significant differences between training the model on 10 features or 6 features. The difference is extremely insignificant to the point where we may want to prioritize using the model that uses less features as it yields the same results as the model with 10. If we had larger datasets to train with, then the model with 6 features would be better, however with our sample size, it may be best to use the model with all 10 features as it still technically yields the better result.

Now, let's try SVM

In [None]:
svm(df[ftMean], df['diagnosis'], C=5, kernel='linear') # 10 features

In [None]:
svm(df[importantFeaturesFtMeanMost.index], df['diagnosis'], C=5, kernel='linear') # 6 features

In [None]:
svm(df[ftSE], df['diagnosis'], C=5, kernel='linear') # 10 features

In [None]:
svm(df[importantFeaturesFtSEMost.index], df['diagnosis'], C=5, kernel='linear') # 6 features

In [None]:
svm(df[ftWorst], df['diagnosis'], C=5, kernel='linear') # 10 features

In [None]:
svm(df[importantFeaturesFtWorstMost.index], df['diagnosis'], C=5, kernel='linear') # 6 features

The same as logistic regression can be concluded for SVMs.

## Conclusion

Our project aimed to develop three different classification Machine Learning models - Logistic Regression, Support Vector Machines, and Random Forests - for identifying whether a breast cancer tumor is malignant or bengign. The main goal of our project was to make a model that can effectively identify with the data, if a tumor is harmful or not. This is critical to help many people in the medical context, and we wanted to make a model that can work well on unseen data. From our results, we could see that our Support Vector Machine model had the highest accuracy and F1 score, with values 96.3% and 95.1% respectively. While the Random Forest model and Logistic Regression model were close second and third with accuracies of 95.6% and 95.1% respecitvely, and F1 scores of 94.1% and 93.4% respectively, it was not as good as our SVM. All three models were effective when using the worst set and not the mean and standard error sets. 

The effectiveness of the SVM can be linked to its ability to work in high dimensional spaces and the ability to formulate a hyperplane separating two classes. By using a linear kernel, the SVM was able to find the best straight line to distinguish malignant and benign classes. This means that the dataset can be separated linearly and playing around with the margin of the hyperplane was the most effective strategy.

We used two methods to see which models were better with which sets (mean, SE, worst). We used cross_validate, and fitting the model normally. The cross_validate function from SciKit Learn. What this function does is it automates the process of cross validation for us, and we can see the average accuracies and F1 scores. This was helpful for us to determine the performance of the models. This is how we were able to boil down our best models.

When analyzing our data, we were pleased with our accuracy and F1 scores, however we later realized that our project does not only focus on accuracy, but also precision and recall. In the medical context, we do not want False Negative, where we incorrectly classify a malginant tumor as benign. Therefore, the recall is the most important metric for our models. We focussed on our recall values for class 1, or malignant. This is because we want to minimize our False Negatives. Having a higher recall for class 1 help us achieve this goal. With a Random Forest classifier, we got a recall of 95% when working with the mean set, however this was with only the top 6 best features that we handpicked in the beginning. With a SVM, we got a recall of 93% when working with the worst set. There were many SVM models that showed a recall of 93% for class 1, but this model had a recall for class 0 of 99%, and that the weighted average was slightly higher. Also the precision was higher for the model with a 99% recall for class 0, so we identified that as the best model for SVMs. For logisitc regression, we got a recall of 93% when working with the worst set as well. We identified that in the medical context, the Random Forest model had the best recall for malignant cases with a value of 95%, minimizing the risk of misdiagnosing a patient. 

Some further research that can be done is hyperparameter tuning. We can play around more with the hyperparameters of each model, hoping for a better result. Using techniques like GridSearchCV for the softness of the SVMs margin could be effective. Since we only worked with C=1 and C=5, finding the optimal C could show SVMs to have a leverage over the other two models. Another step forward could be validating our models on new, unseen data from an external dataset. This would be beneficial as it can help finetune our model even more. Another step forward could be to compare and analyze the three models we decided were the best in consideration of recall for class 1 (malginant tumors). The issue at stand is that the random forest model was trained with 6 of the 10 features from the mean set, and the SVM and logistic regression were trained with all 10 features from the worst set. We would need to find a better way to compare our models.