# Introduction

The advancement of Artificial Intelligence (AI) has revolutionized several areas, especially in the field of supervised learning, where models are trained to make predictions and classifications based on labeled data. In this activity, we explore the impact and importance of different AI techniques in solving complex problems.

Using the Wisconsin Breast Cancer database, a crucial problem in the medical field, we explored three different AI techniques: KNN (K-Nearest Neighbors), RNA MLP (Artificial Neural Networks - Multi-Layer Perceptron) and Random Forest ( Random Forest). Each technique was applied using 15 hyperparameter variations, totaling 90 different models. Furthermore, each training set was divided into two disjoint subsets to evaluate the stability of the models.

To ensure robustness, each model was trained and evaluated 100 times without fixing a random seed, averaging its performance metrics. This approach allowed us to not only compare the effectiveness of techniques, but also identify which hyperparameter settings and training splits resulted in the best results.

# Database Preparation

In [1]:
#--------------------------------------------------
# Important Python libraries needed for the experiment
# Matrix manipulation, mathematics and graphical visualization
#--------------------------------------------------
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
#--------------------------------------------------
# Data processing
#--------------------------------------------------
from sklearn.model_selection import train_test_split
#--------------------------------------------------
# Loading the smart model and performance metrics
#--------------------------------------------------
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
#--------------------------------------------------
# Loading performance metrics
#--------------------------------------------------
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
#--------------------------------------------------
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
column_names = 'ID,Target,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30'



# Reading data
df = pd.read_csv('wdbc.data', names=column_names.split(','))

df.drop('ID', axis=1, inplace=True)
df['Target'] = df['Target'].map({'M': 1, 'B': 0})

# Data normalization
target_column = ['Target']
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()


print("Descriptive statistics of normalized variables:")
df[predictors].describe().transpose()


Descriptive statistics of normalized variables:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
x17,569.0,0.08054,0.076227,0.0,0.038106,0.065379,0.106187,1.0
x26,569.0,0.240326,0.148711,0.025794,0.13913,0.200284,0.32051,1.0
x16,569.0,0.188169,0.132261,0.016632,0.096603,0.151034,0.23966,1.0
x25,569.0,0.594648,0.102572,0.319721,0.52381,0.589847,0.655885,1.0
x15,569.0,0.22618,0.096451,0.055027,0.166046,0.204947,0.261677,1.0
x19,569.0,0.260194,0.104704,0.099835,0.19202,0.237239,0.297403,1.0
x24,569.0,0.207001,0.13384,0.043535,0.121133,0.161378,0.254819,1.0
x10,569.0,0.644475,0.072459,0.512726,0.592159,0.631568,0.678571,1.0
x2,569.0,0.491081,0.109497,0.2472,0.41166,0.479633,0.55499,1.0
x30,569.0,0.404558,0.087042,0.265253,0.344386,0.385735,0.443759,1.0


In [3]:
entrada_X = df[predictors].values
print(entrada_X)
saidaDesejada_y = df[target_column].values

#Applying the train_test_split function to divide the original set into 70% for training and 30% for testing.
X_train, X_test, y_train, y_test = train_test_split(entrada_X, saidaDesejada_y, test_size=0.3, random_state=42) # 42 fixa
print(f"Training Set---Size: {X_train.shape}")
print(X_train);
print(f"\nTest Set---Size: {X_test.shape}")
print(X_test)

[[0.13568182 0.62911153 0.36218612 ... 0.65145889 0.18532242 0.91202749]
 [0.0469697  0.17637051 0.09660266 ... 0.70503979 0.15023541 0.63917526]
 [0.09676768 0.40122873 0.29586411 ... 0.68965517 0.16108495 0.83505155]
 ...
 [0.11944444 0.29243856 0.27555391 ... 0.57453581 0.22006141 0.48728522]
 [0.17972222 0.8205104  0.45480059 ... 0.74323607 0.32650972 0.91065292]
 [0.         0.06090737 0.03441654 ... 0.25421751 0.29232344 0.        ]]
Training Set---Size: (398, 30)
[[0.04494949 0.17240076 0.11757755 ... 0.46748011 0.15504606 0.20683849]
 [0.12517677 0.23922495 0.24150665 ... 0.45676393 0.23336745 0.30852234]
 [0.13328283 0.34357278 0.3872969  ... 0.52106101 0.30931423 0.38075601]
 ...
 [0.01218687 0.04759924 0.0274003  ... 0.47904509 0.14734903 0.11453608]
 [0.05517677 0.33724008 0.13227474 ... 0.48339523 0.19514841 0.62783505]
 [0.02699495 0.08376181 0.04503693 ... 0.40965517 0.32159672 0.25536082]]

Test Set---Size: (171, 30)
[[0.06820707 0.22476371 0.14113737 ... 0.43018568 0.2

#Treinamento e teste dos modelos
Abaixo, criamos uma função genérica para treinar os três modelos utilizados no desenvolvimento desse trabalho.

In [4]:
def training_models(model, parameters, X_train, y_train, random_state=None):

  metrics = []

  for params in parameters:

    # Splitting the training set into two disjoint subsets
    X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train, y_train, test_size=0.5, random_state=random_state)

    y_train_1 = np.ravel(y_train_1)
    y_train_2 = np.ravel(y_train_2)

    #2 training subsets
    conjunto_1 = (X_train_1, y_train_1)
    conjunto_2 = (X_train_2, y_train_2)



    for current_x_train, current_y_train in [conjunto_1, conjunto_2]:
      accs = []
      f1s = []
      recs = []
      precs = []
      for _ in range(100):

        #Performing model training
        mod = model(**params) # The ** operator is used to unpack the 'params' dict and pass its elements as named arguments to the initialization of the 'model' object
        mod.fit(current_x_train, current_y_train) # training the model
        y_pred = mod.predict(X_test) # testing the model with the test dataset


        current_acc = accuracy_score(y_test, y_pred)
        current_f1 = f1_score(y_test, y_pred)
        current_rec = recall_score(y_test, y_pred)
        current_prec = precision_score(y_test, y_pred)


        accs.append(current_acc)
        f1s.append(current_f1)
        recs.append(current_rec)
        precs.append(current_prec)

      metrics.append({
          'params': params,
          'accuracy': np.mean(accs),
          'f1': np.mean(f1s),
          'recall': np.mean(recs),
          'precision': np.mean(precs)
      })

  return metrics

# Auxiliary Functions
Functions that assist in the code such as displaying metrics, graphs, calculating means, etc.

In [31]:
def max_and_min_metrics(model_metrics, metric):
  max1 = 0
  max2 = 0
  min1 = 1
  min2 = 1

  for i in range(0,30):
    current = model_metrics[i][metric]
    current_param = model_metrics[i]['params']

    if i%2 == 0: # belongs to set 1
      if current > max1:
        max1 = current
        param = current_param
      elif current < min1:
        min1 = current
        param = current_param


    else: # belongs to set 2
      if current > max2:
        max2 = current
        param = current_param
      elif current < min2:
        min2 = current
        param = current_param


  return (
      {'max': max1, 'min': min1, 'params': param},
      {'max': max2, 'min': min2, 'params': param}
  )



def separate_by_parameter_and_metric(model_metrics, metrics, parameters):
    separated_metrics = []
    separated_params1 = []
    separated_params2 = []
    count = 0

    for m in model_metrics:
        if all(param in m['params'] and m['params'][param] == parameters[param] for param in parameters):
            separated_metrics.append(m[metrics])
            if count % 2 == 0:
              separated_params1.append(m['params'])
            else:
              separated_params2.append(m['params'])
            count = count + 1

    length = len(separated_metrics) // 2

    return (
        {f'{metrics}': separated_metrics[:length], 'params': separated_params1},
        {f'{metrics}': separated_metrics[length:], 'params': separated_params2}
    )

def print_metrics(model, metrics):
  print(f"Model: {model}")

  print("\n================= First training subset metrics =================\n")
  for i in range(0,len(metrics)):
    if i%2 == 0:
      print(f"\nParameters used:   {metrics[i]['params']}")
      print(f"Accuracy:   {metrics[i]['accuracy']}")
      print(f"F1 Score:   {metrics[i]['f1']}")
      print(f"Recall:     {metrics[i]['recall']}")
      print(f"Precision:  {metrics[i]['precision']}")

  print("\n================= Second training subset metrics =================\n")
  for i in range(0,len(metrics)):
    if i%2 != 0:
      print(f"\nParameters used:   {metrics[i]['params']}")
      print(f"Accuracy:   {metrics[i]['accuracy']}")
      print(f"F1 Score:   {metrics[i]['f1']}")
      print(f"Recall:     {metrics[i]['recall']}")
      print(f"Precision:  {metrics[i]['precision']}")

def specific_metric_mean(model1, model2, metric):
  met1 = []
  met2 = []

  for i in range(len(model1)):
    current = model1[i]
    met1.append(current[metric])

  for i in range(len(model2)):
    current = model2[i]
    met2.append(current[metric])

  return (np.mean(met1), np.mean(met2)) # mean of the specified metric in each set



def metrics_mean(model_metrics): # returns the mean of measurements from both sets
  acc1 = []
  f1_score1 =[]
  rec1 = []
  prec1 = []

  acc2 = []
  f1_score2 =[]
  rec2 = []
  prec2 = []

  for i in range(len(model_metrics)):
    current = model_metrics[i]

    if i%2 == 0:
      acc1.append(current['accuracy'])
      f1_score1.append(current['f1'])
      rec1.append(current['recall'])
      prec1.append(current['precision'])

    else:
      acc2.append(current['accuracy'])
      f1_score2.append(current['f1'])
      rec2.append(current['recall'])
      prec2.append(current['precision'])

  return (
      [np.mean(acc1), np.mean(f1_score1), np.mean(rec1), np.mean(prec1)],
      [np.mean(acc2), np.mean(f1_score2), np.mean(rec2), np.mean(prec2)]
  )

def bar_graph(model_metrics):

  metrics = ['Accuracy', 'F1 Score', 'Recall', 'Precision']
  [model1, model2] = metrics_mean(model_metrics)

  bar_width = 0.25

  r1 = np.arange(4)
  r2 = [x + bar_width for x in r1]

  # Creating the bar chart
  plt.figure(figsize=(10, 6))
  plt.bar(r1, model1, color='#9448BC', width=bar_width, edgecolor='grey', label='Model 1')
  plt.bar(r2, model2, color='#480355', width=bar_width, edgecolor='grey', label='Model 2')

  # Adding metric labels on the x-axis
  plt.xlabel('Metrics', fontweight='bold')
  plt.xticks([r + bar_width / 2 for r in range(4)], metrics)

  ax = plt.gca()
  for i, height in enumerate(model1):
      ax.text(r1[i], height + 0.01, f'{height:.2f}', ha='center', va='bottom', fontsize=10)
  for i, height in enumerate(model2):
      ax.text(r2[i], height + 0.01, f'{height:.2f}', ha='center', va='bottom', fontsize=10)

  plt.legend(loc='upper right', fontsize='small', title='Title', bbox_to_anchor=(1.15, 1))
  plt.tight_layout()

  plt.show()



def bar_graph_metric_and_parameter(model_metrics, metric, parameters):
    model1, model2 = separate_by_parameter_and_metric(model_metrics, metric, parameters)

    metrics1 = model1[metric]
    metrics2 = model2[metric]
    params = model1['params']



    bar_width = 0.35

    r1 = np.arange(len(metrics1))
    r2 = [x + bar_width for x in np.arange(len(metrics2))]

    plt.figure(figsize=(10, 6))  # Increase the size of the graph
    plt.bar(r1, metrics1, color='#9448BC', width=bar_width, edgecolor='grey', label='Model 1')
    plt.bar(r2, metrics2, color='#480355', width=bar_width, edgecolor='grey', label='Model 2')

    # Adding metric labels on the x-axis
    plt.xlabel(f"{metric.capitalize()}", fontweight='bold')

    # Adjusting x-axis labels
    labels = [str(param) for param in params]
    plt.xticks([r + bar_width / 2 for r in range(len(labels))], labels, rotation=0, ha='center', fontsize=10)

    # Adding metric values ​​above the bars
    ax = plt.gca()
    for i, height in enumerate(metrics1):
        ax.text(r1[i], height + 0.01, f'{height:.2f}', ha='center', va='bottom')
    for i, height in enumerate(metrics2):
        ax.text(r2[i], height + 0.01, f'{height:.2f}', ha='center', va='bottom')

    plt.legend(loc='upper right', fontsize='small', title='Sets', bbox_to_anchor=(1.15, 1))
    plt.tight_layout()
    plt.subplots_adjust(left=0.05, right=0.95, top=0.95, bottom=0.25)  # Ajustar o espaçamento ao redor do gráfico

    plt.show()



def graph_max_min(model1, model2, metric):
    maxs = [model1['max'], model2['max']]
    mins = [model1['min'], model2['min']]
    models = ['Model 1', 'Model 2']

    fig, ax = plt.subplots()
    largura_barra = 0.35
    indice = range(2)  

   
    bar1 = ax.bar([i - largura_barra/2 for i in indice], maxs, largura_barra, label='Max', linewidth=20)
    for i, v in enumerate(maxs):
        ax.text(i - largura_barra/2, v + 0.01, f'{v:.2f}', ha='center', va='bottom')

    bar2 = ax.bar([i + largura_barra/2 for i in indice], mins, largura_barra, label='Min', linewidth=2)
    for i, v in enumerate(mins):
        ax.text(i + largura_barra/2, v + 0.01, f'{v:.2f}', ha='center', va='bottom')


    ax.margins(x=0.1, y=0.1)

    ax.set_xlabel('Models')
    ax.set_ylabel('Values')
    ax.set_title(f'Max/Min {metric.capitalize()} of Two Models')
    ax.set_xticks(indice)
    ax.set_xticklabels(models)
    ax.legend(loc='upper right', fontsize='small', title='title', bbox_to_anchor=(1.25, 1))

    plt.show()

# KNN (K-Nearest Neighbors)

The KNN (K-Nearest Neighbors) algorithm is commonly used in supervised learning for classification and regression. It decides the class of a new example by calculating its distance to all training examples. The K closest examples are selected, and the class of the new example is determined by majority voting among these neighbors. Essentially, KNN relies on the similarity of data to make predictions, being simple to understand and implement, but sensitive to the choice of the K parameter and the appropriate treatment of the data.

In [None]:
params = [{'n_neighbors': k, 'metric': metric} for k in [1, 3, 5, 7, 9] for metric in ['euclidean', 'manhattan', 'chebyshev']] # 15 parâmetros

knn_metrics = training_models(KNeighborsClassifier, params, X_train, y_train)

print_metrics("KNN", knn_metrics)


# Artificial Neural Networks (MLP)

In [None]:
parametros_mlp = [{'hidden_layer_sizes': h, 'max_iter': m} for h in [50, 100, 150, 200, 250] for m in [200, 300, 400]] # 15 parameters

mlp_metrics = training_models(MLPClassifier, parametros_mlp, X_train, y_train)

from IPython.display import clear_output
clear_output()

print_metrics("MLP", mlp_metrics)

# Random Forest


The Random Forest algorithm is an advanced supervised learning technique that utilizes the combination of multiple decision trees to enhance the accuracy and robustness of the predictive model. During training, multiple trees are constructed using random samples from the dataset, with each tree using only a random subset of the features. This introduces diversity among the individual trees, reducing overfitting and improving the model's generalization ability. To predict the class in classification problems or the value in regression problems, Random Forest combines the predictions of all the trees through voting (for classification) or averaging (for regression), resulting in a more accurate and stable final estimate. This approach makes Random Forest effective for handling complex and non-linear data, while maintaining reasonable interpretability. However, proper parameter tuning is crucial to optimize its performance in different application contexts.

In [None]:
parametros_rf = [{'n_estimators': n, 'max_depth': d} for n in [50, 100, 150, 200, 250] for d in [10, 20, 30]]

rf_metrics = training_models(RandomForestClassifier, parametros_rf, X_train, y_train)

print_metrics("Random Forest", rf_metrics)

For the dataset in this problem, which relates to breast cancer, recall should be considered the most important metric for measuring the effectiveness of the models. Recall is defined as: 
$${TP \over TP+FN}$$
\
Therefore, the model's effectiveness will be better the lower the number of false negatives. In the current scenario of a medical evaluation, it is crucial to have a low number of false-negative diagnoses, meaning that breast cancer is not detected in patients who actually have it.

# Results

## KNN

In [None]:
# Total average of metrics
bar_graph(knn_metrics)

In [None]:
# Average of metrics for 1 neighbor
distances = ['euclidean', 'manhattan', 'chebyshev']

for d in distances:
  bar_graph_metric_and_parameter(knn_metrics, 'accuracy', {'n_neighbors': 1, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'f1', {'n_neighbors': 1, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'recall', {'n_neighbors': 1, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'precision', {'n_neighbors': 1, 'metric': d})

In [None]:
# Average of metrics for 3 neighbors
distances = ['euclidean', 'manhattan', 'chebyshev']

for d in distances:
  bar_graph_metric_and_parameter(knn_metrics, 'accuracy', {'n_neighbors': 3, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'f1', {'n_neighbors': 3, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'recall', {'n_neighbors': 3, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'precision', {'n_neighbors': 3, 'metric': d})

In [None]:
# Average of metrics for 5 neighbors
distancias = ['euclidean', 'manhattan', 'chebyshev']

for d in distancias:
  bar_graph_metric_and_parameter(knn_metrics, 'accuracy', {'n_neighbors': 5, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'f1', {'n_neighbors': 5, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'recall', {'n_neighbors': 5, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'precision', {'n_neighbors': 5, 'metric': d})

In [None]:
# Average of metrics for 7 neighbors
distancias = ['euclidean', 'manhattan', 'chebyshev']

for d in distancias:
  bar_graph_metric_and_parameter(knn_metrics, 'accuracy', {'n_neighbors': 7, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'f1', {'n_neighbors': 7, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'recall', {'n_neighbors': 7, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'precision', {'n_neighbors': 7, 'metric': d})

In [None]:
# Average of metrics for 9 neighbors
distancias = ['euclidean', 'manhattan', 'chebyshev']

for d in distancias:
  bar_graph_metric_and_parameter(knn_metrics, 'accuracy', {'n_neighbors': 9, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'f1', {'n_neighbors': 9, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'recall', {'n_neighbors': 9, 'metric': d})
  bar_graph_metric_and_parameter(knn_metrics, 'precision', {'n_neighbors': 9, 'metric': d})

In [None]:
# Max and Min of all metrics
metrics = ['recall','accuracy', 'f1', 'precision']

for m in metrics:
  (mod1,mod2) = max_and_min_metrics(knn_metrics,m)
  graph_max_min(mod1,mod2,m)
  print() # \n

### KNN Conclusion
The KNN algorithm proved to be more effective using Euclidean distance, regardless of the variation in the number of neighbors.

## MLP

In [None]:
# Total average of metrics
bar_graph(mlp_metrics)

In [None]:
# Average of metrics with parameter variation

hidden_layer_sizes = [50, 100, 150, 200, 250]
max_iters = [200, 300, 400]

for h in hidden_layer_sizes:
  for m in max_iters:
     bar_graph_metric_and_parameter(mlp_metrics, 'accuracy', {'hidden_layer_sizes': h,'max_iter': m})
     bar_graph_metric_and_parameter(mlp_metrics, 'f1', {'hidden_layer_sizes': h,'max_iter': m})
     bar_graph_metric_and_parameter(mlp_metrics, 'recall', {'hidden_layer_sizes': h,'max_iter': m})
     bar_graph_metric_and_parameter(mlp_metrics, 'precision', {'hidden_layer_sizes': h,'max_iter': m})

In [None]:
# Max and Min of all metrics
metrics = ['recall', 'f1', 'precision', 'accuracy']

for m in metrics:
  (mod1,mod2) = max_and_min_metrics(mlp_metrics,m)
  graph_max_min(mod1,mod2,m)

### MLP conclusion
For the MLP, we have that, for model 1 and model 2, the maximum and minimum values ​​were better in the "Accuracy" metric, being (0.98;0.95) for model 1 and (0.98 ;0.94) for model 2, in the parameters hidden_layer_sizes: 250 and max_iter: 400

## Random Forest


In [None]:
# Total average of metrics
bar_graph(rf_metrics)

In [None]:
# Average of metrics with parameter variation
n_estimators = [50, 100, 150, 200, 250]
max_depths = [10, 20, 30]

for n in n_estimators:
  for m in max_depths:
   bar_graph_metric_and_parameter(rf_metrics, 'accuracy', {'n_estimators': n, 'max_depth': m})
   bar_graph_metric_and_parameter(rf_metrics, 'f1',  {'n_estimators': n, 'max_depth': m})
   bar_graph_metric_and_parameter(rf_metrics, 'recall',  {'n_estimators': n, 'max_depth': m})
   bar_graph_metric_and_parameter(rf_metrics, 'precision',  {'n_estimators': n, 'max_depth': m})

In [None]:
# Max and Min of all metrics
metrics = ['recall', 'f1', 'precision', 'accuracy']

for m in metrics:
  (mod1,mod2) = max_and_min_metrics(rf_metrics,m)
  graph_max_min(mod1,mod2,m)

### Random Forest Conclusion
For Random Forest, we have that, for the maximum and minimum values ​​that were best analyzed, they were for the Accuracy metric, in the parameters n_estimators:250 and max_depth: 20, with values ​​(0.97;0.96) for model 1 and (0.97;0.95) for model 2

# Conclusion

Using a variety of models in supervised learning problems is crucial. Each algorithm has its advantages and limitations, and the application of different models, such as K-NN, Decision Tree and Random Forest, can provide a more complete and accurate understanding of the problem, especially in the detection of breast cancer, assisting in more accurate diagnoses. fast and accurate.

The quality and representativeness of training sets have a significant impact on model performance. Well-selected data is essential for creating robust and accurate models. Changes to these sets can drastically affect the model's ability to generalize and make correct predictions on new data.

Furthermore, the choice of hyperparameters is vital to the performance of the model. Poorly tuned hyperparameters can result in overfitting or underfitting, leading to low accuracy. Therefore, careful tuning and iterative testing of different hyperparameter settings are necessary to optimize the performance of machine learning models.

Therefore, the use of various machine learning algorithms to detect breast cancer can save lives, increasing the accuracy and speed of diagnoses. Proper selection of datasets and accurate tuning of hyperparameters are essential steps to ensure models are effective and reliable