# (02) Data balancing, PCA, training & deployment strategy
<div class="alert alert-info"> <p> To train the breast cancer tumor classification model, a scikit-learn pipeline-based approach will be used. For our initial analysis, we preprocessed the data—which included normalization, data balancing (using SMOTE), and dimensionality reduction via PCA—to evaluate performance and compare methods. </p> </div> <div class="alert alert-info"> <b style="font-size: 1.5em;">🔍 Training Strategy</b> <p> In the final training process, we will start from scratch using the raw data. The entire data transformation pipeline (normalization, balancing, and PCA) will be integrated directly into the training pipeline. This approach ensures that the same pre-processing steps are automatically applied to new data during inference, maintaining consistency and reproducibility. </p> <ol> <li> <b>Data Preprocessing:</b> <ul> <li><b>Train-Test Split:</b> The raw dataset will be split into training and test sets using stratification to maintain the class distribution.</li> <li><b>Normalization:</b> A StandardScaler will be fitted on the training data and used to transform both training and test sets.</li> <li><b>Balancing:</b> SMOTE will be applied to the training data to correct for class imbalance.</li> <li><b>PCA:</b> Dimensionality reduction will be performed on the balanced data to extract the most informative components.</li> </ul> </li> <li> <b>Model Training:</b> <ul> <li> A stacking classifier will be employed, combining several base models: <ul> <li><b>Logistic Regression</b>: A linear model for interpretability.</li> <li><b>Random Forest</b>: A tree-based model capable of capturing non-linear relationships.</li> <li><b>XGBoost Classifier</b>: A boosting model with regularization to improve predictive power.</li> </ul> </li> <li> <b>Meta-Model:</b> Logistic Regression will serve as the meta-model, combining the predictions from the base models. </li> <li> <b>Hyperparameter Tuning:</b> GridSearchCV (or RandomizedSearchCV) will be used along with StratifiedKFold cross-validation to optimize the hyperparameters for both the base models and the meta-model. </li> </ul> </li> <li> <b>Evaluation:</b> <ul> <li><b>Classification Report:</b> To provide precision, recall, f1-score, and support for each class.</li> <li><b>ROC AUC Score:</b> To measure the model’s ability to discriminate between classes across various thresholds.</li> <li><b>Accuracy Score:</b> To indicate the overall percentage of correct predictions.</li> </ul> </li> <li> <b>Model Packaging for Deployment:</b> <ul> <li>The final model will be saved along with the pre-processing objects (the scaler and PCA) so that new, raw data can be automatically transformed in the same way during inference.</li> </ul> </li> </ol> <p> This integrated approach—starting from raw data and applying the full pipeline of transformations—ensures that the model receives data in the same format it was trained on, thereby avoiding discrepancies and biases during production predictions. </p> </div>

### importing libraries, modules and dataset

In [1]:
#libraries

import os
import pandas as pd
import sys
import joblib

In [2]:
# root directory (to be able to import modules outside the current directory)
#os.chdir('..')  
ROOT_DIR = os.path.abspath('..')
sys.path.append(ROOT_DIR)

# own modules
from utils.utils_load_data import Loader
from utils.utils_initial_exploration import InitialExploration
from utils.utils_categorical_plots   import CategoricalPlots
from utils.utils_training_funcs      import TrainingFuncs
from utils.utils_list_and_dicts      import ListAndDicts
from utils.utils_deployment_funcs    import DeploymentFuncs

In [3]:
# instances
loader      = Loader()
initial_exp = InitialExploration()
catplots    = CategoricalPlots()
training    = TrainingFuncs()
list_dict   = ListAndDicts()
deploy      = DeploymentFuncs()

In [4]:
# plot appereance
initial_exp.load_appereance()

In [5]:
df_raw = loader.load_data(file_name='breast_disease', dir= 'raw', copy= True)
df_raw.drop(columns= ['Unnamed: 32', 'id'], inplace= True)
df_raw['diagnosis'] = df_raw['diagnosis'].map({'B': 0, 'M': 1})


### start of the analysis





## (2.1) preprocesing, Data balancing, PCA + training

In [6]:
#diagnosis mapping (B = 0, M = 1)
print(f'shape: {df_raw.shape}')
df_raw['diagnosis'].value_counts()

shape: (569, 31)


diagnosis
0    357
1    212
Name: count, dtype: int64

In [7]:
df_raw.columns

Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

## (2.1) Training

In [8]:
# NOTE: Not use this code
# training_results = training.stacking_model_training(df= df_balanced,
#                                                     target_col= 'diagnosis',
#                                                     output_file_name= 'stacking_model_09',
#                                                     test_size= 0.2,
#                                                     )

# increase recall of malignant (7% FN [recall],4% FP [precision])
# adjust the desicion threshold to prioritize sensitivity (recall) over precision

<div class="alert alert-warning">
    <b style="font-size: 1.5em;">⛔ Updated Strategy</b>
    <p>In the initial version, the GridSearchCV-optimized model <b>achieved quite good results</b>:</p>
    <ul><li>The prediction of benign (0) class <b>reached almost 100%</b><i>(without data leakage)</i>, while the recall for the malignant (1) class was around <b>80%</b></li></ul>
    <p>Since it is critical to minimize false negatives (FN) in the malignant class in medical applications, <b>additional training iterations were performed.</b></p>
    <ul>
        <li>we will incorporate changes on the hyperparameters & the <b>ROC AUC</b> changed to custom <b>recall-based scorer</b> <i>(focused on the malignant class), with the aim of increasing its recall performances in Malignant (1) class</i></li>
    </ul>
</div>
<div class="alert alert-info">
   <p><i>But we already had a model whose performance was alreadyquite good, why keep trying to improve it?</i></p>
   <ul>
        <li>Although good result were already obtainde in previous versions <i>(for example 93% in global recall for malignant improving the first model)</i> in the medical field it is essential to <b>achieve at leat 95% sensitivity in detection of malignant tumors</b></li>
        <li>The justification is clear: A false negative (FN) in this context <b>can delay critical treatments in early stages of the disease</b></li>
   </ul>
   <p>The updated strategy focuses on:</p>
   <ul>
        <li>Optimizing the model to maximize recall in the malignant class, without significantly sacrificing performance in benign class</li>
        <li>Stacking classifier such as the first method tryed, the base models were adjusted: </li>
        <ul>
            <li><b>Logistic Regression, RandomForest & SVC</b>with <code>class_weight={0:1, 1:3}</code> to emphitize the importance of the malingant class (1)</li>
            <li><b>XGBoost Classifier</b> used with <code>scale_pos_weight=3</code> to handle the imbalance, leaving its base configuration intact</li>
        </ul>
   </ul>
   <p>Making those tweaks we will prioitize the detection of malignant cases (0) without lose a significant amount of performance gained on the benign class (0)</p>
   </div>
   
   Function path $\rightarrow$ *'./utils/training_funcs'* 

In [9]:
# model saving path -> '../models/stacking_model_##.pkl' 
training_results, xy, df_balanced = (
    training.stacking_model_training(
        df= df_raw,
        target_col= 'diagnosis',
        output_file_name= 'stacking_model_14',
        pca_n_components= 10, # K = 10 founded in the previous notebook
        return_df= True,
        cv_n_iters= 100)
    )

X_test  = xy['X_test']
y_test  = xy['y_test']

df_balanced = df_balanced.rename(columns= {'target_column': 'diagnosis'})

Fitting 5 folds for each of 100 candidates, totalling 500 fits
- Best SCORE:0.9823529411764707
-*- the performance score performance is acceptable-*-
- Best params:
{'stacking__xgb__n_estimators': 200, 'stacking__xgb__max_depth': 3, 'stacking__xgb__learning_rate': 0.3, 'stacking__rf__n_estimators': 100, 'stacking__rf__max_depth': None, 'stacking__lr__C': 0.1, 'stacking__final_estimator__C': 10}, performance acceptable
✅ SUCCESS: model saved in -> ../models/stacking_model_14.pkl


*accuracy is around $94.7$%, **ROC AUC** is $99.2$%, and **recall** for the malignant class (1) is $92.98$%.*

* *The classification report shows balanced precision and recall for both classes*
    * *These are strong indicators, especially the high **ROC AUC**, which suggests excellent discrimination between classes*
    * *The recall for the malignant class is crucial here because missing a malignant tumor (FN) is more critical than a FN*
    * *However, $92.98$% is still good, but there might be room for improvement*

In [10]:
df_balanced

Unnamed: 0,PC_1,PC_2,PC_3,PC_4,PC_5,PC_6,PC_7,PC_8,PC_9,PC_10,diagnosis
0,-1.477184,-2.451403,-0.893325,1.476884,0.160196,0.208723,-0.561867,-0.141684,-0.332577,0.393456,1
1,-3.190568,0.212680,0.042603,-2.511736,-0.572844,-0.052576,1.082817,-0.015232,0.619042,0.259248,0
2,-2.114533,0.181793,2.489520,1.470261,1.665178,0.967887,-0.831456,0.013943,-0.545885,0.092203,0
3,3.962784,3.428284,-1.252812,0.557448,3.049767,-0.053439,0.405757,0.131580,0.619085,-0.666468,1
4,2.282120,-1.909881,2.414289,0.208435,0.058029,-0.875441,0.866898,-0.246693,-0.163884,-0.349462,1
...,...,...,...,...,...,...,...,...,...,...,...
565,6.943856,-4.861975,-0.368222,-0.392074,-0.299635,-1.042118,0.420382,0.001284,-0.547047,0.438284,1
566,3.635992,1.256639,-2.543011,1.360423,0.100552,-0.159712,-0.637069,-0.485348,-0.122541,-0.394429,1
567,3.006553,-0.714906,1.801834,-0.298255,0.240574,-1.034405,0.597287,0.173995,0.133675,-0.018338,1
568,3.883224,-0.323587,-0.552993,-1.100622,0.287824,-0.359698,2.164175,-0.250045,-0.454591,-0.770506,1


this is the  data balanced, yay! (sorry i need some coffee ☕)

In [11]:
staking_model = training_results['stacking_pipeline']
staking_model

In [12]:
training_results['params']

{'stacking__xgb__n_estimators': 200,
 'stacking__xgb__max_depth': 3,
 'stacking__xgb__learning_rate': 0.3,
 'stacking__rf__n_estimators': 100,
 'stacking__rf__max_depth': None,
 'stacking__lr__C': 0.1,
 'stacking__final_estimator__C': 10}

## (2.2) validate model performance

In [13]:
for item in training_results['metrics']:
    print(f'{item} : {training_results["metrics"][item]}')

ACCURACY : 0.9736842105263158
ROC AUC : 0.996031746031746
RECALL : 0.9523809523809523
CLASSIFICATION REPORT : 
              precision    recall  f1-score   support

      benign       0.97      0.99      0.98        72
   malignant       0.98      0.95      0.96        42

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



In [14]:
df_balanced['diagnosis'].value_counts()

diagnosis
1    285
0    285
Name: count, dtype: int64

In [15]:
model_04 = joblib.load(f'../models/stacking_model_14.pkl')
model_04

### results
<div class="alert alert-success">
    <b style="font-size: 1.5em;">✅ Success: model recall performance higher than 95%</b>
    <p>Thanks to these adjustments & an appropiate configuration of class weights, the malignant class recall raised to 95% (making sure that ther is not data leakage, obtaining 95% of recall, and overall 96% accuracy) & <b>ROC AUC higher than 98%</b></p>

### (2.1.1) predicition

In [16]:
# prediction with new data

path = '../models/stacking_model_14.pkl'
model_11 = joblib.load(path)
model_11.set_params(smote= 'passthrough') # ignore the smote step

# new data for prediction and columns ordered as in training
new_data         = list_dict.new_data_for_prediction
original_columns = list_dict.column_names_in_training

df_new_data = pd.DataFrame([new_data], columns= original_columns)
print(f'df_new_data shape: {df_new_data.shape}')

pred = model_11.predict(df_new_data)


probabilities = model_11.predict_proba(df_new_data)[0]# malignant probability[0]
prob_benign    = probabilities[0]
prob_malignant = probabilities[1]

print(f'Prediction: {"MALIGNANT :O" if pred[0] == 1 else "BENIGN :D"}')
print(f'- Malignant Probability: {prob_malignant:.2%}\n- Benign Probability: {prob_benign:.2%}')


df_new_data shape: (1, 30)
Prediction: MALIGNANT :O
- Malignant Probability: 98.90%
- Benign Probability: 1.10%


prediction successfuly made

## (2.2) Deployment Strategy

<div class="alert alert-info">
    <b style="font-size: 1.8em;">Simulation: Real-World Measurements for Tumor Classification</b>
    <p> In real-world scenarios, tumor measurements (such as those obtained via mammography or ultrasound) are not captured as a single value per attribute. Instead, multiple readings are taken from different angles or at different times, providing a more comprehensive picture of the tumor's characteristics.</p>
    <p> To better simulate this process in our application, the user will be asked to input several measurements (ideally between 5 and 10) for each base attribute. These base attributes include:</p>
    <ul> 
        <li>radius, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension</li>
    </ul>
    <p> For each attribute, the application will automatically compute three key metrics:</p>
    <ul> <li><b>Mean:</b> The average of the measurements.</li>
    <li><b>Standard Error (SE):</b> The standard deviation divided by the square root of the number of measurements, reflecting the variability in the readings.</li>
    <li><b>Worst:</b> Typically, the average of the three highest values, which captures the most extreme measurement that might be clinically significant.</li>
    </ul>
    <p> This approach not only mirrors the variability and complexity of real-world measurements but also ensures that the input provided to our model represents a robust summary of the tumor’s characteristics. </p>
    <p> <b>Additional Considerations:</b></p>
    <ol>
        <li> <b>Input Validation:</b> The application will validate that the user inputs are numeric and that the number of measurements per attribute is within the expected range. If the values deviate significantly from realistic ranges—derived from clinical data—the user will receive a clear error message prompting them to re-check their entries</li>
        <li> <b>Realistic Ranges:</b> Each attribute has an expected range (e.g., radius: 0–50 mm, perimeter: 0–150 mm, area: 0–2500 mm², and adimensional attributes typically between 0 and 1). These ranges help flag any out-of-bound entries</li>
        <li> <b>Data Distribution and Correlation:</b> Some attributes (such as radius and perimeter) are naturally correlated. Our interface and validation logic will consider these relationships, ensuring that the aggregated inputs make clinical sense</li>
        <li> <b>Handling of Repeated Values:</b> It is realistic for some measurements to be very similar or even identical due to instrument precision or tumor homogeneity. The system will compute the metrics regardless, while also alerting the user if the lack of variability might indicate a data entry error</li>
        <li> <b>Units of Measurement:</b> Clear instructions will be provided so that the user inputs values in the correct units (e.g., mm for radius and perimeter, mm² for area)</li>
        </ol>
    <p> This simulation strategy enhances the usability and clinical relevance of our application by transforming raw measurement data into the precise summary metrics (mean, SE, worst) that our model expects, ultimately improving the reliability and interpretability of the predictions.</p>
</div>

In [17]:
X_data_csv = {'radius'   : [12.3, 12.5, 12.1, 12.6, 12.4, 12.2, 12.7, 12.8, 12.9, 13.0],
    'perimeter': [75.2, 76.1, 74.9, 77.0, 75.8, 74.7, 77.5, 78.0, 78.5,],
    'area'     : [450.3, 470.3, 440.2, 480.2, 460.1, 430.1, 490.1, 500.1, 510.1, 520.1],
    'compactness'   : [0.14, 0.26, 0.11, 0.25, 0.12, 0.14, 0.25, ],
    'concavity'     : [0.16, 0.33, 0.12, 0.24, 0.13, 0.15, 0.21, 0.19, 0.21, 0.12],
    'concave points': [0.14, 0.43, 0.13, 0.23, 0.15, 0.16,],
    'symmetry'      : [0.16, 0.56, 0.15, 0.22, 0.18, 0.17, 0.23, 0.12, 0.29, 0.18],
    'fractal dimension': [0.19, 0.4, 0.18, 0.29, 0.14, 0.16, 0.24,]
}



In [18]:
#analizar ruido

# def analyze_noise(self, X_original, X_balanced, have_categoricals: bool = True, scale_data: bool = True):
#    """analize noise: measuring distances (original vs synthetic)
#        - using KNN to calculate distances"""
#    # encoding para el análisis del KNN
#    if have_categoricals:
#       X_original_ = self.encode_categoricals(X_original)
#       X_balanced_ = self.encode_categoricals(X_balanced)
#    else:
#       X_original_ = X_original.copy()
#       X_balanced_ = X_balanced.copy()
#     # transformamos porque los resultados salen raros
#    if scale_data:
#       scaler = StandardScaler()
#       X_original_scaled = scaler.fit_transform(X_original_)  # Ajuste y transformación del conjunto original
#       X_balanced_scaled = scaler.transform(X_balanced_)      # Transformación del conjunto balanceado
   
#    else:
#       X_original_scaled = X_original_
#       X_balanced_scaled = X_balanced_
#     # Identificar datos sintéticos
#    synthetic_indices = self.find_synthetic_data(X_original_scaled, X_balanced_scaled)
#    synthetic_data    = X_balanced_scaled[synthetic_indices] # sin iloc
        
#     # Calcular distancias con KNN
#    knn = NearestNeighbors(n_neighbors= 5)
#    knn.fit(X_original_scaled)                     # Ajuste con originales (scaled)
#    distances, _ = knn.kneighbors(synthetic_data)  # Calcular distancias (sintéticos vs originales)
        
#    print(f"Distancia promedio entre datos sintéticos y originales: {distances.mean():.4f}")
#    return distances

In [19]:
# ver el rango de la columna 'radius_mean'
# radius_mean_describe = pd.DataFrame(df_raw['radius_mean'].describe()).T
# radius_mean_describe

In [20]:
# cdf plot (Cummulative Distribution Function)
#def cdf_calc(self, df: pd.DataFrame, col: str, quantity: int, iqr: bool= False)
# prob.cdf_calc(df= df_raw, col= 'radius_mean', quantity= 13, iqr= True)

In [21]:
#loader.save_dataframe(df= df_raw, file_name= 'clean_breast_disease_00', dir= 'clean')