# Predicting treatment outcome in breast cancer patients

Les données contenues dans `breast_cancer_data.tsv` décrivent des tumeurs de patientes atteintes de cancer du sein par 18 variables, toutes mesurées avant traitement :

- `Age` : âge (en années) 
- `Menopausal status` : 0=pré-ménopause, 1=post-ménopause
- `T stage` : classification (0-4) de la tumeur selon [la classification TMN](https://fr.wikipedia.org/wiki/Classification_TNM)
- `N stage` : classification (0-2) de l'atteinte ganglionnaire selon [la classification TMN](https://fr.wikipedia.org/wiki/Classification_TNM)
- `ER Status` :	si la tumeur présente des récepteurs d'œstrogène (0=négatif, 1=positif) 
- `PR Status` : si la tumeur présente des récepteurs de progestérone (0=négatif, 1=positif)
- `Ki67 25%` : si le marqueur de prolifération Ki67 est surexprimé (0=négatif, 1=positif)
- `TILs 30%` : si l'infiltration lymphocitaire est élevée (0=négatif, 1=positif)
- `Breast Density`	: densité mammaire selon [la classification Bi-RADS](https://en.wikipedia.org/wiki/BI-RADS) (0=A, 1=B, 2=C, 3=D)
- `US LN Cortex` : évaluation sur imagerie par ultrasons (_US_) de la capsule (_cortex_) des ganglions lymphatiques (_lymph nodes, LN_) (0=Thin, 1=Thickened)
- `Intratumoral high SI on T2` : évaluation de l'intensité du signal (_high SI = high signal intensity_) sur [IRM T2*](https://en.wikipedia.org/wiki/T2*-weighted_imaging) (0=absent, 1=présent)
- `Peritumoral Edema` : œdème péritumoral (0=absent, 1=présent)
- `Prepectoral Edema` :  œdème prépectoral (0=absent, 1=présent)
- `Subcutaneous Edema` : œdème sous-cutané (0=absent, 1=présent)
- `Multifocality` : multifocalité (présence de sites supplémentaires de malignité) (0=absent, 1=présent)
- `Maximal MR Size`	: taille (estimée sur l'IRM, selon son plus grand axe) de la masse tumorale
- `Index Lesion MR Size` : taille (estimée sur l'IRM, selon son plus grand axe) de la lésion initiale 
- `Size of Largest LN metastasis (mm)` : taille de la plus grande métastase ganglionnaire

Ces variables appartiennent à 3 modalités différentes :
- variables cliniques : `Age`, `BMI`, `Menopausal status`, `T stage`, `N stage`
- variables histologiques : `ER Status`, `PR Status`, `Ki67 25%`, `TILs 30%`
- variables d'imagerie : `Breast Density`, `US LN Cortex`, `Intratumoral high SI on T2`, `Peritumoral Edema`, `Prepectoral Edema`, `Subcutaneous Edema`, `Multifocality`, `Maximal MR Size`, `Index Lesion MR Size`, `Size of Largest LN metastasis (mm)`.

Le fichier `breast_cancer_pcr.tsv` contient la variable (étiquette) `pCR Status` (_pCR = pathological complete response_), qui décrit si oui ou non le traitement par chimiothérapie donné a permis d'éliminer la tumeur.

Le but du projet est d'évaluer si les variables proposées permettent de prédire la réponse `pCR Status`.

## Instructions
1. Comparer les performances d'au moins deux algorithmes d'apprentissage non-linéaires sur ce problème. (Justifier le choix de ces algorithmes.)

__Attention :__
- au _data leakage_
- aux échelles prises par les différentes variables
- à sélectionner de manière appropriée les hyperparamètres les plus pertinents 
- à utiliser une mesure de performance appropriée (justifier son choix)

2. Évaluer l'utilité de chacune des modalités : par exemple, la performance est-elle dégradée si l'on se prive des variables d'imagerie ?

Penser à commenter et interpréter les résultats.

## Librairies utiles

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Chargement des données

In [2]:
df_data = pd.read_csv("breast_cancer_data.tsv", sep="\t")
X = np.array(df_data)

df_pcr = pd.read_csv("breast_cancer_pcr.tsv", sep="\t")
y = np.array(df_pcr)

In [8]:
print(df_data.columns, df_pcr.columns)
df_data.describe()

Index(['Age', 'Menopausal Status', 'T Stage', 'N Stage', 'ER Status',
       'PR Status', 'Ki67 25%', 'TILs 30%', 'Breast Density', 'US LN Cortex',
       'Intratumoral high SI on T2', 'Peritumoral Edema', 'Prepectoral Edema',
       'Subcutaneous Edema', 'Multifocality', 'Maximal MR Size',
       'Index Lesion MR Size', 'Size of Largest LN metastasis (mm)'],
      dtype='object') Index(['pCR Status'], dtype='object')


Unnamed: 0,Age,Menopausal Status,T Stage,N Stage,ER Status,PR Status,Ki67 25%,TILs 30%,Breast Density,US LN Cortex,Intratumoral high SI on T2,Peritumoral Edema,Prepectoral Edema,Subcutaneous Edema,Multifocality,Maximal MR Size,Index Lesion MR Size,Size of Largest LN metastasis (mm)
count,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0
mean,47.493827,0.358025,1.102881,0.477366,0.386831,0.288066,0.868313,0.296296,1.563786,0.54321,0.366255,0.683128,0.333333,0.144033,0.288066,38.888889,30.452675,1.213992
std,11.434098,0.480409,0.761762,0.532519,0.48803,0.453797,0.338848,0.457566,0.786669,0.499158,0.482775,0.466218,0.472377,0.351848,0.453797,21.994928,12.541206,3.106161
min,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,7.0,0.0
25%,39.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,21.5,0.0
50%,46.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,32.0,28.0,0.0
75%,55.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,0.0,1.0,46.5,36.0,0.0
max,74.0,1.0,4.0,2.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,146.0,85.0,20.0


Idées : 
- observer les données : répartition des deux classes
- standardiser ? l'age en année, etc. Enfin pas pour le RF
- one hot encoding des stages, breast density
- première intention RF
- 2e, SVM rbf (non linéaire) ?

The two models used in your dataset are the Random Forest Classifier and the Support Vector Machine (SVM) with an RBF kernel. Here's the justification for using both models:

Random Forest Classifier:

Strengths: It handles both numerical and categorical data well, is robust to overfitting with a large number of trees, and can capture complex interactions between features.
Use Case: Suitable for datasets with a large number of features and complex relationships.
Support Vector Machine (SVM) with RBF Kernel:

Strengths: Effective in high-dimensional spaces, works well with a clear margin of separation, and the RBF kernel can handle non-linear relationships.
Use Case: Ideal for datasets where the decision boundary is not linear.
Justification for Using Both Models:
Complementary Strengths: Random Forests are excellent for handling datasets with many features and can provide insights into feature importance. SVMs, particularly with the RBF kernel, are powerful for capturing non-linear relationships in the data. By using both, you leverage the strengths of each model to potentially improve overall performance.

Ensemble Approach: Combining predictions from both models can lead to a more robust and accurate prediction system. This can be done through techniques like voting or stacking, where the strengths of each model are utilized to compensate for the weaknesses of the other.

Hyperparameter Tuning: Both models have been subjected to hyperparameter tuning using GridSearchCV, which ensures that the best possible version of each model is used. This process helps in optimizing the models' performance on the given dataset.

Diverse Decision Boundaries: Random Forests and SVMs with RBF kernels can capture different types of decision boundaries. This diversity can be beneficial in scenarios where the data distribution is complex and not easily separable by a single model type.

In summary, using both Random Forest and SVM models allows you to take advantage of their unique strengths, providing a comprehensive approach to modeling your dataset. This strategy can lead to improved accuracy and robustness in predictions.