"""
<h1 align="center"><font color='green'> ðŸ§  Exercise 2: Brain Tumor Classification Task</font></h1>
<h4 align="left"> <font color='purple'> This project addresses a clinical classification task that predicts the outcome of an MRI scan (Positive or Negative) based on patient and tumor characteristics. The main aim is to evaluate Logistic Regression, Random Forest and SVM algorithms to determine the most reliable model.
Initially, preprocessing steps include encoding categorical data, scaling, and addressing class imbalance through class weights. Models are evaluated using Accuracy, Precision, Recall, F1-score, and visualized with Confusion Matrices, ROC curves, and Precision-Recall curves. </font></h4>


In [None]:
#importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

#Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc, precision_recall_curve, ConfusionMatrixDisplay
from scipy.interpolate import make_interp_spline
import warnings

#suppressing a specific warning related to model fitting
warnings.filterwarnings('ignore', category=UserWarning)



<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 20px;
    border-radius: 10px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
    <h1 align="center">
        <font color='gray'>Data Loading and Preprocessing</font>
    </h1>
    <h4 align="center">
        <font color='blue'>
            This section will focus on cleaning the dataset and transforming features so they're ready for modeling. These include: mapping the string-based target variable (MRI_Result) to numerical values (1 and 0), separating all variables (numeric and categorical), and splitting the data into 75% training and 25% testing sets using stratified sampling to ensure the proportion of positive vs negative is identical.
Lastly, 'ColumnTransformer' will be applied, which will handle standard scaling for numerical values and One-hot Encoding for nominal categorical features.
        </font>
    </h4>
</div>

In [None]:
# --- Data Loading and Preprocessing ---

df = pd.read_csv("/kaggle/input/braintumor/brain_tumor_dataset.csv")

#dropping the non-predictive identifier column (cleaning)
df.drop('Patient_ID', axis=1, inplace=True)

#My target variable is 'MRI_Result' (Positive/Negative)
df.rename(columns={'MRI_Result': 'Diagnosis_Result'}, inplace=True)

#Map 'Positive' to 1 means Tumor Indicated and 'Negative' to 0 means No Tumor Indicated
df['Diagnosis_Result'] = df['Diagnosis_Result'].map({'Positive': 1, 'Negative': 0})

print("Loaded Dataset Shape:", df.shape)
print("\nClass Distribution (Target Variable 'Diagnosis_Result'):")
print(df['Diagnosis_Result'].value_counts())
print("\nFirst 5 rows after initial cleanup:")
print(df.head())

y = df['Diagnosis_Result']
X = df.drop('Diagnosis_Result', axis=1)

#These will be categorical for One-Hot Encoding
numeric_features = ['Age', 'Tumor_Size', 'Survival_Rate', 'Tumor_Growth_Rate']

categorical_features = [
    'Gender', 'Tumor_Type', 'Location', 'Histology', 'Stage',
    'Symptom_1', 'Symptom_2', 'Symptom_3', 'Radiation_Treatment',
    'Surgery_Performed', 'Chemotherapy', 'Family_History', 'Follow_Up_Required'
]

# --- Stratified Train-Test Split ---
#stratify=y to ensure both train/test sets maintain the original class proportion.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

print(f"\nTrain set size: {X_train.shape[0]} records")
print(f"Test set size: {X_test.shape[0]} records")


In [None]:
# --- Preprocessing - Scaling and Encoding  ---

numeric_transformer = StandardScaler() #Standard Scaling for numerical consistency
categorical_transformer = OneHotEncoder(handle_unknown='ignore') #One-Hot Encoding for nominal data

#column transformer that applies the correct transformation to each column set
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough'
)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print("\nPreprocessing Complete: Categorical features encoded and numerical features scaled.")

#feature names after one-hot encoding
feature_names = numeric_features + list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features))

X_train_df = pd.DataFrame(X_train_processed, columns=feature_names)
X_test_df = pd.DataFrame(X_test_processed, columns=feature_names)



<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 20px;
    border-radius: 10px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
    <h1 align="center">
        <font color='gray'>Model Implementation and Evaluation</font>
    </h1>
    <h4 align="center">
        <font color='blue'>
            With the data fully processed, we will now train the models using 3 classification algorithms (Logistic Regression, Random Forest, and SVM)
To address the critical clinical nature of this problem and potential class imbalance, I will set **`class_weight='balanced'`** in all models to ensure the models prioritize the detection of true tumor cases.
        </font>
    </h4>
</div>


In [None]:
# --- Model Implementation and Evaluation ---

#Interpretable linear baseline (Logistic Regression)
log_reg = LogisticRegression(random_state=42, solver='liblinear', class_weight='balanced') 
log_reg.fit(X_train_df, y_train)

#Robust non-linear ensemble model (Random Forest Classifier)
rand_forest = RandomForestClassifier(random_state=42, class_weight='balanced', max_depth=10) 
rand_forest.fit(X_train_df, y_train)

#Support Vector Machine (SVM), needs probability estimates for ROC/PR curves, so setting probability=True
svm_model = SVC(random_state=42, probability=True, kernel='rbf', class_weight='balanced')
svm_model.fit(X_train_df, y_train)

models = {
    "Logistic Regression": log_reg,
    "Random Forest": rand_forest,
    "SVM": svm_model
}

results = []

#Loop to calculate metrics
for name, model in models.items():
    y_pred = model.predict(X_test_df)

    #Calculate key classification metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)

    results.append({
        'Model': name,
        'Accuracy': f"{accuracy:.3f}",
        'Precision': f"{precision:.3f}",
        'Recall': f"{recall:.3f}",
        'F1-Score': f"{f1:.3f}",
    })

results_df = pd.DataFrame(results)


<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 20px;
    border-radius: 10px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
    <h1 align="center">
        <font color='gray'>Visualization and Interpretation</font>
    </h1>
    <h4 align="center">
        <font color='blue'>
            The final step is to visualize and interpret the results to draw a definitive conclusion. I have visualized the Confusion Matrix for the best-performing model (Random Forest) to clearly see the number of False Positives and False Negatives, which is vital for clinical interpretation.
Then, generate the ROC and the Precision-Recall curve to assess their performance specifically on the positive class.
        </font>
    </h4>
</div>


In [None]:
# --- Visualization and Interpretation ---

print("\n--- Model Performance Comparison (Test Set) ---")
print(results_df.to_markdown(index=False))

# --- Visualizations 1,2,3: Confusion Matrix for the Best Model (Random Forest), ROC Curve and Precision-Recall Curve ---
best_model_name = "Random Forest"
best_model = rand_forest
y_pred_best = best_model.predict(X_test_df)

plt.figure(figsize=(6, 5))
ConfusionMatrixDisplay.from_estimator(best_model, X_test_df, y_test, cmap=plt.cm.Blues, display_labels=['Negative (0)', 'Positive (1)'])
plt.title(f'Confusion Matrix: {best_model_name} (Best F1-Score)')
plt.grid(False)
plt.show()

plt.figure(figsize=(15, 6))

#ROC Curve 
plt.subplot(1, 2, 1)
plt.plot([0, 1], [0, 1], 'k--', label='Baseline (AUC = 0.50)')

for name, model in models.items():
    #obtaining probability estimates
    if name == "SVM":
        y_prob = model.predict_proba(X_test_df)[:, 1]
    else:
        y_prob = model.predict_proba(X_test_df)[:, 1]

    fpr, tpr, thresholds = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})')

plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize=14)
plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.legend(loc="lower right")
plt.grid(True, linestyle='--')


#Precision-Recall Curve
plt.subplot(1, 2, 2)
#Calculate baseline
no_skill = len(y_test[y_test==1]) / len(y_test)
plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label=f'Baseline (Precision = {no_skill:.3f})')

for name, model in models.items():
    if name == "SVM":
        y_prob = model.predict_proba(X_test_df)[:, 1]
    else:
        y_prob = model.predict_proba(X_test_df)[:, 1]

    #Calculate Precision-Recall curve
    precision, recall, _ = precision_recall_curve(y_test, y_prob)
    plt.plot(recall, precision, label=name)

plt.title('Precision-Recall Curve', fontsize=14)
plt.xlabel('Recall (True Positive Rate)', fontsize=12)
plt.ylabel('Precision (Positive Predictive Value)', fontsize=12)
plt.legend(loc='lower left')
plt.grid(True, linestyle='--')

plt.tight_layout()
plt.show()


<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 20px;
    border-radius: 10px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
    <h1 align="center">
        <font color='gray'>Final Summary and Interpretation (Brain Tumor Diagnosis)</font>
    </h1>
    <h4 align="left">
        <font color='#006699'>
This classification task was to build a reliable model for predicting a positive MRI result, indicating a likely brain tumor. I used a clinical dataset of approximately 795 patient records with features like age, tumor characteristics, histology, and symptoms.

<h3 align="center">
        <font color='gray'>Preprocessing and Imbalance Handling</font>
    </h3>

I immediately focused on data preparation. I defined my target variable as Diagnosis_Result from the original (MRI_Result) and mapped 'Positive' cases to 1 and 'Negative' to 0.Â 
I applied stratified splitting and balanced class weights since dealing with rare conditions often means dealing with imbalance.
1.Â  Stratified Splitting: This ensured both my training and testing sets accurately produce the original distribution of positive and negative cases.
2.Â  Balanced Class Weights: I instructed all algorithms (`class_weight='balanced') to heavily penalize missing a patient who actually had a positive MRI (False Negative).

Moreover, I used Standard Scaling for numerical data and One-Hot Encoding for the numerous categorical features, making sure all features contributed equally.

<h3 align="center">
        <font color='gray'>Model Implementation and Evaluation </font>
    </h3>

For model implementation and evaluation, I used Logistic Regression, Random Forest, and SVM. The clinical priority is absolute: Recall is paramount. I must find as many true positive cases as possible to avoid a missed diagnosis. My evaluation, therefore, prioritizes the F1-Score, as it balances this high recall goal with necessary precision.

| Model | Accuracy | Precision | Recall | F1-Score |
| :--- | :--- | :--- | :--- | :--- |
| Random Forest | 0.565 | 0.530 | 0.490 | 0.508 |
| Logistic Regression | 0.540 | 0.511 | 0.470 |0.490 |
| SVM | 0.555 | 0.525 | 0.485 | 0.504 |

 <h3 align="center">
        <font color='gray'> Interpretation and Limitations</font>
    </h3>

The Random Forest Classifier achieved the highest overall F1-score (0.508). This score shows that the models are only marginally better than random chance (0.50 F1-score). This shows a low predictive power based on the current features.

ROC curve showed that all models clustered very close to the no-skill line (AUC $\approx 0.55$), indicating they struggled with overall discriminatory power. Furthermore, the Precision-Recall curve was close to the baseline as they could not maintain high precision when pushed to achieve maximum recall. (weak performance)Â 

 <h3 align="center">
        <font color='gray'> Limitations</font>
    </h3>
For model implementation and evaluation, I used Logistic Regression, Random Forest, and SVM. The clinical priority is absolute: Recall is paramount. I must find as many true 
1.Â  Underfitting: Given the low scores across all metrics, the models are most likely underfitting the data. This means the existing features or scaling and encoding are not providing enough signal for the models. The models are simply too close to random guessing.
2.Â  Feature Complexity: I handled all the categorical features with One-Hot Encoding, but all those resulting columns might actually be drowning out my strongest numerical predictors. (underfit)

Overall, the current iteration of the models does not yet prove to have high reliability. The Random Forest slightly stands out, but further feature engineering and hyperparameter tuning are critically needed to raise the F1-score.
    </h4>
</div>



