# Enfrentando o Dragão
**Nome:** Bruna Guedes Pereira, Laura Medeiros Dal Ponte e Mariana Melo Pereira

In [1]:
!pip install ucimlrepo

Defaulting to user installation because normal site-packages is not writeable
Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7



[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
dataFrame = fetch_ucirepo(id=759) 
# metadata 
print(dataFrame.metadata) 
  
# variable information 
print(dataFrame.variables) 

{'uci_id': 759, 'name': 'Glioma Grading Clinical and Mutation Features', 'repository_url': 'https://archive.ics.uci.edu/dataset/759/glioma+grading+clinical+and+mutation+features+dataset', 'data_url': 'https://archive.ics.uci.edu/static/public/759/data.csv', 'abstract': 'Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients.    In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects.  The prediction task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The main objective is to find the optimal subset of mutation genes and clinical features for the glioma grading 

## Importação das bibliotecas

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder

## Obtendo os dados

In [3]:
# Carregar o dataset
dataFrame = pd.read_csv("TCGA_InfoWithGrade.csv")

print(dataFrame)

     Grade  Gender  Age_at_diagnosis  Race  IDH1  TP53  ATRX  PTEN  EGFR  CIC  \
0        0       0             51.30     0     1     0     0     0     0    0   
1        0       0             38.72     0     1     0     0     0     0    1   
2        0       0             35.17     0     1     1     1     0     0    0   
3        0       1             32.78     0     1     1     1     0     0    0   
4        0       0             31.51     0     1     1     1     0     0    0   
..     ...     ...               ...   ...   ...   ...   ...   ...   ...  ...   
834      1       1             77.89     0     0     0     0     1     0    0   
835      1       0             85.18     0     0     1     0     1     0    0   
836      1       1             77.49     0     0     1     0     1     0    0   
837      1       0             63.33     0     0     1     0     0     0    0   
838      1       0             76.61     1     0     0     0     0     0    0   

     ...  FUBP1  RB1  NOTCH

## Pré-processamento dos dados

### Substituir valores ausentes

In [4]:
dataFrame = dataFrame.replace({'--': np.nan, 'not reported': np.nan})

### Codificar variáveis categóricas

In [6]:
label_encoders = {}
categorical_cols = ['Gender', 'Race'] + [col for col in dataFrame.columns if dataFrame[col].dtype == 'object' and col not in ['Grade']]
for col in categorical_cols:
    le = LabelEncoder()
    dataFrame[col] = le.fit_transform(dataFrame[col])
    label_encoders[col] = le

### Definindo target e features

In [7]:
X = dataFrame.drop(columns=['Grade'])
y = dataFrame['Grade']

### Normalizando os dados

In [8]:
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

### Divisão entre dados de treino e teste

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

## Instanciação dos modelos escolhidos

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, classification_report

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True),
    'XGBoost': XGBClassifier(eval_metric='logloss')
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    results[name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_proba),
        'F1-Score': f1_score(y_test, y_pred),
        'Classification Report': classification_report(y_test, y_pred)
    }

In [34]:
for nome in results:
    print(nome)
    print('Acurácia:', results[nome]['Accuracy'])
    print('ROC-AUC:', results[nome]['ROC-AUC'])
    print('F1-Score:', results[nome]['F1-Score'])
    print('Classification Report:')
    print(results[nome]['Classification Report'])
    print('-------------------------------------------------------')

Logistic Regression
Acurácia: 0.8630952380952381
ROC-AUC: 0.9173469387755102
F1-Score: 0.8413793103448276
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.86      0.88        98
           1       0.81      0.87      0.84        70

    accuracy                           0.86       168
   macro avg       0.86      0.86      0.86       168
weighted avg       0.87      0.86      0.86       168

-------------------------------------------------------
Random Forest
Acurácia: 0.7797619047619048
ROC-AUC: 0.905466472303207
F1-Score: 0.7338129496402878
Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.82      0.81        98
           1       0.74      0.73      0.73        70

    accuracy                           0.78       168
   macro avg       0.77      0.77      0.77       168
weighted avg       0.78      0.78      0.78       168

---------------------------------------

## Investigação dos conjuntos de hiperparâmetros

## Comparação dos algoritmos com validação cruzada

## Uso da ferramenta `SHAPE` de explicação de modelos

## Uso da ferramenta `LIME` de explicação de modelos

In [16]:
!pip install lime

Defaulting to user installation because normal site-packages is not writeable
Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting scikit-image>=0.12 (from lime)
  Downloading scikit_image-0.25.2-cp312-cp312-win_amd64.whl.metadata (14 kB)
Collecting tifffile>=2022.8.12 (from scikit-image>=0.12->lime)
  Downloading tifffile-2025.10.16-py3-none-any.whl.metadata (31 kB)
Collecting lazy-loader>=0.4 (from scikit-image>=0.12->lime)
  Downloading lazy_loader-0.4-py3-none-any.whl.metadata (7.6 kB)
Downloading scikit_image-0.25.2-cp312-cp312-win_amd64.whl (12.9 MB)
   ---------------------------------------- 0.0/12.9 MB ? eta -:--:--
   -- --------------


[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


https://archive.ics.uci.edu/dataset/759/glioma+grading+clinical+and+mutation+features+dataset