# Predicting Gliomas 

<img src='https://www1.racgp.org.au/getattachment/AJGP/2020/April/Current-management-of-cerebral-gliomas/AJGP-04-2020-Clinical-Jeffree-Fig-1.jpg.aspx'>
    
## Table of Contents:
### 1. [Data Information](#data_info) 
### 2. [Data Evaluation](#data_eval)
### 3. [Modeling](#modeling)
#### 3.1 [Logistic Regression](#log_reg)
#### 3.2 [KNNeighbors](#KNN)
#### 3.3 [Random Forest](#rand_forest)
### 4. [Hyperparameter Tuning](#hyperparameters)
### 5. [Summary](#summary)

# 1. Data Information: <a name='data_info'></a>
About the Project:
> 'Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients.
In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects.'

Features:
1. **Grade ('Target) Binomial Label**
2. Gender: Categorical 
3. Age_at_diagnosis: Numeric
4. Race: Categorical
Below are the genes of interest

5. IDH1
6. TP53
7. ATRX
8. PTEN
9. EGFR
10. CIC
11. MUC16
12. PIK3CA
13. NF1
14. PIK3R1
15. FUBP1
16. RB1
17. NOTCH1
18. BCOR
19. CSMD3
20. SMARCA4
21. GRIN2A
22. IDH2
23. FAT4
24. PDGFRA

### Thanks to UCI Machine Learning Repository for Supplying the data:
<img src="https://arispas.com/project/ucidata/featured_hucfe18df49cc0fbcf4abd94baa39c77da_8457_720x2500_fit_q75_h2_lanczos_3.webp" height="400" width="80">

Dataset information: For more information click [Link](https://archive.ics.uci.edu/dataset/759/glioma+grading+clinical+and+mutation+features+dataset)

# 1. Importing the data and tools

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# sklearn tools
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, RandomizedSearchCV

# sklearn models to test
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# sklearn evaluators
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score

import warnings
warnings.filterwarnings('ignore')

In [None]:
# importing the data
df = pd.read_csv('/kaggle/input/glioma-grading-clinical/TCGA_InfoWithGrade.csv')

# 2. Data Evaluation

Quickly noting the shape, strutcture, and layout of the dataset

In [None]:
df.head()

In [None]:
# checking for missing values
df.isna().sum()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df['Grade'].value_counts().plot(kind='bar')
plt.xticks(rotation=0);

In [None]:
# getting the X and y datasets
X = df.drop('Grade', axis = 1)
y = df.Grade 

# checking for inherent correlations
X_corr = X.corr()
heat_map = sns.heatmap(X_corr, annot = False)
heat_map.set(title = 'Heatmap of Correlations');

In [None]:
X_corr.style.background_gradient(cmap='Reds')

> because the correlation coefficients between the independent variables seem to be lower than 0.5 for the most part, I will skip the VIF step for testing multicolinearity

# 3. Modeling <a name="modeling"></a>

Here we will test three different classification models:
1. LogisticRegression
2. KNeighborsClassifier
3. RandomForestClassifier

to see if there are model-specific differences

In [None]:
# set the seed
np.random.seed(42)

# splitting the data from training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# create a dictionary of libraries to test
model_grid = {'Logistic': LogisticRegression(),
             'RandomForest': RandomForestClassifier(),
              'KNN': KNeighborsClassifier()}
model_score = {} # creating a dictionary of model scores

for name, model in model_grid.items():
    model.fit(X_train, y_train)
    mod_score = model.score(X_test, y_test)
    model_score[name] = mod_score

model_score

In [None]:
pd.DataFrame(model_score.values(), model_score.keys()).plot(kind = 'bar');
plt.xticks(rotation=0);

Seeing that the Logistic model seems to work best, the hyperparameter tuning will be done on this model as a preliminary screen

In [None]:
LogisticRegression().get_params().keys()

In [None]:
# tuning logistic regression
# building a grid for RandomizedSearchCV
log_reg_grid = {'C': np.logspace(-4,4,20),
               'solver': ['liblinear']}


log_reg_model = RandomizedSearchCV(estimator = LogisticRegression(),
                                param_distributions= log_reg_grid,
                                  n_iter= 10,
                                  cv = 5,
                                  verbose = True)

In [None]:
log_reg_model.fit(X_train, y_train)
log_reg_model.score(X_test, y_test), log_reg_model.best_params_

In [None]:
# Evaluating LogisticRegression model
y_preds = log_reg_model.predict(X_test)

In [None]:
# Confustion Matrix

# function to plot the roc_curve:
plt.style.use('ggplot')

def plotting_roc_curve(X_test, y_test, model):
    '''
    Getting the roc_curve plot from the X_test and y_test input
    '''
    y_proba_positive = model.predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_test, y_proba_positive)
    
    # visualization
    fig, ax = plt.subplots(figsize = (4,4))
    ax.plot(fpr, tpr, label = 'ROC')
    ax.plot([0,1], label = 'True')
    ax.set(title = 'Plot of the ROC curve',
           xlabel = 'FPR',
           ylabel = 'TPR')
    ax.legend();
    
plotting_roc_curve(X_test, y_test, model = log_reg_model)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_preds)
ConfusionMatrixDisplay(cm).plot();

### Classification Report

In [None]:
score_dict = {'f1_score':f1_score(y_test, y_preds),
             'recall_score': recall_score(y_test, y_preds),
             'precision_score':precision_score(y_test, y_preds)}

for test, score in score_dict.items():
    print(f'Using the {test}, we see a score of {score *100:.2f}')

In [None]:
print(classification_report(y_test, y_preds))

# 4.Hyperparameter Tuning <a name="hyperparameters"></a>
This time with GridSearchCV

In [None]:
LogisticRegression().get_params().keys()

In [None]:
from sklearn.model_selection import GridSearchCV
# defining the hyperparameter grid
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
penalty = ['l2', 'l1', 'elasticnet', 'none']
C = np.logspace(-4,4,20)

# defining the grid search
grid = {'solver': solvers,
       'penalty': penalty,
       'C': C}
grid_search = GridSearchCV(estimator = LogisticRegression(),
                          param_grid = grid,
                          cv = 5,
                          scoring = 'accuracy')
grid_result = grid_search.fit(X_train, y_train)

# summarizing the results
print(f'Best {grid_result.best_score_} using {grid_result.best_params_}')
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):
    print(f'Mean :{mean}, std: {stdev}, Params: {param}')

In [None]:
grid_result.best_params_

In [None]:
plotting_roc_curve(X_test, y_test, model = grid_result)

In [None]:
print(f' the ROC_AUC Score is {roc_auc_score(y_test, grid_result.best_estimator_.predict(X_test))}')

In [None]:
# Classication evaluators
y_preds2 = grid_result.predict(X_test)
score_dict = {'f1_score':f1_score(y_test, y_preds2),
             'recall_score': recall_score(y_test, y_preds2),
             'precision_score':precision_score(y_test, y_preds2)}

for test, score in score_dict.items():
    print(f'Using the {test}, we see a score of {score *100:.2f}')

# 5. Summary...For Now <a name="summary"></a>

Out of the three estimators (Logistic Regression, RandomForest, and KNN) we see that the Logistic Regression had a high baseline accuracy score compared to the other models, prompting its evaluation. 

After a preliminary round of hyperparameter tuning we got a recall score of **93.67%** with Logistic Regression.

 
Recall: Measures the rate of true positives, i.e how many of the actual positive cases are identified/predicted as positive by the model.

$TP/(TP+FN)$

In [None]:
# cross validation score:
from sklearn.model_selection import cross_val_score

def get_cross_val_metrics(model, X, y, cv):
    '''
    Getting the cross validation metrics score
    '''
    # cross_validation output mean
    list_of_scores = ['accuracy','f1', 'recall','precision']
    scores_dict = {}
    print(f'getting the cross validation metrics with K = {cv}')
    for score in list_of_scores:
        scores_dict[score] = np.mean(cross_val_score(model, X, y, cv = cv, verbose = True, scoring = score))
    print(scores_dict)

In [None]:
# fitting the model with the best params
clf = LogisticRegression(C = 1.623776739188721, penalty = 'l1', solver = 'liblinear')


get_cross_val_metrics(model = clf, X= X, y= y, cv= 10)

# What Next?
### Testing other classifiers:
1. ElasticNet
2. Naive Bayes
3. SVC
4. XGBoost
5. etc. 