# NASA Program Defects:  Classifier Model Comparison

## Problem Statement
When representing real-world scenarios within machine learning models, it is important to know how to compare performance across model architectures. From these comparisons, a data scientist is able to identify the strongest match for problem representation. Additionally, it is important for data scientists to know how fine-tune models within a given architecture both for cross-method comparison and inter-method comparison.

This small project generates five different classifier model architectures that predict the presence of software defects in a given NASA software program. The project generates 10 models for each method using 10-fold cross-validation. The project then captures Area Under Curve scores for cross-method performance comparison. For two of the five models, a given paremeter is fine-tune prior to final cross-validation training and testing.

Technical Note: For the cross-validation process, the parameter “n_jobs” is adjusted so that all but one available system CPU is dedicated to completing the model_selection.cross-validate() call; this is done to improve computing performance.


## Brief Description of the Data

The selected dataset – named “pc4” - is part of the National Aeronautics and Space Administration (NASA) program defect datasets. The set’s data describes earth-orbiting satellite flight software. The datapoints were compiled by Shirabad & Menzies in 2015[1]. For this project, the data is directly imported from OpenML's database located at: https://www.openml.org/search?type=data&status=active&id=1049

The set examples consist of 37 numerical descriptive features and one Boolean target feature which describes whether an example program contained one or more defects. The descriptive features are based on McCabe and Halstead metrics for software complexity. The dataset contains 1458 program instances and no missing values. 


In [3]:
from sklearn import datasets
from sklearn import tree
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

In [5]:
#Assess available system CPU for cross-validation performance tuning
import os
n_cpu = os.cpu_count()

In [6]:
#Select NSA Dataset from OpenML
nsa_dataset = datasets.fetch_openml(data_id=1049)

print("\n")

#Run Model Generation & Evaluation
print("*********Default Decision Tree AUC Scoring**********")
dtc_default = tree.DecisionTreeClassifier() #default parameters
cv_default = model_selection.cross_validate(dtc_default, nsa_dataset.data, nsa_dataset.target, scoring = "roc_auc", cv = 10, n_jobs = n_cpu-1)
print("AUC for each model: ", cv_default["test_score"])
print("Mean AUC: ", cv_default["test_score"].mean())

print("\n")

print("*********Tuned Decision Tree AUC Scoring**********")
parameters = [{"min_samples_leaf":[2,4,6,8,10,20,30,40,50]}]
dtc_tuned = model_selection.GridSearchCV(dtc_default, parameters, scoring = "roc_auc", cv = 10, n_jobs = n_cpu-1 )
cv_dtc_tuned = model_selection.cross_validate(dtc_tuned, nsa_dataset.data, nsa_dataset.target, scoring = "roc_auc", cv = 10, n_jobs = n_cpu-1)
dtc_tuned.fit(nsa_dataset.data, nsa_dataset.target)
print("AUC for each model: ", cv_dtc_tuned["test_score"])
print("Mean AUC: ", cv_dtc_tuned["test_score"].mean())
print("Best \"min_sample_leaf\" parameter value: " , dtc_tuned.best_params_['min_samples_leaf'])


print("\n")

print("*********Tuned Random Forest Ensemble AUC Scoring**********")
parameters = [{"n_estimators": [50, 100, 500]}]
rf = RandomForestClassifier()
rf_tuned = model_selection.GridSearchCV(rf, parameters, scoring = "roc_auc", cv = 10, n_jobs = n_cpu-1)
cv_rf_tuned = model_selection.cross_validate(rf_tuned, nsa_dataset.data, nsa_dataset.target, scoring="roc_auc", cv=10, n_jobs = n_cpu-1)
rf_tuned.fit(nsa_dataset.data,nsa_dataset.target)
print("AUC for each model: ", cv_rf_tuned["test_score"])
print("Mean AUC: ", cv_rf_tuned["test_score"].mean())
print("Best \"n_estimators\" parameter value: ", rf_tuned.best_params_['n_estimators'])


print("\n")

print("*********Default Bagged Ensemble AUC Scoring**********")
dtc_bagged = BaggingClassifier()
cv_bagged = model_selection.cross_validate(dtc_bagged, nsa_dataset.data, nsa_dataset.target, scoring = "roc_auc", cv = 10, n_jobs = n_cpu-1)
print("AUC for each model: ", cv_bagged["test_score"])
print("Mean AUC: ", cv_bagged["test_score"].mean())


print("\n")

print("*********Default AdaBoost Ensemble AUC Scoring**********")
dtc_adaboost = AdaBoostClassifier()
cv_adaboost = model_selection.cross_validate(dtc_adaboost, nsa_dataset.data, nsa_dataset.target, scoring = "roc_auc", cv = 10, n_jobs = n_cpu-1)
print("AUC for each model: ", cv_adaboost["test_score"])
print("Mean AUC: ", cv_adaboost["test_score"].mean())



*********Default Decision Tree AUC Scoring**********
AUC for each model:  [0.70703125 0.63541667 0.72005208 0.72265625 0.69921875 0.84201389
 0.69487847 0.67881944 0.85110294 0.63924632]
Mean AUC:  0.7190436070261439


*********Tuned Decision Tree AUC Scoring**********
AUC for each model:  [0.90342882 0.94639757 0.85872396 0.9296875  0.87521701 0.9047309
 0.92578125 0.91232639 0.88671875 0.88304228]
Mean AUC:  0.9026054432189543
Best "min_sample_leaf" parameter value:  40


*********Tuned Random Forest Ensemble AUC Scoring**********
AUC for each model:  [0.94704861 0.93012153 0.93055556 0.96419271 0.95138889 0.94618056
 0.94335938 0.94921875 0.95036765 0.93474265]
Mean AUC:  0.944717626633987
Best "n_estimators" parameter value:  500


*********Default Bagged Ensemble AUC Scoring**********
AUC for each model:  [0.90082465 0.92599826 0.89583333 0.94661458 0.94314236 0.90147569
 0.89322917 0.91341146 0.91130515 0.94462316]
Mean AUC:  0.9176457822712418


*********Default AdaBoost Ensem

## Observations

The default and tuned single decision trees performed the least competitively compared to the ensemble models with AUC scores of .7252 and .9015, respectively. That these single tree models were generally less competitive compared to the ensemble models aligns with the assertion that ensemble models generally provide better predictions due to their inter-diversity, independence, and ability to generate unique language/search biases that a single model may not be able to produce.

The tuned random forest ensemble model was the most competitive predictor, producing from its 10-fold cross-validation a mean AUC score of 0.9450. The number of trees (“n_estimators”) for each random forest ensemble was notably optimized to 500 trees using GridSearchCV() – this was  from an available parameter grid of 100, 300, and 500 trees. From this, it seems that an increase of decision trees generated within a random forest positively correlates with improved performance; attempting to increase the number of trees may continue to improve performance.


## Resources

[1] Sayyad Shirabad, J. and Menzies, T.J. (2005) The PROMISE Repository of Software Engineering Databases. School of Information Technology and Engineering, University of Ottawa, Canada.  