## DHI Consortium Database - Machine Learning Screening Exercise
### Kevin Jaggs Jan 2020

### Objective:
Using default parameters only, fit a variety of machine learning classifier algorithms to the DHI consortium database. Outcomes will be compiled and ranked by conventional binary classification scoring metrics. This workflow is intended to give a high level overview of which Machine Learning Algorithms present the best opportunities for further detailed investigation.

### Method:

* Step 1 -  [Import the required python modules](#section_id_import_modules). Import of pandas dataframe, numpy, matplotlib and sklearn modules


* Step 2 -  [Import SAAM_Database_csv_to_pandas](#section_id_import_saam_database). Import of dhi consortium database csv file to pandas dataframe. Note: the csv file is an independently formatted file, composed of pg, dhi index adj and user-scored parameters. This format is created ina separate python module.


* Step 3 -  [Split dataset into test and train portions](#section_id_test_train_split). For conventional machine learning workflows, the input and output datasets are divided into training and testing sub-volumes. Algorithms are trained on the (not suprisingly) 'train' dataset and model outcomes are compared using the 'test' dataset, to prevent over-fitting.


* Step 4 -  [Define classifiers and labels](#section_id_define). Create lists of algoithm labels and sklearn algorithms with basic parameters


* Step 5 -  [Feature scale input data](#section_id_feature_scale). Rescale the features/input columns such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.


* Step 6 -  [Fit training data](#section_id_algorithm_fit). Iterate through all the selected algorithms and fit the training data. Predict the output of the test data and compare to the actuals. Record accuracy, precision, f1 and AUC scores.


* Step 7 - [Display heatmap](#section_id_heatmap) Plot sorted results table - coloured heatmap to show top results in each column



### Dataset
The SAAM database is a record of over 340 (as of Jan 2021) O&G exploration prospects, each one displaying a DHI signature. This database is part of the DHI consortium, an industry-funded project headed by Rose & Associates  with the objective of creating a methodical, repeatable and technically robust workflow for risk assessment of DHI prospects.  

Each prospect is evaluated by scoring over 40 questions related to the DHI characteristics, data quality, potential pitfalls and modelling results.

This is a binary classification exercise as the final ouput of the dataset is denoted by 0 for failure cases, and 1 for success cases. 

Options to include Pg and Adj DhiIndex in the evaluations, but this should be regarded as data leakage, partivularly in the case of DHI index Adj.

### Programming
This evaluation is performed using Python code and SKlearn provides all the classifier modules. 

### Outputs
Overview plot of model fit for each classifier
Sheet of evaluation criteria ordered by descending AUC score

<a id='section_id_import_modules'></a>

### 1. Import required Python Modules

In [262]:
print(__doc__)

#pandas dataframe - similar to a python version of excel. Most efficient way to perform data operations
import pandas as pd

#numpy for most mathematical operations
import numpy as np

#matplotlib for plotting functions
#import matplotlib.pyplot as plt
#from matplotlib.colors import ListedColormap
#seaborn for plotting heatmap
#import seaborn as sns

#sklearn for machine learning modules and scoring algorithms
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, f1_score, plot_confusion_matrix,classification_report


Automatically created module for IPython interactive environment


<a id='section_id_import_saam_database'></a>

### 2. Import SAAM Database csv file to pandas
The csv file is imported into a pandas dataframe, allowing ease of computation in later modules. All columns are considered numeric. The column header is the data category and the index is the prospect name. All undrilled prospects have been removed from the database

In [263]:
def readcsvfile():
    '''
    (csv)->(pandas dataframe)
    routine to read import csv file to pandas
    comma separation, column headers = first row of text
    All column data is considered numeric
    Index = prospect name/ID
    -999.25 values are converted to numpy NaN
    PRECONDITION: Import csv file is the correct format. All prospects without a 1/0 output removed.
    KJAGGS June 2020
    '''
    
    #read csv file to pandas dataframe
    df = pd.read_csv('saam_v26b_cat.csv', encoding = 'utf-8',header=0,sep=',')
    df = df.set_index('Prospect/ zone: ')
    df = df.apply(pd.to_numeric, errors='ignore')
    df = df.replace(-999.25,np.NaN)
    df = df.replace(np.NaN,0)
    #print(df.isnull().any())
    return df

df_saam = readcsvfile()

###optional qc features - remove hash/# to view 
#print(df_saam.head)
#print(df_saam.columns.tolist)


print("Dataframe size on import")
print("Number of prospects :", df_saam.shape[0])
print("Number of data categories :", df_saam.shape[1])

Dataframe size on import
Number of prospects : 352
Number of data categories : 43


<a id='section_id_test_train_split'></a>

### 3. Split into test and train datasets
Data is divided into test and training sub-volumes for the purpose of fitting models and preventing over-fitting/bias issues. The current defaults in the module below are to split 80:20 train:test. No random number seeding, no shuffling of input data and no stratifying (equal proportions of success/fail  outcomes in each sub-volume). These faetures are adjusted at the bottom of the next cell as the input to function; user can change accordingly.

In [264]:
def training_split(Xdata, Ydata, test_size=0.2,stratify=None,random_state=None,shuffle=False):
    '''
    (DataframeXColumns)(DataframeYColumn)=>(array)(array)(array)(array)
    For the purposes of splitting x and y data into test and training sets
    train_test_split is currently set to default settings
    PRECONDITION: Input datasets are formatted correctly
    KJAGGS June2020
    '''

    #split the data - default settings for now
    X_train, X_test, y_train, y_test = train_test_split(Xdata, Ydata, test_size=test_size,stratify=stratify, random_state = random_state, shuffle = shuffle)

    #resize the output data sets if a 1 column vector, now compatible  with regression modules
    if len(X_train.shape) <2:  
       X_train = X_train.ravel()
    if len(y_train.shape) <2:  
       y_train = y_train.ravel()
    if len(X_test.shape) <2:  
       X_test = X_test.ravel()
    if len(y_train.shape) <2:  
       y_test = y_test.ravel()
    
    return X_train, X_test, y_train, y_test

#output dataset is success/fail only
y = df_saam['Success or Fail']

X = df_saam

#remove result column from input dataset
#option to keep pg and dhi index adjusted in training dataset just change the # between the two lines below to make the change

#X.drop(['Success or Fail'], axis=1, inplace=True)
X.drop(['Success or Fail','DHI Index after data adjustment','Pg'], axis=1, inplace=True)
#X = df_saam[['DHI Index after data adjustment','Pg']]

#split the datasets
#Shuffle = randomly shuffle the data before, If shuffle=False then stratify must be None.
#random state = seeding of random numbers for reproducibility across various modules
#stratify = ensure proportion of outcomes are the same across both datasets i.e % of successes in train = % of successes in test
X_train, X_test, y_train, y_test = training_split(X,y,shuffle=True,random_state=100,stratify=y,test_size=0.25)

print("Train and test dataset sizes")
print("Training dataset size :", X_train.shape)
print("Test dataset size :", X_test.shape)

Train and test dataset sizes
Training dataset size : (264, 40)
Test dataset size : (88, 40)


<a id='section_id_define'></a>

### 4. Define classifiers and labels
Create 2 python lists; one for a ml algorithm label, and the other with the SKlearn module ID + basic parameters. The model fit section will iterate through these lists a fit the train data to the chosen algorithm, it will also validate the model using the test dataset. Users can change the parameters withom the brackets. User manuals can be found at the following link: 
https://scikit-learn.org/stable/user_guide.html



In [265]:
#test label to id the current process/algorithm in the iteration sequwence
clf_names = ["Logisitic","Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

#list of classifier modules and basic parameter selection 
#iterate in conjunction with the list above
classifiers = [
    linear_model.LogisticRegression(),
    KNeighborsClassifier(2),
    SVC(kernel="linear"),
    SVC(gamma=0.01, C=10),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=40),
    RandomForestClassifier(max_depth=20, n_estimators=20, max_features=X_train_scaled.shape[1]),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

<a id='section_id_feature_scale'></a>

### 5. Feature scale the input data
Input independent variables are scaled so that the the attribute mean is zero with a unit standard deviation. Gradient descent algorithms (neural networks, for example) run more efficiently if input tfeatures are scaled accordingly. decision tree methods are invariant to scaling.

In [266]:
#set up scaler - StandardScaler from SciKit learn
#Standardize features by removing the mean and scaling to unit variance
#z = (x - u) / s
scaler = StandardScaler()

#scaler is calculated on the train data ONLY. Would be considered data leakage - test data must not influence the model output.
scaler.fit(X_train)

#apply the scaler to the train and test datasets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


<a id='section_id_algorithm_fit'></a>

### 6. Fit data using algorithms
Iterate through all the selected algorithms and fit the training data. Predict the output of the test data and compare to the actuals. Record accuracy, precision, f1 and AUC scores. Final results output to a pandas dataframe, sorted on AUC score

In [267]:
#create empty dataframe to receive model scores
df_alg = pd.DataFrame(index=range(len(clf_names)),columns=range(4))

#column names for the results pandas dataframe
df_alg.columns=['Accuracy','Precision','Weighted F1','AUC']
#index names for the results pandas dataframe
df_alg.index=clf_names


# iterate over classifiers
for name, clf in zip(clf_names, classifiers):

    #fit the data to the current classifier
    clf.fit(X_train_scaled, y_train)   
    y_predict = clf.predict(X_test_scaled)
    
    df_alg.loc[name,'Accuracy'] = accuracy_score(y_test, y_predict)
    df_alg.loc[name,'Precision'] = precision_score(y_test, y_predict)
    df_alg.loc[name,'Weighted F1'] = f1_score(y_test, y_predict, average='weighted')
    df_alg.loc[name,'AUC'] = roc_auc_score(y_test, y_predict)

    #print(name , "default parameters")
    #print("Accuracy =", score)
    #print("Precision =", precision_score(y_test, y_predict))
    #print('Weighted f1 = {:.2f}'.format(f1_score(y_test, y_predict, average='weighted')))
    #print("AUC =", roc_auc_score(y_test, y_predict))
    #print("\n")
    #print(classification_report(y_test, y_tuned_predict))

#sort final dataframe by AUC descending 
df_alg.sort_values(['AUC'], ascending=[False], inplace=True)   
#convert string entries to float
df_alg = df_alg.apply(pd.to_numeric, errors='ignore')
#round to 3 decimal places
df_alg = df_alg.round(decimals=3)    
#print(df_alg)    
    



<a id='section_id_heatmap'></a>

### 7. Display final results 
Pandas dataframe is given a gradient fill to highlight the best performing alogirthm for each scoring metric. Dark blue = high score, light purple = low score.
NOTE: Use the >| Run button if colour map is not displayed.

In [268]:
df_alg.style.background_gradient(cmap='PuBu')

Unnamed: 0,Accuracy,Precision,Weighted F1,AUC
Random Forest,0.705,0.733,0.705,0.705
Linear SVM,0.682,0.711,0.682,0.682
Gaussian Process,0.682,0.686,0.68,0.677
Neural Net,0.67,0.696,0.671,0.67
AdaBoost,0.67,0.68,0.669,0.667
Decision Tree,0.659,0.689,0.659,0.659
RBF SVM,0.659,0.681,0.659,0.657
Logisitic,0.648,0.667,0.647,0.645
Nearest Neighbors,0.568,0.622,0.566,0.574
QDA,0.443,0.25,0.304,0.474
