<a href="https://www.spe.org/events/en/2022/conference/22apog/asia-pacific-oil-and-gas-conference-and-exhibition.html"><img src = "https://www.spe.org/binaries/content/gallery/specms/speevents/organization-logos/spe-logo-2020.png" width = 200> 

<h1 align=center><font size = 5>Prediction of Recovery Factor using Machine Learning Methods</font></h1>

<h1 align=center><font size = 4> Munish Kumar, Kannapan Swaminathan</font></h1>
<h1 align=center><font size = 4> Part 4: Modelling of Recovery Factor</font></h1>
<h1 align=center><font size = 3> ERCE 2022 </font></h1>

###### References

1. https://www.kaggle.com/code/kkhandekar/an-introduction-to-pycaret/notebook.
2. https://towardsdatascience.com/5-things-you-dont-know-about-pycaret-528db0436eec
3. https://www.dataquest.io/blog/understanding-regression-error-metrics/ 
4. https://www.analyticsvidhya.com/blog/2021/07/automl-using-pycaret-with-a-regression-use-case/
5. https://www.datacamp.com/community/tutorials/guide-for-automating-ml-workflows-using-pycaret
6. https://pycaret.readthedocs.io/en/latest/api/regression.html
7. http://www.pycaret.org/tutorials/html/REG102.html
8. https://githubhelp.com/ray-project/tune-sklearn

## Check PyCaret Version

In [None]:
from pycaret.utils import version

In [None]:
version()

#### Libraries

In [None]:
# Only install the following libraries if you dont have it, otherwise leave it commented out

#!conda install -c anaconda natsort --yes
#!conda install -c anaconda xlrd --yes

#!pip install natsort --user
#!pip install xlrd --user
#!pip install pycaret[full] --user
#!pip install mlflow --user
#!pip install tune-sklearn ray[tune] --user
#!pip install optuna -- user
#!pip install hyperopt --user
#!pip install redis --user

# General Libraries
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import NullFormatter
import time
import re
import requests
import pickle
import seaborn as sns
import os
import glob
import sys
from natsort import natsorted
sns.set()

import plotly.graph_objects as go
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

# Sklearn Liraries
from sklearn import preprocessing

import datetime
from datetime import timedelta, date 
start = time.time()
%matplotlib inline

import ray
from ray import tune

# Forces the print statement to show everything and not truncate
# np.set_printoptions(threshold=sys.maxsize) 
print('Libraries imported')

In [None]:
#Receive Data
#dir_name = r'C:\Users\kswaminathan\OneDrive\01_KannaLibrary\15_Analogs'
dir_name = r'C:\Users\mkumar\Documents\GitHub\@Papers\SPE2022\Final'
filename_suffix = 'csv'

##### Read in the data 

In [None]:
skiprows = 0
#Means read in the ',' as thousand seperator. Also drops all columns which are unnamed.
df = pd.read_excel("dftorisv2.xlsx", thousands=',', skiprows = skiprows)
df = df.loc[:, ~df.columns.str.contains('^Unnamed')] 
df.head()

In [None]:
# Plot as Heat map to check for highly correlated variables
plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(), annot=True, fmt=".2f")

In observing the heat map above, I define highly correlated variables as having collinearity coeeficients of > 0.7. There was no highly correlated values

##### Convert to float - to ensure it is a numerical feature

In [None]:
df_drop = df.copy()
df_drop = df_drop.astype(float)

# Confirm properties of final dataframe
print(len(df_drop))
print(df_drop.info())
print(df_drop.describe(include='all'))
print(df_drop.columns.values)

Final Data set has 450 rows and 24 columns.

### Train, Validation, and Test Split

In [None]:
# Creates a mask where values that are true go into the training/test set
# Note that I done it so that the random number is predictable

msk = np.random.seed(0)
msk = np.random.rand(len(df_drop))<0.8

raw_train_validate_set = df_drop[msk]
raw_test_set = df_drop[~msk]

print(raw_train_validate_set.shape)
print(raw_test_set.shape)

In [None]:
raw_train_validate_set.to_excel(r'dfssoil.xlsx', index = False, header=True)
raw_test_set.to_excel(r'BlindTest_SSOIL.xlsx', index = False, header=True)

We split the data set 80-20 into a "train-validate" set and a "test" set. The test set is external asn will never be seen by the model.

## 1. Pycaret Implementation

Pycaret will be used in the machine learning portion. Pycaret is a low-code machine learning library in Python that automates machine learning workflows. One of its key benefits is its ability to run a large number of differnt machine learning algorithms, but with only a few lines of code

In [None]:
skiprows = 0
#Means read in the ',' as thousand seperator. Also drops all columns which are unnamed.
df = pd.read_excel("dfssoil.xlsx", thousands=',', skiprows = skiprows)
df = df.loc[:, ~df.columns.str.contains('^Unnamed')] 
df.head()

In [None]:
from pycaret.regression import *

#Create a copy
model_df = df_drop.copy()
target = 'URF'

# no resampling
clf_none = setup(
            data=model_df,
            target=target,
            session_id=42,
            normalize=True,
            transformation = True,
            ignore_low_variance=True,
            remove_outliers = True, outliers_threshold = 0.1,
            remove_multicollinearity = True, multicollinearity_threshold = 0.7,
            train_size=0.7)

In [None]:
best = compare_models()

In [None]:
top3_fold_5 = compare_models(include=['rf', 'et', 'gbr'], fold = 5, sort='MAE')

In [None]:
top3 = compare_models(include=['rf', 'et', 'gbr'], fold = 10, sort='MAE')

There is a performance improvement in going from 5 folds to 10 folds for all 3 models. To keep computation time reasonable, folds is kept at 10.

## 2. Plot each Model and Check Features

##### Random Forest (RFR)

In [None]:
rfr = create_model('rf')
rfr_results = pull()

rfr_feature_imp = pd.DataFrame({'Feature': get_config('X_train').columns, 'Value' : abs(rfr.feature_importances_)}).sort_values(by='Value', ascending=False)
rfr_feature_imp.to_csv('Feature_importance_RFR.csv')

In [None]:
plot_model(rfr, plot = 'feature')

##### Extra Trees Regressor (et)

In [None]:
et = create_model('et')
et_results = pull()
#print(gbr_results)

et_feature_imp = pd.DataFrame({'Feature': get_config('X_train').columns, 'Value' : abs(et.feature_importances_)}).sort_values(by='Value', ascending=False)
et_feature_imp.to_csv('Feature_importance_ET.csv')

In [None]:
plot_model(et, plot = 'feature')

##### Gradient Boost Regressor (gbr)

In [None]:
gbr = create_model('gbr')
gbr_results = pull()
#print(cb_results)

gbr_feature_imp = pd.DataFrame({'Feature': get_config('X_train').columns, 'Value' : abs(gbr.feature_importances_)}).sort_values(by='Value', ascending=False)
#print(cb_feature_imp)
gbr_feature_imp.to_csv('Feature_importance_GBR.csv')

In [None]:
# Given the sheer number of variables, will only plot the first 10
# 'feature_all' will plot everything
plot_model(gbr, plot = 'feature')

## 3a. Testing for Optimisation - Not necessary to run
-----------------------------------------------------------------------------------

One of the important parameters in the hyperparameters is the number of iterations over which the K fold cross validation is done. 

2 checks are done for this. The first scenario is over the range(0, 1000, 50). The optimisation ran overnight and showed that the ML algorithm did not see much improvement past 50.

In [None]:
# elapsed = []
# MAE_mean_iter = []
# MSE_mean_iter = []
# RMSE_mean_iter = []

# # The output from the (0, 1000, 50) is saved; there is no need to run this again. 
# # Line has been modified just so the code can run.
# for i in range(0, 51, 50):
#     start = time.time()
#     if i == 0:
#         i += 1    
#     tuned_cb = tune_model(cb, optimize = 'MSE', n_iter = i)
#     #print(tuned_cb)
#     MAE_mean_iter.append(pull()['MAE']['Mean'])
#     MSE_mean_iter.append(pull()['MSE']['Mean'])
#     RMSE_mean_iter.append(pull()['RMSE']['Mean'])
#     elapsed.append((time.time() - start))

# MAE_Mean = pd.DataFrame(MAE_mean_iter, index = elapsed, columns=['MAE Mean Error'])
# MAE_Mean.index.name = 'Elapsed Time'

# MSE_Mean = pd.DataFrame(MSE_mean_iter, index = elapsed, columns=['MSE Mean Error']) 
# MSE_Mean.index.name = 'Elapsed Time'

# RMSE_Mean = pd.DataFrame(RMSE_mean_iter, index = elapsed, columns=['RMSE Mean Error'])
# RMSE_Mean.index.name = 'Elapsed Time'

# res_50_iter = pd.concat([MAE_Mean, MSE_Mean, RMSE_Mean], axis=1)

# print(res_50_iter)

In [None]:
# b = sns.lineplot(data=res_50_iter)
# b.axes.set_title("Error as function of Elapsed Time",fontsize=20)
# b.set_xlabel("Elapsed Time",fontsize=20)
# b.set_ylabel("Error",fontsize=20)
# #b.set_yscale('log')
# b.tick_params(labelsize=18)

In [None]:
#res.to_csv('Run_Catboost_1000_Itr.csv')

In [None]:
# elapsed = []
# MAE_mean_iter = []
# MSE_mean_iter = []
# RMSE_mean_iter = []

# # This was run at (1, 51, 1) to get increments of 1
# # Right now, this is changed to (1, 51, 50) to allow the code to run efficiently
# for i in range(1, 51, 50):
#     start = time.time()
#     tuned_cb = tune_model(cb, optimize = 'MSE', n_iter = i)
#     MAE_mean_iter.append(pull()['MAE']['Mean'])
#     MSE_mean_iter.append(pull()['MSE']['Mean'])
#     RMSE_mean_iter.append(pull()['RMSE']['Mean'])
#     elapsed.append((time.time() - start))

# MAE_Mean = pd.DataFrame(MAE_mean_iter, index = elapsed, columns=['MAE Mean Error'])
# MAE_Mean.index.name = 'Elapsed Time'

# MSE_Mean = pd.DataFrame(MSE_mean_iter, index = elapsed, columns=['MSE Mean Error']) 
# MSE_Mean.index.name = 'Elapsed Time'

# RMSE_Mean = pd.DataFrame(RMSE_mean_iter, index = elapsed, columns=['RMSE Mean Error'])
# RMSE_Mean.index.name = 'Elapsed Time'

# res_1_iter = pd.concat([MAE_Mean, MSE_Mean, RMSE_Mean], axis=1)

# print(res_1_iter)

# res_1_iter.to_csv('Run_Catboost_50_Itr.csv')

In [None]:
# b = sns.lineplot(data=res_1_iter)
# b.axes.set_title("Error as function of Elapsed Time",fontsize=20)
# b.set_xlabel("Elapsed Time",fontsize=20)
# b.set_ylabel("Error",fontsize=20)
# #b.set_yscale('log')
# b.tick_params(labelsize=18)

In [None]:
# tuned_cb3 = tune_model(cb, optimize = 'RMSE', n_iter = 10)

In [None]:
# plot_model(tuned_cb3, plot = 'parameter')

##### Gradient Boosting Regression

In [None]:
# gbr = create_model('gbr')
# print(gbr)

In [None]:
# tuned_gbr = tune_model(gbr, search_library = "tune-sklearn", search_algorithm="hyperopt", optimize="RMSE", n_iter=50)
# print(tuned_gbr)

----------------------------------------------------------------------------------------------------------------------------

## 3. Optimisation

### a. Tune the Model

The earlier experiments allow one to determine which model performs efficiently, and the tuning needed to arrive at the answer. Here, we will create the 3 specific models , which we will than blend, and than finally produce a "tuned" blended model based on earlier optimised parameters

from pycaret.distributions import UniformDistribution, CategoricalDistribution

catboost_param_dists = {
    'iterations': CategoricalDistribution([500,100,300]),
    'colsample_bylevel': UniformDistribution(0.5, 1.0),
    'random_strength': CategoricalDistribution([0,0.1,0.2,1,10]),
    'max_depth' : CategoricalDistribution([5,6,7,8,9])
}

In [None]:
tuned_models = []

In [None]:
rf = create_model('rf', fold = 10)
rf = tune_model(rf, 
                optimize = 'RMSE', 
                n_iter = 50, 
                choose_better = True, 
                 #search_library = "tune-sklearn", 
                 #search_algorithm="Hyperopt",
                 #search_algorithm="Optuna",
                 #search_algorithm="bayesian",
                )
tuned_models.append(rf)

In [None]:
et = create_model('et', fold = 10)
et = tune_model(et, 
                optimize = 'RMSE', 
                n_iter = 50, 
                choose_better = True,
                #search_library = "optuna", 
                #search_library = "tune-sklearn", 
                #search_algorithm="bayesian",
                #search_algorithm="hyperopt",
                #custom_grid = catboost_param_dists ,
                #early_stopping = "asha",
                #early_stopping_max_iters = 10,
                #return_tuner = False ,   
               )

tuned_models.append(et)

In [None]:
gbr = create_model('gbr', fold = 10)
gbr = tune_model(gbr, 
                 optimize = 'RMSE', 
                 n_iter = 50, 
                 choose_better = True, 
                 #search_library = "tune-sklearn", 
                 #search_algorithm="Hyperopt",
                 #search_algorithm="Optuna",
                 #search_algorithm="bayesian",
                )
tuned_models.append(gbr)

### b. Ensemble the Model

pycaret.regression.ensemble_model(estimator, method: str = 'Bagging', fold: Optional[Union[int, Any]] = None, n_estimators: int = 10, round: int = 4, choose_better: bool = False, optimize: str = 'R2', fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, verbose: bool = True, return_train_score: bool = False)

In [None]:
prediction_model = []

In [None]:
tuned_bagged_rf = ensemble_model(estimator = rf, method = 'Bagging', n_estimators=50, optimize = 'RMSE')
prediction_model.append(tuned_bagged_rf)

In [None]:
tuned_boosted_rf = ensemble_model(estimator = rf, method = 'Boosting', n_estimators=50, optimize = 'RMSE')
prediction_model.append(tuned_boosted_rf)

Based on the output here, the 'bagging' method has improved the statistics.

In [None]:
tuned_bagged_et = ensemble_model(estimator = et, method = 'Bagging', n_estimators=50, optimize = 'RMSE')
prediction_model.append(tuned_bagged_et)

In [None]:
tuned_boosted_et = ensemble_model(estimator = et, method = 'Boosting', n_estimators=50, optimize = 'RMSE')
prediction_model.append(tuned_boosted_et)

Based on the output here, the 'bagging' method has dropped the MAE, MSE and RMSE

In [None]:
tuned_bagged_gbr = ensemble_model(estimator = gbr, method = 'Bagging', n_estimators=50, optimize = 'RMSE')
prediction_model.append(tuned_bagged_gbr)

In [None]:
tuned_boosted_gbr = ensemble_model(estimator = gbr, method = 'Boosting', n_estimators=50, optimize = 'RMSE')
prediction_model.append(tuned_boosted_gbr)

Based on the output here, the 'bagging' method has improved the MAE, MSE and RMSE

### c. Blending all Models

pycaret.regression.blend_models(
estimator_list: list, fold: Optional[Union[int, Any]] = None, round: int = 4, choose_better: bool = False, optimize: str = 'R2', weights: Optional[List[float]] = None, fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, verbose: bool = True, return_train_score: bool = False )

In [None]:
blend_5_soft = blend_models(estimator_list = tuned_models, fold=5, optimize = 'RMSE', choose_better = True)
prediction_model.append(blend_5_soft)

In [None]:
blend_10_soft = blend_models(estimator_list = tuned_models, fold=10, optimize = 'RMSE', choose_better = True)
prediction_model.append(blend_10_soft)

### d. Stacking all Models

pycaret.regression.stack_models(estimator_list: list, meta_model=None, meta_model_fold: Optional[Union[int, Any]] = 5, fold: Optional[Union[int, Any]] = None, round: int = 4, restack: bool = True, choose_better: bool = False, optimize: str = 'R2', fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, verbose: bool = True, return_train_score: bool = False)

In [None]:
stack_5 = stack_models(estimator_list = tuned_models, meta_model = rf, fold = 5, optimize = 'RMSE', choose_better= True)
prediction_model.append(stack_5)

In [None]:
stack_10 = stack_models(estimator_list = tuned_models, meta_model = rf, fold = 10, optimize = 'RMSE', choose_better= True)
prediction_model.append(stack_10)

In [None]:
prediction_model

In [None]:
for model in prediction_model:
    print(model.__class__.__name__)
    display(predict_model(model))

## 4. Save the Model

#### Lowest Mean RMSE is "blend_10_soft" which is the blending of all models with 10 K-folds

In [None]:
save_model(blend_10_soft, 'Blended_model_15072022')

#### 2nd Lowest Mean RMSE is "tuned_bagged_gbr" which is the bagged gradient boost

In [None]:
save_model(tuned_bagged_gbr, 'Bagged_GradBoost_15072022')

#### 3rd Lowest Mean is "stack_10" which is the Stacked Models

In [None]:
save_model(stack_10, 'Stack_10_15072022')

#### 4th Lowest Mean  is "tuned_bagged_et" which is the Extra Trees Regressor

In [None]:
save_model(tuned_bagged_et, 'Bagged_ET_15072022')

## 5. Finalise the model

In [None]:
final_blend = finalize_model(blend_10_soft)

In [None]:
final_gbr = finalize_model(tuned_bagged_gbr)

In [None]:
final_stack = finalize_model(stack_10)

In [None]:
final_et = finalize_model(tuned_bagged_et)
#final_et = finalize_model(et)

### Plots to analyse Model

In [None]:
model = final_et
predict_model(model)

In [None]:
plot_model(model)

In [None]:
# Prediction Error 
plot_model(model, plot = 'error')

In [None]:
# Cooks Distance Plot
plot_model(model, plot='cooks')

In [None]:
# Learning Curve
plot_model(model, plot='learning')

In [None]:
# Manifold Learning
plot_model(model, plot='manifold')

In [None]:
# Model Hyperparameter
plot_model(model, plot='parameter')

## 6. Blind Test

In [None]:
dfblind = pd.read_excel("BlindTest_SSOIL.xlsx", thousands=',', skiprows = skiprows)
#dfblind = dfblind.loc[:, ~df.columns.str.contains('^Unnamed')] 
dfblind.head()

In [None]:
BlindPredict = predict_model(final_et, data=dfblind, round=2)

In [None]:
BlindPredict

In [None]:
a = BlindPredict['URF']
b = BlindPredict['Label']

plt.figure(figsize=(14, 8))
plt.scatter(a, b, color='blue')
plt.plot(a, a, color = 'red', label = 'x=y')
plt.xlabel("Recovery Factor (%)", size=14)
plt.ylabel("Evaluated Recovery Factor (%)", size=14)

#plt.tight_layout()
plt.show()

In [None]:
count = 'Completed Process'
elapsed = (time.time() - start)
print ("%s in %s seconds" % (count,elapsed))