# OilyGiant Mining Company.

## Introduction

In today's dynamic and competitive oil industry, the ability to strategically locate and develop new oil wells is crucial for companies like OilyGiant Mining Company to maximize profitability and ensure sustainable operations. The success of such endeavors heavily relies on the careful analysis of oil well characteristics, including oil quality and reserve volumes, within different geographic regions. By leveraging advanced predictive modeling techniques, such as machine learning, we can accurately estimate reserve volumes for potential new wells and identify promising locations for drilling. This project aims to utilize oil sample data from three distinct regions, coupled with known well parameters, to develop a robust model that not only pinpoints the region with the highest potential profit margin but also assesses associated risks using innovative methodologies like Bootstrapping. Ultimately, this research will provide valuable insights and actionable recommendations to guide decision-making in selecting the optimal site for a new oil well, contributing to the continued success and growth of OilyGiant Mining Company in the oil exploration and production sector.

## Setup

### Library Import

In [1]:
import pandas as pd
import numpy as np
import statistics
from scipy import stats as st
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, r2_score, mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder 

### Data Import

In [2]:
try:
    reg_data_0 = pd.read_csv("./data/geo_data_0.csv")
    reg_data_1 = pd.read_csv("./data/geo_data_1.csv")
    reg_data_2 = pd.read_csv("./data/geo_data_2.csv")
except FileNotFoundError as f_error:
    print(f"The following error: ({f_error}) occured while loading datasets")
else:
    print("The data was sucessfully loaded")

The data was sucessfully loaded


## Data Preparation

In [5]:
# Take initial look at data

reg_data_0.head(10)

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647
5,wX4Hy,0.96957,0.489775,-0.735383,64.741541
6,tL6pL,0.645075,0.530656,1.780266,49.055285
7,BYPU6,-0.400648,0.808337,-5.62467,72.943292
8,j9Oui,0.643105,-0.551583,2.372141,113.35616
9,OLuZU,2.173381,0.563698,9.441852,127.910945


In [6]:
# Observe shape of data

print(f"Reg 0 Shape: {reg_data_0.shape}")
print(f"Reg 1 Shape: {reg_data_1.shape}")
print(f"Reg 2 Shape: {reg_data_2.shape}")

Reg 0 Shape: (100000, 5)
Reg 1 Shape: (100000, 5)
Reg 2 Shape: (100000, 5)


In [7]:
# Look at general info for each data set

datasets = (reg_data_0, reg_data_1, reg_data_2)

for i in range(len(datasets)):
    print(f"Region Data {i}:")
    datasets[i].info()
    print("\n\n")

Region Data 0:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB



Region Data 1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB



Region Data 2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data

In [8]:
# Check for duplicates in ID column

for i in range(len(datasets)):
    print(f"Number of duplicates in Region Data {i}: {datasets[i]['id'].duplicated().sum()}")

Number of duplicates in Region Data 0: 10
Number of duplicates in Region Data 1: 4
Number of duplicates in Region Data 2: 4


**Observations**

The datasets `reg_data_0`, `reg_data_1`, and `reg_data_2` are identical in structure (shape), each containing no missing values and comprising 5 columns with consistent data types:
- id: Object-type
- f0: float64
- f1: float64
- f2: float64
- product: float64

Duplicate values are present in the `id` column across all three datasets. Given that these columns are unnecessary for the model, they can be safely removed. There are no duplicates in any other column

In [9]:
# Drop unnecessary column for model

reg_data_0 = reg_data_0.drop(columns='id')
reg_data_1 = reg_data_1.drop(columns='id')
reg_data_2 = reg_data_2.drop(columns='id')

In [10]:
# Reassign back to datasets list

datasets = (reg_data_0, reg_data_1, reg_data_2)

Data Summary:
Each file corresponds to data from three distinct regions.

- `id` -- a unique identifier for each oil well, has been removed as it does not contribute to model training.
- `f0`, `f1`, `f2` -- represent various features of the oil wells and have consistent scaling.
- `product` -- indicates the volume of reserves in each oil well (measured in thousand barrels).

All files are now clear of duplicate entries and missing values.

## Model Preparation

In [11]:
def split_data_3_1(dataset, drop_cols, target_col, test_s=0.25, rnd_state=123, shuffle=True, axis=1):
    '''Prints a statement specifying the data-split used and returns 2 variables (features and target) for the train and validation datasets respectively'''
    
    # Define the features & target
    features = dataset.drop(drop_cols, axis=axis)
    target = dataset[target_col]
    
    # Split the source data into 25% for Validation and 75% for Training
    features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=test_s, shuffle=True, random_state=rnd_state)
    
    # Print confirmation of data split
    sum_of_datasets = len(features_train) + len(features_valid) 
    if len(dataset) == sum_of_datasets:
        print(f"Features split ratio is 3:1, where data split is allocated as:\n- Training = 75% [shape={features_train.shape}]\n- Validation = 25% [shape={features_valid.shape}]")
        print(f"Target split ratio is 3:1, where data split is allocated as:\n- Training = 75% [shape={target_train.shape}]\n- Validation = 25% [shape={target_valid.shape}]")
    
    return features_train, features_valid, target_train, target_valid 


In [26]:
# Define features & target for all three regions

print("Reg 0:")
features_0_train, features_0_valid, target_0_train, target_0_valid = split_data_3_1(reg_data_0, ['product'], 'product')
print()

print("Reg 1:")
features_1_train, features_1_valid, target_1_train, target_1_valid = split_data_3_1(reg_data_1, ['product'], 'product')
print()

print("Reg 2:")
features_2_train, features_2_valid, target_2_train, target_2_valid = split_data_3_1(reg_data_2, ['product'], 'product')

Reg 0:
Features split ratio is 3:1, where data split is allocated as:
- Training = 75% [shape=(75000, 3)]
- Validation = 25% [shape=(25000, 3)]
Target split ratio is 3:1, where data split is allocated as:
- Training = 75% [shape=(75000,)]
- Validation = 25% [shape=(25000,)]

Reg 1:
Features split ratio is 3:1, where data split is allocated as:
- Training = 75% [shape=(75000, 3)]
- Validation = 25% [shape=(25000, 3)]
Target split ratio is 3:1, where data split is allocated as:
- Training = 75% [shape=(75000,)]
- Validation = 25% [shape=(25000,)]

Reg 2:
Features split ratio is 3:1, where data split is allocated as:
- Training = 75% [shape=(75000, 3)]
- Validation = 25% [shape=(25000, 3)]
Target split ratio is 3:1, where data split is allocated as:
- Training = 75% [shape=(75000,)]
- Validation = 25% [shape=(25000,)]


In [65]:
# Function to calculate metrics

def calculate_regression_metrics(x, y):
    '''Accepts the target and predicted values and resturns several metrics'''
    mse = mean_squared_error(x, y)
    rmse = mean_squared_error(x, y, squared=False)
    r2 = r2_score(x, y)
    mae = mean_absolute_error(x, y)
    
    print(f"MSE: {mse}\nRMSE: {rmse}\nR^2: {r2}\nMAE: {mae}\n")

## Model Training

### Region 0


In [66]:
# Model for Region 0

model_0 = LinearRegression()
model_0.fit(features_0_train, target_0_train)
predictions_0_valid = model_0.predict(features_0_valid)

correct_values_0 = target_0_valid.tolist()
predicted_values_0 = predictions_0_valid.tolist()

### Region 1 

In [67]:
# Region 1 Model

model_1 = LinearRegression()
model_1.fit(features_1_train, target_1_train)
predictions_1_valid = model_1.predict(features_1_valid)

correct_values_1 = target_1_valid.tolist()
predicted_values_1 = predictions_1_valid.tolist()


### Region 2

In [68]:
# Region 2 Model

model_2 = LinearRegression()
model_2.fit(features_2_train, target_2_train)
predictions_2_valid = model_2.predict(features_2_valid)

correct_values_2 = target_2_valid.tolist()
predicted_values_2 = predictions_2_valid.tolist()


In [69]:
# Average number of predicted reserves

print(f"Region 0 Average Volume of Predicted Reserve (thousand barrels): {statistics.mean(predicted_values_0)}")
print(f"Region 1 Average Volume of Predicted Reserve (thousand barrels): {statistics.mean(predicted_values_1)}")
print(f"Region 2 Average Volume of Predicted Reserve (thousand barrels): {statistics.mean(predicted_values_2)}")

Region 0 Average Volume of Predicted Reserve (thousand barrels): 92.54936189116307
Region 1 Average Volume of Predicted Reserve (thousand barrels): 69.28001860653977
Region 2 Average Volume of Predicted Reserve (thousand barrels): 95.09859933591373


In [70]:
# Calculate Metrics for each model

print("Metrics for Region 0:")
calculate_regression_metrics(target_0_valid, predictions_0_valid)

print("Metrics for Region 1:")
calculate_regression_metrics(target_1_valid, predictions_1_valid)

print("Metrics for Region 2:")
calculate_regression_metrics(target_2_valid, predictions_2_valid)

Metrics for Region 0:
MSE: 1417.3615751967832
RMSE: 37.64786282376176
R^2: 0.2812975228159569
MAE: 30.984212391272273

Metrics for Region 1:
MSE: 0.8017661964648819
RMSE: 0.8954139804944313
R^2: 0.9996180923165817
MAE: 0.7210648408937341

Metrics for Region 2:
MSE: 1610.2587969766078
RMSE: 40.12803006598514
R^2: 0.19313657905573023
MAE: 32.80763017044863



**Observations**

* The data indicates that Region 2 has the highest average reserve, followed by Region 0, and then Region 1 with the lowest.

* Based on the Root Mean Squared Error (RMSE) metric, the RMSE value for Region 1 is notably lower than that for Region 0 or Region 2. This suggests that the model performed most accurately in Region 1 in terms of accuracy.