# Project Statement: Where should the next rig be placed?

In this project, I am tasked with identifying the most profitable oil well location for OilyGiant mining company by analyzing geological data from three different regions. The project involves building a linear regression model to predict the volume of reserves for each oil well based on its parameters, such as oil quality and reserve volume. Using these predictions, I will select the most promising wells and determine the region that offers the highest total profit potential. The analysis will incorporate the Bootstrapping technique to evaluate potential profits and associated risks, ensuring the final decision adheres to the company's requirement of maintaining a risk of loss below 2.5%. The goal is to maximize return on investment by choosing the optimal region for new oil well development while satisfying specific business constraints.

For each region under consideration, an initial pool of 500 potential sites will be evaluated to select the top 200 with the most promising oil reserves for profit calculation. The allotted budget for developing these 200 wells is set at 100 million. Economically, each barrel of oil is expected to generate $4.5 in revenue, with the understanding that the oil reserve volumes are measured in thousands of barrels, making the revenue from one unit of product 4,500. Post risk assessment, only those regions with a loss risk under 2.5% will be considered for development. Among these, the region with the highest average profit will be chosen as the site for new oil well development.

# Table of Contents <a id = 'back'></a>

* [1.Load and Inspect Data](#load)
* [2.Clean Data](#clean)
* [3.Creation of Features and Model Testing](#test)
* [4.Bootstrapping and Final Assessment](#final)
* [5.Conclusion](#done)

In [1]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error



# Load and Inspect Data <a id = 'load'><a/>

In [51]:
# loading data 
data0 = pd.read_csv('geo_data_0.csv')
data1 = pd.read_csv('geo_data_1.csv')
data2 = pd.read_csv('geo_data_2.csv')

In [52]:
# creating list of data sets
data_sets = [data0, data1, data2]

In [53]:
# displaying general dataframe information
n =0
for i in data_sets:
    print(f'data{n}:')
    print(data0.info())
    print(' ')
    n = n+1

data0:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
 
data1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
 
data2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 

In [54]:
# displaying descriptive statistics
n =0
for i in data_sets:
    print(f'data{n}:')
    display(data0.describe())
    print(' ')
    n = n+1

data0:


Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


 
data1:


Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


 
data2:


Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


 


# Cleaning Data <a id = 'clean'><a/>

In [56]:
# checking for duplicates in each datasets with loop
data_l = [data0, data1, data2]
n= 0
for i in data_l:
    print(f'data{n}:')
    print(i['id'].duplicated().sum())
    n = n +1 

data0:
10
data1:
4
data2:
4


In [57]:
# writing function to check for and subsequently drop duplicate values
def check_drop_dupe(data):
    dupes = list(data[data['id'].duplicated()]['id'])
    ind = data[data['id'].isin(dupes) == True].index
    s = data['id'].duplicated().sum()
    if s > 0:
        data = data.drop(ind)
    return data
    

In [58]:
# running data through duplicate function
data0 = check_drop_dupe(data0)
data1 = check_drop_dupe(data1)
data2 = check_drop_dupe(data2)

# Feature Creation and Model Testing <a id = 'test'><a/>

In [59]:
# creating feature and target data
features0 = data0.drop(['id', 'product'], axis = 1)
target0 = data0['product']

features1 = data1.drop(['id', 'product'], axis = 1)
target1 = data1['product']

features2 = data2.drop(['id', 'product'], axis = 1)
target2 = data2['product']

In [60]:
# creating function to split data into training and validation data 
def split_data(features, target):

    features_train, features_valid, target_train, target_valid = train_test_split(features, 
                                                                              target, 
                                                                              test_size = .25,
                                                                              random_state = 12345)
    return features_train, features_valid, target_train, target_valid

In [61]:
# running data through split function
features_train0, features_valid0, target_train0, target_valid0 = split_data(features0, target0)

features_train1, features_valid1, target_train1, target_valid1 = split_data(features1, target1)

features_train2, features_valid2, target_train2, target_valid2 = split_data(features2, target2)

Region 0 baseline and model:

In [11]:
# creating baseline for data0
baseline0 = pd.Series(target_train0.mean(), index = target_valid0.index)
base_mse = mean_squared_error(target_valid0, baseline0)
print(f'Baseline RMSE: {np.sqrt(base_mse)}')

Baseline RMSE: 44.34727648978901


Building as model to predict the oil reserves

In [62]:
# intializing model
model = LinearRegression()
# fitting data
model.fit(features_train0, target_train0)
# prediciting data 
predictions_valid0 = model.predict(features_valid0)

# gathering mse score and rmse score 
mse0 = mean_squared_error(target_valid0, predictions_valid0)
rmse0 = np.sqrt(mse0)
print(f'Average Volume of Predicted Reserves: {predictions_valid0.mean()}')
print(f'Model RMSE: {rmse0}')

Average Volume of Predicted Reserves: 92.42384109947359
Model RMSE: 37.716904960382735


Region 1 Baseline and Model: 

In [13]:
baseline1 = pd.Series(target_train1.mean(), index = target_valid1.index)
base_mse = mean_squared_error(target_valid1, baseline1)
print(f'Baseline RMSE: {np.sqrt(base_mse)}')

Baseline RMSE: 45.97003721244095


In [64]:
# intializing model
model = LinearRegression()
# fitting data
model.fit(features_train1, target_train1)
# prediciting data 
predictions_valid1 = model.predict(features_valid1)

# gathering mse score and rmse score 
mse1 = mean_squared_error(target_valid1, predictions_valid1)
rmse1 = np.sqrt(mse1)
print(f'Average Volume of Predicted Reserves: {predictions_valid1.mean()}')
print(f'Model RMSE: {rmse1}')

Average Volume of Predicted Reserves: 68.98311857983121
Model RMSE: 0.891490139034853


Region 2 Baseline and Model:

In [15]:
baseline2 = pd.Series(target_train2.mean(), index = target_valid2.index)
base_mse = mean_squared_error(target_valid2, baseline2)
print(f'Baseline RMSE: {np.sqrt(base_mse)}')

Baseline RMSE: 44.575227075379665


In [65]:
# intializing model
model = LinearRegression()
# fitting data
model.fit(features_train2, target_train2)
# prediciting data 
predictions_valid2 = model.predict(features_valid2)

# gathering mse score and rmse score 
mse2 = mean_squared_error(target_valid2, predictions_valid2)
rmse2 = np.sqrt(mse2)
print(f'Average Volume of Predicted Reserves: {predictions_valid2.mean()}')
print(f'Model RMSE: {rmse2}')

Average Volume of Predicted Reserves: 95.11622302076478
Model RMSE: 39.975543264382345


Model has proved to be more accurate than the baseline model. I believe it is fit to proceed with the predictions for this model

In [17]:
# finding minimum units for net profit
n_best = 200 
budget = 100000000
barrel = 4.5
unit =  1000 * barrel
minimum_reserve = budget/n_best/unit
print(f'Minimum units for net positive profit when developing a new well: {minimum_reserve:.3f}')




Minimum units for net positive profit when developing a new well: 111.111


When comparing the minimum units needed per well to be net positive, the minimum units needed is above the mean of all three regions, with the greatest distance being from region 1. 

In [66]:
# creating profit calculation function
def profit_calculation(target, predictions, count):
    budget = 100000000
    unit = 4.5 * 1000
    top_predictions = predictions.nlargest(count).index
    top_wells = target.loc[top_predictions]
    rev = top_wells.sum() * unit
    profit = rev - budget
    return profit

Profit Calculation for each region:
    

In [67]:
# running each region through profit calculation 
predictions_valid0 = pd.Series(predictions_valid0, index = target_valid0.index)
profit0 = profit_calculation(target_valid0, predictions_valid0, 200)
print(f'Region 0 profit from top 200 wells: ${profit0:.2f}')
print()

predictions_valid1 = pd.Series(predictions_valid1, index = target_valid1.index)
profit1 = profit_calculation(target_valid1, predictions_valid1, 200)
print(f'Region 1 profit from top 200 wells: ${profit1:.2f}')
print()

predictions_valid2 = pd.Series(predictions_valid2, index = target_valid2.index)
profit2 = profit_calculation(target_valid2, predictions_valid2, 200)
print(f'Region 2 profit from top 200 wells: ${profit2:.2f}')
print()

Region 0 profit from top 200 wells: $31360260.57

Region 1 profit from top 200 wells: $24150866.97

Region 2 profit from top 200 wells: $24659457.92



# Bootstrapping Technique and Final Assessment <a id = 'final'><a/>

In [68]:
# creating bootstrapping function
def bootstrap_func(target, predictions, sample_size):
    state = np.random.RandomState(12345)
    profit_values = []

    for i in range(1000):
        target_subsample = target.sample(sample_size, replace = True, random_state = state)
        predict_subsample = predictions[target_subsample.index]
    
        profit_values.append(profit_calculation(target_subsample, predict_subsample, 200))

    
    profit_values = pd.Series(profit_values)
    
    profit_mean = profit_values.mean()
    lower = profit_values.quantile(.025)
    upper = profit_values.quantile(.975)
    risk_of_loss = profit_values[profit_values < 0].count() / len(profit_values)
    
    print(f'''
    Mean Profit: ${profit_mean:.2f},
    95% Confidence interval - 
        Lower limit: {lower:.2f}, 
        Upper limit: {upper:.2f}),
    Risk of Loss: {risk_of_loss:.1%}''')

In [69]:
# running region 0 through bootstrap function
print('Region 0:')
bootstrap_func(target_valid0, predictions_valid0, 500)

Region 0:

    Mean Profit: $6329210.98,
    95% Confidence interval - 
        Lower limit: 549998.19, 
        Upper limit: 12849098.55),
    Risk of Loss: 2.0%


In [70]:
# running region 1 through bootstrap function
print('Region 1:')
bootstrap_func(target_valid1, predictions_valid1, 500)

Region 1:

    Mean Profit: $6836507.78,
    95% Confidence interval - 
        Lower limit: 1761041.83, 
        Upper limit: 12000844.86),
    Risk of Loss: 0.7%


In [71]:
# running region 2 through bootstrap function
print('Region 2:')
bootstrap_func(target_valid2, predictions_valid2, 500)

Region 2:

    Mean Profit: $5290364.67,
    95% Confidence interval - 
        Lower limit: -635691.31, 
        Upper limit: 11644733.64),
    Risk of Loss: 4.4%


# Conclusion <a id = 'done'><a/>

Though the profit from the top 200 wells in region 1 were not as large as the other two regions, region 1 offers a more reliable source of income. This is mainly due to the significantly higher values at the lower quantile within a 95% confidence interval along with its significantly lower risk of loss. Due to this analysis, I strongly suggest that OilyGiant focus its development of new wells in region 1. 