# Optimizing Oil Well Investment: Predictive Modeling and Profitability Analysis Across Regions

In this project we attempt to:
Collect the oil well parameters in the selected region: oil quality and volume of reserves;
Build a model for predicting the volume of reserves in the new wells;
Pick the oil wells with the highest estimated values;
Pick the region with the highest total profit for the selected oil wells.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

In [2]:
data_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_2 = pd.read_csv('/datasets/geo_data_2.csv')

In [3]:
data_0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [4]:
data_1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [5]:
data_2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


Printed the datasets.

In [6]:
data_0.dtypes

id          object
f0         float64
f1         float64
f2         float64
product    float64
dtype: object

Checked datatypes for the datasets.

In [7]:
data_0.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

In [8]:
data_1.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

In [9]:
data_2.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

Checked for null values.

In [10]:
features_0 = data_0.drop(['id', 'product'], axis=1)
target_0 = data_0['product']
   
features_1 = data_1.drop(['id', 'product'], axis=1)
target_1 = data_1['product']
   
features_2 = data_2.drop(['id', 'product'], axis=1)
target_2 = data_2['product']
   

In [11]:
features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(features_0, target_0, test_size=0.25, random_state=42)


features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(features_1, target_1, test_size=0.25, random_state=42)


features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(features_2, target_2, test_size=0.25, random_state=42)

In [12]:
scaler = StandardScaler()
   

features_train_0 = scaler.fit_transform(features_train_0)
features_valid_0 = scaler.transform(features_valid_0)
   
features_train_1 = scaler.fit_transform(features_train_1)
features_valid_1 = scaler.transform(features_valid_1)
   
features_train_2 = scaler.fit_transform(features_train_2)
features_valid_2 = scaler.transform(features_valid_2)

In [13]:
print(features_train_0.shape, target_train_0.shape)
print(features_valid_0.shape, target_valid_0.shape)

(75000, 3) (75000,)
(25000, 3) (25000,)


In [14]:

print(features_train_1.shape, target_train_1.shape)
print(features_valid_1.shape, target_valid_1.shape)

(75000, 3) (75000,)
(25000, 3) (25000,)


In [15]:

print(features_train_2.shape, target_train_2.shape)
print(features_valid_2.shape, target_valid_2.shape)

(75000, 3) (75000,)
(25000, 3) (25000,)


Split the datasets into testing training and validation sets. Looked at their overall shape for analysis.

In [22]:
def train_and_evaluate(features_train, target_train, features_valid, target_valid):

    model = LinearRegression()
    model.fit(features_train, target_train)
    
    predictions = model.predict(features_valid)

    average_predicted_reserves = np.mean(predictions)

    rmse = mean_squared_error(target_valid, predictions, squared=False)
    
    return average_predicted_reserves, rmse

Trained the data models appropriately.

In [23]:

avg_pred_0, rmse_0 = train_and_evaluate(features_train_0, target_train_0, features_valid_0, target_valid_0)
print(f"Region 0 - Average Predicted Reserves: {avg_pred_0:.2f}, RMSE: {rmse_0:.2f}")


avg_pred_1, rmse_1 = train_and_evaluate(features_train_1, target_train_1, features_valid_1, target_valid_1)
print(f"Region 1 - Average Predicted Reserves: {avg_pred_1:.2f}, RMSE: {rmse_1:.2f}")


avg_pred_2, rmse_2 = train_and_evaluate(features_train_2, target_train_2, features_valid_2, target_valid_2)
print(f"Region 2 - Average Predicted Reserves: {avg_pred_2:.2f}, RMSE: {rmse_2:.2f}")


Region 0 - Average Predicted Reserves: 92.40, RMSE: 37.76
Region 1 - Average Predicted Reserves: 68.71, RMSE: 0.89
Region 2 - Average Predicted Reserves: 94.77, RMSE: 40.15


Printed the predicted reserve amounts and the Root Mean Square Error for all three regions.

In [24]:
total_budget = 100_000_000 
number_of_wells = 200  
revenue_per_barrel = 4.5 


cost_per_well = total_budget / number_of_wells


break_even_volume = (cost_per_well / revenue_per_barrel) / 1000


average_volume_0 = data_0['product'].mean()
average_volume_1 = data_1['product'].mean()
average_volume_2 = data_2['product'].mean()


print(f"Break-even volume per well (thousand barrels): {break_even_volume:.2f}")
print(f"Average volume of reserves in Region 0: {average_volume_0:.2f}")
print(f"Average volume of reserves in Region 1: {average_volume_1:.2f}")
print(f"Average volume of reserves in Region 2: {average_volume_2:.2f}")

if average_volume_0 >= break_even_volume:
    print("Region 0 is profitable based on the average volume of reserves.")
else:
    print("Region 0 is not profitable based on the average volume of reserves.")
    
if average_volume_1 >= break_even_volume:
    print("Region 1 is profitable based on the average volume of reserves.")
else:
    print("Region 1 is not profitable based on the average volume of reserves.")
    
if average_volume_2 >= break_even_volume:
    print("Region 2 is profitable based on the average volume of reserves.")
else:
    print("Region 2 is not profitable based on the average volume of reserves.")


Break-even volume per well (thousand barrels): 111.11
Average volume of reserves in Region 0: 92.50
Average volume of reserves in Region 1: 68.83
Average volume of reserves in Region 2: 95.00
Region 0 is not profitable based on the average volume of reserves.
Region 1 is not profitable based on the average volume of reserves.
Region 2 is not profitable based on the average volume of reserves.


Calculated the 'break-even' volume and the profitability of the three regions.

In [25]:
model_0 = LinearRegression()
model_1 = LinearRegression()
model_2 = LinearRegression()

model_0.fit(features_train_0, target_train_0)
model_1.fit(features_train_1, target_train_1)
model_2.fit(features_train_2, target_train_2)


def calculate_profit(predictions, targets, count):
    predictions_df = pd.DataFrame({'prediction': predictions, 'target': targets})
    sorted_predictions_df = predictions_df.sort_values(by='prediction', ascending=False)
    selected = sorted_predictions_df.head(count)
    total_reserves = selected['target'].sum()
    
    revenue_per_barrel = 4.5
    total_revenue = total_reserves * revenue_per_barrel  / 1000
    total_cost = 100  
    profit = total_revenue - total_cost
    
    return total_reserves, profit


number_of_wells = 200


pred_0 = model_0.predict(features_valid_0)
reserves_0, profit_0 = calculate_profit(pred_0, target_valid_0, number_of_wells)

pred_1 = model_1.predict(features_valid_1)
reserves_1, profit_1 = calculate_profit(pred_1, target_valid_1, number_of_wells)

pred_2 = model_2.predict(features_valid_2)
reserves_2, profit_2 = calculate_profit(pred_2, target_valid_2, number_of_wells)

print(f"Region 0 - Total Reserves: {reserves_0:.2f} thousand barrels, Profit: {profit_0:.2f} million USD")
print(f"Region 1 - Total Reserves: {reserves_1:.2f} thousand barrels, Profit: {profit_1:.2f} million USD")
print(f"Region 2 - Total Reserves: {reserves_2:.2f} thousand barrels, Profit: {profit_2:.2f} million USD")

Region 0 - Total Reserves: 29686.98 thousand barrels, Profit: 33.59 million USD
Region 1 - Total Reserves: 27589.08 thousand barrels, Profit: 24.15 million USD
Region 2 - Total Reserves: 27996.83 thousand barrels, Profit: 25.99 million USD


Further clarification of the profitability.

In [33]:
def bootstrap_profit(model, features_valid, target_valid, n_samples=1000, count=200):
    profits = []
    for i in range(n_samples):
        sample = np.random.choice(len(features_valid), size=500, replace=True)
        sample_features = features_valid[sample]
        sample_target = target_valid.iloc[sample]
        
        sample_predictions = model.predict(sample_features)
        
        _, profit = calculate_profit(sample_predictions, sample_target, count)
        profits.append(profit)
    
    profits = np.array(profits)
    mean_profit = np.mean(profits)
    lower_bound = np.percentile(profits, 2.5)
    upper_bound = np.percentile(profits, 97.5)
    risk_of_loss = np.mean(profits < 0) * 100  
    
    return mean_profit, lower_bound, upper_bound, risk_of_loss




In [34]:

results_0 = bootstrap_profit(model_0, features_valid_0, target_valid_0)
results_1 = bootstrap_profit(model_1, features_valid_1, target_valid_1)
results_2 = bootstrap_profit(model_2, features_valid_2, target_valid_2)

print(f"Region 0 - Average Profit: {results_0[0]:.2f} million USD, 95% CI: ({results_0[1]:.2f}, {results_0[2]:.2f}), Risk of Loss: {results_0[3]:.2f}%")
print(f"Region 1 - Average Profit: {results_1[0]:.2f} million USD, 95% CI: ({results_1[1]: .2f}, {results_1[2]:.2f}), Risk of Loss: {results_1[3]:.2f}%")
print(f"Region 2 - Average Profit: {results_2[0]:.2f} million USD, 95% CI: ({results_2[1]:.2f}, {results_2[2]:.2f}), Risk of Loss: {results_2[3]:.2f}%")


Region 0 - Average Profit: 3.95 million USD, 95% CI: (-1.21, 8.83), Risk of Loss: 6.20%
Region 1 - Average Profit: 4.37 million USD, 95% CI: ( 0.41, 8.22), Risk of Loss: 1.10%
Region 2 - Average Profit: 3.86 million USD, 95% CI: (-1.78, 8.96), Risk of Loss: 8.00%


Applied bootstrapping successfully and predictions based on the average profit after bootstrapping. Also calculated the risk of loss and the confidence intervals for 95% of the data.

Region 1 is seen to be the most profitable.

In this project we have successfully:
Collected the oil well parameters in the selected region: oil quality and volume of reserves;
Built a model for predicting the volume of reserves in the new wells;
Picked the oil wells with the highest estimated values;
Picked the region with the highest total profit for the selected oil wells.