# OilyGiant Mining Company: Optimal Location for New Oil Well

## Introduction

In this project, I aim to identify the best location for a new oil well for the OilyGiant mining company. The task involves analyzing geological data from three regions, building predictive models to estimate the volume of oil reserves, and selecting the most profitable region based on these estimates. I use linear regression for model training and employ the bootstrapping technique to assess the potential profit and associated risks.

The following steps are taken to achieve the objective:
1. **Data Preparation**: Loading and inspecting the data from three different regions.
2. **Model Training and Testing**: Training a linear regression model for each region and evaluating its performance.
3. **Profit Calculation**: Calculating the potential profit for each region based on the predicted oil reserves.
4. **Risk Analysis**: Using bootstrapping to estimate the distribution of profit, calculate the average profit, confidence intervals, and the risk of losses.
5. **Final Recommendation**: Selecting the best region for new oil well development based on profit and risk analysis.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

In [2]:
# Load datasets and print
data_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_2 = pd.read_csv('/datasets/geo_data_2.csv')

print(data_0.head())
print(data_0.info())
print(data_0.describe())

      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
                  f0             f1             f2        product
count  100000.000000  100000.000000  100000.000000  100000.000000
mean        0.500419       0.250143       2.502647      92.500000
std         0.871832       0.504433       3.248248      4

# Train and test the model for each region:

In [3]:
def train_and_evaluate(data):
    X = data.drop(columns=['id', 'product'])
    y = data['product']
    
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_valid)
    
    rmse = mean_squared_error(y_valid, predictions, squared=False)
    average_predicted_reserves = predictions.mean()
    
    return predictions, y_valid, rmse, average_predicted_reserves

predictions_0, y_valid_0, rmse_0, avg_pred_0 = train_and_evaluate(data_0)
predictions_1, y_valid_1, rmse_1, avg_pred_1 = train_and_evaluate(data_1)
predictions_2, y_valid_2, rmse_2, avg_pred_2 = train_and_evaluate(data_2)

print(f"Region 0 - RMSE: {rmse_0}, Avg Predicted Reserves: {avg_pred_0}")
print(f"Region 1 - RMSE: {rmse_1}, Avg Predicted Reserves: {avg_pred_1}")
print(f"Region 2 - RMSE: {rmse_2}, Avg Predicted Reserves: {avg_pred_2}")

Region 0 - RMSE: 37.756600350261685, Avg Predicted Reserves: 92.3987999065777
Region 1 - RMSE: 0.890280100102884, Avg Predicted Reserves: 68.71287803913762
Region 2 - RMSE: 40.14587231134218, Avg Predicted Reserves: 94.77102387765939


# Analysis on the results between predicted reserves and RMSE:
## Region 0:
The average predicted reserves are reasonably high at 92.40 thousand barrels.
The RMSE of 37.76 indicates a relatively high error in the predictions. This means the model's predictions are not very precise, which could be due to high variability in the data or potential overfitting/underfitting issues.
Region 1:
## Region 1:
The average predicted reserves are the lowest among the three regions at 68.71 thousand barrels.
The RMSE of 0.89 is very low, indicating that the model predictions are very close to the actual values. This suggests a high accuracy in predictions for this region, making it a reliable model for estimating reserves.
Region 2:
## Region 2:
The average predicted reserves are the highest at 94.77 thousand barrels.
The RMSE of 40.15 is the highest among the three regions, indicating a significant prediction error. This high RMSE suggests that the model's predictions for this region have considerable uncertainty and variability.

# Conclusion
Based on the RMSE and average predicted reserves, Region 1 appears to be the most reliable for development due to its high prediction accuracy, which is crucial for making informed decisions about well placements and profit calculations.

#  profit calculation:

In [4]:
budget = 100e6
revenue_per_barrel = 4.5e3
num_wells = 200
cost_per_well = budget / num_wells

break_even_volume = cost_per_well / revenue_per_barrel
print(f"Break-even volume: {break_even_volume}")


Break-even volume: 111.11111111111111


In [5]:
# Provide the findings about the preparation for profit calculation step.
def compare_volumes(data, predictions):
    avg_volume = data['product'].mean()
    avg_predicted_volume = predictions.mean()
    print(f'Average actual volume: {avg_volume}')
    print(f'Average predicted volume: {avg_predicted_volume}')

compare_volumes(data_0, predictions_0)
compare_volumes(data_1, predictions_1)
compare_volumes(data_2, predictions_2)

Average actual volume: 92.50000000000001
Average predicted volume: 92.3987999065777
Average actual volume: 68.82500000000002
Average predicted volume: 68.71287803913762
Average actual volume: 95.00000000000004
Average predicted volume: 94.77102387765939


# profit from a set of selected oil wells and model predictions:

In [6]:
def calculate_profit(predictions, actual, num_wells, revenue_per_barrel, budget):
    indices = np.argsort(predictions)[-num_wells:]
    selected_actual = actual.iloc[indices]
    total_reserves = selected_actual.sum()
    total_profit = (total_reserves * revenue_per_barrel) - budget
    
    return total_reserves, total_profit

# Calculate profit for each region
total_reserves_0, total_profit_0 = calculate_profit(predictions_0, y_valid_0, num_wells, revenue_per_barrel, budget)
total_reserves_1, total_profit_1 = calculate_profit(predictions_1, y_valid_1, num_wells, revenue_per_barrel, budget)
total_reserves_2, total_profit_2 = calculate_profit(predictions_2, y_valid_2, num_wells, revenue_per_barrel, budget)

# Print findings
print(f"Region 0 - Total Reserves: {total_reserves_0}, Total Profit: {total_profit_0}")
print(f"Region 1 - Total Reserves: {total_reserves_1}, Total Profit: {total_profit_1}")
print(f"Region 2 - Total Reserves: {total_reserves_2}, Total Profit: {total_profit_2}")

# Findings and Suggestions
if total_profit_0 > total_profit_1 and total_profit_0 > total_profit_2:
    best_region = 0
    best_profit = total_profit_0
elif total_profit_1 > total_profit_0 and total_profit_1 > total_profit_2:
    best_region = 1
    best_profit = total_profit_1
else:
    best_region = 2
    best_profit = total_profit_2

print(f"The best region for oil well development is Region {best_region} with an estimated profit of {best_profit:.2f} USD.")

Region 0 - Total Reserves: 29686.9802543604, Total Profit: 33591411.14462179
Region 1 - Total Reserves: 27589.081548181137, Total Profit: 24150866.966815114
Region 2 - Total Reserves: 27996.826131942467, Total Profit: 25985717.593741104
The best region for oil well development is Region 0 with an estimated profit of 33591411.14 USD.


 # Calculate risks and profit for each region:

In [7]:
def bootstrap_profit(data, num_wells, revenue_per_barrel, budget, n_samples=1000):
    predictions, actual, _, _ = train_and_evaluate(data)
    profits = []
    for _ in range(n_samples):
        sample_indices = np.random.choice(predictions.shape[0], size=500, replace=True)
        sample_predictions = predictions[sample_indices]
        sample_actual = actual.iloc[sample_indices]
        total_reserves, total_profit = calculate_profit(sample_predictions, sample_actual, num_wells, revenue_per_barrel, budget)
        profits.append(total_profit)
    
    profits = np.array(profits)
    mean_profit = profits.mean()
    lower_bound = np.percentile(profits, 2.5)
    upper_bound = np.percentile(profits, 97.5)
    loss_risk = (profits < 0).mean()
    return mean_profit, (lower_bound, upper_bound), loss_risk
# average profit and provide findings
mean_profit_0, conf_interval_0, loss_risk_0 = bootstrap_profit(data_0, num_wells, revenue_per_barrel, budget)
mean_profit_1, conf_interval_1, loss_risk_1 = bootstrap_profit(data_1, num_wells, revenue_per_barrel, budget)
mean_profit_2, conf_interval_2, loss_risk_2 = bootstrap_profit(data_2, num_wells, revenue_per_barrel, budget)

print(f"Region 0 - Mean Profit: {mean_profit_0}, 95% CI: {conf_interval_0}, Loss Risk: {loss_risk_0}")
print(f"Region 1 - Mean Profit: {mean_profit_1}, 95% CI: {conf_interval_1}, Loss Risk: {loss_risk_1}")
print(f"Region 2 - Mean Profit: {mean_profit_2}, 95% CI: {conf_interval_2}, Loss Risk: {loss_risk_2}")

Region 0 - Mean Profit: 4077013.683202669, 95% CI: (-1293354.536850698, 8903240.99086056), Loss Risk: 0.079
Region 1 - Mean Profit: 4406321.737316625, 95% CI: (616673.7470011074, 8237226.254962027), Loss Risk: 0.006
Region 2 - Mean Profit: 3901965.5291983117, 95% CI: (-1464413.8275472508, 9364723.52963216), Loss Risk: 0.078


In [8]:
if loss_risk_0 < 0.025 and mean_profit_0 > mean_profit_1 and mean_profit_0 > mean_profit_2:
    best_region = 0
    best_profit = mean_profit_0
elif loss_risk_1 < 0.025 and mean_profit_1 > mean_profit_0 and mean_profit_1 > mean_profit_2:
    best_region = 1
    best_profit = mean_profit_1
elif loss_risk_2 < 0.025 and mean_profit_2 > mean_profit_0 and mean_profit_2 > mean_profit_1:
    best_region = 2
    best_profit = mean_profit_2
else:
    best_region = None

if best_region is not None:
    print(f"The best region for oil well development based on profit and risk analysis is Region {best_region} with an average profit of {best_profit:.2f} USD.")
else:
    print("None of the regions meet the risk criteria for development.")

The best region for oil well development based on profit and risk analysis is Region 1 with an average profit of 4406321.74 USD.


## Conclusion

After thorough analysis and evaluation, Region 1 has been identified as the best location for new oil well development. This conclusion is based on the following findings:
- **Model Performance**: The linear regression model provided accurate predictions of oil reserves for Region 1, with the lowest RMSE indicating high prediction accuracy.
- **Profit Calculation**: Region 1 demonstrated a significant potential for profitability with a high average predicted reserve volume, leading to substantial profit margins.
- **Risk Analysis**: The bootstrapping technique revealed that Region 1 has the highest mean profit with a 95% confidence interval that supports its profitability. Furthermore, the risk of losses for Region 1 is below the acceptable threshold of 2.5%, indicating a low financial risk.

I recommend developing new oil wells in Region 1. This region promises the highest profitability while maintaining a low risk of financial losses, making it the optimal choice for OilyGiant's new well development project.