# Introduction

I work for the OilyGiant mining company. MY task is to find the best place for a new well.

Here are the steps I will take to choose the location:

Collect the oil well parameters in the selected region: oil quality and volume of reserves;
Build a model for predicting the volume of reserves in the new wells;
Pick the oil wells with the highest estimated values;
Pick the region with the highest total profit for the selected oil wells.
I have data on oil samples from three regions. Parameters of each oil well in the region are already known. I wil build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.



# Downloading and preparing the data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from scipy.stats import bootstrap

In [2]:
# Load data
df_0 = pd.read_csv('/datasets/geo_data_0.csv')
df_1 = pd.read_csv('/datasets/geo_data_1.csv')
df_2 = pd.read_csv('/datasets/geo_data_2.csv')

# Print first few rows of each dataset to verify
print(df_0.head())
print(df_1.head())
print(df_2.head())


      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
      id         f0         f1        f2     product
0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
1  62mP7  14.272088  -3.475083  0.999183   26.953261
2  vyE1P   6.263187  -5.948386  5.001160  134.766305
3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
4  AHL4O  12.702195  -8.147433  5.004363  134.766305
      id        f0        f1        f2     product
0  fwXo0 -1.146987  0.963328 -0.828965   27.758673
1  WJtFt  0.262778  0.269839 -2.530187   56.069697
2  ovLUW  0.194587  0.289035 -5.586433   62.871910
3  q6cA6  2.236060 -0.553760  0.930038  114.572842
4  WPMUX -0.515993  1.716266  5.899011  149.600746


In [3]:
# Checking for missing values
print("Missing values in df_0:")
print(df_0.isnull().sum())

print("\nMissing values in df_1:")
print(df_1.isnull().sum())

print("\nMissing values in df_2:")
print(df_2.isnull().sum())

# Checking for duplicates
print("\nDuplicates in df_0:", df_0.duplicated().sum())
print("Duplicates in df_1:", df_1.duplicated().sum())
print("Duplicates in df_2:", df_2.duplicated().sum())

# Checking data types
print("\nData types in df_0:")
print(df_0.dtypes)

print("\nData types in df_1:")
print(df_1.dtypes)

print("\nData types in df_2:")
print(df_2.dtypes)


Missing values in df_0:
id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

Missing values in df_1:
id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

Missing values in df_2:
id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

Duplicates in df_0: 0
Duplicates in df_1: 0
Duplicates in df_2: 0

Data types in df_0:
id          object
f0         float64
f1         float64
f2         float64
product    float64
dtype: object

Data types in df_1:
id          object
f0         float64
f1         float64
f2         float64
product    float64
dtype: object

Data types in df_2:
id          object
f0         float64
f1         float64
f2         float64
product    float64
dtype: object



The data looks clean based on the checks we performed:

No Missing Values: All datasets have zero missing values.
No Duplicates: There are no duplicate rows in any of the datasets.
Appropriate Data Types: The feature columns (f0, f1, f2) and target column (product) are of type float64, which is appropriate for our linear regression model.
Given this, no further preprocessing is needed. We can proceed with training and evaluating the model for each region.

# Train and test the model for each region

In [4]:
# Splitting the data for each region
def split_data(df):
    X = df.drop(columns=['id', 'product'])
    y = df['product']
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)
    return X_train, X_valid, y_train, y_valid

X_train_0, X_valid_0, y_train_0, y_valid_0 = split_data(df_0)
X_train_1, X_valid_1, y_train_1, y_valid_1 = split_data(df_1)
X_train_2, X_valid_2, y_train_2, y_valid_2 = split_data(df_2)

print(f"Region 0 - Training set size: {X_train_0.shape}, Validation set size: {X_valid_0.shape}")
print(f"Region 1 - Training set size: {X_train_1.shape}, Validation set size: {X_valid_1.shape}")
print(f"Region 2 - Training set size: {X_train_2.shape}, Validation set size: {X_valid_2.shape}")

Region 0 - Training set size: (75000, 3), Validation set size: (25000, 3)
Region 1 - Training set size: (75000, 3), Validation set size: (25000, 3)
Region 2 - Training set size: (75000, 3), Validation set size: (25000, 3)


In [5]:
# Function to train the model and make predictions
def train_and_predict(X_train, y_train, X_valid):
    # Initialize and train the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Make predictions
    predictions = model.predict(X_valid)
    
    return predictions

# Train and predict for each region
predictions_0 = train_and_predict(X_train_0, y_train_0, X_valid_0)
predictions_1 = train_and_predict(X_train_1, y_train_1, X_valid_1)
predictions_2 = train_and_predict(X_train_2, y_train_2, X_valid_2)

# Print the first few predictions for each region
print("Region 0 - Predictions:", predictions_0[:5])
print("Region 1 - Predictions:", predictions_1[:5])
print("Region 2 - Predictions:", predictions_2[:5])

Region 0 - Predictions: [101.90101715  78.21777385 115.26690103 105.61861791  97.9801849 ]
Region 1 - Predictions: [ 8.44738063e-01  5.29216119e+01  1.35110385e+02  1.09494863e+02
 -4.72915824e-02]
Region 2 - Predictions: [ 98.30191642 101.59246124  52.4490989  109.92212707  72.41184733]


In [6]:
# Create DataFrame for Region 0
df_predictions_0 = pd.DataFrame({'Actual': y_valid_0, 'Predicted': predictions_0})
# Save to CSV file
df_predictions_0.to_csv('predictions_region_0.csv', index=False)

# Create DataFrame for Region 1
df_predictions_1 = pd.DataFrame({'Actual': y_valid_1, 'Predicted': predictions_1})
# Save to CSV file
df_predictions_1.to_csv('predictions_region_1.csv', index=False)

# Create DataFrame for Region 2
df_predictions_2 = pd.DataFrame({'Actual': y_valid_2, 'Predicted': predictions_2})
# Save to CSV file
df_predictions_2.to_csv('predictions_region_2.csv', index=False)

In [7]:
# Function to calculate RMSE
def calculate_rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

# Calculate average volume of predicted reserves and RMSE for each region
avg_predicted_volume_0 = np.mean(predictions_0)
rmse_0 = calculate_rmse(y_valid_0, predictions_0)

avg_predicted_volume_1 = np.mean(predictions_1)
rmse_1 = calculate_rmse(y_valid_1, predictions_1)

avg_predicted_volume_2 = np.mean(predictions_2)
rmse_2 = calculate_rmse(y_valid_2, predictions_2)

# Print results for each region
print("Region 0:")
print(f"Average Volume of Predicted Reserves: {avg_predicted_volume_0}")
print(f"Model RMSE: {rmse_0}\n")

print("Region 1:")
print(f"Average Volume of Predicted Reserves: {avg_predicted_volume_1}")
print(f"Model RMSE: {rmse_1}\n")

print("Region 2:")
print(f"Average Volume of Predicted Reserves: {avg_predicted_volume_2}")
print(f"Model RMSE: {rmse_2}\n")

Region 0:
Average Volume of Predicted Reserves: 92.3987999065777
Model RMSE: 37.756600350261685

Region 1:
Average Volume of Predicted Reserves: 68.71287803913762
Model RMSE: 0.890280100102884

Region 2:
Average Volume of Predicted Reserves: 94.77102387765939
Model RMSE: 40.14587231134218



Average Volume of Predicted Reserves:

Region 0 and Region 2 have higher average volumes of predicted reserves compared to Region 1. This indicates that the model predicts higher reserves for these regions on average.
Model RMSE:

Region 1 has the lowest RMSE (0.89), indicating that the model's predictions are closer to the actual values on average for Region 1 compared to the other regions.
Region 0 and Region 2 have higher RMSE values (37.76 and 40.15, respectively), suggesting that the model's predictions have higher variability and are farther from the actual values on average for these regions.

While Region 1 has the lowest RMSE, indicating better prediction accuracy, it also has the lowest average volume of predicted reserves.
Regions 0 and 2 have higher average volumes of predicted reserves, but their models have higher RMSE values, indicating less accurate predictions on average.

# Prepare for profit calculation

In [8]:
# Given data
avg_predicted_volume_0 = 92.40
rmse_0 = 37.76

avg_predicted_volume_1 = 68.71
rmse_1 = 0.89

avg_predicted_volume_2 = 94.77
rmse_2 = 40.15

# Constants
BUDGET = 100e6  # $100 million
REVENUE_PER_UNIT = 4.5e3  # $4500 per thousand barrels
NUMBER_OF_WELLS = 200  # Number of wells to be drilled

# Calculate the total volume of reserves sufficient for profit
total_volume_for_profit = BUDGET / REVENUE_PER_UNIT

# Calculate the required average volume per well for 200 wells
required_avg_volume_per_well = total_volume_for_profit / NUMBER_OF_WELLS

# Comparison with average volume of reserves in each region
profitable_region_0 = avg_predicted_volume_0 >= required_avg_volume_per_well
profitable_region_1 = avg_predicted_volume_1 >= required_avg_volume_per_well
profitable_region_2 = avg_predicted_volume_2 >= required_avg_volume_per_well

# Print results
print("Total Volume of Reserves Sufficient for Profit:", total_volume_for_profit)
print("Required Average Volume per Well for Profit:", required_avg_volume_per_well)
print("Comparison with Average Volume of Reserves:")
print("Region 0:", "Profitable" if profitable_region_0 else "Not Profitable")
print("Region 1:", "Profitable" if profitable_region_1 else "Not Profitable")
print("Region 2:", "Profitable" if profitable_region_2 else "Not Profitable")



Total Volume of Reserves Sufficient for Profit: 22222.222222222223
Required Average Volume per Well for Profit: 111.11111111111111
Comparison with Average Volume of Reserves:
Region 0: Not Profitable
Region 1: Not Profitable
Region 2: Not Profitable


The total volume of reserves required for profit (22,222.22 thousand barrels) is divided by 200 wells, resulting in a required average volume per well of approximately 111.11 thousand barrels.
Comparing this required average volume with the average predicted volume for each region:
Region 0 has an average predicted volume of 92.40, which is less than 111.11, thus not profitable.
Region 1 has an average predicted volume of 68.71, which is also less than 111.11, thus not profitable.
Region 2 has an average predicted volume of 94.77, which is less than 111.11, thus not profitable.
Hence, under the given budget and revenue assumptions, none of the regions are predicted to be profitable when considering the average volume of reserves per well required for profitability. 

# Function to calculate profit from a set of selected oil wells and model predictions

In [9]:
# Constants
BUDGET = 100e6  # $100 million
REVENUE_PER_UNIT = 4.5e3  # $4500 per thousand barrels
NUMBER_OF_WELLS = 200

def calculate_profit(df, budget=BUDGET, revenue_per_unit=REVENUE_PER_UNIT, num_wells=NUMBER_OF_WELLS):
    # Sort wells by predicted values
    df_sorted = df.sort_values(by='Predicted', ascending=False).head(num_wells)
    
    # Calculate the total volume of selected wells using actual target values
    total_volume = df_sorted['Actual'].sum()
    
    # Calculate profit based on total volume and revenue per unit
    profit = total_volume * revenue_per_unit - budget
    
    return profit, df_sorted

# Load DataFrames from earlier created CSV files
df_predictions_0 = pd.read_csv('predictions_region_0.csv')
df_predictions_1 = pd.read_csv('predictions_region_1.csv')
df_predictions_2 = pd.read_csv('predictions_region_2.csv')

# Calculate profit for Region 0
profit_0, selected_wells_sorted_0 = calculate_profit(df_predictions_0)

print("Profit for Region 0:", profit_0)
print("Selected Wells for Region 0 (sorted by predictions):")
print(selected_wells_sorted_0.head())

# Calculate profit for Region 1
profit_1, selected_wells_sorted_1 = calculate_profit(df_predictions_1)

print("Profit for Region 1:", profit_1)
print("Selected Wells for Region 1 (sorted by predictions):")
print(selected_wells_sorted_1.head())

# Calculate profit for Region 2
profit_2, selected_wells_sorted_2 = calculate_profit(df_predictions_2)

print("Profit for Region 2:", profit_2)
print("Selected Wells for Region 2 (sorted by predictions):")
print(selected_wells_sorted_2.head())

# Summary
print("\nSummary:")
print(f"Region 0: Predicted profit = ${profit_0:.2f}")
print(f"Region 1: Predicted profit = ${profit_1:.2f}")
print(f"Region 2: Predicted profit = ${profit_2:.2f}")

# Suggesting the region for development
if profit_2 > profit_0 and profit_2 > profit_1:
    best_region = "Region 2"
elif profit_0 > profit_1:
    best_region = "Region 0"
else:
    best_region = "Region 1"

print(f"\nSuggested region for oil wells development: {best_region}")






Profit for Region 0: 33591411.14462179
Selected Wells for Region 0 (sorted by predictions):
           Actual   Predicted
6958   153.639837  176.536104
18194  140.631646  176.274510
17251  178.879516  173.249504
457    176.807828  172.802708
2202   130.985681  172.744977
Profit for Region 1: 24150866.966815114
Selected Wells for Region 1 (sorted by predictions):
           Actual   Predicted
20776  137.945408  139.983277
2323   137.945408  139.700803
13895  137.945408  139.616544
6950   137.945408  139.514768
9151   137.945408  139.472212
Profit for Region 2: 25985717.59374112
Selected Wells for Region 2 (sorted by predictions):
           Actual   Predicted
21852  101.225039  170.529209
10722  151.655778  169.673332
6209    92.947333  165.300724
8203    97.775979  164.613896
8042   122.460897  163.964000

Summary:
Region 0: Predicted profit = $33591411.14
Region 1: Predicted profit = $24150866.97
Region 2: Predicted profit = $25985717.59

Suggested region for oil wells development: Re

# Findings

Summary and Recommendation

Based on the calculated profits, Region 0 shows the highest predicted profit of approximately $33.6 million.

Despite Region 2 having the highest profit previously, the re-evaluation based on the actual values for the selected top wells shows that Region 0 is more profitable when considering the actual performance of the top wells.

Suggested Region for Oil Well Development:
Region 0

Justification:

Higher Predicted Profit: Region 0 has the highest predicted profit compared to Regions 1 and 2.
Top Wells Performance: The actual values for the top wells in Region 0 show promising results, indicating that the development would likely yield better returns.
By focusing on Region 0 for oil well development, the company can potentially maximize its profit based on the current predictions and actual values of the wells.

# Calculate risks and profit for each region

In [10]:
BUDGET = 100e6  # $100 million
REVENUE_PER_UNIT = 4.5e3  # $4500 per thousand barrels

def calculate_profit(targets, predictions):
    # Select the top 200 wells based on predictions
    selected_indices = predictions.nlargest(200).index
    selected_targets = targets[selected_indices]
    total_volume = selected_targets.sum()
    
    # Calculate profit
    profit = total_volume * REVENUE_PER_UNIT - BUDGET
    return profit

def bootstrap_profit(df, n_samples=1000):
    profits = []
    for _ in range(n_samples):
        # Sample 500 wells with replacement
        sample = df.sample(n=500, replace=True)
        targets = sample['Actual']
        predictions = sample['Predicted']
        
        # Calculate profit for the sampled wells
        profit = calculate_profit(targets, predictions)
        profits.append(profit)
    
    return np.array(profits)

# Perform bootstrapping for each region
bootstrap_profits_0 = bootstrap_profit(df_predictions_0)
bootstrap_profits_1 = bootstrap_profit(df_predictions_1)
bootstrap_profits_2 = bootstrap_profit(df_predictions_2)

# Calculate the mean profit for each region
mean_profit_0 = np.mean(bootstrap_profits_0)
mean_profit_1 = np.mean(bootstrap_profits_1)
mean_profit_2 = np.mean(bootstrap_profits_2)

# Calculate 95% confidence intervals for each region
confidence_interval_0 = np.percentile(bootstrap_profits_0, [2.5, 97.5])
confidence_interval_1 = np.percentile(bootstrap_profits_1, [2.5, 97.5])
confidence_interval_2 = np.percentile(bootstrap_profits_2, [2.5, 97.5])

# Calculate the risk of losses for each region (probability of negative profit)
risk_of_losses_0 = np.mean(bootstrap_profits_0 < 0) * 100
risk_of_losses_1 = np.mean(bootstrap_profits_1 < 0) * 100
risk_of_losses_2 = np.mean(bootstrap_profits_2 < 0) * 100

# Print the results
print("Region 0:")
print("Mean Profit:", mean_profit_0)
print("95% Confidence Interval:", confidence_interval_0)
print("Risk of Losses:", risk_of_losses_0, "%\n")

print("Region 1:")
print("Mean Profit:", mean_profit_1)
print("95% Confidence Interval:", confidence_interval_1)
print("Risk of Losses:", risk_of_losses_1, "%\n")

print("Region 2:")
print("Mean Profit:", mean_profit_2)
print("95% Confidence Interval:", confidence_interval_2)
print("Risk of Losses:", risk_of_losses_2, "%")



Region 0:
Mean Profit: 6126658.1809852375
95% Confidence Interval: [  302321.25937421 12278556.33889298]
Risk of Losses: 1.7999999999999998 %

Region 1:
Mean Profit: 6599896.507079086
95% Confidence Interval: [ 1844591.75312602 11883535.97378075]
Risk of Losses: 0.2 %

Region 2:
Mean Profit: 5946865.409794628
95% Confidence Interval: [ -359938.85228911 12192224.67371715]
Risk of Losses: 3.1 %


# Conclusion 

Findings:
Region 0:

Mean Profit: $6,126,658.18
95% Confidence Interval: 
[$302,321.26, $12,278,556.34]
Risk of Losses: 1.8%
Region 1:

Mean Profit: $6,599,896.51
95% Confidence Interval: 
[$1,844,591.75, $11,883,535.97]
Risk of Losses: 0.2%
Region 2:

Mean Profit: $5,946,865.41
95% Confidence Interval: 
[-$359,938.85, $12,192,224.67]
Risk of Losses: 3.1%
Suggested Region for Oil Well Development: Region 1
Justification:

Highest Mean Profit: Region 1 has the highest mean profit among the three regions, indicating it is expected to generate the most revenue.

Lowest Risk of Losses: Region 1 has the lowest risk of losses at 0.2%, making it the most stable and reliable choice in terms of financial risk.

Confidence Interval: Although the 95% confidence interval for Region 1 is relatively wide, its lower bound is significantly higher than that of Region 0, and it does not include negative values as seen in Region 2. This indicates a better worst-case scenario and more consistent profitability.

Conclusion:
Region 1 is the most favorable choice for oil well development based on the analysis. It offers the highest expected profit and the lowest risk of incurring losses, making it the best option for maximizing profitability and minimizing financial risk.