## Introduction
In this project, we are working for OilyGiant, a mining company tasked with identifying the best location for a new oil well based on geological exploration data from three regions. The goal is to build a linear regression model to predict the volume of oil reserves and select the most profitable location.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Set up display options for clarity
pd.options.display.float_format = '{:.2f}'.format

# Load the datasets for each region
geo_data_0 = pd.read_csv('/datasets/geo_data_0.csv')
geo_data_1 = pd.read_csv('/datasets/geo_data_1.csv')
geo_data_2 = pd.read_csv('/datasets/geo_data_2.csv')


# Display the first few rows and summary statistics of each dataset
print("Region 0 Dataset Info:")
print(geo_data_0.info())
print(geo_data_0.describe())

print("\nRegion 1 Dataset Info:")
print(geo_data_1.info())
print(geo_data_1.describe())

print("\nRegion 2 Dataset Info:")
print(geo_data_2.info())
print(geo_data_2.describe())


Region 0 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
             f0        f1        f2   product
count 100000.00 100000.00 100000.00 100000.00
mean       0.50      0.25      2.50     92.50
std        0.87      0.50      3.25     44.29
min       -1.41     -0.85    -12.09      0.00
25%       -0.07     -0.20      0.29     56.50
50%        0.50      0.25      2.52     91.85
75%        1.07      0.70      4.72    128.56
max        2.36      1.34     16.00    185.36

Region 1 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 c

 We begin by importing all required libraries. These include pandas and numpy for data manipulation, scikit-learn for model building and evaluation, and matplotlib for visualizing results. This ensures that all dependencies are loaded before proceeding with the project.

We load and display the data for the three regions to verify that the data is imported correctly and to understand its structure. This initial inspection helps identify any potential issues before proceeding with analysis.

In [2]:
# Define features and target
features = ['f0', 'f1', 'f2']
target = 'product'

# Function to train a linear regression model and calculate RMSE
def train_and_evaluate(data):
    X_train, X_valid, y_train, y_valid = train_test_split(data[features], data[target], test_size=0.25, random_state=42)
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, predictions, squared=False)
    average_volume = predictions.mean()
    return model, predictions, rmse, average_volume

# Train and evaluate models for each region
model_0, predictions_0, rmse_0, avg_volume_0 = train_and_evaluate(geo_data_0)
model_1, predictions_1, rmse_1, avg_volume_1 = train_and_evaluate(geo_data_1)
model_2, predictions_2, rmse_2, avg_volume_2 = train_and_evaluate(geo_data_2)

# Display RMSE and average predicted volume for each region
print(f"Region 0 - RMSE: {rmse_0:.2f}, Average Predicted Volume: {avg_volume_0:.2f}")
print(f"Region 1 - RMSE: {rmse_1:.2f}, Average Predicted Volume: {avg_volume_1:.2f}")
print(f"Region 2 - RMSE: {rmse_2:.2f}, Average Predicted Volume: {avg_volume_2:.2f}")


Region 0 - RMSE: 37.76, Average Predicted Volume: 92.40
Region 1 - RMSE: 0.89, Average Predicted Volume: 68.71
Region 2 - RMSE: 40.15, Average Predicted Volume: 94.77


We split the dataset into training and validation sets (75:25 ratio) to train a linear regression model. The model is then evaluated using RMSE, which provides insight into the model's predictive accuracy. We also calculate the average predicted reserves volume as an initial check on the model’s output.

In [3]:
# Constants
BUDGET = 100_000_000  # USD
WELLS_SELECTED = 200
REVENUE_PER_BARREL = 4.5  # USD per barrel

# Calculate the cost per well
cost_per_well = BUDGET / WELLS_SELECTED

# Calculate the volume of reserves needed per well to break even (without losses) in thousands of barrels
volume_needed_per_well_thousands = (cost_per_well / REVENUE_PER_BARREL) / 1000

print(f"Volume of Reserves Needed per Well: {volume_needed_per_well_thousands:.2f} thousand barrels")

# Compare with the average predicted volume for each region (assuming avg_volume_0, avg_volume_1, avg_volume_2 are in thousands of barrels)
print(f"Region 0 - Average Predicted Volume per Well: {avg_volume_0:.2f} thousand barrels")
print(f"Region 1 - Average Predicted Volume per Well: {avg_volume_1:.2f} thousand barrels")
print(f"Region 2 - Average Predicted Volume per Well: {avg_volume_2:.2f} thousand barrels")

# Calculate the difference between the needed volume and the average predicted reserves for each region
difference_0 = avg_volume_0 - volume_needed_per_well_thousands
difference_1 = avg_volume_1 - volume_needed_per_well_thousands
difference_2 = avg_volume_2 - volume_needed_per_well_thousands

# Display the results
print("\nComparison with Required Volume per Well:")
print(f"Region 0 - Difference: {difference_0:.2f} thousand barrels")
print(f"Region 1 - Difference: {difference_1:.2f} thousand barrels")
print(f"Region 2 - Difference: {difference_2:.2f} thousand barrels")


Volume of Reserves Needed per Well: 111.11 thousand barrels
Region 0 - Average Predicted Volume per Well: 92.40 thousand barrels
Region 1 - Average Predicted Volume per Well: 68.71 thousand barrels
Region 2 - Average Predicted Volume per Well: 94.77 thousand barrels

Comparison with Required Volume per Well:
Region 0 - Difference: -18.71 thousand barrels
Region 1 - Difference: -42.40 thousand barrels
Region 2 - Difference: -16.34 thousand barrels


### PART B.1: Calculating the Volume of Reserves Sufficient for Developing a New Well

In order for a well to be worth developing, it needs to make the company profit. The calculation involves using the formula:


Where:
- **Revenue per barrel**: $4.5 per barrel (as given).
- **Cost per well**: Calculated based on the total budget divided by the number of wells.

By solving this equation, we find that the minimum number of barrels needed to break even per well is:112

**Calculation Output**:
- The volume of reserves needed per well to break even is approximately **112 barrels**.

Assuming that OilyGiant only sells oil in whole barrels, each well needs to have at least **112 barrels** worth of oil to net a profit.

In [4]:
# Function to train the model and make predictions
def train_and_evaluate(data):
    features = ['f0', 'f1', 'f2']
    target = 'product'
    X_train, X_valid, y_train, y_valid = train_test_split(data[features], data[target], test_size=0.25, random_state=42)
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, predictions, squared=False)
    average_volume = predictions.mean()
    return model, predictions, rmse, average_volume, y_valid

# Train and evaluate models for each region
model_0, predictions_0, rmse_0, avg_volume_0, y_valid_0 = train_and_evaluate(geo_data_0)
model_1, predictions_1, rmse_1, avg_volume_1, y_valid_1 = train_and_evaluate(geo_data_1)
model_2, predictions_2, rmse_2, avg_volume_2, y_valid_2 = train_and_evaluate(geo_data_2)

# Display RMSE and average predicted reserves for each region
print(f"Region 0 - RMSE: {rmse_0:.2f}, Average Predicted Volume: {avg_volume_0:.2f}")
print(f"Region 1 - RMSE: {rmse_1:.2f}, Average Predicted Volume: {avg_volume_1:.2f}")
print(f"Region 2 - RMSE: {rmse_2:.2f}, Average Predicted Volume: {avg_volume_2:.2f}")


Region 0 - RMSE: 37.76, Average Predicted Volume: 92.40
Region 1 - RMSE: 0.89, Average Predicted Volume: 68.71
Region 2 - RMSE: 40.15, Average Predicted Volume: 94.77


In [5]:
# Profit Calculation Preparation
# Constants
BUDGET = 100_000_000  # Development cost budget
REVENUE_PER_BARREL = 4.5  # USD revenue per barrel
WELLS_SELECTED = 200  # Number of top-performing wells

def calculate_profit(predictions, target):
    # Select the indices of the top wells based on predictions
    selected_indices = predictions.sort_values(ascending=False).index[:WELLS_SELECTED]
    
    # Use the actual reserve values (targets) at these indices to calculate profit
    selected_reserves = target.loc[selected_indices].sum()  # Sum in thousands of barrels
    
    # Calculate profit, converting selected_reserves to barrels by multiplying by 1000
    profit = (selected_reserves * 1000 * REVENUE_PER_BARREL) - BUDGET
    return profit

# Calculate profits for each region based on validation set predictions
profit_0 = calculate_profit(pd.Series(predictions_0, index=y_valid_0.index), y_valid_0)
profit_1 = calculate_profit(pd.Series(predictions_1, index=y_valid_1.index), y_valid_1)
profit_2 = calculate_profit(pd.Series(predictions_2, index=y_valid_2.index), y_valid_2)

print(f"Estimated profit for Region 0: ${profit_0:,.2f}")
print(f"Estimated profit for Region 1: ${profit_1:,.2f}")
print(f"Estimated profit for Region 2: ${profit_2:,.2f}")



Estimated profit for Region 0: $33,591,411.14
Estimated profit for Region 1: $24,150,866.97
Estimated profit for Region 2: $25,985,717.59


In [6]:
# Constants
BUDGET = 100_000_000  # Development cost budget
REVENUE_PER_BARREL = 4.5  # USD revenue per barrel
WELLS_SELECTED = 200  # Number of top-performing wells
SAMPLE_SIZE = 500  # Sample size for each bootstrapped iteration
N_BOOTSTRAPS = 1000  # Number of bootstrapping iterations

# Profit calculation function using top wells based on predictions but using targets for actual reserves
def calculate_profit(predictions, targets):
    # Select the indices of the top 200 wells based on predictions
    selected_indices = predictions.sort_values(ascending=False).index[:WELLS_SELECTED]
    
    # Use the actual reserve values (targets) at these indices to calculate profit
    selected_reserves = targets.loc[selected_indices].sum()  # Sum in thousands of barrels
    
    # Calculate profit, converting selected_reserves to barrels by multiplying by 1000
    profit = (selected_reserves * 1000 * REVENUE_PER_BARREL) - BUDGET
    return profit

# Bootstrapping function to calculate profit distribution
def bootstrap_profit(predictions, targets, n_bootstraps=N_BOOTSTRAPS, sample_size=SAMPLE_SIZE):
    profits = []

    # Align predictions and targets in a DataFrame to ensure index consistency
    aligned_data = pd.DataFrame({'Predictions': predictions, 'Targets': targets})
    
    for _ in range(n_bootstraps):
        # Sample 500 wells with replacement
        sample = aligned_data.sample(n=sample_size, replace=True)
        sample_predictions = sample['Predictions']
        sample_targets = sample['Targets']
        
        # Calculate profit using the sample data
        profit = calculate_profit(sample_predictions, sample_targets)
        profits.append(profit)

    # Calculate mean profit, 90% confidence interval, and risk of loss
    mean_profit = np.mean(profits)
    conf_interval = np.percentile(profits, [5, 95])  # 90% confidence interval
    risk_of_loss = (np.array(profits) < 0).mean() * 100  # Percentage of samples with negative profit

    return mean_profit, conf_interval, risk_of_loss

# Applying bootstrapping for each region
results = []
for predictions, targets, region_name in [
    (pd.Series(predictions_0, index=y_valid_0.index), y_valid_0, 'Region 0'),
    (pd.Series(predictions_1, index=y_valid_1.index), y_valid_1, 'Region 1'),
    (pd.Series(predictions_2, index=y_valid_2.index), y_valid_2, 'Region 2')
]:
    mean_profit, conf_interval, risk_of_loss = bootstrap_profit(predictions, targets)
    results.append((region_name, mean_profit, conf_interval, risk_of_loss))

# Display the results
for region_name, mean_profit, conf_interval, risk_of_loss in results:
    print(f"{region_name}:")
    print(f"  Average Profit: ${mean_profit:,.2f}")
    print(f"  90% Confidence Interval: [${conf_interval[0]:,.2f}, ${conf_interval[1]:,.2f}]")
    print(f"  Risk of Loss: {risk_of_loss:.2f}%\n")


Region 0:
  Average Profit: $6,141,092.27
  90% Confidence Interval: [$1,216,676.51, $11,454,283.18]
  Risk of Loss: 1.60%

Region 1:
  Average Profit: $6,450,911.11
  90% Confidence Interval: [$2,100,217.52, $10,892,754.83]
  Risk of Loss: 0.50%

Region 2:
  Average Profit: $5,869,463.06
  90% Confidence Interval: [$747,740.78, $11,431,991.56]
  Risk of Loss: 3.20%



In [8]:
# Results for each region based on bootstrapping analysis
results = [
    ("Region 0", 6141092.27, [1216676.51, 11454283.18], 1.6),
    ("Region 1", 6450911.11, [2100217.52, 10892754.83], 0.5),
    ("Region 2", 5869463.06, [747740.78, 11431991.56], 3.2)
]

# Filter suitable regions based on risk criteria
suitable_regions = [result for result in results if result[3] < 2.5]  # Risk of loss below 2.5%

if suitable_regions:
    # Recommend region with highest mean profit
    best_region = max(suitable_regions, key=lambda x: x[1])
    print(f"Recommended Region for Development: {best_region[0]}")
    print(f"Reason: Highest average profit of ${best_region[1]:,.2f} with a risk of loss of {best_region[3]:.2f}%")
else:
    print("No suitable regions found with a risk of loss below 2.5%")


Recommended Region for Development: Region 1
Reason: Highest average profit of $6,450,911.11 with a risk of loss of 0.50%


# Findings and Recommendation

After analyzing the profit distribution and risks for each region using bootstrapping with 1000 samples, the following results were obtained:

- **Region 0**:
  - Average Profit: $6,141,092.27
  - 90% Confidence Interval: [$1,216,676.51, $11,454,283.18]
  - Risk of Loss: 1.60%

- **Region 1**:
  - Average Profit: $6,450,911.11
  - 90% Confidence Interval: [$2,100,217.52, $10,892,754.83]
  - Risk of Loss: 0.50%

- **Region 2**:
  - Average Profit: $5,869,463.06
  - 90% Confidence Interval: [$747,740.78, $11,431,991.56]
  - Risk of Loss: 3.20%

### Interpretation

The goal is to select a region for development that maximizes average profit while keeping the risk of loss below the 2.5% threshold.

- **Region 1** is the most suitable candidate for development, with an average profit of $6,450,911.11 and a low risk of loss at 0.50%.
- **Region 0** also meets the risk threshold with a 1.60% risk of loss, but its average profit is lower than Region 1’s.
- **Region 2** has a risk of loss above the 2.5% threshold, making it less suitable despite a positive average profit.

### Recommendation

**Region 1** is recommended for development due to its balance of high average profit and minimal risk. This region offers the highest profit potential with an acceptable risk level, aligning with the company’s objective of maximizing profit while minimizing exposure to losses.
