# Machine Learning in Business Project <a id='intro'></a>

## Project Description

You work for the OilyGiant mining company. Your task is to find the best place for a new well.

Steps to choose the location:
- Collect the oil well parameters in the selected region: oil quality and volume of reserves;
- Build a model for predicting the volume of reserves in the new wells;
- Pick the oil wells with the highest estimated values;
- Pick the region with the highest total profit for the selected oil wells.

You have data on oil samples from three regions. Parameters of each oil well in the region are already known. Build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the *Bootstrapping* technique.

## Project Instructions

1. **Download and prepare the data. Explain the procedure.**

2. **Train and test the model for each region:**

    2.1. Split the data into a training set and validation set at a ratio of 75:25.

    2.2. Train the model and make predictions for the validation set.

    2.3. Save the predictions and correct answers for the validation set.

    2.4. Print the average volume of predicted reserves and model *RMSE*.

    2.5. Analyze the results.

3. **Prepare for profit calculation:**

    3.1. Store all key values for calculations in separate variables.

    3.2. Calculate the volume of reserves sufficient for developing a new well without losses. Compare the obtained value with the average volume of reserves in each region.

    3.3. Provide the findings about the preparation for profit calculation step.

4. **Write a function to calculate profit from a set of selected oil wells and model predictions:**

    4.1. Pick the wells with the highest values of predictions. 

    4.2. Summarize the target volume of reserves in accordance with these predictions

    4.3. Provide findings: suggest a region for oil wells' development and justify the choice. Calculate the profit for the obtained volume of reserves.

5. **Calculate risks and profit for each region:**

    5.1. Use the bootstrapping technique with 1000 samples to find the distribution of profit.

    5.2. Find average profit, 95% confidence interval and risk of losses. Loss is negative profit, calculate it as a probability and then express as a percentage.

    5.3. Provide findings: suggest a region for development of oil wells and justify the choice.

## Data Description

Geological exploration data for the three regions are stored in files:
- `geo_data_0.csv`
- `geo_data_1.csv`
- `geo_data_2.csv`
- **id** — unique oil well identifier
- **f0**, **f1**, **f2** — three features of points
- **product** — volume of reserves in the oil well (thousand barrels)

## Conditions:

- Only linear regression is suitable for model training.
- When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.
- The budget for development of 200 oil wells is 100 USD million.
- One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
- After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.
- The data is synthetic: contract details and well characteristics are not disclosed.

## Initialize

This set of Python library imports establishes a foundational toolkit for data analysis, machine learning, and statistical computations.

In [1]:
# Loading all the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import scipy.stats as st

## Download Data

This function 'load_data' is designed to load CSV files from either a local directory or a server path, attempting to read each file sequentially. 

In [2]:
# Load Data function
def load_data(file_name, local_path, server_path):
    try:
        # Attempt to read the file from the local path
        data = pd.read_csv(local_path + file_name)
        print(f"'{file_name}' file successfully read from the local path.")
    except FileNotFoundError:
        try:
            # If file not found locally, attempt to read from the server path
            data = pd.read_csv(server_path + file_name)
            print(f"'{file_name}' file successfully read from the server path.")
        except FileNotFoundError:
            # If file not found in both paths, print an error message
            print(f"'{file_name}' file not found. Please check the file paths.")
            data = None
    return data

# File names and paths
file_names = ['geo_data_0.csv', 'geo_data_1.csv', 'geo_data_2.csv']
local_path = '/Users/benjaminstephen/Documents/TripleTen/Sprint_9/Machine_Learning_in_Business_Project/datasets/'
server_path = '/datasets/'

# Load the data files
geo_data_0 = load_data(file_names[0], local_path, server_path)
geo_data_1 = load_data(file_names[1], local_path, server_path)
geo_data_2 = load_data(file_names[2], local_path, server_path)    

'geo_data_0.csv' file successfully read from the local path.
'geo_data_1.csv' file successfully read from the local path.
'geo_data_2.csv' file successfully read from the local path.


## Prepare Data

This 'analyze' function is designed to perform basic exploratory data analysis (EDA) tasks on a given DataFrame

In [3]:
# Analyze function
def analyze(data):
    # Display the DataFrame
    display(data)

    # Print DataFrame Info
    print("DATAFRAME INFO:")
    data.info()
    print()

    # Calculate Percentage of Null Values
    print("PERCENTAGE OF NULL VALUES:")
    print(data.isnull().sum()/len(data))
    print()

    # Calculate Number of Duplicated Rows
    print("NUMBER OF DUPLICATED ROWS:", data.duplicated().sum())
    print()

In [4]:
# Analyze Region 0
print("REGION 0:")
analyze(geo_data_0)

REGION 0:


Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.221170,105.280062
1,2acmU,1.334711,-0.340164,4.365080,73.037750
2,409Wp,1.022732,0.151990,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647
...,...,...,...,...,...
99995,DLsed,0.971957,0.370953,6.075346,110.744026
99996,QKivN,1.392429,-0.382606,1.273912,122.346843
99997,3rnvd,1.029585,0.018787,-1.348308,64.375443
99998,7kl59,0.998163,-0.528582,1.583869,74.040764


DATAFRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB

PERCENTAGE OF NULL VALUES:
id         0.0
f0         0.0
f1         0.0
f2         0.0
product    0.0
dtype: float64

NUMBER OF DUPLICATED ROWS: 0



In [5]:
# Analyze Region 1
print("REGION 1:")
analyze(geo_data_1)

REGION 1:


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276000,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.001160,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305
...,...,...,...,...,...
99995,QywKC,9.535637,-6.878139,1.998296,53.906522
99996,ptvty,-10.160631,-12.558096,5.005581,137.945408
99997,09gWa,-7.378891,-3.084104,4.998651,137.945408
99998,rqwUm,0.665714,-6.152593,1.000146,30.132364


DATAFRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB

PERCENTAGE OF NULL VALUES:
id         0.0
f0         0.0
f1         0.0
f2         0.0
product    0.0
dtype: float64

NUMBER OF DUPLICATED ROWS: 0



In [6]:
# Analyze Region 2
print("REGION 2:")
analyze(geo_data_2)

REGION 2:


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.871910
3,q6cA6,2.236060,-0.553760,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746
...,...,...,...,...,...
99995,4GxBu,-1.777037,1.125220,6.263374,172.327046
99996,YKFjq,-1.261523,-0.894828,2.524545,138.748846
99997,tKPY3,-1.199934,-2.957637,5.219411,157.080080
99998,nmxp2,-2.419896,2.417221,-5.548444,51.795253


DATAFRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB

PERCENTAGE OF NULL VALUES:
id         0.0
f0         0.0
f1         0.0
f2         0.0
product    0.0
dtype: float64

NUMBER OF DUPLICATED ROWS: 0



Each datasets consists of 100,000 entries, each representing a unique oil well within a region identified by the 'id' column. The datasets includes four numerical columns: 'f0', 'f1', and 'f2', which are features representing various geological parameters, and 'product', which indicates the volume of oil reserves in thousand barrels. A detailed inspection reveals that there are no missing values in any of the columns of each dataset, ensuring data completeness. Furthermore, there are no duplicate rows in each dataset, indicating that each entry is unique and provides distinct information about different oil wells in their respective regions.

## Split Data

The 'split_data' function extracts geological features ('f0', 'f1', 'f2') and the target variable ('product') from each region's dataset ('geo_data_0', 'geo_data_1', 'geo_data_2'). It divides the data into training and validation sets using a 75:25 split ratio, ensuring the model is trained on a majority of the data while validating its performance on unseen samples. The 'print_sizes' function provides concise feedback on the sizes of these subsets, verifying the integrity of the data split process.

In [7]:
# Split Data function
def split_data(data):
    # Extract features by dropping 'id' and 'product' columns
    features = data.drop(['id', 'product'], axis=1)
    
    # Extract target labels from the 'product' column
    target = data['product']
    
    # Split data into training and validation sets (75% training, 25% validation)
    features_train, features_valid, target_train, target_valid = train_test_split(
        features, 
        target, 
        test_size=0.25,  # 25% of the data will be validation set
        random_state=12345  # Set random seed for reproducibility
    )
    
    return features_train, features_valid, target_train, target_valid

# Print Sizes function
def print_sizes(features_train, features_valid, target_train, target_valid):
    print("---------")
    print("Training Features Size:", features_train.shape)
    print("Validation Features Size:", features_valid.shape)
    print("Training Target Size:", target_train.shape)
    print("Validation Target Size:", target_valid.shape)

# Split data for Region 0 and print subsets
features_train_0, features_valid_0, target_train_0, target_valid_0 = split_data(geo_data_0)
print("REGION 0:")
print_sizes(features_train_0, features_valid_0, target_train_0, target_valid_0)
print()

# Split data for Region 1 and print subsets
features_train_1, features_valid_1, target_train_1, target_valid_1 = split_data(geo_data_1)
print("REGION 1:")
print_sizes(features_train_1, features_valid_1, target_train_1, target_valid_1)
print()

# Split data for Region 2 and print subsets
features_train_2, features_valid_2, target_train_2, target_valid_2 = split_data(geo_data_2)
print("REGION 2:")
print_sizes(features_train_2, features_valid_2, target_train_2, target_valid_2)

REGION 0:
---------
Training Features Size: (75000, 3)
Validation Features Size: (25000, 3)
Training Target Size: (75000,)
Validation Target Size: (25000,)

REGION 1:
---------
Training Features Size: (75000, 3)
Validation Features Size: (25000, 3)
Training Target Size: (75000,)
Validation Target Size: (25000,)

REGION 2:
---------
Training Features Size: (75000, 3)
Validation Features Size: (25000, 3)
Training Target Size: (75000,)
Validation Target Size: (25000,)


## Train and Evaluate Models

The 'train_model' function initializes a linear regression model and fits it to the training data ('features_train' and 'target_train'). This enables the model to learn the relationship between geological features and oil reserve volumes. After training, predictions are generated for the validation set ('features_valid') to evaluate how well the model generalizes to new data.

The evaluate_model function then quantifies the model's performance by computing the Root Mean Squared Error (RMSE) between the predicted and actual oil reserve volumes ('target_valid'). It also provides insights into the average actual and predicted volumes, offering a comparative view through a DataFrame ('act_vs_pred').

In [8]:
# Train Model function
def train_model(features_train, features_valid, target_train):
    # Initialize a Linear Regression model
    model = LinearRegression()
    
    # Train the model on the training set
    model.fit(features_train, target_train)
    
    # Get predictions on the validation set
    predictions = model.predict(features_valid)
    
    return predictions

# Evaluate Model function
def evaluate_model(target_valid, predictions):
    # Display results of model
    print("---------")
    rmse = mean_squared_error(target_valid, predictions) ** 0.5
    print("RMSE:", rmse)
    print("Average ACTUAL Volume:", target_valid.mean())
    print("Average PREDICTED Volume:", predictions.mean())

    # Create a DataFrame to compare actual and predicted volumes
    act_vs_pred = pd.DataFrame({'ACTUAL': target_valid, 'PREDICTIONS': predictions})

    # Display the DataFrame for comparison
    display(act_vs_pred)

In [9]:
# Train the model for Region 0
predictions_0 = train_model(features_train_0, features_valid_0, target_train_0)

# Evaluate the model for Region 0
print("REGION 0:")
evaluate_model(target_valid_0, predictions_0)

REGION 0:
---------
RMSE: 37.5794217150813
Average ACTUAL Volume: 92.07859674082927
Average PREDICTED Volume: 92.59256778438038


Unnamed: 0,ACTUAL,PREDICTIONS
71751,10.038645,95.894952
80493,114.551489,77.572583
2655,132.603635,77.892640
53233,169.072125,90.175134
91141,122.325180,70.510088
...,...,...
12581,170.116726,103.037104
18456,93.632175,85.403255
73035,127.352259,61.509833
63834,99.782700,118.180397


In [10]:
# Train the model for Region 1
predictions_1 = train_model(features_train_1, features_valid_1, target_train_1)

# Evaluate the model for Region 1
print("REGION 1:")
evaluate_model(target_valid_1, predictions_1)

REGION 1:
---------
RMSE: 0.8930992867756165
Average ACTUAL Volume: 68.72313602435997
Average PREDICTED Volume: 68.728546895446


Unnamed: 0,ACTUAL,PREDICTIONS
71751,80.859783,82.663314
80493,53.906522,54.431786
2655,30.132364,29.748760
53233,53.906522,53.552133
91141,0.000000,1.243856
...,...,...
12581,137.945408,136.869211
18456,110.992147,110.693465
73035,137.945408,137.879341
63834,84.038886,83.761966


In [11]:
# Train the model for Region 2
predictions_2 = train_model(features_train_2, features_valid_2, target_train_2)

# Evaluate the model for Region 2
print("REGION 2:")
evaluate_model(target_valid_2, predictions_2)

REGION 2:
---------
RMSE: 40.02970873393434
Average ACTUAL Volume: 94.88423280885438
Average PREDICTED Volume: 94.96504596800492


Unnamed: 0,ACTUAL,PREDICTIONS
71751,61.212375,93.599633
80493,41.850118,75.105159
2655,57.776581,90.066809
53233,100.053761,105.162375
91141,109.897122,115.303310
...,...,...
12581,28.492402,78.765887
18456,21.431303,95.603394
73035,125.487229,99.407281
63834,99.422903,77.779912


The evaluation results show varying levels of model performance across the three regions considered for oil well development. In Region 0, the model exhibits a RMSE of approximately 37.58, indicating a moderate level of error in predicting oil reserve volumes. The average actual and predicted volumes are close, with actual slightly lower at 92.08 compared to predicted at 92.59, suggesting reasonable accuracy. Region 1 demonstrates the best performance with a significantly lower RMSE of about 0.89, indicating highly accurate predictions. Both actual and predicted volumes align closely around 68.72 and 68.73 respectively, indicating precise estimation. Conversely, Region 2 displays a higher RMSE of about 40.03, indicating less accurate predictions compared to the other regions. The average actual and predicted volumes are similar at around 94.88 and 94.97 respectively, showing moderate accuracy but with notable variability in predictions. These results suggest that Region 1 shows the most promising potential for oil well development based on the model's performance metrics.

## Volume of Reserves to Break-Even

The code here calculates the minimum volume of oil reserves required for each region to break even on drilling costs, based on a fixed budget and average revenue per unit of oil produced. By dividing the total budget ('BUDGET') by the number of target wells ('TARGET_WELLS'), it determines the cost per well ('WELL_COST'). The 'break_even_volume' represents the minimum volume of reserves each well must yield to cover this cost at the average revenue per unit ('UNIT_REVENUE'). The subsequent 'calculate_break_even' function computes and prints the average volume of reserves for each region, highlighting how their respective average volumes compare to the break-even point.

In [12]:
# Constants
BUDGET = 100000000 # Total budget available for development
UNIT_REVENUE = 4500 # Average revenue per unit of oil produced
TARGET_WELLS = 200 # Number of wells to be drilled per region
TOTAL_WELLS = 500 # Total number of available wells
WELL_COST = BUDGET / TARGET_WELLS # Cost of one well

# Minimum volume of reserves to cover the well cost
break_even_volume = WELL_COST / UNIT_REVENUE
print("Minimum Volume of Reserves to Break-Even::", break_even_volume)
print()

# Calculate Break-Even function
def calculate_break_even(data, break_even_volume):
    print("---------")
    avg_volume = data['product'].mean()
    print("Average Volume of Reserves:", avg_volume)
    print("Difference from Breaking Even:", break_even_volume - avg_volume)

print("REGION 0:")
calculate_break_even(geo_data_0, break_even_volume)
print()

print("REGION 1:")
calculate_break_even(geo_data_1, break_even_volume)
print()

print("REGION 2:")
calculate_break_even(geo_data_2, break_even_volume)

Minimum Volume of Reserves to Break-Even:: 111.11111111111111

REGION 0:
---------
Average Volume of Reserves: 92.50000000000001
Difference from Breaking Even: 18.6111111111111

REGION 1:
---------
Average Volume of Reserves: 68.82500000000002
Difference from Breaking Even: 42.2861111111111

REGION 2:
---------
Average Volume of Reserves: 95.00000000000004
Difference from Breaking Even: 16.11111111111107


Region 0 falls short of the minimum volume needed to break even on drilling costs by approximately 18.61 units of oil reserves per well. Region 1 also falls short but to a greater extent, with an average reserve volume that is 42.29 units below the break-even point. Region 2 is the closest to the break-even volume by approximately 16.11 units, indicating it has the potential to generate profit from oil well development under the given budget and revenue conditions.

## Calculate Profit

The function 'calculate_profit' computes potential profits from drilling operations in each region based on model predictions of oil reserves. It begins by converting the target validation data and predictions into pandas Series to ensure consistent indexing. The predictions are sorted in descending order to prioritize wells with higher estimated reserves. The function then selects the top 200 wells based on these sorted predictions and computes the total units of reserves in these wells. It then calculates profits by subtracting total drilling costs from the revenue generated by the selected wells. Finally, it outputs the computed profits for Regions 0, 1, and 2 based on the model predictions and validation data provided.

In [13]:
# Calculate Profit function
def calculate_profit(target_valid, predictions):
    # Convert the target values to a pandas Series and reset the index for consistent indexing
    target_series = pd.Series(target_valid).reset_index(drop=True)

    # Convert the predictions to a pandas Series and sort them in descending order
    predictions_sorted = pd.Series(predictions).sort_values(ascending=False)

    # Select the top 200 wells based on the sorted predictions
    top_200_wells = target_series[predictions_sorted.index][:TARGET_WELLS]

    # Calculate the total units in the selected top wells
    total_units = top_200_wells.sum()

    # Calculate the profit based on total units and revenue per unit
    profit = total_units * UNIT_REVENUE - BUDGET

    return profit

# Print profit of Region 0
profit_0 = calculate_profit(target_valid_0, predictions_0)
print("REGION 0 PROFIT:", profit_0)
print()

# Print profit of Region 1
profit_1 = calculate_profit(target_valid_1, predictions_1)
print("REGION 1 PROFIT:", profit_1)
print()

# Print profit of Region 2
profit_2 = calculate_profit(target_valid_2, predictions_2)
print("REGION 2 PROFIT:", profit_2)

REGION 0 PROFIT: 33208260.43139851

REGION 1 PROFIT: 24150866.966815114

REGION 2 PROFIT: 27103499.635998324


The calculated profits indicate the potential financial outcomes of drilling operations in each region based on the model's predictions of oil reserves. Region 0 is projected to yield the highest profit of approximately $33.2 million, followed by Region 2 with about $27.1 million, and Region 1 with approximately $24.2 million. These figures reflect the economic viability of investing in oil extraction in each respective region, considering the estimated reserves and operational costs. The profitability order suggests that focusing resources on drilling in Region 0 could yield the highest return on investment.

## Profit Distribution

This code defines two functions: 'bootstrap_profit' and 'analyse_profit_distribution'. The 'bootstrap_profit' function conducts a bootstrap analysis to simulate the distribution of profits from oil wells. It iteratively samples 500 wells with replacement from the target data, using predictions to calculate profits for each bootstrap sample. This process is repeated 1000 times to generate a distribution of potential profits. The 'analyse_profit_distribution' function then analyzes and displays the results of this distribution, printing the total number of simulated wells, the number of profitable wells (those generating positive profit), and the percentage of wells that are profitable.

In [14]:
# Bootstrap Profit function
def bootstrap_profit(target, predictions):
    # Convert target to pandas Series and reset index for consistent indexing
    target = pd.Series(target).reset_index(drop=True)

    # Initialize a random state for reproducibility
    state = np.random.RandomState(12345)

    # Initialize a list to store profit values from bootstrap samples
    profit_distribution = []

    # Perform 1000 bootstrap iterations
    for i in range(1000):
        # Randomly sample 500 wells with replacement from the target data
        target_subsample = target.sample(n=TOTAL_WELLS, replace=True, random_state=state)

        # Select the corresponding predictions for the sampled wells
        predictions_subsample = predictions[target_subsample.index]

        # Calculate profit for the bootstrap sample and append to the list
        profit_distribution.append(calculate_profit(target_subsample, predictions_subsample))
    
    # Convert the profit list to a pandas Series
    return pd.Series(profit_distribution)

# Analyze Profit Distribution function
def analyse_profit_distribution(profit_distribution):
    print("---------")
    print("Total Number of Wells:", len(profit_distribution))
    print("Number of Profitable Wells:", profit_distribution.gt(0).sum())
    percentage_profitable_wells = profit_distribution.gt(0).sum() / len(profit_distribution) * 100
    print(f"Percentage of Profitable Wells: {percentage_profitable_wells:.1f}%")

    # Display the profit distribution
    display(profit_distribution)

In [15]:
# Region 0 Profit Distribution
profit_distribution_0 = bootstrap_profit(target_valid_0, predictions_0)
print("REGION 0:")
analyse_profit_distribution(profit_distribution_0)

REGION 0:
---------
Total Number of Wells: 1000
Number of Profitable Wells: 931
Percentage of Profitable Wells: 93.1%


0      6.054641e+06
1      5.363934e+06
2      2.937858e+06
3      1.789934e+06
4      2.719929e+06
           ...     
995    5.253551e+06
996    7.790094e+06
997    6.494122e+06
998    3.149995e+06
999    2.197184e+06
Length: 1000, dtype: float64

In [16]:
# Region 1 Profit Distribution
profit_distribution_1 = bootstrap_profit(target_valid_1, predictions_1)
print("REGION 1:")
analyse_profit_distribution(profit_distribution_1)

REGION 1:
---------
Total Number of Wells: 1000
Number of Profitable Wells: 985
Percentage of Profitable Wells: 98.5%


0      2.280162e+06
1      3.343157e+06
2      2.537047e+06
3      6.139661e+06
4      3.571430e+06
           ...     
995    6.831945e+06
996    6.468698e+06
997    2.386523e+06
998    4.142425e+06
999    1.245778e+06
Length: 1000, dtype: float64

In [17]:
# Region 2 Profit Distribution
profit_distribution_2 = bootstrap_profit(target_valid_2, predictions_2)
print("REGION 2:")
analyse_profit_distribution(profit_distribution_2)

REGION 2:
---------
Total Number of Wells: 1000
Number of Profitable Wells: 924
Percentage of Profitable Wells: 92.4%


0     -7.189923e+05
1      6.459964e+06
2      6.261756e+06
3      4.123517e+06
4     -5.596049e+05
           ...     
995    5.668660e+06
996   -5.850207e+05
997    5.902561e+06
998    4.977628e+06
999    2.009241e+06
Length: 1000, dtype: float64

The analysis reveals that Region 1 demonstrates the highest percentage of profitable wells at 98.5%, followed closely by Region 0 at 93.1% and Region 2 at 92.4%. Despite varying profitability rates, all regions exhibit a wide range of potential profits, with some bootstrap samples indicating losses while others suggest substantial gains.

## Profit Analysis

Finally, the 'analyze_profits' function computes and prints statistical metrics based on a given distribution of profits. It calculates the mean profit, standard deviation (with degrees of freedom adjusted), and a 95% confidence interval using the Student's t-distribution. Additionally, it determines the risk of loss by calculating the proportion of profits that are negative.

In [18]:
# Analyze Profits function
def analyze_profits(profit_distribution):
    # Calculate the 95% confidence interval for the profit distribution using t-distribution
    confidence_interval = st.t.interval(0.95, len(profit_distribution) - 1, profit_distribution.mean(), profit_distribution.sem())

    # Calculate the risk of loss, defined as the proportion of negative profits
    risk_of_loss = (profit_distribution < 0).mean()

    # Print the results
    print("---------")
    print('Average Profit:', profit_distribution.mean())
    print(f"95% Confidence Interval: {confidence_interval[0], confidence_interval[1]}")
    print(f"Risk of Loss: {100 * risk_of_loss:.1f}%")

# Analyze Region 0
print("REGION 0:")
analyze_profits(profit_distribution_0)
print()

# Analyze Region 1
print("REGION 1:")
analyze_profits(profit_distribution_1)
print()

# Analyze Region 2
print("REGION 2:")
analyze_profits(profit_distribution_2)

REGION 0:
---------
Average Profit: 3961649.8480237117
95% Confidence Interval: (3796203.1514797257, 4127096.5445676977)
Risk of Loss: 6.9%

REGION 1:
---------
Average Profit: 4560451.057866608
95% Confidence Interval: (4431472.486639005, 4689429.62909421)
Risk of Loss: 1.5%

REGION 2:
---------
Average Profit: 4044038.665683568
95% Confidence Interval: (3874457.9747128044, 4213619.356654332)
Risk of Loss: 7.6%


## Conclusion

The analysis of profitability across three regions from the oil drilling simulations reveals distinct financial outcomes and associated risks. Region 0 emerges with the highest average profit at approximately 33,208,260 USD, though it also carries a moderate risk of loss with 6.9% of the profit distribution falling below zero. Region 1, with an average profit of 24,150,867 USD, demonstrates a narrower 95% confidence interval and a lower risk of loss at 1.5%, making it a more stable investment option compared to Regions 0 and 2. Region 2, while profitable with an average profit of 27,103,500 USD, exhibits a higher risk of loss at 7.6%, indicating greater variability in returns. These findings highlight Region 1 as potentially the most reliable choice for investors seeking both profitability and lower financial risk, whereas Regions 0 and 2 may require additional risk management strategies despite their higher average profits.