In [7]:
import pandas as pd

# Project description
You work for the OilyGiant mining company. Your task is to find the best place for a new well.

Steps to choose the location:

Collect the oil well parameters in the selected region: oil quality and volume of reserves;
Build a model for predicting the volume of reserves in the new wells;
Pick the oil wells with the highest estimated values;
Pick the region with the highest total profit for the selected oil wells.
You have data on oil samples from three regions. Parameters of each oil well in the region are already known. Build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.

Project instructions
Download and prepare the data. Explain the procedure.
Train and test the model for each region:

 2.1. Split the data into a training set and validation set at a ratio of 75:25.

 2.2. Train the model and make predictions for the validation set.

 2.3. Save the predictions and correct answers for the validation set.

 2.4. Print the average volume of predicted reserves and model RMSE.

 2.5. Analyze the results.

Prepare for profit calculation:

 3.1. Store all key values for calculations in separate variables.

 3.2. Calculate the volume of reserves sufficient for developing a new well without losses. Compare the obtained value with the average volume of reserves in each region.

 3.3. Provide the findings about the preparation for profit calculation step.

Write a function to calculate profit from a set of selected oil wells and model predictions:

 4.1. Pick the wells with the highest values of predictions. 

 4.2. Summarize the target volume of reserves in accordance with these predictions

 4.3. Provide findings: suggest a region for oil wells' development and justify the choice. Calculate the profit for the obtained volume of reserves.

Calculate risks and profit for each region:

     5.1. Use the bootstrapping technique with 1000 samples to find the distribution of profit.

     5.2. Find average profit, 95% confidence interval and risk of losses. Loss is negative profit, calculate it as a probability and then express as a percentage.

     5.3. Provide findings: suggest a region for development of oil wells and justify the choice.

## Data Description

Geological exploration data for the three regions are stored in files:

geo_data_0.csv. download dataset
geo_data_1.csv. download dataset
geo_data_2.csv. download dataset

id — unique oil well identifier
f0, f1, f2 — three features of points (their specific meaning is unimportant, but the features themselves are significant)
product — volume of reserves in the oil well (thousand barrels).
Conditions:

Only linear regression is suitable for model training (the rest are not sufficiently predictable).
When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.
The budget for development of 200 oil wells is 100 USD million.
One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.
The data is synthetic: contract details and well characteristics are not disclosed.

In [8]:
geo_0 = pd.read_csv('geo_data_0.csv')
geo_1 = pd.read_csv('geo_data_1.csv')
geo_2 = pd.read_csv('geo_data_2.csv')



# Cleaning Data 

## Geo_0 DataFrame

In [9]:
geo_0

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.221170,105.280062
1,2acmU,1.334711,-0.340164,4.365080,73.037750
2,409Wp,1.022732,0.151990,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647
...,...,...,...,...,...
99995,DLsed,0.971957,0.370953,6.075346,110.744026
99996,QKivN,1.392429,-0.382606,1.273912,122.346843
99997,3rnvd,1.029585,0.018787,-1.348308,64.375443
99998,7kl59,0.998163,-0.528582,1.583869,74.040764


In [10]:
geo_0.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Analysis: No missing values from just checking info on region 0 and the data types all seem normal

In [11]:
geo_0.duplicated().sum()

0

In [12]:
geo_0['id'].duplicated().sum()

10

In [13]:
duplicated_df_0 = geo_0[geo_0.duplicated('id', keep=False)]
duplicated_df_0

Unnamed: 0,id,f0,f1,f2,product
931,HZww2,0.755284,0.368511,1.863211,30.681774
1364,bxg6G,0.411645,0.85683,-3.65344,73.60426
1949,QcMuo,0.506563,-0.323775,-2.215583,75.496502
3389,A5aEY,-0.039949,0.156872,0.209861,89.249364
7530,HZww2,1.061194,-0.373969,10.43021,158.828695
16633,fiKDv,0.157341,1.028359,5.585586,95.817889
21426,Tdehs,0.829407,0.298807,-0.049563,96.035308
41724,bxg6G,-0.823752,0.546319,3.630479,93.007798
42529,AGS9W,1.454747,-0.479651,0.68338,126.370504
51970,A5aEY,-0.180335,0.935548,-2.094773,33.020205


Analysis on Geo_0: We can see 10 duplicates in the id column however looking at their respecting columns other values they are all different, this could indicate an error or most likely in my opinion different data points collected at different times for the same well. 

I am going to keep all of this is because even though the ids are the same their respected well reserves (product column value) are different and so are the rest of the features. Therefor we will keep them all in.

## Geo_1 DataFrame

In [14]:
geo_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Analysis: No missing values and data types seem normal again

In [15]:
geo_1.duplicated().sum()

0

In [16]:
geo_1['id'].duplicated().sum()

4

In [17]:
duplicated_df_1 = geo_1[geo_1.duplicated('id', keep=False)]
duplicated_df_1

Unnamed: 0,id,f0,f1,f2,product
1305,LHZR0,11.170835,-1.945066,3.002872,80.859783
2721,bfPNe,-9.494442,-5.463692,4.006042,110.992147
5849,5ltQ6,-3.435401,-12.296043,1.999796,57.085625
41906,LHZR0,-8.989672,-4.286607,2.009139,57.085625
47591,wt4Uk,-9.091098,-8.109279,-0.002314,3.179103
82178,bfPNe,-6.202799,-4.820045,2.995107,84.038886
82873,wt4Uk,10.259972,-9.376355,4.994297,134.766305
84461,5ltQ6,18.213839,2.191999,3.993869,107.813044


Analysis on Geo_1: Same interpretation as I explained in the first dataframe I looked at

## Geo_2 DataFrame

In [18]:
geo_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Analysis: No missing values and data types seem to be normal

In [19]:
geo_2.duplicated().sum()

0

In [20]:
geo_2['id'].duplicated().sum()

4

In [21]:
duplicated_df_2 = geo_2[geo_2.duplicated('id', keep=False)]
duplicated_df_2

Unnamed: 0,id,f0,f1,f2,product
11449,VF7Jo,2.122656,-0.858275,5.746001,181.716817
28039,xCHr8,1.633027,0.368135,-2.378367,6.120525
43233,xCHr8,-0.847066,2.101796,5.59713,184.388641
44378,Vcm5J,-1.229484,-2.439204,1.222909,137.96829
45404,KUPhW,0.231846,-1.698941,4.990775,11.716299
49564,VF7Jo,-0.883115,0.560537,0.723601,136.23342
55967,KUPhW,1.21115,3.176408,5.54354,132.831802
95090,Vcm5J,2.587702,1.986875,2.482245,92.327572


Analysis: Same interpretation as the previous two dataframes 

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected

</div>

# Train and Test the Model for Each Region

In [22]:
from sklearn.model_selection import train_test_split

train_0, valid_0 = train_test_split(geo_0, test_size=0.25, random_state=42)
train_1, valid_1 = train_test_split(geo_1, test_size=0.25, random_state=42)
train_2, valid_2 = train_test_split(geo_2, test_size=0.25, random_state=42)

# Splitting data for each of the three regions

In [23]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def train_and_evaluate(train, valid):
    features_train = train.drop(['id', 'product'], axis=1)
    target_train = train['product']
    features_valid = valid.drop(['id', 'product'], axis=1)
    target_valid = valid['product']
    
    # Training the model now
    model = LinearRegression()
    model.fit(features_train, target_train)
    
    # Making predictions and evaluating the model 
    predictions = model.predict(features_valid)
    rmse = mean_squared_error(target_valid, predictions, squared=False)
    
    return predictions, rmse 


# Applying the function to each region's dataset 
predictions_0, rmse_0 = train_and_evaluate(train_0, valid_0)
predictions_1, rmse_1 = train_and_evaluate(train_1, valid_1)
predictions_2, rmse_2 = train_and_evaluate(train_2, valid_2)

In [24]:
# printing the average volume of predicted reserves and model rmse

print("Region 0 - Average Predicted Reserves:", predictions_0.mean(), "RMSE:", rmse_0)
print("Region 1 - Average Predicted Reserves:", predictions_1.mean(), "RMSE:", rmse_1)
print("Region 2 - Average Predicted Reserves:", predictions_2.mean(), "RMSE:", rmse_2)


Region 0 - Average Predicted Reserves: 92.3987999065777 RMSE: 37.756600350261685
Region 1 - Average Predicted Reserves: 68.71287803913762 RMSE: 0.890280100102884
Region 2 - Average Predicted Reserves: 94.77102387765939 RMSE: 40.14587231134218


## Analysis of model results: 

Region 0:

    - Average Predicted Reserves: 92.4 Thousand Barrels 
    - RMSE (Root Mean Square Error): 37.76
Explanation: The model's predictions for Region 0 have a relatively high RMSE, indicating that the model's predictions can be off by about 37.76 thousand barrels on average. This suggests a moderate level of prediction error 

Reion 1:

    - Average Predicted Reserves: 68.71 thousand barrels 
    - RMSE: 0.89
Explanation: This model shows an exceptionally low RMSE< which means predictions are very close to the actual values, with an average error of just about 0.89 thousand barrels. This model is highly accurate compared to the others.

Region 2:

    - Average predicted Reserves: 94.77 Thousand Barrels 
    - RMSE: 40.15 
Explanation: Similar to Region 0, Region 2's model also has a high RMSE, indicating significant prediction errors, with an average error of about 40.15 thousand barrels

Overall Observations:

Accuracy: The model trained on Region 1 is significantly more accurate than those trained on Region 0 and 2. This could be due to various factors such as less variability in the data, better correlations between features and target or other region specific chararistics that make the data more predictable 

Predicted Reserves: The average predicted reserves are highest in Region 2 and lowest in Region 1. Despite its high accuracy, Region 1's model predicts lower average oil reserves, which will impact the decision on the best rgeion for profitability 

From purely model accuracy perspective, Region 1 seems the most reliable for predicting oil reserves. However, the business decision will also depend on the volume of reserves; even though Region 1's predictions are more accurate, the predicted reserves are the lowest among the three regions 

Considering RMSE and Business Needs: If minimizing prediction error (increasing precision in predictions) was more improtant for the projects financial and operational strategies, Region 1 stands out as the best option. However the focus is more on maximizing oil reserves, Regions 0 and 2 will be considered, despite higher uncertainity in their predictions.




# Prepare for Profit Calculation 

## Storing all key values for calculations in seperate variables 

Budget for one region: $100 million for 200 wells

Revenue per barrel: $4.5 

Revenue per thousand barrels (since 'product' is in thousand barrels): $4500



In [25]:
BUDGET = 100_000_000
REVENUE_PER_THOUSAND_BARRELS = 4500
NUM_WELLS = 200

## Calcuating the volume of reserves sufficient for developing a new well without losses 

To calculate the break-even point, we need to determine how many barrels are necessary to cover the investment on one well:

In [26]:
cost_per_well = BUDGET / NUM_WELLS
break_even_reserves = cost_per_well / REVENUE_PER_THOUSAND_BARRELS 
print(f"Cost per well: ${cost_per_well}")
print(f"Break-even reserves per well (in thousand barrels): {break_even_reserves}")

Cost per well: $500000.0
Break-even reserves per well (in thousand barrels): 111.11111111111111


Analysis: 

Cost Per Well:

    - $500,000 per well: This is the amount you will need to invest in each of the 200 wells if the budget of $100 million is distributed equally among them. The figure is crucial because it sets the financial baseline for how much each well needs to generate in terms of oil reserves to be considered viable under the set budget. 
    
Break Even Reserves Per Well:

    - 111.11 Thousand Barrels: This number represents the volume of oil reserves (in thousand barrels) that each well must produce in order to break even on the initial investment of $500,000. In simple terms, each well needs to produce at least 111.11 thousand barrels of oil just to cover the cost of drilling and setting up the well, assuming all the oil produces can be sold at the current revenue rate of $4,500 per thousand barrels 
    
Comparisons with Predicted Reserves:

    - From my earlier model predictions, the average predicted reserves for Regions 0, 1 and 2 are 92.4, 68.71 and 94.77 thousand barrels respectively. 
    - All three regions average prediction of barrels falls short of the break-even threshold 
    
Strategic Decisions: These insights suggest that none of the regions, on average, meet the break-even reserves threshold, posing a financial challege. The focus might need to shift towards identifying individual wells within these regions that exceed the break-even point rather than averaging across all wells.

Further Analysis and Risk Management:
 
    - A more detailed analysis using bootstrapping technique method for risk management and profit simulation will be critcal. It will allow me to refine these estimates and provide a probabilistic view of potential profits and losses, offering a more nuanced decision-making framework 
    
Moving forward, the project will focus on not just the average outputs but on the distribution of reserves across all wells, identifying the high performers that could justify the overall invetsment in each region.
    



## Calculating Profit from Selected Oil Wells

We will need to sort the wells by their predicted reserves and select the top wells. Since the budget allows for 200 wells to be developed, we will select the top 200 wells based on their predicted reserves.

Aftr selecting the top wells, the next step is to calculate the total predicted volume of reserves from these wells.

Finally we will calculate the profit by taking the sum of the predicted reserves, multiplying by the revenue per thousand barrels, and subtracting the total investment costs 

In [32]:
def calculate_profit(data, predictions, top_wells=200):

    data['predictions'] = predictions 
    top_data = data.nlargest(top_wells, 'predictions')
    
    # calculating the total predicted reserves from the top wells 
    total_actual_reserves = top_data['product'].sum()
    
    # Calculating profit total revenue from top wells - total cost 
    total_revenue = total_actual_reserves * REVENUE_PER_THOUSAND_BARRELS 
    profit = total_revenue - BUDGET
    
    #Removing the temporary predictions column to keep the data clean
    data.drop(columns='predictions', inplace=True)
    
    return profit 

# using the function on each regions dataset
profit_0 = calculate_profit(valid_0, predictions_0)
profit_1 = calculate_profit(valid_1, predictions_1)
profit_2 = calculate_profit(valid_2, predictions_2)

print(f"Profit for Region 0: ${profit_0:.2f}")
print(f"Profit for Region 1: ${profit_1:.2f}")
print(f"Profit for Region 2: ${profit_2:.2f}")

Profit for Region 0: $33591411.14
Profit for Region 1: $24150866.97
Profit for Region 2: $25985717.59


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['predictions'] = predictions


The function description:

- def calculate_profit(data, predictions, top_wells=200):
    - We created a function here with the parameters data which will take a Dataframe containing the data of oil wells
    - 'predictions' parameter takes the array of predicted reserves (volume) for the wells 
    - 'top_wells=200' is the last parameter with a default value of 200 representing the number of wells to consider based on their predicted reserves 
    
- data['predictions'] = predictions 
    - This line adds a new column called 'predictions' to the data frame. This column contains the predicted volumes of oil reserves for each well. These predictions come from my machine learning model which was a linear regression model.
    
- 'top_data.nlargest(top_wells, 'predictions')
   - This line selects the 'top_wells' (200) based on the highest predicted reserves. It uses the predictions to determine which wells are likely the most productive

- total_actual_reserves = top_data['product'].sum()
    - After selecting the top wells based on predictions, this line calculates the sum of the actual oil reserves ('product'column) from these selected wells. These are actual amounts of oil that were really found in the wells, not the predicted amounts. 

-  total_revenue = total_actual_reserves * REVENUE_PER_THOUSAND_BARRELS 
    profit = total_revenue - BUDGET
    - Here total revenue is found by multiplying the total actual reserves by the revenue per thousand barrels. Then, the total cost (the budget allocated for drilling these wells) is subtracted from the total revenue to calculate the profit. 
    
- data.drop(columns='predictions', inplace=True)
    - This line removes the temporrary 'predictions' column from the 'data' DataFrame. It is important to clean up temporrary data that was only needed for calculations to keep the dataset clean and prevent any errors in future data operations. 
    
- return Profit
    - Finally the function returns the calculated profit 

Analysis of Profit Results Piror to Boostrapping:

Profit Results Overview:

Region 0: $33,591,411.14

Region 1: $24,150,866.97

Region 2: $25,985,717.59

Highest Profit: Region 0 shows the highest profit among the three with $33,591,411 suggesting that the wells predicted by the model in this region are likely to yield the most substantial returns if developed. It produced actual significant reserves resulting in higher revenue and this higher profit

Region 2 follows with a moderate profit and Region 1 had the lowest profit. The lower profit in region 1, despite potentially a more accurate model predictions (based on the lower RMSE), suggests that the actual colume of reserves in the top wells was not as high as expected. 

- Comparison to Break-Even: 

    - The break even analysis conducted earlier revealed that each well needs to produce approximately 111.11 thousand barrels to cover the investment of 500,000 dollars per well, given that the revenue per thousand barrels is 4500 dollars. Here is how each region fares against this break-even point, taking into account the actual production data:
    
        - Region 0: With a profit of 33,591,411 dollars this region indicates a substantial margin above the break-even point. Given that the total budget for thr egion was 100 million (covering 200 wells), and the profit achieved is significantly high, it implies that the average well production far exceeded the break even reserve requirement. From the 100 million dollar investment this would indicate a 33.59 percet ROI (return on investment)
        
        - Region 1: With a profit of 24,150,866 dollars this region is the lowest among the three regions. Although profitable, it is closer to the investment budget than the others indicating that wells likely hovered closer to the break even point than Region 0. The ROI is around 24 percent.
        
        - Region 2: With a profit of 25,985,717 this region also shows positive returns, but similar to Region 1, suggests that a significant number of wells might be producing just around or somehwat above the break even point. The profit although robust is not as significant as region 0.
        
- Implications on Decision Making

    - Risk vs. Return: Region 0 provides the highest return according the model's predictions and actual well data, though the RMSE for Region 0 was higher, suggesting less accurate predictions. This could mean that the high profits might also carry a higher risk of prediction error
    
    - Recommendation: If maximizing profit is the primary goal and the company is ready to tolerate higher risks, Region 0 appears to be promising. However for a strategy that balances profit potential with prediction accuracy and minimizes risks, a diversified investment across regions may be advisable, possibly capitalizing on the more reliable predictions of region 1 to mitigate risks in regions 0.
        
  
    


## Calculating Risks and profit for Each Region

I will use the Boostrapping Technique with 1000 Samples to Find the Distribution of Profit 
    - Bootstrapping involves repeatedly sampling from the dataset with replacement to estimate the distribution of an estimate (in this case, profit). I will simulate the selection of 200 wells from the predictions 1000 times, calculate the profit for each sample, and use this distribution to assess risk and variability. 

In [38]:
import numpy as np
import pandas as pd

def bootstrap_profit(data, predictions, n_iterations=1000, n_samples=500, top_wells=200):
    np.random.seed(42)  # For reproducibility
    profits = []
    
    data_copy = data.copy()
    data_copy['predictions'] = predictions  # Adding predictions as a column temporarily
    
    for i in range(n_iterations):
        # Correctly sampling from data_copy which includes the 'predictions' column
        sampled = data_copy.sample(n=n_samples, replace=True, random_state=np.random.randint(1, 10000))
        
        # Selecting the top 200 wells based on predictions
        top_sampled = sampled.nlargest(top_wells, 'predictions')
        
        # Calculating the total actual reserves from the top wells
        total_actual_reserves = top_sampled['product'].sum()
        
        # Calculating revenue and profit
        total_revenue = total_actual_reserves * REVENUE_PER_THOUSAND_BARRELS
        profit = total_revenue - BUDGET
        profits.append(profit)
        
    return np.array(profits)

# Applying bootstrapping to each region's dataset
profits_0 = bootstrap_profit(valid_0, predictions_0)
profits_1 = bootstrap_profit(valid_1, predictions_1)
profits_2 = bootstrap_profit(valid_2, predictions_2)


- Bootstrapping

    - Bootstrapping is a technique used in statistics to estimate certain quantities about a population when you only have a small sample of that population. It is like trying to guess the average hight of all the students in a school by repeatedly measuring the heights of ust a few students over and over again. 
    
    - Imagine you have a bag of colored balls, some red, some blue and some green. I want to know which color is the most common, but instead of counting all the balls in the bag, you do the following:
        - You reach into the bag and pull out a handful of balls several times, each time recording how many of each color you get. 
        - After each handful you put the balls back into the bag and shake it mixing them all up again
        - You repeat this process many times, each time noting the colors. 
        - By doing this many times you begin to see a pattern. Maybe most of the time you end up with more blu eballs than any other color. The repeated sampling gives you a good idea that blue or whatever color might be the most common color, even though you haven't counted all the balls in the bag 
        
    -Boostrap in my project:
        - I have predictions about how much oil each well may produce and the actual data of how much they did produce. 
        - Sample Wells: instead of using all the data at once, I randomly pick a smaller group of wells and pretend like im deciding whether to drill them based on the predictions 
        - Calculating profit: For these samples wells, I calculated how much profit they would make based on the actual amount of oil they produced. 
        - Repeat: I repeat this many times, each time picking a different random group of wells. This gives me a bunch of different profit results
        
    -Why use Bootsraping?
        - Just like with the colored balls example, boostrapping heps me understand how often my predictions lead to profitable decisions and how much those profits might vary. It is a way to stimulate different scenarios and see what may happen in various cases without having to drill all the wells to find out. 
    
    - Confidence in Decisions: By seeing the results from many boostrap samples I can feel more confident about my average expected profits and the risks involved. If most of my boostrap samples show good profits, I am more confident that the wells based on the model's predictions is a good idea.

Bootstrap Function Analysis:

import numpy as np
import pandas as pd

def bootstrap_profit(data, predictions, n_iterations=1000, n_samples=500, top_wells=200):
    np.random.seed(42)  # For reproducibility
    profits = []
    
    data_copy = data.copy()
    data_copy['predictions'] = predictions  # Add predictions as a column temporarily

- Setting parameters

    - 'data' is the dataframe which contains the actual data for each well 'product' column. Data was alos copied in this function to avoid modifying the original dataset.
    - 'predictions' This is an array that contains the predicted values for oil producting for each well. These prediction are output from my machine learning model which was used to simulate decision making processeses (deciding which wells are potentially most profitable to develop)
    - 'n_itertions' specifies the number of bootstrap samples to generate. Each iteration represents a seperate simulation of the selection and profit calculation process, allowing for the estiation of the variability and distribution of potential profits. 
        - The loop inside the function runs n_iterations times, each time sampling wells, calculating profits and recording the result. This parameter controls how extensive the simulation is-the more iterations, the more accurate and stable the estiation of profit distribution will be.
    - 'n_samples' Determines the number of wells to sample in each bootstrap iteration. This is not the number of wells to ultimately calculate profit for, but rather the initial number of wells considered in each simulated scenario. 
        - in each iteration n_samples wells are selected from the data_copy with replacement. This simulates the randomness in selecting wells for potential development.
    - top_wells: Purpose after sampling n_samples wells, top_wells specifies the number of top wells to select based on the highest predictions for further profit calculation. This reflects a realistic decision-making scenario where only a subset of promising wells (based on predictions) is actually developed
        - From the sampled wells in each iteration, only the top_wells based on predictions are actually used to calculate profits reflecting the focus on the most promising wells.


        

## Subsequent Analysis using the Bootstrapped Profits 

In [41]:
def analyze_results(profits):
    mean_profit = np.mean(profits)
    confidence_interval = np.percentile(profits, [2.5, 97.5])
    risk_of_loss = (profits < 0).mean() * 100 # percent of negative profit instances 
    
    return mean_profit, confidence_interval, risk_of_loss 

analysis_0 = analyze_results(profits_0)
analysis_1 = analyze_results(profits_1)
analysis_2 = analyze_results(profits_2)

print(f"Region 0: Mean Profit = ${analysis_0[0]:.2f}, 95% CI = {analysis_0[1]}, Risk of Loss = {analysis_0[2]}%")
print(f"Region 1: Mean Profit = ${analysis_1[0]:.2f}, 95% CI = {analysis_1[1]}, Risk of Loss = {analysis_1[2]}%")
print(f"Region 2: Mean Profit = ${analysis_2[0]:.2f}, 95% CI = {analysis_2[1]}, Risk of Loss = {analysis_2[2]}%")          
                

Region 0: Mean Profit = $4028465.14, 95% CI = [-1129119.71951966  9203146.12728711], Risk of Loss = 5.1%
Region 1: Mean Profit = $4422033.42, 95% CI = [ 453430.28489968 8611047.25219177], Risk of Loss = 1.2%
Region 2: Mean Profit = $3846202.77, 95% CI = [-1852021.08835022  9099954.80166967], Risk of Loss = 7.6%


Function:

- We defined the function with profits parameter
    - mean_profit = np.mean(profits): This will calculate the average mean profit form the boostrapped amples. the mean proft gives a central value for expected profit based on the simulation, providing a straightforward metri to gauge the average outcome.
    - confidence_interval = np.percentile(profits, [2.5, 97.5]): This calculates the 95% confidence interval for the profits. The function np.percentile() is used to find the 2.5th and 97.5th perctiles of the boostrapped profit data. This interval means that based on the simulated boostrapping, we are 95% confidence that the actual profit will fall between these two values. A narrower interval indicates less uncertainity or risk in the profit estimate. 
    
    -risk_of_loss = (profits < 0).mean() * 100: This calculates the percentage of boostrapped samples where the profit was less than zero aka a loss). It first creates a boolean array where each entry is 'True' if the profit is negative and 'False' otherwise. The .mean() method calculates the fraction of 'True' values (i.e, loss cases), and multiplying by 100 converts this fraction into a percetage. This metric is important for assessing the riskiness of the invetsment 
    
Returning Results:

the function returns a tuple containing the calculated mean profit, confidence interval and risk of losses. These values provide me a comprehnsive statistical summary of the expected financial outcomes and its variability. 



Analysis: 

Mean Profit

Region 0 Mean Profit = $4,028,465.14

- This indicates that on average developing oils wells in Region 0 are expected to yield a profit of approximately $4.03 million

- 95 Percent Confidence Interval [-1,129,119.72, 9,203,146.13]
    - The wide range suggests a substantial variability in profit outcomes. The CI includes negative values, indicating potential losses in some scenarios.
    
- Risk of Loss: 5.1% 
    - There is a 5.1% chance of incurring losses, which is relatively low but significant enough to warrant caution
    

Region 1 Mean Profit: $4,422,033.42

- The highest mean profit among the three regions, suggesting that Region 1 might be the most lucrative on average.

    - 95 Confidence Interval: [453,430.28, 8,611,047.25]
        - The CI does not include negative values, indicating a lower financial risk compared to Region 0. The profit outcomes are generally positive.
Risk of Loss: 1.2%

        - Very low risk of losses, making Region 1 appear the safest investment among the three in terms of stability and reliability of returns.


Region 2 Mean Profit: $3,846,202.77

- Slightly lower than Region 0, indicating a moderate average profit level.
- 95% Confidence Interval: [-1,852,021.09, 9,099,954.80]
    - Similar to Region 0, the wide range includes potential significant losses, suggesting high variability and financial risk.
- Risk of Loss: 7.6%
    - The highest risk of loss among the three regions, which might be a deterrent for investors seeking stability.









 # Conclusion
 
 Interpretation and Strategic Implications:
 
     - Region 1 offers the best balance between high profitability and low risk, with the highest mean profit and the lowest risk of loss. It also has the most stable profit outcomes (narrowest range that does not dip into lossess). 
     - Region 0 and 2 show higher variability in profits with potential for both higher gains and significant losses. Investors in these regions should fear a higher financial risk than Region 1.
     
Investment Decision:

    - For conservative investors that prioritize stable returns and lower risk, Region 1 is the most attractive. Its mean is also higher on average than the others so I would highly suggest Region 1:
    - However Regions 0 and 2 could be considered with Region 0 being slightly more attractive in terms of lower risk of loss compared to region 2.
    
    
The bootstrapping analysis has provided valuable insights into the financial risks and potential returns from developing oil wells in each region. This data-driven approach aids in making informed invetsment decisions, balancing between ris and profitability based on statitsical evidence from simulated outcomes.