Hello Matthew!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure! 

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Thank you so much for the feedback, I appreacaite it! I should have double checked before submitting. Thanks! 
</div>


### Introduction

In this project, we are working for the OilyGiant mining company to determine the best location for a new oil well. The goal is to use machine learning models to predict the volume of oil reserves in different regions and identify the region with the highest potential profit.

#### Project Overview

We are provided with geological data from three different regions, including features that describe the characteristics of each oil well and the corresponding oil reserves in thousands of barrels. Using this data, we will build a linear regression model to predict the volume of oil reserves for new wells in each region.

#### Key Steps

The project involves the following major steps:

1. **Data Preparation**:
   - Download and prepare the data for analysis and model building.
   
2. **Model Training and Testing**:
   - Split the data into training and validation sets.
   - Train a linear regression model for each region and evaluate it using Root Mean Squared Error (RMSE) on the validation set.

3. **Profit Calculation**:
   - Based on the predictions, calculate the potential profit from each region by selecting the wells with the highest predicted oil reserves.
   - Compare the predicted reserves against the threshold required to make a profit, considering the development budget and revenue per barrel.
   
4. **Write a Function to Calculate Profit from Selected Wells and Model Predictions


5. **Risk and Profit Analysis**:
   - Use bootstrapping to assess the potential risks and profits for each region.
   - Calculate the average profit, confidence intervals, and the risk of losses (probability of negative profit).
   - Based on the analysis, recommend the region with the highest profit potential and lowest risk for oil well development.

#### Business Context

Given the constraints, including a budget of $100 million for developing 200 oil wells, and the requirement to ensure the risk of losses remains below 2.5%, the final decision on which region to choose will be based on a balance between predicted profit and associated risks.

The project will ultimately guide OilyGiant in making an informed decision about the most profitable region for future oil well development, minimizing risk while maximizing returns.


In [5]:
# Step 1: Data Preparation

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from scipy import stats
import matplotlib.pyplot as plt


# Load data for three regions
data_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_2 = pd.read_csv('/datasets/geo_data_2.csv')

# Check for missing values
print(data_0.info())
print()
print(data_1.info())
print()
print(data_2.info())
print()

# Inspect first dataset
print(data_0.head())
print()
print(data_1.head())
print()
print(data_2.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Correct

</div>

In [6]:
# Step 2: Model Training and Testing

# 2.1 Split the data into features (X) and target (y)
def split_data(data):
    X = data.drop(['id', 'product'], axis=1)
    y = data['product']
    return train_test_split(X, y, test_size=0.25, random_state=12345)

X_train_0, X_val_0, y_train_0, y_val_0 = split_data(data_0)
X_train_1, X_val_1, y_train_1, y_val_1 = split_data(data_1)
X_train_2, X_val_2, y_train_2, y_val_2 = split_data(data_2)
    
# 2.2 Train the model and make predictions for the validation set
def train_model(X_train, y_train, X_val):
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_val)
    return predictions

pred_0 = train_model(X_train_0, y_train_0, X_val_0)
pred_1 = train_model(X_train_1, y_train_1, X_val_1)
pred_2 = train_model(X_train_2, y_train_2, X_val_2)

# 2.3 Save the predictions and correct answers for each region
results_0 = pd.DataFrame({'actual': y_val_0, 'predicted': pred_0})
results_1 = pd.DataFrame({'actual': y_val_1, 'predicted': pred_1})
results_2 = pd.DataFrame({'actual': y_val_2, 'predicted': pred_2})

# 2.4 Print the average volume of predicted reserves and model RMSE
def calculate_rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

rmse_0 = calculate_rmse(y_val_0, pred_0)
rmse_1 = calculate_rmse(y_val_1, pred_1)
rmse_2 = calculate_rmse(y_val_2, pred_2)

print(f'Region 0 RMSE: {rmse_0:.2f}')
print(f'Region 1 RMSE: {rmse_1:.2f}')
print(f'Region 2 RMSE: {rmse_2:.2f}')

# 2.5 Analyze the results

# Average predicted reserves for each region
avg_pred_0 = pred_0.mean()
avg_pred_1 = pred_1.mean()
avg_pred_2 = pred_2.mean()

print(f'Average predicted reserves in Region 0: {avg_pred_0:.2f}')
print(f'Average predicted reserves in Region 1: {avg_pred_1:.2f}')
print(f'Average predicted reserves in Region 2: {avg_pred_2:.2f}')


Region 0 RMSE: 37.58
Region 1 RMSE: 0.89
Region 2 RMSE: 40.03
Average predicted reserves in Region 0: 92.59
Average predicted reserves in Region 1: 68.73
Average predicted reserves in Region 2: 94.97


###### Findings:
RMSE (Root Mean Squared Error) helps evaluate how well the linear regression models predicts the volume of oil reserves in the three regions. RMSE quantifies the difference between the actual and predicted values by giving a measure of the average error in the predictions. The lower the RMSE, the closer the predictions are to the actual values.

Region 1 has the lowest RMSE, indicating the predictions for this region are more accurate and reliable.

Regions 0 and 2 had higher RMSE values, suggesting less confidence in the prediction accuracy for these regions.

Region 1 has the lowest average predicted reserves and the lowest RMSE, so we are confident this is an accurate prediction of the actual average reserves in the region.

Both Region 0 and 2 have higher average predicted reserves but they had much higher RMSE, so we cannot be as confident that the actual average reserves is the same as the predicted.

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Good job!

</div>

In [7]:
# Step 3: Profit Calculation
    
# 3.1 Store all key values for calculations in separate variables
BUDGET = 1_000_000  # Budget in USD (divided by 100 for easier calculations)
COST_PER_POINT = 5000  # Assumed cost per point of development
PRODUCT_PRICE = 45  # Price per product unit (USD)

# 3.2 Calculate points based on budget
POINTS_PER_BUDGET = BUDGET // COST_PER_POINT

# 3.3 Compare the obtained value with the average volume of reserves for each region
print(f'Average reserves in Region 0: {y_val_0.mean():.2f} thousand barrels')
print(f'Average reserves in Region 1: {y_val_1.mean():.2f} thousand barrels')
print(f'Average reserves in Region 2: {y_val_2.mean():.2f} thousand barrels')


Average reserves in Region 0: 92.08 thousand barrels
Average reserves in Region 1: 68.72 thousand barrels
Average reserves in Region 2: 94.88 thousand barrels


###### Findings:
Based on these prediction calculations, all of the average reserves are less than the minimum reserve volume to avoid losses, so this would suggest that none of these regions are financially viable to start the development of wells.


<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Well done!

</div>

In [10]:
state = np.random.RandomState(42)

In [11]:
# Step 3: Profit Calculation
    
# 3.1 Store all key values for calculations in separate variables
BUDGET = 1_000_000  # Budget in USD (divided by 100 for easier calculations)
COST_PER_POINT = 5000  # Assumed cost per point of development
PRODUCT_PRICE = 45  # Price per product unit (USD)

# 3.2 Calculate points based on budget
POINTS_PER_BUDGET = BUDGET // COST_PER_POINT

# 3.3 Compare the obtained value with the average volume of reserves for each region
print(f'Average reserves in Region 0: {y_val_0.mean():.2f} thousand barrels')
print(f'Average reserves in Region 1: {y_val_1.mean():.2f} thousand barrels')
print(f'Average reserves in Region 2: {y_val_2.mean():.2f} thousand barrels')


Average reserves in Region 0: 92.08 thousand barrels
Average reserves in Region 1: 68.72 thousand barrels
Average reserves in Region 2: 94.88 thousand barrels


<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Your function looks correct

</div>

###### Findings:
The function calculates the profit for each region based on the model's predictions of the volume of reserves. After calculations, we found that Region 0 is the best region for development because it brings in the most revenue out of all three regions.

In [12]:
# Step 4: Write a function to calculate profit from selected wells and model predictions

# Function to calculate profit for selected wells
def calculate_profit(target, predictions):
    predictions_sorted = predictions.sort_values(ascending=False)
    selected_points = target[predictions_sorted.index][:POINTS_PER_BUDGET]
    product = selected_points.sum()
    revenue = product * PRODUCT_PRICE
    cost = BUDGET
    return revenue - cost

# Calculate profit for each region
profit_0 = calculate_profit(y_val_0, results_0['predicted'])
profit_1 = calculate_profit(y_val_1, results_1['predicted'])
profit_2 = calculate_profit(y_val_2, results_2['predicted'])

print(f'Profit for Region 0: {profit_0:,.2f} k USD')
print(f'Profit for Region 1: {profit_1:,.2f} k USD')
print(f'Profit for Region 2: {profit_2:,.2f} k USD')

# 4.3 Provide findings and suggest a region for development
if profit_0 > profit_1 and profit_0 > profit_2:
    best_region = 'Region 0'
elif profit_1 > profit_0 and profit_1 > profit_2:
    best_region = 'Region 1'
else:
    best_region = 'Region 2'

print(f'The best region for development is {best_region}')


Profit for Region 0: 332,082.60 k USD
Profit for Region 1: 241,508.67 k USD
Profit for Region 2: 271,035.00 k USD
The best region for development is Region 0


In [13]:
# 5.1 Use bootstrapping to find profit distribution

# Bootstrap function
def bootstrap_profit(target, predictions, n_bootstrap=1000):
    SAMPLE_SIZE = 500
    profit_values = []
    
    for _ in range(n_bootstrap):
        target_sample = target.sample(SAMPLE_SIZE, replace=True, random_state=state)
        predictions_sample = predictions[target_sample.index]
        profit_values.append(calculate_profit(target_sample, predictions_sample))
    
    profit_values = pd.Series(profit_values)
    mean_profit = profit_values.mean()
    lower_percentile = profit_values.quantile(0.025)
    upper_percentile = profit_values.quantile(0.975)
    risk_of_loss = (profit_values < 0).mean() * 100

    return mean_profit, (lower_percentile, upper_percentile), risk_of_loss

# Apply bootstrapping for Region 0
avg_profit_0, ci_0, risk_0 = bootstrap_profit(y_val_0, results_0['predicted'])

# Apply bootstrapping for Region 1
avg_profit_1, ci_1, risk_1 = bootstrap_profit(y_val_1, results_1['predicted'])

# Apply bootstrapping for Region 2
avg_profit_2, ci_2, risk_2 = bootstrap_profit(y_val_2, results_2['predicted'])

# 5.2 Find average profit, 95% confidence interval, and risk of losses

# Print results for each region
print(f'Region 0: Avg Profit = {avg_profit_0:,.2f} k USD, CI = {ci_0}, Risk of Loss = {risk_0:.2f}%')
print(f'Region 1: Avg Profit = {avg_profit_1:,.2f} k USD, CI = {ci_1}, Risk of Loss = {risk_1:.2f}%')
print(f'Region 2: Avg Profit = {avg_profit_2:,.2f} k USD, CI = {ci_2}, Risk of Loss = {risk_2:.2f}%')

# 5.3 Suggest a region for development based on risk analysis

# Suggest the best region based on profit and risk analysis
if risk_0 < 2.5 and avg_profit_0 > avg_profit_1 and avg_profit_0 > avg_profit_2:
    best_region = 'Region 0'
elif risk_1 < 2.5 and avg_profit_1 > avg_profit_0 and avg_profit_1 > avg_profit_2:
    best_region = 'Region 1'
elif risk_2 < 2.5 and avg_profit_2 > avg_profit_0 and avg_profit_2 > avg_profit_1:
    best_region = 'Region 2'
else:
    best_region = 'No region meets the risk threshold'

print(f'The best region for development is {best_region}.')


Region 0: Avg Profit = 40,884.03 k USD, CI = (-12175.927995071554, 92761.9110715063), Risk of Loss = 6.60%
Region 1: Avg Profit = 51,325.91 k USD, CI = (5919.49313324081, 95864.33935728906), Risk of Loss = 1.60%
Region 2: Avg Profit = 43,938.09 k USD, CI = (-9278.57406231932, 99211.41604040792), Risk of Loss = 5.30%
The best region for development is Region 1.


<div class="alert alert-info">
<b>Reviewer's comment V2</b>

The only one mistake I see is that according to the task description you need to sample with n=500. But it doesn't solve the problem fully. Indexes and constants look fine. Maybe you have a hidden misprint somewhere or something similar. I think, you can ask your tutor to help you to find the mistake. It seems it's not obvious one.
    
P.S. When your tutor helps you to find the mistake, please, write me what it was. I'm really interesting.

</div>

###### Solution
Thanks to Gerardo for taking the time to help me fix it. He said, "it was kind of simple, but at the same time difficult.
One of the problems seemed to be the seed assignment, plus I had to change the logic of how you got your calculations and redo your calculate_profit function."

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

Your code looks correct. But the results are incorrect. In the correct results the risk in each region should be a value between 0% and 10%. Usually it happens due to two reasons:
1. You have problems with constants
2. Indexes in targets and predictions are not the same

Of course, the problem can be caused by any other reason but these are two the most frequent one.

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Thanks to Gerado, everything is correct now:) Good job!

</div>

###### Findings:
After analysing the risk and profit using bootstrapping and a 95% confidence interval, we can conclude that Region 0 is the best region of development because the lowest average profit is about the same as the next closest region's maximum profit at about 30,000,000 USD, with the maximum of Region 0's average profit being about 36,000,000 USD and a risk of loss at 0.0%.

## Conclusion

In this project, we aimed to help OilyGiant mining company identify the best location for developing new oil wells by predicting the volume of oil reserves in three regions and calculating potential profits and risks.

### Key Findings:
1. **Model Performance**:
   - We trained linear regression models to predict oil reserves for each of the three regions. The Root Mean Squared Error (RMSE) was used to measure prediction accuracy:
     - **Region 0**: Average predicted reserves ~92.50 thousand barrels, RMSE = 37.58
     - **Region 1**: Average predicted reserves ~68.82 thousand barrels, RMSE = 0.89 (very low error)
     - **Region 2**: Average predicted reserves ~94.97 thousand barrels, RMSE = 40.03
     
2. **Profit Calculation**:
   - The minimum reserve volume required for profitable development was calculated to be **111.11 thousand barrels**. Comparing the average reserves in each region:
     - **Region 0**: Below break-even point, making it less profitable.
     - **Region 1**: Below break-even, but with minimal risk due to high prediction accuracy.
     - **Region 2**: Above the break-even point, but with significant risk due to high variability in predicted reserves.

3. **Bootstrapping and Risk Analysis**:
   - Using bootstrapping with 1000 samples, we estimated the distribution of profits and risks for each region:
     - **Region 0**: 
       - Average profit: **40.88 million USD**
       - 95% confidence interval: **(-12.18 million USD, 92.76 million USD)**
       - Risk of loss: **6.60%**
     - **Region 1**: 
       - Average profit: **51.33 million USD**
       - 95% confidence interval: **(5.92 million USD, 95.86 million USD)**
       - Risk of loss: **1.60%** (lowest risk)
     - **Region 2**: 
       - Average profit: **43.94 million USD**
       - 95% confidence interval: **(-9.28 million USD, 99.21 million USD)**
       - Risk of loss: **5.30%**

4. **Final Recommendation**:
   - Based on the analysis, **Region 1** is identified as the best choice for oil well development due to:
     - **Lowest Risk**: With a **1.60% risk of loss**, it presents the safest investment option among the three regions.
     - **Highest Average Profit**: Region 1 offers the highest average profit at **51.33 million USD**.
     - **Positive Confidence Interval**: The lower bound of the confidence interval is above 0, meaning the region is likely to yield positive returns even in less favorable scenarios.
  
---

Overall, the combination of high profit potential, low risk, and reliable model performance makes **Region 1** the most promising location for OilyGiant to invest in new oil wells.
