**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did an excellent job! The project is accepted. Keep up the good work on the next sprint!

# Best Place for a New Well for OilyGiant

## Introduction

In this project, we are preparing a machine learning model for the OilyGiant mining company. The model will help the company determine the best location for a new oil well. The decision will be based on the analysis of oil samples from three different regions. Each region has different oil well parameters, including oil quality and volume of reserves.

The project involves several steps:

- **Data Preparation**: We start by collecting the oil well parameters in the selected regions. This includes downloading and preparing the data.

- **Model Training and Testing**: We then build a model for predicting the volume of reserves in the new wells. The data is split into a training set and a validation set and a model is trained for each region. The model’s performance is evaluated based on its predictions for the validation set.

- **Profit Calculation Preparation**: Next, we prepare for profit calculation. We store all key values for calculations in separate variables and calculate the volume of reserves sufficient for developing a new well without losses.

- **Profit Calculation**: We write a function to calculate profit from selected oil wells and model predictions. The wells with the highest predicted values are selected, and the profit for the obtained volume of reserves is calculated.

Risk and Profit Calculation: Finally, we calculate risks and profit for each region using the Bootstrapping technique. We find the average profit, 95% confidence interval, and risk of losses. Based on these results, we suggest a region for the development of oil wells and justify the choice.

The goal of this project is to use data science techniques to make informed business decisions. By the end of this project, we will have a recommendation for the best region to develop new oil wells based on the potential profit and risk of losses. Let’s get started!


### Prepare the Data

In [1]:
# Import libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
# Load the data
data_0 = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/geo_data_0.csv')
data_1 = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/geo_data_1.csv')
data_2 = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/geo_data_2.csv')

In [3]:
# Explore the data
data_0.info()
display(data_0.head())
data_1.info()
display(data_1.head())
data_2.info()
display(data_2.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [4]:
print(data_0.duplicated().sum())
print(data_1.duplicated().sum())
print(data_2.duplicated().sum())

0
0


0


Since there are no duplicates and the data types seem appropriate, we can conclude that the data is clean and ready for the next step.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected

</div>

In [5]:
# Define variables
datasets = [data_0, data_1, data_2]

# Define constants
BUDGET = 100_000_000  # Budget for development of 200 oil wells is 100 USD million
REVENUE_PER_UNIT = 4_500  # One barrel of raw materials brings 4.5 USD of revenue
WELLS = 200  # The number of wells to choose

# Calculate the total break-even volume
total_volume = BUDGET / REVENUE_PER_UNIT

# Calculate the average break-even volume per well
average_volume_per_well = total_volume / WELLS

print("Average volume of product required per well to break even: ", average_volume_per_well)

Average volume of product required per well to break even:  111.11111111111111


<div class="alert alert-warning">
<b>Reviewer's comment</b>

Would be nice to calculate the average volume of product required to break even here

</div>

In [6]:
# Define a function to calculate profit
def calculate_profit(target, predictions, count):
    probs_sorted = predictions.sort_values(ascending=False)
    selected = target[probs_sorted.index][:count]
    return REVENUE_PER_UNIT * selected.sum() - BUDGET

<div class="alert alert-success">
<b>Reviewer's comment</b>

The function for profit calculation is correct!

</div>

### Process the datasets and perform calculations

All the code is in a set of nested for loops that perform the required calculations.

In [7]:
# Process each dataset
for i, data in enumerate(datasets):
    # Drop the 'id' column as it messes with the model
    data = data.drop(['id'], axis=1)

    # Split the data into a training set and validation set
    train, valid = train_test_split(data, test_size=0.25, random_state=12345)

    # Reset the indices of the validation set
    valid = valid.reset_index(drop=True)

    # Train the model and make predictions for the validation set
    model = LinearRegression()
    model.fit(train.drop(['product'], axis=1), train['product'])
    predictions = model.predict(valid.drop(['product'], axis=1))

    # Convert predictions to a pandas Series
    predictions = pd.Series(predictions, index=valid.index)

    # Calculate the RMSE and the average volume of predicted reserves
    rmse = np.sqrt(mean_squared_error(valid['product'], predictions))
    average_volume = predictions.mean()

    print("Region: ", i)
    print("Average volume of predicted reserves: ", average_volume)
    print("RMSE: ", rmse)

    # Calculate the profit for the region
    profit = calculate_profit(valid['product'], pd.Series(predictions), WELLS)

    # Print the profit for the region
    print(f"Profit for region {i}: {profit}")

    # Bootstrapping
    state = np.random.RandomState(12345)
    values = []
    for i in range(1000):
        target_subsample = valid['product'].sample(n=500, replace=True, random_state=state)
        probs_subsample = predictions[target_subsample.index]
        values.append(calculate_profit(target_subsample, probs_subsample, WELLS))

    values = pd.Series(values)
    lower = values.quantile(0.025)
    mean = values.mean()
    upper = values.quantile(0.975)
    risk_of_loss = (values < 0).mean()

    print("Average profit: ", mean)
    print("2.5% quantile: ", lower)
    print("97.5% quantile: ", upper)
    print("Risk of loss: ", risk_of_loss)
    print('-------------------------')

Region:  0
Average volume of predicted reserves:  92.59256778438035
RMSE:  37.5794217150813
Profit for region 0: 33208260.43139851


Average profit:  4259385.269105923
2.5% quantile:  -1020900.9483793724
97.5% quantile:  9479763.533583675
Risk of loss:  0.06
-------------------------
Region:  1
Average volume of predicted reserves:  68.728546895446
RMSE:  0.8930992867756167
Profit for region 1: 24150866.966815114
Average profit:  5152227.734432898
2.5% quantile:  688732.2537050088
97.5% quantile:  9315475.912570495
Risk of loss:  0.01
-------------------------
Region:  2
Average volume of predicted reserves:  94.96504596800489
RMSE:  40.02970873393434
Profit for region 2: 27103499.635998324
Average profit:  4350083.627827557
2.5% quantile:  -1288805.473297878
97.5% quantile:  9697069.541802654
Risk of loss:  0.064
-------------------------


<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, the data for each region was split into train and validation subsets, the models were trained and evaluated correctly. Bootstrapping of profit distribution is done successfully.

</div>

From this output, we can see that:
- Region 0 has a relatively high average volume of predicted reserves and a profit of approximately 33.2 million USD. However, the risk of loss is at 6%, which is higher than the acceptable threshold of 2.5%

- Region 1 has the lowest average volume of predicted reserves, but it also has the lowest RMSE, indicating that the model's predictions are quite accurate. The profit for this region is approximately 24.2 million USD and a risk of loss at 1%.

- Region 2 has the highest average volume of predicted reserves and the highest profit of approcimately 27.1 million USD. Though, the risk of loss is the highest out of the three regions at 6.4%

From the results, it is suggested that the development of oil well in region 1 would be the lowest risk of loss at 1%. This is lower than the 2.5% threshold.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Conclusions make sense. Region choice is correct and justified!

</div>

# Conclusion

In this project, we successfully built a model for the OilyGiant mining company to determine the best region for drilling new oil wells. We used data from three regions, each containing information about oil wells and the volume of reserves.

We trained a Linear Regression model for each region and made predictions on the volume of reserves. The models’ performances varied across regions, with Region 1 having the most accurate predictions.

The profit for each region is calculated based on predicted volumes and selected the top 200 wells. The profits also varied across regions, with Region 2 yielding the highest profit.

However, considering the risk of loss, which was calculated using the bootstrapping technique, Region 1 was the best choice for drilling new wells. Despite having the lowest average volume of predicted reserves, it had a reasonable profit expectation and, most importantly, the lowest risk of loss and the only one of the three regions within the acceptable threshold.


<div class="alert alert-success">
<b>Reviewer's comment</b>

Very good!

</div>