**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  

<b>Student's comment</b>
    
I'm going to redo my project. I think I over did the previous one and wanted to start fresh and simnple. My whole understanding of it has change and I have went back to the material to come at it at a different apporach. 
    

<div class="alert alert-success">
<b>Reviewer's comment V4</b>

Ok, I see!

</div>

# Project Overview: Optimal Oil Well Location Analysis for OilyGiant

## Objective
This project aims to pinpoint the most lucrative oil well location for OilyGiant by leveraging geological data from three distinct regions. Our mission is to deploy a linear regression model that forecasts the potential reserve volumes of each site, guided by key parameters like oil quality and estimated reserves. Through predictive analysis, we'll identify the prime wells for development, focusing on a region that promises the maximum total profit.

## Methodology
We will employ a linear regression approach to estimate the reserve volume for each of the 500 potential oil wells in every region. The analysis will prioritize the top 200 wells based on their projected profitability. Our financial framework considers an investment budget of 100 million for the development of these wells, with an anticipated revenue of $4.5 per barrel of oil (considering reserves are in thousands of barrels, equating to 4,500 per unit of product). A rigorous Bootstrapping technique will be applied to assess potential profits and the risk factors, ensuring the selection process adheres to a strict risk of loss threshold below 2.5%.

## Selection Criteria
Only regions that demonstrate a loss risk below the 2.5% benchmark will advance to the final round of consideration. The deciding factor will be the average profit potential, guiding us to the optimal region for new developments.

## Contents
1. **Data Loading and Inspection**
2. **Data Cleansing**
3. **Feature Creation and Model Evaluation**
4. **Profitability and Risk Assessment via Bootstrapping**
5. **Conclusions and Recommendations**


In [1]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

##  Load and Inspect Data

In [2]:
# loading data from the CSV files given
data0 = pd.read_csv("/datasets/geo_data_0.csv")
data1 = pd.read_csv("/datasets/geo_data_1.csv")
data2 = pd.read_csv("/datasets/geo_data_2.csv")

In [3]:
# creating the list of data sets
data_sets = [data0, data1, data2]

In [4]:
# displaying general dataframe information for the project
n =0
for i in data_sets:
    print(f'data{n}:')
    print(data0.info())
    print(' ')
    n = n+1

data0:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
 
data1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
 
data2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 

In [5]:
# displaying descriptive statistics for the datasets
n =0
for i in data_sets:
    print(f'data{n}:')
    display(data0.describe())
    print(' ')
    n = n+1

data0:


Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


 
data1:


Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


 
data2:


Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


 


## Clean Data

In [6]:
# checking for duplicates in each datasets with loop added to each 
data_l = [data0, data1, data2]
n= 0
for i in data_l:
    print(f'data{n}:')
    print(i['id'].duplicated().sum())
    n = n +1 

data0:
10
data1:
4
data2:
4


In [7]:
# writing function to check for and subsequently drop duplicate values in the data
def check_drop_dupe(data):
    dupes = list(data[data['id'].duplicated()]['id'])
    ind = data[data['id'].isin(dupes) == True].index
    s = data['id'].duplicated().sum()
    if s > 0:
        data = data.drop(ind)
    return data


In [8]:
# running data through duplicate function and dropping it
data0 = check_drop_dupe(data0)
data1 = check_drop_dupe(data1)
data2 = check_drop_dupe(data2)

<div class="alert alert-success">
<b>Reviewer's comment V4</b>

The data was prepared successfully!

</div>

## Creation and Model Testing

In [9]:
# creating feature and target data for id and product
features0 = data0.drop(['id', 'product'], axis = 1)
target0 = data0['product']

features1 = data1.drop(['id', 'product'], axis = 1)
target1 = data1['product']

features2 = data2.drop(['id', 'product'], axis = 1)
target2 = data2['product']

In [10]:
# creating function to split data into training and validation data 
def split_data(features, target):

    features_train, features_valid, target_train, target_valid = train_test_split(features, 
                                                                              target, 
                                                                              test_size = .25,
                                                                              random_state = 12345)
    return features_train, features_valid, target_train, target_valid

In [11]:
# running data through split function
features_train0, features_valid0, target_train0, target_valid0 = split_data(features0, target0)

features_train1, features_valid1, target_train1, target_valid1 = split_data(features1, target1)

features_train2, features_valid2, target_train2, target_valid2 = split_data(features2, target2)

<div class="alert alert-success">
<b>Reviewer's comment V4</b>

The data for each region was split into train and validation

</div>

In [12]:
# creating baseline for data0
baseline0 = pd.Series(target_train0.mean(), index = target_valid0.index)
base_mse = mean_squared_error(target_valid0, baseline0)
print(f'Baseline RMSE: {np.sqrt(base_mse)}')

Baseline RMSE: 44.34727648978901


In [13]:
# intializing model
model = LinearRegression()
# fitting data
model.fit(features_train0, target_train0)
# prediciting data 
predictions_valid0 = model.predict(features_valid0)

# gathering mse score and rmse score 
mse0 = mean_squared_error(target_valid0, predictions_valid0)
rmse0 = np.sqrt(mse0)
print(f'Average Volume of Predicted Reserves: {predictions_valid0.mean()}')
print(f'Model RMSE: {rmse0}')

Average Volume of Predicted Reserves: 92.42384109947359
Model RMSE: 37.716904960382735


In [14]:
baseline1 = pd.Series(target_train1.mean(), index = target_valid1.index)
base_mse = mean_squared_error(target_valid1, baseline1)
print(f'Baseline RMSE: {np.sqrt(base_mse)}')

Baseline RMSE: 45.97003721244109


In [15]:
# intializing model
model = LinearRegression()
# fitting data
model.fit(features_train1, target_train1)
# prediciting data 
predictions_valid1 = model.predict(features_valid1)

# gathering mse score and rmse score 
mse1 = mean_squared_error(target_valid1, predictions_valid1)
rmse1 = np.sqrt(mse1)
print(f'Average Volume of Predicted Reserves: {predictions_valid1.mean()}')
print(f'Model RMSE: {rmse1}')

Average Volume of Predicted Reserves: 68.98311857983123
Model RMSE: 0.8914901390348537


In [16]:
baseline2 = pd.Series(target_train2.mean(), index = target_valid2.index)
base_mse = mean_squared_error(target_valid2, baseline2)
print(f'Baseline RMSE: {np.sqrt(base_mse)}')

Baseline RMSE: 44.57522707537966


In [17]:
# intializing model
model = LinearRegression()
# fitting data
model.fit(features_train2, target_train2)
# prediciting data 
predictions_valid2 = model.predict(features_valid2)

# gathering mse score and rmse score 
mse2 = mean_squared_error(target_valid2, predictions_valid2)
rmse2 = np.sqrt(mse2)
print(f'Average Volume of Predicted Reserves: {predictions_valid2.mean()}')
print(f'Model RMSE: {rmse2}')

Average Volume of Predicted Reserves: 95.11622302076479
Model RMSE: 39.975543264382345


<div class="alert alert-success">
<b>Reviewer's comment V4</b>

The models were trained and evaluated correctly! It's nice that you compared the models to a simple constant baseline

</div>

In [18]:
# finding minimum units for net profit
n_best = 200 
budget = 100000000
barrel = 4.5
unit =  1000 * barrel
minimum_reserve = budget/n_best/unit
print(f'Minimum units for net positive profit when developing a new well: {minimum_reserve:.3f}')

Minimum units for net positive profit when developing a new well: 111.111


<div class="alert alert-success">
<b>Reviewer's comment V4</b>

Calculation is correct!

</div>

In [19]:
# creating profit calculation function
def profit_calculation(target, predictions, count):
    budget = 100000000
    unit = 4.5 * 1000
    top_predictions = predictions.nlargest(count).index
    top_wells = target.loc[top_predictions]
    rev = top_wells.sum() * unit
    profit = rev - budget
    return profit

<div class="alert alert-success">
<b>Reviewer's comment V4</b>

Profit is calculated correctly

</div>

In [20]:
# running each region through profit calculations 
predictions_valid0 = pd.Series(predictions_valid0, index = target_valid0.index)
profit0 = profit_calculation(target_valid0, predictions_valid0, 200)
print(f'Region 0 top 200 wells profit: ${profit0:.2f}')
print()

predictions_valid1 = pd.Series(predictions_valid1, index = target_valid1.index)
profit1 = profit_calculation(target_valid1, predictions_valid1, 200)
print(f'Region 1 top 200 wells profit: ${profit1:.2f}')
print()

predictions_valid2 = pd.Series(predictions_valid2, index = target_valid2.index)
profit2 = profit_calculation(target_valid2, predictions_valid2, 200)
print(f'Region 2 top 200 wells profit: ${profit2:.2f}')
print()

Region 0 top 200 wells profit: $31360260.57

Region 1 top 200 wells profit: $24150866.97

Region 2 top 200 wells profit: $24659457.92



BootStrapping and Final Assessment of the Information we gathered from the training

In [21]:
# creating bootstrapping function
def bootstrap_func(target, predictions, sample_size):
    state = np.random.RandomState(12345)
    profit_values = []

    for i in range(1000):
        target_subsample = target.sample(sample_size, replace = True, random_state = state)
        predict_subsample = predictions[target_subsample.index]
    
        profit_values.append(profit_calculation(target_subsample, predict_subsample, 200))

    
    profit_values = pd.Series(profit_values)
    
    profit_mean = profit_values.mean()
    lower = profit_values.quantile(.025)
    upper = profit_values.quantile(.975)
    risk_of_loss = profit_values[profit_values < 0].count() / len(profit_values)
    
    print(f'''
    Mean Profit: ${profit_mean:.2f},
    95% Confidence interval - 
        Lower limit: {lower:.2f}, 
        Upper limit: {upper:.2f}),
    Risk of Loss: {risk_of_loss:.1%}''')

<div class="alert alert-success">
<b>Reviewer's comment V4</b>

Bootstrapping is done correctly, all required statistics are calculated

</div>

In [22]:
# running region 0 through bootstrap function
print('Region 0:')
bootstrap_func(target_valid0, predictions_valid0, 500)

Region 0:

    Mean Profit: $6329210.98,
    95% Confidence interval - 
        Lower limit: 549998.19, 
        Upper limit: 12849098.55),
    Risk of Loss: 2.0%


In [23]:
# running region 1 through bootstrap function
print('Region 1:')
bootstrap_func(target_valid1, predictions_valid1, 500)

Region 1:

    Mean Profit: $6836507.78,
    95% Confidence interval - 
        Lower limit: 1761041.83, 
        Upper limit: 12000844.86),
    Risk of Loss: 0.7%


In [24]:
# running region 2 through bootstrap function
print('Region 2:')
bootstrap_func(target_valid2, predictions_valid2, 500)

Region 2:

    Mean Profit: $5290364.67,
    95% Confidence interval - 
        Lower limit: -635691.31, 
        Upper limit: 11644733.64),
    Risk of Loss: 4.4%


## Conclusion

Our comprehensive analysis across the three regions under consideration reveals a nuanced picture of potential profitability and risk. Despite the allure of higher profits from the top 200 wells in regions 2 and 3, our findings underscore Region 1 as the most prudent choice for OilyGiant's new well developments. This conclusion is drawn not solely based on the absolute profit figures but, more critically, on the reliability and risk profile associated with Region 1.

### Key Findings:
- **Stability Over Maximum Profit:** Although Region 1's top 200 wells yield lower profits relative to the counterparts in Regions 2 and 3, they present a more stable and reliable source of income.
- **Risk Assessment:** Significantly, Region 1 showcases superior performance in terms of risk management. It demonstrates a notably lower risk of loss, a critical factor for sustainable development in the volatile oil market.
- **Confidence in Investment:** The analysis within a 95% confidence interval reveals that Region 1 offers higher values at the lower quantile, suggesting a more dependable return on investment compared to the other regions.

### Recommendation:
Based on the detailed analysis and the strategic importance of minimizing risk while ensuring a reliable source of revenue, **it is strongly recommended that OilyGiant prioritizes Region 1 for the development of new oil wells.** This region aligns with OilyGiant's objectives of sustaining profitability while adhering to a conservative risk threshold.

In conclusion, the investment in Region 1, despite its relatively lower immediate profit margins, represents a strategic decision to secure a more predictable and stable financial future. This approach underscores the importance of a comprehensive analysis that goes beyond surface-level profit figures and delves into the nuances of risk and return on investment.


<div class="alert alert-success">
<b>Reviewer's comment V4</b>

Conclusions make sense! Region choice is correct and justified

</div>

<b>Student's comment</b>
    
I wanted to condense the informations, Please let me know if this is what you were looking for or do i need to go back to my original work. 
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V4</b>

No, this looks great! Well done! :)
    
The project is now accepted. Keep up the good work on the next sprint!

</div>