# OilyGiant Mining Projections:

- We are tasked with finding the best region to drill new oil wells.

- Our criteria for making a decision are as follows:

    - Collect the oil well parameters in the selected region: oil quality and volume of reserves.
    - Build a model for predicting the volume of reserves in the new wells.
    - Pick the oil wells with the highest estimated values.
    - Pick the region with the highest total profit for the selected oil wells.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from numpy.random import RandomState
import numpy as np


## Data Cleaning:

In [3]:
geo_0 = pd.read_csv('geo_data_0.csv')
geo_1 = pd.read_csv('geo_data_1.csv')
geo_2 = pd.read_csv('geo_data_2.csv')

#geo_0.info()
#geo_1.info()
#geo_2.info()

display(geo_0.head())
print(geo_0.isnull().sum())
display(geo_1.head())
print(geo_1.isnull().sum())
display(geo_2.head())
print(geo_2.isnull().sum())


Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


id         0
f0         0
f1         0
f2         0
product    0
dtype: int64


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


id         0
f0         0
f1         0
f2         0
product    0
dtype: int64


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


id         0
f0         0
f1         0
f2         0
product    0
dtype: int64


In [4]:
geo_0 = geo_0.drop('id', axis=1)
geo_1 = geo_1.drop('id', axis=1)
geo_2 = geo_2.drop('id', axis=1)
#display(geo_0.head())

- Data integrity is intact with no missing values to account for.
- We have removed the 'Id' column from all data sets as it's relevance is not needed in our caculations.

## Model Training Per Region:

- Each region's data will be used to train a LinearRegression model.

- The model will bring value by providing the following:

    - Predictions we will use in further calculations.
    - Average Volume
    - The model's RMSE

In [5]:
geo_0_features = geo_0.drop('product', axis=1)
geo_0_target = geo_0['product']

geo_0_features_train, geo_0_features_valid, geo_0_target_train, geo_0_target_valid = train_test_split(geo_0_features,geo_0_target, test_size=0.25, random_state=12345)

model = LinearRegression()
model.fit(geo_0_features_train, geo_0_target_train)
geo_0_predictions = model.predict(geo_0_features_valid)
geo_0_mse = mean_squared_error(geo_0_target_valid, geo_0_predictions)
geo_0_rmse = geo_0_mse ** 0.5
print('Average volume predicted:', geo_0_predictions.mean())
print("RMSE for geo_0:", geo_0_rmse)

Average volume predicted: 92.59256778438038
RMSE for geo_0: 37.5794217150813


  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_


In [6]:
geo_1_features = geo_1.drop('product', axis=1)
geo_1_target = geo_1['product']

geo_1_features_train, geo_1_features_valid, geo_1_target_train, geo_1_target_valid = train_test_split(geo_1_features,geo_1_target, test_size=0.25, random_state=12345)

model = LinearRegression()
model.fit(geo_1_features_train, geo_1_target_train)
geo_1_predictions = model.predict(geo_1_features_valid)
geo_1_mse = mean_squared_error(geo_1_target_valid, geo_1_predictions)
geo_1_rmse = geo_1_mse ** 0.5
print('Average volume predicted:', geo_1_predictions.mean())
print("RMSE for geo_1:", geo_1_rmse)

Average volume predicted: 68.72854689544602
RMSE for geo_1: 0.8930992867756163


  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_


In [7]:
geo_2_features = geo_2.drop('product', axis=1)
geo_2_target = geo_2['product']

geo_2_features_train, geo_2_features_valid, geo_2_target_train, geo_2_target_valid = train_test_split(geo_2_features,geo_2_target, test_size=0.25, random_state=12345)

model = LinearRegression()
model.fit(geo_2_features_train, geo_2_target_train)
geo_2_predictions = model.predict(geo_2_features_valid)
geo_2_mse = mean_squared_error(geo_2_target_valid, geo_2_predictions)
geo_2_rmse = geo_2_mse ** 0.5
print('Average volume predicted:', geo_2_predictions.mean())
print("RMSE for geo_2:", geo_2_rmse)

Average volume predicted: 94.96504596800489
RMSE for geo_2: 40.02970873393434


  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_
  return X @ coef_ + self.intercept_


- Now that we have successfully trained our model per region we are able to conduct some early analysis:

    - Region 0 and Region 1 show strong average volume with 92.59 and 94.97 respectfully.
        - However the RMSE for each of these regions is high compared to Region 1, suggesting the model is slightly less accurate for these regions.

    - Region 1 has an average predicted volume of 68.73 which is noticeably lower than the other regions.
        - Like mentioned above, Region 1 has an RMSE of 0.89. This suggests a much higher level of accuracy from our model.
        
- We need to conduct more analysis before we can predict which region is best but the finding in this section are enlightening and show us that Region 1's prections have a high level of accuracy.

## Break Even: Volume

In [8]:
budget = 100000000
revenue_per_unit = 4500
wells_to_develop = 200
total_points_studied = 500


per_well_cost = budget / wells_to_develop
break_even_volume = per_well_cost / revenue_per_unit
print("Break-even volume per well:", break_even_volume)

Break-even volume per well: 111.11111111111111


- In order to continue on with our analysis it's very important that we know the details about what makes drilling a well successful.
- Here we have discovered the Breakeven Volume required to at a minimum make our investment back on a per well basis. 

## Profit Per Region:

- Here we have created a profit function that will calculate the profit of each region based on the top 200 predictions from our model.
    - We are using 200 because our goal is to develop 200 wells.
    
- Our function will also tell us the Total Volume from those 200 predictions.

In [10]:
def calculate_profit(predictions, budget, revenue_per_unit, wells_to_develop):
    # Step 4.1: Pick top wells
    top_predictions = sorted(predictions, reverse=True)[:wells_to_develop]
    
    # Step 4.2: Sum the volumes
    total_volume = sum(top_predictions) # sum of top predictions
    
    # Step 4.3: Calculate profit
    total_revenue = total_volume * revenue_per_unit # volume Ã— revenue_per_unit
    profit = total_revenue - budget # revenue - budget
    
    return profit, total_volume


geo_0_sample = pd.Series(geo_0_predictions).sample(n=total_points_studied, random_state=12345)
geo_1_sample = pd.Series(geo_1_predictions).sample(n=total_points_studied, random_state=12345)
geo_2_sample = pd.Series(geo_2_predictions).sample(n=total_points_studied, random_state=12345)

geo_0_profit, geo_0_volume = calculate_profit(geo_0_sample, budget, revenue_per_unit, wells_to_develop)
print("Geo 0 Profit:", geo_0_profit)
print("Geo 0 Total Volume from top wells:", geo_0_volume)

geo_1_profit, geo_1_volume = calculate_profit(geo_1_sample, budget, revenue_per_unit, wells_to_develop)    
print("Geo 1 Profit:", geo_1_profit)
print("Geo 1 Total Volume from top wells:", geo_1_volume)

geo_2_profit, geo_2_volume = calculate_profit(geo_2_sample, budget, revenue_per_unit, wells_to_develop)
print("Geo 2 Profit:", geo_2_profit)
print("Geo 2 Total Volume from top wells:", geo_2_volume)

Geo 0 Profit: 5053468.9503057
Geo 0 Total Volume from top wells: 23345.215322290154
Geo 1 Profit: 7792191.964127213
Geo 1 Total Volume from top wells: 23953.820436472713
Geo 2 Profit: 3401996.777634293
Geo 2 Total Volume from top wells: 22978.221506140955


- Again we see Region 0 & 2 out performing Region 1.
    - With the RMSE of each region being higher, and the fact we are only taking the top 200 wells, this result makes sense to us.


- While very close in Volume output and Profit, Region 0 shows promise to be the Region of choice for developing our 200 new wells. 
    - Even though we are essentially cherry picking the top 200 wells this analysis does give us solid insight into which region would produce better profit for us. 

## Bootstrapping for Confidence

- Our final analysis will be conducted using the technique of bootstrapping.

- We will use 1000 samples to find the distribution of profit.
    - This will allow us to find the average profit, 95% confidence interval, and risk of losses.
    - We will reject any Regions with a Risk of Loss greater than 2.5%

In [None]:
geo_0_predictions = pd.Series(geo_0_predictions)
geo_1_predictions = pd.Series(geo_1_predictions)
geo_2_predictions = pd.Series(geo_2_predictions)
geo = [geo_0_predictions, geo_1_predictions, geo_2_predictions]
region_name=['Region 0', 'Region 1', 'Region 2']
results = []
for region_name, geo_predictions in zip(region_name, geo):
    profits = []
    state = np.random.RandomState(12345)
    
    for i in range(1000):
        predictions = geo_predictions.sample(n=500, replace=True, random_state=state)
        
        best_200_wells = predictions.sort_values(ascending=False)[:200]

        profit, _ = calculate_profit(best_200_wells, budget, revenue_per_unit, wells_to_develop)
        profits.append(profit)
    profits = pd.Series(profits)

    average_profit = profits.mean()
    lower_bound = profits.quantile(0.025)
    upper_bound = profits.quantile(0.975)
    risk_percentage = (profits < 0).mean() * 100

    print(f'===[{region_name}]===')
    print(f"Average profit: {average_profit:,.2f}")
    print(f"95% confidence interval for profit: [{lower_bound:,.2f}, {upper_bound:,.2f}]")
    print(f"Risk of loss: {risk_percentage:.2f}%")


    results.append({
        'region': region_name,
        'average_profit': average_profit,
        'lower_bound': lower_bound,
        'upper_bound': upper_bound,
        'risk_percentage': risk_percentage
    })
results_df = pd.DataFrame(results)
print(results_df)




===[Region 0]===
Average profit: 3,580,260.54
95% confidence interval for profit: [1,401,373.36, 5,988,772.71]
Risk of loss: 0.10%
===[Region 1]===
Average profit: 4,538,116.10
95% confidence interval for profit: [328,830.20, 8,545,616.85]
Risk of loss: 1.60%
===[Region 2]===
Average profit: 2,814,661.98
95% confidence interval for profit: [920,983.02, 4,785,270.32]
Risk of loss: 0.20%
     region  average_profit   lower_bound   upper_bound  risk_percentage
0  Region 0    3.580261e+06  1.401373e+06  5.988773e+06              0.1
1  Region 1    4.538116e+06  3.288302e+05  8.545617e+06              1.6
2  Region 2    2.814662e+06  9.209830e+05  4.785270e+06              0.2


- Our bootstrapping technique has given us some insightful results!


- All regions show that risk of loss is less than 2.5%.


- For the first time in our analysis Region 1 shows to be the greater producer of profit with an average profit of: $ 4,538,116.
    - Region 1 also has the greatest Risk of Loss at 1.6%.
        - This can be explained with it high level of variability with in it's 95% confidence level. 

- While Region 0 & 2 had a substaintially lower Risk of Loss, so too was their Average Profit compared to Region 1. 

- Combining our confidence in Region 1's RMSE, along with it's average profit from our bootstrapping, as well as the risk of loss being below the 2.5% threshold, we can confidently suggest that Region 1 be used for the drilling of 200 new wells. 