----------
## Review

Hi, my name is Daria! I'm reviewing your project. 

You can find my comments under the heading «Review». 
I’m using __<font color='green'>green</font>__ color if everything is done perfectly. Recommendations and remarks are highlighted in __<font color='blue'>blue</font>__. 
If the topic requires some extra work, the color will be  __<font color='red'>red</font>__. 

You did an outstanding work! Project is accepted, good luck in future learning :)


---------


# Background Information
You work for the OilyGiant mining company. Your task is to find the best place for a new well. Steps to choose the location:
- Collect the oil well parameters in the selected region: oil quality and volume of reserves;
- Build a model for predicting the volume of reserves in the new wells;
- Pick the oil wells with the highest estimated values;
- Pick the region with the highest total profit for the selected oil wells.

You have data on oil samples from three regions. Parameters of each oil well in the region are already known. Build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrap technique.

# Step 1. Data Preparation
Download and prepare the data. Explain the procedure.

In [1]:
# set up libraries

import os
import io
import itertools
import operator
import warnings

import numpy as np
import pandas as pd
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, cross_validate, RepeatedStratifiedKFold, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.utils import resample, shuffle
import matplotlib.pyplot as plt
import seaborn as sns

warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")
pd.pandas.set_option('display.max_columns', None)

In [2]:
# Import data and examine shape

geo1 = pd.read_csv('/datasets/geo_data_0.csv')
geo2 = pd.read_csv('/datasets/geo_data_1.csv')
geo3 = pd.read_csv('/datasets/geo_data_2.csv')

display(geo1.shape)
display(geo2.shape)
display(geo3.shape)

(100000, 5)

(100000, 5)

(100000, 5)

So this tells us that we have the exactly same oil well parameters acrossing the three selected regions. 

In [3]:
# Having a look at the data initially

display(geo1.head())
display(geo2.head())
display(geo3.head())

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [4]:
# Removing unneccessary columns

geo1 = geo1.drop('id', axis=1)
geo2 = geo2.drop('id', axis=1)
geo3 = geo3.drop('id', axis=1)

In [5]:
# Function for a first glance of the data

def summary(df): 
    eda_df = {}
    eda_df['null_sum'] = df.isnull().sum()
    eda_df['null_perc'] = df.isnull().mean()
    eda_df['dtypes'] = df.dtypes
    eda_df['count'] = df.count()
    eda_df['mean'] = df.mean()
    eda_df['median'] = df.median()
    eda_df['min'] = df.min()
    eda_df['max'] = df.max()
    return pd.DataFrame(eda_df)

In [6]:
# Numerical summary of the three datasets

display(summary(geo1))
display(summary(geo2))
display(summary(geo3))

Unnamed: 0,null_sum,null_perc,dtypes,count,mean,median,min,max
f0,0,0.0,float64,100000,0.500419,0.50236,-1.408605,2.362331
f1,0,0.0,float64,100000,0.250143,0.250252,-0.848218,1.343769
f2,0,0.0,float64,100000,2.502647,2.515969,-12.088328,16.00379
product,0,0.0,float64,100000,92.5,91.849972,0.0,185.364347


Unnamed: 0,null_sum,null_perc,dtypes,count,mean,median,min,max
f0,0,0.0,float64,100000,1.141296,1.153055,-31.609576,29.421755
f1,0,0.0,float64,100000,-4.796579,-4.813172,-26.358598,18.734063
f2,0,0.0,float64,100000,2.494541,2.011479,-0.018144,5.019721
product,0,0.0,float64,100000,68.825,57.085625,0.0,137.945408


Unnamed: 0,null_sum,null_perc,dtypes,count,mean,median,min,max
f0,0,0.0,float64,100000,0.002023,0.009424,-8.760004,7.238262
f1,0,0.0,float64,100000,-0.002081,-0.009482,-7.08402,7.844801
f2,0,0.0,float64,100000,2.495128,2.484236,-11.970335,16.739402
product,0,0.0,float64,100000,95.0,94.925613,0.0,190.029838


It looks like we don't have categorical variables and all varialbes from the datasets have similar scale. There is no null values as well.

----------
<font color='green'>

## Review

A very detailed data analysis :)   
    
</font>

---------

### Conclusion

# Step 2. Baseline Performance Assessment
Train and test model for each region:
1. Split the data into a training set and validation set at a ratio of 75:25.
2. Train the model and make predictions for the validation set.
3. Save the predictions and correct answers for the validation set.
4. Print the average volume of predicted reserves and model RMSE.
5. Analyze the results.

In [7]:
# train-test split / 75-25

train1, val1 = train_test_split(geo1, test_size=0.25, random_state=123)
train2, val2 = train_test_split(geo2, test_size=0.25, random_state=123)
train3, val3 = train_test_split(geo3, test_size=0.25, random_state=123)

x_train1, x_val1 = train1.drop('product', axis=1), val1.drop('product', axis=1)
x_train2, x_val2 = train2.drop('product', axis=1), val2.drop('product', axis=1)
x_train3, x_val3 = train3.drop('product', axis=1), val3.drop('product', axis=1)

y_train1, y_val1 = train1['product'], val1['product']
y_train2, y_val2 = train2['product'], val2['product']
y_train3, y_val3 = train3['product'], val3['product']

In [8]:
# train three models separatly for each region then make predictions using the val set

m1 = LinearRegression()
m2 = LinearRegression()
m3 = LinearRegression()

m1.fit(x_train1, y_train1)
m2.fit(x_train2, y_train2)
m3.fit(x_train3, y_train3)

pred1 = m1.predict(x_val1)
pred2 = m2.predict(x_val2)
pred3 = m3.predict(x_val3)

In [9]:
# print the average volume of predicted reserves and model RMSE

print(f'average of true volume1: {y_val1.mean()} ---> average of predicted volume1: {pred1.mean()}') 
print(f'average of true volume1: {y_val2.mean()} ---> average of predicted volume1: {pred2.mean()}') 
print(f'average of true volume1: {y_val3.mean()} ---> average of predicted volume1: {pred3.mean()}','\n') 

print('model1 RSME: %.2f' % (sqrt(mean_squared_error(y_val1, pred1))))
print('model2 RSME: %.2f' % (sqrt(mean_squared_error(y_val2, pred2))))
print('model3 RSME: %.2f' % (sqrt(mean_squared_error(y_val3, pred3))))

average of true volume1: 92.85062391123445 ---> average of predicted volume1: 92.54936189116309
average of true volume1: 69.27371236077902 ---> average of predicted volume1: 69.28001860653976
average of true volume1: 94.87348818660215 ---> average of predicted volume1: 95.09859933591373 

model1 RSME: 37.65
model2 RSME: 0.90
model3 RSME: 40.13


### Conclusion
RMSE represents the sample standard deviation of the differences between predicted values and observed values. So from a interpretation point of view, this metric may not be very easy to understood especially when comparing to MAE. However, one distinct advantage of the RMSE is that it penalizes the higher difference more than MAE does. 

Observing from our results, model2 clearly has the lowest RMSE which makes it a better model comparing aginist with the rest. But our goal for this part is not to choose the most accurate model but to choose the model that yields the highest predicted reserve volume. That said, model3 could be the potential model of our choice. Of cource we should analyze cost, risk and profits and then make further decisions.  

----------
<font color='green'>

## Review

Good job on training and testing models :) Interesting result for region2, you could try to find the reason for such a low RMSE by computing correlation coefficients between features and target.
    
</font><font color='blue'> But it would be better if you created a function or a loop instead of using the same code for different regions. It would make your code more readable and scalable :)
    
</font>

---------

# Step 3. Prepare for profit calculation
1. Store all key values for calculations in separate variables.
2. Calculate the volume of reserves sufficient for developing a new well without losses. Compare the obtained value with the average volume of reserves in each region.
3. Provide the findings about the preparation for profit calculation step.

In [10]:
# Store key values then calculate the volume of reserses needed for a new well

budget = 100000000
number_wells = 200
cost_per_well = 500000
revenue_per_unit = 4.5*1000
expected_vol_per_reserve = cost_per_well / revenue_per_unit
expected_vol_per_reserve

111.11111111111111

### Conclusion
- The logic used to calculate the expected volume of reserves for developing a new well without any losses is that we need to calculate the cost required for building one well. At that point, no losses would mean that the revenue generated from this well would be at least equal to or maybe even greater than such cost. In other words, the cost is our benchmark we want to achieve, and if we know how much revenue is generated from one unit of product, we can calculate the expected number of units of products. This number will be our estimated volume of the reserve.  
- We see that 111.11 is the volume of reserves sufficient for developing a new well. This volume is bigger than the average predicted volumes and that gives motivation for use bootstrap for a more accurate choice of best region.

----------
<font color='green'>

## Review

Your calculation is correct, as well as the conclusion :)
    
</font>

---------

# Step 4. Profit Functions
1. Pick the wells with the highest values of predictions. The number of wells depends on the budget and cost of developing one oil well.
2. Summarize the target volume of reserves in accordance with these predictions
3. Provide findings: suggest a region for oil wells' development and justify the choice. Calculate the profit for the obtained volume of reserves.

In [13]:
def profit(target, predictions, count, revenue_per_unit, cost_per_well):
    
    # pick the wells with the highest predicted values
    pred_sorted = pd.Series(predictions).sort_values(ascending=False)
    target.reset_index(drop=True, inplace=True)
    selected_wells = target[pred_sorted.index][:count]
    
    # summarize the target volume in accordance
    total_region_vol = selected_wells.sum()
    
    revenue = total_region_vol * revenue_per_unit
    cost = count * cost_per_well
    
    return revenue-cost

In [14]:
# initial findings

profit_region1 = profit(y_val1, pred1, 200, revenue_per_unit, cost_per_well)
profit_region2 = profit(y_val2, pred2, 200, revenue_per_unit, cost_per_well)
profit_region3 = profit(y_val3, pred3, 200, revenue_per_unit, cost_per_well)

print(profit_region1, profit_region2, profit_region3)

35346709.17261383 24150866.966815114 23703438.630213737


### Conclusion
- After calculating the potential profit for these three regions, region1 is suggested to be our choice of location to build next wells because the profit made from this location is the highest.
- However, note that this is a one-time estimate which may be quite differ from what could really happened in real situations. To better get a understanding of the potential of each region, we need to compare profit distributions of the three region.

----------
<font color='green'>

## Review

You are right, we need to know the probability of this outcome :)
    
</font>

---------

# Step 5. Calculate risks and profit for each region
1. Use the bootstrap technique with 1000 samples to find the distribution of profit.
2. Find average profit, 95% confidence interval and risk of losses. Loss is negative profit.
3. Provide findings: suggest a region for development of oil wells and justify the choice.

In [15]:
# boostrapping function

def bootstrap(target, prediction):
    state = np.random.RandomState(123)
    
    values=[]
    for i in range(1000):
        target_resample = target.sample(n=500, replace=True, random_state=state)
        pred_resample = prediction[target_resample.index]
        values.append(profit(target_resample, pred_resample, 200, revenue_per_unit, cost_per_well))
        
    values = pd.Series(values)
    mean = values.mean()
    lower = values.quantile(0.025)
    upper = values.quantile(0.975)
    
    return mean, lower, upper

In [18]:
# region 1

mean1, lower1, upper1 = bootstrap(y_val1, pred1)
mean2, lower2, upper2 = bootstrap(y_val2, pred2)
mean3, lower3, upper3 = bootstrap(y_val3, pred3)

print(f'For region1, average profit is: {mean1} | 95% conf interval lower is: {lower1} | 95% conf interval upper\n is: {upper1}', '\n')
print(f'For region2, average profit is: {mean2} | 95% conf interval lower is: {lower2} | 95% conf interval upper\n is: {upper2}', '\n')
print(f'For region3, average profit is: {mean3} | 95% conf interval lower is: {lower3} | 95% conf interval upper\n is: {upper3}', '\n')

For region1, average profit is: 4774168.242664123 | 95% conf interval lower is: -579939.1179927464 | 95% conf interval upper
 is: 9748220.147728719 

For region2, average profit is: 4791901.613003321 | 95% conf interval lower is: 587268.775879572 | 95% conf interval upper
 is: 8744248.194881873 

For region3, average profit is: 3434543.7658087574 | 95% conf interval lower is: -2313762.552278907 | 95% conf interval upper
 is: 8608406.698830131 



### Conclusion
- Based on comparing the profit distributions of the three regions, it is suggested that region2 would be the most profitable since it has the highest average profit which is simulated over 1000 times. 
- Both region1 and region3 risks losses because its lower bound of 95% confidence interval is negative.

----------
<font color='green'>

## Review

Great, your implementation of bootstrap is correct, and your conclusion could be really useful for business :)
    
</font>

---------
