**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did an excellent job! The project is accepted. Keep up the good work on the next sprint!

# Project Description:
You work for the OilyGiant mining company. Your task is to find the best place for a new well.

Steps to choose the location:
1. Collect the oil well parameters in the selected region: oil quality and volume of reserves;
2. Build a model for predicting the volume of reserves in the new wells;
3. Pick the oil wells with the highest estimated values;
4. Pick the region with the highest total profit for the selected oil wells.

You have data on oil samples from three regions. Parameters of each oil well in the region are already known. Build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.

## Load and Prepare Data

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from scipy import stats as st
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')

In [2]:
geo_data_0 = pd.read_csv("https://code.s3.yandex.net/datasets/geo_data_0.csv")
geo_data_1 = pd.read_csv("https://code.s3.yandex.net/datasets/geo_data_1.csv")
geo_data_2 = pd.read_csv("https://code.s3.yandex.net/datasets/geo_data_2.csv")

In [3]:
geo_data_0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [4]:
geo_data_1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [5]:
geo_data_2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [6]:
#checking for missing values & df info
geo_data_0.isna().sum(), geo_data_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


(id         0
 f0         0
 f1         0
 f2         0
 product    0
 dtype: int64,
 None)

In [7]:
geo_data_1.isna().sum(), geo_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


(id         0
 f0         0
 f1         0
 f2         0
 product    0
 dtype: int64,
 None)

In [8]:
geo_data_2.isna().sum(), geo_data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


(id         0
 f0         0
 f1         0
 f2         0
 product    0
 dtype: int64,
 None)

In [9]:
geo_data_0.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


In [10]:
geo_data_1.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,1.141296,-4.796579,2.494541,68.825
std,8.965932,5.119872,1.703572,45.944423
min,-31.609576,-26.358598,-0.018144,0.0
25%,-6.298551,-8.267985,1.000021,26.953261
50%,1.153055,-4.813172,2.011479,57.085625
75%,8.621015,-1.332816,3.999904,107.813044
max,29.421755,18.734063,5.019721,137.945408


In [11]:
geo_data_2.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.002023,-0.002081,2.495128,95.0
std,1.732045,1.730417,3.473445,44.749921
min,-8.760004,-7.08402,-11.970335,0.0
25%,-1.162288,-1.17482,0.130359,59.450441
50%,0.009424,-0.009482,2.484236,94.925613
75%,1.158535,1.163678,4.858794,130.595027
max,7.238262,7.844801,16.739402,190.029838


Datasets are uploaded and inspected. There are no missing values and dtypes are correct. 

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright!

</div>

## Train and Test model for each region 

In [12]:
#splitting the data into training and validation set 75:25 ratio
def data_split(geo_data): 
    features = geo_data.drop(columns=['id','product'], axis=1)
    target = geo_data['product']
    features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)
#standardize the datas
    scaler = StandardScaler()
    numeric = ['f0','f1','f2']
    scaler.fit(features_train[numeric])    
    features_train[numeric] = scaler.transform(features_train[numeric])
    features_valid[numeric] = scaler.transform(features_valid[numeric]) 
    print(len(features_train))
    print(len(target_train))
    print(len(features_valid))
    print(len(target_valid))
    return features_train, features_valid, target_train, target_valid

In [13]:
features_train0, features_valid0, target_train0, target_valid0 = data_split(geo_data_0)

75000
75000
25000
25000


In [14]:
features_train1, features_valid1, target_train1, target_valid1 = data_split(geo_data_1)

75000
75000
25000
25000


In [15]:
features_train2, features_valid2, target_train2, target_valid2 = data_split(geo_data_2)

75000
75000
25000
25000


All the data sets are splitted into training and validation set with 75:25 ratio

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data for each region was split into train and validation sets

</div>

In [16]:
print(features_train0.shape)
print(features_valid0.shape)

(75000, 3)
(25000, 3)


In [17]:
#function that will train the model and test the model of each region
def linear_regression_model(features_train, features_valid, target_train, target_valid):
    model = LinearRegression()
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    print('R2:',r2_score(target_valid, predicted_valid))
    print('RMSE:',np.sqrt(mean_squared_error(target_valid, predicted_valid)))
    print('Average Volume:', predicted_valid.mean())
    return predicted_valid   

In [18]:
#training geo_data_0
predictions0 = linear_regression_model(features_train0, features_valid0, target_train0, target_valid0)

R2: 0.27994321524487786
RMSE: 37.5794217150813
Average Volume: 92.59256778438035


In [19]:
#training geo_data_1
predictions1 = linear_regression_model(features_train1, features_valid1, target_train1, target_valid1)

R2: 0.9996233978805127
RMSE: 0.893099286775617
Average Volume: 68.728546895446


In [20]:
#training geo_data_2
predictions2 = linear_regression_model(features_train2, features_valid2, target_train2, target_valid2)

R2: 0.20524758386040443
RMSE: 40.02970873393434
Average Volume: 94.96504596800489


R2 measures the goodness of fit of a regression model. It ranges to 0-1, 1 indicating a perfect model. Geo_data_0 and geo_data_2 has a low R2 value of 0.27 and 0.20, respectively. The low R2 value can be a sign of underfitting. Geo_data_1 has an R2 of 0.99 which is a strong linear relationship between independent and dependent variable; a perfect model. The RMSE correlates with the R2 value. Geo_data_1 has an RMSE value of 0.89 which means on average the predicted value is approximately 0.89 away from the actual value. The lower the RMSE value the better. Geo_data_0 and Geo_data_1 has an RMSE of 37.57 and 40.02, respectively. Geo_data_0, Geo_data_1, and Geo_data_2 has an average volume of 92.59, 68.72, and 94.96, respectively. Geo_data0 and Geo_data_2 produces the most volume on average. 

<div class="alert alert-success">
<b>Reviewer's comment</b>

The models were trained and evaluated correctly. Conclusions make sense

</div>

## Profit Calculation

Conditions: 
1. When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.
2. The budget for development of 200 oil wells is 100 USD million.
3. One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
4. After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.

In [21]:
n = 500
n_best = 200
budget_200_wells = 100000000
one_barrel_revenue = 4.5 
volumes = 1000
risk = 0.025

In [22]:
#calculating the volume of reserves sufficient for developing a new well without losses
volume = budget_200_wells/n_best/one_barrel_revenue/volumes
print(f'The volume of reserves sufficient for developing a new well without losses: {volume:.2f}')

The volume of reserves sufficient for developing a new well without losses: 111.11


<div class="alert alert-success">
<b>Reviewer's comment</b>

Calculation is correct!

</div>

<div class="alert alert-warning">
<b>Reviewer's comment</b>

We can note here that the average volume in each region is lower than the break even point, thus if we just select the wells to develop randomly, we will lose money. Hopefully using a model to select the wells can help 
    
</div>

## Calculate Profit from a set of Selected Oil Wells

In [23]:
#Writing a function to calculate profit from a set of selected oil wells and model predictions
def profit(target_valid, predictions):
    target_valid = pd.Series(target_valid).reset_index(drop=True)
    predictions = pd.Series(predictions)
    predictions_sort = predictions.sort_values(ascending=False)
    selected = target_valid[predictions_sort.index][:n_best]
    revenue = selected.sum() * one_barrel_revenue * volumes
    return revenue - budget_200_wells

<div class="alert alert-success">
<b>Reviewer's comment</b>

Profit calculation function is correct!

</div>

In [24]:
profit0 = profit(target_valid0, predictions0)
print(f'Profit for region 0: ${profit0:.2f}')

Profit for region 0: $33208260.43


In [25]:
profit1 = profit(target_valid1, predictions1)
print(f'Profit for region 1: ${profit1:.2f}')

Profit for region 1: $24150866.97


In [26]:
profit2 = profit(target_valid2, predictions2)
print(f'Profit for region 2: ${profit2:.2f}')

Profit for region 2: $27103499.64


Region 0 makes the most profit with 33 millions, then region 2 with 27 million, and region 1 makes 24 millions.

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Note that these values come from taking overall top 200 wells, and as we're actually looking not at all wells, but only a random sample of 500, it is highly unlikely that the overall top 200 wells are contained in such a small sample

</div>

## Calculate Risk and Profit 

In [27]:
#Using the bootstrapping technique with 1000 samples to find the distribution of profit 
def bootstrap_profit(target_valid, predictions):
    target_valid = pd.Series(target_valid).reset_index(drop=True)
    state = np.random.RandomState(12345)
    profits = []
    for i in range(1000):
        target_subsample = target_valid.sample(n=500, replace=True, random_state=state)
        probs_subsample = predictions[target_subsample.index]
        profit_subsample = profit(target_subsample,probs_subsample)
        profits.append(profit_subsample)
        
    profits = pd.Series(profits)
    
    average_profits = profits.mean()
    print('Average profit of geo_data:','$',average_profits)
    
    confidence_interval = st.t.interval(0.95, len(profits)-1, loc=np.mean(profits), scale=st.sem(profits))
    print(f"95% confidence interval: ${confidence_interval[0]:.2f} - ${confidence_interval[1]:.2f}")
   
    negative_profits = profits[profits < 0]
    risk_of_losses = len(negative_profits) / len(profits)
    risk_of_losses_percentage = risk_of_losses * 100
    print(f"Risk of losses: {risk_of_losses_percentage:.2f}%")
    
    return average_profits, confidence_interval, risk_of_losses_percentage

<div class="alert alert-success">
<b>Reviewer's comment</b>

Bootstrapping is done correctly, all needed statistics are calculated successfully

</div>

In [28]:
#geo_data_0
profits0 = bootstrap_profit(target_valid0, predictions0)

Average profit of geo_data: $ 3961649.8480237117
95% confidence interval: $3796203.15 - $4127096.54
Risk of losses: 6.90%


In [29]:
#geo_data_1
profits1 = bootstrap_profit(target_valid1, predictions1)

Average profit of geo_data: $ 4560451.057866608
95% confidence interval: $4431472.49 - $4689429.63
Risk of losses: 1.50%


In [30]:
#geo_data_2
profits2 = bootstrap_profit(target_valid2, predictions2)

Average profit of geo_data: $ 4044038.665683568
95% confidence interval: $3874457.97 - $4213619.36
Risk of losses: 7.60%


From the result, geo_data_1 seems to make the most profit on average and it has the lower risk of losses compare to the other two regions. Due to being the highest profit and low risk, the oil company should focus on this region. 

<div class="alert alert-success">
<b>Reviewer's comment</b>

Very good!

</div>

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Would be cool to also make a plot of the profit distribution for each region

</div>

## Conclusion

Region 1 will produce the highest profit region due to its high R2 and low RSME compares to the other two regions; in addition, to producing the highest profit region on average and lowest risk of loss. We also calculated the volume of reserves sufficient for developing a new well without losses, which is 111.11. Without bootstrapping, region 0 makes the most profit; however after applying the bootstrapping method, region 1 makes the most profit based on the top 200 oil wells. 

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Conclusion stops abruptly, it seems that it didn't save correctly (I suggest clicking 'Save and checkpoint' before sending in the project in the future) . But in any case region choice is correct and justified.

</div>