# Mining company margin forecast

**Project's task is to find a region for drilling new wells that will bring the maximum profit to the mining company.**

# Results

Region 1 has the minimum risk of losses, while it also has the maximum average profit. It is necessary to invest in the development of Region 1.

## Data understanding

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

In [2]:
data_0 = pd.read_csv("/datasets/geo_data_0.csv")
data_1 = pd.read_csv("/datasets/geo_data_1.csv")
data_2 = pd.read_csv("/datasets/geo_data_2.csv")

In [3]:
for data in [data_0, data_1, data_2]:
    display(data.isna().sum())
    display(data.head())

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [4]:
for data in [data_0, data_1, data_2]:
    data.drop("id", axis=1, inplace=True)

## Model building 

In [5]:
target = ["product"]

In [6]:
def splitting(data):
    x_train, x_test, y_train, y_test = train_test_split(
    data.drop(target, axis=1), data[target], test_size=0.25, random_state=1)
    return x_train, x_test, y_train, y_test 

In [7]:
x_train_0, x_test_0, y_train_0, y_test_0 = splitting(data_0)

In [8]:
x_train_1, x_test_1, y_train_1, y_test_1 = splitting(data_1)

In [9]:
x_train_2, x_test_2, y_train_2, y_test_2 = splitting(data_2)

In [10]:
lr = LinearRegression()

In [11]:
def prediction(x_train, x_test, y_train):
    lr.fit(x_train, y_train)
    y_pr = lr.predict(x_test)
    return y_pr

In [12]:
y_pr_0 = prediction(x_train_0, x_test_0, y_train_0)

In [13]:
y_pr_1 = prediction(x_train_1, x_test_1, y_train_1)

In [14]:
y_pr_2 = prediction(x_train_2, x_test_2, y_train_2)

### Calculation of metrics

In [15]:
def count_rmse(y_test, y_pr):
    rmse = mean_squared_error(y_test, y_pr, squared=False)
    return rmse

In [16]:
print("Регион 0")
print("RMSE:", count_rmse(y_test_0, y_pr_0))
print("Средний запас предсказанного сырья:", y_pr_0.mean(), "тыс. баррелей")

Регион 0
RMSE: 37.74258669996437
Средний запас предсказанного сырья: 92.49262459838863 тыс. баррелей


In [17]:
print("Регион 1")
print("RMSE:", count_rmse(y_test_1, y_pr_1))
print("Средний запас предсказанного сырья:", y_pr_1.mean(), "тыс. баррелей")

Регион 1
RMSE: 0.8943375629130574
Средний запас предсказанного сырья: 69.12040524285558 тыс. баррелей


In [18]:
print("Регион 2")
print("RMSE:", count_rmse(y_test_2, y_pr_2))
print("Средний запас предсказанного сырья:", y_pr_2.mean(), "тыс. баррелей")

Регион 2
RMSE: 39.86671127773423
Средний запас предсказанного сырья: 94.9568304858529 тыс. баррелей


Linear regression was best trained in Region 1. More raw materials were predicted in Region 2.

## Calculation of profit and risks

In [19]:
y_pr_0 = pd.Series(list(y_pr_0), name="product")
y_pr_1 = pd.Series(list(y_pr_1), name="product")
y_pr_2 = pd.Series(list(y_pr_2), name="product")

In [20]:
y_true_0 = y_test_0.reset_index(drop=True).squeeze()
y_true_1 = y_test_1.reset_index(drop=True).squeeze()
y_true_2 = y_test_2.reset_index(drop=True).squeeze()

In [21]:
#to find a sufficient volume of raw materials for the break-even development of a new well, it is necessary to have a budget for the development of wells in the region, 10 billion rubles, divided by the number of wells to be developed, 200, and then divided by the profit that the well can bring, 4.5 million rubles.
budget_per_area = 1e11
n_holes = 500
n_max_holes = 200
price_per_barrel = 4.5 * 1e6

In [22]:
min_volume = budget_per_area / n_max_holes / price_per_barrel
print(min_volume, "тыс. баррелей")

111.11111111111111 тыс. баррелей


In [23]:
#a function to calculate the profit after the development of the 200 largest wells.
def revenue(y_true, y_pr, count):
    y_pr_sorted = y_pr.sort_values(ascending=False)
    y_sel = y_true[y_pr_sorted.index][:count]
    return ((price_per_barrel * y_sel.sum()) - budget_per_area) / 1e9

- Bootstrap with 1000 samples is used to find the profit distribution. It is necessary to find the average return, 95% confidence interval, and risk of loss for each region.

In [24]:
state = np.random.RandomState(1) 

def revenue_bootstrap(y_true, y_pr):
    values = []
    
    for i in range(1000):
        y_true_subsample = y_true.sample(n=n_holes, replace=True, random_state=state)
        y_pr_subsample = y_pr[y_true_subsample.index] 
        values.append(revenue(y_true_subsample, y_pr_subsample, n_max_holes))
        
    values = pd.Series(values)
    lower = values.quantile(0.025)
    upper = values.quantile(0.975)
    mean = values.mean()
    risk = (values < 0).mean() * 100
    print("Средняя прибыль:", mean, "млрд.руб.")
    print("95%-й доверительный интервал:", lower, "-", upper)
    print("Риск убытков:", risk, "%")

In [25]:
print("Регион 0")
revenue_bootstrap(y_true_0, y_pr_0)

Регион 0
Средняя прибыль: 4.643776703569898 млрд.руб.
95%-й доверительный интервал: -0.9845387716982542 - 10.074690990475634
Риск убытков: 5.2 %


In [26]:
print("Регион 1")
revenue_bootstrap(y_true_1, y_pr_1)

Регион 1
Средняя прибыль: 5.416368033879771 млрд.руб.
95%-й доверительный интервал: 1.2520761425419107 - 9.609603373577178
Риск убытков: 0.4 %


In [27]:
print("Регион 2")
revenue_bootstrap(y_true_2, y_pr_2)

Регион 2
Средняя прибыль: 4.201703713499941 млрд.руб.
95%-й доверительный интервал: -1.0959781500023476 - 9.911556494251101
Риск убытков: 6.7 %
