Пример на датасете из репозитория UCI

Описание данных - https://archive.ics.uci.edu/ml/datasets/banknote+authentication#

In [6]:
import pandas as pd
import numpy as np
from google.colab import files
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score, classification_report, precision_recall_curve, confusion_matrix
from sklearn.ensemble import RandomForestClassifier


%matplotlib inline

In [2]:
file = files.upload()

Saving data_banknote_authentication.txt to data_banknote_authentication.txt


In [3]:
data = pd.read_csv("data_banknote_authentication.txt", header=None)
data.head(7)

Unnamed: 0,0,1,2,3,4
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0
5,4.3684,9.6718,-3.9606,-3.1625,0
6,3.5912,3.0129,0.72888,0.56421,0


У нас есть 4 признака и 1 целевая переменная (бинарная) - нужно определить поддельная купюра или нет

In [4]:
print(data.shape)

(1372, 5)


Всего 1372 купюры

Посмотрим на соотношение классо

In [5]:
data.iloc[:, -1].value_counts()

0    762
1    610
Name: 4, dtype: int64

Разбиваем выборку на тренировочную и тестовую части и обучаем модель (в примере - градиентный бустинг)

In [7]:
x_data = data.iloc[:,:-1]
y_data = data.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=7)

In [8]:
x_train.shape

(1097, 4)

In [9]:
y_train.shape

(1097,)

In [10]:
model = xgb.XGBClassifier()

model.fit(x_train, y_train)
y_predict = model.predict(x_test)

Проверяем качество

In [11]:
def evaluate_results(y_test, y_predict):
    print('Classification results:')
    f1 = f1_score(y_test, y_predict)
    print("f1: %.2f%%" % (f1 * 100.0)) 
    roc = roc_auc_score(y_test, y_predict)
    print("roc: %.2f%%" % (roc * 100.0)) 
    rec = recall_score(y_test, y_predict, average='binary')
    print("recall: %.2f%%" % (rec * 100.0)) 
    prc = precision_score(y_test, y_predict, average='binary')
    print("precision: %.2f%%" % (prc * 100.0)) 

In [12]:
evaluate_results(y_test, y_predict)

Classification results:
f1: 99.57%
roc: 99.57%
recall: 99.15%
precision: 100.00%


## **Теперь очередь за PU learning**

Представим, что нам неизвестны негативы и часть позитивов

In [13]:
mod_data = data.copy()
#get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:,-1].values == 1)[0]
#shuffle them
np.random.shuffle(pos_ind)
# leave just 25% of the positives marked
pos_sample_len = int(np.ceil(0.25 * len(pos_ind)))
print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

Using 153/610 as positives and unlabeling the rest


Создаем столбец для новой целевой переменной, где у нас два класса - P (1) и U (-1)

In [14]:
mod_data['class_test'] = -1
mod_data.loc[pos_sample,'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 -1    1219
 1     153
Name: class_test, dtype: int64


In [15]:
mod_data.head()

Unnamed: 0,0,1,2,3,4,class_test
0,3.6216,8.6661,-2.8073,-0.44699,0,-1
1,4.5459,8.1674,-2.4586,-1.4621,0,-1
2,3.866,-2.6383,1.9242,0.10645,0,-1
3,3.4566,9.5228,-4.0112,-3.5944,0,-1
4,0.32924,-4.4552,4.5718,-0.9888,0,-1


In [16]:
mod_data.class_test.value_counts(), data.iloc[:, -1].value_counts()

(-1    1219
  1     153
 Name: class_test, dtype: int64, 0    762
 1    610
 Name: 4, dtype: int64)

In [17]:
x_data = mod_data.iloc[:,:-2].values # just the X 
y_labeled = mod_data.iloc[:,-1].values # new class (just the P & U)
y_positive = mod_data.iloc[:,-2].values # original class

## **1. random negative samplin**g

In [18]:
mod_data = mod_data.sample(frac=1)
neg_sample = mod_data[mod_data['class_test']==-1][:len(mod_data[mod_data['class_test']==1])]
sample_test = mod_data[mod_data['class_test']==-1][len(mod_data[mod_data['class_test']==1]):]
pos_sample = mod_data[mod_data['class_test']==1]
print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(153, 6) (153, 6)


In [19]:
neg_sample.to_csv('neg_sample.csv')
pos_sample.to_csv('pos_sample.csv')

In [20]:
model = xgb.XGBClassifier()

model.fit(sample_train.iloc[:,:-2].values, 
          sample_train.iloc[:,-2].values)
y_predict = model.predict(sample_test.iloc[:,:-2].values)
evaluate_results(sample_test.iloc[:,-2].values, y_predict)

Classification results:
f1: 94.85%
roc: 96.73%
recall: 99.23%
precision: 90.85%


## **2. probabilistic approach**

The training set will be divided into a fitting-set that will be used to fit the estimator in order to estimate P(s=1|X) and a held-out set of positive samples that will be used to estimate P(s=1|y=1)

In [21]:
def fit_PU_estimator(X,y, hold_out_ratio, estimator):
    
    # find the indices of the positive/labeled elements
    assert (type(y) == np.ndarray), "Must pass np.ndarray rather than list as y"
    positives = np.where(y == 1.)[0] 
    # hold_out_size = the *number* of positives/labeled samples 
    # that we will use later to estimate P(s=1|y=1)
    hold_out_size = int(np.ceil(len(positives) * hold_out_ratio))
    np.random.shuffle(positives)
    # hold_out = the *indices* of the positive elements 
    # that we will later use  to estimate P(s=1|y=1)
    hold_out = positives[:hold_out_size]
    # the actual positive *elements* that we will keep aside
    X_hold_out = X[hold_out] 
    # remove the held out elements from X and y
    X = np.delete(X, hold_out,0) 
    y = np.delete(y, hold_out)
    # We fit the estimator on the unlabeled samples + (part of the) positive and labeled ones.
    # In order to estimate P(s=1|X) or  what is the probablity that an element is *labeled*
    pd.DataFrame(X).to_csv('X.csv')
    pd.DataFrame(y).to_csv('y.csv')
    estimator.fit(X, y)
    # We then use the estimator for prediction of the positive held-out set 
    # in order to estimate P(s=1|y=1)
    hold_out_predictions = estimator.predict_proba(X_hold_out)
    #take the probability that it is 1
    hold_out_predictions = hold_out_predictions[:,1]
    # save the mean probability 
    c = np.mean(hold_out_predictions)
    return estimator, c

def predict_PU_prob(X, estimator, prob_s1y1):
    predicted_s = estimator.predict_proba(X)
    predicted_s = predicted_s[:,1]
    return predicted_s / prob_s1y1

test the PU estimation approach

In [22]:
predicted = np.zeros(len(x_data))
learning_iterations = 24
for index in range(learning_iterations):
    pu_estimator, probs1y1 = fit_PU_estimator(x_data, y_labeled, 0.2, xgb.XGBClassifier())
    predicted += predict_PU_prob(x_data, pu_estimator, probs1y1)
    if(index%4 == 0): 
        print(f'Learning Iteration::{index}/{learning_iterations} => P(s=1|y=1)={round(probs1y1,2)}')

Learning Iteration::0/24 => P(s=1|y=1)=0.20000000298023224
Learning Iteration::4/24 => P(s=1|y=1)=0.2199999988079071
Learning Iteration::8/24 => P(s=1|y=1)=0.20999999344348907
Learning Iteration::12/24 => P(s=1|y=1)=0.20999999344348907
Learning Iteration::16/24 => P(s=1|y=1)=0.20999999344348907
Learning Iteration::20/24 => P(s=1|y=1)=0.20999999344348907


compare the performance of the predictions of the PU approacj (y_predict) with the actuall original classes (y_positive) that we have saved aside

In [23]:
y_predict = [1 if x > 0.01 else 0 for x in (predicted/learning_iterations)]
evaluate_results(y_positive, y_predict)

Classification results:
f1: 70.20%
roc: 66.01%
recall: 100.00%
precision: 54.08%


## **Задание**

**взять любой набор данных для бинарной классификации**

Взал данные с https://www.kaggle.com/manishkc06/web-page-phishing-detection

In [24]:
file = files.upload()

Saving phishing_data.csv to phishing_data.csv


In [25]:
df = pd.read_csv('phishing_data.csv')
df.head(7)

Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,nb_eq,nb_underscore,nb_tilde,nb_percent,nb_slash,nb_star,nb_colon,nb_comma,nb_semicolumn,nb_dollar,nb_space,nb_www,nb_com,nb_dslash,http_in_path,https_token,ratio_digits_url,ratio_digits_host,punycode,port,tld_in_path,tld_in_subdomain,abnormal_subdomain,nb_subdomains,prefix_suffix,random_domain,shortening_service,path_extension,nb_redirection,nb_external_redirection,...,avg_word_host,avg_word_path,phish_hints,domain_in_brand,brand_in_subdomain,brand_in_path,suspecious_tld,statistical_report,nb_hyperlinks,ratio_intHyperlinks,ratio_extHyperlinks,ratio_nullHyperlinks,nb_extCSS,ratio_intRedirection,ratio_extRedirection,ratio_intErrors,ratio_extErrors,login_form,external_favicon,links_in_tags,submit_email,ratio_intMedia,ratio_extMedia,sfh,iframe,popup_window,safe_anchor,onmouseover,right_clic,empty_title,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,http://www.progarchives.com/album.asp?id=61737,46,20,zero,3,zero,0,1,0,0,1,0,0,0,3,0,1,0,0,0,0,1,0,0,0,1,0.108696,0.0,0,0,0,0,0,3,0,0,0,0,0,0,...,7.5,3.75,0,0,0,0,0,0,143,0.93007,0.06993,0,1,0,0.0,0,0.0,0,1,73.913043,0,100.0,0.0,0,0,0,77.777778,0,0,0,1,one,0,627,6678,78526,0,0,5,phishing
1,http://signin.eday.co.uk.ws.edayisapi.dllsign....,128,120,0,10,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,0,0,0,0,1,0.054688,0.058333,0,0,0,0,0,3,0,0,0,0,0,0,...,10.7,0.0,2,0,0,0,0,0,0,0.0,0.0,0,0,0,0.0,0,0.0,0,0,0.0,0,0.0,0.0,0,0,0,0.0,0,0,1,1,zero,0,300,65,0,0,1,0,phishing
2,http://www.avevaconstruction.com/blesstool/ima...,52,25,0,3,0,0,0,0,0,0,0,0,0,4,0,1,0,0,0,0,1,0,0,0,1,0.0,0.0,0,0,0,0,0,3,0,0,0,0,1,0,...,10.0,5.666667,0,0,0,0,0,0,3,1.0,0.0,0,0,0,0.0,0,0.0,0,0,100.0,0,0.0,0.0,0,0,0,0.0,0,0,0,1,zero,0,119,1707,0,0,1,0,phishing
3,http://www.jp519.com/,21,13,0,2,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,1,0,0,0,1,0.142857,0.230769,0,0,0,0,0,2,0,1,0,0,0,0,...,4.0,0.0,0,0,0,0,0,0,404,0.962871,0.037129,0,0,0,0.133333,0,0.0,0,0,100.0,0,92.307692,7.692308,0,0,0,82.539683,0,0,0,1,one,0,130,1331,0,0,0,0,legitimate
4,https://www.velocidrone.com/,28,19,0,2,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,1,0,0,0,0,0.0,0.0,0,0,0,0,0,2,0,0,0,0,0,0,...,7.0,0.0,0,0,0,0,0,0,57,0.684211,0.315789,0,3,0,0.0,0,0.0,0,1,55.555556,0,50.0,50.0,0,0,0,81.081081,0,0,0,0,zero,0,164,1662,312044,0,0,4,legitimate
5,https://support-appleld.com.secureupdate.duila...,128,50,1,4,1,0,1,2,0,3,2,0,0,5,0,1,0,0,0,0,0,1,0,0,0,0.117188,0.0,0,0,0,1,0,3,1,0,0,0,0,0,...,8.4,7.375,0,0,0,0,0,0,51,1.0,0.0,0,0,0,0.0,0,0.0,0,0,100.0,0,100.0,0.0,0,0,0,100.0,0,0,0,1,one,0,25,3993,5707171,0,1,0,phishing
6,https://www.authpro.com/auth/ubabankng/?action...,50,15,0,2,0,0,1,0,0,1,0,0,0,5,0,1,0,0,0,0,1,0,0,0,0,0.0,0.0,0,0,0,0,0,2,0,0,0,0,0,0,...,5.0,5.5,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0.0,0,0.0,0,0,0.0,0,0.0,0.0,0,0,0,0.0,0,0,0,1,zero,0,705,7330,154708,0,0,4,phishing


**сделать feature engineering**

In [26]:
df.domain_with_copyright = df.domain_with_copyright.map({'zero': 0, 'one': 1, 'Zero': 0, 'One': 1})
df.status = df.status.map({'legitimate': 0, 'phishing': 1})
df = df.drop(['url', 'ip', 'nb_hyphens'], 1)

In [28]:
df.head(7)

Unnamed: 0,length_url,length_hostname,nb_dots,nb_at,nb_qm,nb_and,nb_or,nb_eq,nb_underscore,nb_tilde,nb_percent,nb_slash,nb_star,nb_colon,nb_comma,nb_semicolumn,nb_dollar,nb_space,nb_www,nb_com,nb_dslash,http_in_path,https_token,ratio_digits_url,ratio_digits_host,punycode,port,tld_in_path,tld_in_subdomain,abnormal_subdomain,nb_subdomains,prefix_suffix,random_domain,shortening_service,path_extension,nb_redirection,nb_external_redirection,length_words_raw,char_repeat,shortest_words_raw,...,avg_word_host,avg_word_path,phish_hints,domain_in_brand,brand_in_subdomain,brand_in_path,suspecious_tld,statistical_report,nb_hyperlinks,ratio_intHyperlinks,ratio_extHyperlinks,ratio_nullHyperlinks,nb_extCSS,ratio_intRedirection,ratio_extRedirection,ratio_intErrors,ratio_extErrors,login_form,external_favicon,links_in_tags,submit_email,ratio_intMedia,ratio_extMedia,sfh,iframe,popup_window,safe_anchor,onmouseover,right_clic,empty_title,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,46,20,3,0,1,0,0,1,0,0,0,3,0,1,0,0,0,0,1,0,0,0,1,0.108696,0.0,0,0,0,0,0,3,0,0,0,0,0,0,6,3,2,...,7.5,3.75,0,0,0,0,0,0,143,0.93007,0.06993,0,1,0,0.0,0,0.0,0,1,73.913043,0,100.0,0.0,0,0,0,77.777778,0,0,0,1,1,0,627,6678,78526,0,0,5,1
1,128,120,10,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,0,0,0,0,1,0.054688,0.058333,0,0,0,0,0,3,0,0,0,0,0,0,10,6,2,...,10.7,0.0,2,0,0,0,0,0,0,0.0,0.0,0,0,0,0.0,0,0.0,0,0,0.0,0,0.0,0.0,0,0,0,0.0,0,0,1,1,0,0,300,65,0,0,1,0,1
2,52,25,3,0,0,0,0,0,0,0,0,4,0,1,0,0,0,0,1,0,0,0,1,0.0,0.0,0,0,0,0,0,3,0,0,0,0,1,0,5,5,3,...,10.0,5.666667,0,0,0,0,0,0,3,1.0,0.0,0,0,0,0.0,0,0.0,0,0,100.0,0,0.0,0.0,0,0,0,0.0,0,0,0,1,0,0,119,1707,0,0,1,0,1
3,21,13,2,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,1,0,0,0,1,0.142857,0.230769,0,0,0,0,0,2,0,1,0,0,0,0,2,3,3,...,4.0,0.0,0,0,0,0,0,0,404,0.962871,0.037129,0,0,0,0.133333,0,0.0,0,0,100.0,0,92.307692,7.692308,0,0,0,82.539683,0,0,0,1,1,0,130,1331,0,0,0,0,0
4,28,19,2,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,1,0,0,0,0,0.0,0.0,0,0,0,0,0,2,0,0,0,0,0,0,2,3,3,...,7.0,0.0,0,0,0,0,0,0,57,0.684211,0.315789,0,3,0,0.0,0,0.0,0,1,55.555556,0,50.0,50.0,0,0,0,81.081081,0,0,0,0,0,0,164,1662,312044,0,0,4,0
5,128,50,4,0,1,2,0,3,2,0,0,5,0,1,0,0,0,0,0,1,0,0,0,0.117188,0.0,0,0,0,1,0,3,1,0,0,0,0,0,13,4,2,...,8.4,7.375,0,0,0,0,0,0,51,1.0,0.0,0,0,0,0.0,0,0.0,0,0,100.0,0,100.0,0.0,0,0,0,100.0,0,0,0,1,1,0,25,3993,5707171,0,1,0,1
6,50,15,2,0,1,0,0,1,0,0,0,5,0,1,0,0,0,0,1,0,0,0,0,0.0,0.0,0,0,0,0,0,2,0,0,0,0,0,0,6,3,3,...,5.0,5.5,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0.0,0,0.0,0,0,0.0,0,0.0,0.0,0,0,0,0.0,0,0,0,1,0,0,705,7330,154708,0,0,4,1


**обучить любой классификатор (какой вам нравится)**

In [29]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['status'], 1), 
                                                    df['status'], random_state=0)

In [30]:
X_train.shape

(8610, 85)

In [31]:
y_train.shape

(8610,)

In [32]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
preds = rfc.predict_proba(X_test)[:, 1]

In [33]:
preds

array([1.  , 0.99, 0.  , ..., 0.04, 0.15, 0.04])

In [34]:
precision, recall, thresholds = precision_recall_curve(y_test, preds)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%.3f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds[ix], 
                                                                        fscore[ix],
                                                                        precision[ix],
                                                                        recall[ix]))

Best Threshold=0.510, F-Score=0.982, Precision=0.982, Recall=0.982


**далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть**

In [35]:
#Представим, что нам неизвестны негативы и часть позитивов
mod_data = df.copy()
#get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:,-1].values == 1)[0]
#shuffle them
np.random.shuffle(pos_ind)
# leave just 10% of the positives marked
pos_sample_len = int(np.ceil(0.1 * len(pos_ind)))
print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

#Создаем столбец для новой целевой переменной, где у нас два класса - P (1) и U (-1)
mod_data['class_test'] = -1
mod_data.loc[pos_sample,'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())


x_data = mod_data.iloc[:,:-2].values # just the X 
y_labeled = mod_data.iloc[:,-1].values # new class (just the P & U)
y_positive = mod_data.iloc[:,-2].values # original class

Using 575/5741 as positives and unlabeling the rest
target variable:
 -1    10906
 1      575
Name: class_test, dtype: int64


In [36]:
score = []

mod_data = mod_data.sample(frac=1)
neg_sample = mod_data[mod_data['class_test']==-1][:len(mod_data[mod_data['class_test']==1])]
sample_test = mod_data[mod_data['class_test']==-1][len(mod_data[mod_data['class_test']==1]):]
pos_sample = mod_data[mod_data['class_test']==1]
print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(575, 87) (575, 87)


In [37]:
model = RandomForestClassifier()

model.fit(sample_train.iloc[:,:-2].values, 
          sample_train.iloc[:,-2].values)
y_predict = model.predict_proba(sample_test.iloc[:,:-2].values)[:, 1]

precision, recall, thresholds = precision_recall_curve(sample_test.iloc[:,-2].values, y_predict)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%.3f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds[ix], 
                                                                        fscore[ix],
                                                                        precision[ix],
                                                                        recall[ix]))

roc_auc = roc_auc_score(y_true=sample_test.iloc[:,-2].values, y_score=y_predict)

score.append([thresholds[ix], fscore[ix], precision[ix], recall[ix], roc_auc])

Best Threshold=0.700, F-Score=0.950, Precision=0.955, Recall=0.946


**применить random negative sampling для построения классификатора в новых условиях**

In [38]:
def fit_PU_estimator(X,y, hold_out_ratio, estimator):
    
    # find the indices of the positive/labeled elements
    assert (type(y) == np.ndarray), "Must pass np.ndarray rather than list as y"
    positives = np.where(y == 1.)[0] 
    # hold_out_size = the *number* of positives/labeled samples 
    # that we will use later to estimate P(s=1|y=1)
    hold_out_size = int(np.ceil(len(positives) * hold_out_ratio))
    np.random.shuffle(positives)
    # hold_out = the *indices* of the positive elements 
    # that we will later use  to estimate P(s=1|y=1)
    hold_out = positives[:hold_out_size] 
    # the actual positive *elements* that we will keep aside
    X_hold_out = X[hold_out] 
    # remove the held out elements from X and y
    X = np.delete(X, hold_out,0) 
    y = np.delete(y, hold_out)
    # We fit the estimator on the unlabeled samples + (part of the) positive and labeled ones.
    # In order to estimate P(s=1|X) or  what is the probablity that an element is *labeled*
    estimator.fit(X, y)
    # We then use the estimator for prediction of the positive held-out set 
    # in order to estimate P(s=1|y=1)
    hold_out_predictions = estimator.predict_proba(X_hold_out)
    #take the probability that it is 1
    hold_out_predictions = hold_out_predictions[:,1]
    # save the mean probability 
    c = np.mean(hold_out_predictions)
    return estimator, c

def predict_PU_prob(X, estimator, prob_s1y1):
    predicted_s = estimator.predict_proba(X)
    predicted_s = predicted_s[:,1]
    return predicted_s / prob_s1y1

In [39]:
predicted = np.zeros(len(x_data))
learning_iterations = 24
for index in range(learning_iterations):
    pu_estimator, probs1y1 = fit_PU_estimator(x_data, y_labeled, 0.2, RandomForestClassifier())
    predicted += predict_PU_prob(x_data, pu_estimator, probs1y1)
    if(index%4 == 0): 
        print(f'Learning Iteration::{index}/{learning_iterations} => P(s=1|y=1)={round(probs1y1,2)}')

Learning Iteration::0/24 => P(s=1|y=1)=0.1
Learning Iteration::4/24 => P(s=1|y=1)=0.07
Learning Iteration::8/24 => P(s=1|y=1)=0.08
Learning Iteration::12/24 => P(s=1|y=1)=0.11
Learning Iteration::16/24 => P(s=1|y=1)=0.09
Learning Iteration::20/24 => P(s=1|y=1)=0.11


In [40]:
predicted

array([  0.        ,  13.84524154, 102.04438746, ...,   5.06996965,
         0.        ,   2.22495091])

In [41]:
y_predict = [1 if x > 0.01 else 0 for x in (predicted/learning_iterations)]
evaluate_results(y_positive, y_predict)

Classification results:
f1: 79.60%
roc: 75.62%
recall: 95.14%
precision: 68.43%


In [42]:
precision, recall, thresholds = precision_recall_curve(y_positive, y_predict)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%.3f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds[ix], 
                                                                        fscore[ix],
                                                                        precision[ix],
                                                                        recall[ix]))

roc_auc = roc_auc_score(y_true=y_positive, y_score=y_predict)

score.append([thresholds[ix], fscore[ix], precision[ix], recall[ix], roc_auc])

Best Threshold=1.000, F-Score=0.796, Precision=0.684, Recall=0.951


**построить отчет - таблицу метрик**

In [43]:
table = pd.DataFrame({'original': score[0], 'rns': score[1]}).T
table.columns = ['thresholds', 'fscore', 'precision', 'recall', 'roc_auc']
table

Unnamed: 0,thresholds,fscore,precision,recall,roc_auc
original,0.7,0.950353,0.955144,0.94561,0.987412
rns,1.0,0.796036,0.68429,0.951402,0.756189


**Бонусный вопрос:**

Как вы думаете, какой из методов на практике является более предпочтительным: random negative sampling или 2-step approach?

для random negative sampling сэмплирование смещено в большинстве случаев, поэтому 2-step approach лучше