## Домашнее задание

1.	взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)
2.	сделать feature engineering
3.	обучить любой классификатор (какой вам нравится)
4.	далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
5.	применить random negative sampling для построения классификатора в новых условиях
6.	сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
7.	поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)

In [110]:
import pandas as pd
import numpy as np

https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

```
CAR                      car acceptability
   . PRICE                  overall price
   . . buying               buying price
   . . maint                price of the maintenance
   . TECH                   technical characteristics
   . . COMFORT              comfort
   . . . doors              number of doors
   . . . persons            capacity in terms of persons to carry
   . . . lug_boot           the size of luggage boot
   . . safety               estimated safety of the car
```
Class Distribution: unacceptability, acceptability, good, very good.

Будем искать unacceptability.


In [111]:
df = pd.read_csv('./../../6Урок/car.data', names=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'],sep=',')
df.head(3)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc


In [112]:
df.isna().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

In [113]:
df['class'].value_counts()

unacc    1210
acc       384
good       69
vgood      65
Name: class, dtype: int64

In [114]:
df = pd.get_dummies(df, columns=['buying', 'maint', 'lug_boot', 'safety'])
df.head(3)

Unnamed: 0,doors,persons,class,buying_high,buying_low,buying_med,buying_vhigh,maint_high,maint_low,maint_med,maint_vhigh,lug_boot_big,lug_boot_med,lug_boot_small,safety_high,safety_low,safety_med
0,2,2,unacc,0,0,0,1,0,0,0,1,0,0,1,0,1,0
1,2,2,unacc,0,0,0,1,0,0,0,1,0,0,1,0,0,1
2,2,2,unacc,0,0,0,1,0,0,0,1,0,0,1,1,0,0


In [115]:
df['class'] = np.where(df['class'] == 'unacc', 1, 0)
df['class'].value_counts()

1    1210
0     518
Name: class, dtype: int64

In [116]:
df['doors'].value_counts()

4        432
2        432
5more    432
3        432
Name: doors, dtype: int64

In [117]:
df['doors'].loc[df['doors']=='5more'] = 5
df['doors'] = df['doors'].astype('int32')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [118]:
df['persons'].value_counts()

more    576
4       576
2       576
Name: persons, dtype: int64

In [119]:
df['persons'].loc[df['persons']=='more'] = 5
df['persons']=df['persons'].astype('int32')

In [129]:
df.head(3)

Unnamed: 0,doors,persons,class,buying_high,buying_low,buying_med,buying_vhigh,maint_high,maint_low,maint_med,maint_vhigh,lug_boot_big,lug_boot_med,lug_boot_small,safety_high,safety_low,safety_med
0,2,2,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0
1,2,2,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1
2,2,2,1,0,0,0,1,0,0,0,1,0,0,1,1,0,0


In [120]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('class', axis=1), df['class'], test_size=0.2, random_state=0)

In [122]:
from sklearn.ensemble import AdaBoostClassifier

In [123]:
model = AdaBoostClassifier()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [124]:
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score

results=[]

print('Classification results:')
f1 = f1_score(y_test, y_pred)
print("f1: %.2f%%" % (f1 * 100.0)) 
roc = roc_auc_score(y_test, y_pred)
print("roc: %.2f%%" % (roc * 100.0)) 
rec = recall_score(y_test, y_pred, average='binary')
print("recall: %.2f%%" % (rec * 100.0)) 
prc = precision_score(y_test, y_pred, average='binary')
print("precision: %.2f%%" % (prc * 100.0)) 

results.append([prc, rec, f1, roc, 'share 100%'])

Classification results:
f1: 95.45%
roc: 91.99%
recall: 96.25%
precision: 94.67%


#### Теперь очередь за PU learning

In [127]:
share=np.arange(0.7, 0, -0.1)

for i in share:
    mod_data = df.copy()
    #get the indices of the positives samples
    pos_ind = np.array(df['class'].loc[df['class']==1].index)
    #shuffle them
    np.random.shuffle(pos_ind)
    # leave just 25% of the positives marked
    pos_sample_len = int(np.ceil(i * len(pos_ind)))
    print(f'Using {pos_sample_len}/{len(pos_ind)} ({int(i*100)}%)as positives and unlabeling the rest')
    pos_sample = pos_ind[:pos_sample_len]

    mod_data['class_test'] = 0
    mod_data.loc[pos_sample,'class_test'] = 1
#     print('target variable:\n', mod_data.iloc[:,-1].value_counts())
    mod_data = mod_data.sample(frac=1)
    neg_sample = mod_data[mod_data['class_test']==0][:len(mod_data[mod_data['class_test']==1])]
    sample_test = mod_data[mod_data['class_test']==0][len(mod_data[mod_data['class_test']==1]):]
    pos_sample = mod_data[mod_data['class_test']==1]
    print(neg_sample.shape, pos_sample.shape)
    sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)
    
    model.fit(sample_train.drop(['class_test', 'class'], axis=1), 
          sample_train['class_test'])
    y_pred = model.predict(sample_test.drop(['class_test', 'class'], axis=1))
    
    y_test = sample_test['class']
    print('Classification results:')
    f1 = f1_score(y_test, y_pred)
    print("f1: %.2f%%" % (f1 * 100.0)) 
    roc = roc_auc_score(y_test, y_pred)
    print("roc: %.2f%%" % (roc * 100.0)) 
    rec = recall_score(y_test, y_pred, average='binary')
    print("recall: %.2f%%" % (rec * 100.0)) 
    prc = precision_score(y_test, y_pred, average='binary')
    print("precision: %.2f%%\n\n" % (prc * 100.0)) 
    
    results.append([prc, rec, f1, roc, 'share ' + str(int(i*100))+ '%'])

Using 847/1210 (70%)as positives and unlabeling the rest
(847, 18) (847, 18)
Classification results:
f1: 80.00%
roc: 83.33%
recall: 66.67%
precision: 100.00%


Using 726/1210 (60%)as positives and unlabeling the rest
(726, 18) (726, 18)
Classification results:
f1: 78.79%
roc: 82.50%
recall: 65.00%
precision: 100.00%


Using 605/1210 (50%)as positives and unlabeling the rest
(605, 18) (605, 18)
Classification results:
f1: 77.85%
roc: 81.87%
recall: 63.73%
precision: 100.00%


Using 484/1210 (40%)as positives and unlabeling the rest
(484, 18) (484, 18)
Classification results:
f1: 75.24%
roc: 80.16%
recall: 60.31%
precision: 100.00%


Using 364/1210 (30%)as positives and unlabeling the rest
(364, 18) (364, 18)
Classification results:
f1: 73.29%
roc: 78.76%
recall: 58.03%
precision: 99.44%


Using 243/1210 (20%)as positives and unlabeling the rest
(243, 18) (243, 18)
Classification results:
f1: 67.92%
roc: 75.22%
recall: 51.81%
precision: 98.58%


Using 122/1210 (10%)as positives and unlab

In [128]:
col = ['Precision','Recall','Fscore','Roc_auc', 'Share of positive']
res = pd.DataFrame(results, columns=col)
res

Unnamed: 0,Precision,Recall,Fscore,Roc_auc,Share of positive
0,0.946721,0.9625,0.954545,0.919929,share 100%
1,1.0,0.666667,0.8,0.833333,share 70%
2,1.0,0.65,0.787879,0.825,share 60%
3,1.0,0.637324,0.778495,0.818662,share 50%
4,1.0,0.603139,0.752448,0.80157,share 40%
5,0.994382,0.580328,0.732919,0.7876,share 30%
6,0.985782,0.518057,0.679184,0.752195,share 20%
7,0.977695,0.522344,0.680906,0.748593,share 10%


Как видим с уменьшением доли размеченного класса ухадшается и точность определения.

<b>Бонусный вопрос:</b>
Как вы думаете, какой из методов на практике является более предпочтительным: random negative sampling или 2-step approach?

Мне кажется, что 2-step approach выдаст более высокие метрики. random negative sampling явялется более простым в реализации.