### Домашнее задание

1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)
3. сделать feature engineering
4. обучить любой классификатор (какой вам нравится)
5. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
6. применить random negative sampling для построения классификатора в новых условиях
7. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
8. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)


**О наборе данных:**
    
Набор данных предоставил прогностические функции, такие как образование, статус занятости, семейное положение, чтобы предсказать, превышает ли зарплата 50 тысяч долларов.

Его можно использовать для решения задач машинного обучения, таких как классификация.

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score, precision_score, classification_report, precision_recall_curve, confusion_matrix


In [2]:
# № 1:
df = pd.read_csv('C:/Users/User/Desktop/Машинное обучение в бизнесе/df/пред зп/train.csv')
df.head(7)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income_>50K
0,67,Private,366425,Doctorate,16,Divorced,Exec-managerial,Not-in-family,White,Male,99999,0,60,United-States,1
1,17,Private,244602,12th,8,Never-married,Other-service,Own-child,White,Male,0,0,15,United-States,0
2,31,Private,174201,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,1
3,58,State-gov,110199,7th-8th,4,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States,0
4,25,State-gov,149248,Some-college,10,Never-married,Other-service,Not-in-family,Black,Male,0,0,40,United-States,0
5,59,State-gov,105363,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,40,United-States,0
6,70,Private,216390,9th,5,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,2653,0,40,United-States,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43957 entries, 0 to 43956
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              43957 non-null  int64 
 1   workclass        41459 non-null  object
 2   fnlwgt           43957 non-null  int64 
 3   education        43957 non-null  object
 4   educational-num  43957 non-null  int64 
 5   marital-status   43957 non-null  object
 6   occupation       41451 non-null  object
 7   relationship     43957 non-null  object
 8   race             43957 non-null  object
 9   gender           43957 non-null  object
 10  capital-gain     43957 non-null  int64 
 11  capital-loss     43957 non-null  int64 
 12  hours-per-week   43957 non-null  int64 
 13  native-country   43194 non-null  object
 14  income_>50K      43957 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 5.0+ MB


Данных достаточно, так что все дамашнее задание выпольню на 1м файле.

In [4]:
df['income_>50K'].value_counts()

income_>50K
0    33439
1    10518
Name: count, dtype: int64

мы видим, явный дисбаланс распределения, но балансировку мы делать не будем, т.к. при последующем делении на множества у нас могут возникнуть проблемы (п. 4)

In [5]:
#разделим данные на train/test
X_train, X_test, y_train, y_test = train_test_split(df, df['income_>50K'], random_state=0)

In [6]:
# Категориальные признаки закодируем с помощью OneHotEncoding

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.column]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]
    
class Normalizer(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        
    def fit(self,X, y=None):
        return self
    
    def transform(self, X):
        scaler = MinMaxScaler()
        X[[self.key]] = scaler.fit_transform(X[[self.key]])
        return X[[self.key]]
        
    
class OHEEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        self.columns = []

    def fit(self, X, y=None):
        self.columns = [col for col in pd.get_dummies(X, prefix=self.key).columns]
        return self

    def transform(self, X):
        X = pd.get_dummies(X, prefix=self.key)
        test_columns = [col for col in X.columns]
        for col_ in self.columns:
            if col_ not in test_columns:
                X[col_] = 0
        return X[self.columns]

In [7]:
#Зададим списки признаков:
categorical_columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country']
continuous_columns = ['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']

In [8]:
#Теперь нам нужно под каждый признак создать трансформер и объединить их в список (сделаем это в цикле, чтобы не мучиться):
final_transformers = list()

for cat_col in categorical_columns:
    cat_transformer = Pipeline([
                ('selector', FeatureSelector(column=cat_col)),
                ('ohe', OHEEncoder(key=cat_col))
            ])
    final_transformers.append((cat_col, cat_transformer))
    
for cont_col in continuous_columns:
    cont_transformer = Pipeline([
                ('selector', NumberSelector(key=cont_col)),
                ('Min_max', Normalizer(key=cont_col))
            ])
    final_transformers.append((cont_col, cont_transformer))

In [9]:
#Объединим все это в единый пайплайн:
feats = FeatureUnion(final_transformers)

feature_processing = Pipeline([('feats', feats)])

In [10]:
model = ['XGBClassifier', 'XGBClassifier_PU']
fin_precision = []
fin_recall = []
fin_fscore = []

In [11]:
pipeline_xgb = Pipeline([
    ('features',feats),
    ('classifier', XGBClassifier(random_state = 12)),
])

In [12]:
#обучим:
pipeline_xgb.fit(X_train, y_train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice 

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('workclass',
                                                 Pipeline(steps=[('selector',
                                                                  FeatureSelector(column='workclass')),
                                                                 ('ohe',
                                                                  OHEEncoder(key='workclass'))])),
                                                ('education',
                                                 Pipeline(steps=[('selector',
                                                                  FeatureSelector(column='education')),
                                                                 ('ohe',
                                                                  OHEEncoder(key='education'))])),
                                                ('marital-status',
                                                 Pipeline(steps=[('selec

In [13]:
#наши прогнозы для тестовой выборки
preds_xgb = pipeline_xgb.predict_proba(X_test)[:, 1]
preds_xgb[:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice 

array([7.9207110e-01, 4.2868730e-01, 1.8846151e-01, 6.0545275e-04,
       6.3171441e-04, 1.2223635e-04, 1.9925129e-02, 3.1236744e-02,
       1.8771054e-01, 1.3990406e-02], dtype=float32)

In [14]:
#Посчитаем precision/recall/f_score

precision, recall, thresholds = precision_recall_curve(y_test, preds_xgb)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix_xgb = np.argmax(fscore)
print('Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds[ix_xgb], 
                                                                        fscore[ix_xgb],
                                                                        precision[ix_xgb],
                                                                        recall[ix_xgb]))

fin_precision.append(precision[ix_xgb])
fin_recall.append(recall[ix_xgb])
fin_fscore.append(fscore[ix_xgb])

Best Threshold=0.359162, F-Score=0.723, Precision=0.683, Recall=0.768


In [15]:
df_new = df.copy()
#берем индексы Р выборки 
p_i = np.where(df_new.iloc[:,-1].values == 1)[0]

np.random.shuffle(p_i) #перемешиваем Р

#Отделяем, после ручной настройки (задание 7) 50 % 
p_sample_len = int(np.ceil(0.5 * len(p_i)))
print(f'Используем {p_sample_len}/{len(p_i)} как Р отстальные уйдут в неизвестные')
p_sample = p_i[:p_sample_len]

Используем 5259/10518 как Р отстальные уйдут в неизвестные


Создаем столбец для новой целевой переменной, где у нас два класса - P (1) и U (-1)

In [16]:
df_new['test'] = -1
df_new.loc[p_sample,'test'] = 1
print('новое распределение:\n', df_new.iloc[:,-1].value_counts())

новое распределение:
 test
-1    38698
 1     5259
Name: count, dtype: int64


In [17]:
df_new = df_new.sample(frac=1)# берем выборку
n_sample = df_new[df_new['test']==-1][:len(df_new[df_new['test']==1])] # в негативные вписываем все -1, но по количеству позитивных, что бы было 50/50
sample_test = df_new[df_new['test']==-1][len(df_new[df_new['test']==1]):] # забираем все оставшиеся неизвесные
p_sample = df_new[df_new['test']==1] 
print(n_sample.shape, p_sample.shape)
sample_train = pd.concat([n_sample, p_sample]).sample(frac=1)

(5259, 16) (5259, 16)


In [18]:
pipeline_xgb_sample = Pipeline([
    ('features',feats),
    ('classifier', XGBClassifier(random_state = 12)),
])

In [19]:
sample_train.iloc[:,:-2].values

array([[26, 'Private', 366900, ..., 0, 40, 'United-States'],
       [50, 'Self-emp-inc', 82578, ..., 0, 40, 'United-States'],
       [43, 'Private', 117037, ..., 2042, 40, 'United-States'],
       ...,
       [56, 'Private', 125000, ..., 0, 40, 'United-States'],
       [35, 'Private', 393673, ..., 0, 40, 'United-States'],
       [30, 'Never-worked', 176673, ..., 0, 40, 'United-States']],
      dtype=object)

In [20]:
#обучим:
pipeline_xgb_sample.fit(sample_train.iloc[:,:-2], 
          sample_train.iloc[:,-2])
y_pred = pipeline_xgb_sample.predict(sample_test.iloc[:,:-2])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice 

In [21]:
#наши прогнозы для тестовой выборки
preds_xgb_sample = pipeline_xgb_sample.predict_proba(X_test)[:, 1]
preds_xgb[:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[[self.key]] = scaler.fit_transform(X[[self.key]])
A value is trying to be set on a copy of a slice 

array([7.9207110e-01, 4.2868730e-01, 1.8846151e-01, 6.0545275e-04,
       6.3171441e-04, 1.2223635e-04, 1.9925129e-02, 3.1236744e-02,
       1.8771054e-01, 1.3990406e-02], dtype=float32)

In [22]:
#Посчитаем precision/recall/f_score

precision, recall, thresholds = precision_recall_curve(y_test, preds_xgb_sample)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix_xgb_PU = np.argmax(fscore)
print('Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds[ix_xgb_PU], 
                                                                        fscore[ix_xgb_PU],
                                                                        precision[ix_xgb_PU],
                                                                        recall[ix_xgb_PU]))

fin_precision.append(precision[ix_xgb_PU])
fin_recall.append(recall[ix_xgb_PU])
fin_fscore.append(fscore[ix_xgb_PU])

Best Threshold=0.703017, F-Score=0.734, Precision=0.675, Recall=0.805


In [23]:
results = pd.DataFrame(np.column_stack([model, fin_precision, fin_recall, fin_fscore]), 
                               columns=['model', 'precision', 'recall', 'fscore'])

In [24]:
results

Unnamed: 0,model,precision,recall,fscore
0,XGBClassifier,0.6828687967369137,0.7679663608562691,0.7229219143576826
1,XGBClassifier_PU,0.6746717899455652,0.8054281345565749,0.7342742638090259


7) При изменении Р, получаем следующие результаты (fscore):
    
    0.6804865513106048 - 30%
    0.6735081717977954 - 37%
    0.7006439742410304 - 50%
    0.6901615271659325 - 60%

Самый хороший показатель при Р=50 