## Pseudo-labeling Semisupervised Learning

Because I am only able to confidently label a small portion of the larger Bexar property dataset, pseudo-labeling can be a way to train a model on the larger dataset. First I train a series of models on the small, labeled data. Then I use that trained model to label the far-larger unlabeled dataset. Lastly, I can then train a new model on a dataset containing both the original, labeled data and the new, pseudo-labeled data.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')

from sklearn.svm import SVC
from sklearn.utils import shuffle
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, recall_score, precision_score, make_scorer

#### Load Data and Separate Labeled and Unlabeled Dataframes

In [2]:
# read in data
labeled_df = pd.read_hdf('../data/processed/bexar_true_labels.h5')
all_df = pd.read_hdf('../data/processed/bexar_processed.h5')

In [3]:
# Trim to just the data needed for modeling
all_df['price_psf'] = all_df['price_psf'].fillna(0)
# Trim outlier properties
trim_prop_df = all_df[(all_df.price_psf<all_df.price_psf.quantile(.999))]

nan_limit = 70000
check_nan = trim_prop_df.isnull().sum()
variables_list = check_nan[check_nan<nan_limit].index
variables_list = variables_list[variables_list.isin(trim_prop_df.columns[trim_prop_df.dtypes!='object'])]
variables_list = variables_list.drop([
    'py_owner_id','py_addr_zip_cass','prop_val_yr','appraised_val',
    'Prior_Mkt_Val','bexar_2015_market_value','bexar_2016_market_value',
    'bexar_2017_market_value','bexar_2018_market_value','owner_zip_code',
    'property_zip','neighborhood_code'
])
# Drop columns
sub_df = trim_prop_df[variables_list]
sub_df = sub_df.dropna()

In [4]:
# Grab only properties not already in the labeled dataset
unlabeled_df = sub_df[~(sub_df['prop_id'].isin(labeled_df.prop_id))]

X_train = labeled_df.iloc[:,1:-1]
y_train = labeled_df.crim_prop
X_test = unlabeled_df.iloc[:,1:-1]

In [8]:
# List of models to try
models = [
    SVC(),
    XGBClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier()
]

for model in models:
    print(type(model).__name__)
    model.seed = 42
    num_folds = 5

    scores = cross_val_score(model, X_train, y_train, cv=num_folds)
    
    model.fit(X_train,y_train)
    y_pred_train = model.predict(X_train)
    print(confusion_matrix(y_train,y_pred_train,labels=[1,0]))
    print('Recall is',recall_score(y_train,y_pred_train))
    print('Precision is',precision_score(y_train,y_pred_train))
    print("CV scores:",scores,'\n','*'*40,'\n')

SVC
[[   0   47]
 [   0 5516]]
Recall is 0.0
Precision is 0.0
CV scores: [0.99191375 0.99101527 0.99101527 0.99190647 0.99190647] 
 **************************************** 

XGBClassifier


  _warn_prf(average, modifier, msg_start, len(result))


[[  39    8]
 [   0 5516]]
Recall is 0.8297872340425532
Precision is 1.0
CV scores: [0.99371069 0.99371069 0.99281222 0.99280576 0.77517986] 
 **************************************** 

RandomForestClassifier
[[  47    0]
 [   0 5516]]
Recall is 1.0
Precision is 1.0
CV scores: [0.99191375 0.99191375 0.99191375 0.99100719 0.95953237] 
 **************************************** 

GradientBoostingClassifier
[[  47    0]
 [   0 5516]]
Recall is 1.0
Precision is 1.0
CV scores: [0.99371069 0.99191375 0.99371069 0.99370504 0.7221223 ] 
 **************************************** 



Random Forest does not mislabel a single property and has the highest cross-validation scores.

In [9]:
# Train model
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred_train = rfc.predict(X_train)
print(confusion_matrix(y_train,y_pred_train,labels=[1,0]))
print('Recall is',recall_score(y_train,y_pred_train))
print('Precision is',precision_score(y_train,y_pred_train))


# Predict pseudo-labels on unlabeled data
pseudo_labels = rfc.predict(X_test)

# Add pseudo-labels to test
augmented_test = X_test.copy(deep=True)
augmented_test['crim_prop'] = pseudo_labels

[[  47    0]
 [   0 5516]]
Recall is 1.0
Precision is 1.0


In [10]:
# Take a fraction of the pseudo-labeled data to combine with the labeled training data
sampled_test = augmented_test.sample(frac=.2)
len(sampled_test)

115611

In [11]:
# Re-merge
temp_train = pd.concat([X_train,y_train],axis=1)
# Concat labeled data with pseudo-labeled data
augmented_train = pd.concat([sampled_test,temp_train])

In [13]:
# Train new Random Forest model
rfc_aug = RandomForestClassifier()
rfc_aug = RandomForestClassifier()
rfc_aug.fit(augmented_train.iloc[:,:-1], augmented_train.crim_prop)
y_pred_train = rfc_aug.predict(augmented_train.iloc[:,:-1])
print(confusion_matrix(augmented_train.crim_prop,y_pred_train,labels=[1,0]))
print('Recall is',recall_score(augmented_train.crim_prop,y_pred_train))
print('Precision is',precision_score(augmented_train.crim_prop,y_pred_train))

[[  1374      0]
 [     0 119800]]
Recall is 1.0
Precision is 1.0
