In [1]:
from IPython.core.display import HTML
import pandas as pd

# K-Means vs Multi-Class Adaboost for Parcel Condition


### Nate Ron-Ferguson
### COMP 8720


## Background


* Research area
* K-Means popular method for classifying neighborhood condition despite limitations







## Research Question
### Will an ensemble classification method outperform K-Means for classifying property condition both in terms of time as well as accuracy?



## 2 Parts

* Performance on benchmark datasets

* Performance on property condition dataset

## Adaboost - A Brief Introduction

<img src="./files/boosting_toy1.png">

<img src="./files/boosting_toy2.png">

<img src="./files/boosting_toy3.png">

<img src="./files/boosting_final.png">

## Adaptive Boosting (AdaBoost)






<img src="./files/adaboost.png">

## Multi-Class Adaboost - SAMME


* Stagewise Additive Modeling using a Multi-class Exponential loss function

* Expands on Original Adaboost by including additional term with $\alpha$



<img src="./files/samme.png">


$\alpha^{(m)} = log \frac {1-err^{(m)}}{err^{(m)}} + log(K-1)$




$ K = 2\ $ results in AdaBoost

# Implementation

## Tools
### Scikit-learn
### * KMeans
### * AdaBoostClassifier
        * Decision Tree Classifier with Depth = 1
### * train_test_split

## Method
* Data split into training and test sets
* Percentage ranged from 10% to 45% for test_size
* Accuracy estimated by comparing percentage of predicted classes correctly labeled
* Performance measured as amount of training time required for each iteration given training size

## Data - UCI
* 10 Datasets from UCI Machine Learning Data Repository
* Numeric, classification, varied in number of samples and number of attributes
    * Breast Cancer Wisconsin (breast-cancer-wisconsin)
    * Daily Sports Activity (daily_activity)
    * Energy efficiency (energy_efficiency)
    * Steel Plates Faults (faults)
    * Haberman's Survival (haberman)
    * Ionosphere (ionosphere)
    * Statlog (shuttle)
    * Spambase (spambase)
    * Urban Land Cover (urban_land_cover)
    * Wine (wine)


## Data - Memphis Property Hub
* Compilation of City of Memphis property survey data and Shelby County Assessor's Certified Roll

In [3]:
pd.read_csv('./files/prophub_sample.csv')

Unnamed: 0.1,Unnamed: 0,litter,vegetation,trash,dumping,tree,construction,rent,vehicle,siding,...,entry,boarded,rtotapr,yrblt,rmbed,fixbath,sfla,late_fees,num_sales,rating
0,10,0,0,0,0,0,0,0,0,0,...,0,0,92700,1964,3,3,1774,0.0,5,0
1,11,0,0,0,0,0,0,0,0,0,...,0,0,64800,1950,2,1,1508,0.0,2,2
2,12,0,0,0,0,0,0,0,0,0,...,0,0,43300,1954,2,1,780,0.0,0,2
3,13,2,0,1,0,0,0,0,0,0,...,0,0,56500,1960,3,1,1170,0.0,0,3
4,14,1,0,0,0,0,0,0,0,0,...,0,0,54700,1960,3,1,1095,55.8,1,3
5,15,0,0,0,0,0,0,0,0,0,...,0,0,64600,1960,3,1,1431,0.0,0,3


## Results

### Accuracy - AdaBoost

<img src="./files/adaboost_error_sample_size.png">

### Accuracy - K-Means

<img src="./files/km_error_sample_size.png">

## Performance

<img src="./files/train_size_time_breast-cancer-wisconsin.png">

<img src="./files/train_size_time_data_banknote_authentication.png">

<img src="./files/train_size_time_faults.png">

<img src="./files/train_size_time_haberman.png">

<img src="./files/train_size_time_ionosphere.png">

<img src="./files/train_size_time_shuttle.png">

<img src="./files/train_size_time_Skin_NonSkin.png">

<img src="./files/train_size_time_spambase.png">

<img src="./files/train_size_time_urban_land_cover.png">

<img src="./files/train_size_time_wine.png">

<img src="./files/train_size_time_prophub.png">

## References
[1]R. E. Schapire, “Explaining AdaBoost,” in Empirical Inference, B. Schölkopf, Z. Luo, and V. Vovk, Eds. Springer Berlin Heidelberg, 2013, pp. 37–52.

[2]J. Zhu, H. Zou, S. Rosset, and T. Hastie, “Multi-class adaboost,” Statistics and its Interface, vol. 2, no. 3, pp. 349–360, 2009.


