In [1]:
import pandas as pd

df = pd.read_csv('../assets/data/car.csv')

In [2]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


Since most of the features are categorical text we will need to encode them as numbers using the LabelEncoder.


In [3]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
features = [c for c in df.columns if c != 'acceptability'] #running through all columns that aren't acceptability
for c in df.columns:
    df[c] = le.fit_transform(df[c]) #fit_transforming all the columns using le

X = df[features] #setting everything we just transformed to be our X
y = df['acceptability'] #acceptability to be our y

Notice that we overwrote the original features for simplicity, since we are not interested in doing a study on feature importance.  

<b>Check:</b> Is it correct to use the label encoder blindly like this?  


<details>
<summary>Answer</summary>
No. The categorical features have a scale (amount of maintenance or how safe the car is, for example). It would be more appropriate to do one of the following:  
- Use pd.get_dummies to encode them as binaries  
- Use a map that correctly assigns a numerical scale to the values, e.g. where med > small  
</details>

The next step is to calculate the python cross_val_score on the two classifier:


In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier() #this is our base classifier
bagging = BaggingClassifier(knn, max_samples=0.5, max_features=0.5) #this is us saying we want to bag a bunch of knn estimators

print "KNN Score:\t", cross_val_score(knn, X, y, cv=5, n_jobs=-1).mean()
print "Bagging Score:\t", cross_val_score(bagging, X, y, cv=5, n_jobs=-1).mean()

KNN Score:	0.643070305149
Bagging Score:	0.710088721602


## Independent practice:
1. Go back to a previous lab where you didn't have a great accuracy score/your model wasn't very good at classification.
2. Run the base estimator again, and then create a bagged classifier
3. Try changing some of the parameters and look at how that affects your model