

The **Australian** credit approval dataset illustrates multiple practical issues.
* Missing feature values: This is very common in practice. This can be handled in various ways.
 * Removing instances with missing feature values. This may be fine if the number of such instances is small. However, the issue still has to be handled when the predictor is deployed.
 * Imputing (predicting and filling in) the missing values. May affect performance if prediction is poor.
 * Encoding the missing value as a special value. This may be appropriate if the value is not missing at random and being missing actually provides some information.
 * Algorithm specific method. Decision trees have specific methods for handling missing values (e.g. averaging over both paths down the tree at the missing node) but this is not implemented in Scikit Learn. Generative models handle missing values naturally as part of probabilistic inference.
* The Australian dataset has a mix of continuous and categorical feature.
* For confidentiality purposes, feature names and values have been changed into meaningless symbols in the dataset. As a result, we cannot use our exploit knowledge of the problem to construct better predictors. One question is how to handle the categorical feature.
 * If the feature is an ordinal variable, i.e. the values are ordered, it may sometimes be useful to map the values to integers or reals, particularly if we expect the target value to change monotonically with the value of the feature.
 * If the feature is a nominal variable, i.e. the values cannot be ordered, mapping the values to integers or reals may not make sense. Some decision tree algorithms can do a multiway split at the variable with one child for each possible variable value. However, Scikit Learn simply treats all variables as integers/reals and do binary tests with $\leq$.
 
 
In this dataset, missing values are handled by imputation: mean values are used for continuous variables, and mode is used for categorical variables. For datasets like this, one strength of decision trees is that it is somewhat less sensitive to appropriate processing of variables.


In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import AdaBoostClassifier
from sklearn import neighbors
from sklearn import tree
from sklearn.metrics import accuracy_score
import os as os

os.chdir("/content/drive/My Drive/Colab Notebooks/session 2")

X, y = load_svmlight_file("australian.txt")
# Should really repeat this with many random splits to get more reliable results; try various splits
# by changing random_state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Scale and split data
scaler = preprocessing.MaxAbsScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Decision tree on original data
clf = tree.DecisionTreeClassifier(random_state=42)
clf = clf.fit(X_train, y_train)
predict = clf.predict(X_test)
accuracy = accuracy_score(y_test, predict)
print("DT accuracy with original data: " + "{0:.2f}".format(accuracy))

# Decision tree on scaled data
clf = tree.DecisionTreeClassifier(random_state=42)
clf = clf.fit(X_train_scaled, y_train)
predict = clf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, predict)
print("DT accuracy with scaled data: " + "{0:.2f}".format(accuracy) + "\n")

# Nearest neighbour on original data
clf = neighbors.KNeighborsClassifier(1)
clf = clf.fit(X_train, y_train)
predict = clf.predict(X_test)
accuracy = accuracy_score(y_test, predict)
print("NN accuracy with original data: " + "{0:.2f}".format(accuracy))

# Nearest neighbour on scaled data
clf = neighbors.KNeighborsClassifier(1)
clf = clf.fit(X_train_scaled, y_train)
predict = clf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, predict)
print("NN accuracy with scaled data: " + "{0:.2f}".format(accuracy) + "\n")

# Boosted decision tree on original data
clf = AdaBoostClassifier(tree.DecisionTreeClassifier(max_depth=5),
                         n_estimators=50,random_state=42)
clf = clf.fit(X_train, y_train)
predict = clf.predict(X_test)
accuracy = accuracy_score(y_test, predict)
print("Boosted decision tree accuracy with original data: " + "{0:.2f}".format(accuracy))

DT accuracy with original data: 0.82
DT accuracy with scaled data: 0.81

NN accuracy with original data: 0.67
NN accuracy with scaled data: 0.80

Boosted decision tree accuracy with original data: 0.87


In the **Adult Census dataset** we are given information about individuals obtained from the US census and asked to predict whether the person earns more than US\$50K a year. The features consist of 
* Continuous features "Age", "fnlwgt", "Education-Num", "Capital Gain", "Capital Loss", "Hours per week", and
* Categorical features "Workclass", "Education", "Marital Status", "Occupation", "Relationship", "Race", "Sex", "Country". 

Categorical features cannot be used directly in a linear classifier. One simple way to handle this is to simply map each category of a feature to an integer.

**Archipelago:** Before running the exercise, think of a potentially better way to handle categorical variable and write it down. Run the experiment, then discuss.

In [24]:
import numpy as np
from sklearn import preprocessing
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import os as os
os.chdir("/content/drive/My Drive/Colab Notebooks/session 2")

feature_names = ["Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Marital Status",
                 "Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
                 "Hours per week", "Country"]
data = np.genfromtxt('adult.txt', delimiter=', ', dtype=str)

# Extract labels
labels = data[:,14]
le= preprocessing.LabelEncoder()
le.fit(labels)
labels = le.transform(labels)
class_names = le.classes_
data = data[:,:-1]

# Transform categorical features into integers
categorical_features = [1,3,5,6,7,8,9,13]
categorical_names = {}
for feature in categorical_features:
    le = preprocessing.LabelEncoder()
    le.fit(data[:, feature])
    data[:, feature] = le.transform(data[:, feature])
    categorical_names[feature] = le.classes_
data = data.astype(float)

# Scale and split data
scaler = preprocessing.MaxAbsScaler()
data_scaled = scaler.fit_transform(data)
X_train, X_test, y_train, y_test = train_test_split(data_scaled, labels, test_size=0.33, random_state=42)

# SVM classifier
clf = svm.LinearSVC(loss='hinge', C=1, random_state = 42)
clf.fit(X_train, y_train)
predict = clf.predict(X_test)
accuracy = accuracy_score(y_test, predict)
print("SVM accuracy: " + "{0:.2f}".format(accuracy))

# Now use one hot encoder for categorical features
encoder = preprocessing.OneHotEncoder(categories = "auto")
encoder.fit(data)
print(data.shape)

encoded_data = encoder.transform(data)
data_onehot_scaled = scaler.fit_transform(encoded_data)
print(data_onehot_scaled.shape)
X_train, X_test, y_train, y_test = train_test_split(data_onehot_scaled, labels, test_size=0.33, random_state=42)

# SVM on data with categorical features encoded using one hot encoding
clf = svm.LinearSVC(loss='hinge', C=1, random_state = 42)
clf.fit(X_train, y_train)
predict = clf.predict(X_test)
accuracy = accuracy_score(y_test, predict)
print("SVM one hot encoding accuracy: " + "{0:.2f}".format(accuracy))



SVM accuracy: 0.82
(32561, 14)
(32561, 22144)
SVM one hot encoding accuracy: 0.86




Text Classification

In [14]:
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, chi2
from sklearn import neighbors
from sklearn import tree

# Select only 4 categories to speed things up
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']

# Fetch training and test sets
twenty_train = fetch_20newsgroups(subset='train', remove=('headers','footers','quotes'),
                                  categories=categories, shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test', remove=('headers','footers','quotes'),
                                 categories=categories, shuffle=True, random_state=42)

# Use tfidf
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(twenty_train.data)
vectors_test = vectorizer.transform(twenty_test.data)

clf = svm.LinearSVC(loss='hinge', C=100)
clf.fit(vectors, twenty_train.target)
predict = clf.predict(vectors_test)
accuracy = accuracy_score(twenty_test.target, predict)
print("SVM accuracy: " + "{0:.2f}".format(accuracy))

fs = SelectKBest(chi2, k=100)
vectors_fs = fs.fit_transform(vectors, twenty_train.target)
vectors_test_fs = fs.transform(vectors_test)

clf = neighbors.KNeighborsClassifier(1)
clf.fit(vectors_fs, twenty_train.target)
predict = clf.predict(vectors_test_fs)
accuracy = accuracy_score(twenty_test.target, predict)
print("NN accuracy with 100 features (prev best NN performer): " + "{0:.2f}".format(accuracy))

clf = tree.DecisionTreeClassifier(random_state=42)
clf.fit(vectors, twenty_train.target)
predict = clf.predict(vectors_test)
accuracy = accuracy_score(twenty_test.target, predict)
print("Decision Tree accuracy of : " + "{0:.2f}".format(accuracy))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


SVM accuracy: 0.82
NN accuracy with 100 features (prev best NN performer): 0.60
Decision Tree accuracy of : 0.61
