## Approximation I

## Linear Functions

### Text classification
**Twenty Newsgroup Dataset**
According to the website http://qwone.com/~jason/20Newsgroups/

*The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.*

The following example is modified from the Scikit Learn tutorial: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html


In [None]:
%matplotlib inline  

import numpy as np
from sklearn.datasets import fetch_20newsgroups

# Select only 4 categories to speed things up
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']

# Fetch training and test sets
twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories, shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test',
                                 categories=categories, shuffle=True, random_state=42)

# Have a look at the first few lines
print "\n".join(twenty_train.data[0].split("\n")[:3])
print "\n", twenty_train.target_names[twenty_train.target[0]]

We tokenize the documents, get the counts for each word, convert it into TF-IDF representation. Then we run linear and Radial basis function SVM on it. 

Try various parameters to see if the RBF SVM can outperform the linear SVM.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn import svm

# The pipeline tokenizes the documents, convert it into TF-IDF representation before
# using the classifier
linSVM = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', svm.LinearSVC(loss='hinge', C=10)),])

linSVM.fit(twenty_train.data, twenty_train.target)
linPredicted = linSVM.predict(twenty_test.data)
print "Linear SVM error: " + repr(np.mean(linPredicted == twenty_test.target))

rbfSVM = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', svm.SVC(kernel='rbf', gamma=0.1, C=10)),])
rbfSVM.fit(twenty_train.data, twenty_train.target)
rbfPredicted = rbfSVM.predict(twenty_test.data)
print "RBF SVM error: " + repr(np.mean(rbfPredicted == twenty_test.target))

## Linear Combination of Functions

We will run different classifiers on a few datasets. Which ones do you think will do well? Do you need nonlinear functions for these problems?

We will first do the digits problem with linear logistic regression, RBF SVM, decision tree and boosted decision trees.

In [None]:
# Look at the digits visually
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

digits = load_digits()
images_and_labels = list(zip(digits.images, digits.target))
for index, (image, label) in enumerate(images_and_labels[:4]):
    plt.subplot(1, 4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Training: %i' % label)
plt.show()

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Cross validation with 10 iterations
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
lr = LogisticRegression(C=0.005, multi_class='multinomial', solver='newton-cg')
scores_lr = cross_val_score(lr, digits.data, digits.target, cv=cv)
print "Logistic regression: " + repr(scores_lr.mean())

rbfSVM = svm.SVC(kernel='rbf', gamma=0.001, C=1)
scores_rbf = cross_val_score(rbfSVM, digits.data, digits.target, cv=cv)
print "RBF SVM: " + repr(scores_rbf.mean())

dt = DecisionTreeClassifier(random_state=1)
scores_dt = cross_val_score(dt, digits.data, digits.target, cv=cv)
print "Decision Tree: " + repr(scores_dt.mean())

bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=10),
                         n_estimators=100,random_state=1)
scores_bdt = cross_val_score(bdt, digits.data, digits.target, cv=cv)
print "Boosting Decision Trees: " + repr(scores_bdt.mean())


We now consider a credit card approval dataset (from Germany): https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

We will use linear logistic regression, RBF SVM, decision tree and boosted decision trees.

The features are:

* Attribute 1: (qualitative) 
Status of existing checking account 
A11 : ... < 0 DM 
A12 : 0 <= ... < 200 DM 
A13 : ... >= 200 DM / salary assignments for at least 1 year 
A14 : no checking account 

* Attribute 2: (numerical) 
Duration in month 

* Attribute 3: (qualitative) 
Credit history 
A30 : no credits taken/ all credits paid back duly 
A31 : all credits at this bank paid back duly 
A32 : existing credits paid back duly till now 
A33 : delay in paying off in the past 
A34 : critical account/ other credits existing (not at this bank) 

* Attribute 4: (qualitative) 
Purpose 
A40 : car (new) 
A41 : car (used) 
A42 : furniture/equipment 
A43 : radio/television 
A44 : domestic appliances 
A45 : repairs 
A46 : education 
A47 : (vacation - does not exist?) 
A48 : retraining 
A49 : business 
A410 : others 

* Attribute 5: (numerical) 
Credit amount 

* Attibute 6: (qualitative) 
Savings account/bonds 
A61 : ... < 100 DM 
A62 : 100 <= ... < 500 DM 
A63 : 500 <= ... < 1000 DM 
A64 : .. >= 1000 DM 
A65 : unknown/ no savings account 

* Attribute 7: (qualitative) 
Present employment since 
A71 : unemployed 
A72 : ... < 1 year 
A73 : 1 <= ... < 4 years 
A74 : 4 <= ... < 7 years 
A75 : .. >= 7 years 

* Attribute 8: (numerical) 
Installment rate in percentage of disposable income 

* Attribute 9: (qualitative) 
Personal status and sex 
A91 : male : divorced/separated 
A92 : female : divorced/separated/married 
A93 : male : single 
A94 : male : married/widowed 
A95 : female : single 

* Attribute 10: (qualitative) 
Other debtors / guarantors 
A101 : none 
A102 : co-applicant 
A103 : guarantor 

* Attribute 11: (numerical) 
Present residence since 

* Attribute 12: (qualitative) 
Property 
A121 : real estate 
A122 : if not A121 : building society savings agreement/ life insurance 
A123 : if not A121/A122 : car or other, not in attribute 6 
A124 : unknown / no property 

* Attribute 13: (numerical) 
Age in years 

* Attribute 14: (qualitative) 
Other installment plans 
A141 : bank 
A142 : stores 
A143 : none 

* Attribute 15: (qualitative) 
Housing 
A151 : rent 
A152 : own 
A153 : for free 

* Attribute 16: (numerical) 
Number of existing credits at this bank 

* Attribute 17: (qualitative) 
Job 
A171 : unemployed/ unskilled - non-resident 
A172 : unskilled - resident 
A173 : skilled employee / official 
A174 : management/ self-employed/ 
highly qualified employee/ officer 

* Attribute 18: (numerical) 
Number of people being liable to provide maintenance for 

* Attribute 19: (qualitative) 
Telephone 
A191 : none 
A192 : yes, registered under the customers name 

* Attribute 20: (qualitative) 
foreign worker 
A201 : yes 
A202 : no 


In [None]:
from sklearn.datasets import load_svmlight_file

creditX, creditY = load_svmlight_file("german")

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
lr = LogisticRegression(C=1)
scores_lr = cross_val_score(lr, creditX, creditY, cv=cv)
print "Logistic regression: " + repr(scores_lr.mean())

rbfSVM = svm.SVC(kernel='rbf', gamma=0.001, C=10)
scores_rbf = cross_val_score(rbfSVM, creditX, creditY, cv=cv)
print "RBF SVM: " + repr(scores_rbf.mean())

dt = DecisionTreeClassifier(random_state=1)
scores_dt = cross_val_score(dt, creditX, creditY, cv=cv)
print "Decision Tree: " + repr(scores_dt.mean())

bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=10),
                         n_estimators=100,random_state=1)
scores_bdt = cross_val_score(bdt, creditX, creditY, cv=cv)
print "Boosting Decision Trees: " + repr(scores_bdt.mean())

The next dataset predicts forest cover type from GIS features: https://archive.ics.uci.edu/ml/datasets/Covertype

We will use linear logistic regression, decision tree, and boosted decision tree.

We are trying to distinguish lodgepole pine from other forest types (spruce/fir, Ponderosa pine, cottonwood/willow, aspen, Douglas-fir). The features are:

* Elevation / quantitative /meters / Elevation in meters 
* Aspect / quantitative / azimuth / Aspect in degrees azimuth 
* Slope / quantitative / degrees / Slope in degrees 
* Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to nearest surface water features 
* Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to nearest surface water features 
* Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to nearest roadway 
* Hillshade_9am / quantitative / 0 to 255 index / Hillshade index at 9am, summer solstice 
* Hillshade_Noon / quantitative / 0 to 255 index / Hillshade index at noon, summer soltice 
* Hillshade_3pm / quantitative / 0 to 255 index / Hillshade index at 3pm, summer solstice 
* Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to nearest wildfire ignition points 
* Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1 (presence) / Wilderness area designation 
* Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) / Soil Type designation 
* Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation

In [None]:
from sklearn.datasets import load_svmlight_file

X, Y = load_svmlight_file("covertype")

cv = ShuffleSplit(n_splits=1, test_size=0.9, random_state=0)
lr = LogisticRegression(C=1)
scores_lr = cross_val_score(lr, X, Y, cv=cv)
print "Logistic regression: " + repr(scores_lr.mean())

dt = DecisionTreeClassifier(random_state=1)
scores_dt = cross_val_score(dt, X, Y, cv=cv)
print "Decision Tree: " + repr(scores_dt.mean())

bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=30),
                         n_estimators=10,random_state=1)
scores_bdt = cross_val_score(bdt, X, Y, cv=cv)
print "Boosting Decision Trees: " + repr(scores_bdt.mean())