# Linear models in pyrfm
In this example, we run `SDCAClassifier` and `DoublySGDClassifier`, which are linear classifiers implemented in pyrfm.
The features of linear models using stochastic optimizers in pyrfm are as follows:
 - They do not compute the random features of all samples at the same time
 - They compute the random features of each sample at each iteration
 - So, **memory efficient** but **slow**
 - So, you should use these implementation **only when the size of the training data is large and you cannot allocate memory for the random feature matrix of your training data**
 - Similarly for other linear models: `SGDClassifier`, `SAGAClassifier`, `AdaGradClassifier`, `AdamClassfier` and their regressors
 - `DoublySGDClassifier` (and `DoublySGDRegressor`) **increases the number of random features at every iteration**
 - `DoublySGDClassifier` does not keep the random weights explicitly but samples them at each iteration (with same seed)

In [1]:
from sklearn.datasets import fetch_mldata
from sklearn.svm import SVC, LinearSVC
from sklearn.kernel_approximation import RBFSampler
from sklearn.utils import shuffle
import numpy as np

In [2]:
a9a = fetch_mldata('a9a')
X, y = a9a.data, a9a.target
random_state = np.random.RandomState(0)
X, y = shuffle(X, y, random_state=random_state)

# undersampling
pos_indices = np.where(y > 0)[0]
neg_indices = np.where(y < 0)[0]
indices = np.sort(np.append(pos_indices, neg_indices[:len(pos_indices)]))
X, y = X[indices], y[indices]
X, y = shuffle(X, y, random_state=random_state)

# train/test split
n_train = int(0.8 * X.shape[0])
X_train, y_train = X[:n_train], y[:n_train]
X_test, y_test = X[n_train:], y[n_train:]
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(18699, 123) (18699,) (4675, 123) (4675,)




In [3]:
# standarize
from sklearn.preprocessing import StandardScaler
ss = StandardScaler(with_mean=False)
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

## Comparison methods
- `SVC` with RBF kernel and `LinearSVC` with `RBFSampler`

In [4]:
%%time
# Non-linear SVC
svc = SVC(kernel='rbf', gamma=0.001, random_state=0)
svc.fit(X_train, y_train)
print(svc.score(X_test, y_test))

0.8151871657754011
CPU times: user 27 s, sys: 297 ms, total: 27.3 s
Wall time: 27.5 s


In [5]:
%%time
# LinearSVC with RBFSampler
transformer = RBFSampler(n_components=1024, gamma=0.001, random_state=0)
X_train_trans = transformer.fit_transform(X_train)
X_test_trans = transformer.transform(X_test)
clf = LinearSVC(dual=False, C=1)
clf.fit(X_train_trans, y_train)
print(clf.score(X_test_trans, y_test))

0.8220320855614973
CPU times: user 4.73 s, sys: 297 ms, total: 5.03 s
Wall time: 5.13 s


## Linear models in pyrfm

In [6]:
from pyrfm import SDCAClassifier, DoublySGDClassifier



In [7]:
%%time
# SDCAClassifier with RBFSampler
# It is slow because the stochasitc solvers in pyrfm compute the random feature
# of each sample at each iteration
transformer = RBFSampler(n_components=1024, gamma=0.001, random_state=0)
# Do not transform before fitting
clf = SDCAClassifier(transformer, alpha=0.01, tol=1e-5, max_iter=1,
                     verbose=False, random_state=0, warm_start=True,
                     shuffle=True)
for i in range(10):
    clf.fit(X_train, y_train)
    print('Iteration: {} Accuracy: {:4g}'.format(i+1, clf.score(X_test, y_test)))

Iteration: 1 Accuracy: 0.811123
Iteration: 2 Accuracy: 0.805348
Iteration: 3 Accuracy: 0.808556
Iteration: 4 Accuracy: 0.808556
Iteration: 5 Accuracy: 0.809412
Iteration: 6 Accuracy: 0.80877
Iteration: 7 Accuracy: 0.808342
Iteration: 8 Accuracy: 0.808556
Iteration: 9 Accuracy: 0.808984
Iteration: 10 Accuracy: 0.808342
CPU times: user 8.97 s, sys: 15.6 ms, total: 8.98 s
Wall time: 8.96 s


In [11]:
%%time
import time
# DoublySGDClassifier with RBFSampler
# It is slow because the stochasitc solvers in pyrfm compute the random feature
# of each sample at each iteration
transformer = RBFSampler(gamma=0.001, random_state=0)
# Do not transform before fitting
clf = DoublySGDClassifier(transformer, eta0=.01, alpha=1e-2, power_t=1,
                          max_iter=1, batch_size=128, n_bases_sampled=4,
                          verbose=False, random_state=True, warm_start=True)
start = time.time()
# The number of random features increases at every iteration
# So, the running time also increases
for i in range(10):
    clf.fit(X_train, y_train)
    stop = time.time()
    print('Iteration: {} Accuracy: {:4g} Time: {:.3g} (s)'
          .format(i+1, clf.score(X_test, y_test), stop - start))

Iteration: 1 Accuracy: 0.786738 Time: 0.711 (s)
Iteration: 2 Accuracy: 0.796578 Time: 2.78 (s)
Iteration: 3 Accuracy: 0.804278 Time: 6.36 (s)
Iteration: 4 Accuracy: 0.802781 Time: 11.3 (s)
Iteration: 5 Accuracy: 0.804064 Time: 17.6 (s)
Iteration: 6 Accuracy: 0.807487 Time: 25.3 (s)
Iteration: 7 Accuracy: 0.807273 Time: 34.4 (s)
Iteration: 8 Accuracy: 0.806417 Time: 44.9 (s)
Iteration: 9 Accuracy: 0.805348 Time: 56.8 (s)
Iteration: 10 Accuracy: 0.805989 Time: 70 (s)
CPU times: user 1min 11s, sys: 31.2 ms, total: 1min 11s
Wall time: 1min 11s


In [18]:
print(clf.coef_.shape)
print(clf.n_bases_sampled * int((X_train.shape[0]-1)/clf.batch_size+1)*10)

(5880,)
5880
