# Construct problem in scipy format

> README.md line 150-154

For sparse data: scipy csr_matrix((data, (row_ind, col_ind))

y, x = np.asarray([1,-1]), scipy.sparse.csr_matrix(([1, 1, -1, -1], ([0, 0, 1, 1], [0, 2, 0, 2])))

prob  = problem(y, x)

param = parameter('-s 0 -c 4 -B 1')

m = train(prob, param)

$\rightarrow$ the `csr_matrix` represents "Compressed Sparse Row matrix", and only stores nonzero values. This data structure consists of three components, which are:

1. data
2. row_ind
3. col_ind

Like in the example above, the `data` is `[1, 1, -1, -1]`, and the `row_ind` is `[0, 0, 1, 1]`, and the `col_ind` is `[0, 2, 0, 2]`.

$\rightarrow$ the `y`

## transform X

We transform the original data list `X` (which is a list of dictionaries) into a scipy csr_matrix.

This requires us to iterate through each element in `X`, and since each example is a row in the matrix, we:

1. append the value into `data`
2. append the index of this example into `row_ind`
3. append the key into `col_ind`

## transform y

In [None]:
def transform_y(y):
    return np.asarray([1 if y_i == 2 else -1 for y_i in y])


## Ways to call `train()`

There are three ways to call train()

1. model = train(y, x [, 'training_options'])
2. model = train(prob [, 'training_options'])
3. model = train(prob, param)

We use the third way to call `train()`, which in detail is:
- `prob`: a problem instance generated by calling problem(y, x).
- `param`: a parameter instance generated by calling parameter('training_options')


## Generate nodearray

Use the function: `gen_feature_nodearray(xi [,feature_max=None])`

Generate a feature vector from a Python list/tuple/dictionary, numpy ndarray or tuple of (index, data):

For example: xi_ctype, max_idx = gen_feature_nodearray({1:1, 3:1, 5:-2})
- `xi_ctype`: the returned feature_nodearray (a ctypes structure)
- `max_idx`: the maximal feature index of xi
- `feature_max`: if feature_max is assigned, features with indices larger than
             feature_max are removed.

Usage example:

```python
m = liblinear.train(prob, param) # m is a ctype pointer to a model
x0, max_idx = gen_feature_nodearray({1:1, 3:1})
label = liblinear.predict(m, x0)
```

In [None]:
Xtrain_ctype = []
for x_i in X_train:
    x_i_ctype, _ = gen_feature_nodearray(x_i)
    Xtrain_ctype.append(x_i_ctype)

Xtest_ctype = []
for x_i in X_test:
    x_i_ctype, _ = gen_feature_nodearray(x_i)
    Xtest_ctype.append(x_i_ctype)

In [None]:
for experiment in tqdm(range(5)):
    min_Ein = np.inf
    opt_log10_lambda = 0
    seed = np.random.seed(experiment)
    for log10_lambda in (-2, -1, 0, 1, 2, 3):
        train_pred_res = []
        c = 1 / (10 ** log10_lambda)
        prob = problem(y_train, X_train)
        param = parameter('-s 6 -c ' + str(c))
        model = train(prob, param)

        for x_i_ctype in Xtrain_ctype:
            train_pred_res.append(liblinear.predict(model, x_i_ctype))
        Ein = ZeroOneError(train_pred_res, y_train)

        if Ein == min_Ein:
            opt_log10_lambda = max(opt_log10_lambda, log10_lambda)      # break tie by choosing the larger lambda
        elif Ein < min_Ein:
            minEin = Ein
            opt_log10_lambda = log10_lambda

    #print('The best log_10(λ*) = ', opt_log10_lambda)
    c_test  = 1 / (10 ** opt_log10_lambda)
    prob_test = problem(y_test, X_test)
    param_test = parameter('-s 6 -c ' + str(c_test))
    model_test = train(prob_test, param_test)

    test_pred_res = []
    for x_i_ctype in Xtest_ctype:
        test_pred_res.append(liblinear.predict(model_test, x_i_ctype))
    Eout = ZeroOneError(test_pred_res, y_test)
    print(f'Eout for experiment {experiment} is {Eout}')
