### Example Usage from Module

The following shows how one might use csv2libsvm to convert a csv file to libsvm format from a notebook or interative setting.

In [4]:
from csv2libsvm import csv2libsvm
import pandas as pd

df = pd.read_csv('../titanic.csv', nrows=10)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


A typical workflow would have the modeler read in a sample of the csv file to understand the columns and their types. The output above shows several factor columns that neither libsvm nor xgboost know how to handle. By specifying which columns are factors, `csv2libsvm` will encode the string values of these columns with sequential integers based on their order of appearance.

Specifying factor columns often isn't enough, however, as there are clearly some columns that are unfit for modeling altogether (e.g. account #s, names, addresses). Such columns should be listed in the `skip` argument. Alternatively, the user can specify exactly which columns to keep using the `keep` argument.

Once the columns types have been investigated the user calls the function passing in the input file path and a directory to store the output contents like so:

In [5]:
csv2libsvm(
    '../titanic.csv',
    'xgboost-training',
    target='survived',
    factors=['pclass','sex','embarked'],
    keep=['pclass','sex','age','sibsp','parch','fare','embarked'])

100%|██████████| 891/891 [00:00<00:00, 23445.98it/s]


### Output

`csv2libsvm` will create several files in the output folder depending on the input arguments. The `split` argument allows the modeler to indicate a column that should be used to output multiple files. This is most commonly the case when a train/test/val split needs to be created. One file will be produced for each unique value in this column. For example, if the `split` column has the values `train` and `test`, two files will be created: `part-train.libsvm` and `part-test.libsvm`. If no `split` column is provided, only one libsvm file will be created: `part-full.libsvm`.

In addition to creating files in the libsvm format, `csv2libsvm` also outputs a file of meta data called `meta.json`. This file records function arguments and most importantly, factor column mappings. This allows the user to apply the conversion on csv file to another. This might arise when two csv files need to be converted - one train and one test.



### Using with xgboost

The `xgboost` module understands the libsvm format. This allows xgboost to build models from disk-based datasets which may be necessary in some situations. Creation of an `xgboost.DMatrix` is as simple as passing in the path to the libsvm file.

In [6]:
import xgboost

train = xgboost.DMatrix('xgboost-training/part-full.libsvm')

params = {
    'objective': 'binary:logitraw',
    'eta': 0.1,
    'tree_method': 'hist',
    'min_child_weight': 5
}

mod = xgboost.train(params, dtrain=train, num_boost_round=100)

[14:09:36] 891x8 matrix with 4772 entries loaded from xgboost-training/part-full.libsvm
