# Virtual manipulation of datasets tutorial

In this tutorial, we will be using 2 [Dataset types](../../../../doc/#builtin/datasets/Datasets.md.html), namely [Sampled Dataset](../../../../doc/#builtin/datasets/SampledDataset.md.html) and [Transposed Dataset](../../../../doc/#builtin/datasets/TransposedDataset.md.html). We will be using a manually created dataset that has two columns. The first column contains the name of machine learning concepts. The second column contains the description (i.e. first paragraph written about the concept in [Wikipedia](https://en.wikipedia.org)).

The notebook cells below use `pymldb`'s `Connection` class to make [REST API](../../../../doc/#builtin/WorkingWithRest.md.html) calls. You can check out the [Using `pymldb` Tutorial](../../../../doc/nblink.html#_tutorials/Using pymldb Tutorial) for more details.

In [2]:
from pymldb import Connection
mldb = Connection("http://localhost")

## Loading the Dataset
A powerful feature of MLDB allows us to execute SQL code while loading data. Here, the [`tokenize`](../../../../doc/#builtin/sql/ValueExpression.md.html#builtinfunctions) function first splits the columns into (lowercased) individual words, removing words with a length of less than 3 characters. Notice that we have chosen to use a [sparse mutable matrix](../../../../doc/#builtin/datasets/MutableSparseMatrixDataset.md.html) since we do know ahead time the column size for each row.

In [13]:
print mldb.put('/v1/procedures/import_ML_concepts', {
    "type":"import.text",
    "params": {
        "dataFileUrl":"file:///mldb_data/datasets/MachineLearningConcepts.csv",
        "outputDataset": {
                            "id":"ml_concepts", # can be accessed later using id
                            "type": "sparse.mutable" # sparse mutable dataset is needed since we tokenize words below
                         },
        "named": "Concepts", #Row name expression for output dataset
        "select": """     tokenize(
                                lower(Text), 
                                {splitchars: ' -''"?!;:/[]*,().',  
                                min_token_length: 3}) as *
                                
                         """, # Within tokenize function:
                                # lower case
                                # leave out punctuation and spaces
                                # allow only words > 3 characters
        "runOnCreation": True
    }
})

<Response [201]>


## A quick look at the data

We can use the [Query API](../../../../doc/#builtin/sql/QueryAPI.md.html) to get the data into a Pandas DataFrame to take a quick look at it. You will notice that certain cells have a 'NaN' value as seen below. This occurs when the word is not found in the description of the concept. MLDB does uses 'NaN' to more efficiently reprensent a sparse matrix (TBD with François).

In [4]:
mldb.query('SELECT * FROM ml_concepts LIMIT 5')

Unnamed: 0_level_0,addition,algorithm,algorithms,also,analysis,analyze,and,applications,approach,are,...,popularized,provide,rather,recurrent,serve,stored,systems,threshold,understanding,wrong
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Support vector machine,1.0,2.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,4.0,...,,,,,,,,,,
Logistic regression,,,,,,,2.0,,,,...,,,,,,,,,,
Deep belief network,,,,1.0,,,,,,,...,,,,,,,,,,
Restricted boltzmann machines,,1.0,2.0,,,,5.0,1.0,,3.0,...,,,,,,,,,,
Hopfield network,,,,1.0,,,,,,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Exploring the data
Let's count the number of words that describes each concept.

In [5]:
mldb.query("""
SELECT horizontal_sum({*}) as count
FROM ml_concepts

""")

Unnamed: 0_level_0,count
_rowName,Unnamed: 1_level_1
Support vector machine,188
Logistic regression,120
Deep belief network,121
Restricted boltzmann machines,152
Hopfield network,59
Naive bayes classifier,198
Boltzmann machine,112
Autoencoder,41
Artificial neural network,49


## Taking the transpose of data
There are two ways to take the transpose of dataset:
    1. using a transposed dataset
    2. using a transpose SQL function

In [6]:
print mldb.put("/v1/datasets/transposed_concepts", {
    "type": "transposed",
    "params": {
        "dataset": {"id":"ml_concepts"}
    }
})

<Response [201]>


In [7]:
mldb.query("select * from transposed_concepts limit 10")

Unnamed: 0_level_0,Naive bayes classifier,Logistic regression,Hopfield network,Boltzmann machine,Restricted boltzmann machines
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
488,1.0,,,,
718,1.0,,,,
1950s,1.0,,,,
1958,,1.0,,,
1960s,1.0,,,,
1974,,,1.0,,
1982,,,1.0,,
1985,,,,1.0,
1986,,,,,1.0
2000s,,,,,1.0


In [8]:
mldb.query("select * from transpose(ml_concepts) limit 10")

Unnamed: 0_level_0,Naive bayes classifier,Logistic regression,Hopfield network,Boltzmann machine,Restricted boltzmann machines
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
488,1.0,,,,
718,1.0,,,,
1950s,1.0,,,,
1958,,1.0,,,
1960s,1.0,,,,
1974,,,1.0,,
1982,,,1.0,,
1985,,,,1.0,
1986,,,,,1.0
2000s,,,,,1.0


As you can see, both ways yield the same results.

## Taking a sample of data
There are two ways to take samples of dataset:
    1. using a sampled dataset 
    2. using a sample SQL function

In [9]:
print mldb.put("/v1/datasets/sampled_tokens", {
    "type": "sampled",
    "params": {
        "rows": 20, # enter the number of rows or words to sample randomly
        # "fraction": 0.1, # a fraction of rows can be also be used
        "withReplacement": False, # we don't want to have the same row more than once
        "dataset": {"id":"transposed_concepts"},
        "seed": 0 # Seed value for the random number generator
    }
})

<Response [201]>


In [10]:
mldb.query("select * from sampled_tokens")

Unnamed: 0_level_0,Naive bayes classifier,Artificial neural network,Boltzmann machine,Deep belief network,Logistic regression,Restricted boltzmann machines,Support vector machine,Autoencoder
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
maximum,1.0,,,,,,,
nervous,,1.0,,,,,,
random,,,1.0,,,,,
such,1.0,,,1.0,1.0,,,
neurons,,,,,,1.0,,
symmetric,,,,,,1.0,,
marked,,,,,,,1.0,
sufficient,,,1.0,,,,,
highly,1.0,,,,,,,
number,2.0,1.0,1.0,,,,,


In [11]:
mldb.query("""
            select * 
            from sample(
                        transposed_concepts, 
                        {rows: 20, withReplacement: False})                      
            """)

Unnamed: 0_level_0,Logistic regression,Naive bayes classifier,Autoencoder,Restricted boltzmann machines,Hopfield network,Support vector machine,Boltzmann machine,Artificial neural network,Deep belief network
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
latter,1.0,,,,,,,,
bayesian,,2.0,,,,,,,
more,2.0,1.0,1.0,1.0,,,,,
understanding,,,,,1.0,,,,
called,1.0,1.0,,,,2.0,,,
automatic,,1.0,,,,,,,
practical,,,,,,,2.0,,
which,1.0,1.0,,,,3.0,,1.0,
inference,,,,,,,1.0,,
cox,1.0,,,,,,,,


As you can see, both methods output a random set of 20 rows.

## Using datasets vs. SQL functions
As seen above, the two different ways to either transpose or sample data are equivalent. It is recommend to use Dataset types instead of SQL functions when the created table will be reused or called later in the program.