# Virtual Manipulation of Datasets Tutorial

In this tutorial, we will be using two [Dataset types](../../../../doc/#builtin/datasets/Datasets.md.html), namely [Sampled Dataset](../../../../doc/#builtin/datasets/SampledDataset.md.html) and [Transposed Dataset](../../../../doc/#builtin/datasets/TransposedDataset.md.html). This tutorial assumes familiarity with [Procedures](../../../../doc/#builtin/procedures/Procedures.md.html). We suggest going through the [Procedures and Functions Tutorial](../../../../ipy/notebooks/_tutorials/_latest/Procedures%20and%20Functions%20Tutorial.ipynb) beforehand.

To run the examples below, we created a toy dataset using the description of machine learning concepts from [Wikipedia](https://en.wikipedia.org). Our dataset is made up of two columns. The first column contains the name of the machine learning concept. The second column contains the corresponding description.

The notebook cells below use `pymldb`'s `Connection` class to make [REST API](../../../../doc/#builtin/WorkingWithRest.md.html) calls. You can check out the [Using `pymldb` Tutorial](../../../../doc/nblink.html#_tutorials/Using pymldb Tutorial) for more details.

In [8]:
from pymldb import Connection
mldb = Connection("http://localhost")

## Loading the Dataset
A powerful feature of MLDB allows us to execute SQL code while loading data. Here, the [`tokenize`](../../../../doc/#builtin/sql/ValueExpression.md.html#builtinfunctions) function first splits the columns into (lowercased) individual words, removing words with a length of less than 4 characters. Notice that we have chosen to use a dataset of type [sparse.mutable](../../../../doc/#builtin/datasets/MutableSparseMatrixDataset.md.html) since we do know ahead time the column size for each row.

In [9]:
print mldb.put('/v1/procedures/import_ML_concepts', {
        "type":"import.text",
        "params": 
        {
            "dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
            "outputDataset":{
                "id":"ml_concepts", # can be accessed later using id
                "type": "sparse.mutable" # sparse mutable dataset is needed since we tokenize words below
            },
            "named": "Concepts", #row name expression for output dataset
            "select": 
                """ 
                    tokenize(
                        lower(Text), 
                        {splitChars: ' -''"?!;:/[]*,().',  
                        minTokenLength: 4}) AS *
                """, # within the tokenize function:
                        # lower case
                        # leave out punctuation and spaces
                        # allow only words > 4 characters
            "runOnCreation": True
        }
    }
)

<Response [201]>


## A quick look at the data

We can use the [Query API](../../../../doc/#builtin/sql/QueryAPI.md.html) to get the data into a Pandas DataFrame to take a quick look at it. You will notice that certain cells have a 'NaN' value as seen below. This is because the dataset is sparse: every word is not present in the description of every concept. Those missing values are representend as NaNs in a Pandas DataFrame.

In [10]:
mldb.query("SELECT * FROM ml_concepts LIMIT 5")

Unnamed: 0_level_0,addition,algorithm,algorithms,also,analysis,analyze,applications,approach,assigns,associated,...,popularized,provide,rather,recurrent,serve,stored,systems,threshold,understanding,wrong
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Support vector machine,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,,,,,,,,,
Logistic regression,,,,,,,,,,,...,,,,,,,,,,
Deep belief network,,,,1.0,,,,,,,...,,,,,,,,,,
Restricted boltzmann machines,,1.0,2.0,,,,1.0,,,,...,,,,,,,,,,
Hopfield network,,,,1.0,,,,,,,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Exploring the data
Let's count the number of words that describes each concept.

In [11]:
mldb.query(
    """
    SELECT horizontal_sum({*}) AS count
    FROM ml_concepts
    """
)

Unnamed: 0_level_0,count
_rowName,Unnamed: 1_level_1
Support vector machine,149
Logistic regression,99
Deep belief network,98
Restricted boltzmann machines,124
Hopfield network,53
Naive bayes classifier,165
Boltzmann machine,89
Autoencoder,31
Artificial neural network,40


## Taking the transpose of data
There are two ways to take the transpose of a dataset:
    1. using a transposed dataset
    2. using a transpose SQL function

In [12]:
print mldb.put("/v1/datasets/transposed_concepts", {
        "type": "transposed",
        "params": 
        {
            "dataset": {"id":"ml_concepts"}
        }
    }
)

<Response [201]>


The code above transposed the data and created the <code>transposed_concepts</code> dataset. Now that we have transposed our data, we can easily see which words are most frequently used in the Machine Learning concept descriptions. Please note that we have joined below the results from the <code>transposed_concepts</code> dataset and the transpose inline function to compare outputs.

In [13]:
mldb.query(
    """
        SELECT 
            horizontal_count({a.*}) AS top_words_transp_dataset,
            horizontal_count({b.*}) AS top_words_transp_function
        NAMED a.rowName()
        FROM transposed_concepts AS a
        JOIN transpose(ml_concepts) AS b
            ON a.rowName() = b.rowName()
            ORDER BY top_words_transp_dataset DESC
            LIMIT 10
    """
)

Unnamed: 0_level_0,top_words_transp_dataset,top_words_transp_function
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1
learning,7,7
with,7,7
neural,6,6
machine,6,6
training,5,5
used,5,5
machines,5,5
network,5,5
model,5,5
which,4,4


While the methods yield similar results, the transposed dataset allows the user to keep the output dataset in memory. This may be useful if you want to use the dataset in multiple steps of your data analysis. In cases where you don't need to use the transpose more than once, you can simply use the inline syntax.

## Taking a sample of data
There are two ways to take samples of dataset:
    1. using a sampled dataset 
    2. using a sample SQL function

In [14]:
print mldb.put("/v1/datasets/sampled_tokens", {
        "type": "sampled",
        "params": 
        {
            "rows": 10,
            "withReplacement": False,
            "dataset": {"id":"transposed_concepts"},
            "seed": 0
        }
    }
)

<Response [201]>


In [15]:
mldb.query("SELECT * FROM sampled_tokens")

Unnamed: 0_level_0,Naive bayes classifier,Artificial neural network,Boltzmann machine,Restricted boltzmann machines,Support vector machine,Deep belief network,Logistic regression,Hopfield network
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
maximum,1.0,,,,,,,
nervous,,1.0,,,,,,
random,,,1.0,,,,,
studied,1.0,,,,,,,
neurons,,,,1.0,,,,
symmetric,,,,1.0,,,,
marked,,,,,1.0,,,
such,1.0,,,,,1.0,1.0,
hopfield,,,1.0,,,,,4.0
only,,,,,1.0,,,


The above code would be simlar to the following SQL function:
```python
mldb.query(
    """
    SELECT * 
    FROM sample(
        transposed_concepts, 
        {
            rows: 10,
            withReplacement: False
        }
    )                      
    """
) 
```

Again, the two methods provide the same desired outcome, allowing you to choose how to best manipulate your data.

## Using datasets vs. SQL functions
As seen above, the two different ways to either transpose or sample data are equivalent. It is recommended to use Dataset types instead of SQL functions when the created table will be reused or called later in the program.

As seen in this tutorial, MLDB allows you to virtually manipluate data in multiple ways. This grants greater flexibility when constructing models and analyzing complex data.

## Where to next?

Check out the other [Tutorials and Demos](../../../../doc/#builtin/Demos.md.html).