# Introduction to scikit-dataset: A Simple Dataset Management Library

**scikit-dataset** is a Python library designed to simplify the management and manipulation of datasets. With scikit-dataset, you can easily create, access, and manipulate datasets using a variety of data structures such as lists, NumPy arrays, different types of indexable dataframes, and more in a simple and intuitive way.

## Getting Started

Let's dive into the basics of using skdataset to create and work with datasets.




### Creating a Dataset

To create a dataset with skdataset, you can pass data in various formats such as lists, NumPy arrays, Pandas Series, and DataFrames. Here's an example:



In [3]:
from skdataset import Dataset
import numpy as np
import pandas as pd
np.set_printoptions(threshold=10)

dataset = Dataset(
    a = range(100), # a iterable sequence that supports len()
    b = list(range(100)), # a list
    c = np.arange(100), # a numpy array
    d = pd.Series(np.arange(100)), # a pandas Series
    e = pd.DataFrame({'col1': np.arange(100), 'col2': np.arange(100)}), # a pandas DataFrame
    name = 'example dataset',
    description= 'this is an example dataset',
)


dataset.name

'example dataset'

As you can see, there is no limit to how you would define the variable names. Here we have used the variable names as `a`, `b`, `c`, `d`, and `e`. You can use any variable names as per your requirement. The only limitation is that all of the variables should be of the same length.

beside that, you can also add `name` `description` and `metadata` to the dataset, which can be useful for documentation and organization purposes.

`Dataset` acts like a dictionary where the keys are the variable names and the values are the data. You can access the data using the variable names as keys or as attributes of the dataset object.


In [4]:
isinstance(dataset, dict)

True

Here are all the keys of the dataset, and as you can see, `name` and `description` are not included in the keys, only data variables are included.


In [5]:
dataset.keys()

dict_keys(['a', 'b', 'c', 'd', 'e'])

But unlike dictionaries, you can also access the data using the variable names as well:

In [6]:
dataset.e

Unnamed: 0,col1,col2
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4
...,...,...
95,95,95
96,96,96
97,97,97
98,98,98


In [7]:
dataset['a'] == dataset.a

True

There are some major differences between the `Dataset` object and a dictionary. One of these differences is if you loop through the dataset object, it will return the values of the dataset in the same order as they were added to the dataset object. But if you loop through a dictionary, it will return the keys.

In [8]:
my_dcit = dict(a=1, b=2, c=3)
my_dataset = Dataset(a=1, b=2, c=3)

print("Looping over normal dict example:")
for key in my_dcit:
    print('   ',key,)
    
print("Looping over Dataset example:")
for value in my_dataset:
    print('   ', value,)

Looping over normal dict example:
    a
    b
    c
Looping over Dataset example:
    [1]
    [2]
    [3]


As you might have noticed, the dataset object not only returned the values but also cast them to list. The resean behind this is that the dataset object need to be indexable and a normal integer is not. So, it wraps the values in a list to make them indexable.


### Accessing Data

#### Indexing and Slicing

Once you have created a dataset, you can access its elements using indexing and slicing or masking. Here's an example:





In [9]:
dataset[5]

{'a': range(5, 6),
 'b': [5],
 'c': array([5]),
 'd': 5    5
 dtype: int64,
 'e':    col1  col2
 5     5     5}

In [10]:
mask = [True] * 5 + [False] * 95
dataset[mask]

{'a': [0, 1, 2, 3, 4],
 'b': [0, 1, 2, 3, 4],
 'c': array([0, 1, 2, 3, 4]),
 'd': 0    0
 1    1
 2    2
 3    3
 4    4
 dtype: int64,
 'e':    col1  col2
 0     0     0
 1     1     1
 2     2     2
 3     3     3
 4     4     4}

In [11]:
dataset[:5]

{'a': range(0, 5),
 'b': [0, 1, 2, 3, 4],
 'c': array([0, 1, 2, 3, 4]),
 'd': 0    0
 1    1
 2    2
 3    3
 4    4
 dtype: int64,
 'e':    col1  col2
 0     0     0
 1     1     1
 2     2     2
 3     3     3
 4     4     4}

In [12]:
type(dataset[:5])

skdataset.dataset.Dataset

As you see, it returns a new `Dataset` object with the selected rows.

#### Selecting variables

Since it is a dictionary-like structure, you can access the data using the variable names as keys. However, you can also access the data by passing a list of variable names as keys. Here is an example:


In [13]:
dataset[['d', 'e']]

{'d': 0      0
 1      1
 2      2
 3      3
 4      4
       ..
 95    95
 96    96
 97    97
 98    98
 99    99
 Length: 100, dtype: int64,
 'e':     col1  col2
 0      0     0
 1      1     1
 2      2     2
 3      3     3
 4      4     4
 ..   ...   ...
 95    95    95
 96    96    96
 97    97    97
 98    98    98
 99    99    99
 
 [100 rows x 2 columns]}

#### 2D Indexing

You can also access the data using 2D indexing. Remember that the first index is for the rows(shared between all of the variables) and the second index is for variables. Here is an example:

In [14]:
dataset[:10, ['b', 'c']]

{'b': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'c': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}

# Conclusion

In this tutorial, you've learned the basics of using scikit-dataset to create and manipulate datasets in Python. skdataset provides a convenient way to work with different types of data structures, making it easier to analyze and process datasets in your machine learning and data science projects.