# Introduction to the Dataset Object

The `Dataset` object is a powerful tool for working with structured data in Python. It provides a convenient interface for loading, manipulating, and analyzing datasets. Whether you are working with tabular data, time series data, or any other structured data, the `Dataset` object can streamline your workflow and make your data analysis tasks more efficient.

In this tutorial, we will explore the various features and functionalities of the `Dataset` object. We will learn how to load datasets, access and manipulate data, perform common data preprocessing tasks, and visualize the data. By the end of this tutorial, you will have a solid understanding of how to effectively work with the `Dataset` object and leverage its capabilities for your data analysis projects.

Let's get started!

# Creating a Dataset

First, let's create a dataset using the famous Iris dataset from `sklearn`.

In [8]:
from sklearn import datasets

iris = datasets.load_iris()
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [9]:
import pandas as pd
from holisticai.datasets import Dataset

X = pd.DataFrame(iris['data'], columns=iris['feature_names'])
y = pd.Series(iris['target'], name='target')

ds = Dataset(X=X, y=y)
ds

# Loading a Dataset

You can load a dataset with personalized parameters such as `protected_attribute`. For the `law_school` dataset, the protected attributes are the groups `race1` (ethnicity) and `sex` (gender).

In [10]:
from holisticai.datasets import load_dataset

dataset = load_dataset('adult', protected_attribute='sex')
dataset

## Accessing Data

Let's look at how to access features and samples in the dataset.

### Accessing Features

To access the features in the dataset:

In [11]:
dataset['X']

Unnamed: 0,age,fnlwgt,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,25.0,226802.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,38.0,89814.0,0.0,0.0,50.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,28.0,336951.0,0.0,0.0,40.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,44.0,160323.0,7688.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,34.0,198693.0,0.0,0.0,30.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45217,27.0,257302.0,0.0,0.0,38.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
45218,40.0,154374.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
45219,58.0,151910.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
45220,22.0,201490.0,0.0,0.0,20.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### Accessing a Sample

To access a specific sample in the dataset:

In [12]:
dataset[0]

{'X': subfeatures
 age                                   25.0
 fnlwgt                            226802.0
 capital-gain                           0.0
 capital-loss                           0.0
 hours-per-week                        40.0
                                     ...   
 native-country_Thailand                0.0
 native-country_Trinadad&Tobago         0.0
 native-country_United-States           1.0
 native-country_Vietnam                 0.0
 native-country_Yugoslavia              0.0
 Name: 0, Length: 97, dtype: float64,
 'y': np.int64(0),
 'p_attrs': subfeatures
 race    Black
 sex      Male
 Name: 0, dtype: object,
 'group_a': np.True_,
 'group_b': np.False_}

## Manipulating Data

### Creating New Features

You can create new features in the dataset using the `map` function:

In [13]:
dataset.map(lambda x: {'group_c': x['group_a'], 'group_d': x['group_b']})

### Renaming Features

You can rename features in the dataset:

In [14]:
dataset.rename({'group_a': 'ga'})

### Selecting Specific Indices

To select specific indices in the dataset:

In [15]:
dataset.select([1, 2, 3])

### Filtering Data

To filter the dataset based on a condition:

In [16]:
dataset.filter(lambda x: x['group_a'] == True)

## Grouping Data

### Creating Groups

You can create groups in the dataset using one or more features:

In [17]:
grouped_dataset = dataset.groupby(['group_a', 'y'])
grouped_dataset

### Iterating Over Groups

You can iterate over the groups in the dataset:

In [18]:
grouped_dataset.count()

Unnamed: 0,group_a,y,group_size
0,False,0,13026
1,False,1,1669
2,True,0,20988
3,True,1,9539


### Selecting the First N Elements of Each Group

To select the first 10 elements for each group in the dataset:

In [19]:
grouped_dataset.head(10)

### Sampling Elements from Each Group

To sample 10 elements from each group in the dataset:

In [20]:
grouped_dataset.sample(20)

## Splitting the Dataset

You can split the dataset for training and testing purposes:

In [21]:
train_test = dataset.train_test_split(test_size=0.2, random_state=42)
train_test