# Creating and Using Datasets <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/using-datasets/creating-and-using-datasets.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

In this notebook, we demonstrate the creation and usage of Datasets using the Synthetic Data SDK.

Full Datasets endpoints documentation is available in the [API documentation](https://api-docs.mostly.ai/#ee628a4d-afb6-4d44-96dd-a86e1b345f95).

In [None]:
# Install SDK in CLIENT mode
!uv pip install -U mostlyai
# Or install in LOCAL mode
!uv pip install -U 'mostlyai[local]'
# Note: Restart kernel session after installation!

In [None]:
from mostlyai.sdk import MostlyAI

# Get your api key from https://app.mostly.ai/settings/api-keys
api_key = "my-api-key"

# initialize the SDK in LOCAL mode
mostly = MostlyAI(api_key=api_key)

## Creating a Dataset

All Datasets must contain at least a `name` and a `description`. The `name` is how you would like other users to identify your Datasets, but the `description` helps users or the Assistant understand how to handle the Dataset (including a remote location from which the Dataset files can be downloaded). See [Example Descriptions](#example-descriptions).

A Dataset can also optionally include either of the following:
- Files: remotely stored files containing the underlying dataset as well as any supporting artifacts. See [Datasets with Files](#datasets-from-files)
- Connector: the [Connector](https://mostly-ai.github.io/mostlyai/#data-connectors) asset that provides access to the target dataset. See [Datasets with Conectors](#datasets-with-connectors)


### Creating a Dataset from a description

A dataset created with just a `description` can contain instructions for users or the MOSTLY AI Assistant. Consider the following example, whose description instructs the Assistant to download data from a remote location on the public internet.

In [None]:
from mostlyai.sdk.domain import DatasetConfig

# Dataset config
config = DatasetConfig(
    name="airlines_example_dataset",
    description="navigate to https://github.com/mostly-ai/public-demo-data/tree/dev/airlines and download the flight.csv file",
)

dataset = mostly.datasets.create(config=config)

Now you can access this dataset on the [MOSTLY AI Platform](https://app.mostly.ai) or via the SDK, see [Using a Dataset](#using-a-dataset) for more information about the latter.

On the MOSTLY AI Platform, click Explore to use the Dataset with [the Assistant](https://docs.mostly.ai/assistant). The Assistant can help you train a generator and generate synthetic data, or create artifacts like visualizations that you can share with anyone.

![](./images/datasets-01.png)

### Creating a Dataset from a file

You can create a dataset using a file from your LFS as well. For this tutorial, we shall use the fictitious user data creating with [MOSTLY Mock](https://github.com/mostly-ai/mostlyai-mock) and available at `./data/mock-users.csv`.

A dataset created from a file still has a `name` and a `description` but the `description` can now be used to explain the data structure to other users or prompt the Assistant on how to handle the file or files.

The `description` parameter accepts [Markdown](https://daringfireball.net/projects/markdown/) syntax styling and formatting.

In [None]:
# Define the description with Markdown formatting applied
description = """
## Mock User Data Dataset

This dataset contains 1,000 rows of realistic mock user profiles for a consumer web application.  
It includes demographic information, contact details, and account metadata, suitable for analytics, testing, or prototyping.

### Data Dictionary

| Column              | Description                                               |
|---------------------|-----------------------------------------------------------|
| user_id             | Unique user identifier                                    |
| first_name          | User's first name                                         |
| last_name           | User's last name                                          |
| email               | Realistic email address                                   |
| gender              | Gender (male, female, non-binary)                         |
| date_of_birth       | Date of birth (1950-01-01 to 2005-12-31)                  |
| country             | Country of residence                                      |
| city                | City of residence                                         |
| signup_date         | Date the user signed up (2018-01-01 to 2023-12-31)        |
| last_login          | Last login date (2023-01-01 to 2023-12-31)                |
| account_status      | Account status (active, inactive, suspended)              |
| subscription_type   | Subscription type (free, basic, premium)                  |
| referral_source     | How the user was referred (organic, ad, friend, other)    |
| num_logins          | Number of logins (0 to 500)                               |
| avg_session_minutes | Average session length in minutes (1.0 to 120.0)          |
"""

# Dataset config
config = DatasetConfig(
    name="Mock User Data Dataset",
    description=description,
)

dataset = mostly.datasets.create(config=config)

In [None]:
# Upload a locally stored file to the Dataset created in the previous step
dataset.upload_file("./data/mock_user_data.csv")

As we saw previously, the Dataset has been created, and the Markdown formatting is visible in the `description`.

The Dataset can now be explored on the MOSTLY AI Platform!

![](./images/datasets-02.png)

### Creating a Dataset from a Connector

A [Connector](https://docs.mostly.ai/connectors) lets you connect to your own remote data sources from the MOSTLY AI ecosystem.

Datasets can be created with reference to a specfic connector that exists on your MOSTLY AI Platform instance or one to which you have access (for example, a public connector).

For more information about creating a connector, refer to the [MOSTLY AI documentation](https://docs.mostly.ai/connectors/create).

In [None]:
# Dataset config
config = DatasetConfig(
    name="Baseball Folder Dataset",
    description="Dataset referencing all data in the 'baseball' folder of the specified connector.",
    connectors=[{"id": "e43aa845-8d77-4cda-bc9e-10da9a1496a9", "locations": ["baseball"]}],
)

dataset = mostly.datasets.create(config=config)

We can explore the Dataset using the MOSTLY AI Assitant or use it further with the Synthetic Data SDK.

![](./images/datasets-03.png)

## Retrieving a Dataset

Once a dataset has been created, you can retrieve it with the SDK in order to work with it locally. 

In this section of the tutorial, we'll retrive a public dataset from MOSTLY AI and explore it locally.

In [None]:
import pandas as pd

# Define dataset ID and file to download
dataset_id = "17da17c9-3606-423f-996b-4a458f5997b5"
file_name = "stations.parquet"
local_path = "./data/stations.parquet"

# Download the file from the dataset
dataset = mostly.datasets.get(dataset_id)
dataset.download_file(file_name, local_path)

# Load and display the data
df = pd.read_parquet(local_path)
print(df.head())

You can now access the locally stored copy of the [Meteostat weather station data](https://app.mostly.ai/d/datasets/17da17c9-3606-423f-996b-4a458f5997b5).

This data can be used to train a [Generator](https://github.com/mostly-ai/mostlyai/blob/main/docs/tutorials/getting-started/getting-started.ipynb) with the Synthetic Data SDK or perform any other kind of analysis you wish.