Dataset operations example

In this example we will take the following steps:

    Register data family
    Register access credentials for your storage account
    Register datasets
    View all registered datasets (Fetches info for all registered datasets)
    Preview dataset by dataset_id
    Compare datasets
    Notify the user

Prerequisites: 
    You need S3 bucket access secret and access key where your dataset resides. 
Example


In [0]:
import markov
import os

# Create a new data family for the dataset. 
# If you have existing datafamiy please SKIP this step.

# STEP 1
df_reg_resp = markov.data.register_datafamily(
    name= "Twitter Sentiments Family",
    notes= "This is a data family for twitter datasets",
    lang= "en-us",
    source= "3pInternet",# source of your dataset 
)


In [0]:
# STEP 2
# Create a new set of credentials. If you've already registered S3 credentials
# to enable reading data from your s3 bucket, please SKIP this step. 
cred_resp = markov.credentials.register_s3_credentials(
    name="S3TestCredentials",
    access_key="AKIA5ZM74CIYXROVNQ3Q",
    access_secret="ykYv0YRqyMB2ovyb+0dbr4iQg822D62gCbGqmHUi",
    notes="Credentials for S3",
)

# use data an existing data family id or the one created in STEP 1
df_id = df_reg_resp.df_id 

# use data an existing credential_id registered with Markov or the one created in STEP2
cred_id = cred_resp.credential_id

# segment paths
segment_paths_1 = [
    markov.datasegment.DataSegmentPath(
        segment_type=markov.SegmentType.Train,
        path="s3://super-summit-23/twitter_train_dataset.csv",
    ),
    markov.datasegment.DataSegmentPath(
        path="s3://super-summit-23/twitter_test_dataset.csv",
        segment_type=markov.SegmentType.Test,
    ),
]

In [0]:
# STEP 3
# Register the dataset with MarkovML
markov.data.register_dataset(
    name="Twitter Sentiments Dataset",
    should_analyze=True,
    data_category=markov.DataCategory.Text,
    datafamily_id=df_id,
    storage_type=markov.StorageType.S3,
    credentials=cred_id,
    data_segment_path=segment_paths_1,
    delimiter=",",
    x_col_names=["tweet"],
    y_name="class",
    notes="This dataset contains hate speech tweets collected from internet",
)


In [0]:
# STEP 4
# Fetch all datasets
datasets = markov.dataset.get_datasets()


In [0]:
# STEP 5
# Get dataset preview
markov.dataset.get_dataset_preview(ds_id=datasets[0].ds_id)

# Get all registered datasets
markov.dataset.get_datasets()


In [0]:
# STEP 7
# Notify user that all tasks are completed
markov.notify(title="Example tasks completed", text="All the example tasks completed successfully")