# Importing Datasets into MuSE

Currently, MuSE supports the following data formats:
- CSV
- Parquet
- Directory of text files
- Multi-level directory of text files

In this notebook we will specify the format requirements of each of these data formats and how to import them into MuSE.

All implemented data loaders can be found in the `muse.data_importer` module, and follow the same format, of being initialized with some options to specify the specific format.

Typically, you will only interact with these through the `Muse` class, however, the data loaders can be used independently if needed.

## Basic API

Below is the API for loading data into MuSE:
```python
from muse import Muse

muse = Muse()

# Returns nothing, only sets the data in the Muse object
muse.set_data("data type", "path/to/data", "language", {"additional": "options"})
```

Similarly, using the data loaders directly:
```python
from muse.data_importer import import_data

# Returns the loaded data: Union[list[Document], list[MultiDocument], list[Conversation]]
import_data("data type", "path/to/data", "language", {"additional": "options"})
```

The options are specific to the data loader, and will be detailed in the following sections, any options which are not expected by the data loader being applied will be ignored, e.g. if you set `csv_separator` for a Parquet file, or directory it will be ignored.

## Data for Demonstration

In [2]:
import os
from pathlib import Path
from tempfile import TemporaryDirectory

import pandas as pd

document = "This is a document. It has multiple sentences. It is used to test the data importers."
additional_document = "This is an additional document. It has multiple sentences. It is used to test the data importers."
conversation = "#Speaker1# This is a conversation turn. #Speaker2# This is another conversation turn."
summary = "This is a summary of the document. It is used to test the data importers."
metadata = {"metadata1": "This is metadata1", "metadata2": "This is metadata2"}

## CSV

CSV files are commonly used in Document, and Multi-Document Summarization datasets, and other formats can often be easily converted to CSV. 

Currently, MuSE can only take a single CSV file as input.
This is loaded by the `ColumnarConnector` class, which takes in the initializer the following options:
- `text_column`: This is the name of the column in the CSV file that contains the text data for the document, documents, or conversation. It defaults to `text`.
- `summary_column`: This is the name of the column in the CSV file that contains the summary data for the document, documents, or conversation. It defaults to `summary`. This is optional, and if not provided, the dataset will be assumed to be missing summaries, which may be required for some evaluation metrics.
- `metadata_columns`: This is a list of column names in the CSV file that contain metadata for the document, documents, or conversation. It defaults to an empty list `[]`, in which case, all extra columns will be assumed to be metadata, and loaded with the data. This metadata is currently not used by MuSE, but may be useful for future features, or for custom use cases.
- `csv_separator`: This is the separator used in the CSV file. It defaults to `,`.
- `multi_doc_id_column`: This is the name of the column in the CSV file that contains the document id for multi-document datasets. It defaults to `multi_doc_id`. This is optional, and if not provided, and you are loading a multi-document dataset, the documents will be assumed to be all loaded in the same row, separated by a delimiter.
- `multi_document_delimiter`: In the case you are not using a `multi_doc_id_column`, this is the delimiter used to separate the documents in the same row. It defaults to `#DOCUMENT#`, and will be used to split the documents into separate documents.
- `conversation_separator`: This is the separator used to separate the conversation turns in the CSV file. It is a regex pattern, and defaults to `"#\w+#"`. such that the name of the speaker is defined within the `#` characters. If you provide your own, it must have one group, which will be used to split the conversation turns.

In [None]:
from muse import Muse, DataType

# Prepare a basic CSV file for demonstration
with TemporaryDirectory() as tmpdir:
    print("Basic single document CSV file:")
    csv_path = Path(f"{tmpdir}/test.csv")
    df = pd.DataFrame({"text": [document, additional_document], "summary": [summary, summary], **metadata})
    df.to_csv(csv_path, index=False)

    # We need a Muse object to load the data
    muse = Muse()

    # Load the data
    muse.set_data(DataType.SingleDocument, str(csv_path), 'en')

    for doc in muse.data:
        print(doc.text)
        print(doc.summary)
        print(doc.metadata)
        print()

    print("=" * 20)

# Custom column names, and only metadata1
with TemporaryDirectory() as tmpdir:
    print("custom column names, and only metadata1 single document CSV file:")
    csv_path = Path(f"{tmpdir}/test.csv")
    df = pd.DataFrame(
        {"text_with_new_name": [document, additional_document], "summary_with_new_name": [summary, summary],
         **metadata})
    df.to_csv(csv_path, index=False)

    # We need a Muse object to load the data
    muse = Muse()

    # Load the data
    muse.set_data(DataType.SingleDocument, str(csv_path), 'en',
                  {"text_column": "text_with_new_name", "summary_column": "summary_with_new_name",
                   "metadata_columns": ["metadata1"]})

    for doc in muse.data:
        print(doc.text)
        print(doc.summary)
        print(doc.metadata)
        print()

    print("=" * 20)

# Multi-document CSV file
with TemporaryDirectory() as tmpdir:
    print("Multi-document CSV file:")
    csv_path = Path(f"{tmpdir}/test.csv")
    # As you can see here, when using `multi_doc_id_column`, if a summary is provided, you only need to provide it once, and it will be assumed to be the same for all documents in the multi-document group.
    df = pd.DataFrame(
        {"text": [document, additional_document], "summary": [summary, None], "multi_doc_id": ["doc1", "doc1"],
         **metadata})
    df.to_csv(csv_path, index=False)

    # We need a Muse object to load the data
    muse = Muse()

    # Load the data
    muse.set_data(DataType.MultiDocument, str(csv_path), 'en')

    for multi_doc in muse.data:
        for doc in multi_doc.documents:
            print(doc.text)
            print("-" * 5)
        print(multi_doc.summary)
        print(multi_doc.metadata)
        print()

    print("=" * 20)

# Multi-document CSV file with delimiter
with TemporaryDirectory() as tmpdir:
    print("Multi-document CSV file with delimiter:")
    csv_path = Path(f"{tmpdir}/test.csv")
    df = pd.DataFrame({"text": [f"{document}###{additional_document}"], "summary": [summary], **metadata})
    df.to_csv(csv_path, index=False)

    # We need a Muse object to load the data
    muse = Muse()

    # Load the data, here we specify the delimiter used to separate is `###` rather than the default `#DOCUMENT#`
    muse.set_data(DataType.MultiDocument, str(csv_path), 'en', {"multi_document_delimiter": "###"})

    for multi_doc in muse.data:
        for doc in multi_doc.documents:
            print(doc.text)
            print("-" * 5)
        print(multi_doc.summary)
        print(multi_doc.metadata)
        print()

    print("=" * 20)

# Conversation CSV file with custom separator
with TemporaryDirectory() as tmpdir:
    print("Conversation CSV file with custom separator:")
    csv_path = Path(f"{tmpdir}/test.csv")
    df = pd.DataFrame({"text": [conversation.replace("#Speaker", "#Speaker ")], **metadata})
    df.to_csv(csv_path, index=False, sep=";")

    # We need a Muse object to load the data
    muse = Muse()

    # Load the data, here we specify the separator used to separate the conversation turns is `#Speaker\d+#`
    muse.set_data(DataType.Conversation, str(csv_path), 'en',
                  {"conversation_separator": "#Speaker \d+#", "csv_separator": ";"})

    for conv in muse.data:
        for turn in conv.text_units:
            print(f"{turn.speaker}: {turn.text}")
        print(conv.metadata)
        print()

    print("=" * 20)

## Parquet

Parquet files are a columnar storage format, and as such share many similarities with CSV files, only differing within MuSE, in that the `csv_separator` option is not available, as it is not needed.

In [None]:
from muse import Muse

# Simple Parquet file example
with TemporaryDirectory() as tmpdir:
    print("Simple Parquet file:")
    parquet_path = Path(f"{tmpdir}/test.parquet")
    df = pd.DataFrame({"text": [document, additional_document], "summary": [summary, summary], **metadata})
    df.to_parquet(parquet_path, index=False)

    # We need a Muse object to load the data
    muse = Muse()

    # Load the data
    muse.set_data("document", str(parquet_path), 'en')

    for doc in muse.data:
        print(doc.text)
        print(doc.summary)
        print(doc.metadata)
        print()

    print("=" * 20)

## Directories and Multi-level Directories

Another format we support is a directory structure, where each file represents a document, documents, or conversation.
As with the previous formats, to represent multi-document datasets, you can either use delimiter within the file, or use multiple files within a directory. 
For conversations, you will also use a delimiter within the file.

We provide two further groups options for directories:
- `summary_suffix`: This is the suffix used to identify the summary file for a document, documents, or conversation. It defaults to `_summary`, and will be applied to single level directories.
- `metadata_suffix`: This is the suffix used to identify the metadata file for a document, documents, or conversation. It defaults to `_metadata`, and will be applied to single level directories.
- `summary_file`: This is the name of the file used to identify the summary file for a document, documents, or conversation. It defaults to `summary`. This is used for multi-level directories.
- `metadata_file`: This is the name of the file used to identify the metadata file for a document, documents, or conversation. It defaults to `metadata`. This is used for multi-level directories.

In [None]:
from muse import Muse

# Single level directory
with TemporaryDirectory() as tmpdir:
    print("Single level directory:")
    with open(f"{tmpdir}/doc1.txt", "w") as f:
        f.write(document)
    with open(f"{tmpdir}/doc1_summary_special_name.txt", "w") as f:
        f.write(summary)
    with open(f"{tmpdir}/doc1_metadata_special_name.txt", "w") as f:
        f.write(str(metadata))
    with open(f"{tmpdir}/doc2.txt", "w") as f:
        f.write(additional_document)
    with open(f"{tmpdir}/doc2_summary_special_name.txt", "w") as f:
        f.write(summary)
    with open(f"{tmpdir}/doc2_metadata_special_name.txt", "w") as f:
        f.write(str(metadata))

    # We need a Muse object to load the data
    muse = Muse()

    # Load the data
    muse.set_data("document", tmpdir, 'en',
                  {"summary_suffix": "_summary_special_name", "metadata_suffix": "_metadata_special_name"})

    for doc in muse.data:
        print(doc.text)
        print(doc.summary)
        print(doc.metadata)
        print()

    print("=" * 20)

# Multi-level directory
with TemporaryDirectory() as tmpdir:
    print("Multi-level directory:")
    os.mkdir(f"{tmpdir}/doc1")
    os.mkdir(f"{tmpdir}/doc2")

    with open(f"{tmpdir}/doc1/doc1.txt", "w") as f:
        f.write(document)
    with open(f"{tmpdir}/doc1/doc2.txt", "w") as f:
        f.write(additional_document)
    with open(f"{tmpdir}/doc1/summary_special_name.txt", "w") as f:
        f.write(summary)
    with open(f"{tmpdir}/doc1/metadata_special_name.txt", "w") as f:
        f.write(str(metadata))
    with open(f"{tmpdir}/doc2/doc2.txt", "w") as f:
        f.write(additional_document)
    with open(f"{tmpdir}/doc2/summary_special_name.txt", "w") as f:
        f.write(summary)
    with open(f"{tmpdir}/doc2/metadata_special_name.txt", "w") as f:
        f.write(str(metadata))

    # We need a Muse object to load the data
    muse = Muse()

    # Load the data
    muse.set_data("multi-document", tmpdir, 'en',
                  {"summary_file": "summary_special_name", "metadata_file": "metadata_special_name"})

    for multi_doc in muse.data:
        for doc in multi_doc.documents:
            print(doc.text)
            print("-" * 5)
        print(multi_doc.summary)
        print(multi_doc.metadata)
        print()

    print("=" * 20)

## Source-Target files

To be completed.

## Accessing Datasets

MuSE also provides some datasets for demonstration purposes, these must be downloaded before use, we can either do this with the `muse_fetch` command line tool, or by using the `fetch_datasets` function from the `muse.utils` module.