# Splitting the MINT dataset

In this notebook, we will split the MINT dataset in three splits: training, validation and test, following a ratio of 7:1:2.

In addition, we upload this dataset to my hugginface repository as private dataset. In this way, I can load the dataset and its splits faster. 

[SemEval-2023 Task 9 - Multilingual Tweet Intimacy Analysis](#https://codalab.lisn.upsaclay.fr/competitions/7096)


In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Loading the dataset

In the following cells, we are going to explore the dataset. 

First, we have to moung our google drive:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from datasets import load_dataset

# we load the twitter airlines dataset. 
path = "/content/drive/My Drive/Colab Notebooks/data/"
# please, specify the right path to the file csv.
dataset = load_dataset("csv", data_files=path+"intimacy/train-full.csv")
dataset



  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 9491
    })
})

## Split


In [None]:
dataset = dataset["train"].train_test_split(test_size=0.3, seed=42)
dataset




DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 6643
    })
    test: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 2848
    })
})

In [None]:
aux_dataset = dataset["test"].train_test_split(test_size=0.33, seed=42)
aux_dataset



DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 1908
    })
    test: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 940
    })
})

In [None]:
dataset['validation']=aux_dataset['test']
dataset['test']=aux_dataset['train']
del(aux_dataset)
dataset


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 6643
    })
    test: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 1908
    })
    validation: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 940
    })
})

Let's show some records (they should be always the same because we use the same seed)

In [None]:

dataset['train'][0] #{'text': 'J’ai trop le seum la campagne a zemmour va être remboursé','label': 1.5,'language': 'French'}



{'text': 'J’ai trop le seum la campagne a zemmour va être remboursé',
 'label': 1.5,
 'language': 'French'}

In [None]:
dataset['train'][0]    # {'text': 'J’ai trop le seum la campagne a zemmour va être remboursé','label': 1.5,'language': 'French'}

{'text': 'J’ai trop le seum la campagne a zemmour va être remboursé',
 'label': 1.5,
 'language': 'French'}

In [None]:

dataset['validation'][0]    # {'text': '@user @user 😅😅😅 estaba durmiendo y lo despertó otro perrito!','label': 2.6, 'language': 'Spanish'}

{'text': '@user @user 😅😅😅 estaba durmiendo y lo despertó otro perrito!',
 'label': 2.6,
 'language': 'Spanish'}

In [None]:
dataset['test'][0] # {'text': '@user @user 😅😅😅 estaba durmiendo y lo despertó otro perrito!','label': 2.6, 'language': 'Spanish'}

{'text': '@user eu tenho fé, em nome de jesus, vai parar de chover até chegar a noite',
 'label': 1.3333333333333333,
 'language': 'Portuguese'}

We save the three splits into csv files, which later should be uploaded to a new dataset created in my hugginface account for private use:

In [None]:
dataset["train"].to_csv(path+"intimacy/train.csv", index=False)
dataset["test"].to_csv(path+"intimacy/test.csv", index=False)
dataset["validation"].to_csv(path+"intimacy/validation.csv", index=False)
print("the three splits were saved into " + path+'intimacy/')

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

the three splits were saved into /content/drive/My Drive/Colab Notebooks/data/


## Loading the MINT dataset from my huggingface:

I have loaded these three csv files in hugginface to create a dataset a repository in huggingface an

In [None]:
access_token="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"
# if the dataset was defined as public, use this:
# dataset = load_dataset("ISEGURA/edos", use_auth_token=True)
# if the dataset is private:
dataset = load_dataset("ISEGURA/mint", use_auth_token=access_token)
dataset

Downloading readme:   0%|          | 0.00/25.0 [00:00<?, ?B/s]



Downloading and preparing dataset csv/ISEGURA--mint to /root/.cache/huggingface/datasets/ISEGURA___csv/ISEGURA--mint-5c90ae891d86e6db/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/571k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/161k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.1k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/ISEGURA___csv/ISEGURA--mint-5c90ae891d86e6db/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 6643
    })
    test: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 1908
    })
    validation: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 940
    })
})

In [None]:
dataset['train'][0]    # {'text': 'J’ai trop le seum la campagne a zemmour va être remboursé','label': 1.5,'language': 'French'}

{'text': 'J’ai trop le seum la campagne a zemmour va être remboursé',
 'label': 1.5,
 'language': 'French'}

In [None]:
dataset['validation'][0]    # {'text': 'J’ai trop le seum la campagne a zemmour va être remboursé','label': 1.5,'language': 'French'}

{'text': '@user @user 😅😅😅 estaba durmiendo y lo despertó otro perrito!',
 'label': 2.6,
 'language': 'Spanish'}

In [None]:
dataset['test'][0]    # {'text': 'J’ai trop le seum la campagne a zemmour va être remboursé','label': 1.5,'language': 'French'}

{'text': '@user eu tenho fé, em nome de jesus, vai parar de chover até chegar a noite',
 'label': 1.3333333333333333,
 'language': 'Portuguese'}