# Dataset & Dataloader
---
Defining my Dataset and Dataloader based on the data in this folder.

This will be a callable class and function from other sources that I'll be able to specify which datasets I want to use.

**Building this here for testing purposes, will ultimately be a .py file**

In [1]:
# First, import dependencies
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import tkinter as tk
from tkinter import filedialog
import os

In [2]:
# Define the path to your datasets
DATASETS_PATH = "datasets"

In [3]:
def select_datasets(datasets_path):
    root = tk.Tk()
    root.withdraw()  # Hide the main window

    # This will store your selection outside the inner function
    selected_datasets = []

    # Browse the dataset directory to list all subdirectories
    dataset_dirs = next(os.walk(datasets_path))[1]
    
    # Create a new window for selection
    selection_window = tk.Toplevel(root)
    selection_window.title("Select Datasets")

    listbox = tk.Listbox(selection_window, selectmode='multiple', width=50, height=15)
    for dataset_dir in dataset_dirs:
        listbox.insert(tk.END, dataset_dir)
    listbox.pack()

    def confirm_selection():
        nonlocal selected_datasets  # This line is changed to reference the outer scope variable
        selections = listbox.curselection()
        selected_datasets = [dataset_dirs[i] for i in selections]
        selection_window.destroy()
        root.quit()

    confirm_button = tk.Button(selection_window, text="Confirm", command=confirm_selection)
    confirm_button.pack()

    root.mainloop()
    try:
        root.destroy()  # Ensure the root tkinter window is closed
    except:
        pass  # Window is already closed

    return selected_datasets  # This will now return the correct value

In [14]:
class CustomDataset(Dataset):
    def __init__(self, dataset_names, datasets_path):
        self.file_paths = {}  # Dictionary to store file paths keyed by an integer
        self.indices = []  # List of tuples (file_key, row_index)
        file_key = 0  # Initialize file key
        
        for dataset_name in dataset_names:
            data_path = os.path.join(datasets_path, dataset_name, "data")
            for file_name in os.listdir(data_path):
                if file_name.endswith('.parquet'):
                    file_path = os.path.join(data_path, file_name)
                    self.file_paths[file_key] = file_path  # Store file path in dictionary
                    
                    num_rows = self.get_number_of_rows(file_path)
                    for row_index in range(num_rows):
                        self.indices.append((file_key, row_index))  # Use file_key instead of file_path
                        
                    file_key += 1  # Increment file_key for the next file

    def __len__(self):
        return len(self.indices)
    
    def __getitem__(self, idx):
        file_key, row_index = self.indices[idx]
        file_path = self.file_paths[file_key]  # Lookup file path using file_key
        return self.load_row(file_path, row_index)

    @staticmethod
    def get_number_of_rows(file_path):
        parquet_file = pq.ParquetFile(file_path)
        return parquet_file.metadata.num_rows

    def load_row(self, file_path, row_index):
        # Load the specified columns of the Parquet file into a Pandas DataFrame
        df = pd.read_parquet(file_path, columns=['text'])
        # Select the specific row's 'text' column value
        # Convert the Series object to a list or a string
        row_data = df.iloc[row_index]['text']
        if isinstance(row_data, pd.Series):
            # Convert Series to list if multiple rows were somehow selected
            return row_data.tolist()
        else:
            # If it's a single value, you can return it directly, or as a single-element list
            # Depending on whether you expect to handle batching manually or not
            return [row_data]  # or just `return row_data` for direct string handling

In [15]:
def get_dataloader(batch_size, shuffle=True, num_workers=4):
    """
        Function that prompts for selection of which datasets to include, then creates a Pytorch Dataset and Dataloader. Returns the Dataloader.

        Inputs:
            batch_size:  (int) Specify batch_size param for the dataloader (how many examples are returned with each iteration of the dataloader)
            shuffle:     (boolean) Specifies whether the dataloader shuffles the data off start or not
            num_workers: (int) Sets number of subprocesses to create for loading. Set to 0 to only run one process in 'main'

        Returns dataloader
    """
    selected_datasets = select_datasets(DATASETS_PATH)
    dataset = CustomDataset(selected_datasets, DATASETS_PATH)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)

In [19]:
# Code to create a dataloader
loader = get_dataloader(batch_size=1, shuffle=True, num_workers=0)

In [22]:
# Generate and print 10 samples
num_samples_to_print = 10
samples_printed = 0

for batch in loader:
    for sample in batch:
        # Adjust this line if 'sample' is not directly the text data
        # For example, if 'sample' is a dictionary, you might need sample['text']
        text_data = sample if isinstance(sample, str) else sample[0]
        
        print(f"Sample #{samples_printed + 1}")
        print(f"Number of chars: {len(text_data)}")
        print("=" * 50)
        # Print the first 250 characters of the sample
        print(text_data[:250])
        # Visual divider
        print("=" * 50)
        print()
        
        samples_printed += 1
        if samples_printed >= num_samples_to_print:
            break
    
    if samples_printed >= num_samples_to_print:
        break

Sample #1
Number of chars: 3383
washed away by a torrential cloud-burst a dozen years ago, but has since been 
rebuilt on higher and safer ground. The census shows the followiug growth 
of population.

For the year 1890— 4.211.') ; lOOO— 1.151 ; I'JIU— 4,;557. 

Gilliam county was 

Sample #2
Number of chars: 8694
ERROR from the circuit court for the county of Alexandria.

This was an action of debt instituted by the defendants in error, (plaintiffs in the circuit court,) as directors of the Domestic Manufacture Company of Alexandria, against Robert Anderson, 

Sample #3
Number of chars: 8245
GEYER 

65 

GILBERT 

of; Noebert, Saint; Pahk, Abbey of the; Premonstratensian 
Canons; Phemontr^, Abbey of; Psatime, Nicholas, Bishop 
OF Verdun; Tonqerloo, Abbey of; Wichmans, Francis. 

Geyer, Very Reverend Francis Xavibr, c.s.h., 
of Verona, b.

Sample #4
Number of chars: 3074
115 STAT. 2216
PUBLIC LAW 107-116—JAN. 10, 2002
For making benefit payments under title XVI of the Social
Security A