# INSTALLLLS
Instalaciones antes de arrancar, si esta cerca de morirse la GPU porbar con arrancar devuleta desde aqui

In [None]:
!pip install -U bitsandbytes -q

In [None]:
!pip install --upgrade scipy scikit-learn -q



In [None]:
#Reiniciar kernel para que la instalaciones sirvan!
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

# /apis
Adaptación de [apis](https://github.com/AI-secure/aug-pe/tree/main/apis)


## api.py
Definición de la clase API

In [None]:
from abc import ABC, abstractmethod

# Definicion de la clase API después mejor especificada en HFAPI
class API(ABC):
    def __init__(self, args=None):
        self.args = args

    @staticmethod
    def command_line_parser():
        parser = argparse.ArgumentParser()
        parser.add_argument(
            '--api_help',
            action='help')
        return parser

    @classmethod
    def from_command_line_args(cls, args):
        """
        Creating the API from command line arguments.

        Args:
            args: (List[str]):
            The command line arguments
        Returns:
            API:
                The API object.
        """
        args = cls.command_line_parser().parse_args(args)
        print(args)
        return cls(**vars(args), args=args)

    @abstractmethod
    def text_random_sampling(self, num_samples, prompt_counter=None):

        pass

    @abstractmethod
    def text_variation(self, images, additional_info,
                       num_variations_per_image, size, variation_degree=None):

        pass

## hf_api.py
Definición de la clase HFAPI para modelos de huggingface

In [None]:
import torch
import numpy as np
from tqdm import tqdm
import logging
import transformers
import random
import re
import collections


# Definicion de HFAPI para utilizar con modelos abiertos de hugging face
class HFAPI(API):

    def __init__(self,
                 model_type, variation_type, use_subcategory,
                 output_dir, seed, mlm_probability,
                 length, temperature, top_k, top_p, repetition_penalty, do_sample, fp16, no_cuda,
                 random_sampling_batch_size, num_beams, dry_run,
                 variation_batch_size,
                 *args, **kwargs):
        """
        Initializes an object for managing a language model's configuration, tokenizer, and generation process.
        Args:
            model_type (str): Type or name of the model (e.g., "gpt2", "bert").
            variation_type (str): Type of variation method to apply during text generation.
            use_subcategory (bool): Whether to use specific subcategories for certain datasets.
            output_dir (str): Directory where outputs or models will be saved.
            seed (int): Random seed for reproducibility.
            mlm_probability (float): Masked language model probability (if applicable).
            length (int): Maximum length for generated text.
            temperature (float): Sampling temperature for randomness control in generation.
            top_k (int): Top-k sampling parameter for limiting candidate words.
            top_p (float): Top-p sampling parameter for nucleus sampling.
            repetition_penalty (float): Penalty to avoid repetitive text generation.
            do_sample (bool): Whether to use sampling for text generation.
            fp16 (bool): Whether to use 16-bit floating-point precision for model inference.
            no_cuda (bool): Disable CUDA usage, forcing CPU.
            random_sampling_batch_size (int): Batch size for random sampling operations.
            num_beams (int): Number of beams for beam search.
            dry_run (bool): If True, only simulate the operations without generating results.
            variation_batch_size (int): Batch size for generating variations.
            *args, **kwargs: Additional arguments for flexibility.
        """
        super().__init__(*args, **kwargs)

        # Assign basic parameters to the instance
        self.model_type = model_type
        self.variation_type = variation_type
        self.output_dir = output_dir
        self.length = length
        self.temperature = temperature
        self.k = top_k
        self.p = top_p
        self.repetition_penalty = repetition_penalty
        self.num_beams = num_beams
        self.do_sample = do_sample
        self.fp16 = fp16
        self.no_cuda = no_cuda
        self.seed = seed

        # Determine the device: Use GPU if available and not disabled
        self.device = torch.device("cuda" if torch.cuda.is_available() and not self.no_cuda else "cpu")
        # Determine the number of GPUs to use
        self.n_gpu = 0 if self.no_cuda else torch.cuda.device_count()

        # Set the random seed for reproducibility
        set_seed(seed=seed, n_gpu=self.n_gpu)

        # Store whether it's a dry run (test mode)
        self.dry_run = dry_run

        # Handle subcategory usage if enabled
        self.use_subcategory = use_subcategory
        if use_subcategory:
            # Initialize a dictionary for subcategory mappings for different datasets
            self.subcategory_dict = {}
            self.subcategory_dict['yelp'] = get_subcategories("yelp")
            self.subcategory_dict['pubmed'] = get_subcategories("pubmed")
            self.subcategory_dict['openreview'] = get_subcategories("openreview")

        # Model name or path for loading the tokenizer and model
        model_name_or_path = self.model_type

        # Initialize the tokenizer
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
            model_name_or_path,
            # Optional: Device map can be set for tokenizer
            # device_map="auto"
        )
        # Configure padding token and side
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.padding_side = "left"

        # Initialize the model
        if "gpt2" not in self.model_type:
            # Load the model in 4-bit precision for large language models
            self.model = transformers.AutoModelForCausalLM.from_pretrained(
                model_name_or_path,
                load_in_4bit=True,
                device_map="auto",
                # Uncomment for 16-bit floating-point precision
                # torch_dtype=torch.float16
            )
        else:
            # Special handling for GPT-2 models
            pad_token_id = self.tokenizer.pad_token_id if self.tokenizer.pad_token_id else self.tokenizer.eos_token_id
            self.model = transformers.AutoModelForCausalLM.from_pretrained(
                model_name_or_path,
                # Uncomment to enable automatic device mapping
                # device_map="auto",
                pad_token_id=pad_token_id
            )
            # Use half-precision if specified
            if self.fp16:
                self.model.half()

        # Store batch sizes for different sampling modes
        self.random_sampling_batch_size = random_sampling_batch_size
        self.variation_batch_size = variation_batch_size

    def text_random_sampling(self, num_samples, prompt_counter=None, lens_dict=None):
        ratio_generation_training = num_samples / sum(prompt_counter.values())
        all_sequences = []
        ppls_cur = []
        additional_info = []
        sync_labels_counter = collections.Counter()

        self.model.eval()

        simulate_num = 0
        for prompt in tqdm(prompt_counter):
            # generation is proportional to the label distributions
            simulate_num_seq_to_generate = round(prompt_counter[prompt] * ratio_generation_training)
            simulate_num += simulate_num_seq_to_generate

        print(f"should -- simulated generated sequences: %d", simulate_num)
        all_prefix_prompts = []
        for prompt in tqdm(prompt_counter):
            # generation is proportional to the label distributions
            num_seq_to_generate = round(prompt_counter[prompt] * ratio_generation_training)
            if self.use_subcategory:
                full_prompt_text = 'Sos un policia espia. Por sospechas de subversivos fuentes anonimas te informan sobre intinerarios o movimientos de vecinos o conocidos. Escribi una transcripción un informe que te entrego un informante privado.'

            else:
                full_prompt_text = prompt

            prompt_input_ids = self.tokenizer(full_prompt_text)['input_ids']
            before_gen_length = len(full_prompt_text)
            print('num_seq_to_generate=',num_seq_to_generate)
            if num_seq_to_generate > 0:
                # condition on the prompt
                sequences = self._generate_text(prompt=prompt_input_ids,
                                                seq_num=num_seq_to_generate,
                                                max_length=self.length,
                                                batch_size=self.random_sampling_batch_size,
                                                before_gen_length=before_gen_length)
                all_sequences += sequences
            all_prefix_prompts += [full_prompt_text] * num_seq_to_generate
            additional_info += [prompt] * num_seq_to_generate
            sync_labels_counter[prompt] = num_seq_to_generate

        print(f"Total generated sequences: %d", len(all_sequences))
        torch.cuda.empty_cache()
        return all_sequences,  additional_info, sync_labels_counter, all_prefix_prompts

    def _generate_text(self, prompt, seq_num, max_length, batch_size, before_gen_length):

        all_data = []

        if seq_num < batch_size:
            batch_size = seq_num + 1  # TODO: improve

        num_return_sequences = 2 if batch_size > 1 else 1
        for i in tqdm(range(seq_num // batch_size + 1)):
            if self.dry_run:
                generated_sequences = ["s" * max_length] * batch_size
            else:
                input_ids = torch.tensor(prompt).repeat(
                    batch_size, 1).to(self.device)
                with torch.no_grad():
                    output_sequences = self.model.generate(
                        input_ids=input_ids,
                        max_new_tokens=max_length,
                        pad_token_id=self.tokenizer.eos_token_id,
                        temperature=self.temperature,
                        top_k=self.k,
                        top_p=self.p,
                        num_beams = self.num_beams,
                        early_stopping=True,
                        repetition_penalty=self.repetition_penalty,
                        do_sample=self.do_sample,
                        # overgenerate to ensure we have enough non-empty generated sequences
                        num_return_sequences=1,
                        # num_return_sequences=num_return_sequences,
                        # no_repeat_ngram_size=2,
                    )
                    generated_sequences = self.tokenizer.batch_decode(output_sequences[:, input_ids.shape[1]:],
                                                                      skip_special_tokens=True,
                                                                      clean_up_tokenization_spaces=True)
            for g in generated_sequences:
                seq = g
                seq = " ".join(seq.split())
                if seq:
                    all_data.append(seq)

        if len(all_data) > seq_num:
            all_data = random.sample(all_data, seq_num)
        return all_data

    def text_variation(self, sequences, additional_info,
                       num_variations_per_sequence, variation_degree):
        self.model.eval()
        # self.model.to(self.device)
        variations = []
        for idx in tqdm(range(num_variations_per_sequence)):
            sub_variations, var_labels = self._text_variation(
                sequences=sequences,
                labels=list(additional_info),
                variation_degree=variation_degree,
                variation_type=self.variation_type,
                batch_size=self.variation_batch_size)
            variations.append(sub_variations)
        torch.cuda.empty_cache()
        return np.stack(variations, axis=1), var_labels, [], [], []

    def _rephrase(self, label, sequence, variation_type):

        selected_style = ALL_PUBMED_styles[random.randrange(len(ALL_PUBMED_styles))]
        prompt = "Por favor reformula las siguientes oraciones {} como la transcripción de un informe policial:\n{} \n".format(
           selected_style, sequence)
        # prompt = "Por favor reformula las siguientes oraciones {} para hablar sobre el aspecto de una persona:\n{} \n".format(
        #     selected_style, sequence)
        return prompt

    def _text_variation(self, sequences, labels, variation_degree, variation_type, batch_size):
        if self.dry_run:
            all_data = [seq+"s"*self.length for seq in sequences]
            all_labels = [lab for lab in labels]
            return all_data, all_labels

        num_seq = len(sequences)
        all_data = []
        all_labels = []

        self.model.eval()

        self.mlm_probability = variation_degree

        for i in tqdm(range(num_seq // batch_size + 1)):
            start_idx = i*batch_size
            if start_idx >= num_seq:
                break
            end_idx = num_seq if (
                i+1)*batch_size > num_seq else (i+1)*batch_size

            batch_prompt = []
            batch_labels = []
            for idx in range(start_idx, end_idx):
                prompt = self._rephrase(
                    labels[idx], sequences[idx], variation_type)
                batch_prompt.append(prompt)
                batch_labels.append(labels[idx])

            with torch.no_grad():
                input_ids = self.tokenizer(batch_prompt, padding=True, return_tensors='pt')[
                    'input_ids'].to(self.device)  # has been padded into the same lens; cannot be used
                beam_output = self.model.generate(
                        input_ids=input_ids,
                        max_new_tokens=self.length,
                        pad_token_id=self.tokenizer.eos_token_id,
                        temperature=self.temperature,
                        top_k=self.k,
                        top_p=self.p,
                        early_stopping=True,
                        repetition_penalty=self.repetition_penalty,
                        do_sample=self.do_sample,
                        # overgenerate to ensure we have enough non-empty generated sequences
                        num_return_sequences=1,
                        # num_return_sequences=num_return_sequences,
                        # no_repeat_ngram_size=2,
                    )
                # TODO:   skip the tokens so the lens of input_ids is diff from batch_prompt
                generated_sequences = self.tokenizer.batch_decode(
                    beam_output[:, input_ids.shape[1]:], skip_special_tokens=True,  clean_up_tokenization_spaces=True)
            for idx in range(len(generated_sequences)):
                seq = generated_sequences[idx]
                seq = " ".join(seq.split())
                lab = batch_labels[idx].strip().split("\t")
                if seq:
                    all_data.append(seq)  # no lables!
                else:
                    all_data.append(batch_prompt[idx])
                all_labels.append(lab)

        # logging.info(f" _text_variation output lens  {len(all_data)}")

        return all_data, all_labels

## utils.py (prompts)
Definición de prompts de variación

In [None]:
import torch
import numpy as np
import random
import time
import functools
import signal


# Variation Prompts
ALL_PUBMED_styles = ["de forma casual", "de forma creativa",  "de forma concisa", "de forma cronológica"]


def set_seed(seed, n_gpu=0):
    import random  # Import the random module inside the function
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    if n_gpu > 0:
        torch.cuda.manual_seed_all(seed)
        torch.cuda.manual_seed(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False


class Timer:
    """Timer context manager"""

    def __enter__(self):
        """Start a new timer as a context manager"""
        self.start = time.time()
        return self

    def __exit__(self, *args):
        """Stop the context manager timer"""
        self.end = time.time()
        self.duration = self.end - self.start

    def __str__(self):
        return f"{self.duration:.1f} seconds"


def timeout(sec):
    """
    timeout decorator
    :param sec: function raise TimeoutError after ? seconds
    """
    def decorator(func):
        @functools.wraps(func)
        def wrapped_func(*args, **kwargs):

            def _handle_timeout(signum, frame):
                err_msg = f'Function {func.__name__} timed out after {sec} seconds'
                raise TimeoutError(err_msg)

            signal.signal(signal.SIGALRM, _handle_timeout)
            signal.alarm(sec)
            try:
                result = func(*args, **kwargs)
            finally:
                signal.alarm(0)
            return result

        return wrapped_func
    return decorator

# Nosotros no lo usamos
def get_subcategories(dataset):
    if "yelp" in dataset:
        category_list = {'Restaurants', 'Bars', 'Shopping', 'Event Planning & Services',
                         'Beauty & Spas', 'Arts & Entertainment', 'Hotels & Travel',
                         'Health & Medical', 'Grocery', 'Home & Garden'}

        subcategory_list = {}
        for cate in category_list:
            prefix = cate.lower().split(' ')[0]
            fname = f'data/yelp/subcategories/{prefix}.txt'
            file1 = open(fname, 'r')
            Lines = file1.readlines()
            Lines = [s.replace('\n', '') for s in Lines]
            subcategory_list[cate] = Lines
        # print(subcategory_list)
    elif "pubmed" in dataset:
        fname = f'data/pubmed/writers.txt'
        file1 = open(fname, 'r')
        Lines = file1.readlines()
        Lines = [s.replace('\n', '') for s in Lines]
        subcategory_list = Lines
    elif "openreview" in dataset:
        fname = f'data/openreview/writers.txt'
        file1 = open(fname, 'r')
        Lines = file1.readlines()
        Lines = [s.replace('\n', '').replace(':', " who has") for s in Lines]
        subcategory_list = Lines

    return subcategory_list

# /dpsda
Adaptación de [dpsda](https://github.com/AI-secure/aug-pe/tree/main/dpsda)

## feature_extractor.py
Para calcular los embeddinggs

In [None]:
!pip install sentence_transformers -q

In [None]:
import numpy as np
from tqdm import tqdm
import torch
from sentence_transformers import SentenceTransformer

def extract_features(
        data,
        batch_size=1000,
        model_name="all-mpnet-base-v2"):
    """
    Extracts sentence embeddings from a given dataset using a specified model.

    Args:
        data (list or array-like): A list or array of sentences to process.
        batch_size (int): Number of sentences to process in each batch. Default is 1000.
        model_name (str): The name of the Sentence Transformer model to use. Default is "all-mpnet-base-v2".

    Returns:
        np.ndarray: A numpy array containing the concatenated embeddings for each sentence.
    """

    # Initialize the SentenceTransformer model with the specified model name.
    # Optionally, a 'device' parameter could be set here to specify GPU usage (e.g., device='cuda').
    model = SentenceTransformer(model_name)
    model.eval()  # Set the model to evaluation mode to avoid training behavior.

    # Disable gradient calculations to save memory and improve computation speed (since we're just encoding).
    with torch.no_grad():
        sentence_embeddings = []  # Initialize an empty list to hold embeddings for each batch.

        # Loop over data in batches, with tqdm providing a progress bar.
        for i in tqdm(range(len(data) // batch_size + 1)):
            # Get embeddings for the current batch of data.
            # Slices the 'data' array to get the current batch.
            embeddings = model.encode(
                data[i * batch_size:(i + 1) * batch_size])

            # Only append if the embeddings array has content.
            if len(embeddings) > 0:
                sentence_embeddings.append(embeddings)

    # Concatenate all the batches of embeddings into a single numpy array.
    sentence_embeddings = np.concatenate(sentence_embeddings)

    # Delete the model from memory to free up resources.
    del model

    # Return the final array of sentence embeddings.
    return sentence_embeddings

## dp_counter.py
Encargado del histograma

In [None]:
!pip install faiss-gpu -q

In [None]:
import faiss
import logging
import numpy as np
from collections import Counter
import torch


def dp_nn_histogram(public_features, private_features, noise_multiplier,
                    num_packing=1, num_nearest_neighbor=1, mode='L2',
                    threshold=0.0):
    assert public_features.shape[0] % num_packing == 0

    num_true_public_features = public_features.shape[0] // num_packing
    if public_features.shape[0] == 0:  # TODO debug, why this case exists
        return np.zeros(shape=num_true_public_features), np.zeros(shape=num_true_public_features)

    faiss_res = faiss.StandardGpuResources()
    if mode == 'L2':
        index = faiss.IndexFlatL2(public_features.shape[1])
    # inner product; need normalization (https://github.com/spotify/annoy)
    elif mode == 'IP':
        index = faiss.IndexFlatIP(public_features.shape[1])
    elif mode == 'cos_sim':
        # normalize the embeddings first
        faiss.normalize_L2(public_features)
        faiss.normalize_L2(private_features)
        index = faiss.IndexFlatIP(public_features.shape[1])
    else:
        raise Exception(f'Unknown mode {mode}')
    if torch.cuda.is_available():
        index = faiss.index_cpu_to_gpu(faiss_res, 0, index)

    print(f'public_features shape : {public_features.shape}')
    print(f'private_features shape : {private_features.shape}')

    index.add(public_features)
    print(f'Number of samples in index: {index.ntotal}')
    distance, ids = index.search(private_features, k=num_nearest_neighbor)
    print('Finished search')

    counter = Counter(list(ids.flatten()))
    # shape of the synthetic samples
    count = np.zeros(shape=num_true_public_features)
    for k in counter:
        count[k % num_true_public_features] += counter[k]
    print(f'Clean count: {count}')
    print(f'Clean count sum: {np.sum(count)}')
    print(f'Clean count num>0: {np.sum(count > 0)}')
    print(f'Largest clean counters: {sorted(count)[::-1][:50]}')
    count = np.asarray(count)
    clean_count = count.copy()
    count += (np.random.normal(size=len(count)) * np.sqrt(num_nearest_neighbor)
              * noise_multiplier)
    print(f'Noisy count sum: {np.sum(count)}')
    print(f'Noisy count num>0: {np.sum(count > 0)}')
    print(f'Largest noisy counters: {sorted(count)[::-1][:50]}')
    count = np.clip(count, a_min=threshold, a_max=None)
    count = count - threshold
    print(f'Clipped noisy count sum: {np.sum(count)}')
    print(f'Clipped noisy count num>0: {np.sum(count > 0)}')
    print(f'Clipped largest noisy counters: {sorted(count)[::-1][:50]}')
    torch.cuda.empty_cache()
    return count, clean_count

## data_loader

In [None]:
!pip install datasets -q

In [None]:
import numpy as np
import logging
import collections
import csv
from datasets import load_dataset


def sample_dataset(data_name, dataset, label_column_name='label1', sample_size=5000, subsample_one_class=False):
    # Muestra una parte del dataset según el tamaño especificado
    print(f"sample_size: {sample_size}")
    if subsample_one_class == False and sample_size < 0:
        return dataset  # Retorna dataset completo si no hay muestreo

    training_dataset = dataset
    sample_indices = []

    if subsample_one_class:
        if sample_size < 0:
            sample_indices = indices
        else:
            # Muestra aleatoria de una clase si subsample_one_class es True
            sample_indices = np.random.choice(indices, size=sample_size, replace=False)
            np.random.shuffle(sample_indices)
    else:
        # Muestra aleatoria general del dataset
        indices = list(range(len(training_dataset)))
        sample_indices = np.random.choice(indices, size=sample_size, replace=False)
        np.random.shuffle(sample_indices)

    print(sample_indices)

    # Filtra el dataset usando los índices seleccionados
    training_dataset = training_dataset.select(sample_indices)
    dataset = training_dataset
    return dataset


def load_data(dataset, data_file, num_samples=-1, subsample_one_class=False, gen=False):
    # Carga y preprocesa datos, asignando prompts como etiquetas
    print("data_file", data_file)
    prompt_counter = collections.Counter()
    raw_datasets = data_file

    # Muestra datos del dataset original
    original_data = sample_dataset(dataset, raw_datasets, label_column_name='',
                                    sample_size=num_samples, subsample_one_class=subsample_one_class)
    prompt_idexer = dict()  # Almacena índices por cada prompt
    train_data = []
    train_labels = []
    for i, line in enumerate(original_data):

        #los prompt son labels
        prompt = f"dinosaurio"  # Etiqueta fija asignada a cada dato
        prompt_counter[prompt] += 1 # Cuenta ocurrencias del prompt

        # Asocia índices al prompt actual
        if prompt not in prompt_idexer.keys():
            prompt_idexer[prompt] = [i]
        else:
            prompt_idexer[prompt].append(i)
        train_data.append(line['text']) # Guarda texto de entrada
        train_labels.append(prompt) # Guarda etiqueta asignada
    return train_data, train_labels, prompt_counter, prompt_idexer

## metrics.py
Algunas métricas

In [None]:
import numpy as np
from time import time
from numpy import cov
from numpy import trace
from numpy import iscomplexobj
from numpy.random import random
from scipy.linalg import sqrtm
from sklearn.metrics import pairwise_distances



# Frechet Inception Distance
def calculate_fid(act1, act2):
    # calculate mean and covariance statistics
    mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)
    mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)
    # calculate sum squared difference between means
    ssdiff = np.sum((mu1 - mu2) ** 2.0)
    # calculate sqrt of product between cov
    covmean = sqrtm(sigma1.dot(sigma2))
    # check and correct imaginary numbers from sqrt
    if iscomplexobj(covmean):
        covmean = covmean.real
    # calculate score
    fid = ssdiff + trace(sigma1 + sigma2 - 2.0 * covmean)
    return fid


# ----------------------------------------------------------------------------
# https://github.com/kynkaat/improved-precision-and-recall-metric/blob/master/precision_recall.py
class DistanceBlock():
    """Provides multi-GPU support to calculate pairwise distances between two batches of feature vectors."""

    def __init__(self, num_features, num_gpus):
        self.num_features = num_features
        self.num_gpus = num_gpus

    def pairwise_distances(self, U, V):
        """Evaluate pairwise distances between two batches of feature vectors."""
        output = pairwise_distances(U, V, n_jobs=24)
        return output


# ----------------------------------------------------------------------------

class ManifoldEstimator():
    """Estimates the manifold of given feature vectors."""

    def __init__(self, distance_block, features, row_batch_size=25000, col_batch_size=50000,
                 nhood_sizes=[3], clamp_to_percentile=None, eps=1e-5):
        """Estimate the manifold of given feature vectors.

            Args:
                distance_block: DistanceBlock object that distributes pairwise distance
                    calculation to multiple GPUs.
                features (np.array/tf.Tensor): Matrix of feature vectors to estimate their manifold.
                row_batch_size (int): Row batch size to compute pairwise distances
                    (parameter to trade-off between memory usage and performance).
                col_batch_size (int): Column batch size to compute pairwise distances.
                nhood_sizes (list): Number of neighbors used to estimate the manifold.
                clamp_to_percentile (float): Prune hyperspheres that have radius larger than
                    the given percentile.
                eps (float): Small number for numerical stability.
        """
        num_images = features.shape[0]
        self.nhood_sizes = nhood_sizes
        self.num_nhoods = len(nhood_sizes)
        self.eps = eps
        self.row_batch_size = row_batch_size
        self.col_batch_size = col_batch_size
        self._ref_features = features
        self._distance_block = distance_block

        # Estimate manifold of features by calculating distances to k-NN of each sample.
        self.D = np.zeros([num_images, self.num_nhoods], dtype=np.float32)
        distance_batch = np.zeros(
            [row_batch_size, num_images], dtype=np.float32)
        seq = np.arange(max(self.nhood_sizes) + 1, dtype=np.int32)

        for begin1 in range(0, num_images, row_batch_size):
            end1 = min(begin1 + row_batch_size, num_images)
            row_batch = features[begin1:end1]

            for begin2 in range(0, num_images, col_batch_size):
                end2 = min(begin2 + col_batch_size, num_images)
                col_batch = features[begin2:end2]

                # Compute distances between batches.
                distance_batch[0:end1 - begin1, begin2:end2] = self._distance_block.pairwise_distances(row_batch,
                                                                                                       col_batch)

            # Find the k-nearest neighbor from the current batch.
            self.D[begin1:end1, :] = np.partition(
                distance_batch[0:end1 - begin1, :], seq, axis=1)[:, self.nhood_sizes]

        if clamp_to_percentile is not None:
            max_distances = np.percentile(self.D, clamp_to_percentile, axis=0)
            self.D[self.D > max_distances] = 0

    def evaluate(self, eval_features, return_realism=False, return_neighbors=False):
        """Evaluate if new feature vectors are at the manifold."""
        num_eval_images = eval_features.shape[0]
        num_ref_images = self.D.shape[0]
        distance_batch = np.zeros(
            [self.row_batch_size, num_ref_images], dtype=np.float32)
        batch_predictions = np.zeros(
            [num_eval_images, self.num_nhoods], dtype=np.int32)
        max_realism_score = np.zeros([num_eval_images, ], dtype=np.float32)
        nearest_indices = np.zeros([num_eval_images, ], dtype=np.int32)

        for begin1 in range(0, num_eval_images, self.row_batch_size):
            end1 = min(begin1 + self.row_batch_size, num_eval_images)
            feature_batch = eval_features[begin1:end1]

            for begin2 in range(0, num_ref_images, self.col_batch_size):
                end2 = min(begin2 + self.col_batch_size, num_ref_images)
                ref_batch = self._ref_features[begin2:end2]

                distance_batch[0:end1 - begin1, begin2:end2] = self._distance_block.pairwise_distances(feature_batch,
                                                                                                       ref_batch)

            # From the minibatch of new feature vectors, determine if they are in the estimated manifold.
            # If a feature vector is inside a hypersphere of some reference sample, then
            # the new sample lies at the estimated manifold.
            # The radii of the hyperspheres are determined from distances of neighborhood size k.
            samples_in_manifold = distance_batch[0:end1 -
                                                 begin1, :, None] <= self.D
            batch_predictions[begin1:end1] = np.any(
                samples_in_manifold, axis=1).astype(np.int32)

            max_realism_score[begin1:end1] = np.max(self.D[:, 0] / (distance_batch[0:end1 - begin1, :] + self.eps),
                                                    axis=1)
            nearest_indices[begin1:end1] = np.argmin(
                distance_batch[0:end1 - begin1, :], axis=1)

        if return_realism and return_neighbors:
            return batch_predictions, max_realism_score, nearest_indices
        elif return_realism:
            return batch_predictions, max_realism_score
        elif return_neighbors:
            return batch_predictions, nearest_indices

        return batch_predictions


# ----------------------------------------------------------------------------

def knn_precision_recall_features(ref_features, eval_features, nhood_sizes=[3],
                                  row_batch_size=10000, col_batch_size=50000, num_gpus=1, debug=True):
    """Calculates k-NN precision and recall for two sets of feature vectors.

        Args:
            ref_features (np.array/tf.Tensor): Feature vectors of reference images.
            eval_features (np.array/tf.Tensor): Feature vectors of generated images.
            nhood_sizes (list): Number of neighbors used to estimate the manifold.
            row_batch_size (int): Row batch size to compute pairwise distances
                (parameter to trade-off between memory usage and performance).
            col_batch_size (int): Column batch size to compute pairwise distances.
            num_gpus (int): Number of GPUs used to evaluate precision and recall.

        Returns:
            State (dict): Dict that contains precision and recall calculated from
            ref_features and eval_features.
    """
    state = dict()
    if debug:
        state['precision'] = 0
        state['recall'] = 0
        state['f1'] = 0
        return state

    num_images = ref_features.shape[0]
    num_features = ref_features.shape[1]

    # Initialize DistanceBlock and ManifoldEstimators.
    distance_block = DistanceBlock(num_features, num_gpus)
    ref_manifold = ManifoldEstimator(
        distance_block, ref_features, row_batch_size, col_batch_size, nhood_sizes)
    eval_manifold = ManifoldEstimator(
        distance_block, eval_features, row_batch_size, col_batch_size, nhood_sizes)

    # Evaluate precision and recall using k-nearest neighbors.
    print('Evaluating k-NN precision and recall with %i samples...' % num_images)
    start = time()

    # Precision: How many points from eval_features are in ref_features manifold.
    precision = ref_manifold.evaluate(eval_features)
    state['precision'] = precision.mean(axis=0).item()

    # Recall: How many points from ref_features are in eval_features manifold.
    recall = eval_manifold.evaluate(ref_features)
    state['recall'] = recall.mean(axis=0).item()

    state['f1'] = 2 * (state['precision'] * state['recall']) / \
        (state['precision']+state['recall'])

    print('Evaluated k-NN precision and recall in: %gs' % (time() - start))

    return state

## logging.py

In [None]:
import logging
import os
import numpy as np
import csv
import json
# from dpsda.metrics import calculate_fid, knn_precision_recall_features

def compute_fid(synthetic_features, all_private_features, feature_extractor, folder='', step=0, log_online=False):
    # Calcula el FID y F1 entre características generadas y reales

    print(f'Computing FID and F1 for syn shape {synthetic_features.shape}')
    fid = calculate_fid(synthetic_features, all_private_features)
    state = knn_precision_recall_features(ref_features=all_private_features,
                                          eval_features=synthetic_features)
    print(f'fid={fid} F1={state}')



def setup_logging(log_file):
    # Configura el sistema de logging para consola y archivo

    log_formatter = logging.Formatter(
        fmt=('%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s]  '
             '%(message)s'),
        datefmt='%m/%d/%Y %H:%M:%S %p')
    root_logger = logging.getLogger()
    # root_logger.setLevel(logging.DEBUG)
    root_logger.setLevel(logging.INFO)

    console_handler = logging.StreamHandler() # Logging en consola
    console_handler.setFormatter(log_formatter)
    root_logger.addHandler(console_handler)

    file_handler = logging.FileHandler(log_file)  # Logging en archivo
    file_handler.setFormatter(log_formatter)
    root_logger.addHandler(file_handler)

    pil_logger = logging.getLogger('PIL') # Reduce logs innecesarios de PIL
    pil_logger.setLevel(logging.INFO)


def log_embeddings(embeddings, additional_info, folder, fname=''):
    # Guarda embeddings y metadatos en un archivo comprimido

    if not os.path.exists(folder):
        os.makedirs(folder)
    savefname = os.path.join(folder, fname+'.embeddings.npz')
    print("save embeddings into", savefname)
    np.savez(
        savefname,
        embeddings=embeddings,
        additional_info=additional_info)


def load_embeddings(path):
    # Carga embeddings y metadatos de un archivo comprimido

    data = np.load(path)
    embeddings = data['embeddings']
    additional_info = data['additional_info']

    return embeddings, additional_info


def log_num_words(fname="num_word_lookahead.csv", all_gen_words=[], all_target_words=[]):
    # Calcula y guarda diferencias estadísticas entre palabras generadas y objetivo

    if len(all_gen_words) == 0 or len(all_target_words) == 0:
        return
    with open(fname, 'w', newline='', encoding="utf-8") as wf:
        csv_writer = csv.writer(wf)
        csv_writer.writerow(["target", "gen", "diff"])
        diff_list = []
        diff_abs_list = []
        for i in range(len(all_target_words)):
            try:
                diff_list.append(all_gen_words[i] - all_target_words[i])
                diff_abs_list.append(
                    abs(all_gen_words[i] - all_target_words[i]))
                csv_writer.writerow(
                    [all_target_words[i], all_gen_words[i], all_gen_words[i] - all_target_words[i]])
            except:
                continue
        csv_writer.writerow(["mean_abs", "var_abs", "mean", "var"])
        csv_writer.writerow([np.mean(diff_abs_list), np.std(
            diff_abs_list), np.mean(diff_list), np.std(diff_list)])


def log_prompt_generation(fname="prompt_generation.jsonl", prompts=[], generations=[]):
    # Guarda prompts y sus generaciones asociadas en un archivo JSONL

    new_variants_samples = []
    for x in generations:
        new_variants_samples.extend(x.tolist())

    if len(prompts) == 0 or len(new_variants_samples) == 0:
        return
    with open(fname, "w") as file:
        for i in range(len(prompts)):
            try:
                json_str = json.dumps(
                    {"prompt": prompts[i], "generation": new_variants_samples[i]})
                file.write(json_str + "\n")
            except:
                continue


def log_count(count, clean_count, path):
    # Guarda conteos en un archivo CSV

    dirname = os.path.dirname(path)
    if not os.path.exists(dirname):
        os.makedirs(dirname)

    title = ['type', 'count']
    with open(path, 'w', newline='', encoding="utf-8") as wf:
        csv_writer = csv.writer(wf)
        csv_writer.writerow(title)
        csv_writer.writerow(["count", count.tolist()])
        csv_writer.writerow(["clean_count", clean_count.tolist()])


def log_fid(folder, fid, f1, precision, recall, t, save_fname='fid.csv'):
    # Registra métricas de FID, F1, precisión y recall en un archivo CSV

    with open(os.path.join(folder, save_fname), 'a') as f:
        f.write(f'{t} {fid} {f1} {precision} {recall}\n')


def log_fid_list(folder, fids, t, save_fname='fid.csv'):
    # Guarda una lista de FIDs asociados a un paso específico

    write_list = [t]
    write_list.extend(fids)
    with open(os.path.join(folder, save_fname), 'a') as f:
        writer = csv.writer(f)
        writer.writerow(write_list)


def log_samples(samples, additional_info, folder):
    # Guarda muestras generadas y sus etiquetas en un archivo CSV

    if not os.path.exists(folder):
        os.makedirs(folder)

    all_data = []
    for i in range(len(samples)):
        seq = samples[i]
        labels = additional_info[i]
        if seq:
            seq = " ".join(seq.split()) # Limpia espacios extra
            if "pubmed" in labels:
                all_data.append([seq])
            else:
                labels = labels.strip().split("\t")
                all_data.append([seq]+labels)

    if "pubmed" in additional_info[0]:  # unconditional
        title = ['text']
    else:
        title = ['text', 'label1', 'label2']
    try:
        with open(os.path.join(folder, 'samples.csv'), 'w', newline='', encoding="utf-8") as wf:
            csv_writer = csv.writer(wf)
            csv_writer.writerow(title)
            for obj in all_data:
                if obj[0]:  # remove empty sequences
                    csv_writer.writerow(obj)
    except:  # in case there are some special characters in the text
        with open(os.path.join(folder, 'samples.csv'), 'w', newline='', encoding="utf-8") as wf:
            csv_writer = csv.writer(
                wf, quoting=csv.QUOTE_NONE,  quotechar='', escapechar='\\')
            csv_writer.writerow(title)
            for obj in all_data:
                if obj[0]:  # remove empty sequences
                    csv_writer.writerow(obj)
    return all_data

# Data Path UNC
Ruta de la carpeta donde estan los archivos (en nuestro drive).
Si se quieren descargar: [DRIVE](https://drive.google.com/drive/folders/1YumfaRFwbPIDfc6giQ_13_VKLRb9jtuZ?usp=sharing)

In [None]:
folder_path = '/content/drive/Shareddrives/Proyectous/Archivo del Tword/data'

In [None]:
folder_path = '/content/drive/MyDrive/TextMining/Data'

# Dataset
Importamos los documentos y armamos el dataset

In [None]:
!pip install datasets -q

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Cargamos los textos con load_txt_files segun donde esten alojados los archivos

In [None]:
import os
def load_txt_files(folder_path):
  """
  Load contents of all .txt files in a given folder.

  Args:
    folder_path (str): Path to the folder containing .txt files.

  Returns:
    list: List of strings, each representing the content of a .txt file.
  """

  # Initialize an empty list to store the contents of .txt files
  data = []

  # Iterate over each file in the specified folder
  for filename in os.listdir(folder_path):
    # Check if the file has a .txt extension
    if filename.endswith('.txt'):
      # Construct the full path to the file
      file_path = os.path.join(folder_path, filename)

      # Open the file in read mode
      with open(file_path, 'r') as f:
        # Read the contents of the file
        content = f.read()

        # Append the file content to the data list
        data.append(content)

  # Return the list of file contents
  return data

# Load .txt files from a folder (replace 'folder_path' with the actual path)
dataset = load_txt_files(folder_path)

## Split
Ahora dividimos cada documento con los textsplitter

In [None]:
!pip install langchain -q

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter
from gensim import corpora, models, similarities
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string

In [None]:
chunk_size1 = 500
chunk_size2 = 500
chunk_size3 = 500
chunk_overlap = 120
separator1 = 'agente'
separator2 = 'texto'
# separator3 = '\n'

# Split segun la palabra agente
text_splitter1 = CharacterTextSplitter(chunk_size = chunk_size1,
                                               chunk_overlap = chunk_overlap,
                                               separator = separator1
                                               )
# Split segun la palabra texto
text_splitter2 = CharacterTextSplitter(chunk_size = chunk_size2,
                                               chunk_overlap = chunk_overlap,
                                               separator = separator2
                                               )

text_splitter3 = RecursiveCharacterTextSplitter(chunk_size = chunk_size3,
                                                chunk_overlap = chunk_overlap,
                                                # separators = separators
                                                )


# Initialize an empty list to store the final chunks.
data_file = []

# Iterate over each document in the dataset.
for doc in dataset:
    # Apply the first level of splitting after removing stopwords and converting to lowercase.
    split1 = text_splitter1.create_documents([remove_stopwords(doc).lower()])

    # Iterate over each chunk from the first split.
    for chunk1 in split1:
        # Apply the second level of splitting.
        split2 = text_splitter2.create_documents([chunk1.page_content])

        # Iterate over each chunk from the second split.
        for chunk2 in split2:
            # Store chunks to be further processed.
            chunks_to_process = [chunk2.page_content]

            # Process each chunk until all are below the desired length.
            while chunks_to_process:
                # Take the next chunk to process.
                current_chunk = chunks_to_process.pop(0)

                # If the chunk length is greater than 500 characters, split it further.
                if len(current_chunk) > 1000:
                    # Apply text_splitter2 again to split the chunk further.
                    split3 = text_splitter3.create_documents([current_chunk])
                    split3 = list([chunk.page_content for chunk in split3])
                    # Add the newly split smaller chunks back to the list for further checking.
                    chunks_to_process = split3 + chunks_to_process
                else:
                    # If the chunk is already of the desired size, add it directly.
                    # print(len(current_chunk))
                    data_file.append(current_chunk)

Vemos cuantos documentos obtuvimos

In [None]:
N = len(data_file)
print('cantidade documentos:',N)

In [None]:
import datasets
# Create a Dataset object from a dictionary
data_file = datasets.Dataset.from_dict({
    # Map the loaded text data to a column named "text"
    "text": data_file
})

# Datasetito
Correr si solo se quiere ver la generación para un ejemplo distinguido, escogido y corregido a mano.

In [None]:
doc = 'La fuente informa que Eleuterio FERNANDEZ HUIDOBRO, viajó el día miércoles 10 de Junio a Bs. As. , habiéndose enterado del hecho unos días después, en virtud de que la madre de éste se había agralfado y los familiares quisieron ubicarlo sin lograrlo. Uno o dos días después se supo que éste se encontraba en Bs. As. y por comentarios oídos por la fuente al parecer habría ido a trabajar para la votación del día 05 de julio. Desconoce hasta el momento porque medio se trasladó pero manifiesta que por lo general lo hace en el Buque Bus que parte por la noche. Además no tíene conocimiento si ya regresó, cuando y como . Manifiesta que Eleuterio FERNANDEZ HUIDOBRO cuando viaja a Bs. As. se aloja en casa de unos amigos cuyas direcciones y nombres se adjuntan. Agrega además que la casa que éste poseía en el Balneario SALINAS ya fue vendida y la propiedad (chacra) que tenía aparentemente en el paraje "Cuchilla Alta" no está precisamente ubicada ahí, sinó que por el contrario queda en el Balneario Jaureguiberry próximo a la playa. Con respecto a la esposa de FERNANDEZ HUIDOBRO, la fuente expresa que el día 30 de Junio se embarca para Alemania, debido a que se encuentra muy mal de salud y lo médicos le recomendaron trasladarse a una clínica que tienen unos Uruguayos en éste país, donde se internará para la aplicación de un tratamiento contra el cáncer. Aparentemente no le dieron probabilidades de que éste le de buenos resultados. Según algunos comentarios oídos por la fuente FERNANDEZ HUIDOBRO tiene en mente poner lá propiedad de la calle MISSOURI a la venta y aprovechar el viaje de su esposa e irse a vivir con su hija al garage de la casa de su hermana. Las amistades donde aparentemente se aloja FERNANDEZ HUIDOBRO (MLN) cuando viaja a Bs. As. son: GRACIELA "SILVA - Dom. CACHIMAYO 112- piso 6-E - SUSANA MORALES Dom.- RIGLOS 445 - Iro.C casi ALVERDI'

In [None]:
import datasets
data_file = datasets.Dataset.from_dict({
    "text": doc # update with appropriate column names
})

In [None]:
N = 1

In [None]:
len(doc)

1866

# Noise Level Calculation
Adaptación de [dp_budget.ipynb](https://github.com/AI-secure/aug-pe/blob/main/notebook/dp_budget.ipynb), para calcular el nivel de ruido dado el $\epsilon$ buscado. Si $N$ es la cantidad de documentos en el dataset se toma:

$$\delta= \frac{1}{N\cdot\log(N)}$$

In [None]:
import scipy
import numpy as np
import math

def delta_Gaussian(eps, mu):
   """Compute delta of Gaussian mechanism with shift mu or equivalently noise scale 1/mu"""
   if mu==0:
       return 0
   return scipy.stats.norm.cdf(-eps / mu + mu / 2) - np.exp(eps) * scipy.stats.norm.cdf(-eps / mu - mu / 2)

def eps_Gaussian(delta, mu):
   """Compute eps of Gaussian mechanism with shift mu or equivalently noise scale 1/mu"""
   def f(x):
       return delta_Gaussian(x, mu) - delta
   return scipy.optimize.root_scalar(f, bracket=[0, 500], method='brentq').root

def compute_epsilon(noise_multiplier, num_steps, delta):
   return eps_Gaussian(delta, np.sqrt(num_steps) / noise_multiplier)

In [None]:
delta= 1/(N*math.log(N))
epoch=10

break_noise=0
bnoisesss = []
for eps in [1,2,4]:
    for noise in np.arange(20,1, -0.01):
        compute_epsilon(noise, epoch, delta)
        if compute_epsilon(noise, epoch, delta)>eps:
            break_noise=noise
            break
    bnoisesss.append(break_noise)
    print("threshold eps", eps, "break_noise", break_noise, f"eps {compute_epsilon(noise, epoch, delta):4f}")

threshold eps 1 break_noise 12.57999999999884 eps 1.000091
threshold eps 2 break_noise 6.669999999997916 eps 2.003284
threshold eps 4 break_noise 3.589999999997435 eps 4.008467


In [None]:
r_bnoisesss = [ round(elem, 2) + 0.01 for elem in bnoisesss ]
epoch=10
for noise in r_bnoisesss:
    for n in [N]:
        delta= 1/(n*math.log(n))
        print( f"noise {noise} N {n}, delta {delta:10f},  eps {compute_epsilon(noise, epoch, delta):4f}" )
    print("********")

noise 12.59 N 30000, delta   0.000003,  eps 0.999228
********
noise 6.68 N 30000, delta   0.000003,  eps 1.999971
********
noise 3.5999999999999996 N 30000, delta   0.000003,  eps 3.995801
********


# main.py
Definición de algoritmo principal [main.py](https://github.com/AI-secure/aug-pe/blob/main/main.py)

In [None]:
import os
import numpy as np

os.environ["TOKENIZERS_PARALLELISM"] = "false"

def main():

    #Load private data
    all_private_samples, all_private_labels, private_labels_counter, private_labels_indexer = load_data(
        dataset=args.dataset,
        data_file=args.train_data_file,
        num_samples=args.num_private_samples,
        subsample_one_class=args.subsample_one_class)
    print(private_labels_counter)

    private_classes = list(private_labels_counter.keys())
    print(f'Private_num_classes: {len(private_classes)}',
          f'Private_num_samples: {len(all_private_samples)}',
          f'Private_num_labels:{len(all_private_labels)}')


    # Extract the embeddings of the private data
    print('###### Extracting features of private data ######')
    all_private_features = extract_features(
            data=all_private_samples,
            batch_size=args.feature_extractor_batch_size,
            model_name=args.feature_extractor,
        )


    #Generating initial synthetic samples
    print()
    print()
    print('###### Generating initial samples ######')
    print()
    private_lens_dict = None
    num_seed_samples = int(args.num_samples_schedule/args.init_combine_divide_L)
    seed_syn_samples, seed_additional_info, sync_labels_counter, all_prefix_prompts = api.text_random_sampling(
        num_samples=num_seed_samples,
        prompt_counter=private_labels_counter,
        lens_dict=private_lens_dict)
    start_t = 1


    if args.compute_fid:
        synthetic_features = extract_features(
            data=seed_syn_samples,
            batch_size=args.feature_extractor_batch_size,
            model_name=args.feature_extractor,
            )
        compute_fid(synthetic_features, all_private_features, args.feature_extractor,
                    folder=args.result_folder,  step=start_t-1, log_online=args.log_online)

    syn_samples, additional_info = seed_syn_samples, seed_additional_info

    print(f'initial samples size {len(syn_samples)} label {len(additional_info)}')
    for key, value in sync_labels_counter.items():
        if value > 0:
            print(f'initial samples label counter {key}: {value}')

    for t in range(start_t, args.epochs):
        print()
        print()
        print(f'### t={t} ###')
        print()

        if args.lookahead_degree == 0:
            packed_samples = np.expand_dims(syn_samples, axis=1)
        else:
            print('Running text variation')
            packed_samples, variation_lables, all_target_words, all_gen_words, all_masked_prompts = api.text_variation(  # shape [# num_sample, # variations]
                sequences=syn_samples,
                additional_info=additional_info,
                num_variations_per_sequence=args.lookahead_degree,
                variation_degree=args.variation_degree_schedule)
            if args.lookahead_self:
                packed_samples = np.concatenate((packed_samples,  np.expand_dims(
                    syn_samples, axis=1)), axis=1)  # add the original samples to the variations

        packed_features = []
        print('Running feature extraction')

        # iterate over # lookahead_degree variations.
        for i in range(packed_samples.shape[1]):
            sub_packed_features = extract_features(
                data=packed_samples[:, i],
                batch_size=args.feature_extractor_batch_size,
                model_name=args.feature_extractor,

            )
            packed_features.append(sub_packed_features)

        # take the averaged embedding for each sequence..
        packed_features = np.mean(packed_features, axis=0)
        print(f'feature extraction shape {packed_features.shape}')
        print('###### Computing histogram ######')
        count = []
        current_idx = 0
        # for next iteration
        new_syn_samples = []
        new_additional_info = []

        # for current iteration saving
        all_selected_samples = []
        all_selected_additional_info = []

        for class_i, class_ in enumerate(private_classes):
            # key must have the same order as  private_classes (from private_labels_counter)
            num_samples_per_class = sync_labels_counter[class_]
            if num_samples_per_class == 0:
                continue
            # get the count for each synthetic data
            public_features = packed_features[current_idx:
                                              num_samples_per_class+current_idx]
            # logging.info(
            #     f'{class_}, {num_samples_per_class} , features shape {public_features.shape}')
            print(f'{class_}, {num_samples_per_class} , features shape {public_features.shape}')
            assert num_samples_per_class == public_features.shape[0]

            selected_size = int(num_samples_per_class/args.combine_divide_L)
            # logging.info(f'selected_size  {selected_size}')
            print(f'selected_size  {selected_size}')
            if selected_size == 0:
                sub_count = []
                sub_new_indices = list(
                    range(current_idx, num_samples_per_class+current_idx))
                selected_syn_samples = [syn_samples[i]
                                        for i in sub_new_indices]
                selected_additional_info = [
                    additional_info[i] for i in sub_new_indices]
                new_variants_samples = selected_syn_samples*args.combine_divide_L
                new_variants_additional_info = selected_additional_info * args.combine_divide_L
            else:
                # HISTOGRAMA
                sub_count, sub_clean_count = dp_nn_histogram(
                    public_features=public_features,
                    private_features=all_private_features[private_labels_indexer[class_]],
                    noise_multiplier=args.noise_multiplier,
                    num_nearest_neighbor=args.num_nearest_neighbor,
                    mode=args.nn_mode,
                    threshold=args.count_threshold)
                assert np.sum(sub_count) > 0
                # Generating new indices of synthetic data
                if args.select_syn_mode == 'prob':
                    candidate_indices = np.arange(
                        current_idx, num_samples_per_class + current_idx, dtype=int)
                    sampling_prob = (sub_count) / np.sum(sub_count)
                    top_1_ind = np.argpartition(sampling_prob, -1)[-1:]
                    sub_new_indices = np.random.choice(
                        candidate_indices,
                        size=selected_size,
                        p=sampling_prob)
                    print((f'sub_new_indices size  {len(sub_new_indices)}'))

                elif args.select_syn_mode == 'rank':
                    sort_index = [
                        i+current_idx for i, x in sorted(enumerate(sub_count), key=lambda x: -x[1])]
                    sub_new_indices = sort_index[:selected_size]  # top votes
                else:
                    raise ValueError(
                        f'supported select_syn_mode {args.select_syn_mode}')

                count_fname = class_.replace("\t", "_").replace(
                    " ", "_").replace("&", "").replace(":", "")

                # Generate new synthetic data
                selected_syn_samples = [syn_samples[i] for i in sub_new_indices]
                selected_additional_info = [additional_info[i] for i in sub_new_indices]
                print(f'selected_syn_samples shape {len(selected_syn_samples)} label {len(selected_additional_info)}')
                assert len(selected_syn_samples) == len(selected_additional_info)

                new_variants_samples = []
                if args.combine_divide_L == 1:
                    _num_variations_per_sequence = 1  # just do one variation
                elif args.combine_divide_L > 1:
                    if args.donnot_keep_last_iter:
                        _num_variations_per_sequence = args.combine_divide_L
                    else:
                        _num_variations_per_sequence = args.combine_divide_L - 1
                        new_variants_samples.extend(selected_syn_samples)
                else:
                    raise ValueError('combine_divide_L should be >= 1')

                print(f'_num_variations_per_sequence  {_num_variations_per_sequence}')
                new_variants_samples_stacked, _, _, _, _ = api.text_variation(
                    sequences=selected_syn_samples,  # seed samples
                    additional_info=selected_additional_info,
                    num_variations_per_sequence=_num_variations_per_sequence,  # just do one variation
                    variation_degree=args.variation_degree_schedule
                )

                for x in new_variants_samples_stacked:
                    new_variants_samples.extend(x.tolist())
                new_variants_additional_info = selected_additional_info * args.combine_divide_L
                print(f'new_variants_samples shape {len(new_variants_samples)} label {len(new_variants_additional_info)}')
                new_syn_samples.extend(new_variants_samples)
                new_additional_info.extend(new_variants_additional_info)
                sync_labels_counter[class_] = len(
                    new_variants_samples)  # update class size

            if args.save_syn_mode == 'selected':
                all_selected_samples.extend(selected_syn_samples)
                all_selected_additional_info.extend(selected_additional_info)
            elif args.save_syn_mode == 'one_var':
                all_selected_samples.extend(new_variants_samples_stacked[:, 0])
                all_selected_additional_info.extend(selected_additional_info)
            elif args.save_syn_mode == 'all':
                all_selected_samples.extend(
                    new_variants_samples)  # all ---  L times size
                all_selected_additional_info.extend(
                    new_variants_additional_info)

            current_idx += public_features.shape[0]

        syn_samples = new_syn_samples
        additional_info = new_additional_info

        if args.compute_fid:
            synthetic_features = extract_features(
                data=all_selected_samples,
                batch_size=args.feature_extractor_batch_size,
                model_name=args.feature_extractor,

            )
            compute_fid(synthetic_features, all_private_features, args.feature_extractor,
                        folder=args.result_folder,  step=t, log_online=args.log_online)

    if args.log_online:
        wandb.finish()

    return syn_samples, additional_info

# args
Definición de la clase Argumentos. Con sus repesctivas descripciones

In [None]:
from dataclasses import dataclass, field, asdict
from typing import Any, Callable, List, Optional, Union, Dict, Sequence
from dataclasses import dataclass, field, asdict
@dataclass
class Argumentos:
  # Define arguments with default values and metadata for documentation
  train_data_file: field(default=None, metadata={
      "help": "Path to the training data file"
      })
  dataset: str = field(default=None, metadata={
        "help": "Name of the dataset to use"
        })

  num_private_samples: int = field(default=-1, metadata={
        "help": "Number of private samples to load"
        })
  result_folder: object = field(default=None, metadata={
        "help": "Folder to store the results"
        })

  feature_extractor_batch_size: int = field(default=1024, metadata={
        "help": "Batch size for the feature extractor"
        })

  feature_extractor: str = field(default='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2', metadata={
        'choices' : ["sentence-t5-xl", "sentence-t5-large",  "sentence-t5-base",
                      "all-MiniLM-L6-v2", "paraphrase-MiniLM-L6-v2", "all-mpnet-base-v2", "stsb-roberta-base-v2",
                      "roberta-large-nli-stsb-mean-tokens", "distilbert-base-nli-stsb-mean-tokens", 'text-embedding-ada-002'],
        'help' : 'Sentence Similarity model base for freature extractor'
        })

  noise_multiplier: float = field(default=0, metadata={
        "help": 'Noise multiplier for DP NN histogram'
        })

  lookahead_degree: float = field(default=0, metadata={
        "help": 'Lookahead degree for computing distances between private and generated images'
        })

  combine_divide_L: int = field(default=1, metadata={
        "help": 'Combination setting used in a specific part of the code'
        })
  init_combine_divide_L: int = field(default=1, metadata={
        "help": 'Initial combination setting'
        })

  num_nearest_neighbor: int = field(default=1, metadata={
        "help": 'Number of nearest neighbors to find in DP NN histogram'
        })
  nn_mode: str = field(default='L2', metadata={
        "help": 'Which distance metric to use in DP NN histogram'
        })
  count_threshold: float = field(default=0.0, metadata={
        "help": 'Threshold for DP NN histogram'
        })
  compute_fid: bool = field(default=True, metadata={
        "help": 'Whether to compute FID'
        })

  num_samples_schedule: int = field(default= 10, metadata={
        "help": 'Number of samples to generate at each iteration'
        })
  variation_degree_schedule: float = field(default = 0.0, metadata={
        "help": 'Variation degree at each iteration'
        })

  epochs: int = field(default  = 1, metadata={
        "help": 'Number of training epochs'
        })
  select_syn_mode: str = field(default = 'rank', metadata={
        'choices':['prob', 'rank'],
        'help':'sample synthetic data from the histogram by top ranking or by probability'
        })
  save_syn_mode: str = field(default = 'selected', metadata={
        'choices':['selected', 'all', 'one_var'],
        'help':'save all or selected syn samples'
        })
  data_checkpoint_path: str = field(default = '', metadata={
        'help': 'Path to save data checkpoints'
        })
  lookahead_self: bool = field(default = None, metadata={
        'help': 'Path to save data checkpoints'
        })
  subsample_one_class: bool = field(default = None, metadata={
        'help': 'Whether to subsample a single class'
        })
  log_online: bool = field(default = None, metadata={
        'help': 'Whether to log results online'
        })
  train_data_embeddings_file: str = field(default = '', metadata={
        'help': 'File path for training data embeddings'
        })
  data_checkpoint_step: int = field(default = None, metadata={
        'help': 'Step interval to save data checkpoints'
        })


# arg2b
Elegimos varios hiperparametros y los modelos utilizados

In [None]:
En caso de elegir un modelo de HF que necesite logear descomentar
# from huggingface_hub import notebook_login
# notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Elegimos argumentos

args = Argumentos(
train_data_file = data_file,
dataset = 'archivo del tword',
# Elgimos el modelo de Sentence Transformer
feature_extractor = "hiiamsid/sentence_similarity_spanish_es",
                  # 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2',
nn_mode = 'L2',
count_threshold = 0.0,
compute_fid = True,
select_syn_mode = 'rank',
save_syn_mode = 'selected')


# Parametros para HFAPI es decir para el sampling
api = HFAPI(
# Elgimos el modelo de generación
model_type = "clibrain/Llama-2-7b-ft-instruct-es",
          # "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
          # llm for text gen
          # choices: "meta-llama/Llama-3.2-1B-Instruct",       LLAMA 1B
          #          "unsloth/Llama-3.2-1B-Instruct-bnb-4bit"  LLAMA 1B quantizao
          #          "mistralai/Mistral-Nemo-Instruct-2407",
          #          "bigscience/bloom-560m",
          #          "meta-llama/Llama-3.1-8B-Instruct",
          #          "mistralai/Mistral-Nemo-Instruct-2407"
          # NousResearch/Hermes-2-Pro-Llama-3-8B

use_subcategory = False,
variation_type = 'rephrase',
               # help='Which image feature extractor to use'
mlm_probability = 0.5,
output_dir = None,
repetition_penalty = 1.0,
                   # help="primarily useful for CTRL model; in that case, use 1.2")
length = 448,
temperature = 1.0,
            # help="primarily useful for CTRL model; in that case, use 1.2"
top_k = 50,
top_p = 0.9,
num_beams = 5,
do_sample = True,
          # help="sampling when generation"
seed = 42,
dry_run = None,
random_sampling_batch_size = 64,
                           # help='The batch size for random sampling API'
variation_batch_size = 64,
                     # help='The batch size for variation API'
fp16 = True,
     # help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit"
no_cuda = None
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



# Aplicación

In [None]:
print(N)

35497


Elegimos otros hiperparametros y corremos

In [None]:
import random
num_seed_samples = 5
k=0 # number of variations
L=k+1
init_L=L
num_samples = L * num_seed_samples
args.data_checkpoint_path = ''

args.train_data_file = data_file

args.num_private_samples = N
# args.num_private_samples = 25000
                         #?????????
args.noise_multiplier = 0
                      # Elegir el ruido calculado antes
args.lookahead_degree = k
                      # por ahora 0 ln 175
args.num_nearest_neighbor = 1
                          # nose

args.num_samples_schedule = num_samples
args.variation_degree_schedule = 0.5
args.combine_divide_L = L
args.init_combine_divide_L = init_L

args.epochs = 2
args.select_syn_mode = 'rank'
args.save_syn_mode = 'all'

args.feature_extractor_batch_size = 32



api.use_subcategory = True
                    # depue vemo que onda ln 151
api.variation_type  = 'rephrase'
                    # help='Which image feature extractor to use'
api.mlm_probability = 0.5
api.repetition_penalty = 1.0
                       # help="primarily useful for CTRL model; in that case, use 1.2")
api.length = 250
api.temperature = 1.0
                # help="primarily useful for CTRL model; in that case, use 1.2"
api.top_k = 50
api.top_p = 0.9
api.num_beams = 2
              # ln 227
api.do_sample = True
          # help="sampling when generation"
api.fp16 = True
api.random_sampling_batch_size = 24
api.variation_batch_size = 24
# api.device = 'cpu'
api.device = 'cuda'
#api.model.to(api.device)
syn_samples, additional_info = main()

data_file Dataset({
    features: ['text'],
    num_rows: 35497
})
sample_size: 35497
[21815   630 24020 ... 11570 30831 29324]
Counter({'dinosaurio': 35497})
Private_num_classes: 1 Private_num_samples: 35497 Private_num_labels:35497
Extracting features of private data


100%|██████████| 1110/1110 [08:11<00:00,  2.26it/s]




###### Generating initial samples ######



100%|██████████| 1/1 [00:00<00:00, 1856.71it/s]


should -- simulated generated sequences: %d 5


  0%|          | 0/1 [00:00<?, ?it/s]

num_seq_to_generate= 5



  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [01:57<00:00, 117.85s/it]
100%|██████████| 1/1 [01:57<00:00, 117.86s/it]


Total generated sequences: %d 5


100%|██████████| 1/1 [00:00<00:00, 11.90it/s]


Computing FID and F1 for syn shape (5, 768)
fid=255.6264917608072 F1={'precision': 0, 'recall': 0, 'f1': 0}
initial samples size 5 label 5
initial samples label counter dinosaurio: 5


###### t=1 ######

Running feature extraction


100%|██████████| 1/1 [00:00<00:00, 12.37it/s]


feature extraction shape (5, 768)
Computing histogram
dinosaurio, 5 , features shape (5, 768)
selected_size  5
public_features shape : (5, 768)
private_features shape : (35497, 768)
Number of samples in index: 5
Finished search
Clean count: [1.7988e+04 9.0000e+00 6.9290e+03 9.0000e+00 1.0562e+04]
Clean count sum: 35497.0
Clean count num>0: 5
Largest clean counters: [17988.0, 10562.0, 6929.0, 9.0, 9.0]
Noisy count sum: 35497.0
Noisy count num>0: 5
Largest noisy counters: [17988.0, 10562.0, 6929.0, 9.0, 9.0]
Clipped noisy count sum: 35497.0
Clipped noisy count num>0: 5
Clipped largest noisy counters: [17988.0, 10562.0, 6929.0, 9.0, 9.0]
selected_syn_samples shape 5 label 5
_num_variations_per_sequence  1


  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [01:49<00:00, 109.87s/it]
100%|██████████| 1/1 [01:49<00:00, 109.88s/it]


new_variants_samples shape 5 label 5


100%|██████████| 1/1 [00:00<00:00, 11.07it/s]


Computing FID and F1 for syn shape (5, 768)
fid=219.4884349439393 F1={'precision': 0, 'recall': 0, 'f1': 0}


Ejemplos generados

In [None]:
syn_samples

['por], ha recibido tu solicitud y me gustaría que pongas énfasis en tu respuesta. Por favor, podrías responder a esta solicitud con detalle. Saludos Aunque no hay absolutamente ninguna evidencia de que tu usuario sea un troll, estoy preocupado por las posibles consecuencias si se usa información personal, ya sea que se le omitió algún dato importante o se utiliza información incorrecta. Por favor, proporciona a continuación tu nombre y dirección de correo electrónico para verificar que sea factible proporcionar a este usuario un servicio de seguridad de la salud mental y que haya acceso a los servicios correspondientes. Por favor, también aclara si el usuario ya ha solicitado servicios de salud mental anteriormente y, si es así, por favor, proporciona la ubicación y el teléfono de contacto para verificar que tenga acceso a su información de salud y que hayan sido proporcionados los',
 'accés físico a su equipo computacional o acceso a un correo electrónico en el servidor. ### Respuest