# Data Curation Comparison

Compares the filtered dataset after running through `seed_gathering` when using `llama3-8B` as a judge vs `starcoder-v2`.

* [StarCoder Dataset](https://huggingface.co/datasets/bigcode/python-stack-v1-functions-filtered-sc2)
* [Llama 3 8B Dataset](https://huggingface.co/datasets/muellerzr/filtered-data)

## Imports and Constants

In [1]:
from datasets import load_dataset
from pathlib import Path

In [4]:
data_mount = Path("/mnt/data/datasets")
sc2_path = data_mount/"python-stack-v1-functions-filtered-sc2/"
llama3_path = data_mount/"python-stack-v1-functions-filtered-llama-3-8B/"

## Data Analysis

Note: there is only one dataset, `train`

In [12]:
sc2_dataset = load_dataset("parquet", data_dir=sc2_path)["train"]
llama3_dataset = load_dataset("parquet", data_dir=llama3_path)["train"]

In [13]:
len(sc2_dataset), len(llama3_dataset)

(248934, 224363)

In [14]:
sc2_dataset

Dataset({
    features: ['content', 'sha1', 'id'],
    num_rows: 248934
})

In [15]:
llama3_dataset

Dataset({
    features: ['content', 'sha1', 'id'],
    num_rows: 224363
})

First let's look at the % overlap between the two:

In [16]:
value = llama3_dataset[0]; value

{'content': 'def count_char(char, word):\n    """Counts the characters in word"""\n    return word.count(char)\n    # If you want to do it manually try a for loop',
 'sha1': '363222f4876c5a574a84fe14214760c505e920b0',
 'id': 0}

In [18]:
value in llama3_dataset, value in sc2_dataset

(True, True)

In [27]:
ground_truth = sc2_dataset["content"]

In [32]:
matching_values = llama3_dataset.filter(
    lambda example: example["content"] in ground_truth,
    keep_in_memory=True,
    num_proc=8,
)

Filter (num_proc=8):   0%|          | 0/224363 [00:00<?, ? examples/s]

In [37]:
len(matching_values), len(llama3_dataset), len(sc2_dataset)

(160874, 224363, 248934)

In [39]:
len(matching_values) / len(sc2_dataset)

0.6462516168944379

In [40]:
len(matching_values) / len(llama3_dataset)

0.7170255345132664

Of the initial `sc2_dataset`, `llama3_dataset` overlaps by 64%, with 29% of items being *new* values. 

How good is their quality?

In [36]:
new_values = llama3_dataset.filter(
    lambda example: example["content"] not in ground_truth,
    keep_in_memory=True,
    num_proc=8,
)

Filter (num_proc=8):   0%|          | 0/224363 [00:00<?, ? examples/s]

In [47]:
for i in range(5):
    print(new_values[i]["content"])
    print("-"*25)

def check_context(model, sentence, company_name):
    """
    Check if the company name in the sentence is actually a company name.

    :param model: the spacy model.
    :param sentence: the sentence to be analysed.
    :param company_name: the name of the company.
    :return: True if the company name means a company/product.
    """

    doc = model(sentence)
    for t in doc.ents:
        if t.lower_ == company_name: #if company name is called
            if t.label_ == "ORG" or t.label_ == "PRODUCT": #check they actually mean the company
                return True
    return False
-------------------------
def valid_template(template):
    """Is this a template that returns a valid URL?"""
    if template.name.lower() == "google books" and (
        template.has("plainurl") or template.has("plain-url")
    ):
        return True
    if template.name.lower() == "billboardurlbyname":
        return True
    return False
-------------------------
import math


def tanD(angle):
    

So far at least, that actually looks quite good

In [48]:
llama_ground_truth = llama3_dataset["content"]

In [49]:
missing_values = sc2_dataset.filter(
    lambda example: example["content"] not in llama_ground_truth,
    keep_in_memory=True,
    num_proc=8,
)

Filter (num_proc=8):   0%|          | 0/248934 [00:00<?, ? examples/s]

In [51]:
len(missing_values)

88060

In [52]:
for i in range(5):
    print(missing_values[i]["content"])
    print("-"*25)

def getItemSize(dataType):
    """
    Gets the size of an object depending on its data type name

    Args:
        dataType (String): Data type of the object

    Returns:
        (Integer): Size of the object
    """
    # If it's a vector 6, its size is 6
    if dataType.startswith("VECTOR6"):
        return 6
    # If it,s a vector 3, its size is 6
    elif dataType.startswith("VECTOR3"):
        return 3
    # Else its size is only 1
    return 1
-------------------------
import torch


def get_optimizer(lr):
    """
    Specify an optimizer and its parameters.

    Returns
    -------
    tuple(torch.optim.Optimizer, dict)
        The optimizer class and the dictionary of kwargs that should
        be passed in to the optimizer constructor.

    """
    return (torch.optim.SGD,
            {"lr": lr, "weight_decay": 1e-6, "momentum": 0.9})
-------------------------
import re


def extract_digits_from_end_of_string(input_string):
    """
    Gets digits at the end of a string
   

These *seem* okay, so just going to continue to the next stages. We'll see what happens at the end

## Post filtering

In [1]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
filtered_dset = load_dataset("json", data_files="filtered.jsonl")

Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14614.30it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 39.49it/s]
Generating train split: 113524 examples [00:05, 19030.64 examples/s]
  table = cls._concat_blocks(blocks, axis=0)


In [3]:
len(filtered_dset["train"])

113524