The purpose of this notebook is to generate short passages that contain names in context. This will be used to train an NER model that can identify names.

In [1]:
import pandas as pd
import os
import string
import random
import json


from huggingface_hub import InferenceClient
from dotenv import load_dotenv
from datasets import Dataset

load_dotenv("../../.env")

client = InferenceClient(token=os.environ["HF_TOKEN"])

In [2]:
settings = [
    "family event",
    "wedding",
    "reunion",
    "birthday",
    "holiday",
    "workplace scenario",
    "office meeting",
    "conference",
    "team event",
    "educational setting",
    "classroom",
    "lecture",
    "graduation",
    "social gathering",
    "party",
    "community event",
    "club meeting",
    "travel scenario",
    "tour",
    "cruise",
    "group trip"
]

age_groups = [
    "children",
    "teenagers",
    "adults",
    "elders",
    "young adults",
    "middle-aged adults",
]

professions = [
    "teacher",
    "doctor",
    "engineer",
    "artist",
]

relationships = [
    "familial",
    "romantic",
    "professional",
    "sibling",
    "cousin",
    "grandparent",
    "colleague",
    "boss",
    "employee",
    "friend",
    "acquaintance",
    "club member",
    "teammate",
]

In [3]:
prompt = """
Write 1 paragraph about 3-5 people at a family event. 
There are children and parents. 
Mention the following names multiple times: Sandra, Achraf, Dayo, and Maria. 

Do not include a preamble.

Paragraph:

""".lstrip()


try:
    r = client.post(
        json={
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": 300,
                "top_k": 50,
                "temperature": 1.0,
                "return_full_text": False,
                "stop": ["\n\n"]
            },
            "options": {"use_cache": False},
        },
        model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    )

    text = json.loads(r.decode())[0]["generated_text"]

except Exception as e:
    print(e)
    text = "<|Error|>"

In [4]:
text

"At the family barbecue, children Achraf and Dayo played around the bouncing castle under the watchful eye of their parents, Sandra and Maria. A friendly game of tug-of-war between the relatives broke out, and Sandra and Achraf teamed up against Maria and Dayo, resulting in much laughter and cheering from everyone around them. The delicious aroma of Maria's famous grilled kebabs filled the air, and soon enough, everyone gathered around the picnic tables to share a meal and reminisce about old times. Despite the occasional sibling rivalry, the love and camaraderie among Sandra, Achraf, Dayo, and Maria were palpable, making the family gathering a memorable one."

In [5]:
first_names = json.load(open("./mixtral_first_names.json"))
last_names = json.load(open("./mixtral_last_names.json"))

In [6]:
import random

random.sample(first_names, 5)

['Wing-Hong', 'Geta', 'Orianne', 'Sharofat', 'Dcamot']

In [7]:
template = """
Write 1 paragraph about people in a random setting and occasion. 
There are people of all ages.
Mention the following person names multiple times: {names}.
Mention a city name and a name of a company or organization.
Do not make the person names part of the city, company, or organization names.

Do not include a preamble.

Paragraph:

""".lstrip()

def call_api(example):

    try:
        r = client.post(
            json={
                "inputs": template.format(
                    # setting=example["setting"],
                    # age_group=example["age_group"],
                    names=", ".join(example["names"]),
                
                ),
                "parameters": {
                    "max_new_tokens": 300,
                    "top_k": 50,
                    "temperature": 1.0,
                    "return_full_text": False,
                    "stop": ["\n\n"]
                },
                "options": {"use_cache": False},
            },
            model="mistralai/Mixtral-8x7B-Instruct-v0.1",
        )

        text = json.loads(r.decode())[0]["generated_text"]

    except Exception as e:
        print(e)
        text = "<|Error|>"

    return {
        "text": text
    }

In [8]:
from datasets import Dataset

def create_name():
    if random.random() < 0.25:
        return random.choice(first_names) + " " + random.choice(last_names)
    return random.choice(first_names)


k = 100


for i in range(146, 200):

    ds = Dataset.from_dict({
        "names": [[random.choice(first_names) for _ in range(random.choice([3, 5]))] for _ in range(k)]
    })


    temp = ds.map(call_api, num_proc=8)
    temp.to_parquet(f"name_paragraphs_{i}.pq")

Map (num_proc=8):   0%|          | 0/100 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Map (num_proc=8):   0%|          | 0/100 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Map (num_proc=8):   0%|          | 0/100 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Map (num_proc=8):   0%|          | 0/100 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Map (num_proc=8):   0%|          | 0/100 [00:00<?, ? examples/s]

TimeoutError: 

In [31]:
import asyncio
import aiohttp
from huggingface_hub import AsyncInferenceClient

from piidd.data_generation.utils import random_string


client = AsyncInferenceClient(token=os.environ["HF_TOKEN"])

async def fetch(names, session, semaphore):
    async with semaphore:
        try:
            r = await client.post(
                json={
                    "inputs": template.format(
                        # setting=example["setting"],
                        # age_group=example["age_group"],
                        names=", ".join(names),
                    
                    ),
                    "parameters": {
                        "max_new_tokens": 300,
                        "top_k": 50,
                        "temperature": 1.0,
                        "return_full_text": False,
                        "stop": ["\n\n"]
                    },
                    "options": {"use_cache": False},
                },
                model="mistralai/Mixtral-8x7B-Instruct-v0.1",
            )

            t = json.loads(r.decode())[0]["generated_text"]

            id_ = random_string(10)

            with open(f"outputs/name_paragraphs/{id_}.json", "w") as f:
                json.dump({
                    "names": names,
                    "text": t
                }, f)
        except Exception as e:
            print(e)

async def main(names):
    tasks = []
    semaphore = asyncio.Semaphore(20)  # Limit to 10 concurrent tasks
    async with aiohttp.ClientSession() as session:
        for n in names:
            task = asyncio.create_task(fetch(n, session, semaphore))
            tasks.append(task)
        results = await asyncio.gather(*tasks)
    return results

# List of URLs to fetch
names = [[random.choice(first_names) for _ in range(random.choice([3, 5]))] for _ in range(5000)]

# Running the main function without asyncio.run() in environments with existing event loop
loop = asyncio.get_event_loop()

# Check if the loop is running
if loop.is_running():
    results = await main(names)
else:
    results = loop.run_until_complete(main(names))

# Now, results contains the API responses

502, message='Bad Gateway', url=URL('https://api-inference.huggingface.co/models/mistralai/Mixtral-8x7B-Instruct-v0.1')


In [15]:
results

[b'[{"generated_text":"At the New Orleans jazz club, Gerhard sat at the piano while Sebastien-Yvon and Emilien prepared the saxophones. The club teemed with people of all ages, who were excited to celebrate Saturday night with the New Orleans Symphony jazz band."}]',
 b'[{"generated_text":"In the bustling City of Atlanta, Eveless is working as a volunteer for the non-profit organization, Hands Across the World. She is helping to distribute canned food to the underprivileged citizens of the city. As a result, the place is also filled with people like Alina, who came to collect food for her family of four, and Gilbert Ben, who has come to help Eveless with the distribution process. Everyone is working together to ensure that no one in the city goes to bed hungry tonight."}]',
 b'[{"generated_text":"Mario stands in the midst of the bustling crowd at the annual technology fair in the city of San Jose, sponsored by the leading tech giant, Silicon Valley Innovations. He, along with Teburoro 

In [93]:
for x in temp:
    print(x)

{'names': ['In-Seong', 'Hady', 'Fortuna'], 'text': 'In this bustling metropolis of New York City, you can find all sorts of people gathering together for the annual charity event held by the Hope Foundation. Families of varied backgrounds stroll through the various booth stands, enjoying homemade baked goods and toys carefully crafted by the children in the neighborhood. In-Seong and Hady were seen volunteering behind the cupcake display, happily handing out one after the other to eager participants. Meanwhile, Fortuna was over near the bean bag toss, handing out scores and encouraging the young ones to try their best. A festive atmosphere filled the courtyard outside the charitable organization’s headquarters, as laughter, friendly conversation, and the scent of sweets swirled through the air. This special moment demonstrated the essence of unity, as diversity and common good became intertwined together.'}
{'names': ['Moinoul', 'Aliyah', 'Lontum', 'Arnfinn', 'Dontae'], 'text': 'Moinou

In [79]:
for x in temp:
    if not all([y in x["text"] for y in x["names"]]):
        print(x)

{'names': ['Curlan', 'Mendoza', 'Folito', 'Danicy', 'Mmeamle'], 'text': "As people gathered in the grand hall of Davenport Convention Center, various age groups could be seen in animated conversations, anticipating the announcement of the Civic Youth Organization's national winners for the year. From young Folito and Danicy, who were nervously checking their phones, to middle-aged Curlan, who was enthusiastically discussing strategies with his team from the city of Arlington, and elderly Mmeamle, who couldn't stop smiling while recalling memories of her own youth, the energy in the room was infectious. Everyone was eager to find out who would take home the prestigious awards, and the sense of community and camaraderie was palpable."}
{'names': ["J'Adale", 'Remei', 'Orianna'], 'text': 'In the busy streets of New York City on a crisp November day, J’Adale, Remei, and Orianna were all wrapped up warm in thick winter coats, each of them heading to work at the tech company “Virtuosos Incorp

In [32]:
from pathlib import Path
from datasets import Dataset, concatenate_datasets


files = list(Path(".").glob("name_paragraphs_*.pq"))

ds = concatenate_datasets([Dataset.from_parquet(str(f)) for f in files])
ds

Dataset({
    features: ['names', 'text'],
    num_rows: 27500
})

In [20]:
import re


t = "An was a good (An)'s person. So was An"

re.findall(r"\b" + "An" + r"\b", t)

['An', 'An', 'An']

In [12]:
ds.shuffle()[:10]

{'names': [['Jenna', 'Nizigama', 'Bolot', 'Sinkoun', 'Senay'],
  ['Dipanwita', 'Karotu Butuka', 'Nikeisha', "M'Kayla", 'Perrine-Marie'],
  ['Bahaar', 'Raimo', 'Nangor'],
  ['Savier', 'Nawali', 'Rooney'],
  ['Ncabezuluhlaza', 'Nauru', 'Ponchi', 'Trevaughn', 'Harquell'],
  ["S'Fiso", 'Fation', 'Fidencia', 'Rois', 'Estrellita'],
  ['Nang', 'Ngebuked', 'Ruzan'],
  ['Kemdeng', 'Ouistin', 'Ludmila', 'Yue', 'Shawnequa'],
  ['Charmaine', 'Tadanobu', 'Kubady'],
  ['Mohamed-Djibril', 'Wieslaw', 'Xenofonte']],
 'text': ["Families come running into the city's most beloved park, the vibrant New Haven City Center Park. Children squeal with delight as they race towards the swings, their laughter echoing through the air, reaching the ears of the adults. Nizigama's family has brought a portable grill, sending an appetizing aroma wafting through the park, luring in Jenna and her friends. The group smiles as they approach, graciously accepting the grilled treats, savoring the taste. Meanwhile, the park's

In [33]:
from pathlib import Path
import json

files = list(Path("/drive2/kaggle/pii-dd/piidd/data_generation/outputs/name_paragraphs").glob("*.json"))

names = []
texts = []

for f in files:
    d = json.load(open(f))
    names.append(d["names"])
    texts.append(d["text"])

len(names)

16098

In [34]:
from datasets import Dataset, concatenate_datasets

ds2 = Dataset.from_dict({
    "names": names,
    "text": texts
})

combined = concatenate_datasets([ds, ds2])

combined = combined.filter(lambda x: all([y in x["text"] for y in x["names"]]))
combined

Filter:   0%|          | 0/43598 [00:00<?, ? examples/s]

Dataset({
    features: ['names', 'text'],
    num_rows: 42477
})

In [35]:
combined.shuffle()[:10]["text"]

["In the vibrant city of Vancouver, Horea and Strahinja arrived at the annual Mick Tech Symposium. As Mick Tech's keynote speaker, Horea was excited to share his expertise on technology and innovation. Meanwhile, Strahinja, as Mick Tech's program coordinator, ensured a successful event. It was the perfect occasion for Horea, Strahinja, and Mick (the founder of Mick Tech) to connect and share their innovative ideas.",
 "In the bustling city of San Francisco, a group of individuals gather in the headquarters of the tech company, Orion Innovations. Among them are employees like Kit-Yan and Nidhalie, who are busy preparing for the company's annual charity event, which is set to take place in the coming weeks. Meanwhile, Agnes-Laure, a long-time volunteer, is coordinating with the rest of the team to ensure everything runs smoothly. Despite the busy atmosphere, there is a sense of excitement and enthusiasm in the air as everyone works together to make a positive impact on the local communit

In [36]:
combined.to_parquet("/drive2/kaggle/pii-dd/data/name_paragraphs-v3.pq")

Creating parquet from Arrow format:   0%|          | 0/43 [00:00<?, ?ba/s]

30160243