## LLama 3.1 - 8b (or another model) using  Groq (or another API service)

We can use a general purpose LLM through an API service to perform the task. This could be an expensive solution which could yield reliable results

We ll split our data into batches and apply the api call

In [None]:
from groq import Groq



def extract_strings_w_keyword(message, keyword):
    """
    Extracts strings between a given keyword and the end of line or a parenthesis.

    Args:
        message (str): The large input message.
        keyword (str): The keyword to search for.

    Returns:
        list: A list of strings found between the keyword and the end of line or parenthesis.
    """
    # Build the regex pattern to match keyword followed by anything up to \n or '('
    pattern = rf"{re.escape(keyword)}(.*?)(\n|\(|###)"
    
    # Find all matches
    matches = re.findall(pattern, message)
    
    # Extract only the matched strings (group 1)
    return [match[0].strip() for match in matches]


df = pd.read_csv("data/raw/normalization_assesment_dataset_10k.csv.csv")  
raw_df = df["raw_comp_writers_text"]
clean_df = df["CLEAN_TEXT"] 

client = Groq(api_key="",)
batch_size = 10 

preds_list,raw_list,clean_list=[],[],[]
# Example Sample of 4 batch sizes
for i in range(0, 4*batch_size, batch_size):
    raw_batch = raw_df.iloc[i:i+batch_size]
    clean_batch = clean_df.iloc[i:i+batch_size]
    s="\n".join([f"    RAW TEXT:{ph}" for ph in raw_batch])
    prompt = \
    f"""
    You are a useful linguist assistant. We want to perform text normalisation on names of songwriters. We will give you a string which includes names or nicknames 
    in different formats and containing potentially unecessary words you ll need to clean up - or maybe just the name or nickname. 
    Also some Raw text does not contain any useful information (e.g., unknown, weird initials, standardized words, etc)
    Do not include any outputs of non-latin (e.g., cyrillic, chinese or arabian) characters in your output. 
    If there are multiple names or nicknames separate with '/'. Also end your response with '###' so that i can easily find the end.
    You ll need to return the correct names.
    Some examples:
    Example1:
    RAW TEXT: <Unknown>/Wright, Justyce Kaseem
    Normalized Text: Justyce Kaseem Wright
    Example 2:
    RAW TEXT: Pixouu/Abdou Gambetta/Copyright Control
    Normalized Text: Pixouu/Abdou Gambetta
    Example 3:
    RAW TEXT: Mike Hoyer/JERRY CHESNUT/SONY/ATV MUSIC PUBLISHING (UK) LIMITED
    Normalized Text: JERRY CHESNUT/Mike Hoyer
    Example 4:
    RAW TEXT: 신중현 (Shin Joong Hyun)
    Normalized Text: Shin Joong Hyun
    Example 5:
    RAW TEXT: 신중현
    Normalized Text:
    Example 6:
    RAW TEXT: UNKNOWN 
    Normalized Text: 

    Perform the same for the following cases:
    {s}
    """
    
    chat_completion = client.chat.completions.create(messages=[{"role": "user","content": prompt, }],model="llama3-8b-8192",)
    output_msg = chat_completion.choices[0].message.content
    res = extract_strings_w_keyword(output_msg, "Normalized Text:")
    print(output_msg,res)
    
    raw_names = [ph for ph in raw_batch]
    clean_names = [ph for ph in clean_batch]

    preds_list.extend(res)
    raw_list.extend(raw_names)
    clean_list.extend(clean_names)



llm_df =  pd.DataFrame({
    "RAW_TEXT": raw_list,
    "CLEAN_TEXT": clean_list,
    "LLM_OUT": preds_list
})


output_file_path = "output_file_llm.csv"  
llm_df.to_csv(output_file_path, index=False)
     

 Original output of the llm run:

In [None]:
 
############################# ORIGINAL OUTPUT #############################

#I'd be happy to help you with text normalization on songwriters' names! Here are the normalized outputs for each case:
#
#RAW TEXT: Jordan Riley/Adam Argyle/Martin Brammer
#Normalized Text: Jordan Riley/Adam Argyle/Martin Brammer ###
#
#RAW TEXT: Martin Hygård
#Normalized Text: Martin Hygård ###
#
#RAW TEXT: Jesse Robinson/Greg Phillips/Kishaun Bailey/Kai Asa Savon Wright
#Normalized Text: Jesse Robinson/Greg Phillips/Kishaun Bailey/Kai Asa Savon Wright ###
#
#RAW TEXT: Mendel Brikman
#Normalized Text: Mendel Brikman ###
#
#RAW TEXT: Alvin Lee
#Normalized Text: Alvin Lee ###
#
#RAW TEXT: Haddag Samir/MusicAlligator
#Normalized Text: Haddag Samir ###
#
#RAW TEXT: Mefi Morales
#Normalized Text: Mefi Morales ###
#
#RAW TEXT: Christopher Franke
#Normalized Text: Christopher Franke ###
#
#RAW TEXT: UNKNOWN WRITER (999990)
#Normalized Text: ###
#
#RAW TEXT: Shashank Katkar
#Normalized Text: Shashank Katkar ###
#
#Let me know if you need any further assistance! ['Jordan Riley/Adam Argyle/Martin Brammer', 'Martin Hygård', 'Jesse Robinson/Greg Phillips/Kishaun Bailey/Kai Asa Savon Wright', 'Mendel Brikman', 'Alvin Lee', 'Haddag Samir', 'Mefi Morales', 'Christopher Franke', '', 'Shashank Katkar']
#I'm ready to help with text normalization for songwriters' names. Here are the processed results:
#
#1. RAW TEXT: L. Chandler/John Hammond
#Normalized Text: L. Chandler/John Hammond ###
#
#2. RAW TEXT: Itsjaygocrazy/Jordan Ancrum
#Normalized Text: Jordan Ancrum ###
#
#3. RAW TEXT: Ferhan C/Aaron Tyler/Blue Stamp Music
#Normalized Text: Aaron Tyler/Ferhan C ###
#
#4. RAW TEXT: Mike Kalambay
#Normalized Text: Mike Kalambay ###
#
#5. RAW TEXT: Rikard Sjöblom
#Normalized Text: Rikard Sjöblom ###
#
#6. RAW TEXT: Junior Francisco
#Normalized Text: Junior Francisco ###
#
#7. RAW TEXT: PHUC TRUONG
#Normalized Text: Phuc Truong ###
#
#8. RAW TEXT: Slatt Zy
#Normalized Text: Slatt Zy ###
#
#9. RAW TEXT: Bằng Giang/Tú Nhi
#Normalized Text: Bằng Giang/Tú Nhi ###
#
#10. RAW TEXT: Paul Hardcastle/Kim Fuller
#Normalized Text: Paul Hardcastle/Kim Fuller ###
#
#Let me know if these meet your requirements! ['L. Chandler/John Hammond', 'Jordan Ancrum', 'Aaron Tyler/Ferhan C', 'Mike Kalambay', 'Rikard Sjöblom', 'Junior Francisco', 'Phuc Truong', 'Slatt Zy', 'Bằng Giang/Tú Nhi', 'Paul Hardcastle/Kim Fuller']
#I'm ready to help with text normalization. Here are the results:
#
#RAW TEXT: Ivan Torrent
#Normalized Text: Ivan Torrent ###
#RAW TEXT: An Stepper
#Normalized Text: An Stepper ###
#RAW TEXT: NS (PERF BY KALI)
#Normalized Text: ### (no useful name found)
#RAW TEXT: José Afonso/Luís de Andrade
#Normalized Text: José Afonso/Luís de Andrade ###
#RAW TEXT: Lavel Jackson & Demarcus Ford
#Normalized Text: Lavel Jackson/Demarcus Ford ###
#RAW TEXT: fo man
#Normalized Text: Fo Man ###
#RAW TEXT: Wilhelm Hellweg/Ludwig van Beethoven
#Normalized Text: Wilhelm Hellweg/Ludwig van Beethoven ###
#RAW TEXT: Traditional
#Normalized Text: ### (no useful name found)
#RAW TEXT: #unknown#
#Normalized Text: ### (no useful name found)
#RAW TEXT: Christian Michelle Felix Felix
#Normalized Text: Christian Felix ### ['Ivan Torrent', 'An Stepper', '', 'José Afonso/Luís de Andrade', 'Lavel Jackson/Demarcus Ford', 'Fo Man', 'Wilhelm Hellweg/Ludwig van Beethoven', '', '', 'Christian Felix']
#I'm ready to help! Here are the results:
#
#RAW TEXT: Eric Andersen
#Normalized Text: Eric Andersen ###
#
#RAW TEXT: Efrem Jamaar Blackwell
#Normalized Text: Efrem Jamaar Blackwell ###
#
#RAW TEXT: ZAIKS/Mariusz Duda/District 6 Music Publishing Ltd
#Normalized Text: Mariusz Duda ###
#
#RAW TEXT: Ludwig van Beethoven
#Normalized Text: Ludwig van Beethoven ###
#
#RAW TEXT: Nguyễn Nhất Huy
#Normalized Text: Nguyen Nhat Huy ### (assuming Vietnamese names written without spaces)
#
#RAW TEXT: Hayden Buck Jones
#Normalized Text: Hayden Buck Jones ###
#
#RAW TEXT: Alexey Abrosimov
#Normalized Text: Alexey Abrosimov ###
#
#RAW TEXT: Bhai Gurbachan Singh Ji Sri Ganganagar Wale
#Normalized Text: Bhai Gurbachan Singh Ji ### (as there's no nickname or recognized nickname)
#
#RAW TEXT: Rundown Spaz,Rundown Choppaboy
#Normalized Text: Rundown Spaz/Rundown Choppaboy ###
#
#RAW TEXT: Michael Hansen/James Lewis/Felix Rutherford/Robbie Jay
#Normalized Text: Michael Hansen/James Lewis/Felix Rutherford/Robbie Jay ### ['Eric Andersen', 'Efrem Jamaar Blackwell', 'Mariusz Duda', 'Ludwig van Beethoven', 'Nguyen Nhat Huy', 'Hayden Buck Jones', 'Alexey Abrosimov', 'Bhai Gurbachan Singh Ji', 'Rundown Spaz/Rundown Choppaboy', 'Michael Hansen/James Lewis/Felix Rutherford/Robbie Jay']
#