Now that we have the unlabeled data, we need to label it. We cannot do this manually as it would take a lot of time for 9000 instances. Instead, we can pass them to a LLM like gpt 3.5 and ask for the labels. We are assuming that the answers that it gives is always perfect. While this is not actually true, but it is probably giving the best answers anyways (because it is already  very good at language tasks). 

Before sending them, we should make sure they are properly readable by LLM.

In [1]:
import pandas as pd
import unicodedata

def clean_text(text):
    if isinstance(text, str):
        # Normalize Unicode characters
        text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('ASCII')
        
        # Remove non-printable characters
        text = ''.join(char for char in text if char.isprintable())
        
    return text

input_file_path = 'articles_filtered.csv'
output_file_path = 'art.csv'

df = pd.read_csv(input_file_path, on_bad_lines='skip')

for column in df.columns:
    if df[column].dtype == 'object':
        df[column] = df[column].apply(clean_text)


df.to_csv(output_file_path, index=False, quoting=1)  

print(f"Cleaned data saved to {output_file_path}")
print(f"Number of rows in the cleaned file: {len(df)}")

Cleaned data saved to art.csv
Number of rows in the cleaned file: 6488


also, we should remove content that is too large to pass to LLM once, because of its content size.

In [3]:
df = pd.read_csv('art.csv')

max_content_length = 5000
df['content_length'] = df['content'].str.len()

df_filtered = df[df['content_length'] <= max_content_length]

df_filtered = df_filtered.drop(columns=['content_length'])

# Save the filtered DataFrame back to a CSV file
df_filtered.to_csv('filtered_art.csv', index=False)


In [1]:
!pip install openai

Collecting openai
  Downloading openai-1.35.13-py3-none-any.whl.metadata (21 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Downloading openai-1.35.13-py3-none-any.whl (328 kB)
   ---------------------------------------- 0.0/328.5 kB ? eta -:--:--
   ------------- -------------------------- 112.6/328.5 kB 3.3 MB/s eta 0:00:01
   --------------------------- ------------ 225.3/328.5 kB 2.8 MB/s eta 0:00:01
   ---------------------------------------- 328.5/328.5 kB 2.9 MB/s eta 0:00:00
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Installing collected packages: distro, openai
Successfully installed distro-1.9.0 openai-1.35.13


Calling one by one takes too much time (one call returns after 10 seconds, we have 6000 instances, which will take more than one day). Thus we need to make multiple calls ones. Checking if our function works:

In [12]:
import asyncio
import aiohttp
import ssl
import certifi
import nest_asyncio

nest_asyncio.apply()

MODEL = "gpt-3.5-turbo"
OPENAI_SECRET_KEY = "mykey"  

async def call_chatgpt_async(session, prompt: str):
    payload = {
        'model': MODEL,
        'messages': [
            {"role": "user", "content": prompt}
        ]
    }
    try:
        async with session.post(
            url='https://api.openai.com/v1/chat/completions',
            headers={"Content-Type": "application/json", "Authorization": f"Bearer {OPENAI_SECRET_KEY}"},
            json=payload,
            ssl=ssl.create_default_context(cafile=certifi.where())
        ) as response:
            response = await response.json()
        if "error" in response:
            print(f"OpenAI request failed with error {response['error']}")
        return response['choices'][0]['message']['content']
    except:
        print("Request failed.")

async def call_chatgpt_bulk(prompts):
    async with aiohttp.ClientSession() as session:
        tasks = [call_chatgpt_async(session, prompt) for prompt in prompts]
        return await asyncio.gather(*tasks)

if __name__ == "__main__":
    prompts = ["what is 2+2 answer short", "what is 4/2 answer short", "what is 2*3 answer short"]
    results = asyncio.run(call_chatgpt_bulk(prompts))
    print(results)


['4', '2', '6']


We see that it is working. Now, I implemented a code for calling the llm making it answer in the format:
-acquirer
-acquired
-price
than I save the answer to an csv file. I also made a function for creating the training json file in Spacy format as I thought it will be easier to make the model using it. 

There is also a mechanism where if the code stops, it saves the files and after rerun, it skips the already proccessed files.

In [3]:
import csv
import json
import os

nest_asyncio.apply()

MODEL = "gpt-3.5-turbo"
OPENAI_SECRET_KEY = "mykey"  
CSV_FILE = 'filtered_art.csv'  
NUM_INSTANCES = 80

def process_article(title, content):
    return f"{title}\n\n{content}\n\n------YOUR MISSION------\ngive the info of : \n-acquirer \n-acquired \n-price \nwrite not specified if not specified, do not write anything else, I will be using your output to label my data"

def extract_entities(output, full_text):
    lines = output.strip().split('\n')
    entity_dict = {}

    for line in lines:
        if ':' in line:
            entity_type, value = line.split(':', 1)
            entity_type = entity_type.strip().upper()
            entity_type = entity_type.lstrip('-').strip()
            value = value.strip()
            if value.lower() != 'not specified':
                entity_dict[entity_type] = value

    if not entity_dict:
        return

    for entity_type, value in entity_dict.items():
        if value =='':
            break
        start = 0
        while True:
            start = full_text.find(value, start)
            if start == -1:
                break
            end = start + len(value)
            yield (start, end, entity_type)
            start = end

def create_spacy_format(text, entities):
    return (text, {'entities': entities})

def load_json(filename):
    if os.path.exists(filename):
        with open(filename, 'r') as jsonfile:
            return json.load(jsonfile)
    return []

def save_json(filename, data):
    with open(filename, 'w') as jsonfile:
        json.dump(data, jsonfile, indent=2)

def get_prompts_from_csv(csv_file, num_instances, processed_titles):
    prompts = []
    articles = []
    with open(csv_file, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        rows = list(reader)
        
        for row in rows:
            if len(prompts) >= num_instances:
                break
            
            title = row['title']
            if title not in processed_titles:
                content = row['content']
                prompt = process_article(title, content)
                prompts.append(prompt)
                articles.append((title, content))
    
    return prompts, articles

async def main():
    output_file = 'ner_training_data.json'
    processed_file = 'processed_articles.json'
    csv_output_file = 'processed_articles.csv'

    spacy_data = load_json(output_file)
    processed_articles = load_json(processed_file)

    ner_processed_titles = set(item[0].split('\n')[0] for item in spacy_data)
    all_processed_titles = set(item['title'] for item in processed_articles)

    prompts, articles = get_prompts_from_csv(CSV_FILE, NUM_INSTANCES, all_processed_titles)
    results = await call_chatgpt_bulk(prompts)

    with open(csv_output_file, 'a', newline='') as csvfile:
        fieldnames = ['title', 'content', 'llm_output']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        
        if csvfile.tell() == 0:
            writer.writeheader()

        for (title, content), llm_output in zip(articles, results):
            print(f"Processing article: {title}")

            writer.writerow({
                'title': title,
                'content': content,
                'llm_output': llm_output
            })

            full_text = f"{title}\n\n{content}"
            entities = list(extract_entities(llm_output, full_text))

            if entities and title not in ner_processed_titles:
                spacy_format = create_spacy_format(full_text, entities)
                spacy_data.append(spacy_format)
                ner_processed_titles.add(title)

            processed_articles.append({'title': title, 'has_entities': bool(entities)})
            all_processed_titles.add(title)

        save_json(output_file, spacy_data)
        save_json(processed_file, processed_articles)

    print(f"Data processing complete. NER training data saved to '{output_file}'.")
    print(f"Processed articles with LLM output saved to '{csv_output_file}'.")
    print(f"List of all processed articles saved to '{processed_file}'.")

if __name__ == "__main__":
    asyncio.run(main())

Processing article: 4 Charts Show Startup M&A Deal-Making Is Not Moving In The Direction We Expected
Processing article: White & Case advises F-Secure on acquisition of mobile consumer security business from Lookout
Processing article: Mergers & Acquisitions 2023 Deals of the Year: Jacobs Acquired Streetlight from Macquarie and Activate
Processing article: CVS closes $10.6B acquisition of Oak Street Health to expand primary care footprint
Processing article: Mergers & Acquisitions 2023 Deals of the Year: Choice Hotels Acquired Radisson Hotel Group Americas
Processing article: April M&A Roundup: Activity Ticks Up Again After March Dip
Processing article: Google Acquisition History: What Are the Biggest Companies Google Owns?
Processing article: WilmerHale Advises Lookout in Acquisition of its Consumer Mobile Security Business Segment
Processing article: Will Microsoft-Activision Deal Go Through? For Merger Market, It May Not Matter
Processing article: Why Startups May Soon Be Buying Mor

I experimented with number of instances we pass at one async call to gpt. If I made 100 pass, I got rate limits as you can send 60000 tokens per minute. I found 80 as a sweet spot. If I got an error, the json files were not being updated, thus I wrote the following code to add the processed files at the error session.

In [4]:
def process_missing_articles():
    ner_data = load_json('ner_training_data.json')
    processed_articles = load_json('processed_articles.json')

    ner_titles = set(item[0].split('\n')[0] for item in ner_data)
    processed_titles = set(item['title'] for item in processed_articles)

    with open('processed_articles.csv', 'r', newline='', encoding='utf-8', errors='ignore') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            title = row['title']
            content = row['content']
            llm_output = row['llm_output']

            # Check if article is missing from both JSON files
            if title not in ner_titles and title not in processed_titles:
                # Add to ner_training_data.json
                full_text = f"{title}\n\n{content}"
                entities = extract_entities(llm_output, full_text)
                if entities:
                    ner_data.append(create_spacy_format(full_text, entities))

                # Add to processed_articles.json
                processed_articles.append({
                    'title': title,
                    'has_entities': bool(entities)
                })

                print(f"Added missing article: {title}")

    save_json('ner_training_data.json', ner_data)
    save_json('processed_articles.json', processed_articles)


if __name__ == "__main__":
    process_missing_articles()
    print("Processing complete. Both JSON files have been updated with missing articles.")

Processing complete. Both JSON files have been updated with missing articles.


Now, I run these two functions in a loop until I finish all the data, with stopping for 1 minute between each call, for not hitting the rate limits.

In [5]:
import time
while True:
    asyncio.run(main())
    process_missing_articles()
    print("Processing complete. Both JSON files have been updated with missing articles.")
    time.sleep(60)

Processing article: Rulmeca Acquires Conveying Supplier Douglas Manufacturing
Processing article: Low-carbon M&A sets a record level of deal activity in 2022 due to increased focus on net-zero goals
Processing article: Olive Garden Parent Buys Ruths Chris, Plus More Bold Deals
Processing article: Refuel Acquires 8 C-Stores in 2 Separate Deals
Processing article: DuPont to Acquire Spectrum Plastics for $1.75B
Processing article: GMS Acquires Home Lumber & Building Supplies
Processing article: White & Case advises KAP on sale of flexible films subsection
Processing article: U.S. private equity firm Advent International and BCI complete acquisition of Maxar Technologies
Processing article: Where Zimmer Biomet Stands On Future Mergers and Acquisitions
Processing article: Baxter breaks off biopharma solutions segment in $4.25B private equity deal
Processing article: Chinese FDI in Europe: 2022 Update
Processing article: Increase in mergers and acquisitions in first part of 2023
Processing a

In [2]:
json_file = 'processed_articles.json'

with open(json_file, 'r', encoding='utf-8') as f:
    data = json.load(f)

num_articles = len(data)
print(f"Number of articles processed: {num_articles}")

Number of articles processed: 4759


In [3]:
json_file = 'ner_training_data.json'

with open(json_file, 'r', encoding='utf-8') as f:
    data = json.load(f)

num_articles = len(data)
print(f"Number of articles processed: {num_articles}")

Number of articles processed: 3976


In [5]:
!pip install spacy tqdm scikit-learn
!python -m spacy download en_core_web_sm 

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.1/12.8 MB 2.3 MB/s eta 0:00:06
      --------------------------------------- 0.3/12.8 MB 3.5 MB/s eta 0:00:04
     - -------------------------------------- 0.5/12.8 MB 4.0 MB/s eta 0:00:04
     -- ------------------------------------- 0.8/12.8 MB 4.6 MB/s eta 0:00:03
     --- ------------------------------------ 1.0/12.8 MB 4.5 MB/s eta 0:00:03
     --- ------------------------------------ 1.1/12.8 MB 4.4 MB/s eta 0:00:03
     --- ------------------------------------ 1.1/12.8 MB 4.4 MB/s eta 0:00:03
     ---- ----------------------------------- 1.5/12.8 MB 4.3 MB/s eta 0:00:03
     ----- ---------------------------------- 1.7/12.8 MB 4.1 MB/s eta 0:00:03
     ----- ------------------------------

In [10]:
from spacy.tokens import DocBin

# Load the DocBin objects
train_db = DocBin().from_disk("./train.spacy")
test_db = DocBin().from_disk("./test.spacy")

# Check the number of documents (instances) in each DocBin
num_train_instances = len(train_db)
num_test_instances = len(test_db)

print(f"Number of instances in train_db: {num_train_instances}")
print(f"Number of instances in test_db: {num_test_instances}")


Number of instances in train_db: 3578
Number of instances in test_db: 398
