<a href="https://colab.research.google.com/github/potalestor/tt-course-gen-al/blob/main/Task_1_2_Extracting_information_with_LLMs_(student).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 2. Extracting information with LLMs

At the practice session we were usually happy if we got something coherent. However, in real applications we often need to obtain concrete answers. Let's explore how to do it with LLMs.

In [None]:
!pip install -q openai

In [None]:
import openai

openai.api_key = open(".openai-api-key").read().strip()

Let's imagine that you work for a marketing agency, and you need to gather analytics about the passing events dedicated to AI and Machine Learning. For that, you need to process press releases and extract:
- Event name,
- Event date,
- Number of participants,
- Number of speakers,
- Attendance price.

Of course, you can do it manually, but it's much more fun to use Generative AI! So, your task will be to write a function that does this with only one request to OpenAI API.

Below there is an example of a press release (generated by ChatGPT, of course, so that both the event and the personae are fictional). All of them are in the press_releases.zip archive in the hometask week 1 folder.

<blockquote>
<p>PRESS RELEASE

InnovAI Summit 2023: A Glimpse into the Future of Artificial Intelligence</p>

City of Virtue, Cyberspace - November 8, 2023 - The most anticipated event of the year, InnovAI Summit 2023, successfully concluded last weekend, on November 5, 2023. Held in the state-of-the-art VirtuTech Arena, the summit saw a massive turnout of over 3,500 participants, from brilliant AI enthusiasts and researchers to pioneers in the field.

Esteemed speakers took to the stage to shed light on the latest breakthroughs, practical implementations, and ethical considerations in AI. Dr. Evelyn Quantum, renowned for her groundbreaking work on Quantum Machine Learning, emphasized the importance of this merger and how it's revolutionizing computing as we know it. Another keynote came from Prof. Leo Nexus, whose current project 'AI for Sustainability' highlights the symbiotic relationship between nature and machine, aiming to use AI in restoring our planet's ecosystems.

This year's panel discussion, moderated by the talented Dr. Ada Neura, featured lively debates on the limits of AI in creative arts. Renowned digital artist, Felix Vortex, showcased how he uses generative adversarial networks to create surreal art pieces, while bestselling author, Iris Loom, explained her experiments with AI-assisted story crafting.

Among other highlights were hands-on workshops, interactive Q&A sessions, and an 'AI & Ethics' debate which was particularly well-received, emphasizing the need for transparency and fairness in AI models. An exclusive 'Start-up Alley' allowed budding entrepreneurs to showcase their innovations, gaining attention from global venture capitalists and media.

The event wrapped up with an announcement for InnovAI Summit 2024, set to be even grander. Participants left with a renewed enthusiasm for the vast possibilities that the AI and ML world promises.

For media inquiries, please contact:
Jane Cipher
Director of Communications, InnovAI Summit
Email: jane.cipher@innovai.org
Phone: +123-4567-8910</p>
</blockquote>

More specifically, you should write a function

```python
parse_press_release(pr: str) -> dict
```

where the output should be in the format

```python
{
  name: 'InnovAI Summit 2023',
  date: '08.11.2023',
  n_participants: 3500,
  n_speakers: 4,
  price:
}
```

If any of the four characteristics is not mentioned in the text, put `None` in the respective field.

At the end, calculate the statistics of right answers and analyse what kind of mistakes you "model" makes the most.

**Hints and suggestions:**
- It's gonna be more convenient to experiment in OpenAI chat interface https://chat.openai.com/. Plus this doesn't cost API requests money.
- You need to be very accurate with what you want from the model.
- It will help if you specify in the prompt that the output should be in JSON format, this way you will spend less time parsing the output. But be careful. Though **gpt-4o-mini** is easily prompted to output a JSON, please check the output format. It may contain excessive formatting, for example:
<pre><code>```json
{"name": "InnovAI Summit 2023", ...}
```</pre></code>
Actually, examining LLM outputs and their format is a must when working with them

- Please be careful with the details. For example, Jane Cipher in the text above is not a speaker and shouldn't be counter as such (how to get rid of a contact person?). Also pay attention to the date format,
- If the model is too wilful with the output format, don't hesitate to show some examples. Decreasing the temperature of predictions can help reduce the creativity of the answer, which is what we want for such task.
- Debugging an LLM-powered application may become a tough business. When you think that you've polished it, an LLM can still surprise you. So, we don't expect 100% accuracy in this task, but we expect that you do your best to achieve high quality results.

In [None]:
press_release = """PRESS RELEASE

InnovAI Summit 2023: A Glimpse into the Future of Artificial Intelligence

City of Virtue, Cyberspace - November 8, 2023 - The most anticipated event of the year, InnovAI Summit 2023, successfully concluded last weekend, on November 5, 2023. Held in the state-of-the-art VirtuTech Arena, the summit saw a massive turnout of over 3,500 participants, from brilliant AI enthusiasts and researchers to pioneers in the field.

Esteemed speakers took to the stage to shed light on the latest breakthroughs, practical implementations, and ethical considerations in AI. Dr. Evelyn Quantum, renowned for her groundbreaking work on Quantum Machine Learning, emphasized the importance of this merger and how it's revolutionizing computing as we know it. Another keynote came from Prof. Leo Nexus, whose current project 'AI for Sustainability' highlights the symbiotic relationship between nature and machine, aiming to use AI in restoring our planet's ecosystems.

This year's panel discussion, moderated by the talented Dr. Ada Neura, featured lively debates on the limits of AI in creative arts. Renowned digital artist, Felix Vortex, showcased how he uses generative adversarial networks to create surreal art pieces, while bestselling author, Iris Loom, explained her experiments with AI-assisted story crafting.

Among other highlights were hands-on workshops, interactive Q&A sessions, and an 'AI & Ethics' debate which was particularly well-received, emphasizing the need for transparency and fairness in AI models. An exclusive 'Start-up Alley' allowed budding entrepreneurs to showcase their innovations, gaining attention from global venture capitalists and media.

The event wrapped up with an announcement for InnovAI Summit 2024, set to be even grander. Participants left with a renewed enthusiasm for the vast possibilities that the AI and ML world promises.

For media inquiries, please contact: Jane Cipher Director of Communications, InnovAI Summit Email: jane.cipher@innovai.org Phone: +123-4567-8910"""

In [None]:
import openai

def parse_press_release(pr: str) -> dict:
    """
    Parses a press release and extracts event information.
    :param pr: The press release text.
    :return: A dictionary with event data.
    """
    try:
        # Request to OpenAI API
        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant that extracts structured data from text."
                },
                {
                    "role": "user",
                    "content": f"""
Extract structured data from the press release below. Provide the output in strict JSON format with the following keys:
- "name": The title or name of the event
- "date": The event date in DD.MM.YYYY format
- "n_participants": The number of attendees
- "n_speakers": The number of featured speakers
- "price": The ticket price or cost of attendance

If a specific piece of data is not mentioned in the press release, assign it a null value. Here is the press release text:
{pr}

"""
                }
            ],
            temperature=0  # Set temperature to 0 for deterministic responses
        )

        # Extract the response text from the OpenAI API
        summary = response.choices[0].message.content.strip()

        # Remove the ```json``` wrapper if present and parse the JSON
        summary = summary[7:-3].strip() if summary.startswith("```json") else summary
        return json.loads(summary)  # Convert the JSON string into a Python dictionary

    except Exception as e:
        # Handle any parsing errors and return default values
        print(f"Error during parsing: {e}")
        # Return a dictionary with null values for missing information
        return {key: None for key in ["name", "date", "n_participants", "n_speakers", "price"]}

In [None]:
parse_press_release(press_release)

{'name': 'InnovAI Summit 2023',
 'date': '05.11.2023',
 'n_participants': 3500,
 'n_speakers': 4,
 'price': None}

###Testing
We've prepared a small dataset for you to test your prompt on.
Provided you've written your function, try running the following code.
At the end you also have an opportunity to look at the results in a table side-by-side in `with_results.csv`.
Your goal is to get at least 60% of fields right.

In [None]:
!pip install --upgrade gdown
!gdown -O press_release_extraction.csv https://docs.google.com/spreadsheets/d/1vWxzigHtiAePZs7U-6qcW2e5UwZh4irbwKvYEvC5G3s/export?format=csv

Downloading...
From: https://docs.google.com/spreadsheets/d/1vWxzigHtiAePZs7U-6qcW2e5UwZh4irbwKvYEvC5G3s/export?format=csv
To: /content/press_release_extraction.csv
16.0kB [00:00, 25.7MB/s]


In [None]:
import pandas
pr_df = pandas.read_csv("press_release_extraction.csv")
pr_df.head()

Unnamed: 0,pr_text,pr_parsed
0,InnovAI Summit 2023: A Glimpse into the Future...,"{\n ""name"": ""InnovAI Summit 2023"",\n ""date"":..."
1,Press Dispatch: 'Artificial Mariners: Navigati...,"{""name"": ""Artificial Mariners: Navigatin' the ..."
2,FOR IMMEDIATE RELEASE\n\nAI Innovators Convene...,"{""name"": ""Annual Machine Learning Symposium 20..."
3,Press Release: Cutting-Edge Innovations Debute...,"{""name"": ""AI Advancements Summit 2023"",\n ""dat..."
4,"Press Release: Innovative Minds Gather at ""AI ...","{""name"": ""AI Horizon 2023"",\n ""date"": ""15.10.2..."


In [None]:
pr_df.pr_parsed[0]

'{\n  "name": "InnovAI Summit 2023",\n  "date": "05.11.2023",\n  "n_participants": 3500,\n  "n_speakers": 4,\n  "price": "None"\n}'

In [None]:
import json

parsed_list = []
fields = {
    "name": str,
    "date": str,
    "n_speakers": int,
    "n_participants": int,
    "price": str
}
correct_fields = 0
total_fields = 0

for row in pr_df.itertuples():
    parsed_release = parse_press_release(row.pr_text)  # Assume this returns a dictionary
    if not isinstance(parsed_release, dict):
        print(f"Error: Parsed release is not a dictionary for input {row.pr_text}")
        continue

    parsed_list.append(json.dumps(parsed_release, indent=4))  # Serialize for storage if needed
    golden = json.loads(row.pr_parsed)  # Convert stored JSON to a dictionary

    for field, field_type in fields.items():
        # Update counters
        total_fields += 1

        golden_field = golden[field]
        parsed_field = parsed_release.get(field)

        try:
            parsed_field = field_type(parsed_field)  # Convert to the expected type
        except (ValueError, TypeError):
            pass

        if golden_field == parsed_field:
            correct_fields += 1
        else:
            print(f"For {golden['name']} {field} {parsed_field} doesn't seem the same as {golden_field}")

print(correct_fields)

For Artificial Mariners: Navigatin' the AI Seas date 08.10.2023 doesn't seem the same as 08.10.2023-09.10.2023
For Annual Machine Learning Symposium 2023 date 14.10.2023 doesn't seem the same as 14.10.2023-16.10.2023
For Annual Machine Learning Symposium 2023 n_speakers 3 doesn't seem the same as 4
For Annual Machine Learning Symposium 2023 price 1450 doesn't seem the same as USD 1450
For AI Advancements Summit 2023 price 950 doesn't seem the same as USD 900
For AI Horizon 2023 n_speakers None doesn't seem the same as None
For AI for Equity Summit price 250 doesn't seem the same as USD 250
For Generative Intelligence Conclave, Spain 2023 price 180 doesn't seem the same as EUR 180
27


In [None]:
accuracy = (correct_fields / total_fields) * 100 if total_fields > 0 else 0
print(f"Accuracy: {accuracy:.2f}% ({correct_fields}/{total_fields} fields correct)")

Accuracy: 77.14% (27/35 fields correct)


In [None]:
pr_df['results'] = parsed_list
pr_df.to_csv("with_results.csv")

**To submit**. You'll need to submit the file `with_results.csv`.

Before running the following code, please double check that:

- Every item in the 'results' column is a valid json with fields name, date, n_speakers, n_participants, price.
- You get at least 60% of the fields correctly.