<a href="https://colab.research.google.com/github/AlexUmnov/genai_course/blob/main/week1_genai_api/homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 2. Extracting information with LLMs

At the practice session we were usually happy if we got something coherent. However, in real applications we often need to obtain concrete answers. Let's explore how to do it with LLMs.

In [1]:
!pip install -q openai

In [3]:
import openai
openai.api_key = open("open-ai-api-key").read().strip()

Let's imagine that you work for a marketing agency, and you need to gather analytics about the passing events dedicated to AI and Machine Learning. For that, you need to

*   List item
*   List item

process press releases and extract:
- Event name,
- Event date,
- Number of participants,
- Number of speakers,
- Attendance price.

Of course, you can do it manually, but it's much more fun to use Generative AI! So, your task will be to write a function that does this with only one request to OpenAI API.

Below there is an example of a press release (generated by ChatGPT, of course, so that both the event and the personae are fictional). All of them are in the press_releases.zip archive in the hometask week 1 folder.

<blockquote>
<p>PRESS RELEASE

InnovAI Summit 2023: A Glimpse into the Future of Artificial Intelligence</p>

City of Virtue, Cyberspace - November 8, 2023 - The most anticipated event of the year, InnovAI Summit 2023, successfully concluded last weekend, on November 5, 2023. Held in the state-of-the-art VirtuTech Arena, the summit saw a massive turnout of over 3,500 participants, from brilliant AI enthusiasts and researchers to pioneers in the field.

Esteemed speakers took to the stage to shed light on the latest breakthroughs, practical implementations, and ethical considerations in AI. Dr. Evelyn Quantum, renowned for her groundbreaking work on Quantum Machine Learning, emphasized the importance of this merger and how it's revolutionizing computing as we know it. Another keynote came from Prof. Leo Nexus, whose current project 'AI for Sustainability' highlights the symbiotic relationship between nature and machine, aiming to use AI in restoring our planet's ecosystems.

This year's panel discussion, moderated by the talented Dr. Ada Neura, featured lively debates on the limits of AI in creative arts. Renowned digital artist, Felix Vortex, showcased how he uses generative adversarial networks to create surreal art pieces, while bestselling author, Iris Loom, explained her experiments with AI-assisted story crafting.

Among other highlights were hands-on workshops, interactive Q&A sessions, and an 'AI & Ethics' debate which was particularly well-received, emphasizing the need for transparency and fairness in AI models. An exclusive 'Start-up Alley' allowed budding entrepreneurs to showcase their innovations, gaining attention from global venture capitalists and media.

The event wrapped up with an announcement for InnovAI Summit 2024, set to be even grander. Participants left with a renewed enthusiasm for the vast possibilities that the AI and ML world promises.

For media inquiries, please contact:
Jane Cipher
Director of Communications, InnovAI Summit
Email: jane.cipher@innovai.org
Phone: +123-4567-8910</p>
</blockquote>

More specifically, you should write a function

```python
parse_press_release(pr: str) -> dict
```

where the output should be in the format

```python
{
  name: 'InnovAI Summit 2023',
  date: '08.11.2023',
  n_participants: 3500,
  n_speakers: 4,
  price:
}
```

If any of the four characteristics is not mentioned in the text, put `None` in the respective field.

At the end, calculate the statistics of right answers and analyse what kind of mistakes you "model" makes the most.

# New section

**Hints and suggestions:**
- It's gonna be more convenient to experiment in OpenAI chat interface https://chat.openai.com/. Plus this doesn't cost API requests money.
- You need to be very accurate with what you want from the model.
- It will help if you specify in the prompt that the output should be in JSON format, this way you will spend less time parsing the output. But be careful. Though **gpt-4o-mini** is easily prompted to output a JSON, please check the output format. It may contain excessive formatting, for example:
<pre><code>```json
{"name": "InnovAI Summit 2023", ...}
```</pre></code>
Actually, examining LLM outputs and their format is a must when working with them

- Please be careful with the details. For example, Jane Cipher in the text above is not a speaker and shouldn't be counter as such (how to get rid of a contact person?). Also pay attention to the date format,
- If the model is too wilful with the output format, don't hesitate to show some examples. Decreasing the temperature of predictions can help reduce the creativity of the answer, which is what we want for such task.
- Debugging an LLM-powered application may become a tough business. When you think that you've polished it, an LLM can still surprise you. So, we don't expect 100% accuracy in this task, but we expect that you do your best to achieve high quality results.

In [11]:
press_release = """PRESS RELEASE

InnovAI Summit 2023: A Glimpse into the Future of Artificial Intelligence

City of Virtue, Cyberspace - November 8, 2023 - The most anticipated event of the year, InnovAI Summit 2023, successfully concluded last weekend, on November 5, 2023. Held in the state-of-the-art VirtuTech Arena, the summit saw a massive turnout of over 3,500 participants, from brilliant AI enthusiasts and researchers to pioneers in the field.

Esteemed speakers took to the stage to shed light on the latest breakthroughs, practical implementations, and ethical considerations in AI. Dr. Evelyn Quantum, renowned for her groundbreaking work on Quantum Machine Learning, emphasized the importance of this merger and how it's revolutionizing computing as we know it. Another keynote came from Prof. Leo Nexus, whose current project 'AI for Sustainability' highlights the symbiotic relationship between nature and machine, aiming to use AI in restoring our planet's ecosystems.

This year's panel discussion, moderated by the talented Dr. Ada Neura, featured lively debates on the limits of AI in creative arts. Renowned digital artist, Felix Vortex, showcased how he uses generative adversarial networks to create surreal art pieces, while bestselling author, Iris Loom, explained her experiments with AI-assisted story crafting.

Among other highlights were hands-on workshops, interactive Q&A sessions, and an 'AI & Ethics' debate which was particularly well-received, emphasizing the need for transparency and fairness in AI models. An exclusive 'Start-up Alley' allowed budding entrepreneurs to showcase their innovations, gaining attention from global venture capitalists and media.

The event wrapped up with an announcement for InnovAI Summit 2024, set to be even grander. Participants left with a renewed enthusiasm for the vast possibilities that the AI and ML world promises.

For media inquiries, please contact: Jane Cipher Director of Communications, InnovAI Summit Email: jane.cipher@innovai.org Phone: +123-4567-8910"""

In [56]:
json_string = '''{
  "name": "InnovAI Summit 2023",
  "date": "2023-11-05",
  "n_participants": 3500,
  "n_speakers": 3,
  "price": null
}'''

# Parse the JSON string into a Python dictionary
parsed_json = json.loads(json_string)

In [61]:
f'''{parsed_json}'''

"{'name': 'InnovAI Summit 2023', 'date': '2023-11-05', 'n_participants': 3500, 'n_speakers': 3, 'price': None}"

In [96]:
import openai

def parse_press_release(pr: str) -> dict:


    messages=[
            {"role": "system", "content": "You are a helpful assistant designed to output JSON."}, \
            {"role": "user", "content": f"""
                Process the following press release and respond in the json format with fields:
                [name: ..., date: ..., n_participants: ..., n_speakers: ..., price: ...].
                Use only data from the given text and put null if there is no answer.

                Press Release:
                {pr}
                """}
        ]

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        response_format= { "type":"json_object" }
        )
    response_text = response.choices[0].message.content
    parsed_response = json.loads(response_text)
    return parsed_response


In [97]:
parse_press_release(press_release)

{'name': 'InnovAI Summit 2023',
 'date': '2023-11-05',
 'n_participants': 3500,
 'n_speakers': 3,
 'price': None}

In [None]:
parse_press_release(press_release)

{'name': 'InnovAI Summit 2023',
 'date': '05.11.2023',
 'n_participants': 3500,
 'n_speakers': 3,
 'price': 'None'}

###Testing
We've prepared a small dataset for you to test your prompt on.
Provided you've written your function, try running the following code.
At the end you also have an opportunity to look at the results in a table side-by-side in `with_results.csv`.
Your goal is to get at least 60% of fields right.

In [38]:
!pip install --upgrade gdown
!gdown -O press_release_extraction.csv https://docs.google.com/spreadsheets/d/1vWxzigHtiAePZs7U-6qcW2e5UwZh4irbwKvYEvC5G3s/export?format=csv

Downloading...
From: https://docs.google.com/spreadsheets/d/1vWxzigHtiAePZs7U-6qcW2e5UwZh4irbwKvYEvC5G3s/export?format=csv
To: /content/press_release_extraction.csv
16.0kB [00:00, 16.8MB/s]


In [6]:
import pandas
pr_df = pandas.read_csv("press_release_extraction.csv")
pr_df.head()

Unnamed: 0,pr_text,pr_parsed
0,InnovAI Summit 2023: A Glimpse into the Future...,"{\n ""name"": ""InnovAI Summit 2023"",\n ""date"":..."
1,Press Dispatch: 'Artificial Mariners: Navigati...,"{""name"": ""Artificial Mariners: Navigatin' the ..."
2,FOR IMMEDIATE RELEASE\n\nAI Innovators Convene...,"{""name"": ""Annual Machine Learning Symposium 20..."
3,Press Release: Cutting-Edge Innovations Debute...,"{""name"": ""AI Advancements Summit 2023"",\n ""dat..."
4,"Press Release: Innovative Minds Gather at ""AI ...","{""name"": ""AI Horizon 2023"",\n ""date"": ""15.10.2..."


In [9]:
pr_df.pr_parsed[0]

'{\n  "name": "InnovAI Summit 2023",\n  "date": "05.11.2023",\n  "n_participants": 3500,\n  "n_speakers": 4,\n  "price": "None"\n}'

In [98]:
import json

parsed_list = []
fields = {
    "name": str,
    "date": str,
    "n_speakers": int,
    "n_participants": int,
    "price": str
}
correct_fields = 0
for row in pr_df.itertuples():
    parsed_release = parse_press_release(row.pr_text)
    parsed_list.append(json.dumps(parsed_release, indent=4))
    golden = json.loads(row.pr_parsed)
    for field, field_type in fields.items():
        golden_field = golden[field]
        parsed_field = parsed_release.get(field)
        try:
            parsed_field = field_type(parsed_field)
        except (ValueError, TypeError):
            pass
        if golden_field == parsed_field:
            correct_fields += 1
        else:
            print(f"For {golden['name']} {field} {parsed_release.get(field)} doesn't seem the same as {golden[field]}")

print(correct_fields)

For InnovAI Summit 2023 date November 5, 2023 doesn't seem the same as 05.11.2023
For InnovAI Summit 2023 n_speakers 3 doesn't seem the same as 4
For Artificial Mariners: Navigatin' the AI Seas date October 8-9, 2023 doesn't seem the same as 08.10.2023-09.10.2023
For Annual Machine Learning Symposium 2023 date October 14 to October 16, 2023 doesn't seem the same as 14.10.2023-16.10.2023
For Annual Machine Learning Symposium 2023 n_speakers 3 doesn't seem the same as 4
For Annual Machine Learning Symposium 2023 price 1450 doesn't seem the same as USD 1450
For AI Advancements Summit 2023 date 2023-10-16 doesn't seem the same as 16.10.2023
For AI Advancements Summit 2023 price 950 doesn't seem the same as USD 900
For AI Horizon 2023 date 2023-10-15 doesn't seem the same as 15.10.2023
For AI Horizon 2023 n_speakers None doesn't seem the same as None
For AI for Equity Summit date 2023-10-15 doesn't seem the same as 15.10.2023
For AI for Equity Summit price 250 doesn't seem the same as USD 2

In [99]:
pr_df['results'] = parsed_list
pr_df.to_csv("with_results.csv")

**To submit**. You'll need to submit the file `with_results.csv`.

Before running the following code, please double check that:

- Every item in the 'results' column is a valid json with fields name, date, n_speakers, n_participants, price.
- You get at least 60% of the fields correctly.