# Unlocking Unstructured Data Potential with Google Gemini 1.0 Pro

This notebook contains examples of Python code that are explained in article:

https://www.storagefreak.net/2024/02/unlocking-unstructured-data-potential-with-google-gemini-1-0-pro
Google Cloud credits are provided for this project. #GeminiSprint

## Step 1. Install the Vertex AI SDK

In [None]:
pip install --upgrade google-cloud-aiplatform



## Step 2. Authenticate to Google Cloud

There are different ways we can authenticate our colab to Google Cloud APIs; In my example I will use builtin google.colab library

In [None]:
from google.colab import auth as google_auth
google_auth.authenticate_user()

## Step 3. Specify Project ID

Assuming you will want to run those examples on your own, you can specify in below code block your valid project_id. Make sure this project has already necessary APIs activated.
You can do this in GUI by visiting https://console.cloud.google.com/vertex-ai and clicking on "Enable all recommended APIs"

In [None]:
PROJECT_ID = "tomek-bison" # this is my project_id, you have to update this value for yours ;)

## Example 1. Extract dates in given format
Below is an example of code that can extract dates from string stored under "my_string" variable.

In [None]:
import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part
import vertexai.preview.generative_models as generative_models

def extract_dates(input_string):
  vertexai.init(project=PROJECT_ID, location="us-central1")
  model = GenerativeModel("gemini-1.0-pro-001")
  responses = model.generate_content(
    f"""Your job is to extract all dates in the below INPUT. Return all dates in format YYYY-MM-DD. In case of multiple dates, separate them by comma (,). Return only dates, nothing more
INPUT: {input_string}""",
    generation_config={
        "max_output_tokens": 2048,
        "temperature": 0.9,
        "top_p": 1
    },
    safety_settings={
          generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
          generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
          generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
          generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    },
    stream=True,
  )

  for response in responses:
    print(response.text, end="")


In above examples we have defined a function called `extract_dates` that can be called with one parameter - `input_string`. Of course This code could be a bit polished, defining what happens when function doesn't get input_string, or it doesn't contain any dates, etc. But in current shape and form - it is perfectly good for "Proof of Concept". Let's try to call it with `my_string` as parameter

In [None]:
my_string = "In our last meeting on Feb 15th, 2023, we discussed the project deadline initially set for 15/02/2023, which was later postponed to March 20th of the same year."

In [None]:
extract_dates(my_string)

2023-02-15,2023-02-15,2023-03-20

And there we go. Nice, clean example. Of course this could be also polished a bit more. I would probably add at least one example to my prompt ("One-shot prompting"), just to make sure model will always reponse in given format. I would also probably instruct it what to do if there are no dates in the given input, etc..

# Example 2. Extract data from user e-mail

In [None]:
# issue will contain full user e-mail. Below e-mail is of course just an example.

issue = '''
From: Emily Thompson emily.thompson@company.com
Subject: Printer Issue: Assistance Required
Email Content:

Hey IT Support Team,

I hope this message finds you well. I'm in a bit of a bind here with the printer—it's just sitting there, stubbornly refusing to print. I've tried coaxing it, checking its connections, ensuring it's powered on, but no luck. It's like it's decided to take a personal day!
Could you work your magic and get this thing back in action? I've got documents piling up that need seeing the light of day.
Thanks a bunch for jumping on this!

Cheers,
Emily
'''

We would like to extract information necessary to automatically log a ticket. The way and format we pass the data might differ, but in my example let's use JSON structure. This is a well-known format, and fits our example perfectly.

In [None]:
prompt = f'''
**Input Text:**
{issue}

**Instructions:**
Given the above input text, extract and organize the information into a structured JSON format. This information will be used to log a ticket, so propose a proper title and category that fits the reported issue. additional_info might include troubleshooting steps already executed, or any other useful information. The output should consist of the JSON structure only, without any additional characters or explanations.

**Output structure:**
{{
    "name": "",
    "email": "",
    "title": "",
    "category": "",
    "short_summary": "",
    "full_description": "",
    "additional_info": ""
}}
'''

And again let's define a function that can call Gemini 1.0 Pro model. I will include the code that was already executed in the first example, such as import vertexai, etc. I'm assuming you can execute only second exapmle without running the first one.

Since we already have our prompt defined in above code blocks, this time function will not accept any parameters. Of course in more "production-like" environments we would improve the flow.

In [None]:
import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part
import vertexai.preview.generative_models as generative_models

def extract_ticket_info():
  vertexai.init(project=PROJECT_ID, location="us-central1")
  model = GenerativeModel("gemini-1.0-pro-001")
  responses = model.generate_content(
    prompt,  # this is defined in previous code block
    generation_config={
        "max_output_tokens": 2048,
        "temperature": 0.9,
        "top_p": 1
    },
    safety_settings={
          generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
          generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
          generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
          generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    },
    stream=True,
  )

  for response in responses:
    print(response.text, end="")

# run the function
extract_ticket_info()


```json
{
    "name": "Emily Thompson",
    "email": "emily.thompson@company.com",
    "title": "Printer Issue: Inaccessible",
    "category": "Hardware",
    "short_summary": "Printer not printing despite troubleshooting attempts",
    "full_description": "The printer is unresponsive and not printing documents. Power is on, and connections are secure. User has attempted basic troubleshooting steps without success.",
    "additional_info": "Troubleshooting steps executed:\n1. Checked connections\n2. Ensured power is on\n3. Restarted the printer"
}
```

And there we have it—a neatly structured, JSON-formatted dataset meticulously extracted from a user's email. This transformation not only showcases the finesse with which unstructured data can be tamed but also underscores the practical utility of Generative AI in everyday problem-solving.