<a href="https://colab.research.google.com/github/loveyoumaxin/Auto-GPT/blob/master/chatGPT_function_calling_for_structured_data_ipynb%EC%9D%98_%EC%82%AC%EB%B3%B8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import openai
import json
import dotenv
import os

dotenv.load_dotenv()

openai.api_key = os.getenv("OPENAI_API_KEY")

In [None]:
extract_company_info_schema = {
    "name": "extract_company_info",
    "description": "Extract structured information about companies from unstructured text. The model should process the input text to identify and extract company-related information such as company name, founding date, founders, specializations, and headquarters location. It should tokenize the text, identify relevant entities, and extract the necessary information to fill the parameters. The model should handle ambiguities and inaccuracies in the text and ensure the extracted information is accurate and relevant.",
    "parameters": {
        "type": "object",
        "properties": {
            "companies": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {
                            "type": "string",
                            "description": "The official name of the company, e.g. Apple Inc.",
                        },
                        "founding_date": {
                            "type": "string",
                            "description": "The date on which the company was founded, e.g. April 1, 1976. The model should ensure the date is in a recognizable format.",
                        },
                        "founders": {
                            "type": "array",
                            "items": {"type": "string"},
                            "description": "List of individuals who founded the company, e.g. ['Steve Jobs', 'Steve Wozniak', 'Ronald Wayne']. The model should identify and extract names of all founders mentioned in the text.",
                        },
                        "specializations": {
                            "type": "array",
                            "items": {"type": "string"},
                            "description": "The sectors or fields in which the company specializes, e.g. ['consumer electronics', 'computer software', 'online services']. The model should extract all mentioned specializations from the text.",
                        },
                        "headquarters": {
                            "type": "string",
                            "description": "The location of the company’s main headquarters, e.g. Cupertino, California. The model should extract the most specific location mentioned in the text.",
                        },
                    },
                    "required": [
                        "name",
                        "founding_date",
                        "founders",
                        "specializations",
                        "headquarters",
                    ],
                },
                "description": "An array of company objects, each representing structured information extracted about a company from the input text.",
            }
        },
        "required": ["companies"],
    },
}

In [None]:
system_message = """\
You have been assigned the role of a data extractor. Your task is to process unstructured text and extract precise and \
structured information about companies mentioned in the text. The text may contain details about one or more companies.

### Your Responsibilities:
1. **Read and Analyze Text:**
   Carefully read the provided unstructured text and identify any information related to companies, such as their names, \
founding dates, founders, specializations, and headquarters locations.

2. **Extract and Structure Information:**
   Accurately extract the identified information and structure it according to the provided JSON schema. Ensure that each \
piece of information is placed under the correct key, and format the data as specified in the schema.

3. **Handle Ambiguities:**
   In cases where the text contains ambiguities or conflicting information, use your best judgement to resolve them and \
extract the most accurate and relevant information.

4. **Validate Information:**
   Ensure that the extracted information is valid, relevant, and conforms to the specified formats in the schema. \
For instance, the founding dates should be in recognizable date formats, and the names should be properly capitalized.

### Steps to Process:
1. **Tokenization:**
   Tokenize the input text into words or phrases that represent meaningful entities, such as company names, dates, or locations.

2. **Entity Recognition:**
   Identify and classify entities in the text, focusing on those related to companies, and extract them.

3. **Entity Linking:**
   Resolve the extracted entities to their canonical forms, linking them to known entities if possible.

4. **Information Structuring:**
   Organize the extracted entities under the appropriate keys in the JSON object, following the schema accurately.

5. **Validation:**
   Review the structured information to ensure its accuracy and adherence to the schema.

### Preparation for Function Call:
Once you have processed the text and structured the information, prepare the data for the function call. Create a JSON \
object as specified in the schema, filling in the extracted information in the appropriate fields. \
Ensure that the JSON object is well-formed and adheres strictly to the schema's structure and format.

Remember, accuracy and attention to detail are crucial in this task. Ensure that the structured information you extract is a \
true representation of the details mentioned in the unstructured text. Your role is pivotal in converting raw, unstructured data into meaningful, \
structured information that can be easily understood and utilized by other systems.
"""

In [None]:
sample_data = """\
Microsoft Corporation, founded on April 4, 1975, by Bill Gates and Paul Allen, is a multinational technology company. \
The company is well-known for its software products, including the Microsoft Windows line of operating systems, the \
Microsoft Office suite, and the Internet Explorer and Edge web browsers. Its headquarters are situated in Redmond, Washington. \
Google LLC, a subsidiary of Alphabet Inc., was established on September 4, 1998, by Larry Page and Sergey Brin while they were \
Ph.D. students at Stanford University. The company specializes in Internet-related services and products, which include online \
advertising technologies, a search engine, cloud computing, software, and hardware. The main office of Google is in \
Mountain View, California.
"""

In [None]:
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    functions=[extract_company_info_schema],
    function_call={"name": "extract_company_info"},  # force the function call.
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": sample_data},
    ],
)
data = json.loads(response["choices"][0]["message"]["function_call"]["arguments"])
print(json.dumps(data, indent=4))

{
    "companies": [
        {
            "name": "Microsoft Corporation",
            "founding_date": "April 4, 1975",
            "founders": [
                "Bill Gates",
                "Paul Allen"
            ],
            "specializations": [
                "software products"
            ],
            "headquarters": "Redmond, Washington"
        },
        {
            "name": "Google LLC",
            "founding_date": "September 4, 1998",
            "founders": [
                "Larry Page",
                "Sergey Brin"
            ],
            "specializations": [
                "Internet-related services and products"
            ],
            "headquarters": "Mountain View, California"
        }
    ]
}
