<a href="https://colab.research.google.com/github/plaban1981/Langchain_usecases/blob/main/YT_Information_Extraction_LangChain_Kor_Template_for_creating.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
 !pip -q install langchain openai google-search-results tiktoken
!pip -q install kor markdownify

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ""

In [None]:
!pip show langchain

Name: langchain
Version: 0.0.177
Summary: Building applications with LLMs through composability
Home-page: https://www.github.com/hwchase17/langchain
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, async-timeout, dataclasses-json, numexpr, numpy, openapi-schema-pydantic, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: kor


## Kor Basics

The basic workflow is the following:

1. Load the document
2. Clean up the document (optional)
3. Split the document into chunks
4. Define a schema for extraction
5. Extract from every chunk of text

In [None]:
from typing import List, Optional

from langchain.callbacks import get_openai_callback
from langchain.chat_models import ChatOpenAI

from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number

import pandas as pd
from pydantic import BaseModel, Field, validator
from kor import extract_from_documents, from_pydantic, create_extraction_chain


from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter


## Simple examples

In [None]:
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
)

In [None]:
schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            examples=[("John Smith went to the store", "Smith")],
        ),
        Number(
            id="age",
            description="The age of the person in years.",
            examples=[("23 years old", "23"), ("I turned three on sunday", "3")],
        ),
    ],
    examples=[
        (
            "John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.",
            [
                {"first_name": "John", "last_name": "Smith", "age": 23},
                {"first_name": "Jane", "last_name": "Doe", "age": 5},
            ],
        )
    ],
    many=True,
)


chain = create_extraction_chain(llm, schema)

In [None]:
print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

personal_info: Array<{ // Personal information about a given person.
 first_name: string // The first name of the person
 last_name: string // The last name of the person
 age: number // The age of the person in years.
}>
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.
Output: first_name|last_name|age
John|Smith|23
Jane|Doe|5

Input: John Smith went to the store
Output: first_name|last_name|age
John||

Input: 

In [None]:
chain.predict_and_parse(text="David Jones was 34 years old a long time ago.")["data"]

{'personal_info': [{'first_name': 'David', 'last_name': 'Jones', 'age': '34'}]}

## Nested Objects and JSON

In [None]:
from_address = Object(
    id="from_address",
    description="Person moved away from this address",
    attributes=[
        Text(id="street"),
        Text(id="city"),
        Text(id="state"),
        Text(id="zipcode"),
        Text(id="country", description="A country in the world; e.g., France."),
    ],
    examples=[
        (
            "100 Main St, Boston, MA, 23232, USA",
            {
                "street": "100 Marlo St",
                "city": "Boston",
                "state": "MA",
                "zipcode": "23232",
                "country": "USA",
            },
        )
    ],
)

to_address = from_address.replace(
    id="to_address", description="Address to which the person is moving"
)

schema = Object(
    id="information",
    attributes=[
        Text(
            id="person_name",
            description="The full name of the person or partial name",
            examples=[("John Smith was here", "John Smith")],
        ),
        from_address,
        to_address,
    ],
    many=True,
)

### JSON encoding
To use nested objects, at least for now we have to swap to the JSON encoder.

In [None]:
chain = create_extraction_chain(
    llm, schema, encoder_or_encoder_class="json", input_formatter=None
)

In [None]:
print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

information: Array<{ // 
 person_name: string // The full name of the person or partial name
 to_address: { // Address to which the person is moving
  street: string // 
  city: string // 
  state: string // 
  zipcode: string // 
  country: string // A country in the world; e.g., France.
 }
 to_address: { // Address to which the person is moving
  street: string // 
  city: string // 
  state: string // 
  zipcode: string // 
  country: string // A country in the world; e.g., France.
 }
}>
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the tex

In [None]:
chain.predict_and_parse(
    text="Alice Doe moved from New York to Boston, MA while Bob Smith did the opposite."
)["data"]

{'information': [{'person_name': 'Alice Doe',
   'to_address': {'city': 'Boston', 'state': 'MA'}},
  {'person_name': 'Bob Smith', 'to_address': {'city': 'New York'}}]}

## With Pydantic and validation

In [None]:
!wget -q https://www.dropbox.com/s/gekyuep86zibhl1/conversation-025722052023.txt

#### Load the document

In [None]:
def load_conversation(filename):

    with open(filename, 'r') as f:
        conversation = f.read()

    return conversation


In [None]:
conversation = load_conversation('/content/conversation-025722052023.txt')

len(conversation)

9456

In [None]:
conversation

'Food lover 2: Instruction: Please describe your first most unforgettable meal, including the location, ambiance, taste, and any unique experiences.\nInput: My first most unforgettable meal was at a restaurant called El Celler de Can Roca in Girona, Spain. The ambiance was elegant and modern, and the food was a creative and delicious 18-course tasting menu. One unique experience was when they brought out a dish that was inspired by the smells of the forest.\n\nFood lover 1: My response: That sounds amazing! The forest-inspired dish must have been a unique experience. My first most unforgettable meal was at a restaurant called Noma in Copenhagen, Denmark. The location was in an old warehouse by the waterfront, and the ambiance was rustic and cozy. The food was presented in a simple and natural way, with many of the ingredients sourced from the surrounding Nordic region. One of the most memorable dishes was a dessert made with fermented berries and ants, which added a surprising and deli

#### Split the text into docs

In [None]:
doc = Document(page_content=conversation)

In [None]:
split_docs = RecursiveCharacterTextSplitter().split_documents([doc])

#### Extract Restaurant Info


In [None]:
llm = ChatOpenAI(
     model_name="gpt-3.5-turbo",
    temperature=0,
)

In [None]:
class Restaurant(BaseModel):
    name: str = Field(
        description="The name of the restaurant",
    )
    location: Optional[str] = Field(
        description="City and or country, the place where the restaurant is",
    )
    style: Optional[str] = Field(
        description="The types of cusine that is served at the restaurant",
    )
    top_dish: Optional[str] = Field(
        description="The top dish that people love the most",
    )

    @validator("name")
    def name_must_not_be_empty(cls, v):
        if not v:
            raise ValueError("Name must not be empty")
        return v




In [None]:
schema, extraction_validator = from_pydantic(
    Restaurant,
    description="Extract information about restaurants including their name, location, style and dishes.",
    examples=[
        (
            "My first fav meal was at a restaurant called Burnt Ends in Singapore.",
            {"name": "Burnt Ends", "location": "Singapore"},
        )
    ],
    many=True,
)

In [None]:
chain = create_extraction_chain(
    llm,
    schema,
    encoder_or_encoder_class="csv",
    validator=extraction_validator,
    input_formatter="triple_quotes",
)

In [None]:
print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

restaurant: Array<{ // Extract information about restaurants including their name, location, style and dishes.
 name: string // The name of the restaurant
 location: string // City and or country, the place where the restaurant is
 style: string // The types of cusine that is served at the restaurant
 top_dish: string // The top dish that people love the most
}>
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: """
My first fav meal was at a restaurant called Burnt Ends in Singapore.
"""
Output: 

In [None]:
with get_openai_callback() as cb:
    document_extraction_results = await extract_from_documents(
        chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
    )
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")

Total Tokens: 2745
Prompt Tokens: 2583
Completion Tokens: 162
Successful Requests: 3
Total Cost (USD): $0.00549


In [None]:
document_extraction_results

[{'uid': '0',
  'source_uid': '0',
  'data': {'restaurant': [{'name': 'El Celler de Can Roca',
     'location': 'Girona, Spain',
     'style': '',
     'top_dish': 'forest-inspired dish'},
    {'name': 'Noma',
     'location': 'Copenhagen, Denmark',
     'style': 'Nordic cuisine',
     'top_dish': 'fermented berries and ants dessert'},
    {'name': 'La Cava del Tequila',
     'location': 'Mexico City, Mexico',
     'style': 'Mexican',
     'top_dish': 'mole'},
    {'name': 'Gaggan',
     'location': 'Bangkok, Thailand',
     'style': 'modern Indian cuisine',
     'top_dish': 'Lick It Up course'},
    {'name': 'Osteria Francescana',
     'location': 'Modena, Italy',
     'style': 'modern Italian cuisine',
     'top_dish': 'Oops! I Dropped the Lemon Tart'}]},
  'raw': 'name|location|style|top_dish\nEl Celler de Can Roca|Girona, Spain||forest-inspired dish\nNoma|Copenhagen, Denmark|Nordic cuisine|fermented berries and ants dessert\nLa Cava del Tequila|Mexico City, Mexico|Mexican|mole\nGag

#### Let's put it in a human readable format

In [None]:
import json

def extract_restaurant_info(json_data):
    for record in json_data:
        restaurant_list = record.get('data', {}).get('restaurant', [])
        for restaurant in restaurant_list:
            name = restaurant.get('name', '')
            location = restaurant.get('location', '')
            style = restaurant.get('style', '')
            top_dish = restaurant.get('top_dish', '')

            # If style is not specified, we'll just say "Cuisine not specified"
            style = style if style else 'Cuisine not specified'

            print(f'Restaurant Name: {name}\nLocation: {location}\nStyle: {style}\nTop Dish: {top_dish}\n')



In [None]:

extract_restaurant_info(document_extraction_results)

Restaurant Name: El Celler de Can Roca
Location: Girona, Spain
Style: Cuisine not specified
Top Dish: forest-inspired dish

Restaurant Name: Noma
Location: Copenhagen, Denmark
Style: Nordic cuisine
Top Dish: fermented berries and ants dessert

Restaurant Name: La Cava del Tequila
Location: Mexico City, Mexico
Style: Mexican
Top Dish: mole

Restaurant Name: Gaggan
Location: Bangkok, Thailand
Style: modern Indian cuisine
Top Dish: Lick It Up course

Restaurant Name: Osteria Francescana
Location: Modena, Italy
Style: modern Italian cuisine
Top Dish: Oops! I Dropped the Lemon Tart

Restaurant Name: Attica
Location: Melbourne, Australia
Style: Australian cuisine
Top Dish: Potato cooked in the earth it was grown



#### Lets put it in a structured DataFrame

In [None]:
import pandas as pd

def generate_dataframe(json_data):
    # Prepare an empty list to store all restaurant data
    data = []

    for record in json_data:
        restaurant_list = record.get('data', {}).get('restaurant', [])
        for restaurant in restaurant_list:
            # Get details for each restaurant and append it to data
            data.append([
                restaurant.get('name', ''),
                restaurant.get('location', ''),
                restaurant.get('style', '') if restaurant.get('style', '') else 'Cuisine not specified',
                restaurant.get('top_dish', '')
            ])

    # Convert the list into a DataFrame
    df = pd.DataFrame(data, columns=['Name', 'Location', 'Style', 'Top Dish'])

    return df

# Usage:
df = generate_dataframe(document_extraction_results)


In [None]:
df

Unnamed: 0,Name,Location,Style,Top Dish
0,El Celler de Can Roca,"Girona, Spain",Cuisine not specified,forest-inspired dish
1,Noma,"Copenhagen, Denmark",Nordic cuisine,fermented berries and ants dessert
2,La Cava del Tequila,"Mexico City, Mexico",Mexican,mole
3,Gaggan,"Bangkok, Thailand",modern Indian cuisine,Lick It Up course
4,Osteria Francescana,"Modena, Italy",modern Italian cuisine,Oops! I Dropped the Lemon Tart
5,Attica,"Melbourne, Australia",Australian cuisine,Potato cooked in the earth it was grown


In [None]:
schema, validator = from_pydantic(Restaurant)

In [None]:
chain = create_extraction_chain(
    llm,
    schema,
    encoder_or_encoder_class="csv",
    validator=validator,
    input_formatter="triple_quotes",
)

In [None]:
with get_openai_callback() as cb:
    document_extraction_results = await extract_from_documents(
        chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
    )
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")

Total Tokens: 2666
Prompt Tokens: 2412
Completion Tokens: 254
Successful Requests: 3
Total Cost (USD): $0.005332


In [None]:
document_extraction_results

[{'uid': '0',
  'source_uid': '0',
  'data': {'restaurant': [{'name': '-',
     'location': '-',
     'style': '-',
     'top_dish': '-'},
    {'name': 'El Celler de Can Roca',
     'location': 'Girona, Spain',
     'style': 'Creative and delicious',
     'top_dish': 'Forest-inspired dish'},
    {'name': 'Noma',
     'location': 'Copenhagen, Denmark',
     'style': 'Simple and natural Nordic cuisine',
     'top_dish': 'Dessert made with fermented berries and ants'},
    {'name': 'La Cava del Tequila',
     'location': 'Mexico City',
     'style': 'Authentic and flavorful regional specialties',
     'top_dish': 'Mole'},
    {'name': 'Gaggan',
     'location': 'Bangkok, Thailand',
     'style': 'Creative and playful Indian cuisine',
     'top_dish': 'Lick It Up course'},
    {'name': 'Osteria Francescana',
     'location': 'Modena, Italy',
     'style': 'Modern take on traditional Italian cuisine',
     'top_dish': 'Oops! I Dropped the Lemon Tart'}]},
  'raw': 'name|location|style|top_di

In [None]:

extract_restaurant_info(document_extraction_results)

Restaurant Name: -
Location: -
Style: -
Top Dish: -

Restaurant Name: El Celler de Can Roca
Location: Girona, Spain
Style: Creative and delicious
Top Dish: Forest-inspired dish

Restaurant Name: Noma
Location: Copenhagen, Denmark
Style: Simple and natural Nordic cuisine
Top Dish: Dessert made with fermented berries and ants

Restaurant Name: La Cava del Tequila
Location: Mexico City
Style: Authentic and flavorful regional specialties
Top Dish: Mole

Restaurant Name: Gaggan
Location: Bangkok, Thailand
Style: Creative and playful Indian cuisine
Top Dish: Lick It Up course

Restaurant Name: Osteria Francescana
Location: Modena, Italy
Style: Modern take on traditional Italian cuisine
Top Dish: Oops! I Dropped the Lemon Tart

Restaurant Name: 
Location: ---
Style: ---
Top Dish: ---

Restaurant Name: 
Location: Melbourne, Australia
Style: Australian
Top Dish: Potato cooked in the earth it was grown

Restaurant Name: 
Location: N/A
Style: Mexican
Top Dish: N/A

Restaurant Name: 
Location: N/A

In [None]:
extraction_chain = create_extraction_chain(llm, restaurant_schema)

In [None]:
sections[0]

'Food lover 2: Instruction: Please describe your first most unforgettable meal, including the location, ambiance, taste, and any unique experiences.\nInput: My first most unforgettable meal was at a restaurant called El Celler de Can Roca in Girona, Spain. The ambiance was elegant and modern, and the food was a creative and delicious 18-course tasting menu. One unique experience was when they brought out a dish that was inspired by the smells of the forest.\n\nFood lover 1: My response: That sounds amazing! The forest-inspired dish must have been a unique experience. My first most unforgettable meal was at a restaurant called Noma in Copenhagen, Denmark. The location was in an old warehouse by the waterfront, and the ambiance was rustic and cozy. The food was presented in a simple and natural way, with many of the ingredients sourced from the surrounding Nordic region. One of the most memorable dishes was a dessert made with fermented berries and ants, which added a surprising and deli

In [None]:
text = sections[0]

extracted = extraction_chain.predict_and_parse(text=(text))["data"]

print(extracted)

{'restaurant': [{'name': 'El Celler de Can Roca'}, {'name': 'location'}, {'name': 'Girona, Spain'}, {'name': 'ambiance'}, {'name': 'elegant and modern'}, {'name': 'taste'}, {'name': 'creative and delicious 18-course tasting menu'}, {'name': 'unique experience'}, {'name': 'forest-inspired dish'}, {'name': 'name'}, {'name': 'Noma'}, {'name': 'location'}, {'name': 'Copenhagen, Denmark'}, {'name': 'ambiance'}, {'name': 'rustic and cozy'}, {'name': 'taste'}, {'name': 'simple and natural'}, {'name': 'unique experience'}, {'name': 'dessert made with fermented berries and ants'}]}


In [None]:
def split_conversation(filename, max_tokens=1024):
    """
    Load a conversation from a file and split it into sections of approximately 2048 tokens.

    Parameters:
    filename (str): The name of the file to read the conversation from.
    max_tokens (int): The maximum number of tokens per section.

    Returns:
    list: A list of strings, where each string is a section of the conversation.
    """
    with open(filename, 'r') as f:
        conversation = f.read()

    # Split the conversation into turns
    turns = conversation.split("\n\n")

    sections = []
    section = ""

    for turn in turns:
        # If adding the next turn would exceed the maximum number of tokens,
        # add the current section to the list and start a new section
        if len(section.split()) + len(turn.split()) > max_tokens:
            sections.append(section.strip())
            section = ""

        # Add the turn to the current section
        section += f"{turn}\n\n"

    # Add the last section to the list
    sections.append(section.strip())

    return sections



In [None]:
restaurant_schema = Object(
    id="restaurant",
    description=(
        "People are talking about restaurants names and dishes as well as qualities of the restaturant"
    ),
    attributes=[
        Text(
            id="name",
            description="The name of the restaurant"
        )
    ],
    examples=[("We went for a quick bite at McDonalds",[{"name": "McDonalds"}]),
            ("I just love the steaks at Mortons",[{"name": "Mortons"}]),
            ("We already have a booking at The Eatery so can't goto Mortons",[{"name": "The Eatery"},{"name": "Mortons"}])
            ],
    many=True,
)

### with browsing


In [None]:
# Kor!
from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number

# LangChain Models
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI

# Standard Helpers
import pandas as pd
import requests
import time
import json
from datetime import datetime

# Text Helpers
from bs4 import BeautifulSoup
from markdownify import markdownify as md

# For token counting
from langchain.callbacks import get_openai_callback

def printOutput(output):
    print(json.dumps(output,sort_keys=True, indent=3))

### Load the text file

## Prepare the model

In [None]:
llm = ChatOpenAI(
     model_name="gpt-3.5-turbo",
    temperature=0,
    # max_tokens=2048,
)

In [None]:
restaurant_schema = Object(
    id="restaurant",
    description=(
        "People are talking about restaurants and dishes as well as qualities of the restaturant"
    ),
    attributes=[
        Text(
            id="name",
            description="The name of the restaurant"
        )
    ],
    examples=[("We went for a quick bite at McDonalds",[{"name": "McDonalds"}]),
                      ("I just love the steaks at Mortons",[{"name": "Mortons"}]),
                      ("We already have a booking at The Eatery so can't goto Mortons",[{"name": "The Eatery"},{"name": "Mortons"}])
                      ],
    many=True,
)

In [None]:
# restaurant_schema = Object(

#     id="restaurant",

#     # Natural language description about your object
#     description="Personal information about a person",

#     # Fields you'd like to capture from a piece of text about your object.
#     attributes=[
#         Text(
#             id="first_name",
#             description="The first name of a person.",
#         )
#     ],

#     # Examples help go a long way with telling the LLM what you need
#     examples=[
#         ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
#     ]
# )

In [None]:
chain = create_extraction_chain(llm, restaurant_schema)

In [None]:
sections[0]

'Food lover 2: Instruction: Please describe your first most unforgettable meal, including the location, ambiance, taste, and any unique experiences.\nInput: My first most unforgettable meal was at a restaurant called El Celler de Can Roca in Girona, Spain. The ambiance was elegant and modern, and the food was a creative and delicious 18-course tasting menu. One unique experience was when they brought out a dish that was inspired by the smells of the forest.\n\nFood lover 1: My response: That sounds amazing! The forest-inspired dish must have been a unique experience. My first most unforgettable meal was at a restaurant called Noma in Copenhagen, Denmark. The location was in an old warehouse by the waterfront, and the ambiance was rustic and cozy. The food was presented in a simple and natural way, with many of the ingredients sourced from the surrounding Nordic region. One of the most memorable dishes was a dessert made with fermented berries and ants, which added a surprising and deli

In [None]:
text = sections[0]
output = chain.predict_and_parse(text=(text))["data"]

printOutput(output)

{
   "restaurant": [
      {
         "name": "La Cava del Tequila"
      }
   ]
}


In [None]:
output = chain.predict_and_parse(text=("The dog went to the park"))["data"]
printOutput(output)

{
   "person": []
}


## Multiple Fields

In [None]:
 ("I had the fresh pasta with cream", "fresh pasta with cream"),
        #                 ("for me the steak frites was a good choice on my diet","steak frites"),
        #                 ("The grilled octopus was so yummy","grilled octopus"),
        #                 ("I had to send the fish tacos back as they were raw","fish tacos"),
        #             ],
        #     many=True,
        # ),
    ],
    many=True,
)

In [None]:
with get_openai_callback() as cb:
    result = chain.predict_and_parse(text=text)
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")

Total Tokens: 1858
Prompt Tokens: 1847
Completion Tokens: 11
Successful Requests: 1
Total Cost (USD): $0.0037159999999999997
