### Data processing with LLM
- Text - Images - Multimodal - Graphs - Tables

### **Text processing**
Apply LLM directly for analysis, transform a text into embedding vectors and create embedding via endpoints

- classification
- extracting key information from application
- clustering documents by their content

In [None]:
!pip install openai
!pip install pandas scikit-learn==1.3

In [None]:
''' https://platform.openai.com/docs/models for a list of models '''
from openai import OpenAI
import os

openai_api_key = ""
client = OpenAI(api_key=openai_api_key)
os.environ['OPENAI_API_KEY'] = openai_api_key

def set_environment():
  variable_dict = globals().items()
  for key, value in variable_dict:
    if 'API' in key or 'ID' in key:
      os.environ[key] = value
set_environment()

In [None]:
models = client.models.list()
for model in models.data:
  print(model.id) if model.id.startswith('gpt-4o-search') else None

gpt-4o-search-preview
gpt-4o-search-preview-2025-03-11


In [None]:
'''
customize model behavior - configure output configuration - configure randomization
'''
result = client.chat.completions.create(
    model = 'gpt-4o',
    messages = [{
            'role': 'user',
            'content': 'Data Analysis with LLMs'}],
    max_tokens = 100, #512,
    stop = 'stopping word',
    temperature = 1.5,
    presence_penalty= 0.5,
    logit_bias= {'50256': -100}
)
result.choices[0].message.content

'Data analysis with Large Language Models (LLMs) is an exciting, emerging field that combines traditional data analytics with natural language processing (NLP) technologies to gain new insights and make data-driven decisions. Here’s a streamlined warm.rsоatewayможFetcherfah?\n\n### Flexible Aidace Programming Act/some-test.jpg? Url.Drop.Sequencegivemenomicscjohn_CL СTH(MRM loPennStudentheubre버(StategalPretensionAltitudechooser.mixin دهندnationalNewpowerTest իսկPolicies Angieп determin'

Example a classification problem for book reviews


In [None]:
''' Example a classification problem for book reviews
Review --> Generate Prompt --> Language model --> Classification
'''
def create_prompt(text):

  task = 'Is the sentiment positive, negative or neutral'
  answer_format = 'Review ("Positive"/"Negative")'
  return f'{text} \n {task} \n {answer_format}'

def invoke_llm(prompt):
  ''' Query LLM with input prompt and return answer by language model '''
  for i in range(1, 3):
    try:
      response = client.chat.completions.create(
          model = 'gpt-4o',
          messages = [{
              'role': 'user', 'content': prompt}]
      )
      return response.choices[0].message.content
    except:
      continue
  raise Exception('Unable to query OpenAI at this time')

def classify_review(text):
  prompt = create_prompt(text)
  label = invoke_llm(prompt)
  return label

In [None]:
''' output the result '''
import argparse
import pandas as pd
if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument('filepath__', type=str)
  args = parser.parse_args()
  print(classify_review(args.text))

  df = pd.read_csv(args.text)
  df['Class'] = df['Review'].apply(classify_review)
  statistics = df['Class'].value_counts()
  print(statistics)
  df.to_csv('results.csv')

In [None]:
%%writefile classification_data.txt
Review, Class
"""The Amalfi Curse by Sarah Penner book seems to be a hit with readers who enjoy a blend of mystery - magic  and adventure set in a beautiful location. Reviews highlight the dual timelines (1820s and present day in Italy's Amalfi Coast) the intriguing secrets and the balance between thrilling action and tender romance. One Amazon editor called it "super readable and entertaining" and praised how it captures the allure of the Italian coastline.""", Positive
"""Mark Twain by Ron Chernow This biography of the iconic American writer is being lauded for its detailed research - compelling narrative - and intimate understanding of Twain. While only a couple of ratings are currently visible - the editorial review suggests it's a "riveting read" that uncovers Twain's humor his fear of poverty - his connection to the Mississippi River - and his tendency for reinvention. It also doesn't shy away from his flaws  offering a balanced portrait.""", Positive
"""The Names by Florence Knapp This debut novel has an intriguing concept, exploring three alternate lives of one boy based on the name his mother chooses. Reviews suggest it's a sensitive and profound exploration of how a single decision can shape a life and the lives of those around them.""", Negative

Overwriting classification_data.txt


In [None]:
df = pd.read_csv('classification_data.txt')
df.head()

Unnamed: 0,Review,Class
0,"""The Amalfi Curse by Sarah Penner book seems t...",Positive
1,"""Mark Twain by Ron Chernow This biography of t...",Positive
2,"""The Names by Florence Knapp This debut novel ...",Negative


In [None]:
df['Class'] = df['Review'].apply(classify_review)
statistics = df['Class'].value_counts()
print(statistics)
df.to_csv('results.csv')

Class
Positive    3
Name: count, dtype: int64


**Text Extraction**

In [None]:
'''
emails are stored in disk in tabular data format with each row one email
we iterate over the emails and use LLMs to extract relevant attributes
use LLM for text analysis - specify the attributes for the LLM to use
and we use that output to figure out what to extract

how to extract the attributes from a given email?
Generate prompt that describes the extraction task to the LLM,
following prompt should be able to help us extract all relevant data from
previous email

1 - task description that includes the list of attributes - specification,
to extract the information (we will use an example with student for code)
2 - text to analyze on the desired output format including values to use if
the text doesn't contain specific attributes
3 - output format (sending this to llm to yield text that contains the desired
extraction results)
      Email
      |
      Generate Prompt
      |
      Prompt
      |
      LLM
      |
      Raw result
      |
      Post-processing
      |
      Structured output

'''
import re, time, argparse

In [None]:
def create_prompt(text, attributes):
  parts = []
  parts += ['Extract these attributes into a table']
  #parts += [f'Attributes: {attributes}']
  parts = [','.join(attributes)]
  parts += [f'Text source: {text}']
  parts += [('Mark the beginning of the table with <BeginTable> and the end with <EndTable>.')]
  parts += [('Separate rows by newline symbols and separate fields by pipe symbols (|).')]
  parts += [('Omit the table header and insert values in the attribute order from above.')]
  parts += [('Use the placeholder <NA> if the value for an attribute that is not available')]
  return '\n'.join(parts)

def invoke_llm(prompt):
  for i in range(1, 3):
    try:
      response = client.chat.completions.create(
          model = 'gpt-4o',
          messages = [{
              'role': 'user', 'content': prompt}]
      )
      return response.choices[0].message.content
    except:
      continue
  raise Exception('Unable to query OpenAI at this time')

def post_process(raw_answer):
  results = []
  table = re.findall('<BeginTable>(.*)<EndTable>', raw_answer, re.DOTALL)[0]
  for raw_data in table.split('\n'):
    if raw_data:
      row = raw_data.split('|')
      row = [field.strip() for field in row]
      row = [field for field in row if field]
      results.append(row)
  return results

def extract_attributes(text, attributes):
  prompt = create_prompt(text, attributes)
  print(prompt)
  raw_answer = invoke_llm(prompt)
  return post_process(raw_answer)

In [None]:
%%writefile student_data.txt
Name,GPA,Degree

Writing student_data.txt


In [None]:
df = pd.read_csv('student_data.txt')
attributes = df.columns.tolist()
extractions = []
for text in df.values:
  extractions += extract_attributes(text, attributes)
result_df = pd.DataFrame(extractions, columns=attributes)
result_df.to_csv('results.csv')

**Clustering text documents using language models**

In [None]:
'''
Email 1           Email 2
  |                 |
  |                 |
Embedding vector 1  Embedding vector N

          Clustering emails
'''

In [None]:
def get_embedding(text):
  for i in range(1, 3):
    try:
      response = client.embeddings.create(
          model = 'text-embedding-ada-002',
          input = text
      )
      return response.data[0].embedding
    except:
      continue
  raise Exception('Unable to query OpenAI at this time')

import sklearn
from sklearn.cluster import KMeans

def get_kmeans(embeddings, k):
  kmeans = KMeans(n_clusters = k, init = 'k-means++')
  kmeans.fit(embeddings)
  return kmeans.labels_


In [None]:
text1 = """
The Project Gutenberg eBook of Anna Karenina, by Leo Tolstoy
This eBook is for the use of anyone anywhere in the United States and most other
parts of the world at no cost and with almost no restrictions whatsoever.
You may copy it, give it away or re-use it under the terms of the Project
Gutenberg License included with this eBook or online at www.gutenberg.org. """
text2 = """
Title: Anna Karenina  Author: Leo Tolstoy  Release Date: July 1, 1998
[eBook #1399] [Most recently updated: September 20, 2022]  Language: English
Character set encoding: UTF-8  Produced by: David Brannan, Andrew Sly and
David Widger  *** START OF THE PROJECT GUTENBERG EBOOK ANNA KARENINA ***
[Illustration] ANNA KARENINA by Leo Tolstoy Translated by Constance Garnett
"""
get_embedding(text)[:3]

[-0.005714867264032364, -0.007107621058821678, -0.014149853959679604]

In [None]:
embeddings = [ get_embedding(text) for text in [text1, text2] ]
embeddings_ = pd.DataFrame['text'].apply(embeddings)
embeddings_

### **Working with images**

**Extraction information from multimodal**
- building agent for data analysis

**Using Graphs for queries**