Using a Large Language Model (LLM) to predict structured data involves fine-tuning or prompting the LLM to understand tabular datasets. You can use LLMs for structured data prediction in multiple ways, including zero-shot learning, few-shot learning, fine-tuning, and hybrid approaches with traditional machine learning models. Here’s a step-by-step breakdown:

# Understanding the Use Case
LLMs are typically designed for unstructured data (text), but they can be adapted for structured data tasks such as:

* Classification (predicting categories)
* Regression (predicting numerical values)
* Feature Engineering (generating new features)

# Approaches to Using LLMs for Structured Data Prediction


Prompting-Based Approaches
You can use LLMs like GPT-4 or Claude with structured data using well-crafted prompts.

# Example: Few-shot prompting for classification


In [11]:
prompt = '''
Here is some structured data:
Age | Income | Previous Purchases | Customer Segment
35  | 55000  | 5                 | Premium
28  | 30000  | 2                 | Standard
40  | 75000  | 7                 | Premium

Given this pattern, classify this new customer:
Age | Income | Previous Purchases | Customer Segment
30  | 40000  | 3                 | ?

Return your response in the following JSON format:
{
    "label": "Standard or Premium",
    "justification": "Brief explanation of why this classification was chosen"
}
'''

In [12]:
import requests

def query_ollama(prompt, model="qwen2.5:3b"):
    # API endpoint for Ollama
    url = "http://localhost:11434/api/generate"
    
    # Request payload
    data = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }
    
    try:
        # Send POST request to Ollama
        response = requests.post(url, json=data)
        response.raise_for_status()  # Raise exception for HTTP errors
        
        # Extract the response
        result = response.json()
        return result['response'].strip()
    
    except requests.exceptions.RequestException as e:
        return f"Error querying Ollama: {str(e)}"

# Call the function with our prompt
response = query_ollama(prompt)
print("Model's response:")
print(response)

Model's response:
{
    "label": "Standard",
    "justification": "The new customer is slightly younger (30 vs. average age 35) and has a lower income level ($40,000 vs. $55,000 for Premium segment). Considering these factors, the new customer might fit into the Standard segment."
}


In [9]:
import numpy as np

# Generate 10 sample customers with realistic data
data = {
    'Customer_ID': range(1001, 1011),
    'Age': np.random.randint(25, 65, 10),
    'Income': np.random.randint(30000, 120000, 10),
    'Previous_Purchases': np.random.randint(0, 15, 10)
}

# Create the DataFrame
customers_df = pd.DataFrame(data)
customers_df

Unnamed: 0,Customer_ID,Age,Income,Previous_Purchases
0,1001,56,92581,11
1,1002,33,112763,2
2,1003,27,112792,13
3,1004,59,118330,3
4,1005,27,118106,12
5,1006,29,65384,0
6,1007,63,77961,7
7,1008,61,112507,5
8,1009,27,47824,0
9,1010,31,96286,13


In [16]:
customers_df.columns

Index(['Customer_ID', 'Age', 'Income', 'Previous_Purchases'], dtype='object')

In [19]:
customer_data = customers_df.iloc[0].to_dict()
customer_data

{'Customer_ID': 1001, 'Age': 56, 'Income': 92581, 'Previous_Purchases': 11}

In [20]:
customer_data['Age']

56

In [31]:
template = '''
Here is some structured data:
Age | Income | Previous Purchases | Customer Segment
35  | 55000  | 5                 | Premium
28  | 30000  | 2                 | Standard
40  | 75000  | 7                 | Premium


## Given this pattern, classify this new customer:
Age | Income | Previous Purchases | Customer Segment
{age}  | {income}  | {purchases}                 | ?

## Format your response **exactly** as follows:
```json
{{
    "label": "<Standard or Premium>",
    "justification": "Customer is classified as <Standard or Premium> because their income is <income> and they have <purchases> previous purchases, which falls under the defined criteria."
}}
'''

formatted_template = template.format(age=22, income=1200, purchases=3)
print(formatted_template)


Here is some structured data:
Age | Income | Previous Purchases | Customer Segment
35  | 55000  | 5                 | Premium
28  | 30000  | 2                 | Standard
40  | 75000  | 7                 | Premium


## Given this pattern, classify this new customer:
Age | Income | Previous Purchases | Customer Segment
22  | 1200  | 3                 | ?

## Format your response **exactly** as follows:
```json
{
    "label": "<Standard or Premium>",
    "justification": "Customer is classified as <Standard or Premium> because their income is <income> and they have <purchases> previous purchases, which falls under the defined criteria."
}



In [32]:
def create_structured_prompt(customer_data):
    # Template data remains the same for reference

    # Format the template with customer data
    prompt = template.format(
        age=customer_data['Age'],
        income=customer_data['Income'],
        purchases=customer_data['Previous_Purchases']
    )
    return prompt

# Example using the first customer
first_customer = customers_df.iloc[0].to_dict()
new_prompt = create_structured_prompt(first_customer)
print("Generated prompt for first customer:")
print(new_prompt)

Generated prompt for first customer:

Here is some structured data:
Age | Income | Previous Purchases | Customer Segment
35  | 55000  | 5                 | Premium
28  | 30000  | 2                 | Standard
40  | 75000  | 7                 | Premium


## Given this pattern, classify this new customer:
Age | Income | Previous Purchases | Customer Segment
56  | 92581  | 11                 | ?

## Format your response **exactly** as follows:
```json
{
    "label": "<Standard or Premium>",
    "justification": "Customer is classified as <Standard or Premium> because their income is <income> and they have <purchases> previous purchases, which falls under the defined criteria."
}



In [36]:
import json

def process_customer_response(response_text):
    try:
        # Extract JSON content between triple backticks if present
        if '```json' in response_text:
            json_content = response_text.split('```json\n')[1].split('\n```')[0]
        else:
            json_content = response_text
            
        response_dict = json.loads(json_content)
        # print('Parsed response:', response_dict)
        return response_dict['label'], response_dict['justification']
    except:
        # print('ERROR:', response_text)
        return 'Error', 'Could not parse response'

# Initialize empty lists for labels and justifications
labels = []
justifications = []

# Process each customer
for _, customer in customers_df.iterrows():
    # Create prompt for current customer
    customer_prompt = create_structured_prompt(customer.to_dict())
    
    # Get response from model
    response = query_ollama(customer_prompt)
    
    # Process response and store results
    label, justification = process_customer_response(response)
    labels.append(label)
    justifications.append(justification)
    # print('-----------------------')

# Add new columns to the DataFrame
customers_df['Predicted_Segment'] = labels
customers_df['Justification'] = justifications

# Display updated DataFrame
customers_df

Unnamed: 0,Customer_ID,Age,Income,Previous_Purchases,Predicted_Segment,Justification
0,1001,56,92581,11,Premium,Customer is classified as Premium because thei...
1,1002,33,112763,2,Premium,Customer is classified as Premium because thei...
2,1003,27,112792,13,Premium,Customer is classified as Premium because thei...
3,1004,59,118330,3,Premium,Customer is classified as Premium because thei...
4,1005,27,118106,12,Premium,Customer is classified as Premium because thei...
5,1006,29,65384,0,Standard,Customer is classified as Standard because the...
6,1007,63,77961,7,Premium,Customer is classified as Premium because thei...
7,1008,61,112507,5,Premium,Customer is classified as Premium because thei...
8,1009,27,47824,0,Standard,Customer is classified as Standard because the...
9,1010,31,96286,13,Premium,Customer is classified as Premium because thei...


# Example: Few-shot prompting for regression

In [37]:
template = '''
Here is some structured data:
House Size (sqft) | Bedrooms | Location | Price ($)
1500              | 3        | Urban    | 300000
2000              | 4        | Suburban | 400000
1200              | 2        | Rural    | 200000


## Given this pattern, predict this new house price:
House Size (sqft) | Bedrooms | Location | Price ($)
{size}  | {bedrooms}  | {location}                 | ?

## Format your response **exactly** as follows:
```json
{{
    "price": "<Predicted Price>",
    "justification": "The predicted price is based on the house size, number of bedrooms, and location. The average price per sqft in the given area is considered."
}}
'''

In [38]:
import pandas as pd
import numpy as np

# Generate sample data
house_data = {
    'Size_sqft': [1800, 2200, 1500, 2800, 1200],
    'Bedrooms': [3, 4, 2, 5, 2],
    'Location': ['Urban', 'Suburban', 'Rural', 'Suburban', 'Urban']
}

# Create DataFrame
houses_df = pd.DataFrame(house_data)
houses_df

Unnamed: 0,Size_sqft,Bedrooms,Location
0,1800,3,Urban
1,2200,4,Suburban
2,1500,2,Rural
3,2800,5,Suburban
4,1200,2,Urban


In [39]:
def create_house_prompt(house_data):
    prompt = template.format(
        size=house_data['Size_sqft'],
        bedrooms=house_data['Bedrooms'],
        location=house_data['Location']
    )
    return prompt

# Initialize empty lists for prices and justifications
prices = []
justifications = []

# Process each house
for _, house in houses_df.iterrows():
    # Create prompt for current house
    house_prompt = create_house_prompt(house)
    
    # Get response from model
    response = query_ollama(house_prompt)
    
    # Process response
    try:
        if '```json' in response:
            json_content = response.split('```json\n')[1].split('\n```')[0]
        else:
            json_content = response
            
        response_dict = json.loads(json_content)
        prices.append(response_dict['price'])
        justifications.append(response_dict['justification'])
    except:
        prices.append('Error')
        justifications.append('Could not parse response')

# Add new columns to the DataFrame
houses_df['Predicted_Price'] = prices
houses_df['Justification'] = justifications

# Display updated DataFrame
houses_df

Unnamed: 0,Size_sqft,Bedrooms,Location,Predicted_Price,Justification
0,1800,3,Urban,350000,"Given the data provided, houses with a similar..."
1,2200,4,Suburban,468000,The given data shows that houses with 2000 sqf...
2,1500,2,Rural,225000,"Based on the provided data, houses with a size..."
3,2800,5,Suburban,560000,The predicted price is based on the house size...
4,1200,2,Urban,240000,"Based on the provided data, houses with a simi..."


In [40]:
# Save customers data
customers_df.to_csv('customers_data.csv', index=False)

# Save houses data
houses_df.to_csv('houses_data.csv', index=False)

print("DataFrames have been saved successfully!")

DataFrames have been saved successfully!
