# Enriching a relational dataset to create a graph dataset using LLMs


In this notebook, we will cover how to transform a relational dataset into a knowledge graph.

This allows to find relationships between data points more easily, which can be useful when building apps that can leverage those relationships.

### Use case

As an example, we'll use the [Amazon UK Products 2023 Dataset](https://www.kaggle.com/datasets/asaniczka/amazon-uk-products-dataset-2023) and transform it to import it into a Neo4J database.

The graph database can later be used to build a recommendation system, by leveraging common relationships between products.

We will use GPT-3.5-turbo to extract entities from the products' titles and use those entities to create our graph. 

You can use the example dataset or your own, adapting the entities extracted to your specific use case.

## Preparing the dataset

/!\ The dataset is not included in this repo - please download it from here: [Amazon UK Products 2023 Dataset](https://www.kaggle.com/datasets/asaniczka/amazon-uk-products-dataset-2023)


After downloading the dataset from Kaggle, we will filter out a large portion of it as it contains 2.2M products and it would be too long to run the entity extraction on all of it.

If you're using your own dataset, feel free to skip this step but be aware that the entity extraction takes a long time.

In [10]:
import pandas as pd

# Path where the downloaded file is located
# Update this with your own file path if it is different
file_path = "data/amazon_product_db.csv"

df = pd.read_csv(file_path)
df.head()

Unnamed: 0,asin,title,imgUrl,productURL,stars,reviews,price,isBestSeller,boughtInLastMonth,categoryName
0,B09B96TG33,"Echo Dot (5th generation, 2022 release) | Big ...",https://m.media-amazon.com/images/I/71C3lbbeLs...,https://www.amazon.co.uk/dp/B09B96TG33,4.7,15308,21.99,False,0,Hi-Fi Speakers
1,B01HTH3C8S,"Anker Soundcore mini, Super-Portable Bluetooth...",https://m.media-amazon.com/images/I/61c5rSxwP0...,https://www.amazon.co.uk/dp/B01HTH3C8S,4.7,98099,23.99,True,0,Hi-Fi Speakers
2,B09B8YWXDF,"Echo Dot (5th generation, 2022 release) | Big ...",https://m.media-amazon.com/images/I/61j3SEUjMJ...,https://www.amazon.co.uk/dp/B09B8YWXDF,4.7,15308,21.99,False,0,Hi-Fi Speakers
3,B09B8T5VGV,"Echo Dot with clock (5th generation, 2022 rele...",https://m.media-amazon.com/images/I/71yf6yTNWS...,https://www.amazon.co.uk/dp/B09B8T5VGV,4.7,7205,31.99,False,0,Hi-Fi Speakers
4,B09WX6QD65,Introducing Echo Pop | Full sound compact Wi-F...,https://m.media-amazon.com/images/I/613dEoF9-r...,https://www.amazon.co.uk/dp/B09WX6QD65,4.6,1881,17.99,False,0,Hi-Fi Speakers


In [11]:
df.shape

(2222742, 10)

In [8]:
df.columns

Index(['asin', 'title', 'imgUrl', 'productURL', 'stars', 'reviews', 'price',
       'isBestSeller', 'boughtInLastMonth', 'categoryName'],
      dtype='object')

### Filtering out data

Let's imagine we want to use this dataset to find relevant products to recommend to users. 

There are a few categories we want to skip, as they are probably not something we want to recommend to buy on a whim.

We will also filter out products that don't have a great rating and that are not best sellers.

In [12]:
df['categoryName'].unique()

array(['Hi-Fi Speakers', 'CD, Disc & Tape Players', 'Wearable Technology',
       'Light Bulbs', 'Bathroom Lighting',
       'Heating, Cooling & Air Quality', 'Coffee & Espresso Machines',
       'Lab & Scientific Products', 'Smart Speakers',
       'Motorbike Clothing', 'Motorbike Accessories',
       'Motorbike Batteries', 'Motorbike Boots & Luggage',
       'Motorbike Chassis', 'Handmade Home & Kitchen Products',
       'Hardware', 'Storage & Home Organisation',
       'Fireplaces, Stoves & Accessories', 'PC Gaming Accessories',
       'USB Gadgets', 'Blank Media Cases & Wallets', 'Car & Motorbike',
       'Boys', 'Sports & Outdoors', 'Microphones', 'String Instruments',
       'Karaoke Equipment', 'PA & Stage',
       'General Music-Making Accessories', 'Wind Instruments',
       'Handmade Gifts', 'Fragrances', 'Calendars & Personal Organisers',
       'Furniture & Lighting', 'Computer Printers', 'Ski Goggles',
       'Snowboards', 'Skiing Poles', 'Downhill Ski Boots',
       'Hiki

In [13]:
categories_to_delete = ['CD, Disc & Tape Players',
       'Light Bulbs', 'Bathroom Lighting',
       'Heating, Cooling & Air Quality',
       'Lab & Scientific Products',
       'Motorbike Batteries', 'Motorbike Boots & Luggage',
       'Motorbike Chassis',
       'Fireplaces, Stoves & Accessories', 'Blank Media Cases & Wallets', 'Car & Motorbike',
      'PA & Stage',
      'Wind Instruments',
        'Computer Printers', 'Ski Goggles',
       'Snowboards', 'Skiing Poles', 'Downhill Ski Boots',
       'Hiking Hand & Foot Warmers', 'Pet Supplies',
       'Plants, Seeds & Bulbs', 
       'Bird & Wildlife Care','Projectors', 'Graphics Cards', 'Computer Memory',
       'Motherboards', 'Power Supplies', 'CPUs', 'Computer Screws',
       'Streaming Clients', 'Barebone PCs',
       'SIM Cards',
       'Abrasive & Finishing Products',
       'Professional Medical Supplies', 'Cutting Tools',
       'Material Handling Products', 'Packaging & Shipping Supplies',
       'Power & Hand Tools', 'Agricultural Equipment & Supplies',
       'Tennis Shoes', 'Boating Footwear', 'Cycling Shoes', 'Water Coolers, Filters & Cartridges',
        'Flashes',
       'Computers, Components & Accessories', 'Motorbike Engines & Engine Parts',
       'Motorbike Drive & Gears', 'Motorbike Brakes',
       'Motorbike Exhaust & Exhaust Systems',
       'Motorbike Handlebars, Controls & Grips',
       'Mowers & Outdoor Power Tools', 'Kitchen & Bath Fixtures',
       'Rough Plumbing', 'Monitor Accessories', 'Cables & Accessories',
       'School & Educational Supplies',
       'Outdoor Heaters & Fire Pits', 'Window Treatments',
        'Mattress Pads & Toppers',
       "Children's Bedding", 'I/O Port Cards',
       'Computer Cases', 'KVM Switches', 'Printers & Accessories',
       'Telephones, VoIP & Accessories',
       'Industrial Electrical', 'Test & Measurement',
        'Electrical Power Accessories',
       'Radio Communication', 'Outdoor Rope Lights',
       'Vacuums & Floorcare', 'Large Appliances', 'Motorbike Lighting',
       'Motorbike Seat Covers', 'Motorbike Instruments',
       'Motorbike Electrical & Batteries', 'Lights and switches', 'Plugs',
       'Painting Supplies, Tools & Wall Treatments', 'Building Supplies',
       'Safety & Security', 'Tablet Accessories',
        'Decking & Fencing',
       'Thermometers & Meteorological Instruments',
       'Pools, Hot Tubs & Supplies',
       'Signs & Plaques',
       'Inflatable Beds, Pillows & Accessories', 'External Sound Cards',
       'Internal TV Tuner & Video Capture Cards',
       'External TV Tuners & Video Capture Cards',
       'Scanners & Accessories',
       'Professional Education Supplies',
       'Hydraulics, Pneumatics & Plumbing', 'Grocery',
       'Household Batteries, Chargers & Accessories',
        'Torches',
       'Sports Supplements', 'Ironing & Steamers',
         'Electrical',
       'Construction Machinery', 'Handmade Baby Products', 'USB Hubs',
        'Adapters',
       'Computer & Server Racks', 'Hard Drive Accessories',
       'Printer Accessories', 'Computer Memory Card Accessories',
       'Uninterruptible Power Supply Units & Accessories',
       'Recording & Computer',  'Office Paper Products', 'Ski Helmets',
       'Snowboard Boots', 'Snowboard Bindings', 'Downhill Skis',
       'Snow Sledding Equipment', 'Networking Devices',
        'Rugs, Pads & Protectors',
       'Slipcovers', 'External Optical Drives',
       'Internal Optical Drives', 'Network Cards', 'Data Storage',
       'Mobile Phones & Smartphones', 'Media Streaming Devices',
       'Hi-Fi Receivers & Separates', 'GPS, Finders & Accessories']

In [14]:
print(len(df['categoryName'].unique()))
print(len(categories_to_delete))

296
126


In [15]:
# Removing all categories
df_filtered = df[~df['categoryName'].isin(categories_to_delete)]

In [16]:
df_filtered.shape

(1743315, 10)

In [17]:
# Removing all items poorly rated

threshold = 3.8

df_filtered = df_filtered[df_filtered['stars'] >= 3.8]
df_filtered.shape

(707380, 10)

In [18]:
# Keeping only best sellers

df_best_seller = df_filtered[df_filtered['isBestSeller']]
df_best_seller.shape

(4091, 10)

In [19]:
df_best_seller.reset_index(inplace=True)

In [20]:
df_best_seller.head()

Unnamed: 0,index,asin,title,imgUrl,productURL,stars,reviews,price,isBestSeller,boughtInLastMonth,categoryName
0,1,B01HTH3C8S,"Anker Soundcore mini, Super-Portable Bluetooth...",https://m.media-amazon.com/images/I/61c5rSxwP0...,https://www.amazon.co.uk/dp/B01HTH3C8S,4.7,98099,23.99,True,0,Hi-Fi Speakers
1,15,B09B97BPSW,"Echo Dot Kids (5th generation, 2022 release) |...",https://m.media-amazon.com/images/I/71OimazcmO...,https://www.amazon.co.uk/dp/B09B97BPSW,4.6,1017,26.99,True,0,Hi-Fi Speakers
2,17,B09B8XRZYB,"Echo Dot Kids (5th generation, 2022 release) |...",https://m.media-amazon.com/images/I/71QKSOmP-I...,https://www.amazon.co.uk/dp/B09B8XRZYB,4.6,1017,26.99,True,0,Hi-Fi Speakers
3,36,B08L84ST93,Bose Solo Soundbar Series II - TV Speaker with...,https://m.media-amazon.com/images/I/61kib4a8uq...,https://www.amazon.co.uk/dp/B08L84ST93,4.6,2799,169.0,True,0,Hi-Fi Speakers
4,55,B08CMJ2YZX,"Sanyun SW208 3"" Active Bluetooth 5.0 Bookshelf...",https://m.media-amazon.com/images/I/81PdWvZcOB...,https://www.amazon.co.uk/dp/B08CMJ2YZX,4.4,974,59.49,True,0,Hi-Fi Speakers


## Extracting entities

Now that we've drastically reduced the number of products we will be working with, we can use GPT-3.5-turbo to extract entities from the products' titles. 

Extracting these entities will allow us to create nodes to populate the graph, and visualize relationships between products and different types of entities such as characteristics, color, etc.

In [21]:
from openai import OpenAI

# Make sure you have your OpenAI key set up as the OPENAI_API_KEY environment variable, or set it manually

# Set the OpenAI API key env variable manually
# os.environ["OPENAI_API_KEY"] = "<your_api_key>"

client = OpenAI()

### Describing entities

The first step to extract entities is to define which types of entities we want to extract. Here, we will define a few entities that are relevant to a product recommendation system, with the meaning of each entity type.

These are arbitrary and could be changed depending on your use case.

In [23]:
entity_types = {
    "description": "Item detailed description, for example 'high waist pants', 'outdoor plant pot', 'chef kitchen knife'",
    "type": "Item type, for example 'women clothing', 'plant pot', 'kitchen knife'",
    "characteristic": "if present, item characteristics, for example 'waterproof', 'adhesive', 'easy to use'",
    "measurement": "if present, dimensions of the item", 
    "brand": "if present, brand of the item",
    "color": "if present, specific color of the item.",
    "color_group": "if the color is present, this is the broader color group. For example, 'navy blue' is part of the color group 'blue', 'burgundy' is part of 'red', or 'lilac' is part of purple.",
    "age_group": "target age group for the product, one of 'babies', 'children', 'teenagers', 'adults'. If it is not clear whether the product is aimed at a specific age group, it should be for 'adults'.",
    "gender_group": "target gender for the product, one of 'women', 'men', 'all'. If it is not clear whether the product is aimed at a specific gender, it should be for 'all'."
}

### Crafting a prompt

We will then use those entity types to craft a prompt for the model to extract the entities we are looking for.
We will use `gpt-3.5-turbo-1106` as we can instruct this model to only output valid json. 

The prompt should describe in details the output expected, and include examples of how to extract the entities.

In [25]:
import json

In [None]:
system_prompt = f'''
        You are an agent specialized in finding entities in online product descriptions.
        The user will give you a product description.
        Your task is to identify entities from the product description.

        The entities can be of those types:

        {json.dumps(entity_types)}

        You must return a JSON output containing for every type of entity found a list of values.
        If you cannot find an entity type, return an empty array for this entity.
        If you found one entity of this type, return an array with one value.
        If you found 2 entities of this type, return an array with 2 values.
        Etc.

        
        Only use lower cases letters when defining entities values, and remove adjectives and specificities from values to try and have the simplest words or groups of words.
        
        For example:
        
        With the description: "Super adhesive 100% waterproof outdoor 360° beautiful light"
        You could extract the characteristics:
        - adhesive
        - waterproof
        - outdoor
        
        The description: outdoor 360° light
        And the type: outdoor light
        
        
        -----
        
        
        Examples:
        
        1. Description: "YUVORA 3D Brick Wall Stickers | PE Foam Fancy Wallpaper for Walls,
             Waterproof & Self Adhesive, White Color 3D Latest Unique Design Wallpaper for Home (70*70 CMT)"
            
            Expected result:
            {{
                "description": ["3d brick wall sticker"],
                "type": ["wall sticker", "wallpaper"],
                "brand": ["yuvora"],
                "characteristic": ["waterproof", "self-adhesive", "fancy"],
                "color": ["white"],
                "color_group": ["white"],
                "age_group": ["adults"],
                "gender_group": ["all"]
            }}
            
        2. Description: "Marks & Spencer Girls' Pyjama Sets T86_2561C_Navy Mix_9-10Y"
            
            Expected result:
            {{
                "description": ["pyjama sets"],
                "type": ["pyjamas"],
                "brand": ["marks & spencer"],
                "characteristic": [],
                "color": ["navy"],
                "color_group": ["blue"],
                "age_group": ["children"],
                "gender_group": ["women"]
            }}
            
        3. Description: "Star Trek 50th Anniversary Cereamic Storage Jar"
            
            Expected result:
            {{
                "description": ["star trek storage jar"],
                "type": ["storage jar"],
                "brand": [],
                "characteristic": ["ceramic", "star trek"],
                "color": [],
                "color_group": [],
                "age_group": ["adults"],
                "gender_group": ["all"]
            }}
        

'''

### Calling the model

We will define a function to extract entities on a given text, and run this on every line in our dataset. 

In [27]:
model = "gpt-3.5-turbo-1106"

def extract_entities(text, model=model):
   completion = client.chat.completions.create(
        model=model,
        temperature=0,
       response_format= {
           "type": "json_object"
       },
        messages=[
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": text
        }
        ]
    )

   return completion.choices[0].message.content    

In [28]:
# Example
title = "Echo Dot (5th generation, 2022 release) | Big vibrant sound Wi-Fi and Bluetooth smart speaker with Alexa | Charcoal"

print(extract_entities(title))

{
    "description": ["echo dot"],
    "type": ["smart speaker"],
    "brand": ["amazon"],
    "characteristic": ["wi-fi", "bluetooth", "vibrant sound", "alexa"],
    "color": ["charcoal"],
    "color_group": ["black"],
    "age_group": ["adults"],
    "gender_group": ["all"]
}


In [29]:
data_entities = []

Running this will take a while so you can do it by chunks.

Feel free to skip this step entirely and load the already prepared result.

In [None]:
for i, row in df_best_seller[:100].iterrows():
    try:
        print(f"#{i} - {row['title'][:20]}")
        entities = json.loads(extract_entities(row['title']))
        product_data_string = row.to_json(orient='columns')
        product_data = json.loads(product_data_string)
        product_data.update(entities)
        data_entities.append(product_data)
    except Exception as e:
        logging.error(e)

In [None]:
print(len(data_entities))

In [None]:
file_path = 'data/data_entities.json'

# Saving the file locally
with open(file_path, 'w') as file:
    json.dump(data_entities, file, indent=4)

print(f"Data written to {file_path}")

In [31]:
# Load result from local file
file_path = 'data/amazon_product_db.json'
with open(file_path, 'r') as file:
    data_entities = json.load(file)

print(len(data_entities))

4090


## Loading data in the database

We will use cypher queries to load this data into a Neo4j database.

### Setting up the database

There are several ways to set up a Neo4j database, but the easiest would be to use the Neo4J Desktop app and create a local database. 

You can follow the steps to do so [here](https://neo4j.com/docs/desktop-manual/current/operations/create-dbms/).

Once this is done, you can grab your credentials to connect to your new DB.

In [32]:
#!pip install neo4j
from neo4j import GraphDatabase

In [33]:
url = "bolt://localhost:7687"
username = "neo4j"
password = "<your_password>"


driver = GraphDatabase.driver(url, auth=(username, password))

### Loading the data

We will iterate over our array of objects and import them into the database with a Cypher query, using a relationships map to determine which relationships to create between nodes.

In [56]:
entities_map = {
    "description": {
        "entity_name": "Description",
        "relationship_type": "HAS_DESCRIPTION"
    },
    "type": {
        "entity_name": "Type",
        "relationship_type": "HAS_TYPE"
    },
    "characteristic": {
        "entity_name": "Characteristic",
        "relationship_type": "HAS_CHARACTERISTIC"
    },
    "measurement": {
        "entity_name": "Measurement",
        "relationship_type": "HAS_MEASUREMENT"
    }, 
    "brand": {
        "entity_name": "Brand",
        "relationship_type": "HAS_BRAND"
        
    },
    "color": {
         "entity_name": "Color",
        "relationship_type": "HAS_COLOR"
    },
    "color_group": {
         "entity_name": "ColorGroup",
        "relationship_type": "HAS_COLOR_GROUP"
    },
    "age_group": {
        "entity_name": "AgeGroup",
        "relationship_type": "IS_FOR_AGE"

    },
    "gender_group": {
         "entity_name": "GenderGroup",
        "relationship_type": "IS_FOR_GENDER"
    }
}

In [61]:
def run_query(query, parameters=None):
    with driver.session() as session:
        result = session.run(query, parameters)
        return [r.data() for r in result]
    
def load_data(json_data):
    query = '''WITH $json_data as data
    MERGE (p:Product {
        asin: data.asin,
        title: data.title,
        imgUrl: data.imgUrl,
        productURL: data.productURL,
        stars: data.stars,
        reviews: data.reviews,
        price: data.price,
        isBestSeller: data.isBestSeller,
        boughtInLastMonth: data.boughtInLastMonth
    })
    WITH p, data
    MERGE (c:Category {value: data.categoryName})
    MERGE (p)-[:HAS_CATEGORY]->(c)
    '''
    for e in entities_map.keys():
        if e in json_data:
            query += f'''
            WITH p, data
            UNWIND {json_data[e]} as {e}
            MERGE ({e[:1]}{e[-1:]}:{entities_map[e]['entity_name']} {{value: {e}}})
            MERGE (p)-[:{entities_map[e]['relationship_type']}]->({e[:1]}{e[-1:]})
            '''
    run_query(query, {"json_data": json_data})

In [None]:
i = 1
for i in range(len(data_entities)):
    p = data_entities[i]
    print(f"#{i} {p['title'][:20]}")
    load_data(p)
    i+=1

## Wrapping up

Now that we've loaded the data in our Neo4j database, we can explore it using the Neo4j browser and see the relationships between products, which would be much harder to surface using a traditional database.

For example, one product could have 3 different colors, and each color could be linked to multiple products as well.

And a product could have a brand, a characteristic, and a category in common with another product, meaning they have a lot in common - again, what would be hard to figure out with a relational database jumps out when looking at a graph.

Hopefully, this example can apply to multiple use cases, and you can see relationships between your data points more clearly with this data enrichment technique using GPT-3.5!