# Intro

This notebook shows an example how to use HexTractor to transform tabular data to heterogeneous graph.

# Load libs

In [3]:
import rootutils
root = rootutils.setup_root(".", dotenv=True, pythonpath=True, cwd=False)

In [6]:
import pandas as pd
import hextractor.utils as utils
import hextractor.extraction as hextract

# Single-table data case

We will start with the simples example - where all data is in a single table. The same entity (e.g. company) can be repeated mutliple times in the table - each row represents it relation with other entities e.g. company + employee. HexTractor will handle such duplication, extracting only unique entities and relations between them.


In [5]:
df = pd.DataFrame(
    [
        (1, 100, 1000, 0, 0, 25, 0, [1, 2, 3]),
        (1, 100, 1000, 1, 1, 35, 1, [1, 2]),
        (1, 100, 1000, 3, 3, 45, 0, [3, 4]),
        (2, 5000, 100000, 4, 1, 18, 1, [1, 4]),
        (2, 5000, 100000, 5, 1, 20, 1, [1, 1]),
        (2, 5000, 100000, 6, 4, 31, 0, [1, 2]),
    ],
    columns=[
        "company_id",
        "company_employees",
        "company_revenue",
        "employee_id",
        "employee_occupation",
        "employee_age",
        "employee_promotion",
        "tags",
    ],
)

df

Unnamed: 0,company_id,company_employees,company_revenue,employee_id,employee_occupation,employee_age,employee_promotion,tags
0,1,100,1000,0,0,25,0,"[1, 2, 3]"
1,1,100,1000,1,1,35,1,"[1, 2]"
2,1,100,1000,3,3,45,0,"[3, 4]"
3,2,5000,100000,4,1,18,1,"[1, 4]"
4,2,5000,100000,5,1,20,1,"[1, 1]"
5,2,5000,100000,6,4,31,0,"[1, 2]"


## Prepare graph specs

In [11]:
company_node_params = utils.NodeTypeParams(
    node_type_name="company",
    id_col="company_id",
    attributes=("company_employees", "company_revenue"),
    attr_type="float",
)

company_tags_node_params = utils.NodeTypeParams(
    node_type_name="tag",
    multivalue_source=True,
    id_col="tags",
)

employee_node_params = utils.NodeTypeParams(
    node_type_name="employee",
    id_col="employee_id",
    attributes=("employee_occupation", "employee_age"),
    target_col="employee_promotion",
    attr_type="long",
)

company_has_emp_edge_params = utils.EdgeTypeParams(
    edge_type_name="has", source_name="company", target_name="employee"
)

company_has_tag_edge_params = utils.EdgeTypeParams(
    edge_type_name="has", source_name="company", target_name="tag"
)

single_df_specs = utils.DataFrameSource(
    name="df1",
    node_params=(
        company_node_params,
        employee_node_params,
        company_tags_node_params,
    ),
    edge_params=(company_has_emp_edge_params, company_has_tag_edge_params),
    data_frame=df,
)

## Extract graph

In [12]:
hetero_g = hextract.extract_data_from_sources((single_df_specs,))

  id_counts = pd.value_counts(node_ids).max()
  id_counts = pd.value_counts(node_ids).max()


In [9]:
hetero_g

HeteroData(
  company={ x=[3, 2] },
  employee={
    x=[7, 2],
    y=[7],
  },
  tag={ x=[5] },
  (company, has, employee)={ edge_index=[2, 6] },
  (company, has, tag)={ edge_index=[2, 7] }
)

# Multi-table data case

In this case we have multiple tables, each representing different entity type. We will show how to extract graph from such data. This is how the data is usually represented in a database or a normalized data warehouse.

In [10]:
df_company = pd.DataFrame({
    "company_id": [1, 2],
    "company_employees": [100, 5000],
    "company_revenue": [1000, 100000],
})

df_employee = pd.DataFrame({
    "employee_id": [0, 1, 3, 4, 5, 6],
    "employee_occupation": [0, 1, 3, 1, 1, 4],
    "employee_age": [25, 35, 45, 18, 20, 31],
    "employee_promotion": [0, 1, 0, 1, 1, 0],
})

df_company_2_emplopyee = pd.DataFrame({
    "company_id": [1, 1, 1, 2, 2, 2],
    "employee_id": [0, 1, 3, 4, 5, 6],
})

df_company_2_tag = pd.DataFrame({
    "company_id": [1, 1, 1, 2, 2, 2],
    "tags": [[1, 2, 3], [1, 2], [3, 4], [1, 4], [1, 1], [1, 2]],
})

TODO: implement this case. Move validation from Data source level to graph specs level. For example: graph might have multiple sources defined: nodes and edges might be in separate files.