# Graph-Based Feature Engineering with Mercury Graph
This notebook uses an AI-generated dataset to generate graph-based features using Mercury Graph.

## Environment

In [None]:
!pip install graphframes
!pip install anywidget
!git clone --branch feature/graph_features --single-branch https://github.com/BBVA/mercury-graph.git
%cd mercury-graph
!git checkout feature/graph_features

## Setting up the environment
Let's start off by importing the necessary dependencies

In [None]:
import pandas as pd
import mercury.graph as mg
from mercury.graph.ml.graph_features import GraphFeatures
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

Let's set up a pyspark session

In [None]:
spark = (
    SparkSession.builder.appName("graphs")
    .config("spark.jars.packages", "graphframes:graphframes:0.8.3-spark3.5-s_2.12")
    .getOrCreate()
)

## Loading the data
We now load the vertices and edges data directly from the CSV files available in the project repository.

In [None]:
# Declare paths
PATH = "https://raw.githubusercontent.com/BBVA/mercury-graph/refs/heads/master/"
PATH_V = PATH + "tutorials/data/chamberi_nodos.csv"
PATH_E = PATH + "tutorials/data/chamberi_aristas.csv"

# Read data
vertices = pd.read_csv(PATH_V, sep='\t', usecols=["id", "facturación", "precio_medio"])
edges = pd.read_csv(PATH_E, sep = '\t')

# Rename columns
vertices.columns = ["id", "revenue", "mean_price"]

## Declare a graph

* Construct a graph from the loaded nodes and edges, leveraging the core `Graph` class from Mercury-Graph.

In [None]:
g = mg.core.Graph(
    data=edges,
    nodes=vertices,
    keys={"directed": False}
)

### Feature Engineering: Message Aggregation

* The objective is to obtain the average revenue of all neighboring businesses.
* Additionally, we also calculate the weighted average of this revenue level.

By doing this, we go from having a single feature per node to three: the original value and two new variables, which provide additional information about the environment and relationships of each business.

In [None]:
# Init GraphFeatures isntance
gf = GraphFeatures(
    attributes=["revenue", "mean_price"],
    agg_funcs=["min", "avg", "max"]
)

# Fit instance
gf.fit(g)

# View generated attributes
gf.node_features_.show(5)



+------------------+-----------+--------------+-----------------+--------------+-----------+--------------+
|                id|revenue_min|mean_price_min|      revenue_avg|mean_price_avg|revenue_max|mean_price_max|
+------------------+-----------+--------------+-----------------+--------------+-----------+--------------+
|  Horno del Barrio|      27080|             5|         45481.25|          20.0|      74525|            40|
|Juegos y Aventuras|      28120|            12|67619.66666666667|          37.0|     106788|            60|
|La Boutique de Luz|      44688|            12|          75020.6|          33.4|     102090|            60|
|        Gambón Hub|      44300|             5|          82942.0|          52.0|     141000|           150|
|Flores de Chamberí|      16300|            20|          53330.0|         36.25|      98400|            60|
+------------------+-----------+--------------+-----------------+--------------+-----------+--------------+
only showing top 5 rows

