<!-- #### <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:white; font-size:180%; text-align:left;padding:3.0px; background: maroon; border-bottom: 8px solid black" > TABLE OF CONTENTS<br><div> -->

### TABLE OF CONTENTS
* [IMPORTS](#1)
* [INTRODUCTION](#2)
* [Neomodel](#3)


<!-- <a id="1"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color: white; font-size:120%; text-align:left;padding:3.0px; background: maroon; border-bottom: 8px solid black" > Imports<br><div> -->

### Imports

In [1]:
# With this Neomodel import
import ast
import os
from neomodel import db, config
from dotenv import load_dotenv


# Prompts:
from langchain_core.prompts import (
    PromptTemplate
)


## LLMs:
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAI
from langchain.embeddings import HuggingFaceEmbeddings

<!-- <a id="2"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color: white; font-size:120%; text-align:left;padding:3.0px; background: maroon; border-bottom: 8px solid black" > Neomodel<br><div> -->


### Neomodel


In this section we will show how to use neomodel to connect and populate a Neo4J Graph Database. For further defails refer to the documentation: [neomodel](https://neo4j.com/developer-blog/py2neo-end-migration-guide/)

## Conection

In [2]:
_ = load_dotenv()
# Initialize the graph with a specific database
# # Neomodel connection
from neomodel import db
# Using URL - auto-managed
config.DATABASE_NAME = 'graphrag'
db.set_connection(url=f"bolt://neo4j:{os.environ['NEO4J_PASSWORD']}@localhost:7687")

## Graph Configuration

In this section we are going to generate the visual folder structure of the folder that we want to map to a Knowledge graph, and from there and using a LLM, we will get an iterable that will, previous edition, serve as an entrypoint to generate the graph.

In [3]:
folder_structure = """
└── 📁data_preprocessing
    └── 📁feature_engineering
        └── 📁pandas
            └── pandas.py
        └── 📁sklearn
            └── sklearn.py
        └── __init__.py
    └── __init__.py
└── 📁modelling
    └── 📁pytorch
        └── __init__.py
        └── data_loader.py
        └── model.py
        └── trainer.py
    └── 📁tensorflow
        └── __init__.py
        └── trainer.py
    └── 📁transformers
        └── __init__.py
        └── trainer.py
    └── __init__.py
└── 📁visualization
    └── 📁plotly
        └── __init__.py
        └── geospatial_plots.py
        └── machine_learning_evaluation_plots.py
        └── statistical_analysis_plots.py
    └── __init__.py
└── __init__.py
"""


node_definitions = """

class Area(StructuredNode):
    name = StringProperty(unique_index=True)
    contains_subarea = RelationshipTo('SubArea', 'CONTAINS')
    contains_framework = RelationshipTo('Framework', 'CONTAINS')

class SubArea(StructuredNode):
    name = StringProperty(unique_index=True)
    contains_framework = RelationshipTo('Framework', 'CONTAINS')

class Framework(StructuredNode):
    name = StringProperty(unique_index=True)
    contains_class = RelationshipTo('Class', 'CONTAINS')
    contains_function = RelationshipTo('Function', 'CONTAINS')
    """

prompt_template = PromptTemplate.from_template(
    """You are a Graph Database expert tasked to Output a python list. In this case the objective is to create a
      Knowledge graph with NeoModel based in Python programming for Data science, in that sense consider:
        - Area: Nodes labeled as 'Area' representing areas of Data Science, like 'Data Visualization' or 'Data Preprocessing'.
        - SubArea: Nodes labeled as 'SubArea' representing sub-areas within a more general Area.
        - Framework: Nodes labeled as 'Framework' representing frameworks used in data science, corresponding generally to libraries, like Tensorflow, pandas, etc.
      Given that relation you should map following folder structure into nodes (and relations):
      {folder_structure}. 
      
      Those have to be adjusted to the following nodes and relationships:
      {node_definitions}. 
      
      Here you have several examples of how the output should look like:
      [{{
      'label':'Area',
      'name':'data_preprocessing',
      'contains_subarea':['feature_engineering']
      }},
        {{
        'label':'Area',
        'name':'visualization',
        'contains_framework':['plotly']
        }}
    
        ]

      where each node need to have a reference of all relations that is has.
      And so on with all the nodes, so the folder structure is fully mapped.
      Only add the relations of direct childs (folders/files directly under the current one).
      Do not add information (or Nodes) about 'Functions' nor 'Classes', that will be done later.
      Do not add any function/framework that is not present in the folder structure provided."""
)

## Chain definition to map folder structure to Nodes

In [4]:
llm  = OpenAI(max_tokens=-1)
chain = prompt_template | llm

In [5]:
results = chain.invoke({'folder_structure':folder_structure,'node_definitions': node_definitions})
results

'\n[\n    {\n        "label": "Area",\n        "name": "data_preprocessing",\n        "contains_subarea": [\n            "feature_engineering"\n        ]\n    },\n    {\n        "label": "SubArea",\n        "name": "feature_engineering",\n        "contains_framework": [\n            "pandas",\n            "sklearn"\n        ]\n    },\n    {\n        "label": "Framework",\n        "name": "pandas",\n        "contains_class": [],\n        "contains_function": [\n            "pandas.py"\n        ]\n    },\n    {\n        "label": "Framework",\n        "name": "sklearn",\n        "contains_class": [],\n        "contains_function": [\n            "sklearn.py"\n        ]\n    },\n    {\n        "label": "Area",\n        "name": "modelling",\n        "contains_subarea": [\n            "pytorch",\n            "tensorflow",\n            "transformers"\n        ]\n    },\n    {\n        "label": "SubArea",\n        "name": "pytorch",\n        "contains_framework": [\n            "data_loader",\n

In [6]:
parsed_list = ast.literal_eval(results)
parsed_list

[{'label': 'Area',
  'name': 'data_preprocessing',
  'contains_subarea': ['feature_engineering']},
 {'label': 'SubArea',
  'name': 'feature_engineering',
  'contains_framework': ['pandas', 'sklearn']},
 {'label': 'Framework',
  'name': 'pandas',
  'contains_class': [],
  'contains_function': ['pandas.py']},
 {'label': 'Framework',
  'name': 'sklearn',
  'contains_class': [],
  'contains_function': ['sklearn.py']},
 {'label': 'Area',
  'name': 'modelling',
  'contains_subarea': ['pytorch', 'tensorflow', 'transformers']},
 {'label': 'SubArea',
  'name': 'pytorch',
  'contains_framework': ['data_loader', 'model', 'trainer']},
 {'label': 'SubArea', 'name': 'tensorflow', 'contains_framework': ['trainer']},
 {'label': 'SubArea',
  'name': 'transformers',
  'contains_framework': ['trainer']},
 {'label': 'Framework',
  'name': 'data_loader',
  'contains_class': [],
  'contains_function': ['data_loader.py']},
 {'label': 'Framework',
  'name': 'model',
  'contains_class': [],
  'contains_functio

### There are still many improvements to do to this prompt so we automatically map the folder structure into Nodes/relationships.
GPT4-turbo could be used to this purpose, right now we will just modify the output manually to complete the showcase

In [7]:

nodes_relationships = [
    {
        'label': 'Area',
        'name': 'data_preprocessing',
        'relationships': {
            'contains_subarea': ['feature_engineering']
        }
    },
    {
        'label': 'SubArea',
        'name': 'feature_engineering',
        'relationships': {
            'contains_framework': ['pandas', 'sklearn']
        }
    },
    {
        'label': 'Framework',
        'name': 'pandas'
    },
    {
        'label': 'Framework',
        'name': 'sklearn'
    },
    {
        'label': 'Area',
        'name': 'modelling',
        'relationships': {
            'contains_framework': ['pytorch', 'tensorflow', 'transformers']
        }
    },
    {
        'label': 'Framework',
        'name': 'pytorch'
    },
    {   
        'label': 'Framework',
        'name': 'tensorflow'
            
    },
    {
        'label': 'Framework',
        'name': 'transformers'
    },
    {
        'label': 'Area',
        'name': 'visualization',
        'relationships': {
            'contains_framework': ['plotly']
        }
    },
    {
        'label': 'Framework',
        'name': 'plotly'
    }
    ]

In [11]:
from parse_directory_to_KT.graph_generator import create_graph_for_directory

create_graph_for_directory(db=db, base_path="../../data_science_repo", nodes_relationships=nodes_relationships)

Executing query for node creation: MERGE (n:Area {name: $name}) RETURN n
Created node: Area - data_preprocessing
Runned query---> MERGE (n:Area name: data_preprocessing) RETURN n
Executing query for node creation: MERGE (n:SubArea {name: $name}) RETURN n
Created node: SubArea - feature_engineering
Runned query---> MERGE (n:SubArea name: feature_engineering) RETURN n
Executing query for node creation: MERGE (n:Framework {name: $name}) RETURN n
Created node: Framework - pandas
Runned query---> MERGE (n:Framework name: pandas) RETURN n
Executing query for function node creation: 
                        MERGE (f:Function {name: $name})
                        ON CREATE SET f.description = $description, f.code = $code, f.file_path = $file_path
                        RETURN f
                        
Created Function node: remove_outliers with file path data_preprocessing\feature_engineering\pandas\pandas.py
Query---> 
                        MERGE (f:Function name:remove_outliers)
       

## Delete unwanted relations

In [12]:
db.cypher_query("""
MATCH ()-[r]->()
WHERE NOT type(r) STARTS WITH 'CONTAINS'
DELETE r;
""")

([], [])

This will be fixed in next steps, but there are unwanted relations being introduced.

## Create a vector database from this index

As we have mentioned during this development, getting exacts results for our use case based only in the entities and generated query is not completely feasible in many cases. Therefore we go for a hybrid approach, were we are going to mix a 'Graph' based search with a semantic (+keywords) search and join the results. This will make our process more robust.

In [13]:
# We start with this model but may upgrade to more specific and bigger models
model_name = "sentence-transformers/all-MiniLM-L6-v2" # You can specify any sentence-transformer model from the hub
embeddings = HuggingFaceEmbeddings(model_name=model_name)
url = "bolt://localhost:7687"

# We encode the description attribute of the nodes of type (label) function

existing_graph = Neo4jVector.from_existing_graph(
    embedding=embeddings,
    url=url,
    username = os.environ["NEO4J_USERNAME"], 
    password=os.environ["NEO4J_PASSWORD"],
    database="graphrag",
    # index_name="person_index",
    node_label="Function",
    text_node_properties=["description"],
    embedding_node_property="embedding",
    search_type="hybrid",
    keyword_index_name= "keyword"
)

