#### <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:white; font-size:180%; text-align:left;padding:3.0px; background: maroon; border-bottom: 8px solid black" > TABLE OF CONTENTS<br><div>
* [IMPORTS](#1)
* [INTRODUCTION](#2)
* [Neomodel](#3)


<a id="1"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color: white; font-size:120%; text-align:left;padding:3.0px; background: maroon; border-bottom: 8px solid black" > Imports<br><div>

In [1]:
# With this Neomodel import
import ast
import os
from neomodel import db, config
from dotenv import load_dotenv


# Prompts:
from langchain_core.prompts import (
    PromptTemplate
)


## LLMs:
from langchain_openai import OpenAI

ModuleNotFoundError: No module named 'neomodel'

<a id="2"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color: white; font-size:120%; text-align:left;padding:3.0px; background: maroon; border-bottom: 8px solid black" > Neomodel<br><div>


In this section we will show how to use neomodel to connect and populate a Neo4J Graph Database. For further defails refer to the documentation: [neomodel](https://neo4j.com/developer-blog/py2neo-end-migration-guide/)

## Conection

In [None]:
_ = load_dotenv()
# Initialize the graph with a specific database
# # Neomodel connection
from neomodel import db
# Using URL - auto-managed
config.DATABASE_NAME = 'testing2'
db.set_connection(url=f"bolt://neo4j:{os.environ['NEO4J_PASSWORD']}@localhost:7687")

## Graph Configuration

In this section we are going to generate the visual folder structure of the folder that we want to map to a Knowledge graph, and from there and using a LLM, we will get an iterable that will, previous edition, serve as an entrypoint to generate the graph.

In [None]:
folder_structure = """
└── 📁data_preprocessing
    └── 📁feature_engineering
        └── 📁pandas
            └── pandas.py
        └── 📁sklearn
            └── sklearn.py
        └── __init__.py
    └── __init__.py
└── 📁modelling
    └── 📁pytorch
        └── __init__.py
        └── data_loader.py
        └── model.py
        └── trainer.py
    └── 📁tensorflow
        └── __init__.py
        └── trainer.py
    └── 📁transformers
        └── __init__.py
        └── trainer.py
    └── __init__.py
└── 📁visualization
    └── 📁plotly
        └── __init__.py
        └── geospatial_plots.py
        └── machine_learning_evaluation_plots.py
        └── statistical_analysis_plots.py
    └── __init__.py
└── __init__.py
"""


node_definitions = """

class Area(StructuredNode):
    name = StringProperty(unique_index=True)
    contains_subarea = RelationshipTo('SubArea', 'CONTAINS')
    contains_framework = RelationshipTo('Framework', 'CONTAINS')

class SubArea(StructuredNode):
    name = StringProperty(unique_index=True)
    contains_framework = RelationshipTo('Framework', 'CONTAINS')

class Framework(StructuredNode):
    name = StringProperty(unique_index=True)
    contains_class = RelationshipTo('Class', 'CONTAINS')
    contains_function = RelationshipTo('Function', 'CONTAINS')
    """

prompt_template = PromptTemplate.from_template(
    """You are a Graph Database expert tasked to Output a python list. In this case the objective is to create a
      Knowledge graph with NeoModel based in Python programming for Data science, in that sense consider:
        - Area: Nodes labeled as 'Area' representing areas of Data Science, like 'Data Visualization' or 'Data Preprocessing'.
        - SubArea: Nodes labeled as 'SubArea' representing sub-areas within a more general Area.
        - Framework: Nodes labeled as 'Framework' representing frameworks used in data science, corresponding generally to libraries, like Tensorflow, pandas, etc.
      Given that relation you should map following folder structure into nodes (and relations):
      {folder_structure}. 
      
      Those have to be adjusted to the following nodes and relationships:
      {node_definitions}. 
      
      Here you have several examples of how the output should look like:
      [{{
      'label':'Area',
      'name':'data_preprocessing',
      'contains_subarea':['feature_engineering']
      }},
        {{
        'label':'Area',
        'name':'visualization',
        'contains_framework':['plotly']
        }}
    
        ]

      where each node need to have a reference of all relations that is has.
      And so on with all the nodes, so the folder structure is fully mapped.
      Only add the relations of direct childs (folders/files directly under the current one).
      Do not add information (or Nodes) about 'Functions' nor 'Classes', that will be done later.
      Do not add any function/framework that is not present in the folder structure provided."""
)

## Chain definition to map folder structure to Nodes

In [None]:
llm  = OpenAI(max_tokens=-1)
chain = prompt_template | llm

In [None]:
results = chain.invoke({'folder_structure':folder_structure,'node_definitions': node_definitions})
results

In [None]:
parsed_list = ast.literal_eval(results)
parsed_list

### There are still many improvements to do to this prompt so we automatically map the folder structure into Nodes/relationships.
GPT4-turbo could be used to this purpose, right now we will just modify the output manually to complete the showcase

In [2]:

nodes_relationships = [
    {
        'label': 'Area',
        'name': 'data_preprocessing',
        'relationships': {
            'contains_subarea': ['feature_engineering']
        }
    },
    {
        'label': 'SubArea',
        'name': 'feature_engineering',
        'relationships': {
            'contains_framework': ['pandas', 'sklearn']
        }
    },
    {
        'label': 'Framework',
        'name': 'pandas'
    },
    {
        'label': 'Framework',
        'name': 'sklearn'
    },
    {
        'label': 'Area',
        'name': 'modelling',
        'relationships': {
            'contains_framework': ['pytorch', 'tensorflow', 'transformers']
        }
    },
    {
        'label': 'Framework',
        'name': 'pytorch'
    },
    {   
        'label': 'Framework',
        'name': 'tensorflow'
            
    },
    {
        'label': 'Framework',
        'name': 'transformers'
    },
    {
        'label': 'Area',
        'name': 'visualization',
        'relationships': {
            'contains_framework': ['plotly']
        }
    },
    {
        'label': 'Framework',
        'name': 'plotly'
    }
    ]

In [None]:
from utils.parse_directory_to_KT.graph_generator import create_graph_for_directory

create_graph_for_directory(db=db, base_path="./data_science_repo", nodes_relationships=nodes_relationships)

## Delete unwanted relations

In [None]:
db.cypher_query("""
MATCH ()-[r]->()
WHERE NOT type(r) STARTS WITH 'CONTAINS'
DELETE r;
""")