# Paso 1 - Importamos dependencias necesarias para la demo

En el fichero **requirements.txt** están definidas las dependencias necesarias de Python para poder correr la demo.

Para importarlas debemos ejecutar el siguiente comando: *pip install -r requirements.txt*

# Paso 2 - Definimos las variables de entorno necesarias para la demo

Editar el fichero _.env_ donde debemos definir las siguientes variables de entorno:
-  **OPENAI_API_KEY**: API Key para poder conectarnos al LLM que usaremos en la demo, OpenAI.
-  **NEO4J_CONNECTION_URL**: URL de conexión a Neo4j, la base de datos orientada a grafos que usaremos en la demo.
-  **NEO4J_USER**: Usuario para conectarnos a la bbdd.
-  **NEO4J_PASSWORD**: Password para conectarnos a la bbdd.

# Paso 3 - Importar librerías de Python que necesitamos para la demo

In [None]:
import os
import json
import random
import pandas as pd

from dotenv import load_dotenv

from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI
from langchain_community.llms import HuggingFaceHub

# Load environment variables defined in .env
load_dotenv()

# Paso 4 - Preparar el dataset para la demo 

Utilizaremos de base un dataset público que ha publicado MANISH KUMAR en Kaggle con un conjunto de datos de perfiles profesionales de Linkedin: https://www.kaggle.com/datasets/manishkumar7432698/linkedinuserprofiles?select=LinkedIn+people+profiles+datasets.csv 

In [None]:
# Read the original .csv from Kaggle
df = pd.read_csv('files/LinkedIn people profiles datasets.csv')
df.head()


Limpiaremos el dataset para quedarnos con los datos que necesitamos para nuestro knoledge graph:
-  id: identificador único de profesional.
-  name: nombre del profesional.
-  company: nombre de la compañía en la que trabaja.
-  education: institución educativa en la que se ha formado.
-  languages: idiomas que habla.
-  industry: industra principal en la que tiene experiencia.
-  country: nacionalidad del trabajador.

In [None]:
# We will fill the empty data with this random values (just for demo purposes)
industries = ['Advertising Services', 'IT Services and IT Consulting', 'Hospitals and Health Care', 'Higher Education', 'Retail', 'Financial Services', 'Telco', 'Media & Entertainment']
countries= ['United States', 'Argentina', 'Spain', 'France', 'Mexico', 'United Kingdom', 'Sweden']

# Function to extract the industry from the company information
def extract_industry(json_str):
    try:
        data = json.loads(json_str)
        return data.get('industry', random.choice(industries))
    except json.JSONDecodeError:
        return None

# Function to extract the languages from the languages structure
def extract_languages(json_list):
    try:
        languages = [entry['title'] for entry in json.loads(json_list)]
        return '|'.join(languages)
    except: 
        return None

# Function to extract the country from the city structure
def extract_country(string):
    if isinstance(string, str):
        elements = string.split(',')
        return elements[-1].strip()  
    else:
        return random.choice(countries)

# Extract the industry, languages and country information
df['industry'] = df['current_company'].apply(lambda x: extract_industry(x))
df['languages'] = df['languages'].apply(lambda x: extract_languages(x))
df['country'] = df['city'].apply(lambda x: extract_country(x))

# Remove the rows with empty values in these key columns (just for demo purposes)
df = df [['id','name','current_company:name','educations_details','languages','industry','country']].dropna()

# Rename some columns for better readability
df = df.rename(columns={'current_company:name': 'company','educations_details':'education'})

# Preview the curated data
df.head(300)

In [None]:
# OPTIONAL: With this sentence you can save the curated csv in a new file called 'clean_data.csv'
df.to_csv('files/clean_data.csv', index=False)

# Paso 5 - Insertar los datos en Neo4J

Lo primero será preparar el contector a Neo4j usando la utilidad Neo4jGraph de Langchain

In [None]:
# Retrieve connection information to Neo4j from environment variables
neo4j_url = os.getenv("NEO4J_CONNECTION_URL")
neo4j_user = os.getenv("NEO4J_USER")
neo4j_password = os.getenv("NEO4J_PASSWORD")

# https://api.python.langchain.com/en/latest/graphs/langchain_community.graphs.neo4j_graph.Neo4jGraph.html
graph = Neo4jGraph(neo4j_url,neo4j_user,neo4j_password)

graph.refresh_schema()
print(graph.schema)

Seguimos cargando la información que hemos preparado anteriormente en Neo4J con la utilidad de Langchain

In [None]:
# We set up the Cypher query to load the information from the csv that we have published on github
people_query = """
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/jmunizwizeline/talent-land-2024/main/files/clean_data.csv'
AS row
MERGE (person:Person {name: row.name})
MERGE (company:Company {name: row.company})
MERGE (school:School {name: row.education})
MERGE (industry:Industry {name: row.industry})
MERGE (country:Country {name: row.country})

FOREACH (lang in split(row.languages, '|') | 
    MERGE (language:Language {name:trim(lang)})
    MERGE (person)-[:SPEAKS]->(language))

MERGE (person)-[:WORKS_IN]->(company)
MERGE (person)-[:LIVES_IN]->(country)
MERGE (person)-[:EDUCATED_AT]->(school)
MERGE (company)-[:IS_IN]->(industry)
"""

graph.query(people_query)

Por último, confiramos que el esquema que tenemos en la base de datos se ha modificado y exploramos las relaciones que ha creado por nosotros

In [None]:
# We confirm that the schematic has been loaded
graph.refresh_schema()
print(graph.schema)

# Paso 6 - Realizamos nuestra primera consulta sobre nuestro Knowledge Graph

In [None]:
# Utilizamos la utilidad ChatOpenAI que nos proporciona Langchain para conectarnos al API de OpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo-0125",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2)

# Langchain nos facilita este tipo de utilidades que permiten hacer Q&A a Neo4J usando como llm la instancia de ChatOpenAI que hemos creado anteriormente
chain = GraphCypherQAChain.from_llm(graph=graph, llm=llm, verbose=True)

In [None]:
# List of questions that we want to run against the Knowledge Graph
questions = ["List all companies in Advertising Services industry!",
             "A worker who graduated from Simon Fraser University what is his name?",
             "Where is Paul Lukes working?",
             "A worker residing in Canada who is proficient in Vietnamese?",
             "How many workers from the United States speak Urdu?",
             "How many workers work for Capgemini?"]
for q in questions:
    print('====== START ======')
    print(chain.invoke(q)['result'])
    print('====== END ====== \n')

# Paso 7 - Mejorar la estrategia de prompting usando Rol Prompting y Few-Shot.

In [None]:
# We define some examples to show the model more details of the domain's structure
examples= [
    {
        "question": "Which workers speak French?",
        "query": "MATCH (p:Person)-[:SPEAKS]->(l:Language {{name: 'French'}}) RETURN p.name",
    },
    {
        "question": "What industries are workers named Emily associated with?",
        "query": "MATCH (p:Person {{name: 'Emily'}})-[:WORKS_IN]->(c:Company)-[:IS_IN]->(i:Industry) RETURN i.name",
    },
    {
        "question": "Which workers live in Canada and speak German?",
        "query": "MATCH (p:Person)-[:LIVES_IN]->(:Country {{name: 'Canada'}}), (p)-[:SPEAKS]->(:Language {{name: 'German'}}) RETURN p.name",
    },
    {
        "question": "In which countries do workers who speak Spanish live?",
        "query": "MATCH (p:Person)-[:SPEAKS]->(:Language {{name: 'Spanish'}})<-[:SPEAKS]-(worker:Person)-[:LIVES_IN]->(c:Country) RETURN DISTINCT c.name AS Country",
    },
    {
        "question": "What companies do workers named John work in?",
        "query": "MATCH (p:Person {{name: 'John'}})-[:WORKS_IN]->(c:Company) RETURN c.name",
    },
    {
        "question":"How many workers in Hospital and Health Care industry able to speak Korea",
        "query": "MATCH (p:Person)-[:WORKS_IN]->(:Company)-[:IS_IN]->(:Industry {{name: 'Hospitals and Health Care'}}),(p)-[:SPEAKS]->(:Language {{name: 'Korean'}}) RETURN COUNT(DISTINCT p) AS NumberOfWorkers",
    },
    {
        "question": "What companies are located in the technology industry?",
        "query": "MATCH (c:Company)-[:IS_IN]->(:Industry {{name: 'Technology'}}) RETURN c.name",
    },
    {
        "question": "Where do workers named Alice live?",
        "query": "MATCH (p:Person {{name: 'Alice'}})-[:LIVES_IN]->(c:Country) RETURN c.name",
    },
]

In [None]:
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate

# We use another Langchain utility to implement the few-shot and prompting improvements
example_prompt = PromptTemplate.from_template(
    "User input: {question}\nCypher query: {query}"
)
prompt = FewShotPromptTemplate(
    examples=examples[:3],
    example_prompt=example_prompt,
    prefix="You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.\n\nHere is the schema information\n{schema}.\n\nBelow are a number of examples of questions and their corresponding Cypher queries.",
    suffix="User input: {question}\nCypher query: ",
    input_variables=["question", "schema"],
)

# We create a new connector with the new strategy that we have just created
chain2 = GraphCypherQAChain.from_llm(graph=graph, llm=llm, cypher_prompt=prompt, verbose=True)

In [None]:
# This is an example of the prompt that we will run when we make a question
print(prompt.format(question="Where do Michael work?", schema="foo"))

In [None]:
# We run again the questions with this new improved strategy
questions = ["List all companies in Advertising Services industry!",
             "A worker who graduated from Simon Fraser University what is his name?",
             "Where is Paul Lukes working?",
             "A worker residing in Canada who is proficient in Vietnamese?",
             "How many workers from the United States speak Urdu?",
             "How many workers work for Capgemini?"]
for q in questions:
    print('====== START ======')
    chain2.invoke(q)
    print('====== END ====== \n')

# Paso 8 - Mejorar la calidad de los ejemplos que le pasamos al modelo usando similarity search

In [None]:
from langchain_community.vectorstores import Neo4jVector
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_community.embeddings import HuggingFaceEmbeddings

# We use yet another Langchain utility to 
example_selector = SemanticSimilarityExampleSelector.from_examples(
    examples,
    HuggingFaceEmbeddings(),
    Neo4jVector,
    url = neo4j_url,
    username = neo4j_user,
    password = neo4j_password,
    k=4,
    input_keys=["question"],
)

In [None]:
# Now we can see that the set of 3 examples that we have selected are better and also are sort
example_selector.select_examples({"question": "Where do Michael live?"})

In [None]:
# Use the example selector and get rid of the examples=examples[:3]
dynamic_prompt = FewShotPromptTemplate(
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.\n\nHere is the schema information\n{schema}.\n\nBelow are a number of examples of questions and their corresponding Cypher queries.",
    suffix="User input: {question}\nCypher query: ",
    input_variables=["question", "schema"],
)

# We create a new connector with the new strategy that we have just created
chain3 = GraphCypherQAChain.from_llm(graph=graph, cypher_prompt=dynamic_prompt, llm=llm, verbose=True, top_k=32, return_intermediate_steps=True)

In [None]:
# We run again the questions with this new improved strategy
questions = questions = ["List all companies in Advertising Services industry!",
             "A worker who graduated from Simon Fraser University what is his name?",
             "Where is Paul Lukes working?",
             "A worker residing in Canada who is proficient in Vietnamese?",
             "How many workers from the United States speak Urdu?",
             "How many workers work for Capgemini?"]

for q in questions:
    print('====== START ======')
    chain3.invoke(q)
    print('====== END ====== \n')