# Semantic search

Semantic search aims to understand search phrases' intent and contextual meaning, rather than focusing on individual keywords.

Traditional keyword search often depends on exact-match keywords or proximity-based algorithms that find similar words.

## Context

Semantic search is dependent on understanding the context of the search. There are numerous strategies for understanding context, including:

- What other information is included in the search? For example, if the search phrase contains bank and river, the search is likely about waterways, not financial institutions.

- What is known about the user? Their search history and location can provide information about the context of the search. If they are in the UK, a search for "football" is likely about soccer, not American football.

- What scenario is being presented to the user? If the search is on a website about cars, a search for dash is likely about dashboards, not running quickly.

## Why is semantic search useful?

Semantic search allows you to find and score related data. It is useful when finding similarities within unstructured data that rely on understanding the intent and contextual meaning of the search query.

Some typical use cases are:

- Customer Support and Chat-bots - Improving the ability of chat-bots and customer support systems to understand and respond to user queries in a more human-like and contextually relevant manner.

- Product Discovery and Recommendation - Enhancing product search by understanding the nuanced needs and preferences expressed in search queries, leading to better product recommendations.

- Recruitment and Talent Acquisition - Matching job descriptions with candidate profiles more effectively by understanding the deeper meaning and requirements of job postings and the skills and experiences described in resumes.

- Knowledge Management and Information Retrieval - Enhancing the retrieval of information from large databases or document management systems by understanding the context and meaning of the information being sought.

- Anomaly detection - Identifying transactions or messages which are out of the norm and may be fraudulent.

## Considerations

Semantic search faces several challenges that stem from the complexity of natural language, the diversity of user intents, and the dynamic nature of information. Some of these challenges include:

- **Understanding Context** - Accurately grasping the context of queries can be difficult. Different users might use the same words to mean different things.

- **Language Ambiguity** - Natural language is inherently ambiguous. Words can have multiple meanings, and different models may interpret sentences differently.

- **Fine tuning** - To get the best result, you may need to invest significant effort in fine-tuning your model, data and search algorithms.

- **Transparency** - The complexity behind semantic search can make understanding how a score is determined or why a particular result is returned difficult.


In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

from utils import create_embedding

embedding = create_embedding("What does Hallucination mean?")
embedding[:10]

[-0.011630266904830933,
 -0.002094872761517763,
 0.013210325501859188,
 -0.0008369777351617813,
 -0.04646926373243332,
 0.004173556342720985,
 -0.03636724874377251,
 -0.02342890202999115,
 -0.008444247767329216,
 -0.0023279960732907057]

In [7]:
import os
from dotenv import load_dotenv
load_dotenv()

import textwrap
from neo4j import GraphDatabase
from utils import execute_query, create_embedding

neo4j_uri = os.getenv("NEO4J_URI")
neo4j_user = os.getenv("NEO4J_USERNAME")
neo4j_pass = os.getenv("NEO4J_PASSWORD")
neo4j_db = os.getenv("NEO4J_DATABASE")

neo4j_driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_pass))

cypher = textwrap.dedent("""
MATCH (m:Movie)
WHERE m.plot IS NOT NULL AND m.plot <> ''
RETURN m.movieId as movieId, m.plot as plot
""")

result = execute_query(neo4j_driver, cypher)

count = 0
for record in result:
    movie_id = record['movieId']  # ‚Üê THIS WAS MISSING!
    plot = record['plot']
    
    # Create embedding for the plot
    plot_embedding = create_embedding(plot)
    
    # Verify embedding was created
    if plot_embedding is None or len(plot_embedding) == 0:
        print(f"Warning: Failed to create embedding for movie {movie_id}")
        continue
    
    # Update the Movie node with the embedding
    update_cypher = textwrap.dedent("""
    MATCH (m:Movie {movieId: $movieId})
    SET m.plotEmbedding = $plot_embedding
    """)

    execute_query(neo4j_driver, update_cypher, {
        "movieId": movie_id,
        "plot_embedding": plot_embedding
    })
    
    count += 1
    if count % 10 == 0:  # Progress indicator every 10 movies
        print(f"Processed {count} movies...")

print(f"All embeddings added! Total: {count} movies")
neo4j_driver.close()

Processed 10 movies...
Processed 20 movies...
Processed 30 movies...
Processed 40 movies...
Processed 50 movies...
Processed 60 movies...
Processed 70 movies...
Processed 80 movies...
Processed 90 movies...
All embeddings added! Total: 93 movies


## Finding Movie Plots

Run the following Cypher query to return the titles and plots for the movies in the database:

```cypher
MATCH (m:Movie)
RETURN m.title, m.plot
```

You can adapt the query to only return a named movie by adding a filter:

```cypher
MATCH (m:Movie)
RETURN m.title, m.plot
```

You can view the embedding for a movie plot by running the following query:

```cypher
MATCH (m:Movie {title: "Toy Story"})
RETURN m.title, m.plotEmbedding
```

You can query the vector index to find similar movies by running the following query:

```cypher
MATCH (m:Movie {title: 'Toy Story'})

CALL db.index.vector.queryNodes('moviePlots', 6, m.plotEmbedding)
YIELD node, score

RETURN node.title, node.plot, score
```

GraphAcademy has loaded a dataset of movie posters into the sandbox. Each movie has a URL to a poster image:

```cypher
MATCH (m:Movie {title: 'Toy Story'})
RETURN m.title, m.poster
```

The data also contains embeddings for each poster:

```cypher
MATCH (m:Movie {title: 'Toy Story'})
RETURN m.title, m.posterEmbedding
```

GraphAcademy has loaded a dataset of movie posters into the sandbox. Each movie has a URL to a poster image:

```cypher
MATCH (m:Movie {title: 'Toy Story'})
RETURN m.title, m.poster
```

In the same way, you can use a vector index to find similar text; you can use a vector index to find similar images:

```cypher
MATCH (m:Movie{title: 'Babe'})

CALL db.index.vector.queryNodes('moviePosters', 6, m.posterEmbedding)
YIELD node, score

RETURN node.title, node.poster, score;
```