# 101 Hello, Weaviate
## Unit overview

In [1]:
from IPython.display import HTML
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/FU7l5pr2FmU" allowfullscreen></iframe>')



This course is designed to get you started with Weaviate, so that you can go from being new to Weaviate to building an MVP-level product with Weaviate in a short period of time.

Along the way, you'll develop intuitions about not only how Weaviate works, but also how vectors work, and how vector searches work. You'll also learn how to use Weaviate's client library so that you can get going in a language that you are familiar with.

By the time you're done with these short units, you'll be able to build your own instance of Weaviate with your own data, and have a suite of search tools at your disposal so that you can get the data you want in the format you want it.

## Learning objectives
Here, we will cover:
- What Weaviate is, and what it does.
- How to create your own Weaviate instance on WCS.
- Weaviate clients and how to install them.
- Hands-on experience with Weaviate.

By the time you are finished, you will be able to:
- Broadly describe what Weaviate is.
- Outline what vector search is.
- Create a Weaviate instance on WCS.
- Install your preferred Weaviate client.
- Describe some of Weaviate's capabilities.

# Introduction to Weaviate
## What is Weaviate?
Weaviate is an open-source vector database. But what does that mean? Let's unpack it here.

In [2]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/MQgm126pKkU" allowfullscreen></iframe>')

### Vector database
Weaviate is a fantastic tool for retrieving the information you need, quickly and accurately. It does this by being an amazing **vector database**.

You may be familiar with traditional databases such as relational databases that use SQL. A database can catalog, store and retrieve information. A **vector** database can carry out these tasks also, with the key difference being that they can perform these tasks based on similarity.

#### How traditional searches work
Imagine that you are searching a relational database containing articles on cities, to retrieve a list of "major" European cities. Using SQL, you might construct a query like this:

```sql
SELECT city_name wiki_summary
FROM wiki_city
WHERE (wiki_summary LIKE '%major European city%' OR
       wiki_summary LIKE '%important European city%' OR
       wiki_summary LIKE '%prominent European city%' OR
       wiki_summary LIKE '%leading European city%' OR
       wiki_summary LIKE '%significant European city%' OR
       wiki_summary LIKE '%top European city%' OR
       wiki_summary LIKE '%influential European city%' OR
       wiki_summary LIKE '%notable European city%')
    (… and so on)
```
Which would return cities that contained any of these strings (`major`, `important`, `prominent`, ... etc) in the `wiki_summary` column.

This works well in many circumstances. However, there are two significant limitations with this approach.

#### Limitations of traditional search
Using this type of search requires you to identify terms that may have been used to describe the concept, which is no easy feat.

What's more, this doesn't solve the problem of how to rank the list of resulting objects.

With the above search query, an entry merely containing a mention of a different European city (i.e. not very relevant) would be given equal weighting to an entry for Paris, or Rome, which would be highly relevant.

A vector database makes this job simpler by enabling searches based on similarity.

#### Examples of vector search
So, you could perform a query like this in Weaviate:

```
{
  Get {
    WikiCity (
      nearText: { concepts: ["Major European city"] }
    ) { city_name wiki_summary }
  }
}
```

And it would return a list of entries that are ranked by their similarity to the query - the idea of "Major European city".

What's more, Weaviate "indexes" the data based on their similarity, making this type of data retrieval lightning-fast.

Weaviate can help you to do all this, and actually a lot more. Another way to think about Weaviate is that it supercharges the way you use information.

> VECTOR VS SEMANTIC SEARCH<br>
A vector search is also referred to as a "semantic search" because it returns results based on the similarity of meaning (therefore "semantic").

### Open-source
Weaviate is open-source. In other words, its [codebase is available online](https://github.com/weaviate/weaviate) for anyone to see and use [$^{\rm{1}}$](https://weaviate.io/developers/academy/zero_to_mvp/hello_weaviate/intro_weaviate#1).

And that is the codebase, regardless of how you use it. So whether you run Weaviate on your own computer, on a cloud computing environment, or through our managed service [Weaviate Cloud Services, or WCS](https://console.weaviate.io/), you are using the exact same technology.

So, if you want, you can run Weaviate for free on your own device, or use our managed service for convenience. You can also take comfort in that you can see exactly what you are running, and be a part of the open-source community, as well as to shape its development.

It also means that your knowledge about Weaviate is fungible, between local, cloud, or managed instances of Weaviate. So anything you learn here about Weaviate using WCS will be equally applicable to running it locally, and vice versa. 😉

### Information, made dynamic
We are used to thinking of information as static, like a book. But with Weaviate and modern AI-driven language models, we can do much more than just retrieve static information but easily build on top of it. Take a look at these examples:

#### Question answering
Given a list of Wikipedia entries, you could ask Weaviate:
> When was Lewis Hamilton born?

And it would answer with:
> Lewis Hamilton was born on January 7, 1985. ([check for yourself](https://en.wikipedia.org/wiki/Lewis_Hamilton))

The according query:
```
{
  Get {
    WikiArticle (
      ask: {
        question: "When was Lewis Hamilton born?",
        properties: ["wiki_summary"]
      },
      limit: 1
    ) {
      title
      _additional {
        answer {
          result
        }
      }
    }
  }
}
```
The according response:
```
{
  "data": {
    "Get": {
      "WikiArticle": [
        {
          "_additional": {
            "answer": {
              "result": " Lewis Hamilton was born on January 7, 1985."
            }
          },
          "title": "Lewis Hamilton"
        }
      ]
    }
  }
}
```

#### Generative search
Or you can synthesize passages using retrieved information with Weaviate:

Here is one, where we searched Weaviate for an entry on a "racing driver", and produce the result in the format of:
> Write a fun tweet encouraging people to read about this: ## {title} by summarizing highlights from: ## {wiki_summary}

Which produces:
> Check out the amazing story of Lewis Hamilton, the 7-time Formula One World Drivers' Championship winner! From his humble beginnings to becoming one of the world's most influential people, his journey is an inspiring one. #LewisHamilton #FormulaOne #Motorsport #Racing

The according query:
```
{
  Get {
    WikiArticle(
      nearText: {
        concepts: ["Racing Driver"]
      }
      limit: 1
    ) {
      title
      wiki_summary
      _additional {
        generate(
          singleResult: {
            prompt: """
              Write a fun tweet encouraging people to read about this: ## {title}
              by summarizing highlights from: ## {wiki_summary}
            """
          }
        ) {
          singleResult
          error
        }
      }
    }
  }
}
```
The according response:
```
{
  "data": {
    "Get": {
      "WikiArticle": [
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "Check out the amazing story of Lewis Hamilton, the 7-time Formula One World Drivers' Championship winner! From his humble beginnings to becoming a global icon, his journey is an inspiring one. #LewisHamilton #FormulaOne #Motorsport #Racing #Inspiration"
            }
          },
          "title": "Lewis Hamilton",
          "wiki_summary": "Sir Lewis Carl Davidson Hamilton   (born 7 January 1985) is a British racing driver currently competing in Formula One, driving for Mercedes-AMG Petronas Formula One Team. In Formula One, Hamilton has won a joint-record seven World Drivers' Championship titles (tied with Michael Schumacher), and holds the records for the most wins (103), pole positions (103), and podium finishes (191), among others.\nBorn and raised in Stevenage, Hertfordshire, Hamilton joined the McLaren young driver programme in 1998 at the age of 13, becoming the youngest racing driver ever to be contracted by a Formula One team. This led to a Formula One drive with McLaren for six years from 2007 to 2012, making Hamilton the first black driver to race in the series. In his inaugural season, Hamilton set numerous records as he finished runner-up to Kimi R\u00e4ikk\u00f6nen by one point. The following season, he won his maiden title in dramatic fashion\u2014making a crucial overtake at the last corner on the last lap of the last race of the season\u2014to become the then-youngest Formula One World Champion in history.  After six years with McLaren, Hamilton signed with Mercedes in 2013.\nChanges to the regulations for 2014 mandating the use of turbo-hybrid engines saw the start of a highly successful period for Hamilton, during which he won six further drivers' titles. Consecutive titles came in 2014 and 2015 during an intense rivalry with teammate Nico Rosberg. Following Rosberg's retirement in 2016, Ferrari's Sebastian Vettel became Hamilton's closest rival in two championship battles, in which Hamilton twice overturned mid-season point deficits to claim consecutive titles again in 2017 and 2018. His third and fourth consecutive titles followed in 2019 and 2020 to equal Schumacher's record of seven drivers' titles. Hamilton achieved his 100th pole position and race win during the 2021 season. \nHamilton has been credited with furthering Formula One's global following by appealing to a broader audience outside the sport, in part due to his high-profile lifestyle, environmental and social activism, and exploits in music and fashion. He has also become a prominent advocate in support of activism to combat racism and push for increased diversity in motorsport. Hamilton was the highest-paid Formula One driver from 2013 to 2021, and was ranked as one of the world's highest-paid athletes by Forbes of twenty-tens decade and 2021. He was also listed in the 2020 issue of Time as one of the 100 most influential people globally, and was knighted in the 2021 New Year Honours. Hamilton was granted honorary Brazilian citizenship in 2022.\n\n"
        }
      ]
    }
  }
}
```

We will cover these and many more capabilities, such as vectorization, summarization and classification, in our units.

For now, keep in mind that Weaviate is a vector database at its core which can also leverage AI tools to do more with the retrieved information.

## Review
In this section, you learned about what Weaviate is and how it works at a very high level. You have also been introduced to what vector search is at a high level, that it is a similarity-based search method.

### Review exercises
What is the difference in the Weaviate codebase between local and cloud deployments?

$\times$ Cloud deployments always include additional modules.<br>
$\times$ Local deployments are optimized for GPU use.<br>
$\times$ Cloud deployments are optimized for scalability.<br>
$\checkmark$ None, they are the same.

What is the best description of vector search?

$\times$ Vector search is a directional search.<br>
$\checkmark$ Vector search is a similarity-based search<br>
$\times$ Vector search is a number-based search.

### Key takeaways
- Weaviate is an open source vector database.
- The core Weaviate library is the same whether you run it locally, on the cloud, or with WCS.
- Vector searches are similarity-based searches.
- Weaviate can also transform your data after retrieving it before returning it to you.

# Vectors - An overview
## What is a vector?

In [3]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/iFUeV3aYynI" allowfullscreen></iframe>')

We've covered that Weaviate is a vector database, and that a vector search is similarity-based. But what is a vector?

A vector in this context is just a series of numbers - like `[1, 0]` or `[0.513, 0.155, 0.983, ..., 0.001, 0.932]`. Vectors like these are used to capture meaning as a series of numbers.

This might seem like an odd concept. But in fact, many people have used vectors already without realizing - for example if they have tried photo editing, or MS Paint.

### How do numbers represent meaning?
The RGB system use numbers to represent colors. For example:
- `(255, 0, 0)` = red
- `(80, 200, 120)` = emerald

In these examples, each number can be thought of as a dial for how red, green or blue a color is.

Now, imagine having hundreds, or even thousands of these dials. That’s how vectors are used to represent meaning. Modern models such as GPT-x, or those used with Weaviate use vectors in this manner to represent some "essence", or "meaning" of objects. And this can be done for any object type, such as text, code, images, videos and more.

Each vector representation of such "meaning" is called a vector embedding.

## Vector embeddings in Weaviate
Weaviate enables vector searches by indexing and storing data objects and corresponding vector embeddings from machine learning models.

In plain terms, Weaviate processes and organizes your data in such a way that objects can be retrieved based on their similarity to a query. In order for it to perform these tasks at speed, Weaviate does two things that traditional databases do not. They are:
- Quantifying similarity, and
- Indexing vector data

These aspects enable Weaviate to do what it does.

### Quantifying similarity
As we've mentioned, vector searches are similarity-based, but what does that actually mean? How do we determine that two pieces of data are "similar"? What does it mean for two pieces of text, two images, or two objects in general, to be similar?

This is a relatively simple idea that is actually incredibly interesting and intricate once we start to dive into the details.

But for now, you should know that machine learning (ML) models are key to this whole process. Similar models to those that allows clever text generation from prompts power vector searches. Instead of generating new text, here these models capture "meaning" of pieces of text or other media. We will cover this in more detail later on.

### Indexing (vector) data
Vector searches can be very computationally intensive.

To overcome this problem, Weaviate uses a combination of indexes including an approximate nearest neighbor (ANN) index and an inverted index. They respectively allow Weaviate to perform extremely fast vector searches, as well as to filter data using Boolean criteria on data.

We will get into this in more detail later - but for now, it's enough to know that Weaviate can perform fast vector searches as well as filtering.

## Review
In this section, you learned about what vectors are and how Weaviate utilizes them at a very high level. You have also been introduced to Weaviate's two key capabilities that helps it to enable vector search at speed.

### Review exercise
> Can you describe, in your own words, what vectors are?<br><br>My answer:<br>A vector of dimension $n$ is a list of $n$ numbers and thus can be interpreted as a dot in $n$-dimensional space. In the context of artificial intelligence, vectors have many use cases, among them the representation of *embeddings*, i.e., a mathematical represnetation of tokens (sub-words). Creating "good" embeddings – that is embeddings that relate to each other like the sub-words they represent (e.g., "king" should relate to "queen" like "man" to "woman") – for a set of tokens is paramount to natural language processing and other areas of machine learning. 

Which of these statements are true?

$\times$ Vector search is a directional search.<br>
$\checkmark$ Vector search is a similarity-based search<br>
$\times$ Vector search is a number-based search.

### Key takeaways
A vector is a series of numbers that capture the meaning or essence of objects.
Machine learning models help quantify similarity between different objects, which is essential for vector searches.
Weaviate uses a combination of an approximate nearest neighbor (ANN) index and an inverted index to perform fast vector searches with filtering.

# Examples 1 - Queries
## Vectors in action

In [4]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/zC0CpBiLC3g" allowfullscreen></iframe>')

Let's take a look at a few more examples of what you can do with Weaviate.

First, we will try vector searches by searching through our demo database. You will learn how to use Weaviate to retrieve objects based on their similarity, using various query types such as an input text, vector, or object.

You will also compare vector search with keyword search to compare and contrast the two techniques, before learning how to combine the two techniques through the use of filters.

## Vector search demo
For our first example, let's look through our demo dataset which contains a small sample of questions from the quiz show Jeopardy!.

Imagine that you're running a quiz night, and want to get some questions around the category of "animals in movies". You could look for word matches - perhaps something like:

```sql
SELECT question, answer
FROM jeopardy_questions
WHERE (question LIKE '%animal%' OR question LIKE '%creature%' OR question LIKE '%beast%')
AND (question LIKE '%movie%' OR question LIKE '%film%' OR question LIKE '%picture%' OR question LIKE '%cinema%')
```

But this is very difficult. You likely need to know the names of the specific animals to carry this out.

Not so much with Weaviate, though. See what happens when we run the following query:

> WE SEARCHED WEAVIATE FOR:<br>animals in movies

See the full query:
```
{
  Get {
    JeopardyQuestion (
      nearText: {
        concepts: ["animals in movies"]
      }
      limit: 3
    ) {
      question
      answer
    }
  }
}
```
Weaviate retrieved these as the top answers:
> **meerkats**: Group of mammals seen here like Timon in *The Lion King*<br>**dogs**: Scooby-Doo, Goofy & Pluto are cartoon versions<br>**The Call of the Wild Thornberrys**: Jack London story about the dog Buck who joins a Nick cartoon about Eliza, who can talk to animals

JSON response:<br>
```
{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "answer": "meerkats",
          "question": "Group of mammals seen <a href=\"http://www.j-archive.com/media/1998-06-01_J_28.jpg\" target=\"_blank\">here</a>:  [like Timon in <i>The Lion King</i>]"
        },
        {
          "answer": "dogs",
          "question": "Scooby-Doo, Goofy & Pluto are cartoon versions"
        },
        {
          "answer": "The Call of the Wild Thornberrys",
          "question": "Jack London story about the dog Buck who joins a Nick cartoon about Eliza, who can talk to animals"
        }
      ]
    }
  }
}
```

Note just how relevant the results are, despite none of them including the word "animal" or the word "movie", let alone both!

This is exactly why vector searches are so useful. They can identify related objects without the need to match exact texts.

### Vector similarities demo
What if we run *this* query? What will we get back?
```
{
  Get {
    JeopardyQuestion (
      nearText: {
        concepts: ["European geography"]
      }
      limit: 3
    ) {
      question
      answer
      _additional {
        distance
      }
    }
  }
}
```
Take a look at this response. Do you notice any additional information?
```
{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "_additional": {
            "distance": 0.15916324
          },
          "answer": "Bulgaria",
          "question": "A European republic: Sofia"
        },
        ...
      ]
    }
  }
}
```
JSON response:
```
{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "_additional": {
            "distance": 0.15916324
          },
          "answer": "Bulgaria",
          "question": "A European republic: Sofia"
        },
        {
          "_additional": {
            "distance": 0.16247147
          },
          "answer": "Balkan Peninsula",
          "question": "The European part of Turkey lies entirely on this peninsula"
        },
        {
          "_additional": {
            "distance": 0.16832423
          },
          "answer": "Mediterranean Sea",
          "question": "It's the only body of water with shores on the continents of Asia, Africa & Europe"
        }
      ]
    }
  }
}
```