![](https://miro.medium.com/max/700/1*pJOoQOlHCses8zvefz3YFg.png)

# Build a Stack Overflow search engine with Python and ML

This tutorial helps you build an ML-powered search engine for Stack Overflow data while introducing [DocArray](https://docarray.jina.ai?utm_source=stack-overflow-notebook) and [Jina](https://docs.jina.ai). A user can input a text query and then retrieve questions and answers where the question title is similar to the query.

Throughout the notebook we'll have some sections called ⚙️ **Tinker time**, which explain some common changes you may want to make to the code.

## Meet our ingredients

### **[DocArray](https://docarray.jina.ai?utm_source=stack-overflow-notebook)**

DocArray is a library for nested, unstructured data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the multi-modal data with a Pythonic API. ([star the repo]())

### **[Jina](https://docs.jina.ai)**
 
 Jina is a framework that empowers anyone to build cross-modal and multi-modal[*] applications on the cloud. It uplifts a PoC into a production-ready service. Jina handles the infrastructure complexity, making advanced solution engineering and cloud-native technologies accessible to every developer. ([star the repo]())

### **[Jina Hub](https://hub.jina.ai)**

Download pre-built building blocks for neural search.


### **[Stack Overflow R dataset](https://www.kaggle.com/datasets/stackoverflow/rquestions)**

Why not use the [Python dataset](https://www.kaggle.com/datasets/stackoverflow/pythonquestions)? When I tried reading in the CSV I got a few encoding errors and it frankly wasn't worth the headache.

## Why this tech stack?

I'm using the Jina ecosystem because:

- I don't have to manually integrate a lot of stuff like encoders and indexers. I can just pull them with one line of code.
- Switching out one encoder (e.g. [spaCy](https://hub.jina.ai/executor/u7h7cuh2)) for another (e.g. [Transformers](https://hub.jina.ai/executor/u9pqs8eb)) is a matter of just changing a couple of lines. This is great for tinkering and seeing what works best.
- I can run compute-heavy tasks (e.g. encoding) on a cloud [sandbox](https://docs.jina.ai/how-to/sandbox?utm_source=stack-overflow-notebook) really easily, freeing up my own resources. (This applies mostly to running your notebook locally or adapting it for production)
- If I were to productionize this example, I could easily increase speed and reliability via [sharding or replicas](https://docs.jina.ai/how-to/scale-out?utm_source=stack-overflow-notebook). And running on [Kubernetes](https://docs.jina.ai/how-to/kubernetes/) is a breeze.

---

Let's start by installing DocArray:

In [None]:
!pip install -q docarray==0.13.30

...and then importing [DocumentArray](ttps://docarray.jina.ai/fundamentals/documentarray?utm_source=stack-overflow-notebook)

In [None]:
from docarray import DocumentArray

## Downloading our Data

Unfortunately Colab notebooks don't save state, so we can't store our data alongside our notebook. So how can we convert our CSV from the dataset?

We could remedy this in two ways:

1. Download the CSV and [import directly](https://docarray.jina.ai/datatypes/tabular?utm_source=stack-overflow-notebook) into a [DocumentArray](https://docarray.jina.ai/fundamentals/documentarray/) with `docs = DocumentArray.from_csv("Questions.csv")`. This is tricky since it's stored on Kaggle and I don't really want to share my Kaggle key publicly. Or...

2. Here's one I made earlier! In one command we can [pull in a pre-existing DocumentArray from the cloud](https://docarray.jina.ai/fundamentals/documentarray/serialization/?highlight=pull#from-to-cloud). We'll just use the first 1,000 questions in the dataset since this is a demo:

In [None]:
docs = DocumentArray.pull(name="stack_overflow_r_q")[:1000]

---

##### ⚙️ Tinker time

All data that goes into our pipeline needs to be in the form of a [Document]() or [DocumentArray](). They can store any kind of data, so whether we were making an image search engine, text search engine, 3D mesh search engine or whatever, all data would be stored in this data type.

There are several quick ways to create a DocumentArray:

- [From CSV](https://docarray.jina.ai/datatypes/tabular?utm_source=stack-overflow-notebook): `DocumentArray.from_csv('toy.csv', field_resolver={'Title': 'text'})` - every row of a CSV becomes a Document, with the `Title` field as the primary data (which will be processed) and other fields as metadata tags.
- [From a folder](https://docarray.jina.ai/fundamentals/documentarray/construct/#construct-from-local-files): `DocumentArray.from_files("data/**/*.jpg", recursive=True)` - every file in the glob pattern is stored as a Document in the DocumentArray
- [From JSON](https://docarray.jina.ai/fundamentals/documentarray/serialization/#from-to-json): `DocumentArray.from_json("foo.json")` - every record in the JSON file becomes a Document

---

Let's see what's we've got. As we can see, 1,000 [Documents](https://docarray.jina.ai/fundamentals/document?utm_source=stack-overflow-notebook), each with:
- The title of the question in `doc.text` - this is what will be encoded later in our [Flow](https://docs.jina.ai/fundamentals/flow?utm_source=stack-overflow-notebook).
- Tags - i.e. metadata, containing a `dict` of all the other fields associated with that question title.
- ID - a unique identifier for each Document.

In [None]:
docs.summary()

Let's take a closer look at a single Document so we can get an idea of the structure

In [None]:
from pprint import pprint # pretty-print makes it easier for humans to read dicts

print(docs[0].text)
pprint(docs[0].tags)
pprint(docs[0].id)

## Setting up our Flow

To build a search engine we need to pass our Documents into a [Flow](https://docs.jina.ai/fundamentals/flow?utm_source=stack-overflow-notebook). This is what will create embeddings and store our Documents in an index for fast look-up later.

We'll use the [Jina](https://docs.jina.ai?utm_source=stack-overflow-notebook) package to build and orchestrate our Flow.

In [None]:
!pip install -q jina==3.6.11

Creating a Flow is a matter of chaining together building blocks (a.k.a [Executors](https://docs.jina.ai/fundamentals/executor?utm_source=stack-overflow-notebook)). In our case we won't [write these manually](https://docs.jina.ai/fundamentals/executor/executor-api/), but rather we'll either download them from [Jina Hub](https://hub.jina.ai) or run them in a [sandbox in the cloud](https://docs.jina.ai/how-to/sandbox/?highlight=sandbox). This will save us some time and effort.

Let's start by creating an empty Flow:

In [None]:
from jina import Flow

flow = Flow()

Now we'll add our **encoder**. This will encode the text from each Document into vector embeddings. We'll need these for matching similar text later on.

In our case we'll use [SpacyTextEncoder](https://hub.jina.ai/executor/u7h7cuh2) with the medium language model, though you could swap it out easily for other encoders like [Transformers](https://hub.jina.ai/executor/u9pqs8eb). I typically use spaCy because it's been much faster than alternatives (at least in my experience).

We'll use the [medium English model](https://spacy.io/models/en#en_core_web_md) (`en_core_web_md`) since I find that's a good balance between accuracy and performance. 

We'll run it in a sandbox in the cloud. This way, even if you run this notebook on your own machine, you won't be using your own compute for intensive tasks like generating encodings.

In [None]:
flow = flow.add(
    name="encoder",
    uses="jinahub+sandbox://SpacyTextEncoder/v0.4",
    uses_with={"model_name": "en_core_web_md"}
)

---

##### ⚙️ **Tinker time**

If you want to tinker, you can swap out the Executor below by changing the `uses` value and `uses_with` parameters. So if we wanted to use [Transformers](https://hub.jina.ai/executor/u9pqs8eb) to encode our Documents, we could change our code to:

```python
flow = flow.add(
    name="encoder",
    uses="jinahub+sandbox://TransformerTorchEncoder",
    uses_with={"pretrained_model_name_or_path": "bert-base-uncased"}
)
```

If you stick with spaCy, you could also change `model_name` to:

- [`en_core_web_sm`](https://spacy.io/models/en#en_core_web_sm) - Smaller model. Likely less accurate, but faster performance.
- [`en_core_web_lg`](https://spacy.io/models/en#en_core_web_lg) - Larger model. Potentially more accurate, but slower performance.

spaCy also offers [models in other languages](https://spacy.io/models).

---

Next we'll add our indexer. This takes the vector embeddings and metadata and stores them in a database for fast lookup when a user is searching.

We'll use [AnnLiteIndexer](https://hub.jina.ai/executor/7yypg8qk), which will store our data in a SQLite database. For production use, other indexers like [HNSWPostgresIndexer](https://hub.jina.ai/executor/dvp0845a) may be more suitable, but for a simple notebook this is a good fit. AnnLite also has the benefit that we can filter our search by tags. We won't do that in this notebook, but its a nice future option.

In this case we won't run AnnLite in a sandbox, since we want our indexed data stored in the same place as our notebook, not on some cloud machine.

In [None]:
flow = flow.add(
    name="indexer",
    uses="jinahub://AnnLiteIndexer/0.3.0",
    uses_with={"dim": 300},  # we're using a 300 dimension model
    # uses_metas={"workspace": "workspace"},  # this is where we'll store our data on disk
    install_requirements=True
)

---

##### ⚙️ Tinker time

There are lots of ways to run an Executor:

- `uses=jinahub://foo` - downloads the Executor source from [Hub](https://hub.jina.ai) to your local machine and runs it there (don't forget `install_requirements`!)
- `uses=jinahub+docker://foo` - downloads and runs the Executor's Docker image on your machine
- `uses=jinahub+sandbox://foo` - runs the Executor directly in Jina's cloud [sandbox](https://docs.jina.ai/how-to/sandbox/?highlight=sandbox), saving you compute
- `uses=Foo` - run an Executor that you've [built yourself](https://docs.jina.ai/fundamentals/executor/executor-api/#)

---

Let's preview our Flow:

In [None]:
flow.plot()

## Indexing our data

That's our Flow built. Now we can run it to start pushing our data through the pipeline.

In [None]:
with flow:
    docs = flow.index(docs)

## Searching our data

Now that we've built our index, it's time to do some searching!

Everything we've worked with while indexing has been in the form of a [Document](https://docarray.jina.ai/fundamentals/document?utm_source=stack-overflow-notebook) (stored in a DocumentArray). So we'll need to create another Document for searching that index.

Feel free to change `search_term` to your own query.

In [None]:
from docarray import Document

search_term = "How do I create a matrix?"
query = Document(text=search_term)

with flow:
  results = flow.search(query)

Now to look at what matched our search term. `results` is also a DocumentArray (can you see the pattern?). We'll access its `matches` attribute and see what's stored inside:

In [None]:
matches = results[0].matches

for match in matches:
  print(match.text)

## Getting answers to our questions

So far, so good. We've got a list of matching questions. But how can we pair those with the relevant answers?

First we'll need to download our answers. In this case we won't limit them to just 1,000 because:

* Many questions have more than one answer.
* The order may be different, so the first question in our dataset may have answer 1,234, 50,234 or 1,337 as its answer.

Once again, we'll [pull from the cloud](https://docarray.jina.ai/fundamentals/documentarray/serialization/?highlight=pull#from-to-cloud):

In [None]:
answers = DocumentArray.pull(name="stack_overflow_r_a")
answers.summary()

Now we can use the [`find` method](https://docarray.jina.ai/fundamentals/documentarray/find?utm_source=stack-overflow-notebook) to dig out answers where the answer's `ParentId` tag matches the question's `Id` tag:

In [None]:
for match in matches:
  print(match.text)
  match_answers = answers.find({"tags__ParentId": {"$eq": match.tags["Id"]}})
  for answer in match_answers:
    print("---")
    print(answer.text)
  print("-----------")

Voila! You can see:

* Questions matching our search term
* Answers to those questions

Admittedly, the HTML formatting looks a bit janky, but if you were using this IRL you'd strip that out or properly display it. Since this is just a notebook I'll leave that as an exercise for you, dear reader.

## Putting it into production

Colab notebooks have a number of restrictions that make cool stuff quite difficult. If we were building this outside of a notebook, we could:

* Set up a [RESTful or gRPC gateway](https://docs.jina.ai/fundamentals/gateway?utm_source=stack-overflow-notebook) and keep the Flow open to requests using `flow.block()`
* Use [sharding and replicas](https://docs.jina.ai/how-to/scale-out?utm_source=stack-overflow-notebook) to improve performance and reliability.
* [Monitor our Flow with Grafana](https://docs.jina.ai/fundamentals/flow/monitoring-flow?utm_source=stack-overflow-notebook)
* Better yet, host our Flow on [JCloud](https://docs.jina.ai/fundamentals/jcloud?utm_source=stack-overflow-notebook), so we don't have to use any of our own compute for encoding, indexing, hosting, etc (encoding is especially hungry on the hardware)
* Finetune our results using [Finetuner](https://finetuner.jina.ai) to provide better matches
* Use a more specialized model for dealing with technical/code queries (rather than just general purpose)

## Learn more

Want to dig more into the Jina ecosystem? Here are some resources:

- [Developer portal](https://learn.jina.ai) - tutorials, courses, videos on using Jina
- [Fashion search notebook](https://colab.research.google.com/github/alexcg1/neural-search-notebooks/blob/main/fashion-search/1_build_basic_search/basic_search.ipynb) - build an image-to-image fashion search engine
- [DALL-E Flow](https://colab.research.google.com/github/jina-ai/dalle-flow/blob/main/client.ipynb#scrollTo=NeWDy9viOCAP)/[Disco Art](https://colab.research.google.com/github/jina-ai/discoart/blob/main/discoart.ipynb#scrollTo=47428f37) - create AI-generated art in your browser