**Coding Challenge: Text Query Engine**

The objective of this challenge is to build a Text Query Engine in Python.

Your application will receive a series of text documents as a list of strings:

```
documents = [
    "The quick brown fox jumped over the lazy dog",
    "A rose by any other name would smell as sweet",
    "To be or not to be, that is the question",
    "In the end, it's not the years in your life that count. It's the life in your years",
    "You miss 100% of the shots you don't take",
]

```

And a list of queries:

```
queries = ["quick brown", "rose name", "life years", "you miss"]

```

The engine should:

1. Take the list of documents and build an internal representation of the text.
2. Take a list of query words.
3. Return a list of lists. Each list corresponds to a query and contains the document IDs that match the query, sorted by their relevance to the query.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class TextQueryEngine:
    """Basic Solution"""
    def __init__(self, documents):
        self.documents = documents
        self.vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
        self.document_vectors = self.vectorizer.fit_transform(documents)

    def search(self, queries):
        query_vectors = self.vectorizer.transform(queries)
        similarity_scores = cosine_similarity(query_vectors, self.document_vectors)

        results = []
        for scores in similarity_scores:
            ranked_indices = np.argsort(scores)[::-1]
            results.append([self.documents[i] for i in ranked_indices])

        return results

# Instantiate the engine with a list of documents
documents = [
    "The quick brown fox jumped over the lazy dog",
    "A rose by any other name would smell as sweet",
    "To be or not to be, that is the question",
    "In the end, it's not the years in your life that count. It's the life in your years",
    "You miss 100% of the shots you don't take",
]
engine = TextQueryEngine(documents)

# Query the engine with a list of queries
queries = ["quick brown", "rose name", "life years", "you miss"]
results = engine.search(queries)

for query, result in zip(queries, results):
    print(f"Query: {query}")
    for rank, document in enumerate(result, 1):
        print(f"  {rank}. {document}")
    print()


Query: quick brown
  1. The quick brown fox jumped over the lazy dog
  2. You miss 100% of the shots you don't take
  3. In the end, it's not the years in your life that count. It's the life in your years
  4. To be or not to be, that is the question
  5. A rose by any other name would smell as sweet

Query: rose name
  1. A rose by any other name would smell as sweet
  2. You miss 100% of the shots you don't take
  3. In the end, it's not the years in your life that count. It's the life in your years
  4. To be or not to be, that is the question
  5. The quick brown fox jumped over the lazy dog

Query: life years
  1. In the end, it's not the years in your life that count. It's the life in your years
  2. You miss 100% of the shots you don't take
  3. To be or not to be, that is the question
  4. A rose by any other name would smell as sweet
  5. The quick brown fox jumped over the lazy dog

Query: you miss
  1. You miss 100% of the shots you don't take
  2. In the end, it's not the y

In [10]:
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer

class TextQueryEngine:
    """Advanced solution"""
    def __init__(self, documents):
        self.documents = documents
        self.vectorizer = CountVectorizer(lowercase=True, stop_words='english')
        self.document_term_matrix = self.vectorizer.fit_transform(documents).toarray()
        self.index = self.build_index()

    def build_index(self):
        index = defaultdict(list)
        for word, idx in self.vectorizer.vocabulary_.items():
            for doc_id, count in enumerate(self.document_term_matrix[:, idx]):
                if count > 0:
                    index[word].append(doc_id)
        return index

    def search(self, queries):
        results = []
        for query in queries:
            query_vector = self.vectorizer.transform([query]).toarray()[0]
            scores = defaultdict(int)
            for word, idx in self.vectorizer.vocabulary_.items():
                if query_vector[idx] > 0:
                    for doc_id in self.index[word]:
                        scores[doc_id] += self.document_term_matrix[doc_id, idx]
            ranked_indices = [doc_id for doc_id, _ in sorted(scores.items(), key=lambda x: x[1], reverse=True)]
            results.append([self.documents[i] for i in ranked_indices])
        return results

# Instantiate the engine with a list of documents
documents = [
    "The quick brown fox jumped over the lazy dog",
    "A rose by any other name would smell as sweet",
    "To be or not to be, that is the question",
    "In the end, it's not the years in your life that count. It's the life in your years",
    "You miss 100% of the shots you don't take",
]
engine = TextQueryEngine(documents)

# Query the engine with a list of queries
queries = ["quick brown", "rose name", "life years", "you miss"]
results = engine.search(queries)

for query, result in zip(queries, results):
    print(f"Query: {query}")
    for rank, document in enumerate(result, 1):
        print(f"  {rank}. {document}")
    print()


Query: quick brown
  1. The quick brown fox jumped over the lazy dog

Query: rose name
  1. A rose by any other name would smell as sweet

Query: life years
  1. In the end, it's not the years in your life that count. It's the life in your years

Query: you miss
  1. You miss 100% of the shots you don't take



In [1]:
class TextQueryEngine:
    """Bad Chat GPT Solution"""
    def __init__(self, documents):
        self.inverted_index = {}
        self.build_index(documents)

    def build_index(self, documents):
        for doc_id, document in enumerate(documents):
            words = document.lower().split()
            for word in words:
                if word not in self.inverted_index:
                    self.inverted_index[word] = []
                self.inverted_index[word].append(doc_id)

    def search(self, queries):
        results = []
        for query in queries:
            query_words = query.lower().split()
            query_results = []
            if query_words:
                query_results = self.inverted_index.get(query_words[0], [])

                for word in query_words[1:]:
                    query_results = [
                        doc_id
                        for doc_id in query_results
                        if doc_id in self.inverted_index.get(word, [])
                    ]

            query_results = sorted(query_results, key=lambda x: documents[x])
            results.append(query_results)

        return results


documents = [
    "The quick brown fox jumped over the lazy dog",
    "A rose by any other name would smell as sweet",
    "To be or not to be, that is the question",
    "In the end, it's not the years in your life that count. It's the life in your years",
    "You miss 100% of the shots you don't take",
]

queries = ["quick brown", "rose name", "life years", "you miss"]

query_engine = TextQueryEngine(documents)
search_results = query_engine.search(queries)

for query, results in zip(queries, search_results):
    print(f"Query: {query}")
    if results:
        print("Matching documents:")
        for doc_id in results:
            print(f"Document ID: {doc_id}, Document: {documents[doc_id]}")
    else:
        print("No matching documents.")
    print()


Query: quick brown
Matching documents:
Document ID: 0, Document: The quick brown fox jumped over the lazy dog

Query: rose name
Matching documents:
Document ID: 1, Document: A rose by any other name would smell as sweet

Query: life years
Matching documents:
Document ID: 3, Document: In the end, it's not the years in your life that count. It's the life in your years
Document ID: 3, Document: In the end, it's not the years in your life that count. It's the life in your years

Query: you miss
Matching documents:
Document ID: 4, Document: You miss 100% of the shots you don't take
Document ID: 4, Document: You miss 100% of the shots you don't take

