# Text Search with TF-IDF

For some people exposed to AI, it's the first time they've seen what automation is capable of.
Automation has been around for a lot longer than AI, though, and programmers have used it to solve a lot of problems.

Given a structured document and a query, produce TF-IDF vectors for the
document's text and find the section that's most similar to the query.

## Definitions

* TF-IDF: term-frequency times inverse document-frequency

## Research and Tooling

* https://medium.com/@yassineerraji/understanding-textrank-a-deep-dive-into-graph-based-text-summarization-and-keyword-extraction-905d1fb5d266
* https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
* https://courses.cs.washington.edu/courses/cse373/17au/project3/project3-2.html

In [None]:
%pip install scikit-learn networkx

In [30]:
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
import numpy as np

with open('data/chapter-1-structured-content.json') as file:
    data = json.load(file)

query = "How do I collect data?"
# query = "How do I switch to dark mode"

sentences = []
for section in data:
    sentences.append(section['section_title'])
    sentences.append(section['content'])
    for subsection in section['subsections']:
        sentences.append(subsection['subsection_title'])
        sentences.append(subsection['content'])

# print("Sentences:", sentences)
# print("Count:", len(sentences))

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)
Y = vectorizer.transform([query])

# print("X", X)
# print("Y", Y)

sim_matrix = cosine_similarity(X, Y)

print("similarity", sim_matrix)

match_index = np.where(sim_matrix == sim_matrix.max())[0][0]

print("Match Index:", match_index)
print("Match", sentences[match_index])

# nx_graph = nx.from_numpy_array(sim_matrix)

# print("graph", nx_graph)

# scores = nx.pagerank(nx_graph)

# print("scores", scores)

# ranked = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
# summary = " . ".join([s for _, s in ranked[:10]])
# print("Summary:", summary)

similarity [[0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.14701702]
 [0.13164123]
 [0.16353832]
 [0.13849012]
 [0.08850576]
 [0.        ]
 [0.11962411]
 [0.        ]
 [0.0495944 ]
 [0.        ]
 [0.        ]
 [0.06741477]
 [0.04413062]
 [0.0873697 ]
 [0.04366586]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.06645695]
 [0.10099313]
 [0.06270593]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.03664802]
 [0.13849012]
 [0.04726681]
 [0.10315145]
 [0.04256482]
 [0.        ]
 [0.04511777]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.05286617]]
Match Index: 11
Match Choose Sensor Data Collection to collect data from Vernier sensors including Go Direct sensors, Go!Temp and Go!Motion USB sensors, and wired LabQuest sensors.
