# Week 12 Assignment
Find a big set of text and implement a query/search engine from scratch.

In [1]:
!pip install --upgrade tqdm

Requirement already up-to-date: tqdm in /usr/local/lib/python3.6/dist-packages (4.54.0)


## Data Loading
The data I'm using is the Newsgroups data from scikit-learn.

In [2]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

newsgroups_databunch = fetch_20newsgroups(
    subset = 'train',
    shuffle = True, 
    random_state = 1
)

newsgroups_data = pd.DataFrame(newsgroups_databunch.data, columns = ['text'])
newsgroups_data.head()

Unnamed: 0,text
0,"From: ab4z@Virginia.EDU (""Andi Beyer"")\nSubjec..."
1,From: timmbake@mcl.ucsb.edu (Bake Timmons)\nSu...
2,From: bc744@cleveland.Freenet.Edu (Mark Ira Ka...
3,From: ray@ole.cdac.com (Ray Berry)\nSubject: C...
4,From: kkeller@mail.sas.upenn.edu (Keith Keller...


## Text Preprocessing
I'll preprocess the text using the spaCy library.

In [3]:
import re
import spacy
import string
from tqdm import tqdm

tqdm.pandas();

# Get the appropriate spaCy model to use.
spacy_model = spacy.load('en_core_web_sm', disable = ['parser', 'ner'])

################################
# TEXT PREPROCESSING FUNCTIONS #
################################
def _remove_punctuation(text, step):
    if step == 'initial':
        return [
            token for token in text if re.sub(r'[\W_]+', ' ', token.text)
            not in string.punctuation
            and re.sub(r'([\W_])+', ' ', token.text) != ' '
            and re.sub(r'([\W_])+', ' ', token.text) != ''
        ]
    elif step == 'last':
        return [re.sub(r'[\W_]+', ' ', token) for token in text]

def _remove_stop_words(text):
    return [token for token in text if not token.is_stop]

def _lemmatize(text):
    return [token.lemma_ for token in text]

def preprocess_text(text, is_search_space=True):
    if is_search_space:
        # Remove the upper header part of the text.
        # We only need to do this for the search
        # space, not for the query string.
        step_1 = ' '.join(text.split('\n\n')[1:])
    else:
        step_1 = text

    # Lowercase text and remove extra spaces.
    step_2_3 = ' '.join(
        [word.lower() for word in str(step_1).split()]
    )

    # Tokenize text with spaCy.
    step_4 = spacy_model(step_2_3)

    # Remove punctuation.
    step_5 = _remove_punctuation(step_4, step = 'initial')

    # Remove stop words.
    step_6 = _remove_stop_words(step_5)

    # Lemmatize text.
    step_7 = _lemmatize(step_6)

    # Remove punctuation again.
    step_8 = _remove_punctuation(step_7, step = 'last')

    # Remake sentence with new cleaned up tokens.
    return ' '.join(step_8)

# Start processing.
newsgroups_data['text_processed'] = newsgroups_data['text'] \
    .progress_apply(preprocess_text)

newsgroups_data.head()

100%|██████████| 11314/11314 [03:32<00:00, 53.23it/s]


Unnamed: 0,text,text_processed
0,"From: ab4z@Virginia.EDU (""Andi Beyer"")\nSubjec...",sure story nad biased disagree statement u s m...
1,From: timmbake@mcl.ucsb.edu (Bake Timmons)\nSu...,james hogan write timmbake mcl ucsb edu bake t...
2,From: bc744@cleveland.Freenet.Edu (Mark Ira Ka...,realize principle strong point like know ask q...
3,From: ray@ole.cdac.com (Ray Berry)\nSubject: C...,notwithstanding legitimate fuss proposal chang...
4,From: kkeller@mail.sas.upenn.edu (Keith Keller...,change scoring playoff pool unfortunately time...


In [4]:
print(f'=====TEXT BEFORE PROCESSING===== \n"{newsgroups_data["text"][0]}"')
print(f'=====TEXT AFTER PROCESSING===== \n"{newsgroups_data["text_processed"][0]}"')

=====TEXT BEFORE PROCESSING===== 
"From: ab4z@Virginia.EDU ("Andi Beyer")
Subject: Re: Israeli Terrorism
Organization: University of Virginia
Lines: 15

Well i'm not sure about the story nad it did seem biased. What
I disagree with is your statement that the U.S. Media is out to
ruin Israels reputation. That is rediculous. The U.S. media is
the most pro-israeli media in the world. Having lived in Europe
I realize that incidences such as the one described in the
letter have occured. The U.S. media as a whole seem to try to
ignore them. The U.S. is subsidizing Israels existance and the
Europeans are not (at least not to the same degree). So I think
that might be a reason they report more clearly on the
atrocities.
	What is a shame is that in Austria, daily reports of
the inhuman acts commited by Israeli soldiers and the blessing
received from the Government makes some of the Holocaust guilt
go away. After all, look how the Jews are treating other races
when they got power. It is unfortun

## TF-IDF From Scratch
In this section, I'll be calculating TF-IDF vectors from scratch.

### Term Frequency
Now let's calculate term frequency from scratch. First step is to tokenize the text.

In [5]:
def tokenize(text):
    tokens = spacy_model(text)

    return [token.text for token in tokens if (token.text != ' ' and token.text != '')]

newsgroups_data['text_tokenized'] = newsgroups_data['text_processed'] \
    .progress_apply(tokenize)

100%|██████████| 11314/11314 [01:49<00:00, 103.32it/s]


Next, create a frequency distribution for each row and put them all together. To save on time, I'll just do it for a random sample of 1000 documents.

In [6]:
from nltk import FreqDist

# Just select 1000 rows at random.
newsgroups_data_sample = newsgroups_data.sample(1000)

# Get the frequency distribution for each row.
search_space_vectorized = pd.DataFrame()
for row in tqdm(newsgroups_data_sample['text_tokenized']):
    search_space_vectorized = search_space_vectorized.append(
        dict(FreqDist(row)),
        ignore_index = True
    )
search_space_vectorized = search_space_vectorized.fillna(0)
search_space_vectorized.head()

100%|██████████| 1000/1000 [01:55<00:00,  8.68it/s]


Unnamed: 0,10,17,1993,25,access,achieve,actually,addition,algorithm,assess,astron,attorney,biham,bit,capability,clipper,comfort,confidential,crypt,demonstrate,des,detail,devices,effectiveness,expert,faqs,finally,finding,flow,general,gillogly,go,good,government,hand,handle,jim,know,learn,level,...,lcai,notoriously,75w,850,anough,optonica,peice,preamp,premo,tuner,universtiy,borchevsky,burns,demise,dissappointe,drastic,genenius,hellamond,hits,hustle,officiating,potvin,swing,tee,canadians,escalate,lori,xenophobic,19671,877,c5e2g7,chambliss,eyedness,refractive,rsilver,1p7ciqinn3th,bangaldesh,covingc,prophylaxis,tamsun
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Once we have the frequency distributions, now we can calculate the term frequency using the formula $1 + ln(t_{freq})$.

In [7]:
import numpy as np

def calculate_term_frequency(column):
    normalized_freq = np.log(
        column, 
        # Subtracting 1 here because it will return a 1 for term 
        # frequencies that are supposed to be 0 since we add 1 at the end
        out = np.zeros_like(column) - 1,
        where = (column != 0)
    )

    return 1 + normalized_freq

search_space_tf = search_space_vectorized.progress_apply(calculate_term_frequency)
search_space_tf.head()

100%|██████████| 27581/27581 [00:21<00:00, 1275.43it/s]


Unnamed: 0,10,17,1993,25,access,achieve,actually,addition,algorithm,assess,astron,attorney,biham,bit,capability,clipper,comfort,confidential,crypt,demonstrate,des,detail,devices,effectiveness,expert,faqs,finally,finding,flow,general,gillogly,go,good,government,hand,handle,jim,know,learn,level,...,lcai,notoriously,75w,850,anough,optonica,peice,preamp,premo,tuner,universtiy,borchevsky,burns,demise,dissappointe,drastic,genenius,hellamond,hits,hustle,officiating,potvin,swing,tee,canadians,escalate,lori,xenophobic,19671,877,c5e2g7,chambliss,eyedness,refractive,rsilver,1p7ciqinn3th,bangaldesh,covingc,prophylaxis,tamsun
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.098612,1.0,1.0,1.0,1.693147,1.0,1.0,1.0,1.0,1.0,1.0,1.693147,1.693147,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.693147,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Inverse Document Frequency
Now that term frequency is calculated, let's calculate the inverse document frequency.

In [8]:
def calculate_inverse_document_frequency(column):
    document_frequency = column[column > 0].count()
    n = column.shape[0]
    return np.log(n / document_frequency)

search_space_idf = search_space_vectorized.progress_apply(calculate_inverse_document_frequency)
search_space_idf.head()

100%|██████████| 27581/27581 [00:12<00:00, 2294.65it/s]


10        2.343407
17        3.244194
1993      2.551046
25        3.015935
access    2.764621
dtype: float64

### TF-IDF
Now we just combine the TF and IDF vectors through multiplication.

In [9]:
search_space_tf_idf = search_space_tf * search_space_idf
search_space_tf_idf.head()

Unnamed: 0,10,17,1993,25,access,achieve,actually,addition,algorithm,assess,astron,attorney,biham,bit,capability,clipper,comfort,confidential,crypt,demonstrate,des,detail,devices,effectiveness,expert,faqs,finally,finding,flow,general,gillogly,go,good,government,hand,handle,jim,know,learn,level,...,lcai,notoriously,75w,850,anough,optonica,peice,preamp,premo,tuner,universtiy,borchevsky,burns,demise,dissappointe,drastic,genenius,hellamond,hits,hustle,officiating,potvin,swing,tee,canadians,escalate,lori,xenophobic,19671,877,c5e2g7,chambliss,eyedness,refractive,rsilver,1p7ciqinn3th,bangaldesh,covingc,prophylaxis,tamsun
0,2.343407,3.244194,2.551046,3.015935,2.764621,4.422849,2.488915,3.540459,3.816713,5.298317,6.907755,5.115996,6.907755,2.590267,3.912023,3.729701,5.809143,5.298317,5.809143,4.199705,9.464448,3.442019,6.907755,5.521461,7.001446,5.298317,3.688879,4.60517,4.828314,2.937463,6.907755,2.903406,2.333709,2.718101,2.733368,3.575551,3.506558,1.114742,3.015935,2.813411,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,2.488915,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.903406,1.378326,2.718101,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,2.488915,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Search Function
Now we have the TF-IDF vectors of the search space, we just need to make the search function. The search function should take in a query, transform it, and do a cosine similarity between all the documents in the search space. The top rated should be at the top.

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

def search(query, search_space, search_space_tf_idf, search_space_idf, top_n=5):
    # Preprocess query text.
    query_processed = preprocess_text(query, is_search_space = False)

    # Tokenize query and vectorize to get term frequency.
    tokens = tokenize(query_processed)
    query_temp = pd.DataFrame()
    query_vectorized = query_temp.append(
        dict(FreqDist(tokens)), 
        ignore_index = True
    )
    term_frequency = query_vectorized.apply(calculate_term_frequency)
    
    # Reformat term frequency so it matches search space format
    query_dict = {}
    for word in search_space_tf_idf.columns.tolist():
        if word in term_frequency.columns.tolist():
            query_dict[word] = term_frequency.loc[:, word].values[0]
        else:
            query_dict[word] = 0.0
    query_tf = pd.DataFrame.from_records([query_dict])

    # Calculate query TF-IDF.
    query_tf_idf = query_tf * search_space_idf
    
    # Calculate cosine similarity between
    # query TF-IDF vector and search space TF-IDF vectors
    similarity_matrix = cosine_similarity(query_tf_idf, search_space_tf_idf)[0]
    sorted_idxs = np.argsort(similarity_matrix)[::-1][:top_n]
    
    return search_space.iloc[sorted_idxs, :].reset_index(drop = True)

### Search Function Testing
Let's test out some search queries and see what comes up in the top 3 results.

In [11]:
search_results = search(
    'This is a test string for the search function.', 
    newsgroups_data_sample[['text']], 
    search_space_tf_idf, 
    search_space_idf,
    top_n = 3
)

print(f'TOP RESULT'.center(50, '='))
print(f'{search_results.iloc[0].values[0]}')
print('=' * 50)
print()
print('SECOND RESULT'.center(50, '='))
print(f'{search_results.iloc[1].values[0]}')
print('=' * 50)
print()
print('THIRD RESULT'.center(50, '='))
print(f'{search_results.iloc[2].values[0]}')
print('=' * 50)
print()

From: zxxst+@pitt.edu (Zhihua Xie)
Subject: Re: Duo 230 crashes aftersleep (looks like Apple bug!)
Organization: University of Pittsburgh
Lines: 2

this is a test
 


From: jmichael@vnet.IBM.COM
Subject: Electric power line "balls"
Article-I.D.: almaden.19930406.142616.248
Lines: 4

Power lines and airplanes don't mix. In areas where lines are strung very
high, or where a lot of crop dusting takes place, or where there is danger
of airplanes flying into the lines, they place these plastic balls on the
lines so they are easier to spot.


From: mppa3@syma.sussex.ac.uk (Alan Richardson)
Subject: Now available: xvertext.4.0
Organization: University of Sussex
Lines: 25

Now available: xvertext 4.0 
--------------

Summary                                  
-------
xvertext provides you with four functions to draw strings at any angle in   
an X window (previous versions were limited to vertical text). Rotation  
is still achieved using XImages, but the notion of rotating a whole font
first h

In [12]:
search_results = search(
    'computer hardware', 
    newsgroups_data_sample[['text']], 
    search_space_tf_idf, 
    search_space_idf,
    top_n = 3
)

print(f'TOP RESULT'.center(50, '='))
print(f'{search_results.iloc[0].values[0]}')
print('=' * 50)
print()
print('SECOND RESULT'.center(50, '='))
print(f'{search_results.iloc[1].values[0]}')
print('=' * 50)
print()
print('THIRD RESULT'.center(50, '='))
print(f'{search_results.iloc[2].values[0]}')
print('=' * 50)
print()

From: gtoal@gtoal.com (Graham Toal)
Subject: Re: Off the shelf cheap DES keyseach machine (Was: Re: Corporate acceptance of the wiretap chip)
Lines: 9

		I think I should also point out that the mystical DES engines
	are known plaintext engines (unless you add a ton of really smart
	hardware?)

Assume the ton of smart hardware.  It doesn't really have to be that smart.

G




From: jap10@po.CWRU.Edu (Joseph A. Pellettiere)
Subject: Sigma Designs Double up??
Article-I.D.: usenet.1psdv2$gr5
Reply-To: jap10@po.CWRU.Edu
Organization: Case Western Reserve University, Cleveland, Ohio (USA)
Lines: 8
NNTP-Posting-Host: thor.ins.cwru.edu


	I am looking for any information about the Sigma Designs
	double up board.  All I can figure out is that it is a
	hardware compression board that works with AutoDoubler, but
	I am not sure about this.  Also how much would one cost?
-- 
Joe
jap10@po.cwru.edu


From: noah@apple.com (Noah Price)
Subject: Re: How long do RAM SIMM's last?
Distribution: usa
Organi