For this POC, we're going to use the following:
<br><br><b>Database</b>: A disposable db (tinydb) consisting of transcripts of A16z podcasts.  I've already crawled these. 
<br><br><b>Entity extraction</b>: To handle NLP queries, we can use entity extraction to parse the query for an author and use that to refine searches.
<br><br><b>Semantic search</b>: To search the DB, we can use  <a href="https://www.sbert.net/examples/applications/semantic-search/README.html">msmarco-MiniLM-L-12-v3</a>.
<br><br><b>Index</b>: Then, we can use <a href="https://medium.com/mlearning-ai/how-to-build-a-semantic-search-engine-using-python-5c68e8442df1">FAISS</a> for indexing.
<br><br><b>Summarization</b>: We're going to test a bunch of summarization tools here, including Bert Summarization, BART, GPT2 and T5
-

In [71]:
# Import libraries #
from sentence_transformers import SentenceTransformer, util
from transformers import GPT2Tokenizer, GPT2LMHeadModel, pipeline
from transformers import pipeline
from summarizer import Summarizer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import faiss
import numpy as np
import spacy
from tinydb import TinyDB, Query
import os
import time
from collections import defaultdict

In [42]:
# Load the DB
dirname = os.path.abspath('')
db_file = os.path.join(dirname, 'db/parsed_db.json')
db = TinyDB(db_file)
all_data = db.all()

# Load the models
model = SentenceTransformer('msmarco-MiniLM-L-12-v3')
nlp = spacy.load("en_core_web_sm")

# Group the text_strings by the author:
grouped_data = defaultdict(list)

for data in all_data:
    grouped_data[data['author']].append(data)
    
# Now Create a separate FAISS index for each author and store it in our dict
author_indices = {}

for author, documents in grouped_data.items():
    # Encode text_strings
    encoded_data = model.encode([doc['content'] for doc in documents])
    
    # Create FAISS index
    index = faiss.IndexFlatL2(encoded_data.shape[1])
    index.add(encoded_data)
    
    # Store index
    author_indices[author] = (index, documents)

KeyboardInterrupt: 

In [44]:
# STORE YOUR INDEX LOCALLY IF YOU WANT
# import pickle

# # assuming author_indices is a dictionary
# with open('author_indices.pkl', 'wb') as f:
#     pickle.dump(author_indices, f)

# RESTORE IT LOCALLY 
# import pickle

# # open the file containing the pickled dictionary
# with open('author_indices.pkl', 'rb') as f:
#     # load the pickled dictionary
#     author_indices = pickle.load(f)

In [92]:
# See readme for details

# helper to extract entities from a user query
def extract_entities(query):
    doc = nlp(query)
    entities = [(entity.text, entity.label_) for entity in doc.ents]
    return entities

# Process the query
def process_query(query, k=5):
    # Get the author from the query
    author = extract_entities(query)
    author = author[0][0].title()

    # Encode the query
    query_embedding = model.encode(query.strip())
    # Search within the author's index
    if author in author_indices:
        index, documents = author_indices[author]
        distances, indices = index.search(np.array([query_embedding]), k)
        
        # Return results
        result = ''
        for i, idx in enumerate(indices[0]):
            data = documents[idx]
            result += ' '+ data['content']
        
        return result

    else:
        print("Author not found in the database.")

# Summarization function to let us test different models
def summarize(text, model_name):
    summarizer = pipeline("summarization", model=model_name)
    summary = summarizer(text, max_length=500, min_length=100)
    return summary

# Define BERT summarization function
def bert_summarize(text):
    model = Summarizer()
    summary = model(text, max_length=200, min_length=100)
    return summary

# Summarize using GPT2
def summarize_gpt2(text, model_name):
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)

    prompt = "Summarize the following text: " + text
    inputs = tokenizer.encode(prompt, return_tensors="pt", max_length=1024, truncation=True)
    summary_ids = model.generate(inputs, max_length=200, min_length=100, length_penalty=2.0, num_return_sequences=1)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary
        
# Super function
def search(query, showRes=False):
    # Start timer
    start_time=time.time()

    # Search it     
    text = process_query(query)
    if showRes:
        print(text)
    print("Results back in:", time.time() - start_time, "seconds")

    #Now, test the summarization models
    
    # BERT
    print("BERT...")
    start_time=time.time()
    bert_sum = bert_summarize(text)
    print("BERT Summary in :", time.time() - start_time)
    print(bert_sum)

    # BART
    print("\n\nBART...")
    start_time=time.time()
    bart_model_name = "facebook/bart-large-cnn"
    bart_summary = summarize(text, bart_model_name)
    print("BART Summary:", time.time() - start_time)
    print(bart_summary)

    # T5
    print("\n\nT5...")
    start_time=time.time()
    t5_model_name = "t5-large"
    t5_summary = summarize(text, t5_model_name)
    print("\nT5 Summary:", time.time() - start_time)
    print(t5_summary)

    # GPT-2 (note that GPT-2 needs additional steps to be used for summarization)
    print("\n\nGPT-2...")
    start_time=time.time()
    gpt2_model_name = "gpt2"
    gpt2_summary = summarize_gpt2(text, gpt2_model_name)
    print("\nGPT-2 Summary:", time.time() - start_time)
    print(gpt2_summary)



In [94]:
query = "What does Andrew Chen think of utilization marketplaces"
response = search(query, True)

 In the last few years, weâve seen a rise in the number of full-stack or managed marketplaces, or marketplaces that take on additional operational value-add in terms of intermediating the service delivery. While âUber for Xâ models were well-suited to simple services, managed marketplaces evolved to better tackle services that were more complex, higher priced, and that required greater trust. The first iteration of bringing services online involved unmanaged horizontal marketplaces, essentially listing platforms that helped demand search for supply and vice versa. These marketplaces were the digital version of the Yellow Pages, enabling visibility into which service providers existed, but placing the onus on the user to assess providers, contact them, arrange times to meet, and transact. The dynamic here is âcaveat emptorâ — users assume the responsibility of vetting their counterparties and establishing trust, and thereâs little in the way of platform standards, protection

Your max_length is set to 500, but you input_length is only 433. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=216)


BART Summary: 14.365865230560303
[{'summary_text': 'In the last few years, we’ve seen a rise in the number of full-stack or managed marketplaces. These marketplaces take on additional operational value-add in terms of intermediating the service delivery. Companies can raise the quality of service by hiring and managing providers themselves, and by managing the end-to-end customer experience. Examples are Honor and Trusted, managed marketplace for elder care and childcare, respectively, which employ caregivers as W-2 employees and provide them with training and tools.'}]


T5...


Your max_length is set to 500, but you input_length is only 433. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=216)



T5 Summary: 25.92269515991211
[{'summary_text': "in the last few years, we've seen a rise in the number of full-stack or managed marketplaces . these marketplaces take on additional operational value-add in terms of intermediating the service delivery . the dynamic here is âcaveat emptorâ — users assume the responsibility of vetting counterparties and establishing trust . companies can raise the quality of service by hiring and managing providers themselves, and by managing the end-to-end customer experience ."}]


GPT-2...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 438, but `max_length` is set to 200. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.



GPT-2 Summary: 3.6923718452453613
Summarize the following text:  In the last few years, weâve seen a rise in the number of full-stack or managed marketplaces, or marketplaces that take on additional operational value-add in terms of intermediating the service delivery. While âUber for Xâ models were well-suited to simple services, managed marketplaces evolved to better tackle services that were more complex, higher priced, and that required greater trust. The first iteration of bringing services online involved unmanaged horizontal marketplaces, essentially listing platforms that helped demand search for supply and vice versa. These marketplaces were the digital version of the Yellow Pages, enabling visibility into which service providers existed, but placing the onus on the user to assess providers, contact them, arrange times to meet, and transact. The dynamic here is âcaveat emptorâ — users assume the responsibility of vetting their counterparties and establishing trust, 