In this notebook, we do a simple summarization task.
The models used here, which include generative abilities, have a maximum amount of tokens allowed as input.

In [51]:
import pandas as pd
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM

In [52]:
df = pd.read_csv("../data/clusterized_dataframe.csv")

In [53]:
checkpoint = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [54]:
cluster = df[df["cluster_num"]==5]

In [57]:
prefix = "summarize: "
cluster_reviews_joined = ". ".join(cluster["Reviews"])
# tried with other feature engineering functions, but summary results were all very similar in tone. 
# All positive, but with less intelligible output.
doc = prefix + cluster_reviews_joined
inputs = tokenizer(doc, return_tensors="pt", max_length=1024, truncation=True).input_ids

In [60]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
# takes 13s

In [59]:
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This is the best solution I've used for an all-in-one oil filter wrench (although you actually need 2 to cover the majority of sizes) It works only to loosen, not to tighten - and of course you'd never want to tighten an oil filter with a tool anyway.Note that this is listed on Amazon at least twice, and one is priced almost three times as much as the other.


Output: "This is the best solution I've used for an all-in-one oil filter wrench (although you actually need 2 to cover the majority of sizes) It works only to loosen, not to tighten - and of course you'd never want to tighten an oil filter with a tool anyway.Note that this is listed on Amazon at least twice, and one is priced almost three times as much as the other."

One technique that we could use to overcome the max token limitation is to do summaries of each review, and then make a general summary of all the intermediate smaller summaries. 
We could also instead do an extractive summarization, which does not suffer from this limitation.