<a href="https://colab.research.google.com/github/leandrogroup/11tyStaticCMS/blob/main/Autogen_product_desc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Several years ago, in one of my first Ecommerce Director roles, I worked with the ex-Myprotein founder to launch sports nutrition brand GoNutrition. As a “bootstrapped” startup, we were low on numbers, so we all handled various tasks, with me taking on the copywriting.

Keen to improve the site’s conversion rate after launch, I ran an A/B test based on my hypothesis that dumbed-down and more accessible product copy might lead to more sales. To run the test, I had to re-write dozens of product descriptions. This me took days, partly because the gym is not my natural habitat, and protein powders are not my normal sustenance…

On a larger site, this sort of rewriting or text summarising task could easily take weeks or months. I did this rewriting by hand, but if I were tackling the same task today, I’d consider using a deep learning Transformer model to semi-automate the process instead.

In this project, I’ll show you how effective this is, by using the Bart model to automatically generate short product summaries from some of the original product copy I created.

Load the packages

Open a new Jupyter notebook and import the pandas and transformers packages. You’ll likely need to install Transformers, which you can do via PyPi by entering !pip3 install transformers in your terminal. To see more of the text in our dataframe I’ve also set max_colwidth to 150.

In [None]:
!pip install pandas
!pip install transformers

import pandas as pd
from transformers import pipeline

pd.set_option('max_colwidth', 150)

Load the Transformer pipeline

Next, we’ll load the summarization pipeline from Hugging Face. This downloads a massive 1.3 GB pre-trained model for text summarisation which uses Bart, a “denoising autoencoder” model developed in 2019 by Mike Lewis and co-authors.

This one line will download and setup our Bart transformer so it’s ready to handle text summarisation out-of-the-box. Building a model like this yourself would take an enormous dataset, a supercomputer, loads of powerful GPUs, and months of your time.

In [None]:
summarizer = pipeline("summarization")

Load the data

Now the model is ready, we can import our data. I’ve created a miniature dataset based on some of the original product descriptions I wrote for GoNutrition when I worked there. These aim to explain the features and benefits of a range of protein powders and pre-workout supplements to gym goers with various levels of sports nutrition expertise.

In [None]:
df = pd.read_csv('gonutrition2.csv')
df.head()

Extract a product description

To see what we’re giving the model to work with, let’s take a look at the product description for Whey Protein Isolate. This is quite a long and technical description (something required in this market), but we need to truncate it to the first 1024 characters as the model expects data of this size and no larger.

In [11]:
text = df['product_description'][0]
truncated_text = text[:1024]
truncated_text

"Are you preparing to post something abroad with Royal Mail's International services? We can help you to ensure you've got the right all the right packaging, documents and labelling needed to get your postage to its' destination smoothly! For items sent with Royal Mail International Standard (except those posted in Northern Ireland bound for the Republic of Ireland) need an Air Mail sticker in the top left corner on the front of the item.&nbsp; These distinctive little blue labels are available in English and Welsh in various quantities from 18 to 180 stickers. Features:  Royal Mail International By Air Mail stickers Available in English and Welsh, in quantities of 18 to 180 stickers 36 labels per sheet Label size:  Width: 15mm Length: 40mm    &nbsp;"

Generate a summary of the product description

Next, we’ll generate a couple of product summaries. In the first one, I’ve set do_sample to True, so the Bart model extracts relevant snippets from the text from which to construct its summary. To tidy up the text and remove some formatting issues I’ve used strip() and replace().

In [12]:
summary = summarizer(truncated_text, min_length=50, max_length=100, do_sample=True)
summary[0]['summary_text'].strip().replace(' .', '.')

'Royal Mail International By Air Mail stickers are available in English and Welsh. Available in quantities of 18 to 180 stickers per sheet. 36 labels per sheet and size of 15mm width: 15mm length: 40mm. Available only for items posted in Northern Ireland bound for the Republic of Ireland.'

The other approach we can use with the Bart transformer model is to create a text summary from completely unique text. Here, we’ll set do_sample to False and Bart will read the text, understand it, pick out the key points, and then write a short text summary in its own words.

In [13]:
summary = summarizer(truncated_text, min_length=50, max_length=100, do_sample=False)
summary[0]['summary_text'].strip().replace(' .', '.')

'Royal Mail International By Air Mail stickers are available in English and Welsh. Available in quantities of 18 to 180 stickers, 36 labels per sheet. Label size:  Width: 15mm Length: 40mm. Available only for items posted in Northern Ireland bound for the Republic of Ireland.'

Generate summaries for all products

Finally, we can put this all together and create a function to run the model on each of the product descriptions in our original dataframe and generate a bespoke product summary. This takes just a second or two per product!

In [None]:
def get_summary(text, min_length=50, max_length=100, do_sample=False):

    summary = summarizer(text[:1024],
                         min_length=min_length,
                         max_length=max_length,
                         do_sample=do_sample)
    summary_text = summary[0]['summary_text'].strip().replace(' .', '.')

    return summary_text

df['product_summary'] = df.apply(lambda x: get_summary(x.product_description), axis=1)

Your max_length is set to 100, but your input_length is only 95. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=47)
Your max_length is set to 100, but your input_length is only 82. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=41)
Your max_length is set to 100, but your input_length is only 85. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=42)
Your max_length is set to 100, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your

**Examine the results**

After the model has run, we can inspect the product_summary we stored back in the original dataframe. These are slightly truncated in the dataframe, so we’ll also look at them individually to see how well the model compares to a human copywriter.

In [None]:
df = df[['product_name', 'product_summary']]
df.head()