# Production Description generation using LLMs

In order to understand the LLMs and how they work we firstly will learn more about Transformers and also the models available in Hugging face.

## Transofrmers

Transformers have revolutionized the field of natural language processing (NLP) with their unique neural network architecture, as introduced in the 2017 paper "Attention is All You Need" by Vaswani et al. Unlike traditional NLP models that rely on recurrent connections, transformers process sequential data, such as text, using "self-attention" or "scaled dot-product attention". This allows them to weigh the importance of words in a sentence for making predictions.

The main components of a transformer architecture include an encoder and a decoder. The encoder processes the input text and extracts features through multiple layers of self-attention mechanisms and feed-forward neural networks. The decoder generates the output text from the encoded representation, also utilizing self-attention mechanisms and attending to previously generated output tokens.

Self-attention is a crucial mechanism in transformers, enabling the model to weigh the contextual relevance of words in the input text. It computes a weighted sum of input embeddings based on similarity between words and query, key, and value vectors, allowing the model to capture long-range dependencies and contextual relationships.

To address the absence of inherent word order in transformers, positional encoding is used to inject positional information into the input text. This is achieved by adding sinusoidal functions to the input embeddings, encoding the position of each word in the sequence.

Transformers have several advantages, including their ability to model long-range dependencies in text, which is challenging for traditional recurrent models. They have achieved state-of-the-art results in various NLP tasks and are widely used in many NLP applications. Popular transformer-based models include BERT, GPT-2, T5, and Transformer-XL, among others.

## Models

When selecting an LLM model for generating product descriptions, it's essential to consider several factors, such as the model's size, the quality of the generated text, the level of fine-grained control over the text generation, and the computational resources required for training and inference. It's also important to ensure that the generated text is coherent, readable, and relevant to the products being described. through searching the Hugging Face, I eneded up choosing Bloom, gpt2 and LLAMA. The other text generation models either did not work or were not comparable.


*   LLAMA: LLAMA is a large language models developed by Meta AI. They claim that even though that LLAMA is smaller than many other models but it performs better when its trained on more tokens. LLAMA is trained on texts from 20 different languages. Based on the result that they presented in their paper, we can tell that if we provide the textual description of task and also present examples of task, we can get better results.
At this exact moment I have not been granted access to LLAMA.
*   GPT2: GPT2 is a large transformer based model developed by OpenAI. it has 1.5 billion parameters. with the input of previous words in a text, It is able to predict the next words
*   BLOOM: Bloom is a 176 billion LLM developed by BigScience. Bloom is trained on 46 natural languages. The architecture consists mainly of decoders and embedding layers, with multi-headed attention layers.
This architechture allows Bloom to be trained with different languages and allows the user to translate and talk about a topic in a different language.






## Loading the data

The dataset that we are using for this project is the [Amazon product dataset 2020](https://www.kaggle.com/datasets/promptcloud/amazon-product-dataset-2020). it contains 10000 products that are in amazon.

In [None]:
# imports

import numpy as np
import pandas as pd
import pylab as pl
from matplotlib import pyplot as plt
import seaborn as sns
import string
import pickle


In [None]:
!pip install pyquery

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyquery
  Downloading pyquery-2.0.0-py3-none-any.whl (22 kB)
Collecting cssselect>=1.2.0
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: cssselect, pyquery
Successfully installed cssselect-1.2.0 pyquery-2.0.0


In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m85.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.4 tokenizers-0.13.3 transformers-4.28.1


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/marketing_sample_for_amazon_com-ecommerce__20200101_20200131__10k_data.csv')

In [None]:
df

Unnamed: 0,Uniq Id,Product Name,Brand Name,Asin,Category,Upc Ean Code,List Price,Selling Price,Quantity,Model Number,...,Product Url,Stock,Product Details,Dimensions,Color,Ingredients,Direction To Use,Is Amazon Seller,Size Quantity Variant,Product Description
0,4c69b61db1fc16e7013b43fc926e502d,"DB Longboards CoreFlex Crossbow 41"" Bamboo Fib...",,,Sports & Outdoors | Outdoor Recreation | Skate...,,,$237.68,,,...,https://www.amazon.com/DB-Longboards-CoreFlex-...,,,,,,,Y,,
1,66d49bbed043f5be260fa9f7fbff5957,"Electronic Snap Circuits Mini Kits Classpack, ...",,,Toys & Games | Learning & Education | Science ...,,,$99.95,,55324,...,https://www.amazon.com/Electronic-Circuits-Cla...,,,,,,,Y,,
2,2c55cae269aebf53838484b0d7dd931a,3Doodler Create Flexy 3D Printing Filament Ref...,,,Toys & Games | Arts & Crafts | Craft Kits,,,$34.99,,,...,https://www.amazon.com/3Doodler-Plastic-Innova...,,,,,,,Y,,
3,18018b6bc416dab347b1b7db79994afa,Guillow Airplane Design Studio with Travel Cas...,,,Toys & Games | Hobbies | Models & Model Kits |...,,,$28.91,,142,...,https://www.amazon.com/Guillow-Airplane-Design...,,,,,,,Y,,
4,e04b990e95bf73bbe6a3fa09785d7cd0,Woodstock- Collage 500 pc Puzzle,,,Toys & Games | Puzzles | Jigsaw Puzzles,,,$17.49,,62151,...,https://www.amazon.com/Woodstock-Collage-500-p...,,,,,,,Y,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9997,1a22f23576bfdfe5ed6c887dc117aab6,Remedia Publications REM536B Money Activity Bo...,,,Toys & Games | Learning & Education | Counting...,,,$9.31,,REM536B,...,https://www.amazon.com/Remedia-Publications-RE...,,,,,,,Y,,
9998,e11514dcf1f087887cd5ea0bd646d1fc,Trends International NFL La Chargers HG - Mobi...,,,Toys & Games | Arts & Crafts,,,$6.99,,,...,https://www.amazon.com/Trends-International-NF...,,,,,,,Y,,
9999,c00301a38560da2abc89c1f86ce4b267,NewPath Learning 10 Piece Science Owls and Owl...,,,Office Products | Office & School Supplies | E...,,,$37.95,,34-6015,...,https://www.amazon.com/NewPath-Learning-Scienc...,,,,,,,Y,,
10000,c2928dbf9796ceba44863a2736afb405,Disney Princess Do It Yourself Braid Set,,,Toys & Games | Arts & Crafts | Craft Kits,,,$3.58,,2888PRST,...,https://www.amazon.com/Disney-Princess-Yoursel...,,,,,,,Y,,


In [None]:
df['Product Url'][1]

'https://www.amazon.com/Electronic-Circuits-Classpack-Motion-Detector/dp/B008AK6DAS'

In [None]:
df.isnull().sum()

Uniq Id                      0
Product Name                 0
Brand Name               10002
Asin                     10002
Category                   830
Upc Ean Code              9968
List Price               10002
Selling Price              107
Quantity                 10002
Model Number              1770
About Product              273
Product Specification     1632
Technical Details          790
Shipping Weight           1138
Product Dimensions        9523
Image                        0
Variants                  7524
Sku                      10002
Product Url                  0
Stock                    10002
Product Details          10002
Dimensions               10002
Color                    10002
Ingredients              10002
Direction To Use         10002
Is Amazon Seller             0
Size Quantity Variant    10002
Product Description      10002
dtype: int64

## Get the actual Product descriptions from amazon webpage(Failed)

I intended to get the actual product description that the seller wrote in order to have a document to compare the product description that AI wrote to them. However unfortunately amazon doesn't allow scarpy requests and I ended up getting the error 503. For comparing the 2 documents I was planning to use BERT or Doc2vec and the cosine similarity.

In [None]:
# get all product descriptions from amazon

from pyquery import PyQuery
import requests

# webUrl = urllib3.urlopen("https://www.amazon.com/Electronic-Circuits-Classpack-Motion-Detector/dp/B008AK6DAS")
html = requests.get(url = "https://www.amazon.com/Electronic-Circuits-Classpack-Motion-Detector/dp/B008AK6DAS").text
print(html)
pq = PyQuery(html)
tag = pq('div#productDescription p span') # or     tag = pq('div.class')
print(tag.text())


## Testing different prompts and comparing them

For the entry task, I have tried using different prompts and different features in the prompts. as an example to evaluate these prompts [this product](https://www.amazon.com/Electronic-Circuits-Classpack-Motion-Detector/dp/B008AK6DAS) was randomly choosen. based on the features available in the dataset, the 4 features of "Product name", "Category", "about Product" and "product specification" seeem to be the most useful and complete features. The results of chatgpt2 is not very strong but even though that the api for BLOOM available through hugging face has a limit for the charachters, we can see that the results are much better and much more engaging with the costumer and actually descibe the product. Therefore, for this task we only use BLOOM untill we get access to LLAMA and be able to compare it to BLOOM. After running each prompt 5 times, I ended up deciding that the product description that was the output of the prompt that contained Product name, category and about product was the overall best. however if we take a closer look at the "about product" feature, we can see that it contains some sentences that describe different parts of the product. Because of this reason, I have decided to move forward just with Product name and category. Unfortunately, My laptop that I am currently running this task on was not able to fully download and run the BLOOM, therefore for the time being we have to only use the api.

In [None]:
# case : just using product name and category for bloom

# https://huggingface.co/bigscience/bloom

import requests
from pprint import pprint

API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

selected_row = 1
prompt_style = "Formal"

output = query({
    'inputs':
    # f' Prompt style:{prompt_style} \n'+
    f' Product Name:{df["Product Name"][selected_row]} \n'+
    f'Category:{df["Category"][selected_row]} \n' +
    f'Product description:' ,
})

pprint(output)

[{'generated_text': ' Product Name:Electronic Snap Circuits Mini Kits '
                    'Classpack, FM Radio, Motion Detector, Music Box (Set of '
                    '5) \n'
                    'Category:Toys & Games | Learning & Education | Science '
                    'Kits & Toys \n'
                    'Product description: This set of 5 snap circuits is a '
                    'great way to introduce kids to electronics. The snap '
                    'circuits are'}]


In [None]:



import requests
from pprint import pprint

API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()


output = query({
    'inputs':"Schreiben Sie eine Produktbeschreibung für das folgende Produkt: \n Produktname: Lasertoner cyan OKI 42804547 \n Produktkategorie: Toner, Tonereinheit (Laserdrucker, Kopierer)\n Produktbeschreibung:" ,
})

pprint(output)

[{'generated_text': 'Schreiben Sie eine Produktbeschreibung für das folgende '
                    'Produkt: \n'
                    ' Produktname: Lasertoner cyan OKI 42804547 \n'
                    ' Produktkategorie: Toner, Tonereinheit (Laserdrucker, '
                    'Kopierer)\n'
                    ' Produktbeschreibung: Lasertoner cyan OKI 42804547 \n'
                    ' Produktlink: http://www.example.com/'}]


In [None]:
# case : just using product name and category for bloom

# https://huggingface.co/bigscience/bloom

import requests
from pprint import pprint

API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

selected_row = 1
prompt_style = "Formal"

output = query({
    'inputs':
    # f' Prompt style:{prompt_style} \n'+
    f' Product Name:{df["Product Name"][selected_row]} \n'+
    f'Category:{df["Category"][selected_row]} \n' +
    f'Product description:' ,
})

pprint(output)

In [None]:
# case : just using product name and category for bloom

# https://huggingface.co/bigscience/bloom

import requests

API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

selected_row = 1

output = query({
  'inputs': f'Generate a product description for an Amazon product. The product is called {df["Product Name"][selected_row]} and is a {df["Category"][selected_row]} with the following features: {df["About Product"][selected_row]}.'
     f'The target audience for this product is [target audience]. The product description should be engaging, informative, and suitable for an online retail platform like Amazon. Product description:'
    f'Product description:' ,
})

pprint(output)

[{'generated_text': 'Generate a product description for an Amazon product. The '
                    'product is called Electronic Snap Circuits Mini Kits '
                    'Classpack, FM Radio, Motion Detector, Music Box (Set of '
                    '5) and is a Toys & Games | Learning & Education | Science '
                    'Kits & Toys with the following features: Make sure this '
                    'fits by entering your model number. | Snap circuits mini '
                    'kits classpack provides basic electronic circuitry '
                    'activities for students in grades 2-6 | Includes 5 '
                    'separate mini building kits- an FM radio, a motion '
                    'detector, music box, space battle sound effects, and a '
                    'flying saucer | Each kit includes separate components and '
                    'instructions to build | Each component represents one '
                    'function in a circuit; components snap togeth

In [None]:
# case : just using product name,category and about product for bloom

# https://huggingface.co/bigscience/bloom

import requests
from pprint import pprint

API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

selected_row = 1
prompt_style = "Formal"

params = {'max_length': 3000, 'top_k': 10, 'temperature': 2.5}
output = query({
    'inputs':
    # f' Prompt style:{prompt_style} \n'+
    f' Product Name:{df["Product Name"][selected_row]} \n'+
    f'Category:{df["Category"][selected_row]} \n' +
    f'About Product:{df["About Product"][selected_row]} \n' +
    f'Product description:' ,
    'parameters': params,
})

pprint(output)

[{'generated_text': ' Product Name:Electronic Snap Circuits Mini Kits '
                    'Classpack, FM Radio, Motion Detector, Music Box (Set of '
                    '5) \n'
                    'Category:Toys & Games | Learning & Education | Science '
                    'Kits & Toys \n'
                    'About Product:Make sure this fits by entering your model '
                    'number. | Snap circuits mini kits classpack provides '
                    'basic electronic circuitry activities for students in '
                    'grades 2-6 | Includes 5 separate mini building kits- an '
                    'FM radio, a motion detector, music box, space battle '
                    'sound effects, and a flying saucer | Each kit includes '
                    'separate components and instructions to build | Each '
                    'component represents one function in a circuit; '
                    'components snap together to create working models of '
                 

In [None]:
# case : just using name, category and specification for bloom

# https://huggingface.co/bigscience/bloom

import requests
from pprint import pprint

API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

selected_row = 1
prompt_style = "Formal"

params = {'max_length': 3000, 'top_k': 10, 'temperature': 2.5}
output = query({
    'inputs':
    # f' Prompt style:{prompt_style} \n'+
    f' Product Name:{df["Product Name"][selected_row]} \n'+
    f'Category:{df["Category"][selected_row]} \n' +
    f'Product Specification:{df["Product Specification"][selected_row]} \n' +
    f'Product description:' ,
    'parameters': params,
})

pprint(output)

[{'generated_text': ' Product Name:Electronic Snap Circuits Mini Kits '
                    'Classpack, FM Radio, Motion Detector, Music Box (Set of '
                    '5) \n'
                    'Category:Toys & Games | Learning & Education | Science '
                    'Kits & Toys \n'
                    'Product Specification:Product Dimensions:         14.7 x '
                    '11.1 x 10.2 inches ; 4.06 pounds    |Shipping Weight: 4 '
                    'pounds (View shipping rates and policies)|Domestic '
                    'Shipping: Item can be shipped within U.S.|International '
                    'Shipping: This item can be shipped to select countries '
                    'outside of the U.S.  Learn More|ASIN: B008AK6DAS|Item '
                    'model number: 55324|    #3032    in\xa0Science Kits & '
                    'Toys \n'
                    'Product description:\n'
                    'Build a working FM Radio that can receive and record over '
      

In [None]:
# case : just using name, category, about product and specification for bloom

# https://huggingface.co/bigscience/bloom

import requests
from pprint import pprint

API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

selected_row = 1
prompt_style = "Formal"

params = {'max_length': 3000, 'top_k': 10, 'temperature': 2.5}
output = query({
    'inputs':
    # f' Prompt style:{prompt_style} \n'+
    f' Product Name:{df["Product Name"][selected_row]} \n'+
    f'Category:{df["Category"][selected_row]} \n' +
    f'About Product:{df["About Product"][selected_row]} \n' +
    f'Product Specification:{df["Product Specification"][selected_row]} \n' +
    f'Product description:' ,
    'parameters': params,
})

pprint(output)

[{'generated_text': ' Product Name:Electronic Snap Circuits Mini Kits '
                    'Classpack, FM Radio, Motion Detector, Music Box (Set of '
                    '5) \n'
                    'Category:Toys & Games | Learning & Education | Science '
                    'Kits & Toys \n'
                    'About Product:Make sure this fits by entering your model '
                    'number. | Snap circuits mini kits classpack provides '
                    'basic electronic circuitry activities for students in '
                    'grades 2-6 | Includes 5 separate mini building kits- an '
                    'FM radio, a motion detector, music box, space battle '
                    'sound effects, and a flying saucer | Each kit includes '
                    'separate components and instructions to build | Each '
                    'component represents one function in a circuit; '
                    'components snap together to create working models of '
                 

In [None]:
# case : using only product name and category for chatgpt2

# https://huggingface.co/gpt2
import requests

API_URL = "https://api-inference.huggingface.co/models/gpt2"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()


selected_row = 1
params = {'max_length': 500, 'top_k': 10, 'temperature': 2.5}
output = query({
    'inputs': f'Write a Product description for the following Product:\n Product Name:{df["Product Name"][selected_row]} \n'+
    f'Category:{df["Category"][selected_row]} \n' ,
    'parameters': params,
})

pprint(output)

[{'generated_text': 'Write a Product description for the following Product:\n'
                    ' Product Name:Electronic Snap Circuits Mini Kits '
                    'Classpack, FM Radio, Motion Detector, Music Box (Set of '
                    '5) \n'
                    'Category:Toys & Games | Learning & Education | Science '
                    'Kits & Toys \n'
                    'Posted by: John R. H. on 05/11/2018 @ 09:00 a post'}]


In [None]:
# case : using only product name and category for chatgpt2

# https://huggingface.co/gpt2
import requests
from pprint import pprint

API_URL = "https://api-inference.huggingface.co/models/gpt2"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()


selected_row = 1
params = {'max_length': 500, 'top_k': 10, 'temperature': 2.5}
output = query({
    'inputs': f'Schreiben Sie eine Produktbeschreibung für das folgende Produkt: \n Produktname: Lasertoner cyan OKI 42804547 \n Produktkategorie: Toner, Tonereinheit (Laserdrucker, Kopierer)\n' ,
    'parameters': params,
})

pprint(output)

[{'generated_text': 'Schreiben Sie eine Produktbeschreibung für das folgende '
                    'Produkt: \n'
                    ' Produktname: Lasertoner cyan OKI 42804547 \n'
                    ' Produktkategorie: Toner, Tonereinheit (Laserdrucker, '
                    'Kopierer)\n'
                    '\n'
                    'Produkt: Künzünde Produkt (Lasertronik, Lazerdruck)\n'
                    '\n'
                    '(Produkt: Toner, Tonereinheit (Lasertranik)\n'
                    '\n'
                    'Produktkategorie: Dägte Produtsst, Wieße, Lützände '
                    'Produter.\n'
                    '\n'
                    'Produkt: Zug-Das-Dreiner, Die Geburtung, Lautenberg '
                    'Produktkriegsprodut und dazsühle Kultura-Sonderstahl zu '
                    'den Dägen (Sterzweil und Zuger), Lautenberg\n'
                    '\n'
                    'Produktkognizativ, Weltung Produkt. (Druktur)\n'
                    '\n'
              

In [None]:
import requests

API_URL = "https://api-inference.huggingface.co/models/malteos/gpt2-wechsel-german-ds-meg"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

params = {'max_length': 500, 'top_k': 10, 'temperature': 2.5}
output = query({
    'inputs': f'Schreiben Sie eine Produktbeschreibung für das folgende Produkt: \n Produktname: Lasertoner cyan OKI 42804547 \n Produktkategorie: Toner, Tonereinheit (Laserdrucker, Kopierer)\n' ,
    'parameters': params,
})
print(output)

[{'generated_text': 'Schreiben Sie eine Produktbeschreibung für das folgende Produkt: \n Produktname: Lasertoner cyan OKI 42804547 \n Produktkategorie: Toner, Tonereinheit (Laserdrucker, Kopierer)\nProduktname: Toner, Tonerreinheit (Laserdrucker, Kopierer) - Toner, Tonerreinheit (Laserdrucker, Kopierer)\nProduktname: Toner, Tonerreinheit (Laserdrucker, Kopierer) - Toner, Tonerreinheit (Laserdrucker, Kopierer)\nProduktname: Toner, Tonerreinheit (Laserdrucker, Kopierer) - Toner, Tonerreinheit (Laserdrucker, Kopierer)\nProduktname: Toner, Tonerreinheit (Laserdrucker, Kopierer) - Toner, Tonerreinheit (Laserdrucker, Kopierer)\nProduktname: Toner, Tonerreinheit (Laserdrucker, Kopierer) - Toner, Tonerreinheit (Laserdrucker, Kopierer)\nProduktname: Toner, Tonerreinheit (Laserdrucker, Kopierer) - Toner, Tonerreinheit (Laserdrucker, Kopierer)\nProduktname: Toner, Tonerreinheit (Laserdrucker, Kopierer) - Toner, Tonerreinheit (Laserdrucker, Kopierer)\nProduktname: Toner, Tonerreinheit (Laserdrucke

In [None]:
import requests

API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-v0.1"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

output = query({
	'inputs': f'Schreiben Sie eine Produktbeschreibung für das folgende Produkt: \n Produktname: Lasertoner cyan OKI 42804547 \n Produktkategorie: Toner, Tonereinheit (Laserdrucker, Kopierer)\n' ,
})

print(output)

[{'generated_text': 'Schreiben Sie eine Produktbeschreibung für das folgende Produkt: \n Produktname: Lasertoner cyan OKI 42804547 \n Produktkategorie: Toner, Tonereinheit (Laserdrucker, Kopierer)\n Produktbeschreibung:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'}]


In [None]:
# case : using only product name and category for chatgpt2

# https://huggingface.co/gpt2
import requests

API_URL = "https://api-inference.huggingface.co/models/gpt2"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()


selected_row = 1
params = {'max_length': 500, 'top_k': 10, 'temperature': 2.5}
output = query({
    'inputs': f'Generate a product description for an Amazon product. The product is called {df["Product Name"][selected_row]} and is a {df["Category"][selected_row]} with the following features: {df["About Product"][selected_row]}.'
     f'The target audience for this product is [target audience]. The product description should be engaging, informative, and suitable for an online retail platform like Amazon. ',
    'parameters': params,
})

pprint(output)

[{'generated_text': 'Generate a product description for an Amazon product. The '
                    'product is called Electronic Snap Circuits Mini Kits '
                    'Classpack, FM Radio, Motion Detector, Music Box (Set of '
                    '5) and is a Toys & Games | Learning & Education | Science '
                    'Kits & Toys with the following features: Make sure this '
                    'fits by entering your model number. | Snap circuits mini '
                    'kits classpack provides basic electronic circuitry '
                    'activities for students in grades 2-6 | Includes 5 '
                    'separate mini building kits- an FM radio, a motion '
                    'detector, music box, space battle sound effects, and a '
                    'flying saucer | Each kit includes separate components and '
                    'instructions to build | Each component represents one '
                    'function in a circuit; components snap togeth

## Product description generation for all products in the dataset

The prompt format for the task of geenrating product description is as follows:


1.   Prompt style: You can specify the desired style for the product description, such as "formal", "casual", "technical", "funny", etc. This will guide BLOOM in generating text that aligns with the desired tone and style.
2.   Product Name: Specify the name of the product for which you want to generate the description. The name usually contains the most important features of the product and will help BLOOM in generating a more detailed description.
3.   Product Category: Specify the category of the product for which you want to generate the description, such as "Electronics", "Home & Kitchen", "Fashion", "Books", etc. This will help BLOOM in generating product-specific language and details.

### BLOOM

Bloom is a 176 billion LLM developed by BigScience. Bloom is trained on 46 natural languages. The architecture consists mainly of decoders and embedding layers, with multi-headed attention layers.
This architechture allows Bloom to be trained with different languages and allows the user to translate and talk about a topic in a different language.

The performance of BLOOM can be competitive on a wide variety of benchmarks, with stronger results after finetuning. There is a Public License for Responsible AI that allows the model to be used publicly. A number of benchmark tasks, including language modeling, machine translation, summarization, and code generation, were evaluated by Bloom in zero-shot and few-shot settings, and it performed at state-of-the-art on several benchmark tasks. According to the CrowS-Pairs dataset, the model is also without bias.

In order to generate text, the model uses greedy decoding, with generation continuing up until the EOS token, or additionally in the case of a 1-shot. As a general rule, the maximum generation lengths for datasets were set in accordance with literature practices. It is noted in the Paper "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model" that multilingual generation poses additional challenges due to a lack of metrics, particularly for natural language generation. Based on Gehrmann et al. (2022b), the document evaluates text generation performance through ROUGE-2, ROUGE-L, and Levenshtein distance.

As mentioned before, we are accesing Bloom through the api from Hugging face. This is not a good option for long term and is just good enough for this entry task because there is a limit on the output characters and also the api has stopped working couple of times during the last 14 days.

In [None]:
df["prompt1"] = df.apply(lambda row:
                          f'Product Name:{row["Product Name"]} \n'+
                          f'Category:{row["Category"]} \n' +
                          f'Product description:'
    , axis=1)



In [None]:
df["prompt1"][1]

'Product Name:Electronic Snap Circuits Mini Kits Classpack, FM Radio, Motion Detector, Music Box (Set of 5) \nCategory:Toys & Games | Learning & Education | Science Kits & Toys \nProduct description:'

In [None]:

API_URL = "https://api-inference.huggingface.co/models/gpt2"
headers = {"Authorization": "Bearer ********"}


def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

def generate_description(prompt):
  output = query({
      'inputs': prompt
  })
  return(output[0]["generated_text"][len(prompt)+1:])

df["Product Description"] = df.apply(lambda row: generate_description(row.prompt1), axis=1)
# generate_description(df['prompt1'][1])

In [None]:
df

## Using transformers to download, finetune the model

This is unfortunately not possible because of the current resources that I have not possible

In [None]:
from transformers import BloomConfig, BloomModel
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom")

generator = pipeline("text-generation", model="bigscience/bloom")
generator(prompt1[1])

## Translation

In [None]:
!pip install translators

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting translators
  Downloading translators-5.7.5-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.2/41.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests>=2.29.0 (from translators)
  Downloading requests-2.30.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyExecJS>=1.5.1 (from translators)
  Downloading PyExecJS-1.5.1.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pathos>=0.2.9 (from translators)
  Downloading pathos-0.3.0-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
Collecting ppft>=1.7.6.6 (from pathos>=0.2.9->translators)
  Downloading ppft-1.7.6.6-py3-none-any.whl (52 

In [None]:
import translators as ts

phrase = 'This is the perfect way that you can make electrical work with kids. The best part about these products are they come packaged so easily! They’re easy-to-use because there’s no need any more complicated assembly or soldering skills needed.This product comes packed into two sets which include all five pieces required – including four small parts each - making it ideal as either individual projector set up/assembly toolkit / DIY electronics toy'
ts.translate_text(phrase, from_language='en', to_language='de', translator = 'google')

'Dies ist der perfekte Weg, um mit Kindern elektrische Arbeit zu machen. Das Beste an diesen Produkten ist, dass sie so leicht verpackt werden! Sie sind leicht zu bedienen, da keine komplizierteren Versammlungen oder Lötfähigkeiten erforderlich sind. Dieses Produkt wird in zwei Sätze verpackt, die alle fünf erforderlichen Teile enthalten-einschließlich vier kleinen Teile-und macht es ideal als einzelne Projektor, die eingerichtet sind / Montage -Toolkit / DIY -Elektronikspielzeug'

## Required resources

The required resources for bulding a pipeling for generating product descriptions using an LLM such as Bloom are as follows. However it should be noted that these required resources could change in case the dataset or Model changes or we need to finetune the model. Fine-tuning and optimizing the model for the specific task of generating product descriptions may also require iterative experimentation and evaluation.

### Computational Resources:

*   CPU / GPU / TPU: A powerful CPU or GPU is required for efficient inference. GPUs are generally more suitable for deep learning tasks like LLM inference due to their parallel processing capabilities.
BLOOM is a large language model with 176 billion parameters, so it requires a powerful GPU (Graphics Processing Unit) or TPU (Tensor Processing Unit) for efficient inference. A single high-end GPU, such as an NVIDIA A100 or NVIDIA V100, or a TPU from Google Cloud, can be used for running inference with BLOOM. The memory capacity of the GPU or TPU should be sufficient to handle the size of the model and the input data for text generation tasks.
*   Memory: Sufficient memory is required to store the model's parameters and intermediate results during inference. The memory requirements depend on the size of the model and the batch size used for inference. Bloom is a 352GB (176B parameters in bf16) model, we need at least that much GPU RAM to make it fit.
*   Storage: Storage is required to store the pre-trained LLM model, input data, and output results.


In case that we want our Pipeline to be Real-time we should calculate how many tokens per ms we can infer using our system. Amount of operations can be calculated using the formula below where B is the batch size, s the sequence length, and h the hidden dimension.
\begin{equation}
 24Bsh^2 + 4𝐵s^2h24Bsh^2 + 4𝐵s^2h
\end{equation}



---



Refrences:

[Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization)


[Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/)


In [None]:
import requests

from pprint import pprint

API_URL = "https://api-inference.huggingface.co/models/malteos/bloom-6b4-clp-german"
headers = {"Authorization": "Bearer ********"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

output = query({
	"inputs": "Hallo, ich bin studentin. wie geht es ",
})

pprint(output)

{'error': 'The model malteos/bloom-6b4-clp-german is too large to be loaded '
          'automatically (12GB > 10GB). For commercial use please use PRO '
          'spaces (https://huggingface.co/spaces) or Inference Endpoints '
          '(https://huggingface.co/inference-endpoints).'}
