## Data Extraction Using Mixtral:7x8b

**This is a data extraction project in which we extract product attributes of individual product categories one by one.**
**Problem**: 
- There is a large set of HTML text in an excel sheet that has all the product attributes of our products. <br>
- The product attributes are not individually listed anywhere. <br>
- We need a dataset of product attributes in which various properties of each product are listed separately, in order to build a PIM system, and to use specific attributes to SEO-optimize our webshop texts. <br>
- The Challenge is that the text is not regular, different attributes come in a variety of formats from one product to the next. <br>
- Another challenge is that we do not have a comprehensive list of product properties and that also needs to be created along the way. <br>

**Solution**: 
1. The first solution was Text mining using Regular expressions. It was implemented on one product group by reading and analyzing product descriptions of many products and finding attributes and then generating the regex patterns to extract them. <br>
Successful, but took a lot of time and energy. 
2. The second solution was to use SpaCy and NLP methods to extract adjective and prepositional groups such as "mit Griff" or "mit Deckel" to then use them for attribute generation. <br>
This was a faster method than raw text mining, but the problem was a large number of false positives (adjectives that were not product attributes) and false negatives <br>
(items that were not adjectives or prepositional groups but were attributes of the product nevertheless). 
3. This lead to trying out LLMs. The first attempt was with the transformers library. However, running into Langchain and Ollama, I found them to be faster solutions. <br>
I used Ollama because it supported the Mixtral:7x8b model which is both compatible with structured outputs and also supports German language. <br>
The result is the following code that functions much better in extracting all the relevant data required for our products.

- Note: The prompt was an important part of the modeling which resulted in correct and coherent results that could be further processed.

In [None]:
## Importing libraries
import pandas as pd
import re
import json
from values import *
import ollama
import warnings

warnings.filterwarnings('ignore')

In [None]:
## Loading Productgroup Number and its name
val = Values()
wg_list = val.wr_gr
wg_name = val.wr_name

In [None]:
## loading the csv file that has the product information for each warengruppe (previously generated using wr_folder_building.ipynb)
file = pd.read_csv(f'{val.parent_dir}{wg_list}/{wg_list}.csv',delimiter=';',encoding='utf-8')
mined_text = file.copy()

In [None]:
## Extracting the sets from the items to process them later
mined_text['TEILIG'] = mined_text['BESCHREIBUNG'].str.extract(r'( *\d+ *-*tlg.|\d+-teilig|\d+tlg.*|\d+-*er *-*set|\d+ set|\d+ ?ply)',flags=re.IGNORECASE)
mined_text = mined_text[mined_text['TEILIG'].isna() == True]
mined_text = mined_text[['NUMMER','NAME','BESCHREIBUNG']]
# mined_text = mined_text[:10]

In [None]:
## Defining the prompt variables

allowed_keys = [   "NUMMER", # Produkt ID
    "NAME", # Produkt NAME
    "MATERIAL", # Kunststoff
    "ART", # Weinglas
    "DESIGN", # Stillvolles Design
    "BREITE", # 12 cm
    "TIEFE", # 10 cm
    "HOEHE", # 7 cm
    "ANDERE-ABMESSUNGEN", # 8 cm
    "GEWICHT", # 50 kg
    "HERGESTELLT", # Deutschland
    "BRAND", # Hagen Grote
    "QUALITAET", # hochwertiger Edelstahl,
    "ANWENDUNG", # Eisbereiter
    "ANDERE-EIGENSCHAFTEN" 

]
example = { "NUMMER" : "111HP06",
    "NAME": "Eisbereiter wandelt Brotbackautomaten in Eis- und Sorbetmaschine", # Produkt NAME
    "MATERIAL": "Kunststoff", #  Kunststoff
    "ART": "Weinglas", # Weinglas
    "DESIGN":"Stillvolles Design", # Stillvolles Design
    "BREITE":"12 cm", # 12 cm
    "TIEFE":"10 cm", # 10 cm
    "HOEHE":"7 cm", # 7 cm
    "ANDERE-ABMESSUNGEN": "40 X 60 CM Tabletten", # 8 cm
    "GEWICHT": "50 kg", # 50 kg
    "HERGESTELLT": "Deutschland", # Deutschland
    "BRAND": "Hagen Grote",
    "QUALITAET": "hochwertiger Edelstahl", # hochwertiger Edelstahl
    "ANWENDUNG": "Eisbereiter", # Eisbereiter
    "ANDERE-EIGENSCHAFTEN": []
    }

In [None]:
## Check if all the keys in the list match those in the example (inconsistencies lead to poor results)
for item in allowed_keys:
    if item in example.keys():
        continue
    else:
        print(f"{item} not in example.")

In [None]:
## function to clean responses from the LLM
def clean_response(response_text):
    response_text = response_text.replace(r"\\u00fc","ü")
    try:
        response_dict = json.loads(response_text)
        cleaned_dict = {key: value.strip() if isinstance(value, str) else value for key, value in response_dict.items()}
        return cleaned_dict

    except json.JSONDecodeError :

        print(f"Error: The response is not a valid JSON object.")
        print(response_text)

        return None

In [None]:
### Loading the prompt and generating responses based on previous variables as well as extact instrctions
cleaned_data = []
uncleaned_data = []

for id,name, beschreibung in zip(mined_text['NUMMER'],mined_text['NAME'],mined_text['BESCHREIBUNG']):
    print(id,name)

    prompt = f"""[INST]1. Bitte extrahieren Sie die Produktattribute aus dem folgenden Beschreibung und geben Sie diese als ein gültiges JSON-Objekt in deutscher Sprache aus.
                    2. Verwenden Sie NUR die {allowed_keys} Schlüssel und ändern Sie diese nicht.
                    3. Verwenden Sie ALLE erlaubten Schlüssel, um das JSON-Objekt zu formatieren, und wenn es keine Werte für die spezifischen Schlüssel im JSON-Objekt gibt, lassen Sie deren Werte leer.
                    4. Beschränken Sie die Werte auf maximal 50 Zeichen. 
                    5. Füllen Sie nur die Informationen aus, die ausdrücklich im Beschreibung erwähnt werden.
                    7. die json-Datei sollte keine ascii-Kodierung haben. Die Sprache ist deutsch und die entsprechende Kodierung sollte verwendet werden.
                    8. Überprüfen Sie, ob die Antwort ein JSON-Objekt ist, und wenn nicht, ändern Sie sie so, dass sie ein JSON-Objekt wird.
                    9. Überprüfen Sie doppelt, ob die Schlüssel alle genau gleich sind. 
                    10. nur und ausschließlich das JSON-Objekt ausgeben und sonst nichts. kein Text und keine Erklärung, nur das JSON-Objekt.
                    11. deutsche Zeichen richtig kodieren und dekodieren [äëöüß] sowohl in Groß- als auch in Kleinschreibung . Verwenden Sie stattdessen KEINE Ascii-Schlüssel.


                    Beispiel-RESPONSE: {json.dumps(example,ensure_ascii=False)}
                    [/INST]
                    
                    NUMMER: {id}
                    NAME: {name}
                    Beschreibung: {beschreibung}
        """

# Generate response using Mixtral 7x8b
    response = ollama.generate(model='mixtral:latest', prompt=prompt)

# Clean up the response
    cleaned_data.append(response['response'])



In [None]:
## cleaning one step further and appending responses to a new list
new_data = []
for id,item in enumerate(cleaned_data):

    new_data.append(clean_response(item))
    

In [None]:
## saving clean json-objects to a json file
with open(f'{val.parent_dir}{val.wr_gr}/{val.wr_name}_{val.wr_gr}.json', 'w', encoding="utf-8") as f:
    json.dump(new_data, f, indent=4, ensure_ascii=False)    