# Data Extraction using Ollama and Mixtral 7x8b

Hardware: Apple M3 Max - 64GB RAM

1. I installed the models mixtral:7x8b and mixtral:8x22b in terminal and then I tried them in python using the following code. <br><br>
2. Any variation in prompt leads to different results. To extract information from text, it must be explicitly mentioned in the prompt that the model should look for answers only in the text. However, it still made some inferences on its own, without seeing the explicit keyword for a property. As an example, being dishwasher_safe is a property that should be explicitly mentioned in the product description, however, with some materials, the model inferred them to be dishwasher_safe, things like glass as an example. However that is not what needed and need to be further refined.<br><br>
3. An example also helps with further refinement of the output of the model. with explicit keys, the model looks for specific properties in text that is needed for that product group. This may change from one product category to the next. However, product description text for each product is not standard with all the needed information. Therefore, some items need to be null or None, and we should also ask the model that when it cannot find the answers in the model, consider the value for the keys as None. It is also important to ask the model to only use keys given in the example to get uniform results. However, the model may sometimes diverge from the instruction, but the majority of responses will be uniformed.

In [None]:
# !pip install pandas ollama

In [1]:
import pandas as pd
from values import *
import ollama
import warnings
warnings.filterwarnings('ignore')
# pd.set_option('display.max_colwidth', 50)

In [2]:
val = Values()

### DATA CLEANING


In [39]:
artikel_df = pd.read_excel(val.shop_file_path)
marketing_artikel = pd.read_csv(val.marketing_artikel,encoding='latin-1', delimiter=';', on_bad_lines='skip',parse_dates=val.dates,dayfirst=True)

In [42]:
marketing_artikel = marketing_artikel[marketing_artikel['WM'].isna()==True]

In [43]:
### Cleaning and preparing Warengroup data
warengrps = marketing_artikel[['NUMMER','WARENGR']]
warengrps['WARENGR'] = pd.to_numeric(warengrps['WARENGR'],errors='coerce')
warengrps.dropna(subset='WARENGR',inplace=True)
warengrps['WARENGR'] = warengrps['WARENGR'].astype(int)
warengrps.drop_duplicates(subset='NUMMER',inplace=True)

### Selecting a list of columns from our artikels that we need
artikels = artikel_df[['StoreId','Name','Beschreibung']]

### Cleaning up the Artikelnumbers, renaming their column and removing unwanted information to connect this dataset with Warengroups data
artikels['Number'] = artikels['StoreId'].str.split().str[0]

### Connecting two datasets to find the warengroups of each artikel and select some of them for further analysis
artikels_mit_wrgp = pd.merge(artikels,warengrps,how='left',left_on='Number',right_on='NUMMER')

### Checking the items from right dataset and see if the left-join left some rows null and removing them
artikels_mit_wrgp = artikels_mit_wrgp[artikels_mit_wrgp['NUMMER'].isna()==False]
### Converting WARENGR column values from float to int
artikels_mit_wrgp['WARENGR'] = artikels_mit_wrgp['WARENGR'].astype(int)

In [44]:
### Selecting certain Warengroups and columns for further analysis
kuchen_gerate = artikels_mit_wrgp[artikels_mit_wrgp['WARENGR'].isin([1203])]

### Selecting the needed columns and uniforming the column names
kuchen_gerate = kuchen_gerate[['WARENGR','NUMMER','Name','Beschreibung']]
kuchen_gerate = kuchen_gerate.rename(columns={'Name':'NAME','Beschreibung':'BESCHREIBUNG'})

In [45]:
### Removing rows without description
for id,item in enumerate(kuchen_gerate['BESCHREIBUNG']):
    kuchen_gerate = kuchen_gerate.dropna(subset='BESCHREIBUNG')

kuchen_gerate['ORIGINAL_BESCHREIBUNG'] = kuchen_gerate['BESCHREIBUNG'].copy()

In [46]:
### Removing unwanted characters and duplicates
kuchen_gerate['BESCHREIBUNG'] = kuchen_gerate['BESCHREIBUNG'].str.replace(' ',' ',regex=True)
kuchen_gerate['BESCHREIBUNG'] = kuchen_gerate['BESCHREIBUNG'].str.replace(r'<p>|</p>|<ul>|</ul>|<li>|</li>|<br>|<b>|</b>|<span.*>|<font.*>|<strong>|</strong>',' ',regex=True,case=False)
kuchen_gerate['BESCHREIBUNG'] = kuchen_gerate['BESCHREIBUNG'].str.replace(r'&nbsp_|&nbsp;',' ',regex=True)
kuchen_gerate['BESCHREIBUNG'] =kuchen_gerate['BESCHREIBUNG'].str.replace('&Oslash',' ')
kuchen_gerate = kuchen_gerate.drop_duplicates()


<class 'pandas.core.frame.DataFrame'>
Index: 216 entries, 135 to 11217
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   WARENGR                216 non-null    int64 
 1   NUMMER                 216 non-null    object
 2   NAME                   216 non-null    object
 3   BESCHREIBUNG           216 non-null    object
 4   ORIGINAL_BESCHREIBUNG  216 non-null    object
dtypes: int64(1), object(4)
memory usage: 10.1+ KB


In [48]:
# Choosing certain columns and combining the name and Beschreibung as they both have product properties
mined_text = kuchen_gerate[['NUMMER','NAME','BESCHREIBUNG','ORIGINAL_BESCHREIBUNG','WARENGR']].copy()
mined_text['BESCHREIBUNG'] = mined_text['NAME'] + '\n' +mined_text['BESCHREIBUNG']
mined_text['BESCHREIBUNG'] = mined_text['BESCHREIBUNG'].str.lstrip()
mined_text['BESCHREIBUNG'] = mined_text['BESCHREIBUNG'].str.rstrip()
mined_text.reset_index(inplace=True)


In [26]:
## JSON format for the response of the model
example = {"NUMMER": str,
           "NAME": str,
           "ART":str,
           "MATERIAL":list,
           "MASSEN": r'\d+ cm|mm',
           'VOLUME': r'\d+ l|ml',
           'GEWICHT':r'\d+ g|kg',
           'VOLTAGE': r'\d+V',
           'WATT': r'\d W',
           'FARBE': str,
           'SET': r'\d+ teilig',
           'OFENFEST': bool,
           'SPUELMACHINEFEST':bool,
            'BRAND':str,
            'BESCHICHTUNG': str,
            'TEMPERATUR': str,
            'LED_ANZEIGE':bool,
            'KABEL': str,
            'RUTSCHFEST': bool,
            'ANWENDUNG': str,
            'ANDERE_EIGENSCHAFTEN': list
           
           }

In [None]:
### engineered prompt for data extraction
prompt = f"""[INST]Bitte extrahieren Sie die Produktattribute aus dem folgenden Text und geben Sie sie im JSON-Format auf Deutsch aus. 
    Die Schlüssel sollten kurz sein (maximal 1 bis 2 Wörter) und die Werte sollten in maximal 50 Zeichen beschrieben werden. 
    Verwenden Sie nur die im Beispiel gegebenen Schlüssel. 
    Wenn Sie eine Information nicht finden können, geben Sie einfach an, dass Sie es nicht wissen, und versuchen Sie nicht, eine Antwort zu erfinden.

    Beispiel-JSON-Format: {example}
    [/INST]

    ProduktID: {id}
    Produktname: {name}
    Beschreibung: {description}
    """

In [None]:
# a list of to add jsons to it
attribute_list = []

## generating response using mixtral 7x8b (can be changed to mixtral 8x22b but takes much longer) for each product and write it to a json file
with open('Exports/data_1.json', 'w') as f:
    for id,name, description in zip(mined_text['NUMMER'],mined_text['NAME'],mined_text['BESCHREIBUNG']):
        response = ollama.generate(model='mixtral:latest',prompt=prompt)
        attribute_list.append(response)
        f.write(response['response'])
        print(response['response'])