# Prepare Dataset

## About
This notebook contains the code to
1. download the ABO dataset 
2. Clean the dataset to extract title/product type
3. export dataset as HuggingFace compatible dataset

## Dataset

This notebook uses the [Amazon Berkeley Objects (ABO) Dataset](https://amazon-berkeley-objects.s3.amazonaws.com/index.html) . 

The dataset was created in partnership with Amazon and UC Berklely .

For 147,702 it contains product metadata , images and 3D models. 

In [None]:
%%bash 
cd ../artifacts/dataset_raw/amazon/
wget https://amazon-berkeley-objects.s3.amazonaws.com/archives/abo-listings.tar
tar -xvf abo-listings.tar

In [None]:
#!wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/All_Amazon_Meta.json.gz
#!wget https://amazon-berkeley-objects.s3.amazonaws.com/archives/abo-listings.tar

## Imports

In [1]:
import pathlib
import sklearn
import datasets
import pandas as pd
import sklearn.preprocessing
import sklearn.model_selection
import glob
import functools

In [2]:
!pwd

/home/jupyter/tutorials/personal/pydata_bert/notebooks


## Process Dataset

In [3]:
!ls ../artifacts/dataset_raw/amazon/listings/metadata

listings_0.json.gz  listings_4.json.gz	listings_8.json.gz  listings_c.json.gz
listings_1.json.gz  listings_5.json.gz	listings_9.json.gz  listings_d.json.gz
listings_2.json.gz  listings_6.json.gz	listings_a.json.gz  listings_e.json.gz
listings_3.json.gz  listings_7.json.gz	listings_b.json.gz  listings_f.json.gz


In [4]:
dataset_path_raw = "../artifacts/dataset_raw/amazon/listings/metadata"

In [5]:
glob.glob(f'{dataset_path_raw}/*.json.gz')

['../artifacts/dataset_raw/amazon/listings/metadata/listings_2.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_9.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_0.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_1.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_a.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_7.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_5.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_6.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_f.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_3.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_b.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_c.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata/listings_4.json.gz',
 '../artifacts/dataset_raw/amazon/listings/metadata

load all 16 files

In [6]:
df_raw = pd.concat(map(functools.partial(pd.read_json, lines=True ), 
                    glob.glob(f'{dataset_path_raw}/*.json.gz') )) 

In [7]:
df_raw.head()

Unnamed: 0,brand,bullet_point,color,fabric_type,item_id,item_name,model_name,model_number,product_type,style,...,item_keywords,material,spin_id,3dmodel_id,color_code,model_year,pattern,product_description,finish_type,item_shape
0,"[{'language_tag': 'de_DE', 'value': 'Amazon Es...","[{'language_tag': 'de_DE', 'value': 'Fällt gro...","[{'language_tag': 'de_DE', 'value': 'Mehrfarbi...","[{'language_tag': 'en_GB', 'value': '100% Cott...",B07HL25ZQM,"[{'language_tag': 'en_GB', 'value': 'Amazon Es...","[{'language_tag': 'en_GB', 'value': '6-Pack Bi...",[{'value': 'P_AE3131_M6'}],[{'value': 'BABY_PRODUCT'}],"[{'language_tag': 'de_DE', 'value': '6-Pack Bi...",...,,,,,,,,,,
1,"[{'language_tag': 'en_GB', 'value': 'AmazonBas...","[{'language_tag': 'en_GB', 'value': 'Large dry...",,,B0825D4F6R,"[{'language_tag': 'en_GB', 'value': 'AmazonBas...",,[{'value': 'AMAZ2001'}],[{'value': 'HOME'}],"[{'language_tag': 'en_GB', 'value': 'Deluxe'}]",...,"[{'language_tag': 'en_GB', 'value': 'tower lau...",,,,,,,,,
2,"[{'language_tag': 'en_IN', 'value': 'Amazon Br...","[{'language_tag': 'en_IN', 'value': '3D Printe...","[{'language_tag': 'en_IN', 'standardized_value...",,B07TF1FCFD,"[{'language_tag': 'en_IN', 'value': 'Amazon Br...","[{'language_tag': 'en_IN', 'value': 'Samsung G...",[{'value': 'gz8587-SL40668'}],[{'value': 'CELLULAR_PHONE_CASE'}],,...,"[{'language_tag': 'en_IN', 'value': 'mobile co...",,,,,,,,,
3,"[{'language_tag': 'en_IN', 'value': 'Amazon Br...","[{'language_tag': 'en_IN', 'value': 'Snug fit ...","[{'language_tag': 'en_IN', 'standardized_value...",,B08569SRJD,"[{'language_tag': 'en_IN', 'value': 'Amazon Br...","[{'language_tag': 'en_IN', 'value': 'Nokia 7.2'}]",[{'value': 'UV10845-SL40357'}],[{'value': 'CELLULAR_PHONE_CASE'}],,...,"[{'language_tag': 'en_IN', 'value': 'Back Cove...","[{'language_tag': 'en_IN', 'value': 'Silicon'}]",,,,,,,,
4,"[{'language_tag': 'en_US', 'value': 'Stone & B...","[{'language_tag': 'en_US', 'value': 'With mode...","[{'language_tag': 'en_US', 'value': 'Dark Grey'}]",,B07B4G5RBN,"[{'language_tag': 'zh_CN', 'value': 'Stone & B...",,[{'value': 'UPH10095B'}],[{'value': 'CHAIR'}],,...,"[{'language_tag': 'en_US', 'value': 'living-ro...","[{'language_tag': 'zh_CN', 'value': '灰石色'}, {'...",485925ed,B07B4G5RBN,[#918F8C],,,,,


In [8]:
len(df_raw)

147702

sample record

In [9]:
df_raw.iloc[0].to_dict()

{'brand': [{'language_tag': 'de_DE', 'value': 'Amazon Essentials'}],
 'bullet_point': [{'language_tag': 'de_DE',
   'value': 'Fällt gross aus; eventuell eine Größe kleiner bestellen'}],
 'color': [{'language_tag': 'de_DE', 'value': 'Mehrfarbig(Girl Fruit)'}],
 'fabric_type': [{'language_tag': 'en_GB', 'value': '100% Cotton'},
  {'language_tag': 'de_DE', 'value': '100 % Baumwolle'}],
 'item_id': 'B07HL25ZQM',
 'item_name': [{'language_tag': 'en_GB',
   'value': 'Amazon Essentials Bib Set of 6'},
  {'language_tag': 'de_DE',
   'value': 'Amazon Essentials 6-Pack Bib Set, Mehrfarbig(Girl Fruit), Einheitsgröße'}],
 'model_name': [{'language_tag': 'en_GB', 'value': '6-Pack Bib Set'},
  {'language_tag': 'de_DE', 'value': '6-Pack Bib Set'}],
 'model_number': [{'value': 'P_AE3131_M6'}],
 'product_type': [{'value': 'BABY_PRODUCT'}],
 'style': [{'language_tag': 'de_DE', 'value': '6-Pack Bib Set'}],
 'main_image_id': '718mYsQTQbL',
 'country': 'DE',
 'marketplace': 'Amazon',
 'domain_name': 'amazo

for this project, we only need `item_name` and `brand`.    
We can assume and take the first value for the fields

In [10]:
def parse_property(property_record:dict,property_name:str):
    try:
        r = property_record[property_name][0]
        if property_name =="node":
            return r['node_name']
        else:
            return r['value']
    except Exception as e:
        return None
    
def cleanup_record(raw_record:dict):
    
    
    record= {
        'brand': parse_property(raw_record,'brand')
        ,'item_id': raw_record['item_id']
        ,'item_name': parse_property(raw_record,'item_name')
        ,'product_type': parse_property(raw_record,'product_type')
        ,'node': parse_property(raw_record, 'node')
        , 'main_image_id': raw_record['main_image_id']
        ,'product_description': raw_record['product_description']

        
    }
    
    return pd.Series(record)

In [11]:
df = df_raw.apply(cleanup_record,axis=1)

In [12]:
df.head()

Unnamed: 0,brand,item_id,item_name,product_type,node,main_image_id,product_description
0,Amazon Essentials,B07HL25ZQM,Amazon Essentials Bib Set of 6,BABY_PRODUCT,/Kategorien/Ernährung & Stillen/Lätzchen,718mYsQTQbL,
1,AmazonBasics,B0825D4F6R,AmazonBasics 3-Tier Deluxe Tower Laundry Dryin...,HOME,/Home & Garden/Home & Kitchen/Categories/Stora...,81lg2wto16L,
2,Amazon Brand - Solimo,B07TF1FCFD,Amazon Brand - Solimo Designer Number Eight 3D...,CELLULAR_PHONE_CASE,/Categories/Mobiles & Accessories/Mobile Acces...,71R4R6x-tjL,
3,Amazon Brand - Solimo,B08569SRJD,Amazon Brand - Solimo Designer Dark Night View...,CELLULAR_PHONE_CASE,/Categories/Mobiles & Accessories/Mobile Acces...,71QSAxIJagL,
4,Stone & Beam,B07B4G5RBN,"Stone & Beam Varon 过渡日床, 灰石色",CHAIR,/Categories/Furniture/Living Room Furniture/Ch...,91UiRD6UcHL,


In [13]:
df.columns

Index(['brand', 'item_id', 'item_name', 'product_type', 'node',
       'main_image_id', 'product_description'],
      dtype='object')

In [14]:
df['product_type'].value_counts()

CELLULAR_PHONE_CASE    64853
SHOES                  12965
GROCERY                 6546
HOME                    5264
HOME_BED_AND_BATH       3082
                       ...  
SOUS_VIDE_MACHINE          1
SKIN_TREATMENT_MASK        1
SCULPTURE                  1
THICKENING_AGENT           1
TERMINAL_BLOCK             1
Name: product_type, Length: 576, dtype: int64

There are some product types that don't occur frequently.       
We should limit our training data to include at least 50+ product types

In [19]:
min_product_count = 500

compute top product types

In [20]:
top_products =  df['product_type'].value_counts().loc[lambda x: x>min_product_count].index.tolist()

In [21]:
len(df['product_type'].value_counts() ) , len (top_products)

(576, 31)

In [22]:
df_all = df [ df['product_type'].isin(top_products) ].copy()


In [23]:
len(df_all)

121239

`text` and `label` are the columns that are needed by Hugging Face Transformer package

Item title is the text. 
Product Type is the label we are predicting

In [24]:
df_all['label_name'] = df_all['product_type']
df_all['text'] = df_all['item_name']

encode the product type to a numeric label

In [25]:
label_encoder = sklearn.preprocessing.LabelEncoder()

In [26]:
label_encoder.fit(df_all['label_name'])

LabelEncoder()

In [27]:
df_all['label'] = label_encoder.transform(df_all['label_name'])


Allocate 60% for training , 20% validation and 20% for training

In [28]:
df_train, df_test = sklearn.model_selection.train_test_split(df_all, train_size=.6, stratify= df_all['label'] )


df_test, df_val = sklearn.model_selection.train_test_split(df_test, test_size=.5, stratify= df_test['label'] )


print  ( 
{
    'train': len(df_train)
    ,'test': len(df_test)
    ,'val': len(df_val)
}

)

{'train': 72743, 'test': 24248, 'val': 24248}


## Create Hugging Face Dataset

In order to later feed our model to HF transformers package, we need either Pytorch Dataloader or use HF [datasets](https://github.com/huggingface/datasets).

`Datasets` can easily be used by TF/ Pytorch


In [29]:
dataset_features = datasets.Features(
    {'text': datasets.Value('string')
     , 'item_name': datasets.Value('string')
     , 'label': datasets.ClassLabel(names=list ( label_encoder.classes_ ))
     , 'brand': datasets.Value('string')
     , 'item_id': datasets.Value('string')
     , 'main_image_id': datasets.Value('string')
    , 'node': datasets.Value('string')

    }

)

In [30]:
dataset_features.keys()

dict_keys(['text', 'item_name', 'label', 'brand', 'item_id', 'main_image_id', 'node'])

create dataset dictionary with all the subsets

In [31]:
interested_columns = dataset_features.keys()

dataset_train = datasets.Dataset.from_pandas(df_train[interested_columns],features=dataset_features)
dataset_test = datasets.Dataset.from_pandas(df_test[interested_columns],features=dataset_features)
dataset_validation = datasets.Dataset.from_pandas(df_test[interested_columns],features=dataset_features)

dataset_all = datasets.DatasetDict({
    'train': dataset_train,
    'test': dataset_test,
    'valid': dataset_validation }
)

In [32]:
dataset_all

DatasetDict({
    train: Dataset({
        features: ['text', 'item_name', 'label', 'brand', 'item_id', 'main_image_id', 'node'],
        num_rows: 72743
    })
    test: Dataset({
        features: ['text', 'item_name', 'label', 'brand', 'item_id', 'main_image_id', 'node'],
        num_rows: 24248
    })
    valid: Dataset({
        features: ['text', 'item_name', 'label', 'brand', 'item_id', 'main_image_id', 'node'],
        num_rows: 24248
    })
})

In [33]:
dataset_all['train'][0]

{'text': 'Amazon Brand - Solimo Designer Light Blue Flower Photography 3D Printed Hard Back Case Mobile Cover for Sony Xperia L1',
 'item_name': 'Amazon Brand - Solimo Designer Light Blue Flower Photography 3D Printed Hard Back Case Mobile Cover for Sony Xperia L1',
 'label': 2,
 'brand': 'Amazon Brand - Solimo',
 'item_id': 'B07THC7RSK',
 'main_image_id': '71PBcKpr8jL',
 'node': '/Categories/Mobiles & Accessories/Mobile Accessories/Cases & Covers/Back & Bumper Cases'}

In [34]:
all_classes = dataset_all['train'].features['label'].names_file
all_classes

## Persist Changes

save the dataset and load it 

In [35]:
dataset_path = '../artifacts/dataset_processed/'

In [36]:
dataset_all.save_to_disk(dataset_path)

In [37]:
datasets.load_from_disk(dataset_path)

DatasetDict({
    train: Dataset({
        features: ['text', 'item_name', 'label', 'brand', 'item_id', 'main_image_id', 'node'],
        num_rows: 72743
    })
    test: Dataset({
        features: ['text', 'item_name', 'label', 'brand', 'item_id', 'main_image_id', 'node'],
        num_rows: 24248
    })
    valid: Dataset({
        features: ['text', 'item_name', 'label', 'brand', 'item_id', 'main_image_id', 'node'],
        num_rows: 24248
    })
})

# References

[Amazon Object Dataset](https://amazon-berkeley-objects.s3.amazonaws.com/index.html)       
[Hugging Face Tutorial on Custom Dataset](https://github.com/huggingface/notebooks/blob/master/transformers_doc/custom_datasets.ipynb)