This notebook explores the possibility of creating a multi-modal, multi-purpose AI that can achieve the following:

1. **Multimodal product-question answering specifically focusing on the product details.** <br>
The AI can tell you things ABOUT THE PRODUCT such as it’s durability, customer complaints, etc., so you can talk to it like it is a customer who has more first-hand experience with the product than you (concatenate Amazon review data). The focus is on product knowledge, not necessarily guiding the customer through the shopping experience or answering other customer-experience related questions. 

2. **Multimodal Product Anomaly Detection:** <br>
Develop models that can detect anomalies or inconsistencies in product listings by cross-checking information across different modalities (e.g., text description inconsistent with product images).


### PROCESS
We can phrase a RQ/research process as:

1. What are the hottest selling items on Amazon or other sites? (Views, clicks, successful purchases, any metric related)

2. What is the nature of the text descriptions of these items? What <font color="salmon"> metric</font> can we develop to determine the consistency of product image to description? 
    * Supervised learning with hand-coded “good description” vs. “not good description”
    * Design model’s internal representation: RAG, etc. not my forte
    * USE CONSUMER FEEDBACK as part of the score! 

3. Then apply this metric/analysis to other websites or products. Which sites score highest?

4. Lastly, apply this analysis outside of shopping. Do image captions and descriptions match better in some modes of media than others? 


### WHAT DOES THE DATA LOOK LIKE?
* Tabular, with columns for 
    * Product name
    * Descriptions, features
    * Review text (as a dict?)
    * Rating/popularity measure
    * Image link

AFTER PROCESSING THIS DATA:
* Tabular   
    * Encode/Embed all text & stats
    * Embed all images
    * Combined embedding


MORE DATA
* [BEST] Amazon Product Reviews (item metadata, image link, and user reviews): https://cseweb.ucsd.edu/~jmcauley/datasets.html. Example data:
```
{
  "sort_timestamp": 1634275259292,
  "rating": 3.0,
  "helpful_votes": 0,
  "title": "Meh",
  "text": "These were lightweight and soft but much too small for my liking. I would have preferred two of these together to make one loc. For that reason I will not be repurchasing.",
  "images": [
    {
      "small_image_url": "https://m.media-amazon.com/images/I/81FN4c0VHzL._SL256_.jpg",
      "medium_image_url": "https://m.media-amazon.com/images/I/81FN4c0VHzL._SL800_.jpg",
      "large_image_url": "https://m.media-amazon.com/images/I/81FN4c0VHzL._SL1600_.jpg",
      "attachment_type": "IMAGE"
    }
  ],
  "asin": "B088SZDGXG",
  "verified_purchase": true,
  "parent_asin": "B08BBQ29N5",
  "user_id": "AEYORY2AVPMCPDV57CE337YU5LXA"
}
```
* Amazon Reviews ONLY: https://www.tensorflow.org/datasets/catalog/amazon_us_reviews

In [1]:
import json
import pandas as pd
import os

### MERGE McAuley Lab Amazon Review Data
https://amazon-reviews-2023.github.io

In [2]:
filepath = '/Users/ez/Downloads/Amazon_Fashion.jsonl'

with open(filepath, 'r') as f:
    reviews = []
    for line in f:
        try:
            reviews.append(json.loads(line))
        except ValueError:
            pass

len(reviews)

2500939

In [3]:
reviews[0]

{'rating': 5.0,
 'title': 'Pretty locket',
 'text': 'I think this locket is really pretty. The inside back is a solid silver depression and the front is a dome that is not solid (knotted). You could use it to store a small photo, lock of hair, etc but I use it when I need to carry medication with me. Closes securely. High quality & very pretty.',
 'images': [],
 'asin': 'B00LOPVX74',
 'parent_asin': 'B00LOPVX74',
 'user_id': 'AGBFYI2DDIKXC5Y4FARTYDTQBMFQ',
 'timestamp': 1578528394489,
 'helpful_vote': 3,
 'verified_purchase': True}

In [4]:
filepath_meta = '/Users/ez/Downloads/meta_Amazon_Fashion.jsonl'

with open(filepath_meta, 'r') as f:
    meta = []
    for line in f:
        try:
            meta.append(json.loads(line))
        except ValueError:
            pass

len(meta)


826108

In [5]:
meta[0].keys()

dict_keys(['main_category', 'title', 'average_rating', 'rating_number', 'features', 'description', 'price', 'images', 'videos', 'store', 'categories', 'details', 'parent_asin', 'bought_together'])

In [None]:
meta[0]

In [9]:
# Extract list of PARENT_ASIN from META
meta_ids = []
for item in meta:
    meta_ids.append(item['parent_asin'])

# Extract list of PARENT_ASIN from REVIEWs
review_ids = []
for item in reviews:
    review_ids.append(item['parent_asin'])

In [15]:
shared_ids = set(meta_ids).intersection(review_ids)
len(shared_ids)

825869

In [16]:
# Extract data
# meta_subset = []
meta_subset = [d for d in meta if d.get('parent_asin') in shared_ids]

# for d in meta:
#     if d['parent_asin'] in shared_ids:
#         meta_subset.append(d)


# reviews_subset = []
# # product_reviewed = [] # Can have multiple reviews per product, 
#                     # but for the sake of this exercise, each product will have 1 review
# for d in reviews:
#     # if d['parent_asin'] in shared_ids and d['parent_asin'] not in product_reviewed:
#     if d['parent_asin'] in shared_ids:
#         reviews_subset.append(d)
#         # product_reviewed.append(d['parent_asin'])

reviews_subset = [d for d in reviews if d.get('parent_asin') in shared_ids]

In [17]:
len(reviews_subset), len(meta_subset) # These turn out to be the same dim as the original data

(2500939, 825869)

In [18]:
# Delete original datasets
del meta, reviews

#### DEAL WITH MULTIPLE REVIEWS

In [149]:
reviews_subset_test = reviews_subset
meta_subset_test = meta_subset

In [150]:
# Make into DF
reviews_subset_test = pd.DataFrame(reviews_subset_test)
meta_subset_test = pd.DataFrame(meta_subset_test)

In [151]:
# Rename reviews title & images
reviews_subset_test.rename(columns={"title": "title_review", "images": "images_review"}, inplace=True)

In [152]:
# Merge
merged_data = pd.merge(meta_subset_test, reviews_subset_test, left_on='parent_asin', right_on='parent_asin', how='right')

In [153]:
len(merged_data)

2500939

In [154]:
reviews_subset_test.drop(columns=['parent_asin'], inplace=True)

In [155]:
aggregate_columns = list(reviews_subset_test.columns)
aggregate_columns

['rating',
 'title_review',
 'text',
 'images_review',
 'asin',
 'user_id',
 'timestamp',
 'helpful_vote',
 'verified_purchase']

Use DASK to parallelize grouping

In [158]:
import dask.dataframe as dd
merged_data_dask = dd.from_pandas(merged_data, npartitions=4)

In [159]:
# Perform groupby and aggregation to list using Dask
aggregated_reviews_dask = merged_data_dask.groupby('parent_asin')[aggregate_columns].agg(list).reset_index()

In [160]:
# Convert Dask DF back to Pandas DF
aggregated_reviews = aggregated_reviews_dask.compute()

In [161]:
# Flatten
def flatten_l_of_l_of_dict(list_of_lists):
    return [item for sublist in list_of_lists for item in sublist]

In [162]:
aggregated_reviews['images_review'] = aggregated_reviews['images_review'].apply(flatten_l_of_l_of_dict)

In [164]:
# Add product info columns based on matching id
final_df = pd.merge(meta_subset_test, aggregated_reviews, on='parent_asin', how='right')

In [165]:
final_df['main_category'].value_counts() # No matching right now

main_category
AMAZON FASHION    825869
Name: count, dtype: int64

In [166]:
len(final_df) # length should match

825869

In [196]:
# Check alignment 
# print(final_df['images'].iloc[175])

[{'thumb': 'https://m.media-amazon.com/images/I/310Mb7HArGL._AC_SR38,50_.jpg', 'large': 'https://m.media-amazon.com/images/I/310Mb7HArGL._AC_.jpg', 'variant': 'MAIN', 'hi_res': 'https://m.media-amazon.com/images/I/715nzfTpnYL._AC_UL1500_.jpg'}, {'thumb': 'https://m.media-amazon.com/images/I/31q0U-iv5ML._AC_SR38,50_.jpg', 'large': 'https://m.media-amazon.com/images/I/31q0U-iv5ML._AC_.jpg', 'variant': 'PT01', 'hi_res': 'https://m.media-amazon.com/images/I/71A9skIGW5L._AC_UL1500_.jpg'}, {'thumb': 'https://m.media-amazon.com/images/I/512QK54eqHL._AC_SR38,50_.jpg', 'large': 'https://m.media-amazon.com/images/I/512QK54eqHL._AC_.jpg', 'variant': 'PT11', 'hi_res': 'https://m.media-amazon.com/images/I/816yZ5ReQTL._AC_UL1436_.jpg'}, {'thumb': 'https://m.media-amazon.com/images/I/61Fkqm0y3BL._AC_SR38,50_.jpg', 'large': 'https://m.media-amazon.com/images/I/61Fkqm0y3BL._AC_.jpg', 'variant': 'PT13', 'hi_res': None}, {'thumb': 'https://m.media-amazon.com/images/I/51H7BVWuh1L._AC_SR38,50_.jpg', 'large

In [169]:
# Export 
# final_df.to_csv('amazon_fashion_merged_051624.csv')
# final_df.to_parquet('amazon_fashion_merged_051624.parquet', index=False)