# Marketplace simulation

### A. Introduction

While non-marketplace simulations are focused on a number of consumers experiencing and leaving ratings of a single product, marketplace simulations allow to conduct a similar exercise but including a whole range of ___P___ different products from which consumers can choose. Accordingly, the journey of virtual consumers through the marketplace simulation will be in part similar to that of a non-marketplace one, with the main difference that now each of them will have to choose which particular product to buy from those available. Once this selection process takes place the consumer goes through the rest of the steps that will determine whether a rating is posted or not (formulating expectations, experiencing the product etc.). Moreover, multiple marketplaces can be included in a single simulation so you can think of a marketplace simulation as being composed of ___M___ different marketplaces, including  ___P___ different simulated products each from which simulated consumers can choose.


### B. Preparing the simulation

Having briefly introduced the concept of Marketplace simulation, lets execute the following code cell containing the relevant imports to run the simulation alongside other related settings.

In [1]:
%load_ext autoreload
%autoreload 2

### Just a formatting related plugin
%load_ext nb_black

%matplotlib inline
import matplotlib.pyplot as plt

import sys
from numpy import random

sys.path.append("../")


import multiprocessing as mp

from collections import deque
from pathlib import Path
from typing import Dict, Optional

import arviz
import pickle

import numpy as np
import pandas as pd
import pyreadr
import sbi
import sbi.utils as sbi_utils
import seaborn as sns
import statsmodels.formula.api as smf

import torch

from joblib import Parallel, delayed
from matplotlib.lines import Line2D
from scipy.stats import ttest_ind
from snpe.inference import inference_class
from snpe.simulations import simulator_class
from snpe.simulations import simulator_class, marketplace_simulator_class
from snpe.embeddings.embeddings_to_ratings import EmbeddingRatingPredictor
from snpe.embeddings.embeddings_density_est_GMM import EmbeddingDensityGMM
from snpe.utils.statistics import review_histogram_correlation
from snpe.utils.tqdm_utils import tqdm_joblib
from tqdm import tqdm

# import tqdm

ARTIFACT_PATH = Path("../artifacts/marketplace/")

### Set plotting parameters
sns.set(style="white", context="talk", font_scale=2.5)
sns.set_color_codes(palette="colorblind")
sns.set_style("ticks", {"axes.linewidth": 2.0})

  from .autonotebook import tqdm as notebook_tqdm


<IPython.core.display.Javascript object>

In the cell below the path where the output of the simulation will be stored is defined. It is set such that output is stored in the "output" folder within this directory, but you can modify it to match your storage preferences. 

In [2]:
OUTPUT_PATH = Path("output/")

<IPython.core.display.Javascript object>

Beyond the overall scheme presented in the introduction, marketplace simulations incorporate two other significant new elements that were not present in the non-marketplace setting and that deserve to be mentioned before actually running the simulation.

1. The first of these is the Gaussian mixture models (GMM) fitted on the products' and users' embeddings that are then used to draw the embeddings for the fictional products and users that will compose the simulated marketplaces. 

2. The second element is a feed-forward neural network trained to predict the real quality of each simulated product from its embedding. 

At this point, if you intend to run an example marketplace simulation you can just carry on with the rest of the notebook given that example data for such a task is already provided. However, if you are interested in running the simulation with your own data on products and users you should consider that the processes mentioned above require the following inputs:

1. A file called "productspace.tsv" containing the n-dimensional product embeddings, each one accompanied by a product identifier in the format "product_< id-number >" (e.g. product_1234567). Additionally, make sure that the id-number itself does not contain any underscore as this may cause problems in the input processing stage.


2. A second file "userspace.tsv" containing the n-dimensional user embeddings (without any specific user identifier unlike the case of "productspace.tsv").


3. A third file named "rating_histogram_all.txt" containing the numerical IDs and actual rating distribution [1 - 5] of the products in "productspace.tsv".


For further clarity on the exact format expected for the inputs, you can visit the example versions of these already provided at the path "snpe/artifacts/marketplace/" path within the repository. Moreover, at the end of the notebook you can also find an example piece of code designed to generate valid artificial inputs for the marketplace simulation.

You can place your inputs in the path "snpe/artifacts/marketplace/" (which is the one used by default), or alternatively select any path of your choice by setting it as the value of the argument "artifact_path" when instantiating the new objects of the classes in charge of fitting/training the models in the code cells below. Once the inputs have been provided as specified, the code cells below headings B.1 and B.2 can be executed to fit the GMM models for embeddings and train the feed-forward neural network for quality prediction respectively.

#### B.1 GMM model for generation of virtual products and users

In [3]:
model = EmbeddingDensityGMM(n_components=10, n_init=5, artifact_path=ARTIFACT_PATH)
product_embeddings, user_embeddings = model.process_input_data()
model.fit(product_embeddings, user_embeddings)
model.save()

Product embeddings of shape: (1400, 100)
User embeddings of shape: (1400, 100)
Fitting product model: GaussianMixture(max_iter=500, n_components=10, n_init=5, random_state=42,
                verbose=2, verbose_interval=20)
1260 samples in train set, 140 samples in test set
Initialization 0
Initialization converged: True	 time lapse 0.07950s	 ll -112.14466
Initialization 1
Initialization converged: True	 time lapse 0.08107s	 ll -100.60310
Initialization 2
Initialization converged: True	 time lapse 0.09166s	 ll -104.64910
Initialization 3
Initialization converged: True	 time lapse 0.09013s	 ll -95.58325
Initialization 4
Initialization converged: True	 time lapse 0.08292s	 ll -89.75754
Fitting user model: GaussianMixture(max_iter=500, n_components=10, n_init=5, random_state=42,
                verbose=2, verbose_interval=20)
1260 samples in train set, 140 samples in test set
Initialization 0
Initialization converged: True	 time lapse 0.09995s	 ll -108.23209
Initialization 1
Initializatio

<IPython.core.display.Javascript object>



#### B.2 Feed-forward neural network for predicting real quality of products

In [4]:
embedding_model = EmbeddingRatingPredictor(artifact_path=ARTIFACT_PATH)
input_df = embedding_model.process_input_data()
(
    ratings,
    embeddings,
    train_loader,
    val_loader,
    train_indices,
    val_indices,
) = embedding_model.create_training_data(input_df, validation_frac=0.1, batch_size=100)
embedding_model.fit(train_loader, val_loader, train_indices, val_indices)
embedding_model.save()

	 Device set to cpu, using torch num threads=1
Using the dense network: 
 Sequential(
  (0): Linear(in_features=100, out_features=256, bias=True)
  (1): LeakyReLU(negative_slope=0.01)
  (2): Linear(in_features=256, out_features=128, bias=True)
  (3): LeakyReLU(negative_slope=0.01)
  (4): Linear(in_features=128, out_features=64, bias=True)
  (5): LeakyReLU(negative_slope=0.01)
  (6): Linear(in_features=64, out_features=5, bias=True)
)
Merged product embeddings with review histograms and produced merged DF of shape: (1400, 8)
Train set size: torch.Size([1260]), Validation set size: torch.Size([140])
Train Loss after epoch: 0: 30.110531156025235
Validation loss after epoch: 0: 26.983489990234375
Train Loss after epoch: 50: 3.835285561425345
Validation loss after epoch: 50: 5.336745807102749
Stopping after epoch 70 as validation loss was not improving further
Training process has finished.
Best loss: 4.2177333150591165


<IPython.core.display.Javascript object>

### C. The simulation and its arguments

The Marketplace simulation is represented by the MarketplaceSimulator class, and its increased complexity requires three new arguments (to be added on top of those inherited from its parent class, the HerdingSimulator class) to govern its functioning. These new arguments are:

- __num_products__: Number of products to be included in each marketplace. For example, if set at a value of 1400, a total of 1400 different virtual products will be sampled from the GMM fit of the provided product embeddings for each of the simulated marketplaces. 


- __num_marketplace_reviews__: Desired number (Integer) of ratings to be obtained across all products in a marketplace of a given to conclude the simulation. Alternatively, the simulation of a marketplace will also be concluded automatically if this number of ratings has not been reached by the time that an amount of consumers 30 times larger have visited the marketplace. E.g. In a case where this argument is set to 10, if by the time 300 consumers have been simulated less than 10 ratings have been posted the simulation will be concluded automatically.


- __consideration_set_size__: Number of virtual products (Integer) that will compose the consideration set from which the consumer will make the final purchasing decision. For instance, if set at a value of 5, consumers in the simulation will make their final purchasing decision from a set comprised of the top 5 products whose embeddings display the highest cosine similarity to their own assigned embedding.  

Furthermore, the MarketplaceSimulator class allows to simulate multiple different marketplaces at once. As a result of this the function of the argument 'num_simulations' which was already employed in the non-marketplace setting has been updated in accordance with the new simulational design so now it is described as:

- __num_simulations__: Total number (Integer) of marketplaces to be simulated. Each marketplace will be composed of its unique set of virtual products and consumers sampled from the GMM fit of the provided product embeddings.

The generate_and_save_simulations function defined below is in charge of calling the MarkeplaceSimulator class and delivering the required arguments to configure and carry out the simulation. In case you would like to re-visit the interpretations of the other arguments that were already introduced for non-marketplace simulations you can scroll further down the notebook where a brief review has been included.

In [5]:
def generate_and_save_simulations(
    num_simulations: int,
    review_prior: np.ndarray,
    tendency_to_rate: float,
    simulation_type: str,
    previous_rating_measure: str,
    min_reviews_for_herding: int,
    # herding_differentiating_measure: str,
    num_products: int,
    num_total_marketplace_reviews: int,
    consideration_set_size: int,
) -> None:
    params = {
        "num_simulations": num_simulations,
        "review_prior": review_prior,
        "tendency_to_rate": tendency_to_rate,
        "simulation_type": simulation_type,
        "previous_rating_measure": previous_rating_measure,
        "min_reviews_for_herding": min_reviews_for_herding,
        # "herding_differentiating_measure": herding_differentiating_measure,
        "num_products": num_products,
        "num_total_marketplace_reviews": num_total_marketplace_reviews,
        "consideration_set_size": consideration_set_size,
    }
    simulator = marketplace_simulator_class.MarketplaceSimulator(params)
    simulator.simulate(num_simulations=num_simulations)
    simulator.save_simulations(OUTPUT_PATH)

<IPython.core.display.Javascript object>

### D. Running the simulation

After reviewing the key aspects involved in running a marketplace simulation we can proceed to run an example of it. In the code cell below the function in charge of running the simulation is called including a series of parameters that will shape it such that:

1. 16 different marketplaces will be simulated.
2. Five ratings (one for each rating value) are pre-loaded.
3. The tendency to rate is set at 5%.
4. It will return a time series of the simulated ratings.
5. The mode (of all previous ratings) is taken as the reference metric for herding.
6. At least five previous ratings are required for herding to start happening.
7. Each of the simulated marketplaces will be comprised of 10 different virtual products
8. Five ratings are required to conclude the simulation of a given marketplace.
9. The consideration sets will be comprised of three virtual products.

In [6]:
generate_and_save_simulations(8, np.ones(5), 0.05, "timeseries", "mode", 5, 10, 200, 5)

8 marketplaces to be simulated on 1 CPUs. Press Enter to continue..
Loaded product embedding density estimator: 
 GaussianMixture(max_iter=500, n_components=10, n_init=5, random_state=42,
                verbose=2, verbose_interval=20)
Loaded user embedding density estimator: 
 GaussianMixture(max_iter=500, n_components=10, n_init=5, random_state=42,
                verbose=2, verbose_interval=20)
	 Device set to cpu, using torch num threads=1
Using the dense network: 
 Sequential(
  (0): Linear(in_features=100, out_features=256, bias=True)
  (1): LeakyReLU(negative_slope=0.01)
  (2): Linear(in_features=256, out_features=128, bias=True)
  (3): LeakyReLU(negative_slope=0.01)
  (4): Linear(in_features=128, out_features=64, bias=True)
  (5): LeakyReLU(negative_slope=0.01)
  (6): Linear(in_features=64, out_features=5, bias=True)
)
Loaded embedding -> rating predictor model: 
 RatingPredictorModel(
  (net): Sequential(
    (0): Linear(in_features=100, out_features=256, bias=True)
    (1): L

Worker 1: 100% 200/200 [00:20<00:00, 24.20it/s]

<IPython.core.display.Javascript object>

### Appendix 1: Review of the simulation's arguments

Below you can find a review of the arguments required to carry out the simulation but that are not exclusive of the marketplace simulations. Their functioning and interpretation are exactly the same as for non-marketplace simulations except for the 'review_prior' argument, whose interpretation differs intuitively in the sense that it does not only apply to the single product considered in a non-marketplace simulation, but to all products across all marketplaces.

- __review_prior__: Set of initial ratings for the products that are pre-loaded before the simulation starts, taking the shape of an array of five integer values. By default, this is set as an array of five 1s. This implies that by the time a product within a marketplace is purchased for the first time, the consumer that has done so will observe 5 prior reviews, each of them assigned to one of the five values composing the rating scale [1 - 5]. 


- __tendency_to_rate__: Underlying tendency to rate for all consumers taking float values in the interval [0,1]. In other words, this is the proportion of consumers that will post a rating regardless of the value of the rho parameter(s) and the difference between their actual and expected product experience. If set at the default value of 0.05, 5% of all consumers will post a rating independently of the other factors at play in the simulation. This is necessary to address the "cold start" problem where by random chance for some products, we might have high enough values of rho that no visitors ever leave a rating.


- __simulation_type__: Type of simulation output to produce between time series and histogram. Accepts the strings "timeseries" and "histogram" as inputs. Returns the timeseries of the simulated ratings, in a cumulative histogram format (so, the order of rating accumulation is preserved) if "timeseries" is chosen. For "histogram", returns the final histogram of ratings (and throws away the order of rating accumulation). 


- __previous_rating_measure__: Measure of previous ratings that will be taken as a reference when experiencing herding behavior. It can be either the mean, the mode or the latest review posted. For example, if a consumer leaves a rating being subject to herding and this parameter is set as mode, it will herd towards the mode of all previous reviews. This argument is specific to the Herding and Double herding simulations and takes the strings "mode", "mean" and "latest" as valid inputs.


- __min_reviews_for_herding__: Minimum number of pre-existing reviews for a consumer to be able to be subject to herding behavior. It has to be an integer value larger than 0. This argument is specific to the Herding and Double herding simulations.

### Appendix 2: Generating artificial inputs for the simulation

In [3]:
def input_generator(
    number_prods: int, number_users: int, artifact_path="../artifacts/marketplace/"
):
    """
    This function generates and saves fictional inputs for a marketplace simulation.
    It requires the number of products and users to include in the fictional inputs as arguments.
    Output is stored by default in the path "snpe/artifacts/marketplace/",
    which is the path where the instances of the marketplace simulation-related classes are instructed
    to look for the inputs by default.
    A different path storage can be provided as the value of the optional "artifact_path" argument.
    """
    assert (
        number_prods < 13939
    ), "Argument number_user cannot be set at a avalue higher than 13939"

    # Generating a set of random product ids to be used in rating histograms and embeddings
    ids_alone = np.random.choice(
        np.arange(1000000, 9999999), size=number_prods, replace=False
    )
    ids_product = ids_alone.astype(str)
    for i in range(len(ids_product)):
        ids_product[i] = "product_" + ids_product[i]

    # Generating 100d arrays to be used as product embeddings
    hundred_space = np.random.normal(size=(number_prods, 100))

    # Generating 100d arrays to be used as user embeddings
    hundred_space_users = np.random.normal(size=(number_prods, 100))

    # Including product ids alogside product embeddings
    embeddings_ready = pd.DataFrame(hundred_space)
    embeddings_ready.insert(0, "product_id", ids_product)

    # Users have no visible id
    users_ready = pd.DataFrame(hundred_space_users)

    ratings = pd.read_csv("rating_histogram_anom.txt", sep="\t")
    ratings_ready = ratings.iloc[:1400]
    ratings_ready.insert(0, "asin", ids_alone)

    embeddings_ready.to_csv(
        str(artifact_path) + "productspace.tsv", sep="\t", index=False, header=False
    )
    users_ready.to_csv(
        str(artifact_path) + "userspace.tsv", sep="\t", index=False, header=False
    )
    ratings_ready.to_csv(
        str(artifact_path) + "rating_histogram_all.txt", sep="\t", index=False
    )

<IPython.core.display.Javascript object>

Genrating artificial output containing 1400 fictional products and users to feed the GMM models and the neural network from B.1 and B.2

In [4]:
input_generator(1400, 1400)

<IPython.core.display.Javascript object>

### Appendix 3: Rating Scale simulation (non-marketplace)

Since this tutorial was first written a new kind of non-marketplace simulator, the Rating Scale simulation, has been implemented. It is represented by the RatingScaleSimulator class, which is a child class of the HerdingSimulator class. Its main addition is to introduce 5 new parameters that create a rating scale to determine the value of the review that each consumer will post (1 to 5 stars) depending on the difference between its expected and actual experience with the product (delta). This rating scale is delimited by the following two elements:

- __five_star_highest_limit__: Upper bound for delta that will result in a 5-star rating. The actual limit will lie between `five_star_highest_limit` and `five_star_highest_limit` * 0.5. If delta is larger than this limit, the user's review will be a 5-star one. 


- __one_star_lowest_limit__:  Lower bound for delta that will result in a 1-star rating. The actual limit will lie between `one_star_lowest_limit` and `one_star_lowest_limit` * 0.5. If delta is lower than this limit, the user's review will be a 1-star one. 

Considering these two limits the rating scale is governed by the following four parameters introduced by the Rating Scale simulator:

1. __P5__: taking values between 0.5 and 1, it determines the limit to which delta is compared to get a 5-star rating. Such limit will be equivalent to: five_star_highest_limit * p5. Thus, for example, if p5 = 1 then the limit will be five_star_highest_limit.


2. __P4__: taking values between 0.25 and 0.75, it determines the limit to which delta is compared to get a 4-star rating. Such limit will be equivalent to: five_star_highest_limit * p5 * p4.


3. __P1__: taking values between 0.5 and 1, it determines the limit to which delta is compared to get a 1-star rating. Such limit will be equivalent to: one_star_lowest_limit * p1. Thus, for example, if p1 = 1 then the limit will be one_star_lowest_limit.


4. __P2__: taking values between 0.25 and 0.75, it determines the limit to which delta is compared to get a 2-star rating. Such limit will be equivalent to: one_star_lowest_limit * p1 * p2.


Having this in mind, 3-star ratings will come from the space where delta is in between one_star_lowest_limit * p1 * p2 to zero, and from zero to five_star_highest_limit * p5 * p4.

Lastly, the fifth parameter introduced by the rating scale simulator is:

5. __bias_5_star__ : Probability that a user is biased towards five stars and leaves a 5 star rating irrespective of its product experience.