# INF6953I - C.SPÉC.: Fouille de données 
## TP3 - Summer 2018
### Team Components
    - Patrice Béchard
    - Soufiane Lamghari


In [1]:
!ls

bin				     lib				run
boot				     lib64				sbin
dataproc-initialization-actions      lost+found				srv
dev				     media				sys
etc				     miniconda				tmp
export				     Miniconda3-latest-Linux-x86_64.sh	usr
hadoop				     mnt				var
hadoop_gcs_connector_metadata_cache  opt				vmlinuz
home				     proc
initrd.img			     root


In [1]:
#need to run this on Ubuntu

import findspark
import os

findspark.init()

ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).

In [3]:
# Download Instacart dataset

# retrieve url from https://www.instacart.com/datasets/grocery-shopping-2017
url = "https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz?X-Amz-Expires=21600&X-Amz-Date=20180619T200551Z&X-Amz-Security-Token=FQoDYXdzEPT//////////wEaDCXVDqE3Fi9GxyWzmSKtA8S1mS6OKlV6gRpdR%2B7xURFzKnyxnQOJi8Xazr2VFDXqFpYBCObkeT73bAEUCtVPOenPS3Fy6YYJHxDbCohvnsRBxBeetM5DC1wi3Wof9tfav5wq0kNtGqlHxYJlmW223xeXsFW0Gh/NxWSDwWjQr61C3H/yicDpXQSuQ4wync11wSudW8KyIwPOWY4XrIs3qRsE0sLJNH3X3EDJbsw4L9r0XKCY5THMRBTFc9WdF5z5rHo7481mCeQ0t83p2yEfxjpRvzRgGUlJPQluErC0fLb1zyGZBGPqs5NpqLCOBlYRnVzKK66eF0a/jIZ8xcPtYAXhBVyqk0EKPmLAoOelMTkn%2Bt5D9WSDFwYygxZ%2By%2BObSDknkzDeZq4SK3LO%2Btbc1hwWvKLe6yGQ2t2WZlpgBYVGI%2BJn053KlJmPbvG26pQTVbBM7Jc3qSWOHCwNboc%2BUpadvrCEM%2BxhwFH1NQJ2xlLel79J/yUSZZz7iFusKVhIvthoQuMsfqizCo6xl2JAemaEygGW5UfppPQ98chPOiGTplMJkFZ01AMIVprHzx/svS48V9J6eHN8WJoT7CjFm6XZBQ%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAJKBXD67T26USQEQA/20180619/us-east-1/s3/aws4_request&X-Amz-SignedHeaders=host&X-Amz-Signature=7d3cd09e0bf18e30b9d9b40f56925c8351dc2fa1cc33e408873792b32e6865ed"
filename = "instacart.tar.gz"

from urllib.request import URLopener

testfile = URLopener()
testfile.retrieve(url, filename)

('instacart.tar.gz', <http.client.HTTPMessage at 0x7f036eb69a58>)

In [4]:
# Untar downloaded file
import tarfile
tar = tarfile.open(filename)
tar.extractall()
tar.close()

In [7]:
!mv instacart_2017_05_01 instacart

In [8]:
# download custom order file from personal github
url = "https://raw.githubusercontent.com/patricebechard/data_mining_inf6953i/master/lab3/instacart/orders_team8.csv"
filename = "instacart/orders_team8.csv"

from urllib.request import URLopener
testfile = URLopener()
testfile.retrieve(url, filename)

('instacart/orders_team8.csv', <http.client.HTTPMessage at 0x7f036eb30ac8>)

In [9]:
# download toy data

url = "https://raw.githubusercontent.com/patricebechard/data_mining_inf6953i/master/lab3/toy.csv"
filename = "toy.csv"

from urllib.request import URLopener
testfile = URLopener()
testfile.retrieve(url, filename)

('toy.csv', <http.client.HTTPMessage at 0x7f036eb30eb8>)

## Market Basket Analysis
Market Basket Analysis (MBA) is a well-known data mining technique to uncover associations between products or product grouping. MBA aims to explore interesting patterns from a large collection of data, for example:
millions of supermarket transactions, online orders or credit card history. In other words, MBA allows retailers to identify the relationship between the items that people buy, i.e., reveal patterns of items often purchased together. 

A widely approach to explore these patterns is by constructing $\textit{association rules}$ such as

<center> **if** bought *product1* **then** will buy *product2* with **confidence** *x*. </center>

Then, marketers may use these association rules to allocate correlated products close to each other on store shelves or make online suggestions so that customers buy more items. However, mining association rules for large datasets is a very computationally intensive problem, which makes it almost impractical to perform it without a distributed system.

Hence, your goal in this TP is to create a **MapReduce** solution for identifying patterns and creating association rules for a big dataset with more than three millions transactions. This algorithm will be running in a distributed cloud computing cluster. Finally, it is expected that, analyzing your results, you should be able to help marketer answers questions such as:

* What items are often bought together?
* Given a basket, what items should be suggested?
* How should items be placed together on the shelves?

### Methodology

This TP will be divided into two steps:
1. The first step is the implementation, where you will code a MapReduce algorithm for the MBA association rules problem. For this step, a small toy dataset is provided in order to test the developed code.

2. The second part is to use the algorithm created in step 1 for a dataset with more than three millions of supermarket transactions. In this step, you will use a cloud computing grid to run your experiments.


For the implementation step, we will follow the Market Basket Analysis algorithm presented by Jongwook Woo and Yuhang Xu (2012). The Figure below presents the algorithm workflow for this TP. The blue boxes are the ones where you must implement a method to perform a map or reduce function, and the grey boxes represent their expected output. **All these operations are explained in details in the following sections.**

![scale=0.5](workflow.svg "Algorithm Workflow")

### Setting up Spark

To implement this MapReduce solution you will use a tool called **Apache Spark**, a fast and general-purpose cluster computing system. In a nutshell, [Spark](http://spark.apache.org) is an open source framework designed with a *scale-out* methodology which makes it a very powerful tool for programmers or application developers to perform a huge volume of computations and data processing in distributed environments. Sparks provides high-level APIs that make it easy to build parallel apps. Moreover, Sparks can achieve high-performance computation by using a state-of-the-art (job/stage) scheduler, so you do not need to worry about how your code/data are parallelized/distributed, it does it all for you.

Your first task is to get Spark up and running. 
1. First, go to http://spark.apache.org/downloads 
2. Select the newest Spark release (2.3.0) and the pre-built package type
3. Click for download *spark-2.3.0-bin-hadoop2.7.tgz* and unzip it in any folder of your preference. 

4. Next, write the following two commands in your **~/.bashrc** file:
  - export SPARK_HOME=/path/to/spark-2.3.0-bin-hadoop2.7
  - export PYTHONPATH=\$SPARK_HOME/python:\$SPARK_HOME/python/lib/py4j-0.10.6-src.zip:\$SPARK_HOME/python/lib/pyspark.zip

5. Run the command *source ~/.bashrc*, reopen this jupyter notebook file and execute the next cell where the *pyspark* (Spark python API) is loaded.

In [10]:
# downloading spark
url = "https://www.apache.org/dyn/closer.lua/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz"
filename = "spark-2.3.1-bin-hadoop2.7.tgz"

from urllib.request import URLopener
testfile = URLopener()
testfile.retrieve(url, filename)

('spark-2.3.1-bin-hadoop2.7.tgz', <http.client.HTTPMessage at 0x7f036eb3d278>)

In [12]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/ee/2f/709df6e8dc00624689aa0a11c7a4c06061a7d00037e370584b9f011df44c/pyspark-2.3.1.tar.gz (211.9MB)
[K    100% |████████████████████████████████| 211.9MB 200kB/s eta 0:00:01    55% |██████████████████              | 118.6MB 66.6MB/s eta 0:00:02
[?25hCollecting py4j==0.10.7 (from pyspark)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K    100% |████████████████████████████████| 204kB 37.2MB/s ta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Running setup.py bdist_wheel for pyspark ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/37/48/54/f1b63f0dbb729e20c92f1bbcf1c53c03b300e0b93ca1781526
Successfully built pyspark
[31mmkl-random 1.0.1 requires cython, which is not installed.[0m
[31mmkl-fft 1.0.0 requires cython, which is not installed.[0m
Installi

In [13]:
from pyspark import SparkContext

# Initialize the spark context.
sc = SparkContext(appName='tp3team8')
# sc = SparkContext()

# Close the spark context
# sc.stop()

###### Word count Example 

It is part of this TP to study the [Spark python API](https://spark.apache.org/docs/latest/api/python/) and learn how to use it. For that, you will work with the [RDD API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD), a great Spark abstraction to work with the MapReduce framework.  RDD is a collection of elements partitioned across the nodes of the cluster that can operate  in parallel. In other words, RDD is how Spark keeps your data ready to operate some function (e.g., a map or reduce function) in parallel. Do not worry if this still sounds confusing, it will be clear once you starting implementing.

In the next cell, the spark context object was used to read a toy dataset, *toy.csv*, that contains four supermarket transactions, one per line. The *textFile* function returns a RDD object and this is your starting point to work with the RDD API. Some useful functions that the API offers are:

- **map** return a new RDD by applying a function to each element of this RDD.
- **flatMap** return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. **Should be used when each entry will yield more than one mapped element**
- **reduce** reduces the elements of this RDD using the specified commutative and associative binary operator.
- **reduceByKey** merge the values for each key using an associative and commutative reduce function
- **groupByKey** group the values for each key in the RDD into a single sequence
- **collect** return a list that contains all of the elements in this RDD. **Should not be used when working with a lot of data**
- **sample** return a sampled subset of this RDD
- **count** return the number of elements in this RDD.
- **filter** return a new RDD containing only the elements that satisfy a predicate.

In [14]:
def map_to_words(transaction):
    """
    Map each transaction into a set of KEY-VALUE elements.
    The KEY is the word itself and the VALUE is its number of apparitions.
    """
    words = transaction.split(',')
    for w in words:
        yield (w,1)

def reduce_words_by_key(value1, value2):
    "Merge the "
    return value1+value2
        

# Read a toy dataset
transactions = sc.textFile('toy.csv')
print("Transactions:\n\t", transactions.collect())

# Map function to identify words
words = transactions.flatMap(map_to_words)
print("Words Found:\n\t", words.collect())

# Reduce function to merge values of elements that share the same KEY
unique_words = words.reduceByKey(reduce_words_by_key)

print('Word count:')
for uw in unique_words.collect():
    print(uw)

Transactions:
	 ['a,b,c', 'a,b,d', 'b,c', 'b,c']
Words Found:
	 [('a', 1), ('b', 1), ('c', 1), ('a', 1), ('b', 1), ('d', 1), ('b', 1), ('c', 1), ('b', 1), ('c', 1)]
Word count:
('d', 1)
('b', 4)
('a', 2)
('c', 3)


## MBA Algorithm 
 The following sections explain how you should develop each step of the algorithm presented in the figure above. 
### 1. Map to Patterns (10 points)
Given a set of transactions, each transaction is **mapped** into a set of purchase patterns found within the transaction. Formally, these patterns are subsets of products that represent a group of items bought together. 
    
For the MapReduce framework, each pattern must be created as a *KEY-VALUE* element, where they KEY can take the form of a singleton, a pair or a trio of products that are present in the transaction. More precisely, for each transaction, all possible **unique** subsets of size one, two or three must be generated.  The VALUE associated with each KEY is the number of times that the KEY appeared in the transaction (if we assume that no product appears more than once in the transaction, this value is always equal to one). 

Now, implement the  **map_to_patterns** function that receives a transaction (a line of the dataset file) and returns the patterns found in the transaction. It is important to notice that, since each entry (transaction) of the map function will yield more than one KEY-VALUE element, a *flatMap* must be invoked for this step.

For the toy dataset, the expected output is:

<div style="border:1px solid black;white-space: pre;font-size: 9pt; line-height: 1.1; background-color:#f2f2f2; height: auto; width: 30em; padding-left:5px">
(('a',), 1)
(('a', 'b'), 1)
(('a', 'b', 'c'), 1)
(('a', 'c'), 1)
(('b',), 1)
(('b', 'c'), 1)
(('c',), 1)
(('a',), 1)
(('a', 'b'), 1)
(('a', 'b', 'd'), 1)
(('a', 'd'), 1)
(('b',), 1)
(('b', 'd'), 1)
(('d',), 1)
(('b',), 1)
(('b', 'c'), 1)
(('c',), 1)
(('b',), 1)
(('b', 'c'), 1)
(('c',), 1)
</div> 





** IT IS QUITE UNCLEAR HOW WE SHOULD DEAL WITH A TRANSACTION WHEN AN ITEM IS REPEATED IN IT**

In [15]:
import itertools
from collections import Counter

def map_to_patterns(transaction):
    
    # we limit ourselves to singletons, pairs and trios
    max_subset_size = 3
    
    items = transaction.split(',')
    for i in range(max_subset_size):
        tmp = list(itertools.combinations(items, i+1))
        for key, value in Counter(tmp).items():
            yield(key, value)
    
patterns = transactions.flatMap(map_to_patterns)
print("Collect")
for pattern in patterns.collect():
    print(pattern)  
    

Collect
(('a',), 1)
(('b',), 1)
(('c',), 1)
(('a', 'b'), 1)
(('a', 'c'), 1)
(('b', 'c'), 1)
(('a', 'b', 'c'), 1)
(('a',), 1)
(('b',), 1)
(('d',), 1)
(('a', 'b'), 1)
(('a', 'd'), 1)
(('b', 'd'), 1)
(('a', 'b', 'd'), 1)
(('b',), 1)
(('c',), 1)
(('b', 'c'), 1)
(('b',), 1)
(('c',), 1)
(('b', 'c'), 1)


## 2. Reduce patterns (5 points)
Once the transactions were processed by different CPUs, a **reduce** function must take place to combine identical KEYS (the subset of products) and compute the total number of its occurrences in the entire dataset. In other words, this reduce procedure must sum the *VALUE* of each identical KEY.

Create a **reduce_patterns** function below that must sum the VALUE of each pattern.
For the toy dataset, the expected output is:

<div style="border:1px solid black;white-space: pre;font-size: 9pt; line-height: 1.1; background-color:#f2f2f2; height: auto; width: 30em; padding-left:5px">
(('a',), 2)
(('a', 'b'), 2)
(('a', 'b', 'c'), 1)
(('a', 'c'), 1)
(('b',), 4)
(('b', 'c'), 3)
(('c',), 3)
(('a', 'b', 'd'), 1)
(('a', 'd'), 1)
(('b', 'd'), 1)
(('d',), 1)
</div> 




In [16]:
def reduce_patterns(value1, value2):
    return value1 + value2

frequent_patterns = patterns.reduceByKey(reduce_patterns)
for frequent_pattern in frequent_patterns.collect():
    print(frequent_pattern)


(('a', 'd'), 1)
(('b', 'd'), 1)
(('a',), 2)
(('b',), 4)
(('c',), 3)
(('a', 'b'), 2)
(('a', 'c'), 1)
(('b', 'c'), 3)
(('d',), 1)
(('a', 'b', 'c'), 1)
(('a', 'b', 'd'), 1)


## 3. Map to subpatterns (15 points)
Following, another **map** function should be applied to generate subpatterns. Once again, the subpatterns are KEY-VALUE elements, where the KEY is a subset of products as well. However, creating the subpattern's KEY is a different procedure. This time, the idea is to break down the list of products of each pattern (pattern KEY), remove one product at a time, and yield the resulting list as the new subpattern KEY. For example, for a given pattern $P$ with three products, $p_1, p_2 $ and $p_3$, then three new subpatterns KEYs are going to be created: (i) remove $p_1$ and yield ($p_2, p_3$); (ii) remove $p_2$ and yield ($p_1,p_3$); and (iii) remove $p_3$ and yield ($p_1,p_2$). Additionally, the subpattern's VALUE structure will also be different. Instead of just single interger value as we had in the patterns, this time a *tuple* should be created for the subpattern VALUE. This tuple contains the product that was removed when yielding the KEY and the number of times the pattern appeared. For the example above, the values should be ($p_1,v$), ($p_2,v$) and ($p_3,v$), respectively, where $v$ is the VALUE of the pattern. The idea behind subpatterns is to create **rules** such as: when the products of KEY were bought, the item present in the VALUE was also bought $v$ times.

Furthermore, each pattern should also yield a subpattern where the KEY is the same list of products of the pattern, but the VALUE is a tuple with a null product (None) and the number of times the pattern appeared. This element will be useful to keep track of how many times such pattern was found and later will be used to compute the confidence value when generating the association rules. 

Now, implement the  **map_to_subpatterns** function that receives a pattern and yields all found subpatterns. Once again, each entry (pattern) will generate more than one KEY-VALUE element, then a flatMap function must be called.

For the toy dataset, the expected output is:

<div style="border:1px solid black;white-space: pre;font-size: 9pt; line-height: 1.1; background-color:#f2f2f2; height: auto; width: 60em; padding-left:5px">
(('a',), (None, 2))
(('a', 'b'), (None, 2))
(('b',), ('a', 2))
(('a',), ('b', 2))
(('a', 'b', 'c'), (None, 1))
(('b', 'c'), ('a', 1))
(('a', 'c'), ('b', 1))
(('a', 'b'), ('c', 1))
(('a', 'c'), (None, 1))
(('c',), ('a', 1))
(('a',), ('c', 1))
(('b',), (None, 4))
(('b', 'c'), (None, 3))
(('c',), ('b', 3))
(('b',), ('c', 3))
(('c',), (None, 3))
(('a', 'b', 'd'), (None, 1))
(('b', 'd'), ('a', 1))
(('a', 'd'), ('b', 1))
(('a', 'b'), ('d', 1))
(('a', 'd'), (None, 1))
(('d',), ('a', 1))
(('a',), ('d', 1))
(('b', 'd'), (None, 1))
(('d',), ('b', 1))
(('b',), ('d', 1))
(('d',), (None, 1))
</div> 



In [17]:
def map_to_subpatterns(pattern):
    
    keys = pattern[0]
    value = pattern[1]
    n_items = len(keys)
    if n_items > 1:
        # if more than one item, return subpattern with one less element for each element
        for i in range(n_items):
            yield(tuple([keys[j] for j in range(n_items) if j!=i]), (keys[i], value))

    # for all patterns, return whole pattern with None as associated subpattern
    yield(keys, (None, value))
    
subpatterns = frequent_patterns.flatMap(map_to_subpatterns)

for subpattern in subpatterns.collect():
    print(subpattern)

(('d',), ('a', 1))
(('a',), ('d', 1))
(('a', 'd'), (None, 1))
(('d',), ('b', 1))
(('b',), ('d', 1))
(('b', 'd'), (None, 1))
(('a',), (None, 2))
(('b',), (None, 4))
(('c',), (None, 3))
(('b',), ('a', 2))
(('a',), ('b', 2))
(('a', 'b'), (None, 2))
(('c',), ('a', 1))
(('a',), ('c', 1))
(('a', 'c'), (None, 1))
(('c',), ('b', 3))
(('b',), ('c', 3))
(('b', 'c'), (None, 3))
(('d',), (None, 1))
(('b', 'c'), ('a', 1))
(('a', 'c'), ('b', 1))
(('a', 'b'), ('c', 1))
(('a', 'b', 'c'), (None, 1))
(('b', 'd'), ('a', 1))
(('a', 'd'), ('b', 1))
(('a', 'b'), ('d', 1))
(('a', 'b', 'd'), (None, 1))


## 4. Reduce Subpattern (5 points)
Once more, a **reduce** function will be required to group all the subpatterns by their KEY. The objective of this reducing procedure is to create a list with all the **rules** that appeared by a KEY. Hence, the expected resulting of the reduce function is also a KEY-VALUE element, where the KEY is the subpattern's KEY and the VALUE is a group containing all the VALUEs of the subpatterns that share the same KEY.

For the toy dataset, the expected output is:

<div style="border:1px solid black;white-space: pre;font-size: 9pt; line-height: 1.1; background-color:#f2f2f2; height: auto; width: 60em; padding-left:5px">
(('a',), [(None, 2), ('b', 2), ('c', 1), ('d', 1)])
(('a', 'b'), [(None, 2), ('c', 1), ('d', 1)])
(('b',), [('a', 2), (None, 4), ('c', 3), ('d', 1)])
(('a', 'b', 'c'), [(None, 1)])
(('b', 'c'), [('a', 1), (None, 3)])
(('a', 'c'), [('b', 1), (None, 1)])
(('c',), [('a', 1), ('b', 3), (None, 3)])
(('a', 'b', 'd'), [(None, 1)])
(('b', 'd'), [('a', 1), (None, 1)])
(('a', 'd'), [('b', 1), (None, 1)])
(('d',), [('a', 1), ('b', 1), (None, 1)])
</div> 




In [18]:
# convert to weird pyspark iterable, then to list
rules = subpatterns.groupByKey().mapValues(list)

for rule in rules.collect():
    print(rule)

(('a', 'd'), [(None, 1), ('b', 1)])
(('b', 'd'), [(None, 1), ('a', 1)])
(('d',), [('a', 1), ('b', 1), (None, 1)])
(('a',), [('d', 1), (None, 2), ('b', 2), ('c', 1)])
(('b',), [('d', 1), (None, 4), ('a', 2), ('c', 3)])
(('c',), [(None, 3), ('a', 1), ('b', 3)])
(('a', 'b'), [(None, 2), ('c', 1), ('d', 1)])
(('a', 'c'), [(None, 1), ('b', 1)])
(('b', 'c'), [(None, 3), ('a', 1)])
(('a', 'b', 'c'), [(None, 1)])
(('a', 'b', 'd'), [(None, 1)])


## 5. Map to Association Rules (15 points)
Finally, the last step of the algorithm is to create the association rules to perform the market basket analysis. The goal of this map function is to calculate the **confidence** level of buying a product knowing that there is already a set of products in the basket. Thus, the KEY of the subpattern is the set of products placed in the basket and, for each product present in the list of rules, i.e., in the VALUE, the confidence can be calculated as:

\begin{align*}
\frac{\text{number of times the product was bought together with KEY }}{\text{number of times the KEY appeared}}
\end{align*}

For the example given in the figure above, *coffee* was bought 20 times and, in 17 of them, *milk* was bought together. Then, the confidence level of buying *milk* knowing that *coffee* is in the basket is $\frac{17}{20} = 0.85$, which means that in 85% of the times the coffee was bought, milk was purchased as well.

Implement the **map_to_assoc_rules** function that calculates the confidence level for each subpattern.

For the toy dataset, the expected output is:
<div style="border:1px solid black;white-space: pre;font-size: 9pt; line-height: 1.1; background-color:#f2f2f2; height: auto; width: 60em; padding-left:5px">
(('a',), [('b', 1.0), ('c', 0.5), ('d', 0.5)])
(('a', 'b'), [('c', 0.5), ('d', 0.5)])
(('b',), [('a', 0.5), ('c', 0.75), ('d', 0.25)])
(('a', 'b', 'c'), [])
(('b', 'c'), [('a', 0.3333333333333333)])
(('a', 'c'), [('b', 1.0)])
(('c',), [('a', 0.3333333333333333), ('b', 1.0)])
(('a', 'b', 'd'), [])
(('b', 'd'), [('a', 1.0)])
(('a', 'd'), [('b', 1.0)])
(('d',), [('a', 1.0), ('b', 1.0)])
</div>



In [19]:
def map_to_assoc_rules(rule):
    #find number of occurences
    key = rule[0]
    values = rule[1]
    for elem in values:
        if elem[0] == None:
            total = elem[1]
            
    # divide every subpattern by nb of occurence        
    result = []
    for sp in values:
        if sp[0] is not None:
            result.append((sp[0], sp[1] / total))
    return (key, result)

assocRules = rules.map(map_to_assoc_rules)
for assoc_rule in assocRules.collect():
    print(assoc_rule)


(('a', 'd'), [('b', 1.0)])
(('b', 'd'), [('a', 1.0)])
(('d',), [('a', 1.0), ('b', 1.0)])
(('a',), [('d', 0.5), ('b', 1.0), ('c', 0.5)])
(('b',), [('d', 0.25), ('a', 0.5), ('c', 0.75)])
(('c',), [('a', 0.3333333333333333), ('b', 1.0)])
(('a', 'b'), [('c', 0.5), ('d', 0.5)])
(('a', 'c'), [('b', 1.0)])
(('b', 'c'), [('a', 0.3333333333333333)])
(('a', 'b', 'c'), [])
(('a', 'b', 'd'), [])


## Instacart dataset

With your MBA algorithm ready to be used, now it is time to work on a real dataset. For this second part of the TP, download the [instacart](https://www.instacart.com/datasets/grocery-shopping-2017) dataset and read its [description](https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b) to understand how the dataset is structured. 

Before applying the developed algorithm on the instacart dataset you must first filter the transactions to be in the same format defined by your algorithm (one transaction per line). A very good Spark API to work with such type of structured data is the [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html#sql), a module that allows you to run SQL queries and/or work with [DataFrame](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame), a distributed collection of data grouped into named columns.

**If you are not familiar with SQL, it is recommended that you follow the [tutorial from W3Schools](https://www.w3schools.com/sql/) to learn the basics.** 

For example, the following code cells use the Spark SQL module initialized by SparkSession to read the orders from the *order_products__train.csv* and the order information from *orders.csv* to construct a dataframe that contains a list of all products ever purchased by each user.

In [26]:
# Initialize the SparkSession
from pyspark.sql import SparkSession
ss = SparkSession(sc)

# Reading the structured data

df_order_prod = ss.read.csv('instacart/order_products__train.csv', header=True, sep=',', inferSchema=True)
print('order_products__train.csv')
df_order_prod.show(5)

df_orders = ss.read.csv('instacart/orders.csv', header=True, sep=',', inferSchema=True)
print('orders.csv')
df_orders.show(5)


order_products__train.csv
+--------+----------+-----------------+---------+
|order_id|product_id|add_to_cart_order|reordered|
+--------+----------+-----------------+---------+
|       1|     49302|                1|        1|
|       1|     11109|                2|        1|
|       1|     10246|                3|        0|
|       1|     49683|                4|        0|
|       1|     43633|                5|        1|
+--------+----------+-----------------+---------+
only showing top 5 rows

orders.csv
+--------+-------+--------+------------+---------+-----------------+----------------------+
|order_id|user_id|eval_set|order_number|order_dow|order_hour_of_day|days_since_prior_order|
+--------+-------+--------+------------+---------+-----------------+----------------------+
| 2539329|      1|   prior|           1|        2|                8|                  null|
| 2398795|      1|   prior|           2|        3|                7|                  15.0|
|  473747|      1|   prior| 

###### Using SQL

In [56]:
df_order_prod.createOrReplaceTempView("order_prod") # creates a Table order_prod
df_orders.createOrReplaceTempView("orders") # creates a Table orders

df_ex = ss.sql('SELECT o.user_id, COLLECT_LIST(op.product_id) AS products' 
               ' FROM orders o '
               ' INNER JOIN order_prod op ON op.order_id = o.order_id'
               ' GROUP BY user_id ORDER BY o.user_id')

df_ex.show(5)

+-------+--------------------+
|user_id|            products|
+-------+--------------------+
|      1|[196, 25133, 3892...|
|      2|[22963, 7963, 165...|
|      5|[15349, 19057, 16...|
|      7|[12053, 47272, 37...|
|      8|[15937, 5539, 109...|
+-------+--------------------+
only showing top 5 rows



###### Using Dataframe

In [57]:
from pyspark.sql.functions import collect_list


df_ex = df_orders.join(df_order_prod, df_order_prod.order_id == df_orders.order_id, 'inner')\
.groupBy(df_orders.user_id).agg(collect_list(df_order_prod.product_id).alias('products'))\
.orderBy(df_orders.user_id)
                                                                                                                           
df_ex.show(5)

+-------+--------------------+
|user_id|            products|
+-------+--------------------+
|      1|[196, 25133, 3892...|
|      2|[22963, 7963, 165...|
|      5|[15349, 19057, 16...|
|      7|[12053, 47272, 37...|
|      8|[15937, 5539, 109...|
+-------+--------------------+
only showing top 5 rows



## 6. Bonus (5 points) 

To practice the use of Spark SQL module, create a query, using SQL or dataframe, to answer the following questions:

1. Who are the top 10 users with the biggest number of orders? (0.25) 
2. What are the top 10 most purchased products? (0.25)

In [58]:
# top 10 users with biggest number of orders

df_orders.createOrReplaceTempView("orders") # creates a Table orders

df_query1 = ss.sql('SELECT user_id, COUNT(*) AS n_orders FROM orders GROUP BY user_id ORDER BY n_orders desc')

df_query1.show(10)

+-------+--------+
|user_id|n_orders|
+-------+--------+
|  31118|     100|
|   7120|     100|
|   2387|     100|
|  29058|     100|
|   8779|     100|
| 171956|     100|
| 176469|     100|
| 185406|     100|
|  54844|     100|
| 167957|     100|
+-------+--------+
only showing top 10 rows



In [59]:
# top 10 most purchased products

df_order_prod.createOrReplaceTempView("order_prod") # creates a Table order_prod

df_query2 = ss.sql('SELECT product_id, COUNT(*) AS n_times_ordered FROM order_prod GROUP BY product_id ORDER BY n_times_ordered desc')

df_query2.show(10)


+----------+---------------+
|product_id|n_times_ordered|
+----------+---------------+
|     24852|          18726|
|     13176|          15480|
|     21137|          10894|
|     21903|           9784|
|     47626|           8135|
|     47766|           7409|
|     47209|           7293|
|     16797|           6494|
|     26209|           6033|
|     27966|           5546|
+----------+---------------+
only showing top 10 rows



## 7. Run MBA for the *training* set (25 points)

Using the orders from the *order_products__train.csv*, create a dataframe where each row contain just one column, the transaction, with the list of purchased products.

To convert a dataframe to RDD you can use *dataframe.rdd*. For example, for the df_ex dataframe, one could desire to only work with a RDD containing the first product of each user. To do so, it is enough to run the following code: 

In [60]:
def map_to_first_product(row): # row contains the variables of each row
    return row.products[0]

prods = df_ex.rdd.map(map_to_first_product)
for p in prods.take(5):
    print(p)

196
22963
15349
12053
15937


Now, create a query to construct the transactions and run locally on your computer

In [23]:
import time
# import cpuinfo

In [24]:
def map_to_patterns_real(transaction):

    # we limit ourselves to singletons, pairs and trios
    max_subset_size = 3
    
    for i in range(max_subset_size):
        tmp = list(itertools.combinations(transaction, i+1))
        for key, value in Counter(tmp).items():
            yield(key, value)

def run_MBA(transactions):
    
    # Map to patterns
    patterns = transactions.flatMap(map_to_patterns_real)    
    
    # Reduce patterns
    frequent_patterns = patterns.reduceByKey(reduce_patterns)
    
    # Map to subpatterns
    subpatterns = frequent_patterns.flatMap(map_to_subpatterns)

    # Reduce subpatterns
    rules = subpatterns.groupByKey().mapValues(list)

    # Map to association rules
    assocRules = rules.map(map_to_assoc_rules)
    
    return assocRules

In [63]:
print('to do: query the transactions')

def map_to_transactions(row):
    return row.transactions

df_order_prod.createOrReplaceTempView("order_prod") # creates a Table order_prod
df_orders.createOrReplaceTempView("orders") # creates a Table orders

# create query so that we have a table of only the transactions
df_ex = ss.sql('SELECT COLLECT_LIST(op.product_id) AS transactions' 
               ' FROM orders o '
               ' INNER JOIN order_prod op ON op.order_id = o.order_id'
               ' GROUP BY o.order_id')

transactions = df_ex.rdd.map(map_to_transactions)

print('run MBA algorithm')

start = time.time()
assocRules = run_MBA(transactions)

for ar in assocRules.take(1):
    print("Output for every item sold with one given item with association rule")
    print(ar[0])
    print(ar[1])
end = time.time()


to do: query the transactions
run MBA algorithm
Output for every item sold with one given item with association rule
(30827,)
[(30591, 0.005772005772005772), (28199, 0.008658008658008658), (19767, 0.004329004329004329), (2015, 0.001443001443001443), (35951, 0.015873015873015872), (33303, 0.002886002886002886), (48183, 0.001443001443001443), (21103, 0.001443001443001443), (7175, 0.001443001443001443), (15999, 0.001443001443001443), (1727, 0.001443001443001443), (19887, 0.002886002886002886), (31087, 0.001443001443001443), (7751, 0.002886002886002886), (33231, 0.001443001443001443), (45007, 0.002886002886002886), (4279, 0.001443001443001443), (9759, 0.001443001443001443), (34615, 0.001443001443001443), (49279, 0.001443001443001443), (30191, 0.001443001443001443), (9535, 0.001443001443001443), (33000, 0.008658008658008658), (16254, 0.001443001443001443), (7806, 0.001443001443001443), (39408, 0.011544011544011544), (47766, 0.012987012987012988), (43720, 0.004329004329004329), (29536, 0.001

Finally, repeat the same process but now using the Google Cloud Platform (GCP) that each team received access. All the instructions for creating a computing cluster with spark and how to submit a job will be explained in both sessions of the laboratory. In any case, the guide line to perform this task can be found [here](https://cloud.google.com/blog/big-data/2017/02/google-cloud-platform-for-data-scientists-using-jupyter-notebooks-with-apache-spark-on-google-cloud).

You should report here the runtime of each experiment as well the CPU configuration that you used to run locally.

In [18]:
# Local
print("When run locally : \n------------------")
print("Total time : %02dmin %02.3f s" % ((end-start)//60, (end-start)%60))
print("CPU INFO : %s" % cpuinfo.get_cpu_info()['brand'])

When run locally : 
------------------
Total time : 14min 50.969 s
CPU INFO : Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz


In [64]:
# On Google Cloud Platform
print("When run on GCP : \n-----------------")
print("Total time : %02dmin %02.3f s" % ((end-start)//60, (end-start)%60))


When run on GCP : 
-----------------
Total time : 36min 9.458 s


## 8. Run MBA for your custom dataset (25 points)

Each team will receive a custom file on its storage bucket that contains a set of *order_id*. For this last task, you must query both the *order_products__prior* and *order_products__train* files to construct your own set of transactions, i.e., search only for the order_ids you have received. You should report the number of transactions and unique products that you will work with.

Moreover, build a list of unique products appearing on the first 10 orders of your custom file and report the association rules when you have the product alone in the basket. In other words, after running your MBA algorithm, print the association rules where the KEY product (alone) is present in this file. **You should print the product's name, not its ID.**  
       
Once again, you should run this experiment using the GCP and report the execution time.

In [27]:
print('to do: query the transactions')

def map_to_transactions(row):
    return row.transactions

custom_file_name = "orders_team8.csv"

# Reading the structured data

df_order_prod_train = ss.read.csv('instacart/order_products__train.csv', header=True, sep=',', inferSchema=True)
df_order_prod_prior = ss.read.csv('instacart/order_products__prior.csv', header=True, sep=',', inferSchema=True)
df_custom = ss.read.csv('instacart/' + custom_file_name, header=True, sep=',', inferSchema=True)
df_orders = ss.read.csv('instacart/orders.csv', header=True, sep=',', inferSchema=True)
df_products = ss.read.csv('instacart/products.csv', header=True, sep=',', inferSchema=True)

df_order_prod_train.createOrReplaceTempView("order_prod_train")
df_order_prod_prior.createOrReplaceTempView("order_prod_prior")
df_custom.createOrReplaceTempView("custom")
df_orders.createOrReplaceTempView("orders") # creates a Table orders
df_orders.createOrReplaceTempView("products")

# combine order_prod_train and order_prod_prior
df_order_prod_combined = ss.sql('SELECT opt.order_id, opt.product_id FROM order_prod_train opt'
                                ' UNION'
                                ' SELECT opp.order_id, opp.product_id FROM order_prod_prior opp')
df_order_prod_combined.createOrReplaceTempView("order_prod_combined")

# select only orders where order_id is in our custom file
df_custom_transactions = ss.sql('SELECT opc.order_id, opc.product_id FROM order_prod_combined opc'
                                ' INNER JOIN custom cu ON opc.order_id = cu.order_id'
                                ' ORDER BY cu.order_id')
df_custom_transactions.createOrReplaceTempView("custom_transactions")

# Count number of transactions
df_n_transactions = ss.sql('SELECT COUNT (DISTINCT order_id) AS n_transactions'
                           ' FROM custom_transactions')
# df_n_transactions.show()

# Count number of unique products
df_n_unique_products = ss.sql('SELECT COUNT (DISTINCT product_id) as number_of_unique_products'
                              ' FROM custom_transactions')
# df_n_unique_products.show()

# build custom transactions to feed to MBA algorithm
df_custom_transactions = ss.sql('SELECT order_id, COLLECT_LIST(product_id) AS transactions'
                                ' FROM custom_transactions'
                                ' GROUP BY order_id')
df_custom_transactions.createOrReplaceTempView("custom_transactions")

transactions = df_custom_transactions.rdd.map(map_to_transactions)

to do: query the transactions


In [28]:
# building unique products list
df_kept_products = ss.sql('SELECT ct.transactions'
                          ' FROM custom_transactions ct'
                          ' INNER JOIN custom cu ON ct.order_id = cu.order_id'
                          ' LIMIT 1')
df_kept_products.createOrReplaceTempView('kept_products')

tmp = df_kept_products.collect()
kept_keys = []
for i in range(len(tmp)):
    for elem in tmp[i].transactions:
        if elem not in kept_keys:
            kept_keys.append((elem,))


In [29]:
# Creating id2name dictionary

id2name_file = 'instacart/products.csv'

with open(id2name_file) as f:
    
    # skip header
    f.readline()
    id2name = {}
    for line in f:
        line = line.strip().split(',')
        id2name[line[0]] = line[1]
        

In [30]:
print('run MBA algorithm')

start = time.time()
assocRules = run_MBA(transactions)

# use filter
for ar in assocRules.filter(lambda x: x[0] in kept_keys).collect():
# for ar in assocRules.take(1):    
    
    print(ar)

    print()
    item_name = id2name[ar[0][0]]
    print(item_name + ' : ')
    for i in range(len(ar[1])):
        other_item_name = id2name[ar[1][i][0]]
        print('\t' + other_item_name + str(ar[1][i][1]))
        
end = time.time()

run MBA algorithm
((26940,), [(27326, 0.0003341687552213868), (2238, 0.0003341687552213868), (37774, 0.0005012531328320802), (19634, 0.0001670843776106934), (13202, 0.0001670843776106934), (34786, 0.0006683375104427736), (29142, 0.0005012531328320802), (446, 0.0001670843776106934), (8358, 0.0003341687552213868), (46226, 0.001670843776106934), (40198, 0.002506265664160401), (47830, 0.0003341687552213868), (33462, 0.0001670843776106934), (48974, 0.0001670843776106934), (17902, 0.0018379281537176273), (15386, 0.0005012531328320802), (42110, 0.0026733500417710945), (9438, 0.000835421888053467), (26402, 0.0001670843776106934), (10782, 0.0003341687552213868), (46938, 0.0001670843776106934), (33554, 0.0001670843776106934), (738, 0.0001670843776106934), (37646, 0.026566416040100252), (29826, 0.0006683375104427736), (2086, 0.005346700083542189), (24854, 0.0001670843776106934), (22142, 0.0001670843776106934), (22270, 0.0001670843776106934), (8454, 0.0001670843776106934), (35682, 0.00016708437761

KeyError: 26940

In [1]:
# Locally

"""
Due to hardware constraints, we could not run the experiment locally, since the amount of files stored in /tmp/ would become too big (~50gb).
However, it took several hours (~4) to run the code before it crashed, so we can report this preliminary numbers. Please report to the GCP
experiments for the results.

"""

'\nDue to hardware constraints, we could not run the experiment locally, since the amount of files stored in /tmp/ would become too big (~50gb).\nHowever, it took several hours (~4) to run the code before it crashed, so we can report this preliminary numbers. Please report to the GCP\nexperiments for the results.\n\n'

In [47]:
# On Google Cloud Platform


"""
We ran the MBA algorithm for items in only one transaction and report the results for only one item 
in it because of the length of the output. We see that even for one transaction, the algorithm takes 
a lot of time to run on GCP (~4h).

"""
print("When run on GCP : \n-----------------")
print("Total time : %02dmin %02.3f s" % ((end-start)//60, (end-start)%60))


When run on GCP : 
-----------------
Total time : 234min 57.213 s


In [46]:
# Printing results for one item only.
import ast

url = "https://raw.githubusercontent.com/patricebechard/data_mining_inf6953i/master/lab3/results.txt"
filename = "results.txt"

from urllib.request import URLopener

testfile = URLopener()
testfile.retrieve(url, filename)

with open(filename) as f:

    results = f.readline().strip()

    results = ast.literal_eval(results)
    
    item_name = id2name[str(results[0][0])]
    print(item_name + ' : ')
    for i in range(len(results[1])):
        other_item_name = id2name[str(results[1][i][0])]
        print('\t' + other_item_name + ' ' + str(results[1][i][1]))

Organic Large Green Asparagus : 
	Organic Sprouted Spelt Flour 0.0003341687552213868
	Vitamin C Super Orange Dietary Supplement 0.0003341687552213868
	Artesian Sparkling Water 0.0005012531328320802
	Protein Lovers Breakfast Burrito 0.0001670843776106934
	Organic Grape Tomato 0.0001670843776106934
	Tikka Masala Simmer Sauce 0.0006683375104427736
	Natural Chicken & Apple Breakfast Sausage Patty 0.0005012531328320802
	Blueberry Pomegranate Acaí Cultured Goat Milk Kefir 0.0001670843776106934
	Uncooked Corn Tortillas 0.0003341687552213868
	Thick & Crispy Tortilla Chips 0.001670843776106934
	Blueberry Yoghurt 0.002506265664160401
	Watercress Greens Juice 0.0003341687552213868
	Almond & Apricot Bar 0.0001670843776106934
	Organic Asian Sesame Dressing 0.0001670843776106934
	Liquid Egg Whites 0.0018379281537176273
	Whole Grain White Corn Salted Tortilla Chips 0.0005012531328320802
	Organic Macaroni Shells & Real Aged Cheddar 0.0026733500417710945
	Honey Graham Sticks 0.000835421888053467
	Grape

	Apple Honeycrisp Organic 0.03308270676691729
	Vanilla Ice Cream Sandwich Cookies 0.0003341687552213868
	Pure Dark Brown Cane Sugar 0.0003341687552213868
	Brazilian Cheese Bread Original Cheddar and Parmesan 0.0001670843776106934
	Organic Stomach Ease Tea Bags 0.0001670843776106934
	Smarty Dish Pink Grapefruit Dishwasher Detergent Tabs 0.0003341687552213868
	Honey Whole Wheat Bread 0.0005012531328320802
	Body Wash and Shower Gel Citrus Scrub 0.0001670843776106934
	Toilet Bowl Cleaner - Emerald Cypress and Fir 0.0001670843776106934
	Diced Butternut Squash 0.000835421888053467
	Bac-Out Stain & Odor Remover 0.0001670843776106934
	Organic Sweet Pea Sprouts 0.0006683375104427736
	Organic Beef Broth 0.0020050125313283208
	Chai Rooibos Herbal Tea Bags 0.0003341687552213868
	Tarragon 0.0005012531328320802
	Organic Quinoa Squares Sweet Potato & Apple & Cinnamon 0.0005012531328320802
	Organic Black Peppercorns 0.0003341687552213868
	Organic Tomato Basil Sauce 0.0020050125313283208
	"Nuts & Spice

	Less Sodium Teriyaki Marinade & Sauce 0.0001670843776106934
	Organics Vitamin C Fortified 100% Apple Juice 0.0001670843776106934
	Whole Grain Lavash 0.0001670843776106934
	Gypsy Cold Care Herbal Tea 0.0001670843776106934
	"Veggie Juice -Apple 0.0001670843776106934
	Culinary Coconut Milk 0.0006683375104427736
	"Stock 0.0006683375104427736
	Green Energy Tea 0.0001670843776106934
	Organic Spring Mix Salad 0.008187134502923977
	Gummy Bears 0.0006683375104427736
	Window Cleaner with Vinegar 0.0003341687552213868
	EF California White Basmati Eco-Farmed 2 lb Rice 0.0001670843776106934
	Brie 0.0006683375104427736
	Rice Crusted Fish Fillets 0.0001670843776106934
	Thin Stackers Brown Rice Lightly Salted 0.0020050125313283208
	Organic Balance Vanilla Bean Protein Shake 0.0001670843776106934
	Organic Concord Grape Juice 0.0001670843776106934
	Porter & Spicy Brown Mustard 0.0001670843776106934
	Cauliflower 0.000835421888053467
	Cold Pressed Vanilla Cinnamon Agave Cashew Milk 0.0001670843776106934


	Organic Superfoods Carrot Rice Cakes 0.0001670843776106934
	Artichokes 0.0038429406850459483
	"Happy Tot Banana 0.0006683375104427736
	Organic Kale 0.0005012531328320802
	Organic Multigrain Waffles 0.0030075187969924814
	Spearmint + Lemongrass Hand Soap 0.0001670843776106934
	Disinfecting Bathroom Cleaner - Lemongrass Citrus 0.0001670843776106934
	Passion Fruit 0.0006683375104427736
	Organic Original Hummus 0.0010025062656641604
	Gluten Free Cheese Ravioli 0.0001670843776106934
	Celery 0.0005012531328320802
	Original Almond Breeze Almond Milk 0.0003341687552213868
	Chocolate Chip Cookie Dough Frozen Greek Yogurt Bars 0.0001670843776106934
	"Dulse 0.0001670843776106934
	Natural Spring Water 0.0035087719298245615
	Total 2% Greek Strained Yogurt with Cherry 5.3 oz 0.001670843776106934
	Blood Builder Multivitamin Tablets 0.0001670843776106934
	Organic Balsamic Vinaigrette Dressing 0.0006683375104427736
	Plain Cultured Goat Milk 0.0003341687552213868
	Organic Jalapeno Pepper 0.011361737677

	Green Tea Mochi Ice Cream Bonbons 0.0003341687552213868
	Original Rotisserie Chicken 0.0013366750208855473
	Dark Chocolate Calcium Supplement 0.0005012531328320802
	Organic Wild Blueberry Fruit Spread 0.0001670843776106934
	Capretta Greek Original Goat Yogurt 0.0005012531328320802
	Grassmilk Organic Fat Free Milk 0.0006683375104427736
	Dairy Free Lite Culinary Coconut Milk 0.0001670843776106934
	100% Recycled Facial Tissue 0.0006683375104427736
	Organic Rainier Cherries 0.0003341687552213868
	Organic Rainbow Chard Vegetable 0.006182121971595656
	Organic Unrefined Sesame Oil 0.0001670843776106934
	Organic Unsweetened Green Tea 0.0003341687552213868
	Grapefruit Sparkling Juice Beverage 0.0001670843776106934
	Organic Cilantro 0.0304093567251462
	Mirin Rice Cooking Wine 0.0006683375104427736
	Grab 'N Go® Cups & Lids 12 Ounce 0.0001670843776106934
	Gluten Free Old Fashioned Rolled Oats 0.003341687552213868
	Raspberry Cereal 0.0006683375104427736
	Maple Almond Butter 0.0006683375104427736
	

	Organic Yokids Lemonade/Blueberry Variety Pack Yogurt Squeezers Tubes 0.0011695906432748538
	Shishito Pepper 0.0001670843776106934
	Honey Wheat Enriched Bread 0.0001670843776106934
	Fresh Goat Cheese Classic 0.0001670843776106934
	Pumpkin & Spinach Stage 2 Baby Food 0.000835421888053467
	Organic Protein Unsweetened Vanilla Almond Milk 0.0005012531328320802
	Organic Stage 2 Spinach Lentil Brown Rice Baby Food 0.0001670843776106934
	"Snacks 0.0001670843776106934
	Comice Pear 0.0003341687552213868
	Traditional Lavash Flatbread 0.0005012531328320802
	Straws 0.0003341687552213868
	Organic Promise Autumn Wheat Cereal 0.0010025062656641604
	Brussels Sprouts 0.00685045948203843
	Crumbled Feta Cheese 0.0001670843776106934
	70% Dark Chocolate 0.0001670843776106934
	Jumbo Garlic 0.0001670843776106934
	All-In-One French Vanilla Nutritional Shake Sachet 0.0003341687552213868
	Almond Meal/Flour 0.0011695906432748538
	Gluten Free Quinoa Rotelle Pasta 0.0001670843776106934
	Hass Avocado 0.00200501253