# Amazon Products Relevance Analysis

## Goal
Our goad is to find out the relationship between any two products in Amazon. For example, if a person buys package of coffee bean, s/he might also buy a set of filter papers. We can see the coffee beans is somehow connected to the filter papers. 

## Algorithm
The strategy we are using to analyze is based on [Combination](https://en.wikipedia.org/wiki/Combination). The products that being bought by a user at the same month will be collected in to a [Set](https://en.wikipedia.org/wiki/Set_(mathematics))(as product's id). We will get the combination(choose 2) from the set and count the number of each combination. The set will be sorted first to avoid the cases having identical elements but in different order(like (A, B) and (B, A)). Next, we will also have each product's count where how many users have bought the product. And we take the [Intersection](https://en.wikipedia.org/wiki/Intersection) as the relevance of any two products.


## Example
Say a person A left reviews for products {X, Y, Z} in the same month. Another person B left reviews for {K, X, Z}. The combination for A is {(X, Y), (X, Z), (Y, Z)}, and {(K, X), (K, Z), (X, Z)} for B. The counts of c(X), c(Z), and c({X, Z}) are all 2. We can have the intersection between X and Z
```
2 / (2 + 2 - 2) = 100%. 
```

The dividend is c({X, Z}), and the divisor is x(X) + c(Z) - c({X, Z}).  So now we can give X and Z a 100% relevance score.   

We also can have the relevance between X and Y. 
```
c({X, Y}) / c(x) + c(y) - c({X, Y}) = 50%
(c(x) =2, c(y)=1 and c({X, Y}) = 1)
```

## Pre-process
Because amazon's review are really huge(126GB) so we need pre-preproces 
The pre-prcess is introduced in ....




In [14]:
from itertools import combinations
def parse_line(line):
    infos = line.split('\t')
    # date, products
    return (infos[0] + infos[1], infos[2])

def distinct_products(products):
    products_list = set(products)
    distinct = []
    for p in products:
        distinct.append((p, 1))
    return distinct

def combine_products(products): # [products]
    products_list = list(set(products))
    products_list.sort()
    return combinations(products_list, 2)


### Group Data
The parsed data is in the format of date, user_id, product_id seperated by tab. We concatenate date and user's id as key, and product's id as value. Therefore, when we gorup data by key, the value is the products that the user bought in the same month. 

In [None]:
grouped_data = sc.textFile('/amazon-reviews/parsed_data/') \
    .map(lambda line: parse_line(line)).groupByKey()

### Two Products Occurence
After data is grouped, we can get any two products' occurrence. Note here we only take the occurrence which has at least 1,000 times because we want to make the analysis accurate and credible. So we discard those combinations which have too few times being purchased.

In [15]:
two_products_occ = grouped_data\
    .flatMap(lambda row: combine_products(row[1]))\
    .map(lambda comb: (comb, 1))\
    .reduceByKey(lambda a, b: a + b)\
    .map(lambda reduced: (reduced[0][0], reduced[0][1], reduced[1]))\
    .filter(lambda row: row[2] > 1000) 

### Products Count
We also need to count how many times each product is purchased in order to calculate the intersection. Because the computation is on the distributed system(RDD). The global product-count dictionary would not be taken on different machines. So the **Broadcast** is useful here that we can populate the dict to all cluster machines. 

In [16]:
products_count = grouped_data\
    .flatMap(lambda row: distinct_products(row[1]))\
    .reduceByKey(lambda a, b: a + b)\

products_count = sc.broadcast(products_count.collectAsMap())

### Calculate the relevance

As introduced above, the equation to calculate the relevance aboe is 
```
(c({A, B}) / c(A) + c(B) - c({A, B})) * 100%
```
where c(A) and c(B) is from the *product_count* dict, and c({A, B}) is from the *two_products_occ*

In [18]:
def cal_relationship(row):
    p1 = row[0]
    p2 = row[1]
    occurrence = row[2]
    products_count_dict = products_count.value
    p12_sum = products_count_dict[p1] + products_count_dict[p2]
    relevance = (occurrence / (p12_sum - occurrence)) * 100
    return (p1, p2, relevance)

products_relevance = two_products_occ \
    .map(lambda row: cal_relationship(row)) \
    .sortBy(lambda row: -row[2]) \


### Save to file
Here we can keep the analysis result as a text file on HDFS. 

In [19]:
products_relevance\
    .map(lambda row: "{}\t{}\t{:.2f}".format(row[0], row[1], row[2]))\
    .saveAsTextFile('/amazon-reviews-analysis/')

###  Metadata
We cannot get any information from a product's id at all from the analysis, so we have to import the metadata and make a query for those ids.  
The metadata we are using here is also from [Amazon review data (2018)](http://deepyeti.ucsd.edu/jianmo/amazon/index.html), but we transform the data to only include brand, category, produc_id, and title so the performance will be better. 


In [21]:
metadata = spark.read.json('/amazon-meta/parsed_metadata/')

In [22]:
metadata.printSchema()

root
 |-- brand: string (nullable = true)
 |-- category: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- title: string (nullable = true)



### Analysis result
We pick up the top 100 combinations which are the most related. We can evaluate them now by its title. In the output showing below, we can see there are some combinations having almost similar or identical elements(but we did filter out those products with the same id)

In [25]:
from pyspark.sql.functions import col
top100 = products_relevance.take(100)
product_set = set()
for row in top100:
    product_set.add(row[0])
    product_set.add(row[1])

product_title = metadata.select('product_id', 'title').where(col('product_id').isin(product_set)).rdd \
                        .map(lambda row: (row['product_id'], row['title'])) \
                        .collectAsMap()

for row in top100:
    pro1 = product_title[row[0]]
    pro2 = product_title[row[1]]
    
    print("{}\t{}\t{}".format(pro1, pro2, row[2]))


The Man From Snowy River VHS	The Man From Snowy River	100.0
Moxeay V-Neck One Piece Bodysuit Bodycon Rompers Overall	Moxeay V-Neck One Piece Bodysuit Bodycon Rompers Overall	100.0
The Quiet Man VHS	The Quiet Man	100.0
Durango Men's 11&quot; Harness Boot	Durango Men's 11&quot; Harness Boot	100.0
Haggar Men's Work-To-Weekend No-Iron Pleat-Front Pant with Hidden Expandable Waist	Haggar Men's Work-To-Weekend No-Iron Pleat-Front Pant with Hidden Expandable Waist	100.0
The Big Lebowski VHS	The Big Lebowski VHS	100.0
Carhartt Men's Canvas Work Dungaree Pant B151	 Carhartt Men&#39;s Canvas Work Dungaree Pant B151	100.0
Capezio Daisy 205 Ballet Shoe (Toddler/Little Kid)	Capezio Daisy 205 Ballet Shoe (Toddler/Little Kid)	100.0
Russell Athletic Men Cotton Tanks	Russell Athletic Men&rsquo;s Essential Cotton Tank Top	100.0
CleanTools 42149 The Absorber Synthetic Drying Chamois, 27" x 17", Blue	CleanTools 51149 The Absorber Synthetic Drying Chamois, 27" x 17", Natural	100.0
Vans Herren Authentic Cor