# Assignment 03 - Mining Frequent Itemsets using FP-Growth Algorithm
## 1. Preparation
### 1.1 Requirements

1. Operating System : Linux Mint 18.3 Sylvia
2. Apache Spark 2.4.0 Binary (https://spark.apache.org/downloads.html)
3. PySpark 2.4.0 (Apache Spark Python API)
4. Findspark 1.3.0 (Python's library)
5. Jupyter Notebook (https://jupyter.org/install)

### 1.2 Description of Dataset
* Dataset's name : [Brazilian E-Commerce Public Dataset by Olist](https://www.kaggle.com/olistbr/brazilian-ecommerce)
* Description : a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. 
* There are nine data sources in this dataset, but I'll use only two data sources that required to mining frequent itemsets, that is **olist_order_items_dataset.csv** and **olist_products_dataset.csv**
    
| Data Source | Number of Rows | Number of Columns |
|---|---|---|
| olist_customers_dataset.csv | 99.4k | 5 |
| olist_geolocation_dataset.csv | 1.00m | 5 |
| olist_order_items_dataset.csv | 113k | 7 |
| olist_order_payments_dataset.csv | 104k | 5 |
| olist_order_reviews_dataset.csv | 100.0k | 7 |
| olist_orders_dataset.csv | 99.4k | 8 |
| olist_products_dataset.csv | 33.0k | 9 |
| olist_sellers_dataset.csv | 3095 | 4 |
| product_category_name_translation.csv | 71 | 2 |

![data scheme](img/olistdata.png)

## 2. Spark Initialization

In [1]:
# Import findspark to make pyspark importable as a regular library
import findspark
findspark.init('/home/mocatfrio/spark') 

# /home/mocatfrio/spark has symbolic link to /bin/spark-2.4.0-bin-hadoop2.7

In [2]:
# Import SparkSession
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession \
    .builder \
    .appName("Frequent Itemsets") \
    .getOrCreate()

In [3]:
# Print spark object ID
print(spark)

<pyspark.sql.session.SparkSession object at 0x7fba8906f7b8>


## 3. Load Dataset
### 3.1 Order items Dataset

In [9]:
# Load the dataset
df_order = spark.read.load("/home/mocatfrio/Documents/projects/big-data/datasets/brazilian_ecommerce/olist_order_items_dataset.csv", \
                     format="csv", sep=",", inferSchema="true", header="true")

In [10]:
# Print top 20 rows data
df_order.show()

+--------------------+-------------+--------------------+--------------------+-------------------+------+-------------+
|            order_id|order_item_id|          product_id|           seller_id|shipping_limit_date| price|freight_value|
+--------------------+-------------+--------------------+--------------------+-------------------+------+-------------+
|00010242fe8c5a6d1...|            1|4244733e06e7ecb49...|48436dade18ac8b2b...|2017-09-19 09:45:35|  58.9|        13.29|
|00018f77f2f0320c5...|            1|e5f2d52b802189ee6...|dd7ddc04e1b6c2c61...|2017-05-03 11:05:13| 239.9|        19.93|
|000229ec398224ef6...|            1|c777355d18b72b67a...|5b51032eddd242adc...|2018-01-18 14:48:30| 199.0|        17.87|
|00024acbcdf0a6daa...|            1|7634da152a4610f15...|9d7a1d34a50524090...|2018-08-15 10:10:18| 12.99|        12.79|
|00042b26cf59d7ce6...|            1|ac6c3623068f30de0...|df560393f3a51e745...|2017-02-13 13:57:51| 199.9|        18.14|
|00048cc3ae777c65d...|            1|ef92

In [11]:
# Count data rows
df_order.count()

112650

In [12]:
# inferSchema is used to inference the actual datatype of columns, especially for dates and timestamp
df_order.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- product_id: string (nullable = true)
 |-- seller_id: string (nullable = true)
 |-- shipping_limit_date: timestamp (nullable = true)
 |-- price: double (nullable = true)
 |-- freight_value: double (nullable = true)



In [125]:
# Register the dataframe as a SQL temporary view
df_order.createOrReplaceTempView("order")

### 3.1 Products Dataset

In [13]:
# Load the dataset
df_products = spark.read.load("/home/mocatfrio/Documents/projects/big-data/datasets/brazilian_ecommerce/olist_products_dataset.csv", \
                     format="csv", sep=",", inferSchema="true", header="true")

In [14]:
df_products.show()

+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|          product_id|product_category_name|product_name_lenght|product_description_lenght|product_photos_qty|product_weight_g|product_length_cm|product_height_cm|product_width_cm|
+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|1e9e8ef04dbcff454...|           perfumaria|                 40|                       287|                 1|             225|               16|               10|              14|
|3aa071139cb16b67c...|                artes|                 44|                       276|                 1|            1000|               30|               18|              20|
|96bd76ec8810374ed...|        esporte_lazer|                 46|                       250|    

In [15]:
df_products.count()

32951

In [16]:
df_products.printSchema()

root
 |-- product_id: string (nullable = true)
 |-- product_category_name: string (nullable = true)
 |-- product_name_lenght: integer (nullable = true)
 |-- product_description_lenght: integer (nullable = true)
 |-- product_photos_qty: integer (nullable = true)
 |-- product_weight_g: integer (nullable = true)
 |-- product_length_cm: integer (nullable = true)
 |-- product_height_cm: integer (nullable = true)
 |-- product_width_cm: integer (nullable = true)



In [20]:
df_products.createOrReplaceTempView("products")

## 4. Frequent Pattern Mining using FP-Growth Algorithm
Apache Spark has provides a library to mining frequent itemsets using FP-Growth algorithm. First step of FP-growth is to calculate item frequencies and identify frequent items. Then, FP-Growth uses a suffix tree (FP-Tree) structure to encode transactions without generating candidate sets explicitly, which are usually expensive to generate. After that, the frequent itemsets can be extracted from the FP-Tree. For further explanation about FP-Growth, please visit http://hareenlaks.blogspot.com/2011/06/fp-tree-example-how-to-identify.html


### 4.1 Preprocess Data
First, we need to preprocess the original data because FP-Growth algorithm from Spark can only process 2-column data. i.e. id and items, for example:

| id|       items|
|---|------------|
|  0|   [1, 2, 5]|
|  1|[1, 2, 3, 5]|
|  2|      [1, 2]|

In [108]:
# Make new id for each products named "item_id" using row_number()
query = spark.sql("SELECT `product_id`, ROW_NUMBER() OVER (ORDER BY `product_id`) AS `item_id` \
                    FROM products")

# Register the dataframe as a SQL temporary view
query.createOrReplaceTempView("products_prep")

query.show()

+--------------------+-------+
|          product_id|item_id|
+--------------------+-------+
|00066f42aeeb9f300...|      1|
|00088930e925c41fd...|      2|
|0009406fd7479715e...|      3|
|000b8f95fcb9e0096...|      4|
|000d9be29b5207b54...|      5|
|0011c512eb256aa0d...|      6|
|00126f27c81360368...|      7|
|001795ec6f1b187d3...|      8|
|001b237c0e9bb435f...|      9|
|001b72dfd63e9833e...|     10|
|001c5d71ac6ad696d...|     11|
|00210e41887c2a8ef...|     12|
|002159fe700ed3521...|     13|
|0021a87d4997a48b6...|     14|
|00250175f79f584c1...|     15|
|002552c0663708129...|     16|
|002959d7a0b0990fe...|     17|
|002af88741ba70c7b...|     18|
|002c6dab60557c48c...|     19|
|002d4ea7c04739c13...|     20|
+--------------------+-------+
only showing top 20 rows



In [131]:
# Make new id for each order named "id" using row_number()
query1 = spark.sql("SELECT `order_id`, ROW_NUMBER() OVER (ORDER BY `order_id`) AS `id`, COUNT(`order_id`) AS `num_of_items` \
                    FROM order \
                    GROUP BY `order_id`")

# Register the dataframe as a SQL temporary view
query1.createOrReplaceTempView("order_prep")

query1.show()

+--------------------+---+------------+
|            order_id| id|num_of_items|
+--------------------+---+------------+
|00010242fe8c5a6d1...|  1|           1|
|00018f77f2f0320c5...|  2|           1|
|000229ec398224ef6...|  3|           1|
|00024acbcdf0a6daa...|  4|           1|
|00042b26cf59d7ce6...|  5|           1|
|00048cc3ae777c65d...|  6|           1|
|00054e8431b9d7675...|  7|           1|
|000576fe39319847c...|  8|           1|
|0005a1a1728c9d785...|  9|           1|
|0005f50442cb953dc...| 10|           1|
|00061f2a7bc09da83...| 11|           1|
|00063b381e2406b52...| 12|           1|
|0006ec9db01a64e59...| 13|           1|
|0008288aa423d2a3f...| 14|           2|
|0009792311464db53...| 15|           1|
|0009c9a17f916a706...| 16|           1|
|000aed2e25dbad2f9...| 17|           1|
|000c3e6612759851c...| 18|           1|
|000e562887b1f2006...| 19|           1|
|000e63d38ae8c00bb...| 20|           1|
+--------------------+---+------------+
only showing top 20 rows



In [132]:
# Join the order and products dataframe by product_id, so its product_id converted to item_id 
query2 = spark.sql("SELECT op.id, o.order_item_id, p.item_id \
                    FROM order AS o \
                    JOIN products_prep AS p \
                        ON o.product_id = p.product_id \
                    JOIN order_prep AS op \
                        ON o.order_id = op.order_id \
                    ORDER BY op.id ASC")

query2.show()

+---+-------------+-------+
| id|order_item_id|item_id|
+---+-------------+-------+
|  1|            1|   8629|
|  2|            1|  29598|
|  3|            1|  25668|
|  4|            1|  15323|
|  5|            1|  22080|
|  6|            1|  30848|
|  7|            1|  18182|
|  8|            1|  11123|
|  9|            1|   6385|
| 10|            1|   9013|
| 11|            1|  27628|
| 12|            1|  31102|
| 13|            1|  19743|
| 14|            2|   7080|
| 14|            1|   7080|
| 15|            1|  18083|
| 16|            1|   8238|
| 17|            1|  10354|
| 18|            1|  23217|
| 19|            1|  12287|
+---+-------------+-------+
only showing top 20 rows



In [133]:
# Register the dataframe as a SQL temporary view
query2.createOrReplaceTempView("order_prep")

In [148]:
from pyspark.sql.functions import col, concat, array, coalesce  

query2.select(concat(col("order_item_id"), col("item_id"))).show()

+------------------------------+
|concat(order_item_id, item_id)|
+------------------------------+
|18629                         |
|129598                        |
|125668                        |
|115323                        |
|122080                        |
|130848                        |
|118182                        |
|111123                        |
|16385                         |
|19013                         |
|127628                        |
|131102                        |
|119743                        |
|27080                         |
|17080                         |
|118083                        |
|18238                         |
|110354                        |
|123217                        |
|112287                        |
+------------------------------+
only showing top 20 rows



## References

* https://spark.apache.org/docs/2.3.0/ml-frequent-pattern-mining.html
* https://www.qubole.com/resources/pyspark-cheatsheet/