# Exploring Market Basket Analysis with FPGrowth Algorithm in PySpark

The provided code presents a comprehensive analysis of transactional data using PySpark's FPGrowth algorithm, focusing on the Online Retail dataset.

In [12]:
from pyspark import SparkContext, SparkConf

In [13]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()

## 1. Load CSV File as DataFrame
We started by creating a SparkSession and loading a CSV file named "test.txt" into a DataFrame using the spark.read.load() method. The file was loaded with a specific format ("csv"), delimiter (";"), and with the first row as headers.

In [14]:
mydf=spark.read.load("/content/test.txt", format="csv",sep=";",header="true")

## 2 Display the Content

In [15]:
mydf = spark.read.load("/content/test.txt", format="csv",sep=";",header="true")

In [16]:
mydf.show()

+------+----+----+
|  Name| Age|Mark|
+------+----+----+
|  Said|  23| 15 |
|  Sara|  19| 14 |
| Assia|2016|NULL|
|Moncef|  22|  18|
+------+----+----+



## 3. Print Schema
We printed the schema of the DataFrame to show the data types and nullability of the columns.

In [17]:
mydf.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Mark: string (nullable = true)



## 4. Select One Column
We selected the "Name" column from the DataFrame and displayed it using the select() and show() methods.

In [18]:
mydf.select("Name").show()

+------+
|  Name|
+------+
|  Said|
|  Sara|
| Assia|
|Moncef|
+------+



## 5. Select People Older Than 22
We filtered the DataFrame to select people with an "Age" greater than 22 and displayed the filtered data using the filter() and show() methods.

In [19]:
mydf.filter(mydf['Age'] > 22).show()

+-----+----+----+
| Name| Age|Mark|
+-----+----+----+
| Said|  23| 15 |
|Assia|2016|NULL|
+-----+----+----+



## 6. Rename Columns
We renamed the columns of the DataFrame to "user_name," "user_age," and "user_Mark" using the toDF() method.

In [20]:
mydf=mydf.toDF("user_name", "user_age","user_Mark")

## 7. Transform Columns
We transformed the "user_age" column by adding 2 to its values and displayed the updated DataFrame using the withColumn() method.

In [21]:
mydf = mydf.withColumn("user_age", mydf.user_age + 2)

In [22]:
mydf.show()

+---------+--------+---------+
|user_name|user_age|user_Mark|
+---------+--------+---------+
|     Said|    25.0|      15 |
|     Sara|    21.0|      14 |
|    Assia|  2018.0|     NULL|
|   Moncef|    24.0|       18|
+---------+--------+---------+



# Online Retail Dataset
## 1. Import the Dataset
We imported the Online Retail dataset from an Excel file and created a PySpark DataFrame.
The OnlineRetail Dataset is a set of transactions made between 12/01/2010 and 12/09/2011 for
a UK based online retail business. The company mainly sells gifts for all occasions. Many of
the company's customers are wholesalers.
(http://archive.ics.uci.edu/ml/datasets/online+retail#)
Description of the fields in the dataset:
- InvoiceNo: Transaction number
- StockCode: Product code.
- Description: Product name
- Quantity: quantity of products purchased per transaction
- Date and time
- Unit price: unit price of the product
- Customer number
- Country
## 1. Importing and Displaying the Online Retail Dataset


In [23]:
import pandas as pd
excel_file_path = "/content/Online Retail.xlsx"
pandas_df = pd.read_excel(excel_file_path)

In [24]:
df = spark.createDataFrame(pandas_df)

In [25]:
df.show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|2010-12-01 08:26:00|     7.65|   17850.0|United Kingdom|
|   536365|    21730|GLASS S

In [26]:
from pyspark.sql.functions import collect_list, expr

In [27]:
transactions_df = df.groupBy("InvoiceNo").agg(collect_list("Description").alias("items")).select("InvoiceNo", expr("array_distinct(items) as items"))

In [28]:
transactions_df.show()

+---------+--------------------+
|InvoiceNo|               items|
+---------+--------------------+
|   536366|[HAND WARMER UNIO...|
|   536367|[ASSORTED COLOUR ...|
|   536371|[PAPER CHAIN KIT ...|
|   536374|[VICTORIAN SEWING...|
|   536375|[WHITE HANGING HE...|
|   536377|[HAND WARMER RED ...|
|   536384|[WOOD BLACK BOARD...|
|   536385|[SET 3 WICKER OVA...|
|   536386|[WHITE WIRE EGG H...|
|   536387|[CHILLI LIGHTS, L...|
|   536389|[CHRISTMAS LIGHTS...|
|   536392|[3 STRIPEY MICE F...|
|   536394|[FANCY FONT BIRTH...|
|   536395|[BLACK HEART CARD...|
|   536396|[WHITE HANGING HE...|
|   536398|[PACK OF 12 RED R...|
|   536399|[HAND WARMER RED ...|
|   536402|[PAPER CHAIN KIT ...|
|   536403|[HAND WARMER BIRD...|
|   536404|[HEART IVORY TREL...|
+---------+--------------------+
only showing top 20 rows



## 2. Extract Frequent Itemsets and Associations
We used Spark MLlib library and the FPGrowth algorithm to extract frequent itemsets and associations. We set the minimum support to 0.02 and the minimum confidence to 0.1.

In [29]:
from pyspark.ml.fpm import FPGrowth

In [30]:
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.02, minConfidence=0.1)

In [31]:
model = fpGrowth.fit(transactions_df)
frequent_itemsets = model.freqItemsets

In [32]:
frequent_itemsets=model.freqItemsets
frequent_itemsets.show()

+--------------------+----+
|               items|freq|
+--------------------+----+
|[RED RETROSPOT CA...| 581|
|[PLASTERS IN TIN ...| 709|
|            [Manual]| 518|
|[WORLD WAR 2 GLID...| 540|
|[HOT WATER BOTTLE...| 663|
|[VINTAGE HEADS AN...| 611|
|[GINGERBREAD MAN ...| 672|
|[72 SWEETHEART FA...| 619|
|[ASSORTED COLOURS...| 522|
|[IVORY KITCHEN SC...| 724|
|[LOVEBIRD HANGING...| 547|
|[CHARLOTTE BAG SU...| 894|
|[ALARM CLOCK BAKE...| 799|
|[BOX OF 24 COCKTA...| 590|
|[LUNCH BAG APPLE ...|1049|
|    [POPCORN HOLDER]| 839|
|[DOORMAT NEW ENGL...| 602|
|[VINTAGE SNAP CARDS]| 929|
|[PACK OF 72 RETRO...|1334|
|[LUNCH BAG SUKI D...|1103|
+--------------------+----+
only showing top 20 rows



## 3. Association Rules
We also extracted association rules from the model and displayed them.

In [33]:
association_rules=model.associationRules

In [34]:
association_rules.show()

+--------------------+--------------------+-------------------+------------------+--------------------+
|          antecedent|          consequent|         confidence|              lift|             support|
+--------------------+--------------------+-------------------+------------------+--------------------+
|[LUNCH BAG SUKI D...|[LUNCH BAG RED RE...| 0.5113327289211242|  8.24114354639522|0.021776061776061777|
|[LUNCH BAG SUKI D...|[LUNCH BAG  BLACK...| 0.4696282864913871|  9.39256572982774|                0.02|
|[LUNCH BAG WOODLAND]|[LUNCH BAG RED RE...| 0.5130183220829315| 8.268310231454839| 0.02054054054054054|
|[LUNCH BAG PINK P...|[LUNCH BAG RED RE...| 0.5522522522522523| 8.900643020120308|0.023667953667953667|
|[LUNCH BAG PINK P...|[LUNCH BAG  BLACK...| 0.4927927927927928| 9.855855855855856| 0.02111969111969112|
|[PINK REGENCY TEA...|[ROSES REGENCY TE...| 0.8524844720496895| 19.71370341614907|0.021196911196911198|
|[JUMBO STORAGE BA...|[JUMBO SHOPPER VI...|0.43713572023313907| 