In [0]:
!pip install kaggle

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
import os

os.environ["KAGGLE_USERNAME"] = "YOUR_ACCESS_TOKEN" # Obtain this from your Kaggle profile
os.environ["KAGGLE_KEY"] = "YOUR_ACCESS_TOKEN_KEY" 

print("Kaggle credentials configured!")

Kaggle credentials configured!


This line of code is asking Databricks Spark to create a schema (also known as a database) called: workspace.ecommerce

But only if it doesn’t already exist. The essential meaning is:
➡️ “Make me a place to store tables for e-commerce data, unless it’s already there.”

In [0]:
spark.sql("""
CREATE SCHEMA IF NOT EXISTS workspace.ecommerce
""")

DataFrame[]

1. Creates a volume named ecommerce_data inside the schema workspace.ecommerce.
2. IF NOT EXISTS means Spark will only create it if it does not already exist (no error if it already exists).
A volume is used to store files and non-tabular data (like JSON, images, CSVs, etc.).Volumes are different from tables: tables manage structured columns, while volumes manage file storage within the schema.

In [0]:
spark.sql("""
CREATE VOLUME IF NOT EXISTS workspace.ecommerce.ecommerce_data
""")

DataFrame[]

The lines of code below switches to the volume’s directory and downloads the specified Kaggle dataset into that directory using the Kaggle command-line tool.

In [0]:
%sh
cd /Volumes/workspace/ecommerce/ecommerce_data
kaggle datasets download -d mkechinov/ecommerce-behavior-data-from-multi-category-store

Dataset URL: https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store
License(s): copyright-authors
Downloading ecommerce-behavior-data-from-multi-category-store.zip to /Volumes/workspace/ecommerce/ecommerce_data


  0%|          | 0.00/4.29G [00:00<?, ?B/s]  0%|          | 1.00M/4.29G [00:00<08:58, 8.55MB/s]  0%|          | 14.0M/4.29G [00:00<01:00, 76.4MB/s]  1%|          | 24.0M/4.29G [00:00<00:52, 88.0MB/s]  1%|          | 35.0M/4.29G [00:00<00:47, 95.4MB/s]  1%|          | 45.0M/4.29G [00:00<00:53, 85.6MB/s]  1%|▏         | 55.0M/4.29G [00:00<00:50, 89.4MB/s]  1%|▏         | 65.0M/4.29G [00:00<00:48, 93.0MB/s]  2%|▏         | 76.0M/4.29G [00:00<00:47, 96.1MB/s]  2%|▏         | 86.0M/4.29G [00:01<00:47, 95.2MB/s]  2%|▏         | 96.0M/4.29G [00:01<00:53, 84.4MB/s]  2%|▏         | 106M/4.29G [00:01<00:51, 88.1MB/s]   3%|▎         | 115M/4.29G [00:01<00:50, 88.4MB/s]  3%|▎         | 126M/4.29G [00:01<00:47, 93.3MB/s]  3%|▎         | 136M/4.29G [00:01<00:47, 93.0MB/s]  3%|▎         | 145M/4.29G [00:01<00:49, 90.2MB/s]  4%|▎         | 156M/4.29G [00:01<00:47, 93.6MB/s]  4%|▍         | 166M/4.29G [00:01<00:47, 93.6MB/s]  4%|▍         | 176M/4.29G [00:02<00:46, 95.1MB/s]  4%|▍  




This block unpacks the dataset ZIP into the local volume folder and shows you the extracted files so you can work with them in Databricks.

In [0]:
%sh
cd /Volumes/workspace/ecommerce/ecommerce_data
unzip -o ecommerce-behavior-data-from-multi-category-store.zip
ls -lh


Archive:  ecommerce-behavior-data-from-multi-category-store.zip
  inflating: 2019-Nov.csv            
  inflating: 2019-Oct.csv            
total 18G
-rwxrwxrwx 1 spark-b9d90f51-07ce-47de-b54e-74 nogroup 8.4G Jan  9 15:37 2019-Nov.csv
-rwxrwxrwx 1 spark-b9d90f51-07ce-47de-b54e-74 nogroup 5.3G Jan  9 15:39 2019-Oct.csv
-rwxrwxrwx 1 spark-b9d90f51-07ce-47de-b54e-74 nogroup 4.3G Jan  9 15:36 ecommerce-behavior-data-from-multi-category-store.zip


This block removes the downloaded ZIP file after extraction to save space, and then shows the updated file list.

In [0]:
%sh
cd /Volumes/workspace/ecommerce/ecommerce_data
rm -f ecommerce-behavior-data-from-multi-category-store.zip
ls -lh


total 14G
-rwxrwxrwx 1 spark-b9d90f51-07ce-47de-b54e-74 nogroup 8.4G Jan  9 15:37 2019-Nov.csv
-rwxrwxrwx 1 spark-b9d90f51-07ce-47de-b54e-74 nogroup 5.3G Jan  9 15:39 2019-Oct.csv


In [0]:
%restart_python

Reads the November 2019 e-commerce dataset into a Spark DataFrame for further processing.


In [0]:
df_nov = spark.read\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv")


Reads the October 2019 e-commerce dataset into a Spark DataFrame for further processing.


In [0]:
df_oct = spark.read\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv")



- `df_oct.count()` computes the total number of rows (events) in the DataFrame `df_oct`.
  The formatted print shows this count with commas for readability.
- The separator lines (`"="*60`) improve the output layout in the notebook.
- `df_oct.printSchema()` prints the structure of the DataFrame:
     • Column names
     • Data types
     • Nullable flags

This helps you understand how many events are in the October dataset and what the columns look like.
*/



In [0]:
print(f"October 2019 - Total Events: {df_oct.count():,}")
print("\n" + "="*60)
print("SCHEMA:")
print("="*60)
df_oct.printSchema()

October 2019 - Total Events: 42,448,764

SCHEMA:
root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- category_id: long (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- user_session: string (nullable = true)



In [0]:
# print("\n" + "="*60)
# print("SAMPLE DATA (First 5 rows):")
# print("="*60)
df_oct.show(5, truncate=False)

+-------------------+----------+----------+-------------------+-----------------------------------+--------+-------+---------+------------------------------------+
|event_time         |event_type|product_id|category_id        |category_code                      |brand   |price  |user_id  |user_session                        |
+-------------------+----------+----------+-------------------+-----------------------------------+--------+-------+---------+------------------------------------+
|2019-10-01 00:00:00|view      |44600062  |2103807459595387724|NULL                               |shiseido|35.79  |541312140|72d76fde-8bb3-4e00-8c23-a032dfed738c|
|2019-10-01 00:00:00|view      |3900821   |2053013552326770905|appliances.environment.water_heater|aqua    |33.2   |554748717|9333dfbd-b87a-4708-9857-6336556b0fcc|
|2019-10-01 00:00:01|view      |17200506  |2053013559792632471|furniture.living_room.sofa         |NULL    |543.1  |519107250|566511c2-e2e3-422b-b695-cf8e6e792ca8|
|2019-10-01 00:0

In [0]:
display(df_oct.limit(5))


event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
2019-10-01T00:00:00.000Z,view,44600062,2103807459595387724,,shiseido,35.79,541312140,72d76fde-8bb3-4e00-8c23-a032dfed738c
2019-10-01T00:00:00.000Z,view,3900821,2053013552326770905,appliances.environment.water_heater,aqua,33.2,554748717,9333dfbd-b87a-4708-9857-6336556b0fcc
2019-10-01T00:00:01.000Z,view,17200506,2053013559792632471,furniture.living_room.sofa,,543.1,519107250,566511c2-e2e3-422b-b695-cf8e6e792ca8
2019-10-01T00:00:01.000Z,view,1307067,2053013558920217191,computers.notebook,lenovo,251.74,550050854,7c90fc70-0e80-4590-96f3-13c02c18c713
2019-10-01T00:00:04.000Z,view,1004237,2053013555631882655,electronics.smartphone,apple,1081.98,535871217,c6bd7419-2748-4c56-95b4-8cec9ff8b80d


In [0]:
df_nov.printSchema()

root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- category_id: long (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- user_session: string (nullable = true)



In [0]:
print(f"November 2019 - Total Events: {df_nov.count():,}")


November 2019 - Total Events: 67,501,979


In [0]:
# Sample dataframe creation & testing

In [0]:
# Create simple DataFrame
sample_data = [("Kookaburra", 129), ("Gray-Nicolls", 109), ("SG",99), ("GM", 89)]
sample_df = spark.createDataFrame(sample_data, ["brand", "price"])
sample_df.show()

+------------+-----+
|       brand|price|
+------------+-----+
|  Kookaburra|  129|
|Gray-Nicolls|  109|
|          SG|   99|
|          GM|   89|
+------------+-----+



In [0]:
# Filter to view the cheapest bat

from pyspark.sql.functions import min, col

sample_df.filter(
    col("price") == sample_df.agg(min("price")).first()[0]
).show()



+-----+-----+
|brand|price|
+-----+-----+
|   GM|   89|
+-----+-----+

