# Commerce clickstream ML prediction
#### Dataset download > [Ad impressions with clicks dataset](https://www.kaggle.com/c/avazu-ctr-prediction/data)

The use case here I am taking is of a Commerce company that has an ecommerce website as well as traditional retail stores. They want to analyse the online clickstream data to better understand their customers. We will use a sample clickstream dataset from the data science website Kaggle.  We will start with the Ingest and Exploration of data. Next we create features and train and evaluate the ML model. We will join this data with Dynamics products table to try to analyse if products influence the ML model result. The goal of this workflow is to create a machine learning model that, given a new ad impression, predicts whether or not there will be a click. We will also do features exploration to see what features influence the prediction most. We have a big dataset so we will go with supervised learning which relies on historicl data to build a model to predict the result of the next observation.

Clickstream data is data about how users interact with your ecommerce websites, what ads they click, what products they view, which pages they spend most time on. It is behavioural data that can give you insights into your products and customers so you can better market to your customer base.

The notebook is written in PySpark and executed on Synapse. 

Note- In the dataset download from Kaggle, train.csv given is 40 million rows, a 6 GB uncompressed file! Excel only shows 1 million, and since i wanted to add a product column, i saved excel as a smaller set of 1 million rows. I filled with some random product numbers taken from Dynamics to be able to make joins.

In [6]:
# Reading clicks csv files in a dataframe
file_path = "abfss://commercedata@salabcommercedatalake.dfs.core.windows.net/clickstream-ad-ML/adtech/impression/csv/train_1M_p.csv"
df_clicks = spark.read.csv(file_path, header=True, inferSchema=True)
display(df_clicks.limit(10))

StatementMeta(SparkNB, 0, 7, Finished, Available)

SynapseWidget(Synapse.DataFrame, b89e29b3-1856-422c-abde-76f82a252f84)

In [7]:
df_clicks.count()

StatementMeta(SparkNB, 0, 8, Finished, Available)

1048575

In [8]:
df_clicks.printSchema()

StatementMeta(SparkNB, 0, 9, Finished, Available)

root
 |-- id: double (nullable = true)
 |-- click: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- C1: integer (nullable = true)
 |-- banner_pos: integer (nullable = true)
 |-- site_id: string (nullable = true)
 |-- site_domain: string (nullable = true)
 |-- site_category: string (nullable = true)
 |-- product: string (nullable = true)
 |-- app_id: string (nullable = true)
 |-- app_domain: string (nullable = true)
 |-- app_category: string (nullable = true)
 |-- device_id: string (nullable = true)
 |-- device_ip: string (nullable = true)
 |-- device_model: string (nullable = true)
 |-- device_type: integer (nullable = true)
 |-- device_conn_type: integer (nullable = true)
 |-- C14: integer (nullable = true)
 |-- C15: integer (nullable = true)
 |-- C16: integer (nullable = true)
 |-- C17: integer (nullable = true)
 |-- C18: integer (nullable = true)
 |-- C19: integer (nullable = true)
 |-- C20: integer (nullable = true)
 |-- C21: integer (nullable = true)

In [9]:
display(df_clicks.describe())

StatementMeta(SparkNB, 0, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, 46c4f8b5-ddce-4995-9b3e-3e6e0d0738ee)

In [10]:
# create a sql view
df_clicks.createOrReplaceTempView("vw_clicks")

StatementMeta(SparkNB, 0, 11, Finished, Available)

In [13]:
%%sql 
describe vw_clicks

StatementMeta(SparkNB, 0, 12, Finished, Available)

<Spark SQL result set with 25 rows and 3 fields>

In [14]:
# Reading product csv files in a dataframe
df_product= spark.read.format("csv").option("header",False).load("abfss://dynamics365-financeandoperations@salabcommercedatalake.dfs.core.windows.net/d365commerce.sandbox.operations.dynamics.com/Tables/SupplyChain/ProductInformationManagement/Main/EcoResProduct/ECORESPRODUCT_00001.csv")

display(df_product.limit(10))

StatementMeta(SparkNB, 0, 13, Finished, Available)

SynapseWidget(Synapse.DataFrame, c82c774b-f5b2-469c-ba8e-77fcbe82e35c)

In [15]:
# select only relevant columns and create a new dataframe
df_productSmall =  df_product.selectExpr(
    '_c12 AS ProductId',
    '_c16 AS ProductName')

display(df_productSmall.limit(10))

StatementMeta(SparkNB, 0, 14, Finished, Available)

SynapseWidget(Synapse.DataFrame, b789ba34-0ce9-4246-a802-83dc71822205)

In [16]:
# create a view
df_productSmall.createOrReplaceTempView("vw_Products")

StatementMeta(SparkNB, 0, 15, Finished, Available)

In [17]:
%%sql
select * from vw_Products limit 10

StatementMeta(SparkNB, 0, 16, Finished, Available)

<Spark SQL result set with 10 rows and 2 fields>

In [18]:
%%sql
-- join clicks and products view
select s.*, p.ProductName as product_name
                     from vw_clicks s 
                     left join vw_Products p on s.product = p.ProductId limit 10

StatementMeta(SparkNB, 0, 17, Finished, Available)

<Spark SQL result set with 10 rows and 26 fields>

Next lets do some Exploratory Data Analysis, that is analyse relationships between features, to get a sense of what could be influencing someone clicking an ad.

In [19]:
%%sql 
-- different banner positions of ads. Where they are placed on a page. We can see 8 types
select banner_pos, count(1)
from vw_clicks
group by 1 order by 1

StatementMeta(SparkNB, 0, 18, Finished, Available)

<Spark SQL result set with 7 rows and 2 fields>

In [20]:
%%sql 
-- total number of clicks vs no clicks for each banner pos
select banner_pos,
sum(case when click = 1 then 1 else 0 end) as click,
sum(case when click = 0 then 1 else 0 end) as no_click
from vw_clicks group by 1 order by 1

StatementMeta(SparkNB, 0, 19, Finished, Available)

<Spark SQL result set with 7 rows and 3 fields>

In [21]:
%%sql 
-- CTR is the number of clicks that your ad receives divided by the number of times that your ad is shown: clicks ÷ impressions = CTR
-- CTR value for each banner pos. Number 3 is empty which means that position is never clicked. It could be faulty data too. Number is haighest CTR, so that is a popular one.
select banner_pos,
sum(case when click = 1 then 1 else 0 end) / (count(1) * 1.0) as CTR
from vw_clicks group by 1 order by 1

StatementMeta(SparkNB, 0, 20, Finished, Available)

<Spark SQL result set with 7 rows and 2 fields>

In [22]:
%%sql 
-- different kinds of devices used
-- Device type 1 is most used by people who visit the site. 
select device_type, count(1)
from vw_clicks
group by 1 order by 1

StatementMeta(SparkNB, 0, 21, Finished, Available)

<Spark SQL result set with 4 rows and 2 fields>

In [23]:
%%sql 
-- total number of clicks vs no clicks for each device
-- though device 1 is most used but has highest no clicks too
select device_type,
sum(case when click = 1 then 1 else 0 end) as click,
sum(case when click = 0 then 1 else 0 end) as no_click
from vw_clicks group by 1 order by 1

StatementMeta(SparkNB, 0, 22, Finished, Available)

<Spark SQL result set with 4 rows and 3 fields>

In [24]:
%%sql 
-- CTR value for each device type. Number 4 is least and Number 0 is highest, highest chances are with device 0. For number 4, maybe company should stop showing ads and save some money.
select device_type,
sum(case when click = 1 then 1 else 0 end) / (count(1) * 1.0) as CTR
from vw_clicks group by 1 order by 1

StatementMeta(SparkNB, 0, 23, Finished, Available)

<Spark SQL result set with 4 rows and 2 fields>

In [25]:
%%sql 
-- product M0001 is really popular in this clickstream dataset. So customers are spending lot of time looking at that product. Next are D0002 and M0010.
select product, count(1) as count
from vw_clicks
group by 1 having count > 200 order by count desc

StatementMeta(SparkNB, 0, 24, Finished, Available)

<Spark SQL result set with 15 rows and 2 fields>

In [26]:
%%sql 
-- total number of clicks vs no clicks for each product
-- M0010 gets the highest of clicks
select product,
sum(case when click = 1 then 1 else 0 end) as click,
sum(case when click = 0 then 1 else 0 end) as no_click
from vw_clicks group by 1 order by 3 desc

StatementMeta(SparkNB, 0, 25, Finished, Available)

<Spark SQL result set with 21 rows and 3 fields>

In [27]:
%%sql 
-- CTR of different products. M0006 has the highest CTR. % wise this product gets most clicks, 30%
select product,
sum(case when click = 1 then 1 else 0 end) / (count(1) * 1.0) as CTR
from vw_clicks group by 1 order by 2 desc

StatementMeta(SparkNB, 0, 26, Finished, Available)

<Spark SQL result set with 21 rows and 2 fields>

In [28]:
%%sql 
select substr(hour, 7) as hour, 
count(1)
from vw_clicks 
group by 1 order by 1

StatementMeta(SparkNB, 0, 27, Finished, Available)

<Spark SQL result set with 6 rows and 2 fields>

In [29]:
%%sql
-- total number of clicks vs no clicks for hour of day
select substr(hour, 7) as hour,
sum(case when click = 1 then 1 else 0 end) as click,
sum(case when click = 0 then 1 else 0 end) as no_click
from vw_clicks group by 1 order by 1

StatementMeta(SparkNB, 0, 28, Finished, Available)

<Spark SQL result set with 6 rows and 3 fields>

In [30]:
%%sql 
select substr(hour, 7) as hour,
sum(case when click = 1 then 1 else 0 end) / (count(1) * 1.0) as CTR
from vw_clicks group by 1 order by 1

StatementMeta(SparkNB, 0, 29, Finished, Available)

<Spark SQL result set with 6 rows and 2 fields>

In [31]:
%%sql 

select count(1) as total,

count(distinct C1) as C1,
count(distinct banner_pos) as banner_pos,
count(distinct site_id) as site_id,
count(distinct site_domain) as site_domain,
count(distinct site_category) as site_category,
count(distinct product) as product,
count(distinct app_id) as app_id,
count(distinct app_domain) as app_domain,
count(distinct app_category) as app_category,
count(distinct device_id) as device_id,
count(distinct device_ip) as device_ip,
count(distinct device_model) as device_model,
count(distinct device_type) as device_type,
count(distinct device_conn_type) as device_conn_type,
count(distinct C14) as C14,
count(distinct C15) as C15,
count(distinct C16) as C16,
count(distinct C17) as C17,
count(distinct C18) as C18,
count(distinct C19) as C19,
count(distinct C20) as C20,
count(distinct C21) as C21

from vw_clicks

StatementMeta(SparkNB, 0, 30, Finished, Available)

<Spark SQL result set with 1 rows and 23 fields>

In [32]:
display(df_clicks.describe())

StatementMeta(SparkNB, 0, 31, Finished, Available)

SynapseWidget(Synapse.DataFrame, 0e5dbd35-4c34-44ac-80bd-77aea980f8cf)

In [33]:
# Drop site_category column
# we have 1 to 1 mapping with our product column so its highly correlated. We want to avoid correlation and use features that have no bearing on each other to get the best prediction.
df_clicks1 = df_clicks.drop('site_category')
df_clicks1.printSchema()

StatementMeta(SparkNB, 0, 32, Finished, Available)

root
 |-- id: double (nullable = true)
 |-- click: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- C1: integer (nullable = true)
 |-- banner_pos: integer (nullable = true)
 |-- site_id: string (nullable = true)
 |-- site_domain: string (nullable = true)
 |-- product: string (nullable = true)
 |-- app_id: string (nullable = true)
 |-- app_domain: string (nullable = true)
 |-- app_category: string (nullable = true)
 |-- device_id: string (nullable = true)
 |-- device_ip: string (nullable = true)
 |-- device_model: string (nullable = true)
 |-- device_type: integer (nullable = true)
 |-- device_conn_type: integer (nullable = true)
 |-- C14: integer (nullable = true)
 |-- C15: integer (nullable = true)
 |-- C16: integer (nullable = true)
 |-- C17: integer (nullable = true)
 |-- C18: integer (nullable = true)
 |-- C19: integer (nullable = true)
 |-- C20: integer (nullable = true)
 |-- C21: integer (nullable = true)

In [34]:
# extract exact hour from hour column into a new hr column
# we will add hr as a new feature
df_clicks1 =  df_clicks1.selectExpr("*",
    'substr(hour, 7) as hr')

display(df_clicks1.limit(10))

StatementMeta(SparkNB, 0, 33, Finished, Available)

SynapseWidget(Synapse.DataFrame, 6b200176-e177-4180-bf51-bd4655f1f85e)

In [35]:
from pyspark.sql.functions import *

strCols = map(lambda t: t[0], __builtin__.filter(lambda t: t[1] == 'string', df_clicks1.dtypes))
intCols = map(lambda t: t[0], __builtin__.filter(lambda t: t[1] == 'int', df_clicks1.dtypes))

# [row_idx][json_idx]
strColsCount = sorted(map(lambda c: (c, df_clicks1.select(countDistinct(c)).collect()[0][0]), strCols), key=lambda x: x[1], reverse=True)
intColsCount = sorted(map(lambda c: (c, df_clicks1.select(countDistinct(c)).collect()[0][0]), intCols), key=lambda x: x[1], reverse=True)

StatementMeta(SparkNB, 0, 34, Finished, Available)

In [36]:
# distinct counts for str columns
display(strColsCount)

StatementMeta(SparkNB, 0, 35, Finished, Available)

SynapseWidget(Synapse.DataFrame, fcd04bbf-ee7a-45c7-a920-974605d69752)

In [37]:
# distinct counts for int columns
display(intColsCount)

StatementMeta(SparkNB, 0, 36, Finished, Available)

SynapseWidget(Synapse.DataFrame, 9484eedf-d063-4765-9703-66e345c8348d)

Below code is taken from databricks’ official site and it indexes each categorical column using the StringIndexer, then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row. We use the StringIndexer again to encode our labels to label indices. Next, we use the VectorAssembler to combine all the feature columns into a single vector column.

Once we have familiarized ourselves with our data, we proceed to the machine learning phase, where we convert our data into features for input to a machine learning algorithm and produce a trained model with which we can predict. Because Spark MLlib algorithms take a column of feature vectors of doubles as input, a typical feature engineering workflow includes:

1. Identifying numeric and categorical features
2. String indexing
3. Assembling them all into a sparse vector

In our use of GBTClassifer, while we use string indexer but we are not applying One Hot Encoder (OHE).

When using StringIndexer, categorical features are kept as k-ary categorical features. A tree node will test if feature X has a value in {subset of categories}. With both StringIndexer + OHE: Your categorical features are turned into a bunch of binary features. A tree node will test if feature X = category a vs. all the other categories (one vs. rest test).

When using only StringIndexer, the benefits include:

1. There are fewer features to choose
2. Each node’s test is more expressive than with binary 1-vs-rest features

Therefore, for tree based methods, it is preferable to not use OHE as it is a less expressive test and it takes up more space. But for non-tree-based algorithms such as like linear regression, you must use OHE or else the model will impose a false and misleading ordering on categories.

In [38]:
# Include PySpark Feature Engineering methods
from pyspark.ml.feature import StringIndexer, VectorAssembler

# All of the columns (string or integer) are categorical columns
#  except for the [click] column
maxBins = 70
categorical = list(map(lambda c: c[0], __builtin__.filter(lambda c: c[1] <= maxBins, strColsCount)))
categorical += list(map(lambda c: c[0], __builtin__.filter(lambda c: c[1] <= maxBins, intColsCount)))
categorical.remove('click')

# Apply string indexer to all of the categorical columns
#  And add _idx to the column name to indicate the index of the categorical value
stringIndexers = list(map(lambda c: StringIndexer(inputCol = c, outputCol = c + "_idx"), categorical))

# Assemble the put as the input to the VectorAssembler 
#   with the output being our features
assemblerInputs = list(map(lambda c: c + "_idx", categorical))
vectorAssembler = VectorAssembler(inputCols = assemblerInputs, outputCol = "features")

# The [click] column is our label 
labelStringIndexer = StringIndexer(inputCol = "click", outputCol = "label")

# The stages of our ML pipeline 
stages = stringIndexers + [vectorAssembler, labelStringIndexer]

StatementMeta(SparkNB, 0, 37, Finished, Available)

We use Pipeline to chain multiple Transformers and Estimators together to specify our machine learning workflow. A Pipeline’s stages are specified as an ordered array.

In [39]:
from pyspark.ml import Pipeline

# Create our pipeline
pipeline = Pipeline(stages = stages)

# create transformer to add features
featurizer = pipeline.fit(df_clicks1)

# dataframe with feature and intermediate transformation columns appended
featurizedClicks = featurizer.transform(df_clicks1)

selectedCols = ['label', 'features'] + df_clicks1.columns
featurizedClicks = featurizedClicks.select(selectedCols)
featurizedClicks.printSchema()


StatementMeta(SparkNB, 0, 38, Finished, Available)

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- id: double (nullable = true)
 |-- click: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- C1: integer (nullable = true)
 |-- banner_pos: integer (nullable = true)
 |-- site_id: string (nullable = true)
 |-- site_domain: string (nullable = true)
 |-- product: string (nullable = true)
 |-- app_id: string (nullable = true)
 |-- app_domain: string (nullable = true)
 |-- app_category: string (nullable = true)
 |-- device_id: string (nullable = true)
 |-- device_ip: string (nullable = true)
 |-- device_model: string (nullable = true)
 |-- device_type: integer (nullable = true)
 |-- device_conn_type: integer (nullable = true)
 |-- C14: integer (nullable = true)
 |-- C15: integer (nullable = true)
 |-- C16: integer (nullable = true)
 |-- C17: integer (nullable = true)
 |-- C18: integer (nullable = true)
 |-- C19: integer (nullable = true)
 |-- C20: integer (nullable = true)
 |-- C21: inte

As you can see, we now have 'features' column and 'label' column.

In [41]:
display(featurizedClicks.select('features', 'label').limit(10))

StatementMeta(SparkNB, 0, 40, Finished, Available)

SynapseWidget(Synapse.DataFrame, 71ea607b-0599-48bb-8188-102b0979e2ec)

In [42]:
train, test = featurizedClicks \
  .select(["label", "features", "hr"]) \
  .randomSplit([0.7, 0.3], 42)
train.cache()
test.cache()

print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

StatementMeta(SparkNB, 0, 41, Finished, Available)

Training Dataset Count: 734366
Test Dataset Count: 314209

We will take Gradient Boosting Tree classifier for our ML as that is a popular one. There are others you can try, XGBoost, Random forest etc. The exact nature of these models is outstide the scope for our demo.

In [43]:
from pyspark.ml.classification import GBTClassifier

# Train our GBTClassifier model 
classifier = GBTClassifier(labelCol="label", featuresCol="features", maxBins=maxBins, maxDepth=10, maxIter=10)
model = classifier.fit(train)

StatementMeta(SparkNB, 0, 42, Finished, Available)

In [44]:
# Execute our predictions
predictions = model.transform(test)

predictions.select('hr', 'label', 'rawPrediction', 'prediction', 'probability').show(10)

StatementMeta(SparkNB, 0, 43, Finished, Available)

+---+-----+--------------------+----------+--------------------+
| hr|label|       rawPrediction|prediction|         probability|
+---+-----+--------------------+----------+--------------------+
| 00|  0.0|[0.56423776377680...|       0.0|[0.75555745339017...|
| 00|  0.0|[0.56423776377680...|       0.0|[0.75555745339017...|
| 00|  0.0|[0.56423776377680...|       0.0|[0.75555745339017...|
| 00|  0.0|[0.56423776377680...|       0.0|[0.75555745339017...|
| 00|  0.0|[0.56423776377680...|       0.0|[0.75555745339017...|
| 00|  0.0|[0.56423776377680...|       0.0|[0.75555745339017...|
| 00|  0.0|[0.56423776377680...|       0.0|[0.75555745339017...|
| 00|  0.0|[0.56423776377680...|       0.0|[0.75555745339017...|
| 00|  0.0|[0.56423776377680...|       0.0|[0.75555745339017...|
| 00|  0.0|[0.56423776377680...|       0.0|[0.75555745339017...|
+---+-----+--------------------+----------+--------------------+
only showing top 10 rows

In [45]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate our GBTClassifier model using BinaryClassificationEvaluator()
ev = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

print("Test Area Under ROC: " + str(ev.evaluate(predictions)))

StatementMeta(SparkNB, 0, 44, Finished, Available)

Test Area Under ROC: 0.7277238524799394

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. 
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve.
With our predictions, we can evaluate the model according to an evaluation metric, like area under the ROC curve, which in this case is 72%

In [46]:
#exaplanation of all parameters available
print(classifier.explainParams())

StatementMeta(SparkNB, 0, 45, Finished, Available)

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 

In [47]:
import json
features = map(lambda c: str(json.loads(json.dumps(c))['name']), \
               list(predictions.schema['features'].metadata.get('ml_attr').get('attrs').values())[0])
# convert numpy.float64 to str for spark.createDataFrame()

weights=map(lambda w: '%.10f' % w, model.featureImportances)
weightedFeatures = sorted(zip(weights, features), key=lambda x: x[1], reverse=True)

spark.createDataFrame(weightedFeatures).toDF("weight", "feature").createOrReplaceTempView('wf')

StatementMeta(SparkNB, 0, 46, Finished, Available)

In [48]:
%%sql 
select feature, weight 
from wf 
order by weight desc

StatementMeta(SparkNB, 0, 47, Finished, Available)

<Spark SQL result set with 13 rows and 2 fields>

#### Product feature has 10% weight on the prediction. Its not very high. So it does not impact heavily on the result, whether a customer clicks an ad or not. The feature C21 though is a different story, 53%. We should dig more into what that is and why it is influencing the result so much.

In [49]:
#create a sql view
predictions.createOrReplaceTempView("predictions")

StatementMeta(SparkNB, 0, 48, Finished, Available)

In [50]:
%%sql 
describe predictions

StatementMeta(SparkNB, 0, 49, Finished, Available)

<Spark SQL result set with 6 rows and 3 fields>

In [51]:
%%sql 
select sum(case when prediction = label then 1 else 0 end) / (count(1) * 1.0) as accuracy
from predictions

StatementMeta(SparkNB, 0, 50, Finished, Available)

<Spark SQL result set with 1 rows and 1 fields>

#### The AUC for our model is 72% and accuracy is 84%. Both are high enough. 

We evaluated two metrics, AUC and Accuracy. There are other metrics too like Precision, Recall, F1 score. Choosing the right metric needs some thinking. Sometimes it depends on the dataset, whether its balanced or not, or what kind of problem you are solving or what kind of ML model you are using. Again something outside the scope of this notebook.

Product feature has 10% weight on the prediction. Its not very high, which means it does not impact heavily on the result. What product a customer sees has no effect on the probability an ad will be clicked.

The feature C21 though is a different story, 53%. We should dig more into what that is and why it is influencing the result so much.

Hope you got a taste of what kind of data analysis and ML models we can build on clickstream data and Dynamics data. 

Thank you.