# Desjardins Walkthrough with behavior_mapper pckge Import

In [1]:
from behavior_mapper import helper_functions, activities_class, modeling

Use custom csv_import function to read in csv file and identify the 3 columns corresponding to the session ID, activity conducted, and time of activity.

# Import and Data Prep Pipeline

In [2]:
sample_input = helper_functions.csv_import(file='data/private/jds_tt_eventdetail.csv',
                                          ID='uuid',
                                          activity='event_name',
                                          timestamp_col='event_datetime')
                                          #nrows=10000)
sample_input.head(5)

Unnamed: 0,uuid,event_datetime,event_name
0,e9c962fe-6dcb-11eb-bce1-5694ce998af9,2020-11-10 23:50:51,insweb.entry
1,e9c962fe-6dcb-11eb-bce1-5694ce998af9,2020-11-10 23:50:51,insweb.home
2,e9c962fe-6dcb-11eb-bce1-5694ce998af9,2020-11-10 23:50:58,insweb.home
3,e9c962fe-6dcb-11eb-bce1-5694ce998af9,2020-11-10 23:50:58,insweb.exit
4,c2919b36-fadc-406e-8b4b-038381094580,2020-11-10 21:36:50,insweb.entry


Create an activity dataframe class from the previously imported file. The activity class will be used in all further functions with pre-configured column labels used throughout. 

In [3]:
activities_df = activities_class.activities({'ID':sample_input['uuid'],
                                            'activity':sample_input['event_name'],
                                            'occurrence':sample_input['event_datetime']})
activities_df.head()

Unnamed: 0,ID,activity,occurrence
0,e9c962fe-6dcb-11eb-bce1-5694ce998af9,insweb.entry,2020-11-10 23:50:51
1,e9c962fe-6dcb-11eb-bce1-5694ce998af9,insweb.home,2020-11-10 23:50:51
2,e9c962fe-6dcb-11eb-bce1-5694ce998af9,insweb.home,2020-11-10 23:50:58
3,e9c962fe-6dcb-11eb-bce1-5694ce998af9,insweb.exit,2020-11-10 23:50:58
4,c2919b36-fadc-406e-8b4b-038381094580,insweb.entry,2020-11-10 21:36:50


Using the activity dataframe class, a dictionary of unique activities will be created to replace activities with a unique numeric identifier. Once activities are mapped, the activities will be sequenced for each session ID. 

Before sequencing however, any activities with strings matching those specified in the drop_activities list will be removed from the activities. After sequencing, any sequences that do not have the minimum number of steps will be removed from the corpus. Note that activities are dropped before sequences are evaluated. Thus, any activities dropped will not be included when sequence steps are counted.

In [4]:
sequence_df, activity_map, activity_counts = activities_df.create_corpus(drop_activities=['insweb.entry','insweb.exit'],min_num=2, remove_repeats=True)

Next, the corpus of integer sequences will be tokenized and fitted using a word2vec skip grams model. The output will be a dictionary of activities with an array of features producted from the model. Key parameters have pre-configured defaults. However, 2 of the parameters may need to be changed depending on the original source data and domain knowledge. 

The first parameter to consider, 'window', corresponds to the number of activities to look before and after each activity when evaluating sequences. The default is 3. The second parameter to consider, 'min_activity_count', corresponds to the minimum number of occurrences an activity must have to remain in consideration. 

# Skip Grams

In [5]:
activities_features = modeling.fit_sequences(sequence_df=sequence_df,
                                             activity_map=activity_map,
                                             feature_size=100,
                                             window=4,
                                             min_activity_count=0,
                                             iter=50,
                                             sample=1e-5,
                                             negative=5) # Add any additional skip gram parameter specifications here

The activities_features dictionary will now be used to cluster the activities based on the features produced by the model. Here again, there are 2 parameters that may require additional consideration beyond their default value.

The first parameter to consider, 'min_samples', denotes the number of activities that must be near an activity for it to be considered a core point. This includes the activity itself. The default value is set to 2, meaning a cluster can have as few as 2 activities. 

The second parameter to consider, 'eps', denotes the maximum distance between 2 activities for 1 to be considered in the cluster of the other. This measure is a big more ambiguous and is defaulted to 0.5.

# Dim Reduction

Now that the activities are clustered using all the features produced from the original model. Those features will be reduced to 2 using TSNE (t-distributed Stochastic Neighbor Embedding). Approximating all the features to 2 allows all the activities to be visualized and their proximities to be evaluated further. This will allow additional users to augment the clusters and explore the activities' relationships further. Parameters for this dimension reduction model have be pre-configured for ease of use.

In [6]:
activity_cluster_df = modeling.dim_reduction(activities_features)
activity_cluster_df[['x', 'y']].sample(5)

Unnamed: 0,x,y
insweb.blog:-:6-tips-to-winterize-your-car,-67.13726,-42.45792
insweb.blog:-:how-to-replace-your-windshield-wiper-blades,13.158572,60.497166
insweb.500:https://services.lapersonnelle.com/documentsvirtuels/documentsassurance#assurance%20habitation,-2.691019,38.441437
insweb.500:https://services.thepersonal.com/polauto/adddriver/newdriver,47.975517,-68.196732
insweb.home,28.362507,-38.528149


# Add Counts

Activities that tend to coincide together will have relatively similar volumes. Thus, a previously created dictionary will be used to capture volume for each activity and can be used in clustering if desired.

In [7]:
activity_cluster_df = modeling.add_volume(activity_cluster_df,
                                          activity_counts)
activity_cluster_df[['x','y','volume_pctl']].sample(5)

Unnamed: 0,x,y,volume_pctl
insweb.quote:auto:quick:001.2:eof,-29.957603,-1.934803,94.067797
insweb.web:desjardins-assurances:faq-en:radar,-32.204636,-71.731995,5.169492
insweb.blog:-:save-on-your-home-and-auto-insurance-quebec,-52.715794,34.386845,56.059322
insweb.500:https://services.desjardinsgeneralinsurance.com/polauto/consultation/vehicledetails,-71.242294,39.73687,68.898305
insweb.quote:property:condo:household:quote:property:condo:00003:household,-91.496597,21.69529,88.262712


# Clustering 

In [8]:
activity_cluster_df = modeling.dbscan_cluster (activity_cluster_df,
                                               cluster_dims=['x','y'],
                                               min_samples=3,
                                               eps=5) # Add any additional DBSCAN clustering parameter specifications here

activity_cluster_df[['cluster']].sample(5)

Unnamed: 0,cluster
insweb.blog:-:auto-insurance-101-everything-young-first-time-drivers-need-to-know,47
insweb.tips-tools:prevention:fuel-oil-leak,41
insweb.blogue:-:accident-de-voiture-et-assurance-auto-au-quebec,-1
insweb.eof:a:na:quote:rv:002.1:eof,1
insweb.recreational-leisure-vehicle-insurance:watercraft,75


The activity_cluster_df is a dataframe of unique activities, the many skip gram model features, an assigned cluster value, and a set of x & y values. To visualize, you may expand further with interactive visualization tools such as bokeh or D3. Or, you may find it worthwhile to export the activities, their cluster, and x & values to a csv for others to visualzie and explore further in BI tools or Microsoft Excel. Likely, the clusters will not match your expectations exactly; however, remember that the the proximity of activities correspond to how often they occur together. Thus, you can derive additional relationships from activities' relative locations.

# Export

In [9]:
version = 'desj_2dcluster_25iter_3wind_5eps' # Enter file name
activity_cluster_df.to_csv('data/private/'+version+'.csv',
                          index=True,
                          header=True,
                          columns=['cluster','x','y', 'volume_pctl'])

"\nversion = 'desj_2dcluster_25iter_3wind_5eps' # Enter file name\nactivity_cluster_df.to_csv('data/private/'+version+'.csv',\n                          index=True,\n                          header=True,\n                          columns=['cluster','x','y', 'volume_pctl'])\n"

# GPC Walkthrough with behavior_mapper pckge Import

In [1]:
from behavior_mapper import helper_functions, activities_class, modeling
from glob import glob
import pandas as pd

file_path = 'tests/private/inputs/jas_adobe_all_*.csv'
export_df = [pd.read_csv(f, parse_dates=['event_datetime']) for f in glob(file_path)]
sample_input = pd.concat(export_df)

In [3]:
activities_df = activities_class.activities({'ID':sample_input['uuid'],
                                            'activity':sample_input['cdd_adobe_interaction_name_post_evar1'],
                                            'occurrence':sample_input['event_datetime']})

sequence_df, activity_map, activity_counts = activities_df.create_corpus(#drop_activities=['adobe.entry','adobe.exit'],
                                                                         min_num=0, 
                                                                         remove_repeats=True)
activity_cluster_df = modeling.fit_sequences(sequence_df=sequence_df,
                                             activity_map=activity_map,
                                             activity_counts=activity_counts,
                                             feature_size=100,
                                             window=4,
                                             min_activity_count=10,
                                             iter=100,
                                             sample=1e-5,
                                             negative=5) # Add any additional skip gram parameter specifications here

activity_cluster_df = modeling.dbscan_cluster (activity_cluster_df,
                                               cluster_dims=['x','y'],
                                               min_samples=3,
                                               eps=5) # Add any additional DBSCAN clustering parameter specifications here

activity_cluster_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,94,95,96,97,98,99,x,y,volume_pctl,cluster
occ:/billing/paybill/access,-0.059985,0.228044,-0.014246,-0.24556,-0.260454,0.400837,0.38176,-0.038,-0.10387,0.272894,...,-0.276321,0.343697,0.29912,0.095009,0.193624,0.396529,35.972698,-4.329173,98.760331,0
occ:login:login and register:clicks to log in,-0.200738,-0.145492,0.213701,-0.038448,-0.010163,0.401077,0.133611,-0.243653,0.04198,0.469547,...,-0.009501,0.237078,0.090353,-0.314398,0.172924,0.062043,24.493591,-19.142611,100.0,1
occ:login:login and register:clicks to close out login modal,-0.260744,-0.049025,0.05066,-0.205234,0.009319,0.248088,0.133617,0.160513,-0.105514,0.402457,...,-0.343315,0.396004,-0.000109,-0.206582,0.084815,-0.371353,48.162083,-8.652531,99.173554,2
occ:billing and payments:pay now:clicks between account,-0.175005,-0.176573,-0.124526,-0.293164,-0.361498,0.707121,0.283768,-0.084143,0.054418,0.186876,...,-0.212047,0.302994,0.052193,-0.059333,0.069452,-0.056213,51.685444,-51.559078,92.837466,3
occ:billing and payments:pay now,-0.040341,0.018531,-0.136609,-0.047397,-0.137167,0.244717,0.255479,-0.154652,-0.085669,0.099785,...,-0.06807,0.124247,0.262687,-0.15728,0.082249,-0.02469,36.994049,-4.414742,96.831956,0
