# Where apply button clicks come from

There's two ways to look at the data, each of which might give slightly different counts:
* LeadGeneration.ClickConversion from the button click
* PageView on the ISS page
* (and also redirect completed on ISS)

The button click event gives more context about where on the page the click happened, while the ISS pageview is the official definition of an action.


Events should be defined as per https://docs.google.com/spreadsheets/d/1HICh77BoGMIat9K3NPwz3pBayJWiAr0ohAlTuv7dr80/edit#gid=1692709656, but this hasn't be implemented consistently.  

Of note for button clicks / LeadGeneration.ClickConversion:
* We used to send product_id and provider_id, but with the move to Falcon that doesn't work any longer (or will provide incorrect results).  LPS in particular doesn't seem to have updated the implementation.
* The product comparison widget in the blog currently doesn't tell us what page it is on
* Sometimes there is no product or provider info coming through


Other things of note:
* LPS pages don't seem to be categorised as such in the database


Outstanding things not necessarily covered below:
* NPP clicks

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Expand to screen width to fit more on.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [3]:
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
import sqlalchemy

from data_warehouse_querying import DataWarehouseQuery

  """)


In [4]:
try:
    from pyathena import connect
except:
    print("Failed to import pyathena, trying to install it")
    !pip install pyathena

import athena_querying  #doing this style as there are connection details within that I want to scope

In [5]:
from athena_common_queries import *

In [6]:
import data_parsing

# Settings

In [7]:
num_days_to_query = 7
to_datetime = datetime.now().date() - timedelta(days=1) #datetime(year=2020, month=3, day=1)
from_datetime = to_datetime - timedelta(days=num_days_to_query)


In [8]:
pd.set_option("display.max_colwidth", 200)

# Database Connections

In [9]:
# Redshift data warehouse - most queries here
dq = DataWarehouseQuery()
dq.connect()

In [10]:
# Athena - used for page type analysis
aq = athena_querying.AthenaQuery()
aq.connect()

# Getting Base Data

In [11]:
products = dq.query("select * from dim_product")

Starting query at 2020-04-10T17:36:14.848463
Query took 0.07


In [12]:
products.head()

Unnamed: 0,product_id,product_name,source_product_id,sys_inserted,sys_updated,status,slug,language_id,channel_id,provider_id,country_id
0,56622,CIMB Platinum Mastercard,101,2019-07-08 19:34:51.551632,2019-07-08 19:34:51.551632,0,cimb-platinum-mastercard-212cae7f-f5cd-4dd1-bac3-63033973420e,1,4,102,1
1,70784,Maybank DUO Platinum Mastercard,105,2019-08-16 19:34:45.935610,2019-08-16 19:34:45.935610,1,maybank-duo-platinum-mastercard,1,4,107,1
2,75681,OCBC 90°N Card,106,2019-08-30 19:34:23.890164,2019-08-30 19:34:23.890164,0,ocbc-90-n-card,1,4,108,1
3,81857,Citibank Quick Cash (Existing Loan Customers),24,2019-09-16 19:35:32.751792,2019-09-16 19:35:32.751792,1,citibank-quick-cash-existing-customers,1,16,836,1
4,93497,OCBC ExtraCash Loan,28,2019-10-18 20:10:55.807708,2019-10-18 20:10:55.807708,0,ocbc-extra-cash-loan,1,16,833,1


In [13]:
providers = dq.query("select * from dim_provider")

Starting query at 2020-04-10T17:36:14.998037
Query took 0.02


In [14]:
providers.head()

Unnamed: 0,provider_id,provider_name,sys_inserted,sys_updated,source_provider_id,slug,status,channel_id,country_id,language_id
0,833,OCBC,2019-02-19 19:32:36.488054,2019-02-19 19:32:36.488054,6.0,ocbc,1,16,1,1
1,837,Standard Chartered Bank,2019-02-19 19:32:36.488054,2019-02-19 19:32:36.488054,2.0,scb,1,16,1,1
2,100,Standard Chartered Bank,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,67.0,scb,0,4,1,1
3,104,Citibank,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,56.0,citibank,1,4,1,1
4,255,AXA,2019-02-01 01:43:57.318499,2019-02-01 01:43:57.318499,259.0,axa-direct,1,20,1,1


In [15]:
channels = dq.query("select * from dim_channel")

Starting query at 2020-04-10T17:36:15.074067
Query took 0.02


In [16]:
channels.head()


Unnamed: 0,channel_id,channel_key,channel_name,sys_inserted,sys_updated
0,4,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
1,8,home-equity-loan,Home Equity Loan,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
2,12,life-insurance,Life Insurance,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
3,16,personal-loan,Personal Loan,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
4,20,travel-insurance,Travel Insurance,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062


In [17]:
providers_channels = pd.merge(providers, channels, on="channel_id", how="left")

In [18]:
len(providers)

496

In [19]:
len(providers_channels)

496

In [20]:
anonymous_users_some = dq.query("select * from dim_anonymous_user limit 1000")

Starting query at 2020-04-10T17:36:15.189889
Query took 0.02


In [21]:
anonymous_users_some.head()

Unnamed: 0,anonymous_user_id,source_anonymous_id,site_version_id,sys_inserted,sys_updated
0,22568387,2b3a19e2-bc37-44ff-a120-6a14fed9224e,2,2019-02-19 19:34:29.494854,2019-02-19 19:34:29.494854
1,22568399,2b4f8663-a80d-4dd2-8856-30e103b641ec,2,2019-02-19 19:34:29.494854,2019-02-19 19:34:29.494854
2,22568423,2b68a16b-4fcb-485d-8dff-910e59786224,5,2019-02-19 19:34:29.494854,2019-02-19 19:34:29.494854
3,22568452,2b9a0ea4-20de-420e-b226-d28cfb96deb2,1,2019-02-19 19:34:29.494854,2019-02-19 19:34:29.494854
4,22568481,2bba3513-4459-4cce-b938-9e623a24d5af,2,2019-02-19 19:34:29.494854,2019-02-19 19:34:29.494854


In [22]:
sessions_some = dq.query("select * from fact_sessions limit 1000")

Starting query at 2020-04-10T17:36:15.252842
Query took 0.03


In [23]:
sessions_some.head()

Unnamed: 0,fact_session_id,session_date_id,session_start_time_id,anonymous_user_id,user_id,device_id,browser_id,site_country_id,acquisition_site_version_id,session_landing_page_id,session_campaign_id,session_count,total_pageviews,total_interaction_events,session_order,sys_inserted,sys_updated,session_id,user_filter_type
0,9075319,20180701,162018,10188694,,2834,8692,1,2,517847,485852,1,1,1,1,2018-11-23 11:46:49.078785,2018-11-23 11:46:49.078785,11240374,external_visitor
1,9106274,20180701,183041,10182610,,2835,8684,1,1,517289,485980,1,5,8,1,2018-11-23 11:46:49.078785,2018-11-23 11:46:49.078785,11247942,external_visitor
2,9095621,20180701,51941,10193258,,2843,8679,1,2,517793,485852,1,1,1,1,2018-11-23 11:46:49.078785,2020-01-31 11:23:30.531883,11259674,bot
3,9087080,20180701,165234,10181904,,2843,8679,1,2,517782,485852,1,1,1,1,2018-11-23 11:46:49.078785,2020-01-31 11:23:30.531883,11288167,bot
4,9085606,20180701,142923,10208464,,2843,8679,1,2,516973,485852,1,1,1,1,2018-11-23 11:46:49.078785,2020-01-31 11:23:30.531883,11276457,bot


In [24]:
dim_session_some = dq.query("select * from dim_session limit 1000")

Starting query at 2020-04-10T17:36:15.330957
Query took 0.02


In [25]:
dim_session_some.head()

Unnamed: 0,session_id,source_session_id,source_session_start_time,source_session_end_time,site_version_id,sys_inserted,sys_updated
0,13678591,d3dba6c5-b233-41ec-80e9-2e22874f606a-180818230837824,2018-08-18 23:08:37.824,2999-12-31 16:00:00,2,2018-11-23 16:44:22.046187,2018-11-23 16:44:22.046187
1,13678595,d3e3be8b-3642-4603-b24b-8d210c80a829-180818154622120,2018-08-18 15:46:22.120,2999-12-31 16:00:00,1,2018-11-23 16:44:22.046187,2018-11-23 16:44:22.046187
2,13678599,d3ebf4a5-da72-4bca-8865-0f5705afc80f-180818191333170,2018-08-18 19:13:33.170,2999-12-31 16:00:00,2,2018-11-23 16:44:22.046187,2018-11-23 16:44:22.046187
3,13678603,d3ece6ff-8966-43e6-a41f-030653ee1e42-180818170633190,2018-08-18 17:06:33.190,2999-12-31 16:00:00,2,2018-11-23 16:44:22.046187,2018-11-23 16:44:22.046187
4,13678607,d3f2d9a2-cd1e-4936-ac88-f5d2c6bcb6f7-180818072726357,2018-08-18 07:27:26.357,2999-12-31 16:00:00,2,2018-11-23 16:44:22.046187,2018-11-23 16:44:22.046187


# Loading Supporting Data

NB: I'm being a bit liberal with querying excessive data to make it easier; and doing it properly should probably look at other semi-standard views done historically.

## Pageviews for session, user and pageview  counts (irrespective of apply) and so conversion rates

In [26]:

query = """
select 
    fact_activities.page_id
    , page_url
    , dim_page_type.page_type
    , dim_page_type.page_sub_type
    , referrer_page_id
    , session_id
    , fact_activities.anonymous_user_id
    , source_anonymous_id
    , full_date
    , full_time
    , hour24
    , minute
from 
    fact_activities
    left join dim_date on fact_activities.activity_date_id = dim_date.date_id
    left join dim_time on fact_activities.activity_time_id = dim_time.time_id
    left join dim_page on fact_activities.page_id = dim_page.page_id
    left join dim_anonymous_user on fact_activities.anonymous_user_id = dim_anonymous_user.anonymous_user_id
    left join dim_activity_type on fact_activities.activity_type_id = dim_activity_type.activity_type_id
    left join dim_page_type on dim_page_type.page_type_id = dim_page.page_type_id
    -- left join dim_page_
    

where
    dim_activity_type.activity_name = 'PageView'
    and user_filter_type='external_visitor'
    and dim_date.full_date>='{from_date}'
    and dim_date.full_date<='{to_date}'


""".format(from_date= from_datetime.isoformat(), to_date=to_datetime.isoformat())

In [27]:
print(query)


select 
    fact_activities.page_id
    , page_url
    , dim_page_type.page_type
    , dim_page_type.page_sub_type
    , referrer_page_id
    , session_id
    , fact_activities.anonymous_user_id
    , source_anonymous_id
    , full_date
    , full_time
    , hour24
    , minute
from 
    fact_activities
    left join dim_date on fact_activities.activity_date_id = dim_date.date_id
    left join dim_time on fact_activities.activity_time_id = dim_time.time_id
    left join dim_page on fact_activities.page_id = dim_page.page_id
    left join dim_anonymous_user on fact_activities.anonymous_user_id = dim_anonymous_user.anonymous_user_id
    left join dim_activity_type on fact_activities.activity_type_id = dim_activity_type.activity_type_id
    left join dim_page_type on dim_page_type.page_type_id = dim_page.page_type_id
    -- left join dim_page_
    

where
    dim_activity_type.activity_name = 'PageView'
    and user_filter_type='external_visitor'
    and dim_date.full_date>='2020-04-02'


In [28]:
page_views = dq.query(query)

Starting query at 2020-04-10T17:36:15.431144
Query took 11.98


In [29]:
page_views.head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,referrer_page_id,session_id,anonymous_user_id,source_anonymous_id,full_date,full_time,hour24,minute
0,2970667,blog.moneysmart.sg/shopping/free-parking-singapore-malls-covid-19,blog_page,blog,,71528277,22586204,7dbcc214-ba75-4355-b694-d0edd21a24cf,2020-04-04,19:37:56,19,37
1,2970667,blog.moneysmart.sg/shopping/free-parking-singapore-malls-covid-19,blog_page,blog,,71919909,22607204,d0fed3b8-ff83-4807-8982-50cc729c8c50,2020-04-09,22:20:25,22,20
2,2533495,iss.moneysmart.sg/credit-cards/posb-everyday-card/redirect,interstitial_page,redirect,517750.0,71862168,22607204,d0fed3b8-ff83-4807-8982-50cc729c8c50,2020-04-07,20:14:26,20,14
3,2531464,iss.moneysmart.sg/credit-cards/dbs-altitude-visa-signature-card/redirect,interstitial_page,redirect,517750.0,71862168,22607204,d0fed3b8-ff83-4807-8982-50cc729c8c50,2020-04-07,20:12:40,20,12
4,517497,www.moneysmart.sg/credit-cards/standard-chartered-unlimited-cashback-card,details,product_details,,71938665,22607204,d0fed3b8-ff83-4807-8982-50cc729c8c50,2020-04-09,22:22:08,22,22


In [30]:
page_views.agg(["min","max","count", "size"])

Unnamed: 0,page_id,page_url,page_type,page_sub_type,referrer_page_id,session_id,anonymous_user_id,source_anonymous_id,full_date,full_time,hour24,minute
min,516315,blog-admin.moneysmart.hk,Unknown,,516315.0,71214225,10176278,00001c4d-2c82-4ad5-ac50-5d825993a3a6,2020-04-02,00:00:00,0,0
max,2971858,www.moneysmart.sg/wedding/5-money-saving-tips-for-wedding-receptions,thank_you_page,thank_you,2971858.0,72114635,61594347,fffffd10-6887-4bab-aacd-838a34337611,2020-04-09,23:59:54,23,59
count,986947,986947,986947,986947,123531.0,986947,986947,986947,986947,986947,986947,986947
size,986947,986947,986947,986947,986947.0,986947,986947,986947,986947,986947,986947,986947


In [31]:
page_views_referrer_stats = page_views.groupby(["page_type", "page_sub_type"]).agg({"referrer_page_id":["count", "size"]})#.apply(lambda x: x.referrer_page_id.count / x.size)

In [32]:
page_views_referrer_stats[("referrer_page_id", "fraction_with_referrer_set")] = page_views_referrer_stats[("referrer_page_id", "count")] / page_views_referrer_stats[("referrer_page_id", "size")]

In [33]:
page_views_referrer_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,referrer_page_id,referrer_page_id,referrer_page_id
Unnamed: 0_level_1,Unnamed: 1_level_1,count,size,fraction_with_referrer_set
page_type,page_sub_type,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Unknown,,27511,92838,0.296333
blog_page,blog,57207,718395,0.079632
details,product_details,5252,32273,0.162737
form_page,form,0,274,0.0
interstitial_page,apply,509,1083,0.469991
interstitial_page,redirect,16014,16785,0.954066
interstitial_page,site,41,45,0.911111
learn_page,learn,363,1702,0.213278
listing,category_listing,4103,36317,0.112977
listing,channel_listing,10643,71121,0.149646


In [34]:
page_views[page_views.page_sub_type=="site"]["page_url"]

15787                                   www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
15788                                   www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
30807                          www.moneysmart.sg/renovation-loan/maybank/maybank-renovation-loan-monthly-rest/site
46330                                   www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
68610                          www.moneysmart.sg/renovation-loan/maybank/maybank-renovation-loan-monthly-rest/site
68616                          www.moneysmart.sg/renovation-loan/maybank/maybank-renovation-loan-monthly-rest/site
132225                                  www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
162793                                  www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
175030                                  www.moneysmart.sg/education-loan/posb/po

## Sessions for landing page information

Note that:
* You can get some of this off the fact_activities 
* You can get a lot of marketing info on the session level
* For doing per-day etc, you'd likely want the earliest session of the sessions for a user (or do it off pageviews)

In [35]:
query = """
select
    session_id
    , anonymous_user_id
    , session_landing_page_id
    , session_count
    , dim_date.full_date
    

from 
    fact_sessions
    left join dim_date on session_date_id = dim_date.date_id


where 
    dim_date.full_date>='{from_date}'
    and dim_date.full_date<='{to_date}'

    and user_filter_type='external_visitor'

""".format(from_date= from_datetime.isoformat(), to_date=to_datetime.isoformat())

In [36]:
print(query)


select
    session_id
    , anonymous_user_id
    , session_landing_page_id
    , session_count
    , dim_date.full_date
    

from 
    fact_sessions
    left join dim_date on session_date_id = dim_date.date_id


where 
    dim_date.full_date>='2020-04-02'
    and dim_date.full_date<='2020-04-09'

    and user_filter_type='external_visitor'




In [37]:
sessions = dq.query(query)

Starting query at 2020-04-10T17:36:29.656279
Query took 2.99


In [38]:
sessions.head()

Unnamed: 0,session_id,anonymous_user_id,session_landing_page_id,session_count,full_date
0,71348675,61009582,1656230,1,2020-04-02
1,71272540,60999490,1656230,1,2020-04-02
2,71285202,60922845,1244470,1,2020-04-02
3,71215025,61062071,1656230,1,2020-04-02
4,71285746,61020033,1578914,1,2020-04-02


In [39]:
sessions.agg(["count", "min", "max"])

Unnamed: 0,session_id,anonymous_user_id,session_landing_page_id,session_count,full_date
count,762458,762458,762458,762458,762458
min,71214225,10176278,516315,1,2020-04-02
max,72114635,61594347,2971854,1,2020-04-09


## Better Page Segmentation (should put in ETL sometime)
For background, current ETL process uses page types from sql pattern matching, but events now send page_type (and in future will try to get page_sub_type) in the event body.  Ticket exists to improve this.

### Get All Pages from Data Warehouse

In [40]:
query = """
select 
    page_id
    , page_url
    , page_type
    , page_sub_type
    -- not joining to get product and category at the moment as not sure its needed.  Data is also in fact_activities
    , product_category_id
    , product_id
    , provider_id

    
from 
    dim_page
    left join dim_page_type on dim_page.page_type_id = dim_page_type.page_type_id

"""

pages = dq.query(query)

Starting query at 2020-04-10T17:36:32.984556
Query took 0.29


In [41]:
pages.count()

page_id                47691
page_url               47691
page_type              47691
page_sub_type          47691
product_category_id     1251
product_id              2391
provider_id             3233
dtype: int64

In [42]:
pages.head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id
0,1375785,blog.moneysmart.hk/en/mortgage/%E5%B1%85%E5%B1%8B-2019-%E4%BD%95%E6%96%87%E7%94%B0-%E5%B0%87%E8%BB%8D%E6%BE%B3-%E7%94%B3%E8%AB%8B-%E9%A6%AC%E9%9E%8D%E5%B1%B1-%E6%B7%B1%E6%B0%B4%E5%9F%97,blog_page,blog,,,
1,1647817,blog.moneysmart.sg/property/private-properties-hdbs-differences/attachment/private-vs-public-housing-singapore,blog_page,blog,,,
2,1376813,forum.moneysmart.sg/topic/taking-multiple-loans,forum_page,forum,,,
3,1158494,blog.moneysmart.sg/dining/11-best-places-in-singapore-to-get-your-meat-fix-at-1-for-1/_,blog_page,blog,,,
4,1378619,www.moneysmart.sg/home%20loan,Unknown,,,,


In [43]:
pages[pages.page_url.str.startswith("http")].head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id


### Page Type from Event Body (via athena)

In [44]:
# Just take one day's worth of data from the end of the period (more likely to have updated values than start)
day_to_take_types_from = to_datetime - timedelta(days=1)

query = """
    select 
        context.page_url
        
        --regexp_extract(context.page_url, '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 5)  -- slug
        
        , context.canonical_url
        , body.data.page_type
        , count(*) as event_count
        
        
    from {table_name}
    
    where
        {partition_filter}
        and context.page_url not like '%moneysmart.tw%'
        and context.page_url not like '%moneysmart.ph%'
        and context.page_url not like '%moneysmart.id%'
    
    group by 1,2,3
    

""".format(table_name = athena_querying.athena_database+ "." +athena_querying.athena_raw_events_table,
          partition_filter = create_partition_filter(day_to_take_types_from, to_datetime)
          
          )
           



#could filter out just pageviews, but the group by has the same effect and it's not indexed or anything.




In [45]:
print(query)


    select 
        context.page_url
        
        --regexp_extract(context.page_url, '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 5)  -- slug
        
        , context.canonical_url
        , body.data.page_type
        , count(*) as event_count
        
        
    from ms_data_lake_production.ms_data_stream_production_processed
    
    where
        
  (
 partition_0 >= '2020'
 AND partition_1 >= '04'
 AND partition_2 >= '08'
 OR (
 partition_0 >= '2020'
 AND partition_1 > '04'
 ) 
 OR (
 partition_0 > '2020'
 ) 
)
 AND ((partition_0 <= '2020'
	 AND partition_1 <= '04'
	 AND partition_2 <= '09'
) 
 OR (
	 partition_0 <= '2020'
	 AND partition_1 < '04'
) 
 OR (
	 partition_0 < '2020'
) 
)
        and context.page_url not like '%moneysmart.tw%'
        and context.page_url not like '%moneysmart.ph%'
        and context.page_url not like '%moneysmart.id%'
    
    group by 1,2,3
    




In [46]:
pages_types_from_athena_raw = aq.query(query)

In [47]:
pages_types_from_athena_raw.head()

Unnamed: 0,page_url,canonical_url,page_type,event_count
0,https://blog.moneysmart.sg/shopping/online-grocery-shopping-singapore-redmart-honestbee/,,blog-article-details,3878
1,https://blog.moneysmart.hk/zh-hk/budgeting/2019-%E5%8D%80%E8%AD%B0%E6%9C%83-%E9%81%B8%E8%88%89-%E8%B3%87%E6%A0%BC-%E4%BA%BA%E5%B7%A5/#dc3,,,5
2,https://blog.moneysmart.hk/zh-hk/family/%E9%9B%A2%E5%A9%9A-%E6%89%8B%E7%BA%8C-%E5%BE%8B%E5%B8%AB-%E7%A8%8B%E5%BA%8F-%E8%B2%BB%E7%94%A8/#1,,article,86
3,https://blog.moneysmart.sg/career/singapore-air-stewardess-cabin-crew-recruitment/,,article,145
4,https://iss.moneysmart.sg/credit-cards/citibank-rewards-card/redirect?utm_source=google&utm_medium=cpc&utm_campaign=Adwords%20|%20Credit%20Card%20|%20DSA&gclid=Cj0KCQjwybD0BRDyARIsACyS8muFos6Ng-aC...,,,1


In [48]:
# See how much canonical_url is available, but don't want to match as dim_page ATOW isn't using canonical :(
# I think that it was only added for AMP
pages_types_from_athena_raw.count()

page_url         24425
canonical_url        0
page_type        18924
event_count      24425
dtype: int64

In [49]:
# Set the dim_page_url for joining
urls_for_dim_page = pages_types_from_athena_raw.apply(lambda x: data_parsing.get_dim_page_url(x["page_url"]), axis=1)

In [50]:
urls_for_dim_page.head()

0                                                   blog.moneysmart.sg/shopping/online-grocery-shopping-singapore-redmart-honestbee
1      blog.moneysmart.hk/zh-hk/budgeting/2019-%e5%8d%80%e8%ad%b0%e6%9c%83-%e9%81%b8%e8%88%89-%e8%b3%87%e6%a0%bc-%e4%ba%ba%e5%b7%a5
2    blog.moneysmart.hk/zh-hk/family/%e9%9b%a2%e5%a9%9a-%e6%89%8b%e7%ba%8c-%e5%be%8b%e5%b8%ab-%e7%a8%8b%e5%ba%8f-%e8%b2%bb%e7%94%a8
3                                                         blog.moneysmart.sg/career/singapore-air-stewardess-cabin-crew-recruitment
4                                                                     iss.moneysmart.sg/credit-cards/citibank-rewards-card/redirect
dtype: object

In [51]:
# check for bad matching with dim_page (looks like shop url or blog url, but doesn't match)
missing_pages = urls_for_dim_page[~urls_for_dim_page.isin(pages["page_url"])].unique()

In [52]:
len(missing_pages)

24

In [53]:
pd.DataFrame(missing_pages)

Unnamed: 0,0
0,www.moneysmart.hk/zh-hk/lending-companies-loan/lending-companies-loan-plans/promise-easy-loan1
1,blog3.moneysmart.hk/zh-hk/investment/%e5%bc%b7%e7%a9%8d%e9%87%91-%e5%bb%b6%e6%9c%9f%e5%b9%b4%e9%87%91-tvc-%e5%ae%8f%e5%88%a9-%e6%af%94%e8%bc%83-%e9%80%80%e4%bc%91
2,blog.moneysmart.hk/zh-hk/credit-cards/didi%e8%bf%8e%e6%96%b0%e5%84%aa%e6%83%a0%e6%b8%9b%e8%87%b326-%e7%94%a8%e6%88%b6%e8%96%a6%e5%8f%8b%e6%88%90%e5%8a%9f%e4%bd%bf%e7%94%a8%e7%8d%b250
3,blog.moneysmart.hk/zh-hk/investment/%e8%b2%b7%e5%b3%b6-%e7%84%a1%e4%ba%ba%e5%b3%b6-%e5%8a%a0%e5%8b%92%e6%af%94%e6%b5%b7-%e4%b8%ad%e7%be%8e%e6%b4%b2-%e8%8f%b2%e5%be%8b%e8%b3%93
4,blog.moneysmart.hk/zh-hk/credit-cards/%e9%a3%9b%e8%a1%8c%e9%87%8c%e6%95%b8-%e4%bf%a1%e7%94%a8%e5%8d%a1-%e6%af%94%e8%bc%83-2018
5,blog.moneysmart.hk/zh-hk/uncategorized/%e5%8a%a0%e6%81%af-%e6%bb%99%e8%b1%90%e9%8a%80%e8%a1%8c-hsbc-%e6%81%92%e7%94%9f%e9%8a%80%e8%a1%8c-hang-seng-bank-dbs-%e4%b8%ad%e5%9c%8b%e9%8a%80%e8%a1%8c
6,blog.moneysmart.hk/https://blog.moneysmart.hk/zh-hk/budgeting/%e9%9f%b3%e6%a8%82%e4%b8%b2%e6%b5%81-apps-%e5%83%b9%e9%8c%a2-%e6%af%94%e8%bc%83-2018-video
7,blog3.moneysmart.hk/zh-hk/loans/%e8%ae%80%e7%a2%a9%e5%a3%ab-%e5%8e%bb%e5%8a%a0%e6%8b%bf%e5%a4%a7%e9%80%b2%e4%bf%ae-%e6%8f%80%e5%a4%a7%e5%ad%b8-%e9%96%8b%e6%94%af-%e8%b2%b8%e6%ac%be%e6%af%94%e8%bc%83
8,www.moneysmart.sg/%3ca%20class=%22ui%20blue%20button%22%20target=%22_self%22%20href=%22https://web.archive.org/web/20191225001530/https://blog.moneysmart.sg/%22%3evisit%20blog%3c/a%3e
9,blog.moneysmart.hk/zh-hk/uncategorized/%e8%b2%b7%e6%a8%93-%e5%8d%b0%e8%8a%b1%e7%a8%85-%e6%8c%89%e6%8f%ad-bsd-ssd-dsd


In [54]:
pages[pages.page_url.isin([z.strip("/") for z in missing_pages])]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id


In [55]:
pages[pages.page_url=="www.moneysmart.sg/"]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id
26664,663410,www.moneysmart.sg/,Unknown,,,,


In [56]:
pages[pages.page_url=="www.moneysmart.sg"]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id
35099,521077,www.moneysmart.sg,Unknown,,,,


In [57]:
len(pages_types_from_athena_raw)

24425

In [58]:
# Group together and remove duplicate entries sensibly

In [59]:
pages_types_from_athena_raw.head()

Unnamed: 0,page_url,canonical_url,page_type,event_count
0,https://blog.moneysmart.sg/shopping/online-grocery-shopping-singapore-redmart-honestbee/,,blog-article-details,3878
1,https://blog.moneysmart.hk/zh-hk/budgeting/2019-%E5%8D%80%E8%AD%B0%E6%9C%83-%E9%81%B8%E8%88%89-%E8%B3%87%E6%A0%BC-%E4%BA%BA%E5%B7%A5/#dc3,,,5
2,https://blog.moneysmart.hk/zh-hk/family/%E9%9B%A2%E5%A9%9A-%E6%89%8B%E7%BA%8C-%E5%BE%8B%E5%B8%AB-%E7%A8%8B%E5%BA%8F-%E8%B2%BB%E7%94%A8/#1,,article,86
3,https://blog.moneysmart.sg/career/singapore-air-stewardess-cabin-crew-recruitment/,,article,145
4,https://iss.moneysmart.sg/credit-cards/citibank-rewards-card/redirect?utm_source=google&utm_medium=cpc&utm_campaign=Adwords%20|%20Credit%20Card%20|%20DSA&gclid=Cj0KCQjwybD0BRDyARIsACyS8muFos6Ng-aC...,,,1


In [60]:
pages_types_from_athena_processing = pages_types_from_athena_raw[["page_url", "page_type", "event_count"]]

In [61]:
pages_types_from_athena_processing.rename(columns={"page_type":"page_type_from_events", "event_count":"event_count_athena"}, inplace=True)
pages_types_from_athena_processing.columns

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Index(['page_url', 'page_type_from_events', 'event_count_athena'], dtype='object')

In [62]:
pages_types_from_athena_processing["dim_page_url"] = urls_for_dim_page
pages_types_from_athena_processing.head()

Unnamed: 0,page_url,page_type_from_events,event_count_athena,dim_page_url
0,https://blog.moneysmart.sg/shopping/online-grocery-shopping-singapore-redmart-honestbee/,blog-article-details,3878,blog.moneysmart.sg/shopping/online-grocery-shopping-singapore-redmart-honestbee
1,https://blog.moneysmart.hk/zh-hk/budgeting/2019-%E5%8D%80%E8%AD%B0%E6%9C%83-%E9%81%B8%E8%88%89-%E8%B3%87%E6%A0%BC-%E4%BA%BA%E5%B7%A5/#dc3,,5,blog.moneysmart.hk/zh-hk/budgeting/2019-%e5%8d%80%e8%ad%b0%e6%9c%83-%e9%81%b8%e8%88%89-%e8%b3%87%e6%a0%bc-%e4%ba%ba%e5%b7%a5
2,https://blog.moneysmart.hk/zh-hk/family/%E9%9B%A2%E5%A9%9A-%E6%89%8B%E7%BA%8C-%E5%BE%8B%E5%B8%AB-%E7%A8%8B%E5%BA%8F-%E8%B2%BB%E7%94%A8/#1,article,86,blog.moneysmart.hk/zh-hk/family/%e9%9b%a2%e5%a9%9a-%e6%89%8b%e7%ba%8c-%e5%be%8b%e5%b8%ab-%e7%a8%8b%e5%ba%8f-%e8%b2%bb%e7%94%a8
3,https://blog.moneysmart.sg/career/singapore-air-stewardess-cabin-crew-recruitment/,article,145,blog.moneysmart.sg/career/singapore-air-stewardess-cabin-crew-recruitment
4,https://iss.moneysmart.sg/credit-cards/citibank-rewards-card/redirect?utm_source=google&utm_medium=cpc&utm_campaign=Adwords%20|%20Credit%20Card%20|%20DSA&gclid=Cj0KCQjwybD0BRDyARIsACyS8muFos6Ng-aC...,,1,iss.moneysmart.sg/credit-cards/citibank-rewards-card/redirect


In [63]:
page_types_from_athena = pages_types_from_athena_processing.fillna("").groupby(["dim_page_url"]).agg({"event_count_athena":"sum", "page_type_from_events":"max" }).reset_index()

In [64]:
page_types_from_athena.head()

Unnamed: 0,dim_page_url,event_count_athena,page_type_from_events
0,blog-admin.moneysmart.sg,6,home
1,blog-admin.moneysmart.sg/credit-cards/letting-loose-long-week-heres-dont-need-worry-cost,1,article
2,blog-admin.moneysmart.sg/credit-cards/uob-prvi-miles-credit-card-review,2,blog-article-details
3,blog-admin.moneysmart.sg/fixed-deposits/best-fixed-deposit-accounts-singapore,3,article
4,blog-admin.moneysmart.sg/savings-accounts/dbs-multiplier-ocbc360-uob-one-covid-19,10,blog-article-details


In [65]:
#using fillna because it was erroring on max
page_types_from_athena.groupby("page_type_from_events").agg({"event_count_athena":"sum", "dim_page_url":"count"})

Unnamed: 0_level_0,event_count_athena,dim_page_url
page_type_from_events,Unnamed: 1_level_1,Unnamed: 2_level_1
,62178,721
article,298841,2325
blog-article-details,174943,185
blog-post-page,212794,92
category,2017,83
claim-status-tracker,178,1
claim-status-tracker-result,23,1
contact-us-general-enquiry,110,1
home,280,5
interstitial-page,5141,165


### Page Type from Jamie's Logic

This was originally done for understanding segment vs kinesis, and then has been tweaked to add a bit more.

AToW it defaults to shop if it can't categorise better.

In [66]:
# Expect this to be a bit slow
page_types_jamie = pages[["page_id", "page_url"]].reset_index()

In [67]:
jamie_types = pages.apply(lambda x:data_parsing.get_metadata_from_url("https://"+x.page_url)[0], axis=1) #[page_type, slug, slug_root, ab_test, country_code]
#page_types_jamie["page_type"] = jamie_types

In [68]:
page_types_jamie["page_type_jamie"] = jamie_types

In [69]:
len(page_types_jamie)

47691

In [70]:
page_types_jamie.columns

Index(['index', 'page_id', 'page_url', 'page_type_jamie'], dtype='object')

In [71]:
# NB: logic is a bit flakey as it defaults to shop
page_types_jamie.groupby("page_type_jamie").size()

page_type_jamie
blog_article              23690
blog_category_page          267
blog_category_tag_page     2108
blog_home_page                4
blog_tag_page               115
calculator                   28
embed                       255
home_page                    29
iss                        2789
learn                       447
lps                         184
shop                      17373
trend                         8
unbounce                    394
dtype: int64

### Merged Page Type

In [72]:
# could probably merge techniques to use Jamie style plus PDP and listing page from the data warehouse
# essentially take the athena one and if not present, then use the Jamie one

pages_types_combined = pages.merge(page_types_jamie[["page_id", "page_type_jamie"]], how="left", on="page_id")\
    .merge(page_types_from_athena, how="left", left_on="page_url", right_on="dim_page_url")




In [73]:
pages_types_combined.head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events
0,1375785,blog.moneysmart.hk/en/mortgage/%E5%B1%85%E5%B1%8B-2019-%E4%BD%95%E6%96%87%E7%94%B0-%E5%B0%87%E8%BB%8D%E6%BE%B3-%E7%94%B3%E8%AB%8B-%E9%A6%AC%E9%9E%8D%E5%B1%B1-%E6%B7%B1%E6%B0%B4%E5%9F%97,blog_page,blog,,,,blog_article,,,
1,1647817,blog.moneysmart.sg/property/private-properties-hdbs-differences/attachment/private-vs-public-housing-singapore,blog_page,blog,,,,blog_article,,,
2,1376813,forum.moneysmart.sg/topic/taking-multiple-loans,forum_page,forum,,,,shop,,,
3,1158494,blog.moneysmart.sg/dining/11-best-places-in-singapore-to-get-your-meat-fix-at-1-for-1/_,blog_page,blog,,,,blog_article,,,
4,1378619,www.moneysmart.sg/home%20loan,Unknown,,,,,shop,,,


### Comparing all the techniques
Intent here is to go back and make one technique that solves this for all pages

In [74]:
pages_types_combined.groupby(["page_type", "page_sub_type", "page_type_jamie", "page_type_from_events"])\
.agg({"page_id":"count","event_count_athena":"sum"}).reset_index().rename(columns={"page_id":"page_count"})

Unnamed: 0,page_type,page_sub_type,page_type_jamie,page_type_from_events,page_count,event_count_athena
0,Unknown,,calculator,,9,2675.0
1,Unknown,,embed,,11,2747.0
2,Unknown,,embed,blog-post-page,92,212794.0
3,Unknown,,home_page,,1,2315.0
4,Unknown,,lps,,80,1416.0
5,Unknown,,lps,lps,61,2805.0
6,Unknown,,shop,,142,12150.0
7,Unknown,,shop,claim-status-tracker,1,178.0
8,Unknown,,shop,claim-status-tracker-result,1,23.0
9,Unknown,,shop,contact-us-general-enquiry,1,110.0


In [75]:
# ^ It's going to take some work to resolve it well

In [76]:
ptc = pages_types_combined

In [77]:
ptc[(ptc.page_type == "Unknown") & (ptc.page_type_from_events == "product-listing")]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events
962,517337,www.moneysmart.sg/savings-account/rhb,Unknown,,,,,shop,www.moneysmart.sg/savings-account/rhb,2.0,product-listing
30210,1623914,www.moneysmart.hk/zh-hk/credit-cards/hang-seng-bank/travel-overseas-spending,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/credit-cards/hang-seng-bank/travel-overseas-spending,10.0,product-listing
30211,1625043,www.moneysmart.hk/zh-hk/credit-cards/travel-overseas-spending,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/credit-cards/travel-overseas-spending,19.0,product-listing
30328,1673296,www.moneysmart.hk/en/credit-cards/bea/travel-overseas-spending,Unknown,,,,,shop,www.moneysmart.hk/en/credit-cards/bea/travel-overseas-spending,2.0,product-listing
30354,1798332,www.moneysmart.hk/zh-hk/credit-cards/aeon-credit-service,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/credit-cards/aeon-credit-service,6.0,product-listing
30427,1532085,www.moneysmart.hk/zh-hk/credit-cards/hang-seng-bank,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/credit-cards/hang-seng-bank,108.0,product-listing
30677,2788055,www.moneysmart.hk/en/credit-cards/citic-bank-international,Unknown,,,,,shop,www.moneysmart.hk/en/credit-cards/citic-bank-international,4.0,product-listing
30756,1792727,www.moneysmart.hk/en/credit-cards/hang-seng-bank,Unknown,,,,,shop,www.moneysmart.hk/en/credit-cards/hang-seng-bank,15.0,product-listing
30809,2704242,www.moneysmart.hk/zh-hk/personal-loan/ua-finance,Unknown,,296.0,,,shop,www.moneysmart.hk/zh-hk/personal-loan/ua-finance,14.0,product-listing
30814,2805717,www.moneysmart.hk/en/personal-loan/promise,Unknown,,296.0,,,shop,www.moneysmart.hk/en/personal-loan/promise,3.0,product-listing


In [78]:
ptc[ptc.page_type_jamie=="blog_home_page"]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events
2468,1015130,blog.moneysmart.hk,blog_page,blog,,,,blog_home_page,blog.moneysmart.hk,270.0,home
10150,518231,blog.moneysmart.sg,blog_page,blog,,,,blog_home_page,blog.moneysmart.sg,3105.0,other
29271,2965738,blog3.moneysmart.hk,blog_page,blog,,,,blog_home_page,,,
36935,2964624,blog3.moneysmart.sg,blog_page,blog,,,,blog_home_page,blog3.moneysmart.sg,8.0,


In [79]:
ptc[(ptc.page_type_jamie=="home_page") & (ptc.event_count_athena>0)]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events
28853,2615872,blog-admin.moneysmart.sg,blog_page,blog,,,,home_page,blog-admin.moneysmart.sg,6.0,home
35099,521077,www.moneysmart.sg,Unknown,,,,,home_page,www.moneysmart.sg,2315.0,


In [80]:
ptc[(ptc.page_type_from_events=="category") & (ptc.page_type_jamie=="blog_article")].head(20)

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events
143,1388165,blog.moneysmart.hk/zh-hk/mortgage/page/11,blog_page,blog,,,,blog_article,blog.moneysmart.hk/zh-hk/mortgage/page/11,1.0,category
9464,1045709,blog.moneysmart.hk/zh-hk/budgeting/page/2,blog_page,blog,,,,blog_article,blog.moneysmart.hk/zh-hk/budgeting/page/2,1.0,category
17442,1145854,blog.moneysmart.sg/wedding/page/3,blog_page,blog,,,,blog_article,blog.moneysmart.sg/wedding/page/3,1.0,category
32582,1854510,blog.moneysmart.sg/invest/page/10,blog_page,blog,,,,blog_article,blog.moneysmart.sg/invest/page/10,1.0,category


### Trying to make the best page_type of some so-so options

In [81]:
# This is a bit hacky and liable to break
# It also takes a bit of time to run -> could optimise by running as a mapping on the summary, but whatever.

def _merge_page_types(x):
    j_type = x.page_type_jamie
    if "blog" in j_type:
        page_type = "blog"
        page_sub_type = j_type
    elif j_type in ["lps", "unbounce", "trend", "calculator"]:
        page_type = "landing"
        page_sub_type = j_type
    elif j_type == "iss":
        page_type = "interstitial"
        page_sub_type = x.page_sub_type
    elif j_type == "embed":
        page_type = "embed"
        page_sub_type = "unknown"
    elif j_type == "shop":
        if x.page_type=="listing" or x.page_type_from_events=="product-listing":
            page_type = "listing"
            if x.page_sub_type in ["category_listing", "channel_listing", "provider_listing"]:
                page_sub_type = x.page_sub_type
            else:
                page_sub_type = "unknown"
        elif x.page_type_from_events == "product-details" or x.page_sub_type =="product-details":
            page_type="product_details"
            page_sub_type="unknown"
        elif x.page_type == "Unknown" and bool(x.page_type_from_events) and x.page_type_from_events!=np.NaN:
            page_type = "misc_shop"
            page_sub_type = x.page_type_from_events
        
        else:
            page_type = "misc_shop"
            page_sub_type = "unknown"
    else:
        page_type = j_type
        page_sub_type = "unknown"
    return pd.Series([page_type, page_sub_type], index=['page_type_merged', 'page_sub_type_merged'])
        

page_type_merged_col = ptc.apply(_merge_page_types, axis=1)

In [82]:
pages_types_merged_dev = pages_types_combined.merge(page_type_merged_col, how="left", left_index=True, right_index=True)


In [378]:
pd.DataFrame(pages_types_merged_dev.groupby(['page_type_merged', 'page_sub_type_merged']).size())

Unnamed: 0_level_0,Unnamed: 1_level_0,0
page_type_merged,page_sub_type_merged,Unnamed: 2_level_1
blog,blog_article,23690
blog,blog_category_page,267
blog,blog_category_tag_page,2108
blog,blog_home_page,4
blog,blog_tag_page,115
embed,unknown,255
home_page,unknown,29
interstitial,,3
interstitial,apply,1770
interstitial,forum,3


In [83]:
pages_types_merged_dev.groupby(['page_type_merged', 'page_sub_type_merged', "page_type", "page_sub_type", "page_type_jamie", "page_type_from_events"])\
.agg({"page_id":"count","event_count_athena":"sum"}).reset_index().rename(columns={"page_id":"page_count"})

Unnamed: 0,page_type_merged,page_sub_type_merged,page_type,page_sub_type,page_type_jamie,page_type_from_events,page_count,event_count_athena
0,blog,blog_article,blog_page,blog,blog_article,,201,33817.0
1,blog,blog_article,blog_page,blog,blog_article,article,2318,298821.0
2,blog,blog_article,blog_page,blog,blog_article,blog-article-details,183,174931.0
3,blog,blog_article,blog_page,blog,blog_article,category,4,4.0
4,blog,blog_article,blog_page,blog,blog_article,home,3,4.0
5,blog,blog_article,blog_page,blog,blog_article,other,44,407.0
6,blog,blog_article,blog_page,blog,blog_article,tag,1,144.0
7,blog,blog_category_page,blog_page,blog,blog_category_page,,3,3.0
8,blog,blog_category_page,blog_page,blog,blog_category_page,category,35,810.0
9,blog,blog_category_tag_page,blog_page,blog,blog_category_tag_page,,3,15.0


In [84]:
pages_types_merged_dev[(pages_types_merged_dev.page_type_merged=="misc_shop") & (pages_types_merged_dev.page_sub_type_merged=="unknown")].sort_values("event_count_athena", ascending = False).head(20)

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events,page_type_merged,page_sub_type_merged
26628,901598,www.moneysmart.sg/refinancing/compare,Unknown,,,,,shop,www.moneysmart.sg/refinancing/compare,3668.0,,misc_shop,unknown
4930,679337,www.moneysmart.sg/home-loan/compare,Unknown,,,,,shop,www.moneysmart.sg/home-loan/compare,1620.0,,misc_shop,unknown
16870,902232,www.moneysmart.sg/refinancing/compare/loans,Unknown,,,,,shop,www.moneysmart.sg/refinancing/compare/loans,950.0,,misc_shop,unknown
9229,852846,www.moneysmart.sg/car-insurance/wizard/register,Unknown,,,,,shop,www.moneysmart.sg/car-insurance/wizard/register,682.0,,misc_shop,unknown
3436,517603,www.moneysmart.sg/car-insurance/wizard,Unknown,,,,,shop,www.moneysmart.sg/car-insurance/wizard,651.0,,misc_shop,unknown
41262,1524742,www.moneysmart.hk/zh-hk/mortgage/property-valuation-tool,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/mortgage/property-valuation-tool,481.0,,misc_shop,unknown
42911,1536292,www.moneysmart.hk/zh-hk/mortgage,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/mortgage,480.0,,misc_shop,unknown
37767,1011194,www.moneysmart.hk/zh-hk,Unknown,,,,,shop,www.moneysmart.hk/zh-hk,470.0,,misc_shop,unknown
11959,826200,www.moneysmart.sg/car-insurance/wizard/results,Unknown,,,,,shop,www.moneysmart.sg/car-insurance/wizard/results,379.0,,misc_shop,unknown
34618,678568,www.moneysmart.sg/home-loan/compare/loans,Unknown,,,,,shop,www.moneysmart.sg/home-loan/compare/loans,333.0,,misc_shop,unknown


In [85]:
pages_types_merged = pages_types_merged_dev[["page_id", "page_url", "page_type_merged", "page_sub_type_merged","event_count_athena"]].rename(columns={"page_type_merged": "page_type", "page_sub_type_merged":"page_sub_type"})
canonical_urls_col = pages_types_merged.apply(lambda x: data_parsing.get_canonical_url("https://"+x.page_url), axis=1)
pages_types_merged["canonical_url"] = canonical_urls_col

In [384]:
df = pages_types_merged[pages_types_merged.page_url.str.contains("moneysmart.sg") | pages_types_merged.page_url.str.contains("moneysmart.sg")].groupby(["page_type", "page_sub_type"])\
    .agg({"page_id":"count","event_count_athena":"sum"}).reset_index().rename(columns={"page_id":"page_count"})

df

Unnamed: 0,page_type,page_sub_type,page_count,event_count_athena
0,blog,blog_article,16552,439399.0
1,blog,blog_category_page,266,813.0
2,blog,blog_category_tag_page,1874,379.0
3,blog,blog_home_page,2,3113.0
4,blog,blog_tag_page,64,4.0
5,embed,unknown,216,210679.0
6,home_page,unknown,18,2321.0
7,interstitial,,3,0.0
8,interstitial,apply,746,316.0
9,interstitial,forum,3,0.0


In [392]:
for a in pages_types_merged[(pages_types_merged.page_type=="listing") & (pages_types_merged.page_sub_type == "unknown") & (pages_types_merged.event_count_athena>0)].page_url:
    print(a)

www.moneysmart.sg/savings-account/rhb
www.moneysmart.sg/credit-cards/icbc-chinese-zodiac-credit-card
www.moneysmart.hk/zh-hk/credit-cards/hang-seng-bank/travel-overseas-spending
www.moneysmart.hk/zh-hk/credit-cards/travel-overseas-spending
www.moneysmart.hk/en/credit-cards/bea/travel-overseas-spending
www.moneysmart.hk/zh-hk/credit-cards/aeon-credit-service
www.moneysmart.hk/zh-hk/credit-cards/hang-seng-bank
www.moneysmart.hk/en/credit-cards/citic-bank-international
www.moneysmart.hk/en/credit-cards/hang-seng-bank
www.moneysmart.hk/zh-hk/personal-loan/ua-finance
www.moneysmart.hk/en/personal-loan/promise
www.moneysmart.sg/debt-consolidation-plan/standard-chartered
www.moneysmart.hk/zh-hk/credit-cards/dah-sing-bank
www.moneysmart.hk/zh-hk/credit-cards/bank-of-china
www.moneysmart.hk/zh-hk/credit-cards/china-construction-bank
www.moneysmart.sg/personal-loan/hl-bank
www.moneysmart.hk/zh-hk/credit-cards/citic-bank-international/welcome-offer
www.moneysmart.hk/en/credit-cards/china-construc

In [390]:
for a in pages_types_merged[(pages_types_merged.page_type=="listing") & (pages_types_merged.page_sub_type == "unknown") & (pages_types_merged.event_count_athena==0)]:
    print(a)

page_id
page_url
page_type
page_sub_type
event_count_athena
canonical_url


In [86]:
len(pages_types_merged)



47691

In [87]:
len(pages)

47691

In [88]:
# we should have fewer grouping by canonical as it removes the AB test urls
len(pages_types_merged.groupby(["canonical_url"]))

40522

In [89]:
len(pages_types_merged.groupby(["canonical_url", "page_type", "page_sub_type"]))

41347

In [90]:
# TODO: >>>>>> there's a mismatch here.  Probably want to do a group by, max on it to merge them together and then join again with the non-canonical... but should really investigate the origin.
# See below, it looks like a remnant of AB testing falcon

In [91]:
issues = pages_types_merged.groupby(["canonical_url", "page_type", "page_sub_type"]).count().groupby(["canonical_url"]).size()

In [92]:
issues[issues.values>1]

canonical_url
www.moneysmart.hk/en/credit-cards/american-express-platinum-credit-card                            2
www.moneysmart.hk/en/credit-cards/icbc/unionpay                                                    2
www.moneysmart.hk/en/personal-loan/hsbc                                                            2
www.moneysmart.hk/zh-hk/credit-cards/american-express-platinum-credit-card                         2
www.moneysmart.hk/zh-hk/credit-cards/bank-of-china                                                 2
www.moneysmart.hk/zh-hk/credit-cards/dbs/unionpay                                                  2
www.moneysmart.hk/zh-hk/credit-cards/dbs/welcome-offer                                             2
www.moneysmart.hk/zh-hk/credit-cards/icbc/unionpay                                                 2
www.moneysmart.hk/zh-hk/credit-cards/icbc/welcome-offer                                            2
www.moneysmart.sg/credit-cards/american-express-platinum-credit-card         

In [93]:
pages_types_merged[pages_types_merged.canonical_url=="www.moneysmart.sg/personal-loan/scb-cashone"]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,event_count_athena,canonical_url
15472,517928,www.moneysmart.sg/personal-loan/scb-cashone,product_details,unknown,94.0,www.moneysmart.sg/personal-loan/scb-cashone
32968,2678863,www-new.moneysmart.sg/personal-loan/scb-cashone,misc_shop,unknown,,www.moneysmart.sg/personal-loan/scb-cashone


In [94]:
pages_types_merged.head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,event_count_athena,canonical_url
0,1375785,blog.moneysmart.hk/en/mortgage/%E5%B1%85%E5%B1%8B-2019-%E4%BD%95%E6%96%87%E7%94%B0-%E5%B0%87%E8%BB%8D%E6%BE%B3-%E7%94%B3%E8%AB%8B-%E9%A6%AC%E9%9E%8D%E5%B1%B1-%E6%B7%B1%E6%B0%B4%E5%9F%97,blog,blog_article,,blog.moneysmart.hk/en/mortgage/%e5%b1%85%e5%b1%8b-2019-%e4%bd%95%e6%96%87%e7%94%b0-%e5%b0%87%e8%bb%8d%e6%be%b3-%e7%94%b3%e8%ab%8b-%e9%a6%ac%e9%9e%8d%e5%b1%b1-%e6%b7%b1%e6%b0%b4%e5%9f%97
1,1647817,blog.moneysmart.sg/property/private-properties-hdbs-differences/attachment/private-vs-public-housing-singapore,blog,blog_article,,blog.moneysmart.sg/property/private-properties-hdbs-differences/attachment/private-vs-public-housing-singapore
2,1376813,forum.moneysmart.sg/topic/taking-multiple-loans,misc_shop,unknown,,forum.moneysmart.sg/topic/taking-multiple-loans
3,1158494,blog.moneysmart.sg/dining/11-best-places-in-singapore-to-get-your-meat-fix-at-1-for-1/_,blog,blog_article,,blog.moneysmart.sg/dining/11-best-places-in-singapore-to-get-your-meat-fix-at-1-for-1/_
4,1378619,www.moneysmart.sg/home%20loan,misc_shop,,,www.moneysmart.sg/home%20loan


# Looking at the LeadGeneration.ClickConversion event

## Getting the click event data

In [95]:
query = """
select  

    country_code
    , dim_page_type.page_type
    , dim_page_type.page_sub_type
    , case when page_url like '%/embed/%' then true else false end as is_embed
    , page_url
    , fact_activities.page_id
    , full_date
    , full_time
    , session_id
    , fact_activities.anonymous_user_id
    , source_anonymous_id
    , device_os
    , device_category
    , browser
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'channel', true) as channel
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product_slug', true) as product_slug
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product', true) as product
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product_id', true) as product_id
    , dim_product.slug as product_from_id
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider_slug', true) as provider_slug
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider', true) as provider
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider_id', true) as provider_id
    , dim_provider.slug as provider_from_id
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_category', true) as affiliate_category
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_location', true) as affiliate_location
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_page_type', true) as affiliate_page_type
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_widget_type', true) as affiliate_widget_type
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'list_position', true) as list_position
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'action', true) as action
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'source', true) as source
    , dim_activity.activity_attributes
    from 
    
    -- TODO: cut down the join.s just copy / pasting
    fact_activities 
    left join dim_page on fact_activities.page_id = dim_page.page_id
    left join dim_page_type on dim_page_type.page_type_id = dim_page.page_type_id
    -- left join dim_session on fact_activities.session_id = dim_session.session_id
    left join dim_activity on fact_activities.activity_id = dim_activity.activity_id
    left join dim_activity_type on fact_activities.activity_type_id = dim_activity_type.activity_type_id
    left join dim_anonymous_user on fact_activities.anonymous_user_id = dim_anonymous_user.anonymous_user_id
    left join dim_date on dim_date.date_id = fact_activities.activity_date_id
    left join dim_time on fact_activities.activity_time_id = dim_time.time_id
    left join dim_country on fact_activities.site_country_id = dim_country.country_id
    
    left join dim_browser on fact_activities.browser_id = dim_browser.browser_id -- firefox etc
    left join dim_device on fact_activities.device_id = dim_device.device_id -- device_os, device_category (desktop / mobile...)
    
    left join dim_channel on dim_channel.channel_key = json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'channel', true)
    
    
    -- only join product and provider if the slug isn't set i.e. assume that it's pre-falcon YMMV (and it's deprecated)
    -- TODO: remove this; only not doing as not sure it's safe.
    left join dim_product on (dim_product.source_product_id = json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product_id', true) 
        and coalesce(json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product', true), '') =''
        and dim_product.channel_id = dim_channel.channel_id 
        and dim_product.country_id = dim_country.country_id) 
    left join dim_provider on (
        dim_provider.source_provider_id = json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider_id', true) 
        and coalesce(json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider', true), '') =''
        and dim_provider.channel_id = dim_channel.channel_id 
            and dim_provider.country_id = dim_country.country_id)
    

    
    where 
        dim_activity_type.activity_name = 'LeadGeneration.ClickConversion'
        and user_filter_type='external_visitor'
        and dim_date.full_date>='{from_date}'
        and dim_date.full_date<='{to_date}'
        
        
        -- NB: embeds aren't currently listed as blog pages :(
        
""".format(from_date= from_datetime.isoformat(), to_date=to_datetime.isoformat())


In [96]:
print(query)


select  

    country_code
    , dim_page_type.page_type
    , dim_page_type.page_sub_type
    , case when page_url like '%/embed/%' then true else false end as is_embed
    , page_url
    , fact_activities.page_id
    , full_date
    , full_time
    , session_id
    , fact_activities.anonymous_user_id
    , source_anonymous_id
    , device_os
    , device_category
    , browser
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'channel', true) as channel
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product_slug', true) as product_slug
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product', true) as product
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product_id', true) as product_id
    , dim_product.slug as product_from_id
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider_slug', true) as provider_slug
    , json_extrac

In [97]:
query = sqlalchemy.text(query)
apply_clicks = dq.query(query)

Starting query at 2020-04-10T17:37:22.519002
Query took 104.32


In [98]:
apply_clicks.describe()

Unnamed: 0,page_id,session_id,anonymous_user_id
count,12628.0,12628.0,12628.0
mean,1203246.0,71672350.0,57607620.0
std,909478.9,223397.3,9454406.0
min,516320.0,71215400.0,10182710.0
25%,517325.0,71480130.0,60530380.0
50%,826200.0,71682120.0,61196800.0
75%,1707747.0,71864590.0,61383720.0
max,2971639.0,72113980.0,61594300.0


In [99]:
apply_clicks.head(5)

Unnamed: 0,country_code,page_type,page_sub_type,is_embed,page_url,page_id,full_date,full_time,session_id,anonymous_user_id,...,provider_id,provider_from_id,affiliate_category,affiliate_location,affiliate_page_type,affiliate_widget_type,list_position,action,source,activity_attributes
0,sg,listing,channel_listing,False,www.moneysmart.sg/personal-loan,517289,2020-04-07,00:01:06,71893431,61284590,...,4.0,,,,,,,,,"{""channel"":""personal-loan"",""country"":""sg"",""is_paid"":""true"",""product"":""posb-personal-loan"",""provider"":""posb"",""page_path"":""/personal-loan?utm_source=google&utm_medium=cpc&utm_campaign=Adwords%20%7C%..."
1,hk,details,product_details,False,www.moneysmart.hk/zh-hk/credit-cards/wewa-unionpay,1012007,2020-04-05,00:02:06,71641550,61314380,...,118.0,,,,,,,,,"{""channel"":""credit-cards"",""country"":""hk"",""is_paid"":""true"",""product"":""wewa-unionpay"",""language"":""zh-hk"",""provider"":""primecredit"",""page_path"":""/zh-hk/credit-cards/wewa-unionpay?utm_source=facebook&u..."
2,hk,details,product_details,False,www.moneysmart.hk/zh-hk/credit-cards/wewa-unionpay,1012007,2020-04-06,00:02:22,71815379,61272433,...,118.0,,,,,,,,,"{""channel"":""credit-cards"",""country"":""hk"",""is_paid"":""true"",""product"":""wewa-unionpay"",""language"":""zh-hk"",""provider"":""primecredit"",""page_path"":""/zh-hk/credit-cards/wewa-unionpay?utm_source=facebook&u..."
3,sg,Unknown,,False,www.moneysmart.sg/investments/ig-markets-ms,1855085,2020-04-07,00:02:30,71888598,61451108,...,,,,,,,,,product-card,"{""source"":""product-card"",""channel"":""investments"",""country"":""sg"",""language"":""en"",""page_path"":""/investments/ig-markets-ms"",""page_type"":""lps"",""event_type"":""apply-now"",""product_id"":"""",""auth_status"":""f..."
4,sg,listing,provider_listing,False,www.moneysmart.sg/credit-cards/dbs,516694,2020-04-06,00:02:30,71863211,61319356,...,8.0,,,,,,,,,"{""channel"":""credit-cards"",""country"":""sg"",""is_paid"":""true"",""product"":""dbs-altitude-visa-signature-card"",""provider"":""dbs"",""page_path"":""/credit-cards/dbs?card-association=visa&provider_slugs%5B%5D=db..."


In [100]:
pd.set_option("display.max_colwidth", 200)
# apply_clicks[apply_clicks["page_type"].str.contains("blog")][["page_url", "provider", "provider_id", "activity_attributes"]]
for a in apply_clicks[apply_clicks["page_type"].str.contains("blog")][ "activity_attributes"].values[0].split(","): print(a)


{"channel":"personal-loan"
"country":"hk"
"language":"zh-hk"
"auth_status":"false"
"provider_id":"12"
"affiliate_category":"loans"}


In [101]:
product_provider_summary_cols = [ "page_url", "action", "page_type", "page_sub_type","channel"]+ [z for z in apply_clicks.columns if "product" in z or "provider" in z]
affiliate_cols = [z for z in apply_clicks.columns if "affiliate" in z]

In [102]:
apply_clicks_original = apply_clicks

In [103]:
apply_clicks[product_provider_summary_cols ].head()

Unnamed: 0,page_url,action,page_type,page_sub_type,channel,product_slug,product,product_id,product_from_id,provider_slug,provider,provider_id,provider_from_id
0,www.moneysmart.sg/personal-loan,,listing,channel_listing,personal-loan,,posb-personal-loan,3.0,,,posb,4.0,
1,www.moneysmart.hk/zh-hk/credit-cards/wewa-unionpay,,details,product_details,credit-cards,,wewa-unionpay,4.0,,,primecredit,118.0,
2,www.moneysmart.hk/zh-hk/credit-cards/wewa-unionpay,,details,product_details,credit-cards,,wewa-unionpay,4.0,,,primecredit,118.0,
3,www.moneysmart.sg/investments/ig-markets-ms,,Unknown,,investments,,,,,,,,
4,www.moneysmart.sg/credit-cards/dbs,,listing,provider_listing,credit-cards,,dbs-altitude-visa-signature-card,10.0,,,dbs,8.0,


## Using New Page_type and page_sub_type

In [104]:

apply_clicks_with_new_types = apply_clicks_original.rename(columns={"page_type":"page_type_from_dwh", "page_sub_type":"page_sub_type_from_dwh"})\
    .merge(pages_types_merged[["page_id", "page_type", "page_sub_type", "canonical_url"]], how="left", on="page_id")#.reset_index()

In [105]:
len(apply_clicks_original)

12628

In [106]:
len(apply_clicks_with_new_types)

12628

In [107]:
apply_clicks = apply_clicks_with_new_types

In [108]:
apply_clicks.columns

Index(['country_code', 'page_type_from_dwh', 'page_sub_type_from_dwh',
       'is_embed', 'page_url', 'page_id', 'full_date', 'full_time',
       'session_id', 'anonymous_user_id', 'source_anonymous_id', 'device_os',
       'device_category', 'browser', 'channel', 'product_slug', 'product',
       'product_id', 'product_from_id', 'provider_slug', 'provider',
       'provider_id', 'provider_from_id', 'affiliate_category',
       'affiliate_location', 'affiliate_page_type', 'affiliate_widget_type',
       'list_position', 'action', 'source', 'activity_attributes', 'page_type',
       'page_sub_type', 'canonical_url'],
      dtype='object')

In [109]:
#apply_clicks.page_sub_type

In [110]:
apply_clicks.head()

Unnamed: 0,country_code,page_type_from_dwh,page_sub_type_from_dwh,is_embed,page_url,page_id,full_date,full_time,session_id,anonymous_user_id,...,affiliate_location,affiliate_page_type,affiliate_widget_type,list_position,action,source,activity_attributes,page_type,page_sub_type,canonical_url
0,sg,listing,channel_listing,False,www.moneysmart.sg/personal-loan,517289,2020-04-07,00:01:06,71893431,61284590,...,,,,,,,"{""channel"":""personal-loan"",""country"":""sg"",""is_paid"":""true"",""product"":""posb-personal-loan"",""provider"":""posb"",""page_path"":""/personal-loan?utm_source=google&utm_medium=cpc&utm_campaign=Adwords%20%7C%...",listing,channel_listing,www.moneysmart.sg/personal-loan
1,hk,details,product_details,False,www.moneysmart.hk/zh-hk/credit-cards/wewa-unionpay,1012007,2020-04-05,00:02:06,71641550,61314380,...,,,,,,,"{""channel"":""credit-cards"",""country"":""hk"",""is_paid"":""true"",""product"":""wewa-unionpay"",""language"":""zh-hk"",""provider"":""primecredit"",""page_path"":""/zh-hk/credit-cards/wewa-unionpay?utm_source=facebook&u...",product_details,unknown,www.moneysmart.hk/zh-hk/credit-cards/wewa-unionpay
2,hk,details,product_details,False,www.moneysmart.hk/zh-hk/credit-cards/wewa-unionpay,1012007,2020-04-06,00:02:22,71815379,61272433,...,,,,,,,"{""channel"":""credit-cards"",""country"":""hk"",""is_paid"":""true"",""product"":""wewa-unionpay"",""language"":""zh-hk"",""provider"":""primecredit"",""page_path"":""/zh-hk/credit-cards/wewa-unionpay?utm_source=facebook&u...",product_details,unknown,www.moneysmart.hk/zh-hk/credit-cards/wewa-unionpay
3,sg,Unknown,,False,www.moneysmart.sg/investments/ig-markets-ms,1855085,2020-04-07,00:02:30,71888598,61451108,...,,,,,,product-card,"{""source"":""product-card"",""channel"":""investments"",""country"":""sg"",""language"":""en"",""page_path"":""/investments/ig-markets-ms"",""page_type"":""lps"",""event_type"":""apply-now"",""product_id"":"""",""auth_status"":""f...",landing,lps,www.moneysmart.sg/investments/ig-markets-ms
4,sg,listing,provider_listing,False,www.moneysmart.sg/credit-cards/dbs,516694,2020-04-06,00:02:30,71863211,61319356,...,,,,,,,"{""channel"":""credit-cards"",""country"":""sg"",""is_paid"":""true"",""product"":""dbs-altitude-visa-signature-card"",""provider"":""dbs"",""page_path"":""/credit-cards/dbs?card-association=visa&provider_slugs%5B%5D=db...",listing,provider_listing,www.moneysmart.sg/credit-cards/dbs


## Utilities

In [111]:
def format_results(df):
    def make_clickable(val):
        # target _blank to open new window
        return '<a target="_blank" href="{}">{}</a>'.format("https://"+ val, val)
    
    return df.style.format({'page_url': make_clickable})

## Issues

### Not having product / provider (slug) set (product_id or provider_id is deprecated)

In [112]:
df = apply_clicks[(apply_clicks.provider.isna()) | (apply_clicks.provider=="")][product_provider_summary_cols]
print("only first 20 shown")
format_results(df.head(20))

only first 20 shown


Unnamed: 0,page_url,action,page_type,page_sub_type,channel,product_slug,product,product_id,product_from_id,provider_slug,provider,provider_id,provider_from_id
3,www.moneysmart.sg/investments/ig-markets-ms,,landing,lps,investments,,,,,,,,
5,blog.moneysmart.hk/zh-hk/loans/%E9%80%B2%E4%BF%AE%E8%B2%B8%E6%AC%BE-%E7%A7%81%E4%BA%BA%E8%B2%B8%E6%AC%BE-%E6%AF%94%E8%BC%83,,blog,blog_article,personal-loan,,,,,,,12.0,
6,www.moneysmart.sg/investments/online-brokerages-ms,,landing,lps,investments,,,,,,,,
8,blog.moneysmart.sg/credit-cards/boc-sheng-siong-card,,blog,blog_article,credit-cards,,,,,,,10.0,
14,www.moneysmart.sg/investments/saxo-markets-ms,,landing,lps,investments,,,,,,,,
16,www.moneysmart.hk/zh-hk/personal-loan/best-promise-personal-loan-ms,,landing,lps,personal-loan,,,239.0,,,,25.0,
30,www.moneysmart.hk/zh-hk/personal-loan/low-interest-rate-ms,,landing,lps,personal-loan,,,,,,,,
33,www.moneysmart.hk/zh-hk/personal-loan/best-hsbc-personal-loan-ms,,landing,lps,personal-loan,,,,,,,,
42,blog.moneysmart.sg/budgeting/cheapest-sim-only-plans,,blog,blog_article,credit-cards,,,,,,,7.0,
47,blog.moneysmart.sg/credit-cards/dbs-credit-cards-singapore-review,,blog,blog_article,credit-cards,,,,,,,8.0,


In [113]:
df2 = pd.DataFrame(df.groupby(["page_url", "page_type", "page_sub_type","channel"]).size().reset_index().sort_values(0, ascending=False))
format_results(df2)

Unnamed: 0,page_url,page_type,page_sub_type,channel,0
139,www.moneysmart.sg/investments/online-brokerages-ms,landing,lps,investments,225
140,www.moneysmart.sg/investments/saxo-markets-ms,landing,lps,investments,89
68,blog.moneysmart.sg/personal-loans/best-personal-loan-singapore,blog,blog_article,personal-loan,58
129,www.moneysmart.sg/embed/5051cca749bae55521c34317d0799cae/result,embed,unknown,refinancing,31
48,blog.moneysmart.sg/credit-cards/dbs-credit-cards-singapore-review,blog,blog_article,credit-cards,29
102,www.moneysmart.hk/zh-hk/health-insurance/vhis-ms,landing,lps,health-insurance,29
114,www.moneysmart.hk/zh-hk/personal-loan/no-credit-check-loans-ms,landing,lps,personal-loan,26
78,blog.moneysmart.sg/shopping/lazada-promo-code-promotion,blog,blog_article,credit-cards,26
10,blog.moneysmart.hk/zh-hk/credit-cards/%E9%9B%BB%E5%99%A8%E5%84%AA%E6%83%A0-%E4%BF%A1%E7%94%A8%E5%8D%A1-%E8%B2%B7%E9%9B%BB%E5%99%A8-%E5%84%AA%E6%83%A0,blog,blog_article,credit-cards,22
103,www.moneysmart.hk/zh-hk/investments/retirement-products-deduct-taxes-ms,landing,lps,investments,21


### Using product_slug or provider_slug not product / provider

In [114]:
apply_clicks[~(apply_clicks.provider_slug.isna() | (apply_clicks.provider_slug==""))][product_provider_summary_cols]

Unnamed: 0,page_url,action,page_type,page_sub_type,channel,product_slug,product,product_id,product_from_id,provider_slug,provider,provider_id,provider_from_id


In [115]:
apply_clicks[~(apply_clicks["product_slug"].isna() | (apply_clicks["product_slug"]==""))][product_provider_summary_cols]

Unnamed: 0,page_url,action,page_type,page_sub_type,channel,product_slug,product,product_id,product_from_id,provider_slug,provider,provider_id,provider_from_id


### Not having any product or provider info

In [116]:
# No product info
df = apply_clicks[(apply_clicks["product"].isna() | (apply_clicks["product"]=="")) & (apply_clicks.product_id.isna() | ((apply_clicks["product_id"]=="")))][product_provider_summary_cols]
df2 = pd.DataFrame(df.groupby(["page_url", "page_type", "channel"]).size()).reset_index().sort_values(0, ascending=False).rename(columns={0:"click count"})
format_results(df2)

Unnamed: 0,page_url,page_type,channel,click count
119,www.moneysmart.sg/investments/online-brokerages-ms,landing,investments,225
120,www.moneysmart.sg/investments/saxo-markets-ms,landing,investments,89
68,blog.moneysmart.sg/personal-loans/best-personal-loan-singapore,blog,personal-loan,58
48,blog.moneysmart.sg/credit-cards/dbs-credit-cards-singapore-review,blog,credit-cards,29
97,www.moneysmart.hk/zh-hk/health-insurance/vhis-ms,landing,health-insurance,29
78,blog.moneysmart.sg/shopping/lazada-promo-code-promotion,blog,credit-cards,26
106,www.moneysmart.hk/zh-hk/personal-loan/no-credit-check-loans-ms,landing,personal-loan,23
10,blog.moneysmart.hk/zh-hk/credit-cards/%E9%9B%BB%E5%99%A8%E5%84%AA%E6%83%A0-%E4%BF%A1%E7%94%A8%E5%8D%A1-%E8%B2%B7%E9%9B%BB%E5%99%A8-%E5%84%AA%E6%83%A0,blog,credit-cards,22
109,www.moneysmart.sg/car-insurance/aig-ms,landing,car-insurance,21
98,www.moneysmart.hk/zh-hk/investments/retirement-products-deduct-taxes-ms,landing,investments,21


In [117]:
# No provider info
apply_clicks[(apply_clicks.provider.isna() | (apply_clicks.provider==""))][product_provider_summary_cols]

Unnamed: 0,page_url,action,page_type,page_sub_type,channel,product_slug,product,product_id,product_from_id,provider_slug,provider,provider_id,provider_from_id
3,www.moneysmart.sg/investments/ig-markets-ms,,landing,lps,investments,,,,,,,,
5,blog.moneysmart.hk/zh-hk/loans/%E9%80%B2%E4%BF%AE%E8%B2%B8%E6%AC%BE-%E7%A7%81%E4%BA%BA%E8%B2%B8%E6%AC%BE-%E6%AF%94%E8%BC%83,,blog,blog_article,personal-loan,,,,,,,12,
6,www.moneysmart.sg/investments/online-brokerages-ms,,landing,lps,investments,,,,,,,,
8,blog.moneysmart.sg/credit-cards/boc-sheng-siong-card,,blog,blog_article,credit-cards,,,,,,,10,
14,www.moneysmart.sg/investments/saxo-markets-ms,,landing,lps,investments,,,,,,,,
16,www.moneysmart.hk/zh-hk/personal-loan/best-promise-personal-loan-ms,,landing,lps,personal-loan,,,239,,,,25,
30,www.moneysmart.hk/zh-hk/personal-loan/low-interest-rate-ms,,landing,lps,personal-loan,,,,,,,,
33,www.moneysmart.hk/zh-hk/personal-loan/best-hsbc-personal-loan-ms,,landing,lps,personal-loan,,,,,,,,
42,blog.moneysmart.sg/budgeting/cheapest-sim-only-plans,,blog,blog_article,credit-cards,,,,,,,7,
47,blog.moneysmart.sg/credit-cards/dbs-credit-cards-singapore-review,,blog,blog_article,credit-cards,,,,,,,8,


In [118]:
# No product or provider info, grouped by number of clicks on the page
missing_providers = apply_clicks[((apply_clicks.provider=="" ) | (apply_clicks.provider.isna())) & ((apply_clicks.provider_id=="" ) | (apply_clicks.provider_id.isna())) ][["page_url", "provider", "provider_id"]]
missing_providers_grouped = missing_providers.groupby(["page_url"]).size().reset_index() #.rename(columns={0:"click count"})
#missing_providers_grouped.sort_values("provider_id", ascending=False)
format_results(pd.DataFrame(missing_providers_grouped.sort_values(0, ascending=False)))

Unnamed: 0,page_url,0
38,www.moneysmart.sg/investments/online-brokerages-ms,225
39,www.moneysmart.sg/investments/saxo-markets-ms,89
32,www.moneysmart.sg/embed/5051cca749bae55521c34317d0799cae/result,38
11,www.moneysmart.hk/zh-hk/health-insurance/vhis-ms,29
20,www.moneysmart.hk/zh-hk/personal-loan/no-credit-check-loans-ms,23
12,www.moneysmart.hk/zh-hk/investments/retirement-products-deduct-taxes-ms,21
23,www.moneysmart.sg/car-insurance/aig-ms,21
29,www.moneysmart.sg/car-insurance/msig-ms,19
28,www.moneysmart.sg/car-insurance/fwd-ms,19
17,www.moneysmart.hk/zh-hk/personal-loan/clear-credit-card-debts-ms,19


### Embed without any info about the page that it's on

In [119]:
print("all of them!")

all of them!


### Blog page without affiliate stuff set
Blog should have full details of e.g. where on the page it is coming from

In [120]:
apply_clicks.page_type.unique()

array(['listing', 'product_details', 'landing', 'blog', 'misc_shop',
       'embed'], dtype=object)

In [121]:
df = apply_clicks[ apply_clicks.page_type.isin(["blog_page"]) & ((apply_clicks.affiliate_category=="") | (apply_clicks.affiliate_location=="") | (apply_clicks.affiliate_page_type=="") |  (apply_clicks.affiliate_widget_type=="")\
             | (apply_clicks.affiliate_category.isna()) | (apply_clicks.affiliate_location.isna()) | (apply_clicks.affiliate_page_type.isna()) |  (apply_clicks.affiliate_widget_type.isna()))][product_provider_summary_cols + affiliate_cols]

format_results(df)

Unnamed: 0,page_url,action,page_type,page_sub_type,channel,product_slug,product,product_id,product_from_id,provider_slug,provider,provider_id,provider_from_id,affiliate_category,affiliate_location,affiliate_page_type,affiliate_widget_type


In [122]:
"""
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_category', true) as affiliate_category
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_location', true) as affiliate_location
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_page_type', true) as affiliate_page_type
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_widget_type', true) as affiliate_widget_type

"""

'\n    , json_extract_path_text(trim(\'"\' from dim_activity.activity_attributes), \'affiliate_category\', true) as affiliate_category\n    , json_extract_path_text(trim(\'"\' from dim_activity.activity_attributes), \'affiliate_location\', true) as affiliate_location\n    , json_extract_path_text(trim(\'"\' from dim_activity.activity_attributes), \'affiliate_page_type\', true) as affiliate_page_type\n    , json_extract_path_text(trim(\'"\' from dim_activity.activity_attributes), \'affiliate_widget_type\', true) as affiliate_widget_type\n\n'

### Fails to join on provider_id or product_id

Note that you might expect some pre-falcon stuff in HK not to join as we didn't have the application database loaded.

In [123]:
# product_id is set, but product_from_id is null
df = apply_clicks[((apply_clicks["product"]=="") | apply_clicks["product"].isna()) & apply_clicks.product_id.str.isnumeric() & apply_clicks.product_from_id.isna()][product_provider_summary_cols]
format_results(df)


Unnamed: 0,page_url,action,page_type,page_sub_type,channel,product_slug,product,product_id,product_from_id,provider_slug,provider,provider_id,provider_from_id
12,www.moneysmart.sg/car-insurance/wizard/results,conversion,misc_shop,unknown,car-insurance,,,106,,,FWD,88.0,
16,www.moneysmart.hk/zh-hk/personal-loan/best-promise-personal-loan-ms,,landing,lps,personal-loan,,,239,,,,25.0,
132,www.moneysmart.sg/embed/034efd4dbf31b271e2107e810b18f435/result,,embed,unknown,refinancing,,,2446,,,,,
184,www.moneysmart.sg/car-insurance/wizard/results,conversion,misc_shop,unknown,car-insurance,,,121,,,MSIG,86.0,
198,www.moneysmart.sg/embed/5051cca749bae55521c34317d0799cae/result,,embed,unknown,refinancing,,,2411,,,,,
226,www.moneysmart.sg/car-insurance/wizard/results,conversion,misc_shop,unknown,car-insurance,,,118,,,AXA,87.0,
237,www.moneysmart.sg/embed/9cb432acbab519e7863e0608254b41e7/result,,embed,unknown,refinancing,,,2517,,,,,
301,www.moneysmart.sg/car-insurance/wizard/results,conversion,misc_shop,unknown,car-insurance,,,109,,,Etiqa,91.0,
314,www.moneysmart.sg/embed/9cb432acbab519e7863e0608254b41e7/result,,embed,unknown,home-loan,,,2477,,,,,
323,www.moneysmart.sg/car-insurance/wizard/results,conversion,misc_shop,unknown,car-insurance,,,124,,,Aviva,94.0,


In [124]:
# provider can't be interpreted from provider_id
df = apply_clicks[((apply_clicks["provider"]=="") | apply_clicks["provider"].isna()) & apply_clicks.provider_id.str.isnumeric() & apply_clicks.provider_from_id.isna()][product_provider_summary_cols]
print(len(df))
format_results(df)


377


Unnamed: 0,page_url,action,page_type,page_sub_type,channel,product_slug,product,product_id,product_from_id,provider_slug,provider,provider_id,provider_from_id
5,blog.moneysmart.hk/zh-hk/loans/%E9%80%B2%E4%BF%AE%E8%B2%B8%E6%AC%BE-%E7%A7%81%E4%BA%BA%E8%B2%B8%E6%AC%BE-%E6%AF%94%E8%BC%83,,blog,blog_article,personal-loan,,,,,,,12,
8,blog.moneysmart.sg/credit-cards/boc-sheng-siong-card,,blog,blog_article,credit-cards,,,,,,,10,
16,www.moneysmart.hk/zh-hk/personal-loan/best-promise-personal-loan-ms,,landing,lps,personal-loan,,,239.0,,,,25,
42,blog.moneysmart.sg/budgeting/cheapest-sim-only-plans,,blog,blog_article,credit-cards,,,,,,,7,
47,blog.moneysmart.sg/credit-cards/dbs-credit-cards-singapore-review,,blog,blog_article,credit-cards,,,,,,,8,
63,blog.moneysmart.hk/zh-hk/credit-cards/%E9%9B%BB%E5%99%A8%E5%84%AA%E6%83%A0-%E4%BF%A1%E7%94%A8%E5%8D%A1-%E8%B2%B7%E9%9B%BB%E5%99%A8-%E5%84%AA%E6%83%A0,,blog,blog_article,credit-cards,,,,,,,14,
155,blog.moneysmart.sg/credit-cards/best-student-credit-cards-singapore,,blog,blog_article,credit-cards,,,,,,,8,
233,blog.moneysmart.sg/shopping/lazada-promo-code-promotion,,blog,blog_article,credit-cards,,,,,,,8,
240,blog.moneysmart.sg/credit-cards/dbs-credit-cards-singapore-review,,blog,blog_article,credit-cards,,,,,,,8,
241,blog.moneysmart.sg/credit-cards/boc-sheng-siong-card,,blog,blog_article,credit-cards,,,,,,,10,


In [125]:
providers_channels[providers_channels.source_provider_id == 1]

Unnamed: 0,provider_id,provider_name,sys_inserted_x,sys_updated_x,source_provider_id,slug,status,channel_id,country_id,language_id,channel_key,channel_name,sys_inserted_y,sys_updated_y
127,838,HSBC,2019-02-19 19:32:36.488054,2019-02-19 19:32:36.488054,1.0,hsbc,1,16,1,1,personal-loan,Personal Loan,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
140,5944,DBS,2019-06-13 19:34:26.014648,2019-06-13 19:34:26.014648,1.0,dbs,1,17,1,1,refinancing,Home Loan Refinancing,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
260,1901,HSBC,2019-03-18 07:48:04.367002,2019-03-18 07:48:04.367002,1.0,hsbc,1,24,1,1,debt-consolidation-plan,Debt Consolidation Plan,2019-03-18 07:40:50.126887,2019-03-18 07:40:50.126887
270,5970,DBS,2019-06-13 19:34:26.014648,2019-06-13 19:34:26.014648,1.0,dbs,1,10,1,1,home-loan,Home Loan,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062


In [126]:
providers_channels[~providers_channels.source_provider_id.isna()][providers_channels.source_provider_id>30].sort_values(["source_provider_id"])

  if __name__ == '__main__':


Unnamed: 0,provider_id,provider_name,sys_inserted_x,sys_updated_x,source_provider_id,slug,status,channel_id,country_id,language_id,channel_key,channel_name,sys_inserted_y,sys_updated_y
371,103,American Express,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,51.0,american-express,1,1,1,3,unknown,Unknown,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
384,112,Bank Of China,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,53.0,boc,1,4,1,1,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
378,102,CIMB,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,55.0,cimb,0,4,1,1,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
3,104,Citibank,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,56.0,citibank,1,4,1,1,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
129,105,DBS,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,58.0,dbs,1,4,1,1,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
128,101,Hitachi Capital,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,60.0,hitachi-capital,0,4,1,1,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
123,106,HSBC,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,62.0,hsbc,1,4,1,1,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
133,107,Maybank,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,63.0,maybank,1,4,1,1,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
383,108,OCBC,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,64.0,ocbc,1,4,1,1,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
257,109,POSB,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,65.0,posb,1,4,1,1,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062


### Channels Observed (manual sense check)

In [127]:
apply_clicks.columns

Index(['country_code', 'page_type_from_dwh', 'page_sub_type_from_dwh',
       'is_embed', 'page_url', 'page_id', 'full_date', 'full_time',
       'session_id', 'anonymous_user_id', 'source_anonymous_id', 'device_os',
       'device_category', 'browser', 'channel', 'product_slug', 'product',
       'product_id', 'product_from_id', 'provider_slug', 'provider',
       'provider_id', 'provider_from_id', 'affiliate_category',
       'affiliate_location', 'affiliate_page_type', 'affiliate_widget_type',
       'list_position', 'action', 'source', 'activity_attributes', 'page_type',
       'page_sub_type', 'canonical_url'],
      dtype='object')

In [128]:
apply_clicks.groupby(["channel"]).size().sort_index()

channel
cancer-insurance              1
car-insurance               388
credit-cards               6700
debt-consolidation-plan     207
health-insurance             36
home-insurance               86
home-loan                    14
investments                 371
maid-insurance              154
personal-loan              4234
refinancing                  58
savings-account             179
travel-insurance            200
dtype: int64

## Listing click doesn't have index
(not sure been implemented yet on falcon)

## Trying to get something useful out of apply clicks

NB don't use these numbers for reporting just yet.

### Utility functions

In [129]:
#def group_and_sort_with_conversion_rates(df, cols_to_group_and_sort_by):
 #   >>
    # pageviews, sessions, users on the page
    
    # sessions, users converting
    
    # conversion rates

In [130]:
def group_and_sort(df, cols_to_group_and_sort_by, sort_by_click_count = False):
    r = pd.DataFrame(df.groupby(cols_to_group_and_sort_by).size().reset_index().rename(columns={0:"num_clicks"}))
    if sort_by_click_count:
        if "country_code" in r.columns:
            r = r.sort_values(["country_code", "num_clicks"], ascending = False)
        else:
            r = r.sort_values("num_clicks", ascending = False)
    else:
        r = r.sort_values(cols_to_group_and_sort_by)
    return format_results(r) #AToW makes urls clickable


### Blog apply clicks excluding embeds by where they come from

AToW this won't include apply clicks from the comparison widgets in the page, but it will work for the 

In [131]:
blog_apply_clicks = apply_clicks[apply_clicks["page_type"].str.contains("blog")]


In [132]:
group_and_sort(blog_apply_clicks, ["country_code", "channel", "affiliate_category"])

Unnamed: 0,country_code,channel,affiliate_category,num_clicks
0,hk,credit-cards,air-miles,2
1,hk,credit-cards,budgeting,54
2,hk,credit-cards,credit-cards,10
3,hk,personal-loan,loans,10
4,sg,credit-cards,budgeting,13
5,sg,credit-cards,credit-cards,157
6,sg,credit-cards,dining,1
7,sg,credit-cards,entertainment,2
8,sg,credit-cards,savings-accounts,40
9,sg,credit-cards,shopping,23


### Blog apply clicks excluding embeds by place on page

In [133]:
group_and_sort(blog_apply_clicks, ["country_code","channel"]+ affiliate_cols)

Unnamed: 0,country_code,channel,affiliate_category,affiliate_location,affiliate_page_type,affiliate_widget_type,num_clicks
0,hk,credit-cards,air-miles,top,blog-article-details,inline-widget,2
1,hk,credit-cards,budgeting,,,,16
2,hk,credit-cards,budgeting,top,blog-article-details,inline-widget,38
3,hk,credit-cards,credit-cards,,,,3
4,hk,credit-cards,credit-cards,top,blog-article-details,inline-widget,7
5,hk,personal-loan,loans,,,,2
6,hk,personal-loan,loans,top,blog-article-details,inline-widget,8
7,sg,credit-cards,budgeting,top,blog-article-details,inline-widget,13
8,sg,credit-cards,credit-cards,top,blog-article-details,inline-widget,157
9,sg,credit-cards,dining,top,blog-article-details,inline-widget,1


### Blog apply clicks excluding embeds by product and provier

In [134]:
group_and_sort(blog_apply_clicks, ["country_code","channel", "product", "provider"]+ affiliate_cols)

Unnamed: 0,country_code,channel,product,provider,affiliate_category,affiliate_location,affiliate_page_type,affiliate_widget_type,num_clicks
0,hk,credit-cards,,,air-miles,top,blog-article-details,inline-widget,2
1,hk,credit-cards,,,budgeting,,,,16
2,hk,credit-cards,,,budgeting,top,blog-article-details,inline-widget,38
3,hk,credit-cards,,,credit-cards,,,,3
4,hk,credit-cards,,,credit-cards,top,blog-article-details,inline-widget,7
5,hk,personal-loan,,,loans,,,,2
6,hk,personal-loan,,,loans,top,blog-article-details,inline-widget,8
7,sg,credit-cards,,,budgeting,top,blog-article-details,inline-widget,13
8,sg,credit-cards,,,credit-cards,top,blog-article-details,inline-widget,157
9,sg,credit-cards,,,dining,top,blog-article-details,inline-widget,1


### Blog apply clicks excluding embeds top articles

In [135]:
group_and_sort(blog_apply_clicks, ["country_code","canonical_url"], sort_by_click_count=True)

Unnamed: 0,country_code,canonical_url,num_clicks
65,sg,blog.moneysmart.sg/personal-loans/best-personal-loan-singapore,58
45,sg,blog.moneysmart.sg/credit-cards/dbs-credit-cards-singapore-review,29
75,sg,blog.moneysmart.sg/shopping/lazada-promo-code-promotion,26
69,sg,blog.moneysmart.sg/savings-accounts/dbs-multiplier-account-review,17
70,sg,blog.moneysmart.sg/savings-accounts/dbs-multiplier-ocbc360-uob-one-covid-19,15
30,sg,blog.moneysmart.sg/credit-cards/best-student-credit-cards-singapore,12
76,sg,blog.moneysmart.sg/shopping/online-grocery-shopping-singapore-redmart-honestbee,12
33,sg,blog.moneysmart.sg/credit-cards/boc-sheng-siong-card,10
16,sg,blog.moneysmart.sg/budgeting/cheapest-sim-only-plans,8
74,sg,blog.moneysmart.sg/shopping/free-parking-singapore-malls-covid-19,7


### Setting up embed analysis

In [136]:
embed_apply_clicks = apply_clicks[apply_clicks.is_embed==True]

In [137]:
group_and_sort(embed_apply_clicks,["country_code","channel"]+ affiliate_cols)

Unnamed: 0,country_code,channel,affiliate_category,affiliate_location,affiliate_page_type,affiliate_widget_type,num_clicks
0,hk,credit-cards,credit-cards,,,smart-widget,11
1,hk,credit-cards,credit-cards,top,blog-post-page,smart-widget,1
2,hk,personal-loan,personal-loan-en,bottom,blog-post-page,smart-widget,1
3,hk,personal-loan,personal-loan-zh,,,smart-widget,4
4,hk,personal-loan,personal-loan-zh,bottom,blog-post-page,smart-widget,26
5,sg,credit-cards,,sidebar,blog-post-page,sidebar-widget,232
6,sg,credit-cards,budgeting,middle,blog-post-page,smart-widget,17
7,sg,credit-cards,career,middle,blog-post-page,smart-widget,8
8,sg,credit-cards,credit-cards,middle,blog-post-page,smart-widget,3
9,sg,credit-cards,credit-cards,top,blog-post-page,smart-widget,5


### Embed by location on page

In [138]:
group_and_sort(embed_apply_clicks, ["country_code", "affiliate_location"])

Unnamed: 0,country_code,affiliate_location,num_clicks
0,hk,,15
1,hk,bottom,27
2,hk,top,1
3,sg,middle,174
4,sg,sidebar,232
5,sg,top,5


### Blog apply clicks on page & embed
NB you can't do by url this way

In [139]:
blog_and_embed_apply_clicks = blog_apply_clicks.append(embed_apply_clicks)

In [140]:
group_and_sort(blog_and_embed_apply_clicks, ["country_code","channel"]+ affiliate_cols)

Unnamed: 0,country_code,channel,affiliate_category,affiliate_location,affiliate_page_type,affiliate_widget_type,num_clicks
0,hk,credit-cards,air-miles,top,blog-article-details,inline-widget,2
1,hk,credit-cards,budgeting,,,,16
2,hk,credit-cards,budgeting,top,blog-article-details,inline-widget,38
3,hk,credit-cards,credit-cards,,,,3
4,hk,credit-cards,credit-cards,,,smart-widget,11
5,hk,credit-cards,credit-cards,top,blog-article-details,inline-widget,7
6,hk,credit-cards,credit-cards,top,blog-post-page,smart-widget,1
7,hk,personal-loan,loans,,,,2
8,hk,personal-loan,loans,top,blog-article-details,inline-widget,8
9,hk,personal-loan,personal-loan-en,bottom,blog-post-page,smart-widget,1


### Blog and embed apply clicks by product and provider

In [141]:
group_and_sort(blog_and_embed_apply_clicks, ["country_code","channel", "product", "provider"])

Unnamed: 0,country_code,channel,product,provider,num_clicks
0,hk,credit-cards,,,67
1,hk,credit-cards,citibank-premiermiles-card,citibank,2
2,hk,credit-cards,dbs-eminent-visa-platinum-card,dbs,1
3,hk,credit-cards,earnmore-unionpay-card,primecredit,1
4,hk,credit-cards,earnmore-unionpay-card-applicable-to-designated-full-time-university-tertiary-students,primecredit,1
5,hk,credit-cards,hsbc-unionpay-dual-currency-diamond-card,hsbc,1
6,hk,credit-cards,hsbc-visa-signature-card,hsbc,1
7,hk,credit-cards,standard-chartered-asia-miles-mastercard,standard-chartered,2
8,hk,credit-cards,standard-chartered-simply-cash-visa-card,standard-chartered,1
9,hk,credit-cards,wewa-unionpay-card-applicable-to-designated-full-time-university-tertiary-students,primecredit,1


In [142]:
group_and_sort(blog_and_embed_apply_clicks, ["country_code","channel", "product", "provider"]+ affiliate_cols)

Unnamed: 0,country_code,channel,product,provider,affiliate_category,affiliate_location,affiliate_page_type,affiliate_widget_type,num_clicks
0,hk,credit-cards,,,air-miles,top,blog-article-details,inline-widget,2
1,hk,credit-cards,,,budgeting,,,,16
2,hk,credit-cards,,,budgeting,top,blog-article-details,inline-widget,38
3,hk,credit-cards,,,credit-cards,,,,3
4,hk,credit-cards,,,credit-cards,,,smart-widget,1
5,hk,credit-cards,,,credit-cards,top,blog-article-details,inline-widget,7
6,hk,credit-cards,citibank-premiermiles-card,citibank,credit-cards,,,smart-widget,1
7,hk,credit-cards,citibank-premiermiles-card,citibank,credit-cards,top,blog-post-page,smart-widget,1
8,hk,credit-cards,dbs-eminent-visa-platinum-card,dbs,credit-cards,,,smart-widget,1
9,hk,credit-cards,earnmore-unionpay-card,primecredit,credit-cards,,,smart-widget,1


### All clicks mobile vs desktop vs channel

In [143]:
group_and_sort(apply_clicks,["country_code","device_category"] )

Unnamed: 0,country_code,device_category,num_clicks
0,hk,desktop,973
1,hk,mobile,3057
2,hk,tablet,43
3,sg,desktop,3299
4,sg,mobile,5148
5,sg,tablet,108


In [144]:
group_and_sort(apply_clicks, ["country_code","channel", "device_category"])

Unnamed: 0,country_code,channel,device_category,num_clicks
0,hk,credit-cards,desktop,499
1,hk,credit-cards,mobile,1220
2,hk,credit-cards,tablet,20
3,hk,health-insurance,desktop,19
4,hk,health-insurance,mobile,17
5,hk,home-insurance,desktop,20
6,hk,home-insurance,mobile,1
7,hk,investments,desktop,13
8,hk,investments,mobile,8
9,hk,personal-loan,desktop,374


### All clicks by page type (and sub-type)
NB this isn't set properly for embeds

In [145]:
group_and_sort(apply_clicks, ["country_code","page_type"])#, "is_embed"])

Unnamed: 0,country_code,page_type,num_clicks
0,hk,blog,76
1,hk,embed,43
2,hk,landing,408
3,hk,listing,2386
4,hk,misc_shop,18
5,hk,product_details,1142
6,sg,blog,317
7,sg,embed,411
8,sg,landing,523
9,sg,listing,4993


In [146]:
group_and_sort(apply_clicks, ["country_code","page_type", "page_sub_type"])#, "is_embed"])

Unnamed: 0,country_code,page_type,page_sub_type,num_clicks
0,hk,blog,blog_article,76
1,hk,embed,unknown,43
2,hk,landing,lps,408
3,hk,listing,category_listing,254
4,hk,listing,channel_listing,1786
5,hk,listing,provider_listing,53
6,hk,listing,unknown,293
7,hk,product_details,unknown,1142
8,sg,blog,blog_article,317
9,sg,embed,unknown,411


### All clicks by channel

In [147]:
group_and_sort(apply_clicks, ["country_code","channel"])

Unnamed: 0,country_code,channel,num_clicks
0,hk,credit-cards,1739
1,hk,health-insurance,36
2,hk,home-insurance,21
3,hk,investments,21
4,hk,personal-loan,2201
5,hk,travel-insurance,55
6,sg,cancer-insurance,1
7,sg,car-insurance,388
8,sg,credit-cards,4961
9,sg,debt-consolidation-plan,207


### All clicks by channel and product

In [148]:
group_and_sort(apply_clicks, ["country_code","channel", "product", "provider"])

Unnamed: 0,country_code,channel,product,provider,num_clicks
0,hk,credit-cards,,,92
1,hk,credit-cards,aeon-card-jal,aeon-credit-service,21
2,hk,credit-cards,aeon-unionpay-credit-card,aeon-credit-service,1
3,hk,credit-cards,american-express-cathay-pacific-credit-card,american-express,5
4,hk,credit-cards,american-express-cathay-pacific-elite-credit-card,american-express,20
5,hk,credit-cards,american-express-i-t-cashback-card,american-express,5
6,hk,credit-cards,american-express-platinum-credit-card,american-express,20
7,hk,credit-cards,bank-of-china-cup-dual-currency-platinum-card,bank-of-china,2
8,hk,credit-cards,blue-cash-credit-card-from-american-express,american-express,9
9,hk,credit-cards,boc-dual-currency-diamond-card,bank-of-china,24


### All clicks by list position

List position will only apply on listing pages and blog pages where it has been set.  Won't work on PDP



In [149]:
group_and_sort(apply_clicks, ["country_code","channel", "page_type", "page_sub_type", "list_position"])

Unnamed: 0,country_code,channel,page_type,page_sub_type,list_position,num_clicks
0,hk,credit-cards,blog,blog_article,,19
1,hk,credit-cards,blog,blog_article,0.0,28
2,hk,credit-cards,blog,blog_article,1.0,6
3,hk,credit-cards,blog,blog_article,2.0,3
4,hk,credit-cards,blog,blog_article,4.0,8
5,hk,credit-cards,blog,blog_article,5.0,1
6,hk,credit-cards,blog,blog_article,8.0,1
7,hk,credit-cards,embed,unknown,1.0,4
8,hk,credit-cards,embed,unknown,13.0,1
9,hk,credit-cards,embed,unknown,18.0,1


### By Landing Page

### Setting Up Conversion Rates

In [150]:
page_views.columns

Index(['page_id', 'page_url', 'page_type', 'page_sub_type', 'referrer_page_id',
       'session_id', 'anonymous_user_id', 'source_anonymous_id', 'full_date',
       'full_time', 'hour24', 'minute'],
      dtype='object')

In [151]:
apply_clicks.columns

Index(['country_code', 'page_type_from_dwh', 'page_sub_type_from_dwh',
       'is_embed', 'page_url', 'page_id', 'full_date', 'full_time',
       'session_id', 'anonymous_user_id', 'source_anonymous_id', 'device_os',
       'device_category', 'browser', 'channel', 'product_slug', 'product',
       'product_id', 'product_from_id', 'provider_slug', 'provider',
       'provider_id', 'provider_from_id', 'affiliate_category',
       'affiliate_location', 'affiliate_page_type', 'affiliate_widget_type',
       'list_position', 'action', 'source', 'activity_attributes', 'page_type',
       'page_sub_type', 'canonical_url'],
      dtype='object')

In [152]:

# NB: there's an issue AToW with canonical_url mapping to multiple page_type and page_sub_types, so got to be careful about joins.
"""
apply_clicks_by_page = apply_clicks.groupby(["page_id"]).agg({"anonymous_user_id":"nunique", "session_id":"nunique", "page_id":"size"})\
                                .rename(columns={"anonymous_user_id":"num_apply_users", "session_id":"num_apply_sessions", "page_id":"num_clicks"})
apply_clicks_by_page.reset_index(inplace=True)
"""

In [154]:
"""
page_views_summary = page_views.groupby(["page_id"]).agg({"anonymous_user_id":"nunique", "session_id":"nunique", "page_id":"size"})\
                                .rename(columns={"anonymous_user_id":"num_users", "session_id":"num_sessions", "page_id":"num_pageviews"})
page_views_summary.reset_index(inplace=True)

page_views_summary_with_page_types = page_views_summary.merge(pages_types_merged, how="left", on="page_id")
"""

In [155]:

#len(page_views_summary_with_page_types)

6708

In [156]:
#len(page_views_summary)

6708

In [157]:
#page_views_summary_with_page_types.head()

Unnamed: 0,page_id,num_users,num_sessions,num_pageviews,page_url,page_type,page_sub_type,event_count_athena,canonical_url
0,516315,118,121,123,blog.moneysmart.sg/credit-cards/annual-fees-singapores-credit-cards,blog,blog_article,180.0,blog.moneysmart.sg/credit-cards/annual-fees-singapores-credit-cards
1,516316,64,67,71,blog.moneysmart.sg/home-loans/standard-chartered-home-loan-review,blog,blog_article,40.0,blog.moneysmart.sg/home-loans/standard-chartered-home-loan-review
2,516317,27,31,31,blog.moneysmart.sg/credit-cards/best-credit-card-pregnancy-costs-singapore,blog,blog_article,26.0,blog.moneysmart.sg/credit-cards/best-credit-card-pregnancy-costs-singapore
3,516320,399,434,634,www.moneysmart.sg/credit-cards/online-shopping,listing,category_listing,146.0,www.moneysmart.sg/credit-cards/online-shopping
4,516322,1,1,1,blog.moneysmart.sg/property/4-things-look-renting-cheap-room-singapore,blog,blog_article,5.0,blog.moneysmart.sg/property/4-things-look-renting-cheap-room-singapore


In [158]:
"""
page_conversions = page_views_summary_with_page_types.merge(apply_clicks_by_page, how="left", on="page_id")\

# TODO: >> really want to add a conversion per user-day i.e. day long session

page_conversions.fillna({"num_clicks":0, "num_apply_sessions":0, "num_apply_users":0, "page_type":"", "page_sub_type":""}, inplace=True)

# >>>> TODO: need to add in country_code, which you'd get by adding into query for pages, but then need to follow through everything.

#This step will go away if we sort out dim_page
canonical_page_conversions = page_conversions.groupby([ "canonical_url"]).agg({
    
    "num_apply_users":"sum",
    "num_apply_sessions":"sum",
    "num_clicks":"sum",
    "num_users":"sum",
    "num_sessions":"sum",
    "num_pageviews":"sum",
    
    # This is due to not sorting out earlier... might go away as a problem.  It's ab test urls
    "page_type":"max",
    "page_sub_type":"max",
    
} )


"""

In [357]:
def group_with_conversions(apply_click_df, page_views_df, cols_to_group_by):
    """
    cols_to_group by must be in both data frames.
    
    """
    # Newer / 2nd method
    
    
    # trying to have the count on a column that will always be included, hence messy rename
    apply_click_summary = apply_click_df.groupby(cols_to_group_by).agg({"session_id":"nunique", "anonymous_user_id":("nunique", "size")})
                                #.rename(columns={"anonymous_user_id":"num_apply_users", "session_id":"num_apply_sessions", "anonymous_user_id":"num_clicks"}) 
    apply_click_summary.columns=["num_apply_sessions", "num_apply_users", "num_clicks",]
    apply_click_summary.reset_index(inplace=True)
    
    page_views_summary = page_views_df.groupby(cols_to_group_by).agg({"session_id":"nunique", "anonymous_user_id":("nunique", "size")})
                                #.rename(columns={"anonymous_user_id":"num_users", "session_id":"num_sessions", "anonymous_user_id":"num_pageviews"})
    page_views_summary.columns = ["num_sessions", "num_users", "num_pageviews", ]
    page_views_summary.reset_index(inplace=True)
    
    
    g = page_views_summary.merge(apply_click_summary, how="left", on=cols_to_group_by)\

    g.fillna({"num_clicks":0, "num_apply_sessions":0, "num_apply_users":0}, inplace=True)
    g.fillna("") #make sure this is after the numbers one
    #, "page_type":"", "page_sub_type":""
    
    
    """
    g = df.groupby(cols_to_group_by).agg({
        "num_apply_users":"sum",
        "num_apply_sessions":"sum",
        "num_clicks":"sum",
        "num_users":"sum",
        "num_sessions":"sum",
        "num_pageviews":"sum",
        
        
    })
    
    """
    
    g["unique_user_cr"] = g["num_apply_users"] / g["num_users"]
    g["unique_session_cr"] = g["num_apply_sessions"] / g["num_sessions"]
    g["apply_clicks_per_pageview_cr"] = g["num_clicks"] / g["num_pageviews"]
    return g
   

    
def add_page_meta_and_group(df_by_page_id, page_meta_df, cols_to_group_by):
    #Filtering out the columns that won't be used to try to minimise risk of mismatches
    # TODO: move this to using more generalised function
    count_columns = [  "num_apply_users",
                        "num_apply_sessions",
                        "num_clicks",
                        "num_users",
                        "num_sessions",
                        "num_pageviews",]
    
    
    df_group_cols = [z for z in cols_to_group_by if z in df_by_page_id.columns]
    page_meta_group_cols = [z for z in cols_to_group_by if z in page_meta_df.columns]
    g = df_by_page_id[["page_id",] + df_group_cols + count_columns].merge(page_meta_df[["page_id",] + page_meta_group_cols], how="left", on="page_id")\
                .groupby(cols_to_group_by)\
                .agg({
                    "num_apply_users":"sum",
                    "num_apply_sessions":"sum",
                    "num_clicks":"sum",
                    "num_users":"sum",
                    "num_sessions":"sum",
                    "num_pageviews":"sum",
                })
    g.reset_index(inplace=True)
    g["unique_user_cr"] = g["num_apply_users"] / g["num_users"]
    g["unique_session_cr"] = g["num_apply_sessions"] / g["num_sessions"]
    g["apply_clicks_per_pageview_cr"] = g["num_clicks"] / g["num_pageviews"]
    return g
    
def add_meta_and_group(df_by_page_id, page_meta_df, join_column, cols_to_group_by):
    
    #Filtering out the columns that won't be used to try to minimise risk of mismatches
    count_columns = [  "num_apply_users",
                        "num_apply_sessions",
                        "num_clicks",
                        "num_users",
                        "num_sessions",
                        "num_pageviews",]
    
    
    df_group_cols = [z for z in cols_to_group_by if z in df_by_page_id.columns]
    page_meta_group_cols = [z for z in cols_to_group_by if z in page_meta_df.columns]
    g = df_by_page_id[[join_column,] + df_group_cols + count_columns].merge(page_meta_df[[join_column] + page_meta_group_cols], how="left", on=join_column)\
                .groupby(cols_to_group_by)\
                .agg({
                    "num_apply_users":"sum",
                    "num_apply_sessions":"sum",
                    "num_clicks":"sum",
                    "num_users":"sum",
                    "num_sessions":"sum",
                    "num_pageviews":"sum",
                })
    g.reset_index(inplace=True)
    g["unique_user_cr"] = g["num_apply_users"] / g["num_users"]
    g["unique_session_cr"] = g["num_apply_sessions"] / g["num_sessions"]
    g["apply_clicks_per_pageview_cr"] = g["num_clicks"] / g["num_pageviews"]
    return g

def add_session_meta_and_group(df_by_session_id, session_meta_df, cols_to_group_by):
    join_column = "session_id"
    return add_meta_and_group(df_by_session_id, session_meta_df, join_column,  cols_to_group_by)
    

### Best Converting Pages (excludes blog embeds)

Taken the top pages by number of apply clicks and then sorted by which are highest converting, sorted by highest conversion of users (over the whole time period).

view -> click on same page;

In [358]:
df1 = group_with_conversions(apply_clicks, page_views , ["page_id"])
df2 = add_page_meta_and_group(df1, pages_types_merged, ["canonical_url", "page_type", "page_sub_type"])
df2.sort_values("num_clicks", ascending=False).head(50).sort_values("unique_user_cr", ascending=False)


Unnamed: 0,canonical_url,page_type,page_sub_type,num_apply_users,num_apply_sessions,num_clicks,num_users,num_sessions,num_pageviews,unique_user_cr,unique_session_cr,apply_clicks_per_pageview_cr
5209,www.moneysmart.sg/car-insurance/wizard/results,misc_shop,unknown,194.0,215.0,289.0,238,255,626,0.815126,0.843137,0.461661
5709,www.moneysmart.sg/maid-insurance,listing,channel_listing,90.0,98.0,124.0,421,487,694,0.213777,0.201232,0.178674
5792,www.moneysmart.sg/personal-loan/hsbc-personal-loan,product_details,unknown,108.0,114.0,132.0,551,598,706,0.196007,0.190635,0.186969
5777,www.moneysmart.sg/personal-loan,listing,channel_listing,945.0,1081.0,1573.0,5359,6288,7863,0.176339,0.171915,0.200051
5073,www.moneysmart.hk/zh-hk/personal-loan/citi-tax-season-loan,product_details,unknown,42.0,42.0,45.0,239,259,304,0.175732,0.162162,0.148026
5560,www.moneysmart.sg/debt-consolidation-plan,listing,category_listing,108.0,115.0,177.0,642,723,974,0.168224,0.159059,0.181725
5808,www.moneysmart.sg/personal-loan/scb-cashone,product_details,unknown,34.0,35.0,38.0,208,230,278,0.163462,0.152174,0.136691
4661,www.moneysmart.hk/en/personal-loan,listing,channel_listing,86.0,91.0,113.0,549,610,766,0.156648,0.14918,0.14752
5051,www.moneysmart.hk/zh-hk/personal-loan/banks-loan,listing,channel_listing,50.0,50.0,70.0,326,343,458,0.153374,0.145773,0.152838
5702,www.moneysmart.sg/investments/saxo-markets-ms,landing,lps,80.0,81.0,89.0,547,577,643,0.146252,0.140381,0.138414


### Conversion Rates By Page Type

In [359]:
df1 = group_with_conversions(apply_clicks, page_views , ["page_id"])
df3 = add_page_meta_and_group(df1, pages_types_merged, ["page_type"]).sort_values(["page_type"])
df3

Unnamed: 0,page_type,num_apply_users,num_apply_sessions,num_clicks,num_users,num_sessions,num_pageviews,unique_user_cr,unique_session_cr,apply_clicks_per_pageview_cr
0,blog,352.0,357.0,393.0,606457,646709,718279,0.00058,0.000552,0.000547
1,home_page,0.0,0.0,0.0,6530,7527,9270,0.0,0.0,0.0
2,interstitial,0.0,0.0,0.0,12958,13585,17868,0.0,0.0,0.0
3,landing,773.0,783.0,931.0,27126,28926,33269,0.028497,0.027069,0.027984
4,learn,0.0,0.0,0.0,1457,1549,1702,0.0,0.0,0.0
5,listing,5293.0,5641.0,7379.0,79896,87112,129204,0.066249,0.064756,0.057111
6,misc_shop,266.0,289.0,367.0,18691,20173,27079,0.014231,0.014326,0.013553
7,product_details,2566.0,2650.0,3104.0,38677,40754,50276,0.066344,0.065024,0.061739


In [360]:
df1 = group_with_conversions(apply_clicks, page_views , ["page_id"])
df3 = add_page_meta_and_group(df1, pages_types_merged, ["page_type", "page_sub_type"]).sort_values(["page_type", "page_sub_type"])
df3

Unnamed: 0,page_type,page_sub_type,num_apply_users,num_apply_sessions,num_clicks,num_users,num_sessions,num_pageviews,unique_user_cr,unique_session_cr,apply_clicks_per_pageview_cr
0,blog,blog_article,352.0,357.0,393.0,588651,626200,692766,0.000598,0.00057,0.000567
1,blog,blog_category_page,0.0,0.0,0.0,2377,2527,3462,0.0,0.0,0.0
2,blog,blog_category_tag_page,0.0,0.0,0.0,5230,5334,5695,0.0,0.0,0.0
3,blog,blog_home_page,0.0,0.0,0.0,10187,12636,16342,0.0,0.0,0.0
4,blog,blog_tag_page,0.0,0.0,0.0,12,12,14,0.0,0.0,0.0
5,home_page,unknown,0.0,0.0,0.0,6530,7527,9270,0.0,0.0,0.0
6,interstitial,apply,0.0,0.0,0.0,725,762,1083,0.0,0.0,0.0
7,interstitial,redirect,0.0,0.0,0.0,12233,12823,16785,0.0,0.0,0.0
8,landing,calculator,0.0,0.0,0.0,4513,4866,5762,0.0,0.0,0.0
9,landing,lps,773.0,783.0,931.0,14611,15507,18080,0.052905,0.050493,0.051493


In [361]:
# TODO: >> not sure using consistent page type in apply clicks vs pageviews (hack around this has been done for the above tables)

In [362]:
# TODO: >>> Need to get the channel etc of the pages, and work out how to deal with apply clicks not necessarily matching them (without it getting too complex)

###  Top Converting Blog Pages (excluding embeds currently)
It filters out for only pages that have above a certain level number of users to try to stop internal testers distorting.

NB the numbers are so  low they could be very distorted by internal users.

In [363]:
min_num_clicking_users = 3
df1 = group_with_conversions(apply_clicks, page_views , ["page_id"])
df2 = add_page_meta_and_group(df1, pages_types_merged, ["canonical_url", "page_type", "page_sub_type"])
df2[(df2.page_type=="blog") & (df2.num_apply_users>=min_num_clicking_users)].sort_values("num_clicks", ascending=False).head(50).sort_values("unique_user_cr", ascending=False)


#df = group_with_conversions(canonical_page_conversions, ["canonical_url", "page_type", "page_sub_type"]).reset_index()
#df[(df.page_type=="blog") & (df.num_clicks>=min_num_clicks)].sort_values(["num_clicks"], ascending=False).head(50).sort_values("unique_user_cr", ascending=False)

Unnamed: 0,canonical_url,page_type,page_sub_type,num_apply_users,num_apply_sessions,num_clicks,num_users,num_sessions,num_pageviews,unique_user_cr,unique_session_cr,apply_clicks_per_pageview_cr
2149,blog.moneysmart.sg/credit-cards/cimb-credit-cards-singapore-review,blog,blog_article,4.0,4.0,4.0,30,31,35,0.133333,0.129032,0.114286
2138,blog.moneysmart.sg/credit-cards/boc-sheng-siong-card,blog,blog_article,10.0,10.0,10.0,101,109,121,0.09901,0.091743,0.082645
2168,blog.moneysmart.sg/credit-cards/dbs-credit-cards-singapore-review,blog,blog_article,26.0,27.0,29.0,334,360,398,0.077844,0.075,0.072864
659,blog.moneysmart.hk/zh-hk/credit-cards/%e9%9b%bb%e5%99%a8%e5%84%aa%e6%83%a0-%e4%bf%a1%e7%94%a8%e5%8d%a1-%e8%b2%b7%e9%9b%bb%e5%99%a8-%e5%84%aa%e6%83%a0,blog,blog_article,27.0,27.0,28.0,410,424,484,0.065854,0.063679,0.057851
3242,blog.moneysmart.sg/personal-loans/best-personal-loan-singapore,blog,blog_article,42.0,46.0,58.0,639,694,749,0.065728,0.066282,0.077437
3257,blog.moneysmart.sg/personal-loans/uob-personal-loan-review,blog,blog_article,3.0,3.0,4.0,57,60,67,0.052632,0.05,0.059701
2166,blog.moneysmart.sg/credit-cards/dbs-altitude-credit-card-review,blog,blog_article,3.0,3.0,4.0,62,67,70,0.048387,0.044776,0.057143
635,blog.moneysmart.hk/zh-hk/credit-cards/%e7%8f%be%e9%87%91%e5%9b%9e%e8%b4%88-%e4%bf%a1%e7%94%a8%e5%8d%a1-%e6%af%94%e8%bc%83,blog,blog_article,3.0,3.0,3.0,73,73,78,0.041096,0.041096,0.038462
2130,blog.moneysmart.sg/credit-cards/best-student-credit-cards-singapore,blog,blog_article,12.0,12.0,12.0,293,313,340,0.040956,0.038339,0.035294
2157,blog.moneysmart.sg/credit-cards/citibank-rewards-card-review,blog,blog_article,3.0,3.0,3.0,99,102,109,0.030303,0.029412,0.027523


### Session Analysis Setup

In [364]:
sessions_with_landing_pages = sessions.merge(pages_types_merged[["page_id", "page_type", "page_sub_type", "canonical_url"]], how="left", left_on="session_landing_page_id", right_on="page_id")
sessions_with_landing_pages.rename(columns={"page_type":"landing_page_page_type", "page_sub_type":"landing_page_page_sub_type", "canonical_url":"landing_page_canonical_url"}, inplace=True)

### Top Converting Session Landing Page

Given they landed on the page, did they click apply at some point in the journey (all within a session)

In [374]:
# Top pages by conversion rate
min_num_applying_users = 3
df1 = group_with_conversions(apply_clicks, sessions , ["session_id"])
df2 = add_session_meta_and_group(df1, sessions_with_landing_pages, ["landing_page_canonical_url", "landing_page_page_type", "landing_page_page_sub_type"])
df2[df2.num_apply_users>min_num_applying_users].sort_values("unique_user_cr", ascending=False).head(50)

Unnamed: 0,landing_page_canonical_url,landing_page_page_type,landing_page_page_sub_type,num_apply_users,num_apply_sessions,num_clicks,num_users,num_sessions,num_pageviews,unique_user_cr,unique_session_cr,apply_clicks_per_pageview_cr
4008,www.moneysmart.hk/en/health-insurance/vhis-ms,landing,lps,4.0,4.0,6.0,5,5,5,0.8,0.8,1.2
4460,www.moneysmart.sg/car-insurance/wizard/results,misc_shop,unknown,30.0,30.0,51.0,43,43,43,0.697674,0.697674,1.186047
4306,www.moneysmart.hk/zh-hk/investments/retirement-products-deduct-taxes-ms,landing,lps,16.0,16.0,21.0,31,31,31,0.516129,0.516129,0.677419
4343,www.moneysmart.hk/zh-hk/personal-loan/best-dbs-personal-loan-ms,landing,lps,5.0,5.0,6.0,10,10,10,0.5,0.5,0.6
4804,www.moneysmart.sg/embed/cb22e9c7b206441c82552d10c81f993a/result,embed,unknown,5.0,5.0,7.0,12,12,12,0.416667,0.416667,0.583333
4445,www.moneysmart.sg/car-insurance/aig-ms,landing,lps,16.0,16.0,28.0,41,41,41,0.390244,0.390244,0.682927
4296,www.moneysmart.hk/zh-hk/health-insurance/vhis-ms,landing,lps,10.0,10.0,22.0,30,30,30,0.333333,0.333333,0.733333
4451,www.moneysmart.sg/car-insurance/fwd-ms,landing,lps,14.0,14.0,18.0,44,44,44,0.318182,0.318182,0.409091
4359,www.moneysmart.hk/zh-hk/personal-loan/clear-credit-card-debts-ms,landing,lps,6.0,6.0,7.0,19,19,19,0.315789,0.315789,0.368421
4969,www.moneysmart.sg/personal-loan/uob-personal-loan,product_details,unknown,10.0,10.0,10.0,34,34,34,0.294118,0.294118,0.294118


In [375]:
# Most number of clicks
df2[df2.num_apply_users>min_num_applying_users].sort_values("num_clicks", ascending=False).head(50)

Unnamed: 0,landing_page_canonical_url,landing_page_page_type,landing_page_page_sub_type,num_apply_users,num_apply_sessions,num_clicks,num_users,num_sessions,num_pageviews,unique_user_cr,unique_session_cr,apply_clicks_per_pageview_cr
4938,www.moneysmart.sg/personal-loan,listing,channel_listing,1002.0,1002.0,1495.0,5515,5515,5515,0.181686,0.181686,0.271079
4475,www.moneysmart.sg/credit-cards,listing,channel_listing,1025.0,1025.0,1391.0,11158,11158,11158,0.091862,0.091862,0.124664
4329,www.moneysmart.hk/zh-hk/personal-loan,listing,channel_listing,523.0,523.0,735.0,3396,3396,3396,0.154005,0.154005,0.216431
4379,www.moneysmart.hk/zh-hk/personal-loan/lending-companies-loan,listing,channel_listing,378.0,378.0,493.0,6757,6757,6757,0.055942,0.055942,0.072961
4290,www.moneysmart.hk/zh-hk/credit-cards/wewa-unionpay,product_details,unknown,417.0,417.0,485.0,3848,3848,3848,0.108368,0.108368,0.12604
4441,www.moneysmart.sg,home_page,unknown,248.0,248.0,366.0,5889,5889,5889,0.042112,0.042112,0.06215
4544,www.moneysmart.sg/credit-cards/citi-cashback-plus-card,product_details,unknown,294.0,294.0,361.0,4509,4509,4509,0.065203,0.065203,0.080062
4071,www.moneysmart.hk/zh-hk/credit-cards,listing,channel_listing,232.0,232.0,323.0,2145,2145,2145,0.108159,0.108159,0.150583
4527,www.moneysmart.sg/credit-cards/cash-back,listing,category_listing,253.0,253.0,309.0,4145,4145,4145,0.061037,0.061037,0.074548
4549,www.moneysmart.sg/credit-cards/citibank-rewards-card,product_details,unknown,175.0,175.0,249.0,1594,1594,1594,0.109787,0.109787,0.156211


### Session Landing Page Page Type

In [376]:
# TODO: >> can we add in page / unpaid??
# TODO: >> check up on the sessions vs users - seems to be the same
df1 = group_with_conversions(apply_clicks, sessions , ["session_id"])
df2 = add_session_meta_and_group(df1, sessions_with_landing_pages, ["landing_page_page_type", "landing_page_page_sub_type"])
df2

Unnamed: 0,landing_page_page_type,landing_page_page_sub_type,num_apply_users,num_apply_sessions,num_clicks,num_users,num_sessions,num_pageviews,unique_user_cr,unique_session_cr,apply_clicks_per_pageview_cr
0,blog,blog_article,958.0,958.0,1151.0,584029,584029,584029,0.00164,0.00164,0.001971
1,blog,blog_category_page,4.0,4.0,4.0,335,335,335,0.01194,0.01194,0.01194
2,blog,blog_category_tag_page,4.0,4.0,6.0,4056,4056,4056,0.000986,0.000986,0.001479
3,blog,blog_home_page,13.0,13.0,27.0,8819,8819,8819,0.001474,0.001474,0.003062
4,blog,blog_tag_page,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0
5,embed,unknown,28.0,28.0,43.0,34746,34746,34746,0.000806,0.000806,0.001238
6,home_page,unknown,248.0,248.0,366.0,5902,5902,5902,0.04202,0.04202,0.062013
7,interstitial,apply,4.0,4.0,6.0,336,336,336,0.011905,0.011905,0.017857
8,interstitial,redirect,28.0,28.0,38.0,1025,1025,1025,0.027317,0.027317,0.037073
9,landing,calculator,1.0,1.0,1.0,3997,3997,3997,0.00025,0.00025,0.00025


In [381]:
page_views[page_views.page_url.str.contains("embed")].head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,referrer_page_id,session_id,anonymous_user_id,source_anonymous_id,full_date,full_time,hour24,minute


# ISS, .../apply etc pageview applies

The main way that we track apply clicks is through ISS (and before that an earlier interstitial page).  NB other actions like contact form submissions aren't tracked.

Expect this not to work well for mortgage and car insurance

# NPP

# More Questions

* Is there a time delay between pageviews and applies? (across the site, across shop, across blog)?
* Mix in channel of the pages with apply clicks