# Where apply button clicks come from

There's two ways to look at the data, each of which might give slightly different counts:
* LeadGeneration.ClickConversion from the button click
* PageView on the ISS page
* (and also redirect completed on ISS)

The button click event gives more context about where on the page the click happened, while the ISS pageview is the official definition of an action.


Events should be defined as per https://docs.google.com/spreadsheets/d/1HICh77BoGMIat9K3NPwz3pBayJWiAr0ohAlTuv7dr80/edit#gid=1692709656, but this hasn't be implemented consistently.  

Of note for button clicks / LeadGeneration.ClickConversion:
* We used to send product_id and provider_id, but with the move to Falcon that doesn't work any longer (or will provide incorrect results).  LPS in particular doesn't seem to have updated the implementation.
* The product comparison widget in the blog currently doesn't tell us what page it is on
* Sometimes there is no product or provider info coming through


Other things of note:
* LPS pages don't seem to be categorised as such in the database


Outstanding things not necessarily covered below:
* NPP clicks

In [265]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [254]:
# Expand to screen width to fit more on.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [2]:
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
import sqlalchemy

from data_warehouse_querying import DataWarehouseQuery

  """)


In [3]:
try:
    from pyathena import connect
except:
    print("Failed to import pyathena, trying to install it")
    !pip install pyathena

import athena_querying  #doing this style as there are connection details within that I want to scope

In [4]:
from athena_common_queries import *

In [5]:
import data_parsing

# Settings

In [6]:
num_days_to_query = 1
to_datetime = datetime.now().date() - timedelta(days=1) #datetime(year=2020, month=3, day=1)
from_datetime = to_datetime - timedelta(days=num_days_to_query)


In [7]:
pd.set_option("display.max_colwidth", 200)

# Database Connections

In [8]:
# Redshift data warehouse - most queries here
dq = DataWarehouseQuery()
dq.connect()

In [9]:
# Athena - used for page type analysis
aq = athena_querying.AthenaQuery()
aq.connect()

# Getting Base Data

In [10]:
products = dq.query("select * from dim_product")

Starting query at 2020-04-10T08:58:35.958875
Query took 0.07


In [11]:
products.head()

Unnamed: 0,product_id,product_name,source_product_id,sys_inserted,sys_updated,status,slug,language_id,channel_id,provider_id,country_id
0,56622,CIMB Platinum Mastercard,101,2019-07-08 19:34:51.551632,2019-07-08 19:34:51.551632,0,cimb-platinum-mastercard-212cae7f-f5cd-4dd1-bac3-63033973420e,1,4,102,1
1,70784,Maybank DUO Platinum Mastercard,105,2019-08-16 19:34:45.935610,2019-08-16 19:34:45.935610,1,maybank-duo-platinum-mastercard,1,4,107,1
2,75681,OCBC 90°N Card,106,2019-08-30 19:34:23.890164,2019-08-30 19:34:23.890164,0,ocbc-90-n-card,1,4,108,1
3,81857,Citibank Quick Cash (Existing Loan Customers),24,2019-09-16 19:35:32.751792,2019-09-16 19:35:32.751792,1,citibank-quick-cash-existing-customers,1,16,836,1
4,93497,OCBC ExtraCash Loan,28,2019-10-18 20:10:55.807708,2019-10-18 20:10:55.807708,0,ocbc-extra-cash-loan,1,16,833,1


In [12]:
providers = dq.query("select * from dim_provider")

Starting query at 2020-04-10T08:58:36.262887
Query took 0.02


In [13]:
providers.head()

Unnamed: 0,provider_id,provider_name,sys_inserted,sys_updated,source_provider_id,slug,status,channel_id,country_id,language_id
0,833,OCBC,2019-02-19 19:32:36.488054,2019-02-19 19:32:36.488054,6.0,ocbc,1,16,1,1
1,837,Standard Chartered Bank,2019-02-19 19:32:36.488054,2019-02-19 19:32:36.488054,2.0,scb,1,16,1,1
2,100,Standard Chartered Bank,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,67.0,scb,0,4,1,1
3,104,Citibank,2019-01-21 19:32:53.867181,2019-01-21 19:32:53.867181,56.0,citibank,1,4,1,1
4,255,AXA,2019-02-01 01:43:57.318499,2019-02-01 01:43:57.318499,259.0,axa-direct,1,20,1,1


In [14]:
channels = dq.query("select * from dim_channel")

Starting query at 2020-04-10T08:58:36.518420
Query took 0.02


In [15]:
channels.head()


Unnamed: 0,channel_id,channel_key,channel_name,sys_inserted,sys_updated
0,4,credit-cards,Credit Cards,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
1,8,home-equity-loan,Home Equity Loan,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
2,12,life-insurance,Life Insurance,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
3,16,personal-loan,Personal Loan,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062
4,20,travel-insurance,Travel Insurance,2018-09-11 10:11:04.481062,2018-09-11 10:11:04.481062


In [16]:
providers_channels = pd.merge(providers, channels, on="channel_id", how="left")

In [17]:
len(providers)

496

In [18]:
len(providers_channels)

496

In [19]:
anonymous_users_some = dq.query("select * from dim_anonymous_user limit 1000")

Starting query at 2020-04-10T08:58:38.095673
Query took 0.02


In [263]:
anonymous_users_some.head()

Unnamed: 0,anonymous_user_id,source_anonymous_id,site_version_id,sys_inserted,sys_updated
0,22568387,2b3a19e2-bc37-44ff-a120-6a14fed9224e,2,2019-02-19 19:34:29.494854,2019-02-19 19:34:29.494854
1,22568399,2b4f8663-a80d-4dd2-8856-30e103b641ec,2,2019-02-19 19:34:29.494854,2019-02-19 19:34:29.494854
2,22568423,2b68a16b-4fcb-485d-8dff-910e59786224,5,2019-02-19 19:34:29.494854,2019-02-19 19:34:29.494854
3,22568452,2b9a0ea4-20de-420e-b226-d28cfb96deb2,1,2019-02-19 19:34:29.494854,2019-02-19 19:34:29.494854
4,22568481,2bba3513-4459-4cce-b938-9e623a24d5af,2,2019-02-19 19:34:29.494854,2019-02-19 19:34:29.494854


In [21]:
sessions_some = dq.query("select * from fact_sessions limit 1000")

Starting query at 2020-04-10T08:58:39.291487
Query took 0.03


In [22]:
sessions_some.head()

Unnamed: 0,fact_session_id,session_date_id,session_start_time_id,anonymous_user_id,user_id,device_id,browser_id,site_country_id,acquisition_site_version_id,session_landing_page_id,session_campaign_id,session_count,total_pageviews,total_interaction_events,session_order,sys_inserted,sys_updated,session_id,user_filter_type
0,9075319,20180701,162018,10188694,,2834,8692,1,2,517847,485852,1,1,1,1,2018-11-23 11:46:49.078785,2018-11-23 11:46:49.078785,11240374,external_visitor
1,9106274,20180701,183041,10182610,,2835,8684,1,1,517289,485980,1,5,8,1,2018-11-23 11:46:49.078785,2018-11-23 11:46:49.078785,11247942,external_visitor
2,9095621,20180701,51941,10193258,,2843,8679,1,2,517793,485852,1,1,1,1,2018-11-23 11:46:49.078785,2020-01-31 11:23:30.531883,11259674,bot
3,9087080,20180701,165234,10181904,,2843,8679,1,2,517782,485852,1,1,1,1,2018-11-23 11:46:49.078785,2020-01-31 11:23:30.531883,11288167,bot
4,9085606,20180701,142923,10208464,,2843,8679,1,2,516973,485852,1,1,1,1,2018-11-23 11:46:49.078785,2020-01-31 11:23:30.531883,11276457,bot


In [23]:
dim_session_some = dq.query("select * from dim_session limit 1000")

Starting query at 2020-04-10T08:58:40.952405
Query took 0.02


In [24]:
dim_session_some.head()

Unnamed: 0,session_id,source_session_id,source_session_start_time,source_session_end_time,site_version_id,sys_inserted,sys_updated
0,13678591,d3dba6c5-b233-41ec-80e9-2e22874f606a-180818230837824,2018-08-18 23:08:37.824,2999-12-31 16:00:00,2,2018-11-23 16:44:22.046187,2018-11-23 16:44:22.046187
1,13678595,d3e3be8b-3642-4603-b24b-8d210c80a829-180818154622120,2018-08-18 15:46:22.120,2999-12-31 16:00:00,1,2018-11-23 16:44:22.046187,2018-11-23 16:44:22.046187
2,13678599,d3ebf4a5-da72-4bca-8865-0f5705afc80f-180818191333170,2018-08-18 19:13:33.170,2999-12-31 16:00:00,2,2018-11-23 16:44:22.046187,2018-11-23 16:44:22.046187
3,13678603,d3ece6ff-8966-43e6-a41f-030653ee1e42-180818170633190,2018-08-18 17:06:33.190,2999-12-31 16:00:00,2,2018-11-23 16:44:22.046187,2018-11-23 16:44:22.046187
4,13678607,d3f2d9a2-cd1e-4936-ac88-f5d2c6bcb6f7-180818072726357,2018-08-18 07:27:26.357,2999-12-31 16:00:00,2,2018-11-23 16:44:22.046187,2018-11-23 16:44:22.046187


# Loading Supporting Data

NB: I'm being a bit liberal with querying excessive data to make it easier; and doing it properly should probably look at other semi-standard views done historically.

## Pageviews for session, user and pageview  counts (irrespective of apply) and so conversion rates

In [266]:

query = """
select 
    fact_activities.page_id
    , page_url
    , dim_page_type.page_type
    , dim_page_type.page_sub_type
    , referrer_page_id
    , session_id
    , fact_activities.anonymous_user_id
    , source_anonymous_id
    , full_date
    , full_time
    , hour24
    , minute
from 
    fact_activities
    left join dim_date on fact_activities.activity_date_id = dim_date.date_id
    left join dim_time on fact_activities.activity_time_id = dim_time.time_id
    left join dim_page on fact_activities.page_id = dim_page.page_id
    left join dim_anonymous_user on fact_activities.anonymous_user_id = dim_anonymous_user.anonymous_user_id
    left join dim_activity_type on fact_activities.activity_type_id = dim_activity_type.activity_type_id
    left join dim_page_type on dim_page_type.page_type_id = dim_page.page_type_id
    -- left join dim_page_
    

where
    dim_activity_type.activity_name = 'PageView'
    and user_filter_type='external_visitor'
    and dim_date.full_date>='{from_date}'
    and dim_date.full_date<='{to_date}'


""".format(from_date= from_datetime.isoformat(), to_date=to_datetime.isoformat())

In [267]:
print(query)


select 
    fact_activities.page_id
    , page_url
    , dim_page_type.page_type
    , dim_page_type.page_sub_type
    , referrer_page_id
    , session_id
    , fact_activities.anonymous_user_id
    , source_anonymous_id
    , full_date
    , full_time
    , hour24
    , minute
from 
    fact_activities
    left join dim_date on fact_activities.activity_date_id = dim_date.date_id
    left join dim_time on fact_activities.activity_time_id = dim_time.time_id
    left join dim_page on fact_activities.page_id = dim_page.page_id
    left join dim_anonymous_user on fact_activities.anonymous_user_id = dim_anonymous_user.anonymous_user_id
    left join dim_activity_type on fact_activities.activity_type_id = dim_activity_type.activity_type_id
    left join dim_page_type on dim_page_type.page_type_id = dim_page.page_type_id
    -- left join dim_page_
    

where
    dim_activity_type.activity_name = 'PageView'
    and user_filter_type='external_visitor'
    and dim_date.full_date>='2020-04-08'


In [268]:
page_views = dq.query(query)

Starting query at 2020-04-10T13:20:15.360733
Query took 3.84


In [269]:
page_views.head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,referrer_page_id,session_id,anonymous_user_id,source_anonymous_id,full_date,full_time,hour24,minute
0,517109,blog.moneysmart.sg/budgeting/declare-additional-income-tax,blog_page,blog,,72032736,22682275,1f1ef0f5-fb38-4d5e-9149-609e23bda4f1,2020-04-09,15:42:35,15,42
1,2971487,blog.moneysmart.sg/budgeting/covid-19-financial-assistance-retrenchment,blog_page,blog,,72066911,22926542,63d21cfa-05b6-4aa0-ab37-e0fdac8dcd1e,2020-04-09,22:42:53,22,42
2,2971487,blog.moneysmart.sg/budgeting/covid-19-financial-assistance-retrenchment,blog_page,blog,,71891321,23071050,b9e13651-e817-4f56-95a6-aadf6a6b8a05,2020-04-08,10:39:42,10,39
3,2971639,blog.moneysmart.sg/savings-accounts/dbs-multiplier-ocbc360-uob-one-covid-19,blog_page,blog,,71897409,23207960,0b8c8fff-77ca-4c6c-9bb7-599e824a2018,2020-04-08,12:09:16,12,9
4,2967216,blog.moneysmart.sg/budgeting/budget-2020-highlights,blog_page,blog,,72013546,22959808,ed7648a1-31cc-41a2-9b62-a86868a41c53,2020-04-09,13:36:11,13,36


In [270]:
page_views.agg(["min","max","count", "size"])

Unnamed: 0,page_id,page_url,page_type,page_sub_type,referrer_page_id,session_id,anonymous_user_id,source_anonymous_id,full_date,full_time,hour24,minute
min,516315,blog-admin.moneysmart.sg,Unknown,,516315.0,71824989,10177948,00003ae1-5271-4f96-b039-2f0fce0241cb,2020-04-08,00:00:00,0,0
max,2971858,www.moneysmart.sg/user-dashboard/profile,thank_you_page,thank_you,2971858.0,72114635,61594347,ffffde45-7b44-41fc-83f2-06e6cc06ba92,2020-04-09,23:59:53,23,59
count,204020,204020,204020,204020,27525.0,204020,204020,204020,204020,204020,204020,204020
size,204020,204020,204020,204020,204020.0,204020,204020,204020,204020,204020,204020,204020


In [271]:
page_views_referrer_stats = page_views.groupby(["page_type", "page_sub_type"]).agg({"referrer_page_id":["count", "size"]})#.apply(lambda x: x.referrer_page_id.count / x.size)

In [272]:
page_views_referrer_stats[("referrer_page_id", "fraction_with_referrer_set")] = page_views_referrer_stats[("referrer_page_id", "count")] / page_views_referrer_stats[("referrer_page_id", "size")]

In [273]:
page_views_referrer_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,referrer_page_id,referrer_page_id,referrer_page_id
Unnamed: 0_level_1,Unnamed: 1_level_1,count,size,fraction_with_referrer_set
page_type,page_sub_type,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Unknown,,5959,20017,0.297697
blog_page,blog,13186,151393,0.087098
details,product_details,1152,6810,0.169163
form_page,form,0,44,0.0
interstitial_page,apply,107,233,0.459227
interstitial_page,redirect,3542,3748,0.945037
interstitial_page,site,11,11,1.0
learn_page,learn,77,385,0.2
listing,category_listing,927,7055,0.131396
listing,channel_listing,2229,11621,0.191808


In [274]:
page_views[page_views.page_sub_type=="site"]["page_url"]

6795      www.moneysmart.sg/renovation-loan/maybank/maybank-renovation-loan-monthly-rest/site
9406               www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
50836              www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
65700              www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
90878     www.moneysmart.sg/renovation-loan/maybank/maybank-renovation-loan-monthly-rest/site
119472             www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
129034             www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
136043             www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
148113    www.moneysmart.sg/renovation-loan/maybank/maybank-renovation-loan-monthly-rest/site
157509             www.moneysmart.sg/education-loan/posb/posb-loans-further-study-assist/site
184141             www.moneysmart.sg/education-loan/posb/pos

## Sessions for landing page information

Note that:
* You can get some of this off the fact_activities 
* You can get a lot of marketing info on the session level
* For doing per-day etc, you'd likely want the earliest session of the sessions for a user (or do it off pageviews)

In [275]:
query = """
select
    session_id
    , anonymous_user_id
    , session_landing_page_id
    , session_count
    , dim_date.full_date
    

from 
    fact_sessions
    left join dim_date on session_date_id = dim_date.date_id


where 
    dim_date.full_date>='{from_date}'
    and dim_date.full_date<='{to_date}'

    and user_filter_type='external_visitor'

""".format(from_date= from_datetime.isoformat(), to_date=to_datetime.isoformat())

In [276]:
print(query)


select
    session_id
    , anonymous_user_id
    , session_landing_page_id
    , session_count
    , dim_date.full_date
    

from 
    fact_sessions
    left join dim_date on session_date_id = dim_date.date_id


where 
    dim_date.full_date>='2020-04-08'
    and dim_date.full_date<='2020-04-09'

    and user_filter_type='external_visitor'




In [277]:
sessions = dq.query(query)

Starting query at 2020-04-10T13:20:19.859720
Query took 0.71


In [278]:
sessions.head()

Unnamed: 0,session_id,anonymous_user_id,session_landing_page_id,session_count,full_date
0,71827117,61426121,993224,1,2020-04-08
1,71894634,61486720,1227230,1,2020-04-08
2,71894810,58535310,1578936,1,2020-04-08
3,71963603,61435866,1656230,1,2020-04-08
4,71894922,61491299,516766,1,2020-04-08


In [279]:
sessions.agg(["count", "min", "max"])

Unnamed: 0,session_id,anonymous_user_id,session_landing_page_id,session_count,full_date
count,154559,154559,154559,154559,154559
min,71824985,10177948,516315,1,2020-04-08
max,72114635,61594347,2971854,1,2020-04-09


## Better Page Segmentation (should put in ETL sometime)
For background, current ETL process uses page types from sql pattern matching, but events now send page_type (and in future will try to get page_sub_type) in the event body.  Ticket exists to improve this.

### Get All Pages from Data Warehouse

In [280]:
query = """
select 
    page_id
    , page_url
    , page_type
    , page_sub_type
    -- not joining to get product and category at the moment as not sure its needed.  Data is also in fact_activities
    , product_category_id
    , product_id
    , provider_id
from 
    dim_page
    left join dim_page_type on dim_page.page_type_id = dim_page_type.page_type_id

"""

pages = dq.query(query)

Starting query at 2020-04-10T13:20:28.647670
Query took 0.27


In [281]:
pages.count()

page_id                47691
page_url               47691
page_type              47691
page_sub_type          47691
product_category_id     1251
product_id              2391
provider_id             3233
dtype: int64

In [282]:
pages.head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id
0,1375785,blog.moneysmart.hk/en/mortgage/%E5%B1%85%E5%B1%8B-2019-%E4%BD%95%E6%96%87%E7%94%B0-%E5%B0%87%E8%BB%8D%E6%BE%B3-%E7%94%B3%E8%AB%8B-%E9%A6%AC%E9%9E%8D%E5%B1%B1-%E6%B7%B1%E6%B0%B4%E5%9F%97,blog_page,blog,,,
1,1647817,blog.moneysmart.sg/property/private-properties-hdbs-differences/attachment/private-vs-public-housing-singapore,blog_page,blog,,,
2,1376813,forum.moneysmart.sg/topic/taking-multiple-loans,forum_page,forum,,,
3,1158494,blog.moneysmart.sg/dining/11-best-places-in-singapore-to-get-your-meat-fix-at-1-for-1/_,blog_page,blog,,,
4,1378619,www.moneysmart.sg/home%20loan,Unknown,,,,


In [283]:
pages[pages.page_url.str.startswith("http")].head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id


### Page Type from Event Body (via athena)

In [284]:
# Just take one day's worth of data from the end of the period (more likely to have updated values than start)
day_to_take_types_from = to_datetime - timedelta(days=1)

query = """
    select 
        context.page_url
        
        --regexp_extract(context.page_url, '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 5)  -- slug
        
        , context.canonical_url
        , body.data.page_type
        , count(*) as event_count
        
        
    from {table_name}
    
    where
        {partition_filter}
        and context.page_url not like '%moneysmart.tw%'
        and context.page_url not like '%moneysmart.ph%'
        and context.page_url not like '%moneysmart.id%'
    
    group by 1,2,3
    

""".format(table_name = athena_querying.athena_database+ "." +athena_querying.athena_raw_events_table,
          partition_filter = create_partition_filter(day_to_take_types_from, to_datetime)
          
          )
           



#could filter out just pageviews, but the group by has the same effect and it's not indexed or anything.




In [285]:
print(query)


    select 
        context.page_url
        
        --regexp_extract(context.page_url, '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 5)  -- slug
        
        , context.canonical_url
        , body.data.page_type
        , count(*) as event_count
        
        
    from ms_data_lake_production.ms_data_stream_production_processed
    
    where
        
  (
 partition_0 >= '2020'
 AND partition_1 >= '04'
 AND partition_2 >= '08'
 OR (
 partition_0 >= '2020'
 AND partition_1 > '04'
 ) 
 OR (
 partition_0 > '2020'
 ) 
)
 AND ((partition_0 <= '2020'
	 AND partition_1 <= '04'
	 AND partition_2 <= '09'
) 
 OR (
	 partition_0 <= '2020'
	 AND partition_1 < '04'
) 
 OR (
	 partition_0 < '2020'
) 
)
        and context.page_url not like '%moneysmart.tw%'
        and context.page_url not like '%moneysmart.ph%'
        and context.page_url not like '%moneysmart.id%'
    
    group by 1,2,3
    




In [286]:
pages_types_from_athena_raw = aq.query(query)

In [287]:
pages_types_from_athena_raw.head()

Unnamed: 0,page_url,canonical_url,page_type,event_count
0,https://blog.moneysmart.hk/zh-hk/mortgage/%E6%97%A5%E5%87%BA%E5%BA%B7%E5%9F%8E-6%E6%9C%9F-%E9%A0%98%E9%83%BD-montara-malibu/,,article,362
1,https://blog.moneysmart.sg/transportation/motorcycle-singapore/,,article,597
2,https://blog.moneysmart.sg/invest/investment-brokerage-singapore-guide/,,article,4118
3,https://blog.moneysmart.sg/budgeting/gomo-sim-only-plan/?dclid=COj9hsHg2egCFdsmrQYdep8Ikw,,article,5
4,https://blog.moneysmart.sg/invest/5-ways-in-which-you-can-make-the-most-of-the-smaller-sgx-lot-sizes/,,article,136


In [288]:
# See how much canonical_url is available, but don't want to match as dim_page ATOW isn't using canonical :(
# I think that it was only added for AMP
pages_types_from_athena_raw.count()

page_url         24425
canonical_url        0
page_type        18924
event_count      24425
dtype: int64

In [289]:
# Set the dim_page_url for joining
urls_for_dim_page = pages_types_from_athena_raw.apply(lambda x: data_parsing.get_dim_page_url(x["page_url"]), axis=1)

In [290]:
urls_for_dim_page.head()

0    blog.moneysmart.hk/zh-hk/mortgage/%e6%97%a5%e5%87%ba%e5%ba%b7%e5%9f%8e-6%e6%9c%9f-%e9%a0%98%e9%83%bd-montara-malibu
1                                                                 blog.moneysmart.sg/transportation/motorcycle-singapore
2                                                         blog.moneysmart.sg/invest/investment-brokerage-singapore-guide
3                                                                        blog.moneysmart.sg/budgeting/gomo-sim-only-plan
4                           blog.moneysmart.sg/invest/5-ways-in-which-you-can-make-the-most-of-the-smaller-sgx-lot-sizes
dtype: object

In [291]:
# check for bad matching with dim_page (looks like shop url or blog url, but doesn't match)
missing_pages = urls_for_dim_page[~urls_for_dim_page.isin(pages["page_url"])].unique()

In [292]:
len(missing_pages)

24

In [293]:
pd.DataFrame(missing_pages)

Unnamed: 0,0
0,blog.moneysmart.sg/%e2%80%a6/icbc-horoscope-credit-card-ca%e2%80%a6
1,blog.moneysmart.hk/zh-hk/credit-cards/%e9%a3%9b%e8%a1%8c%e9%87%8c%e6%95%b8-%e4%bf%a1%e7%94%a8%e5%8d%a1-%e6%af%94%e8%bc%83-2018
2,www.moneysmart.hk/zh-hk/lending-companies-loan/lending-companies-loan-plans/promise-easy-loan1
3,blog.moneysmart.hk/zh-hk/investment/%e8%b2%b7%e5%b3%b6-%e7%84%a1%e4%ba%ba%e5%b3%b6-%e5%8a%a0%e5%8b%92%e6%af%94%e6%b5%b7-%e4%b8%ad%e7%be%8e%e6%b4%b2-%e8%8f%b2%e5%be%8b%e8%b3%93
4,blog.moneysmart.hk/zh-hk/mortgage/%e5%b1%85%e5%b1%8b%e6%8c%89%e6%8f%ad-%e5%88%a9%e7%8e%87-%e5%9b%9e%e8%b4%88-%e9%8a%80%e8%a1%8c/%e2%80%9chttps:/blog.moneysmart.hk
5,blog.moneysmart.sg/health%20insurance/covid-19-health-insurance
6,blog.moneysmart.hk/zh-hk/credit-cards/didi%e8%bf%8e%e6%96%b0%e5%84%aa%e6%83%a0%e6%b8%9b%e8%87%b326-%e7%94%a8%e6%88%b6%e8%96%a6%e5%8f%8b%e6%88%90%e5%8a%9f%e4%bd%bf%e7%94%a8%e7%8d%b250
7,www.moneysmart.sg/%3ca%20class=%22ui%20blue%20button%22%20target=%22_self%22%20href=%22https://web.archive.org/web/20191225001530/https://blog.moneysmart.sg/%22%3evisit%20blog%3c/a%3e
8,blog.moneysmart.hk/zh-hk/uncategorized/%e8%b2%b7%e6%a8%93-%e5%8d%b0%e8%8a%b1%e7%a8%85-%e6%8c%89%e6%8f%ad-bsd-ssd-dsd
9,blog3.moneysmart.hk/zh-hk/budgeting/%e7%89%9b%e8%82%89%e4%b9%be-%e6%8a%84%e7%89%8c-%e5%8f%b8%e6%a9%9f-%e8%b3%ba%e9%87%8c%e6%95%b8-%e5%9b%9e%e8%b4%88-%e7%bd%b0%e6%ac%be-%e5%8a%a0%e5%83%b9


In [294]:
pages[pages.page_url.isin([z.strip("/") for z in missing_pages])]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id


In [295]:
pages[pages.page_url=="www.moneysmart.sg/"]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id
20791,663410,www.moneysmart.sg/,Unknown,,,,


In [296]:
pages[pages.page_url=="www.moneysmart.sg"]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id
37822,521077,www.moneysmart.sg,Unknown,,,,


In [297]:
len(pages_types_from_athena_raw)

24425

In [298]:
# Group together and remove duplicate entries sensibly

In [299]:
pages_types_from_athena_raw.head()

Unnamed: 0,page_url,canonical_url,page_type,event_count
0,https://blog.moneysmart.hk/zh-hk/mortgage/%E6%97%A5%E5%87%BA%E5%BA%B7%E5%9F%8E-6%E6%9C%9F-%E9%A0%98%E9%83%BD-montara-malibu/,,article,362
1,https://blog.moneysmart.sg/transportation/motorcycle-singapore/,,article,597
2,https://blog.moneysmart.sg/invest/investment-brokerage-singapore-guide/,,article,4118
3,https://blog.moneysmart.sg/budgeting/gomo-sim-only-plan/?dclid=COj9hsHg2egCFdsmrQYdep8Ikw,,article,5
4,https://blog.moneysmart.sg/invest/5-ways-in-which-you-can-make-the-most-of-the-smaller-sgx-lot-sizes/,,article,136


In [300]:
pages_types_from_athena_processing = pages_types_from_athena_raw[["page_url", "page_type", "event_count"]]

In [301]:
pages_types_from_athena_processing.rename(columns={"page_type":"page_type_from_events", "event_count":"event_count_athena"}, inplace=True)
pages_types_from_athena_processing.columns

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Index(['page_url', 'page_type_from_events', 'event_count_athena'], dtype='object')

In [302]:
pages_types_from_athena_processing["dim_page_url"] = urls_for_dim_page
pages_types_from_athena_processing.head()

Unnamed: 0,page_url,page_type_from_events,event_count_athena,dim_page_url
0,https://blog.moneysmart.hk/zh-hk/mortgage/%E6%97%A5%E5%87%BA%E5%BA%B7%E5%9F%8E-6%E6%9C%9F-%E9%A0%98%E9%83%BD-montara-malibu/,article,362,blog.moneysmart.hk/zh-hk/mortgage/%e6%97%a5%e5%87%ba%e5%ba%b7%e5%9f%8e-6%e6%9c%9f-%e9%a0%98%e9%83%bd-montara-malibu
1,https://blog.moneysmart.sg/transportation/motorcycle-singapore/,article,597,blog.moneysmart.sg/transportation/motorcycle-singapore
2,https://blog.moneysmart.sg/invest/investment-brokerage-singapore-guide/,article,4118,blog.moneysmart.sg/invest/investment-brokerage-singapore-guide
3,https://blog.moneysmart.sg/budgeting/gomo-sim-only-plan/?dclid=COj9hsHg2egCFdsmrQYdep8Ikw,article,5,blog.moneysmart.sg/budgeting/gomo-sim-only-plan
4,https://blog.moneysmart.sg/invest/5-ways-in-which-you-can-make-the-most-of-the-smaller-sgx-lot-sizes/,article,136,blog.moneysmart.sg/invest/5-ways-in-which-you-can-make-the-most-of-the-smaller-sgx-lot-sizes


In [303]:
page_types_from_athena = pages_types_from_athena_processing.fillna("").groupby(["dim_page_url"]).agg({"event_count_athena":"sum", "page_type_from_events":"max" }).reset_index()

In [304]:
page_types_from_athena.head()

Unnamed: 0,dim_page_url,event_count_athena,page_type_from_events
0,blog-admin.moneysmart.sg,6,home
1,blog-admin.moneysmart.sg/credit-cards/letting-loose-long-week-heres-dont-need-worry-cost,1,article
2,blog-admin.moneysmart.sg/credit-cards/uob-prvi-miles-credit-card-review,2,blog-article-details
3,blog-admin.moneysmart.sg/fixed-deposits/best-fixed-deposit-accounts-singapore,3,article
4,blog-admin.moneysmart.sg/savings-accounts/dbs-multiplier-ocbc360-uob-one-covid-19,10,blog-article-details


In [305]:
#using fillna because it was erroring on max
page_types_from_athena.groupby("page_type_from_events").agg({"event_count_athena":"sum", "dim_page_url":"count"})

Unnamed: 0_level_0,event_count_athena,dim_page_url
page_type_from_events,Unnamed: 1_level_1,Unnamed: 2_level_1
,62178,721
article,298841,2325
blog-article-details,174943,185
blog-post-page,212794,92
category,2017,83
claim-status-tracker,178,1
claim-status-tracker-result,23,1
contact-us-general-enquiry,110,1
home,280,5
interstitial-page,5141,165


### Page Type from Jamie's Logic

This was originally done for understanding segment vs kinesis, and then has been tweaked to add a bit more.

AToW it defaults to shop if it can't categorise better.

In [327]:
# Expect this to be a bit slow
page_types_jamie = pages[["page_id", "page_url"]].reset_index()

In [328]:
jamie_types = pages.apply(lambda x:data_parsing.get_metadata_from_url("https://"+x.page_url)[0], axis=1) #[page_type, slug, slug_root, ab_test, country_code]
#page_types_jamie["page_type"] = jamie_types

In [329]:
page_types_jamie["page_type_jamie"] = jamie_types

In [330]:
len(page_types_jamie)

47691

In [331]:
page_types_jamie.columns

Index(['index', 'page_id', 'page_url', 'page_type_jamie'], dtype='object')

In [332]:
# NB: logic is a bit flakey as it defaults to shop
page_types_jamie.groupby("page_type_jamie").size()

page_type_jamie
blog_category_page          267
blog_category_tag_page    25798
blog_home_page                4
blog_tag_page               115
calculator                   28
embed                       255
home_page                    30
iss                        2789
learn                       446
lps                         184
shop                      17373
trend                         8
unbounce                    394
dtype: int64

### Merged Page Type

In [333]:
# could probably merge techniques to use Jamie style plus PDP and listing page from the data warehouse
# essentially take the athena one and if not present, then use the Jamie one

pages_types_combined = pages.merge(page_types_jamie[["page_id", "page_type_jamie"]], how="left", on="page_id")\
    .merge(page_types_from_athena, how="left", left_on="page_url", right_on="dim_page_url")




In [334]:
pages_types_combined.head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events
0,1375785,blog.moneysmart.hk/en/mortgage/%E5%B1%85%E5%B1%8B-2019-%E4%BD%95%E6%96%87%E7%94%B0-%E5%B0%87%E8%BB%8D%E6%BE%B3-%E7%94%B3%E8%AB%8B-%E9%A6%AC%E9%9E%8D%E5%B1%B1-%E6%B7%B1%E6%B0%B4%E5%9F%97,blog_page,blog,,,,blog_category_tag_page,,,
1,1647817,blog.moneysmart.sg/property/private-properties-hdbs-differences/attachment/private-vs-public-housing-singapore,blog_page,blog,,,,blog_category_tag_page,,,
2,1376813,forum.moneysmart.sg/topic/taking-multiple-loans,forum_page,forum,,,,shop,,,
3,1158494,blog.moneysmart.sg/dining/11-best-places-in-singapore-to-get-your-meat-fix-at-1-for-1/_,blog_page,blog,,,,blog_category_tag_page,,,
4,1378619,www.moneysmart.sg/home%20loan,Unknown,,,,,shop,,,


### Comparing all the techniques
Intent here is to go back and make one technique that solves this for all pages

In [335]:
pages_types_combined.groupby(["page_type", "page_sub_type", "page_type_jamie", "page_type_from_events"])\
.agg({"page_id":"count","event_count_athena":"sum"}).reset_index().rename(columns={"page_id":"page_count"})

Unnamed: 0,page_type,page_sub_type,page_type_jamie,page_type_from_events,page_count,event_count_athena
0,Unknown,,calculator,,9,2675.0
1,Unknown,,embed,,11,2747.0
2,Unknown,,embed,blog-post-page,92,212794.0
3,Unknown,,home_page,,1,2315.0
4,Unknown,,lps,,80,1416.0
5,Unknown,,lps,lps,61,2805.0
6,Unknown,,shop,,142,12150.0
7,Unknown,,shop,claim-status-tracker,1,178.0
8,Unknown,,shop,claim-status-tracker-result,1,23.0
9,Unknown,,shop,contact-us-general-enquiry,1,110.0


In [336]:
# ^ It's going to take some work to resolve it well

In [337]:
ptc = pages_types_combined

In [342]:
ptc[(ptc.page_type == "Unknown") & (ptc.page_type_from_events == "product-listing")]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events
2674,517337,www.moneysmart.sg/savings-account/rhb,Unknown,,,,,shop,www.moneysmart.sg/savings-account/rhb,2.0,product-listing
35659,2963823,www.moneysmart.sg/debt-consolidation-plan/standard-chartered,Unknown,,,,,shop,www.moneysmart.sg/debt-consolidation-plan/standard-chartered,1.0,product-listing
36721,1583842,www.moneysmart.hk/zh-hk/credit-cards/dah-sing-bank,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/credit-cards/dah-sing-bank,6.0,product-listing
36768,1540787,www.moneysmart.hk/zh-hk/credit-cards/bank-of-china,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/credit-cards/bank-of-china,133.0,product-listing
36938,1797374,www.moneysmart.hk/zh-hk/credit-cards/china-construction-bank,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/credit-cards/china-construction-bank,12.0,product-listing
37020,1582609,www.moneysmart.hk/en/credit-cards/bea,Unknown,,,,,shop,www.moneysmart.hk/en/credit-cards/bea,5.0,product-listing
37147,2803554,www.moneysmart.hk/en/personal-loan/bank-of-communications,Unknown,,296.0,,,shop,www.moneysmart.hk/en/personal-loan/bank-of-communications,2.0,product-listing
37182,1802404,www.moneysmart.hk/en/credit-cards/aeon-credit-service,Unknown,,,,,shop,www.moneysmart.hk/en/credit-cards/aeon-credit-service,8.0,product-listing
40379,1435624,www.moneysmart.hk/zh-hk/credit-cards/citic-bank-international/welcome-offer,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/credit-cards/citic-bank-international/welcome-offer,1.0,product-listing
40420,1580332,www.moneysmart.hk/en/credit-cards/china-construction-bank,Unknown,,,,,shop,www.moneysmart.hk/en/credit-cards/china-construction-bank,4.0,product-listing


In [340]:
ptc[ptc.page_type_jamie=="blog_home_page"]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events
2373,1015130,blog.moneysmart.hk,blog_page,blog,,,,blog_home_page,blog.moneysmart.hk,270.0,home
13258,518231,blog.moneysmart.sg,blog_page,blog,,,,blog_home_page,blog.moneysmart.sg,3105.0,other
29677,2965738,blog3.moneysmart.hk,blog_page,blog,,,,blog_home_page,,,
39658,2964624,blog3.moneysmart.sg,blog_page,blog,,,,blog_home_page,blog3.moneysmart.sg,8.0,


In [344]:
ptc[(ptc.page_type_jamie=="home_page") & (ptc.event_count_athena>0)]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events
342,549778,learn.moneysmart.sg,learn_page,learn,,,,home_page,learn.moneysmart.sg,39.0,
29259,2615872,blog-admin.moneysmart.sg,blog_page,blog,,,,home_page,blog-admin.moneysmart.sg,6.0,home
37822,521077,www.moneysmart.sg,Unknown,,,,,home_page,www.moneysmart.sg,2315.0,


In [338]:
ptc[(ptc.page_type_from_events=="category") & (ptc.page_type_jamie=="blog_article")].head(20)

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events


### Trying to make the best page_type of some so-so options

In [360]:
# This is a bit hacky and liable to break
# It also takes a bit of time to run -> could optimise by running as a mapping on the summary, but whatever.

def _merge_page_types(x):
    j_type = x.page_type_jamie
    if "blog" in j_type:
        page_type = "blog"
        page_sub_type = j_type
    elif j_type in ["lps", "unbounce", "trend", "calculator"]:
        page_type = "landing"
        page_sub_type = j_type
    elif j_type == "iss":
        page_type = "interstitial"
        page_sub_type = x.page_sub_type
    elif j_type == "embed":
        page_type = "embed"
        page_sub_type = "unknown"
    elif j_type == "shop":
        if x.page_type=="listing" or x.page_type_from_events=="product-listing":
            page_type = "listing"
            if x.page_sub_type in ["category_listing", "channel_listing", "provider_listing"]:
                page_sub_type = x.page_sub_type
            else:
                page_sub_type = "unknown"
        elif x.page_type_from_events == "product-details" or x.page_sub_type =="product-details":
            page_type="product_details"
            page_sub_type="unknown"
        elif x.page_type == "Unknown" and bool(x.page_type_from_events) and x.page_type_from_events!=np.NaN:
            page_type = "misc_shop_a"
            page_sub_type = x.page_type_from_events
        
        else:
            page_type = "misc_shop"
            page_sub_type = "unknown"
    else:
        page_type = j_type
        page_sub_type = "unknown"
    return pd.Series([page_type, page_sub_type], index=['page_type_merged', 'page_sub_type_merged'])
        

page_type_merged_col = ptc.apply(_merge_page_types, axis=1)

In [369]:
pages_types_merged_dev = pages_types_combined.merge(page_type_merged_col, how="left", left_index=True, right_index=True)


In [370]:
pages_types_merged_dev.groupby(['page_type_merged', 'page_sub_type_merged', "page_type", "page_sub_type", "page_type_jamie", "page_type_from_events"])\
.agg({"page_id":"count","event_count_athena":"sum"}).reset_index().rename(columns={"page_id":"page_count"})

Unnamed: 0,page_type_merged,page_sub_type_merged,page_type,page_sub_type,page_type_jamie,page_type_from_events,page_count,event_count_athena
0,blog,blog_category_page,blog_page,blog,blog_category_page,,3,3.0
1,blog,blog_category_page,blog_page,blog,blog_category_page,category,35,810.0
2,blog,blog_category_tag_page,blog_page,blog,blog_category_tag_page,,204,33832.0
3,blog,blog_category_tag_page,blog_page,blog,blog_category_tag_page,article,2318,298821.0
4,blog,blog_category_tag_page,blog_page,blog,blog_category_tag_page,blog-article-details,183,174931.0
5,blog,blog_category_tag_page,blog_page,blog,blog_category_tag_page,category,48,1207.0
6,blog,blog_category_tag_page,blog_page,blog,blog_category_tag_page,home,3,4.0
7,blog,blog_category_tag_page,blog_page,blog,blog_category_tag_page,other,63,442.0
8,blog,blog_category_tag_page,blog_page,blog,blog_category_tag_page,tag,41,306.0
9,blog,blog_home_page,blog_page,blog,blog_home_page,,1,8.0


In [371]:
pages_types_merged_dev[(pages_types_merged_dev.page_type_merged=="misc_shop") & (pages_types_merged_dev.page_sub_type_merged=="unknown")].sort_values("event_count_athena", ascending = False).head(20)

Unnamed: 0,page_id,page_url,page_type,page_sub_type,product_category_id,product_id,provider_id,page_type_jamie,dim_page_url,event_count_athena,page_type_from_events,page_type_merged,page_sub_type_merged
20690,901598,www.moneysmart.sg/refinancing/compare,Unknown,,,,,shop,www.moneysmart.sg/refinancing/compare,3668.0,,misc_shop,unknown
8146,679337,www.moneysmart.sg/home-loan/compare,Unknown,,,,,shop,www.moneysmart.sg/home-loan/compare,1620.0,,misc_shop,unknown
17928,902232,www.moneysmart.sg/refinancing/compare/loans,Unknown,,,,,shop,www.moneysmart.sg/refinancing/compare/loans,950.0,,misc_shop,unknown
10945,852846,www.moneysmart.sg/car-insurance/wizard/register,Unknown,,,,,shop,www.moneysmart.sg/car-insurance/wizard/register,682.0,,misc_shop,unknown
5064,517603,www.moneysmart.sg/car-insurance/wizard,Unknown,,,,,shop,www.moneysmart.sg/car-insurance/wizard,651.0,,misc_shop,unknown
42905,1524742,www.moneysmart.hk/zh-hk/mortgage/property-valuation-tool,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/mortgage/property-valuation-tool,481.0,,misc_shop,unknown
37073,1536292,www.moneysmart.hk/zh-hk/mortgage,Unknown,,,,,shop,www.moneysmart.hk/zh-hk/mortgage,480.0,,misc_shop,unknown
40490,1011194,www.moneysmart.hk/zh-hk,Unknown,,,,,shop,www.moneysmart.hk/zh-hk,470.0,,misc_shop,unknown
6631,826200,www.moneysmart.sg/car-insurance/wizard/results,Unknown,,,,,shop,www.moneysmart.sg/car-insurance/wizard/results,379.0,,misc_shop,unknown
37341,678568,www.moneysmart.sg/home-loan/compare/loans,Unknown,,,,,shop,www.moneysmart.sg/home-loan/compare/loans,333.0,,misc_shop,unknown


In [373]:
pages_types_merged = pages_types_merged_dev[["page_id", "page_url", "page_type_merged", "page_sub_type_merged","event_count_athena"]].rename(columns={"page_type_merged": "page_type", "page_sub_type_merged":"page_sub_type"})
canonical_urls_col = pages_types_merged.apply(lambda x: data_parsing.get_canonical_url("https://"+x.page_url), axis=1)
pages_types_merged["canonical_url"] = canonical_urls_col

In [374]:
len(pages_types_merged)

47691

In [375]:
len(pages)

47691

In [379]:
# we should have fewer grouping by canonical as it removes the AB test urls
len(pages_types_merged.groupby(["canonical_url"]))

40522

In [381]:
len(pages_types_merged.groupby(["canonical_url", "page_type", "page_sub_type"]))

41347

In [None]:
# TODO: >>>>>> there's a mismatch here.  Probably want to do a group by, max on it to merge them together and then join again with the non-canonical... but should really investigate the origin.
# See below, it looks like a remnant of AB testing falcon

In [391]:
issues = pages_types_merged.groupby(["canonical_url", "page_type", "page_sub_type"]).count().groupby(["canonical_url"]).size()

In [392]:
issues[issues.values>1]

canonical_url
www.moneysmart.hk/en/credit-cards/american-express-platinum-credit-card                            2
www.moneysmart.hk/en/credit-cards/icbc/unionpay                                                    2
www.moneysmart.hk/en/personal-loan/hsbc                                                            2
www.moneysmart.hk/zh-hk/credit-cards/american-express-platinum-credit-card                         2
www.moneysmart.hk/zh-hk/credit-cards/bank-of-china                                                 2
www.moneysmart.hk/zh-hk/credit-cards/dbs/unionpay                                                  2
www.moneysmart.hk/zh-hk/credit-cards/dbs/welcome-offer                                             2
www.moneysmart.hk/zh-hk/credit-cards/icbc/unionpay                                                 2
www.moneysmart.hk/zh-hk/credit-cards/icbc/welcome-offer                                            2
www.moneysmart.sg/credit-cards/american-express-platinum-credit-card         

In [393]:
pages_types_merged[pages_types_merged.canonical_url=="www.moneysmart.sg/personal-loan/scb-cashone"]

Unnamed: 0,page_id,page_url,page_type,page_sub_type,event_count_athena,canonical_url
14528,517928,www.moneysmart.sg/personal-loan/scb-cashone,product_details,unknown,94.0,www.moneysmart.sg/personal-loan/scb-cashone
35520,2678863,www-new.moneysmart.sg/personal-loan/scb-cashone,misc_shop,unknown,,www.moneysmart.sg/personal-loan/scb-cashone


In [380]:
pages_types_merged.head()

Unnamed: 0,page_id,page_url,page_type,page_sub_type,event_count_athena,canonical_url
0,1375785,blog.moneysmart.hk/en/mortgage/%E5%B1%85%E5%B1%8B-2019-%E4%BD%95%E6%96%87%E7%94%B0-%E5%B0%87%E8%BB%8D%E6%BE%B3-%E7%94%B3%E8%AB%8B-%E9%A6%AC%E9%9E%8D%E5%B1%B1-%E6%B7%B1%E6%B0%B4%E5%9F%97,blog,blog_category_tag_page,,blog.moneysmart.hk/en/mortgage/%e5%b1%85%e5%b1%8b-2019-%e4%bd%95%e6%96%87%e7%94%b0-%e5%b0%87%e8%bb%8d%e6%be%b3-%e7%94%b3%e8%ab%8b-%e9%a6%ac%e9%9e%8d%e5%b1%b1-%e6%b7%b1%e6%b0%b4%e5%9f%97
1,1647817,blog.moneysmart.sg/property/private-properties-hdbs-differences/attachment/private-vs-public-housing-singapore,blog,blog_category_tag_page,,blog.moneysmart.sg/property/private-properties-hdbs-differences/attachment/private-vs-public-housing-singapore
2,1376813,forum.moneysmart.sg/topic/taking-multiple-loans,misc_shop,unknown,,forum.moneysmart.sg/topic/taking-multiple-loans
3,1158494,blog.moneysmart.sg/dining/11-best-places-in-singapore-to-get-your-meat-fix-at-1-for-1/_,blog,blog_category_tag_page,,blog.moneysmart.sg/dining/11-best-places-in-singapore-to-get-your-meat-fix-at-1-for-1/_
4,1378619,www.moneysmart.sg/home%20loan,misc_shop_a,,,www.moneysmart.sg/home%20loan


# Looking at the LeadGeneration.ClickConversion event

## Getting the click event data

In [None]:
query = """
select  

    country_code
    , dim_page_type.page_type
    , dim_page_type.page_sub_type
    , case when page_url like '%/embed/%' then true else false end as is_embed
    , page_url
    , page_id
    , full_date
    , time
    , hour
    , minute
    , device_os
    , device_category
    , browser
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'channel', true) as channel
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product_slug', true) as product_slug
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product', true) as product
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product_id', true) as product_id
    , dim_product.slug as product_from_id
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider_slug', true) as provider_slug
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider', true) as provider
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider_id', true) as provider_id
    , dim_provider.slug as provider_from_id
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_category', true) as affiliate_category
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_location', true) as affiliate_location
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_page_type', true) as affiliate_page_type
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_widget_type', true) as affiliate_widget_type
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'list_position', true) as list_position
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'action', true) as action
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'source', true) as source
    , dim_activity.activity_attributes
    from 
    
    -- TODO: cut down the join.s just copy / pasting
    fact_activities 
    left join dim_page on fact_activities.page_id = dim_page.page_id
    left join dim_page_type on dim_page_type.page_type_id = dim_page.page_type_id
    -- left join dim_session on fact_activities.session_id = dim_session.session_id
    left join dim_activity on fact_activities.activity_id = dim_activity.activity_id
    
    left join dim_activity_type on fact_activities.activity_type_id = dim_activity_type.activity_type_id
    left join dim_date on dim_date.date_id = fact_activities.activity_date_id
    left join dim_time on fact_activities.activity_time_id = dim_time.time_id
    left join dim_country on fact_activities.site_country_id = dim_country.country_id
    
    left join dim_browser on fact_activities.browser_id = dim_browser.browser_id -- firefox etc
    left join dim_device on fact_activities.device_id = dim_device.device_id -- device_os, device_category (desktop / mobile...)
    
    left join dim_channel on dim_channel.channel_key = json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'channel', true)
    -- only join product and provider if the slug isn't set i.e. assume that it's pre-falcon YMMV (and it's deprecated)
    left join dim_product on (dim_product.source_product_id = json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product_id', true) 
        and coalesce(json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'product', true), '') =''
        and dim_product.channel_id = dim_channel.channel_id 
        and dim_product.country_id = dim_country.country_id) 
    left join dim_provider on (
        dim_provider.source_provider_id = json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider_id', true) 
        and coalesce(json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'provider', true), '') =''
        and dim_provider.channel_id = dim_channel.channel_id 
            and dim_provider.country_id = dim_country.country_id)
    

    
    where 
        dim_activity_type.activity_name = 'LeadGeneration.ClickConversion'
        and user_filter_type='external_visitor'
        and dim_date.full_date>='{from_date}'
        and dim_date.full_date<='{to_date}'
        
        
        -- NB: embeds aren't currently listed as blog pages :(
        
""".format(from_date= from_datetime.isoformat(), to_date=to_datetime.isoformat())


In [None]:
print(query)

In [None]:
query = sqlalchemy.text(query)
apply_clicks = dq.query(query)

In [None]:
apply_clicks.describe()

In [None]:
apply_clicks.head(5)

In [None]:
pd.set_option("display.max_colwidth", 200)
# apply_clicks[apply_clicks["page_type"].str.contains("blog")][["page_url", "provider", "provider_id", "activity_attributes"]]
for a in apply_clicks[apply_clicks["page_type"].str.contains("blog")][ "activity_attributes"].values[0].split(","): print(a)


In [None]:
product_provider_summary_cols = [ "page_url", "action", "page_type", "channel"]+ [z for z in apply_clicks.columns if "product" in z or "provider" in z]
affiliate_cols = [z for z in apply_clicks.columns if "affiliate" in z]

In [None]:
apply_clicks[product_provider_summary_cols ].head()

## Issues

In [None]:
def format_results(df):
    def make_clickable(val):
        # target _blank to open new window
        return '<a target="_blank" href="{}">{}</a>'.format("https://"+ val, val)
    
    return df.style.format({'page_url': make_clickable})

### Not having product / provider (slug) set (product_id or provider_id is deprecated)

In [None]:
df = apply_clicks[(apply_clicks.provider.isna()) | (apply_clicks.provider=="")][product_provider_summary_cols]
print("only first 20 shown")
format_results(df.head(20))

In [None]:
df2 = pd.DataFrame(df.groupby(["page_url", "page_type","channel"]).size().reset_index().sort_values(0, ascending=False))
format_results(df2)

### Using product_slug or provider_slug not product / provider

In [None]:
apply_clicks[~(apply_clicks.provider_slug.isna() | (apply_clicks.provider_slug==""))][product_provider_summary_cols]

In [None]:
apply_clicks[~(apply_clicks["product_slug"].isna() | (apply_clicks["product_slug"]==""))][product_provider_summary_cols]

### Not having any product or provider info

In [None]:
# No product info
df = apply_clicks[(apply_clicks["product"].isna() | (apply_clicks["product"]=="")) & (apply_clicks.product_id.isna() | ((apply_clicks["product_id"]=="")))][product_provider_summary_cols]
df2 = pd.DataFrame(df.groupby(["page_url", "page_type", "channel"]).size()).reset_index().sort_values(0, ascending=False).rename(columns={0:"click count"})
format_results(df2)

In [None]:
# No provider info
apply_clicks[(apply_clicks.provider.isna() | (apply_clicks.provider==""))][product_provider_summary_cols]

In [None]:
# No product or provider info, grouped by number of clicks on the page
missing_providers = apply_clicks[((apply_clicks.provider=="" ) | (apply_clicks.provider.isna())) & ((apply_clicks.provider_id=="" ) | (apply_clicks.provider_id.isna())) ][["page_url", "provider", "provider_id"]]
missing_providers_grouped = missing_providers.groupby(["page_url"]).size().reset_index() #.rename(columns={0:"click count"})
#missing_providers_grouped.sort_values("provider_id", ascending=False)
format_results(pd.DataFrame(missing_providers_grouped.sort_values(0, ascending=False)))

### Embed without any info about the page that it's on

In [None]:
print("all of them!")

### Blog page without affiliate stuff set
Blog should have full details of e.g. where on the page it is coming from

In [None]:
apply_clicks.page_type.unique()

In [None]:
df = apply_clicks[ apply_clicks.page_type.isin(["blog_page"]) & ((apply_clicks.affiliate_category=="") | (apply_clicks.affiliate_location=="") | (apply_clicks.affiliate_page_type=="") |  (apply_clicks.affiliate_widget_type=="")\
             | (apply_clicks.affiliate_category.isna()) | (apply_clicks.affiliate_location.isna()) | (apply_clicks.affiliate_page_type.isna()) |  (apply_clicks.affiliate_widget_type.isna()))][product_provider_summary_cols + affiliate_cols]

format_results(df)

In [None]:
"""
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_category', true) as affiliate_category
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_location', true) as affiliate_location
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_page_type', true) as affiliate_page_type
    , json_extract_path_text(trim('"' from dim_activity.activity_attributes), 'affiliate_widget_type', true) as affiliate_widget_type

"""

### Fails to join on provider_id or product_id

Note that you might expect some pre-falcon stuff in HK not to join as we didn't have the application database loaded.

In [None]:
# product_id is set, but product_from_id is null
df = apply_clicks[((apply_clicks["product"]=="") | apply_clicks["product"].isna()) & apply_clicks.product_id.str.isnumeric() & apply_clicks.product_from_id.isna()][product_provider_summary_cols]
format_results(df)


In [None]:
# provider can't be interpreted from provider_id
df = apply_clicks[((apply_clicks["provider"]=="") | apply_clicks["provider"].isna()) & apply_clicks.provider_id.str.isnumeric() & apply_clicks.provider_from_id.isna()][product_provider_summary_cols]
print(len(df))
format_results(df)


In [None]:
providers_channels[providers_channels.source_provider_id == 1]

In [None]:
providers_channels[~providers_channels.source_provider_id.isna()][providers_channels.source_provider_id>30].sort_values(["source_provider_id"])

### Channels Observed (manual sense check)

In [None]:
apply_clicks.columns

In [None]:
apply_clicks.groupby(["channel"]).size().sort_index()

## Listing click doesn't have index
(not sure been implemented yet on falcon)

## Trying to get something useful out of apply clicks

NB don't use these numbers for reporting just yet.

In [None]:
def group_and_sort(df, cols_to_group_and_sort_by, sort_by_click_count = False):
    r = pd.DataFrame(df.groupby(cols_to_group_and_sort_by).size().reset_index().rename(columns={0:"num_clicks"}))
    if sort_by_click_count:
        if "country_code" in r.columns:
            r = r.sort_values(["country_code", "num_clicks"], ascending = False)
        else:
            r = r.sort_values("num_clicks", ascending = False)
    else:
        r = r.sort_values(cols_to_group_and_sort_by)
    total_clicks = >>
    >> total clicks by country
    >> do distinct anonymous_id, day
    return format_results(r) #AToW makes urls clickable


### Blog apply clicks excluding embeds by where they come from

AToW this won't include apply clicks from the comparison widgets in the page, but it will work for the 

In [None]:
blog_apply_clicks = apply_clicks[apply_clicks["page_type"].str.contains("blog")]


In [None]:
group_and_sort(blog_apply_clicks, ["country_code", "channel", "affiliate_category"])

### Blog apply clicks excluding embeds by place on page

In [None]:
group_and_sort(blog_apply_clicks, ["country_code","channel"]+ affiliate_cols)

### Blog apply clicks excluding embeds by product and provier

In [None]:
group_and_sort(blog_apply_clicks, ["country_code","channel", "product", "provider"]+ affiliate_cols)

### Blog apply clicks excluding embeds top articles

In [None]:
group_and_sort(blog_apply_clicks, ["country_code","page_url"], sort_by_click_count=True)

### Comparison widget (embed) apply clicks

In [None]:
embed_apply_clicks = apply_clicks[apply_clicks.is_embed==True]

In [None]:
group_and_sort(embed_apply_clicks,["country_code","channel"]+ affiliate_cols)

### Embed by location on page

In [None]:
group_and_sort(embed_apply_clicks, ["country_code", "affiliate_location"])

### Blog apply clicks on page & embed
NB you can't do by url this way

In [None]:
blog_and_embed_apply_clicks = blog_apply_clicks.append(embed_apply_clicks)

In [None]:
group_and_sort(blog_and_embed_apply_clicks, ["country_code","channel"]+ affiliate_cols)

### Blog and embed apply clicks by product and provider

In [None]:
group_and_sort(blog_and_embed_apply_clicks, ["country_code","channel", "product", "provider"])

In [None]:
group_and_sort(blog_and_embed_apply_clicks, ["country_code","channel", "product", "provider"]+ affiliate_cols)

### All clicks mobile vs desktop vs channel

In [None]:
group_and_sort(apply_clicks,["country_code","device_category"] )

In [None]:
group_and_sort(apply_clicks, ["country_code","channel", "device_category"])

### All clicks by page type (and sub-type)
NB this isn't set properly for embeds

In [None]:
group_and_sort(apply_clicks, ["country_code","page_type", "page_sub_type", "is_embed"])

### All clicks by channel

In [None]:
group_and_sort(apply_clicks, ["country_code","channel"])

### All clicks by channel and product

In [None]:
group_and_sort(apply_clicks, ["country_code","channel", "product", "provider"])

### All clicks by list position

List position will only apply on listing pages and blog pages where it has been set.  Won't work on PDP



In [None]:
group_and_sort(apply_clicks, ["country_code","channel", "page_type", "list_position"])

# ISS, .../apply etc pageview applies

The main way that we track apply clicks is through ISS (and before that an earlier interstitial page).  NB other actions like contact form submissions aren't tracked.

Expect this not to work well for mortgage and car insurance

# NPP

# More Questions
* Where did people land from?
* Conversion rates based on users / pageviews  / sessions
* Is there a time delay between pageviews and applies? (across the site, across shop, across blog)?