# Introduction

This is used to compare the counts of events between segment and kinesis during the move between the two pipelines.

As they counts inevitably don't match, it then provides detailed segmentation and search / exploratory tools towards the bottom.

For a large number of days, you'll want a lot of RAM (16GB or 32GB).  For single day experimenting, you will be fine with 8GB.

Running the whole script takes quite a long time initially, in particular due to the segment query (minutes to tens of minutes).  Once this has been done, further exploration is generally very quick (less than a second to a few seconds).

It's not overly optimised, but some steps have been taken to reduce memory.

# Requirements / Jupyter Extensions

Install these through jupyterlab extension manager (if using jupyterlab)
* jupyter-widgets
* plotly (and ideally chart studio too)

In [1]:
# Safe imports
from datetime import datetime, timedelta, date

In [2]:
# Run imports that might require installation to the environment, and install if necessary.
try:
    import psycopg2
except:
    print("Failed ot import psychopg2, trying to install it")
    !{sys.executable} -m pip install psycopg2-binary
    import psycopg2
    print("Successfully installed")
    
    
try:
    import dateparser
except:
    print("Failed ot import dateparser, trying to install it")
    !{sys.executable} -m pip install dateparser
    import dateparser
    print("Successfully installed")
    
try:
    import pyathena #used in other imports, so really just checking it's available
except:
    print("Failed ot import pyathena, trying to install it")
    !{sys.executable} -m pip install pyathena
    import pyathena
    print("Successfully installed")
    
try:
    import user_agents
except:
    print("Failed ot import user_agents, trying to install it")
    !{sys.executable} -m pip install user_agents
    import user_agents
    print("Successfully installed")

    
import ipywidgets as widgets
    


  """)


In [3]:
# Imports on files that might have dependencies that need installing
import data_pier_querying
from athena_querying import AthenaQuery
from athena_common_queries import *
import user_agents # this converts user agent from browser to mobile / desktop etc.

In [4]:
ua = user_agents.parse("Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Mobile/15E148 Safari/604.1")

In [5]:
ua.browser.family

'Mobile Safari'

# Settings

In [6]:
num_days_to_query = 3
#from_datetime = datetime.now() - timedelta(days = 5)
from_datetime = datetime(year=2020, month=1, day=10)
to_datetime = from_datetime+ timedelta(days=num_days_to_query)
include_device_segmentation = True #E.g. iphone users.  This will use more memory (and likely slow things a bit).

# Kinesis Data via Athena

Data goes tracker -> kinesis -> S3 (+ another S3 transform).  Then we can query S3 using Athena.

In [7]:
aq = AthenaQuery()

In [8]:
aq.connect()

In [9]:
athena_database = "ms_data_lake_production"
athena_raw_events_table = "ms_data_stream_production_processed"

In [10]:
#query = "select context.page_url, body.event_name, count(*) from "+athena_database+"."+athena_raw_events_table
#query += " where partition_0='2019' and partition_1>='12' and partition_2>='05' group by 1,2"

In [11]:
# I've removed the device_type data to save memory, but it would be useful.
query = create_generic_event_query(from_datetime, to_datetime, include_user_agent=include_device_segmentation, include_ip_address = include_device_segmentation, interpret_urls=False)

full_query = "select * from (%s) where country_code ='sg'" %query

In [12]:
print(full_query)

select * from (
    
    SELECT 
          CAST("from_iso8601_timestamp"("sent_at") AS timestamp) "sent_at_timestamp"
    , "sent_at"
    , "type"
    , "body"."event_name"
    , "body"."data"."status"
    , "user"."anonymous_id"
    , "user"."amp_id"
    , "context"."page_url"
    , "context"."referrer"
 
    
        , context.user_agent as user_agent
        
        , context.ip_address
        
    
    FROM
      ms_data_lake_production.ms_data_stream_production_processed
    
    
    WHERE true -- makes query composition easier
    
 AND 
  (
 partition_0 >= '2020'
 AND partition_1 >= '01'
 AND partition_2 >= '10'
 OR (
 partition_0 >= '2020'
 AND partition_1 > '01'
 ) 
 OR (
 partition_0 > '2020'
 ) 
)
 AND ((partition_0 <= '2020'
	 AND partition_1 <= '01'
	 AND partition_2 <= '13'
) 
 OR (
	 partition_0 <= '2020'
	 AND partition_1 < '01'
) 
 OR (
	 partition_0 < '2020'
) 
)
 AND CAST(from_iso8601_timestamp(sent_at) AS timestamp)  between CAST(from_iso8601_timestamp('2020-01-1

In [13]:
athena_full_events_df = aq.query(query)

In [14]:
# Set types to speed queries and save on memory
athena_full_events_df = athena_full_events_df.astype({ "type":"category"
    , "event_name":"category"
    , "status":"category"}, copy=False)

In [15]:
athena_full_events_df.dtypes

sent_at_timestamp      object
sent_at                object
type                 category
event_name           category
status               category
anonymous_id           object
amp_id                 object
page_url               object
referrer               object
user_agent             object
ip_address            float64
dtype: object

In [16]:
athena_full_events_df.head(5)

Unnamed: 0,sent_at_timestamp,sent_at,type,event_name,status,anonymous_id,amp_id,page_url,referrer,user_agent,ip_address
0,2020-01-11 19:19:10.045,2020-01-11T19:19:10.045Z,event,Reading,Article Body 75,3c7ba482-7819-48f3-9d27-86975b92f77a,,https://blog.moneysmart.hk/zh-hk/budgeting/%E5...,https://www.google.com/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2...,
1,2020-01-11 19:19:09.930,2020-01-11T19:19:09.930Z,event,Reading,Article Body 75,c1a491a6-4ed5-41f9-a35f-1ed45a61f776,,https://blog.moneysmart.sg/budgeting/open-elec...,https://www.google.com/,Mozilla/5.0 (Linux; Android 9; EVR-L29) AppleW...,
2,2020-01-11 19:19:09.156,2020-01-11T19:19:09.156Z,event,Reading,Article Body 50,ff86bdf2-5b0f-44ab-8da1-730dd8e0551a,,https://blog.moneysmart.sg/travel/cheap-batam-...,https://www.google.com/,Mozilla/5.0 (Linux; Android 9; SM-G973F) Apple...,
3,2020-01-11 19:19:10.284,2020-01-11T19:19:10.284Z,page,PageView,,7e9c9c4f-e157-4c29-97be-296cc5f6818d,,https://blog.moneysmart.hk/zh-hk/career/%E6%95...,,Mozilla/5.0 (Linux; Android 9; SAMSUNG SM-G973...,
4,2020-01-11 19:19:11.544,2020-01-11T19:19:11.544Z,event,Reading,Article Body 25,b22ff5e1-20e4-41de-9b34-a8948de744ec,,https://blog.moneysmart.sg/health-insurance/ax...,https://www.google.com/,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,


# Segment Data

NB: screwed up, and can use the tracks table, rather than individual event tables, so a lot of this is pointless.

In [17]:
#from importlib import reload
#reload(data_pier_querying)

In [18]:
# Below there are some checks on what columns are available

segment_columns_to_query = [
    # "sent_at", - don't use this, use timestamp
    "timestamp",
    #"event", - going to get that implied from the table.
    # "status", # TODO: would like to have this, but not sure which column, or which tables.  Maybe just not used much, so only do for the 4 tables.
    "anonymous_id",
    "context_page_url",
    # "referrer", #maybe only used in pages table??
    "context_ip", 
    "context_user_agent"]

In [19]:
dp_querying = data_pier_querying.DataPierQuerying()
dp_querying.connect()

In [20]:
tables_df = dp_querying.query_to_dataframe("select * from information_schema.tables")

In [21]:
segment_event_tables_df = tables_df[tables_df.table_schema=="moneysmartsg_prod"]["table_name"]


In [22]:
# These are taken from the dictionary in https://docs.google.com/spreadsheets/d/1HICh77BoGMIat9K3NPwz3pBayJWiAr0ohAlTuv7dr80/edit#gid=1882048411
#but actually it turns out there should be more than this, and don't need to do it this way.
expected_events_str = """
LeadGeneration.ClickConversion
LeadGeneration.FormStepCompleted
LeadGeneration.FormSubmitted
LeadGeneration.PaymentCompleted
LeadGeneration.ThankYou
LeadGeneration.RedirectCompleted
UserEngagement.ShowedMoreDetails
UserEngagement.ViewedMoreDetails
UserEngagement.SortedList
UserEngagement.UsedHelpHints
UserEngagement.ClickedMenuItem
UserEngagement.QuestionAnswered
UserEngagement.ShowMoreFilter
UserEngagement.ShowMoreOptions
UserEngagement.ClickedFilter
UserEngagement.ButtonClick
UserAuth.LoggedIn
UserAuth.RegisteredAccount
UserAuth.LoggedOut
UserFeedback.ModalDisplayed
UserFeedback.MoodSubmitted
UserFeedback.FeedbackSubmitted
UserFeedback.MoreFeedback
ABTest.Conversion
UserView.WidgetLoad
EmailCapture
PageView
Sharing
Reading
NewsLetterPopup
"""
expected_events = [z.strip() for z in expected_events_str.split("\n") if len(z.strip())>0]

In [23]:


expected_events_and_segment_tables = []
special_maps = {
    "PageView": "pages"
}
for event in expected_events:
    if event in special_maps:
        new_event_name = special_maps[event]
    else:
        new_event_name = ""
        for i, c in enumerate(event):
            if i==0:new_event_name+=c.lower()
            elif str.isupper(c): 
                if i>0 and event[i-1]!=".":
                    new_event_name += "_"
                new_event_name += c.lower()
            elif c==".": new_event_name += "_"
            else: new_event_name+= c
    expected_events_and_segment_tables.append([event, new_event_name])

In [24]:
expected_events_and_segment_tables

[['LeadGeneration.ClickConversion', 'lead_generation_click_conversion'],
 ['LeadGeneration.FormStepCompleted', 'lead_generation_form_step_completed'],
 ['LeadGeneration.FormSubmitted', 'lead_generation_form_submitted'],
 ['LeadGeneration.PaymentCompleted', 'lead_generation_payment_completed'],
 ['LeadGeneration.ThankYou', 'lead_generation_thank_you'],
 ['LeadGeneration.RedirectCompleted', 'lead_generation_redirect_completed'],
 ['UserEngagement.ShowedMoreDetails', 'user_engagement_showed_more_details'],
 ['UserEngagement.ViewedMoreDetails', 'user_engagement_viewed_more_details'],
 ['UserEngagement.SortedList', 'user_engagement_sorted_list'],
 ['UserEngagement.UsedHelpHints', 'user_engagement_used_help_hints'],
 ['UserEngagement.ClickedMenuItem', 'user_engagement_clicked_menu_item'],
 ['UserEngagement.QuestionAnswered', 'user_engagement_question_answered'],
 ['UserEngagement.ShowMoreFilter', 'user_engagement_show_more_filter'],
 ['UserEngagement.ShowMoreOptions', 'user_engagement_show_m

### Check for missing tables

Expect some random events not to be in Segment, or blog specific ones that haven't been deployed to SG and HK

In [25]:
# Check all the event tables exist
expected_event_segment_tables = [z[1] for z in expected_events_and_segment_tables]
segment_table_names = segment_event_tables_df.to_list()
missing_event_tables = [z for z in expected_event_segment_tables if z not in segment_table_names]
missing_event_tables

['user_engagement_used_help_hints',
 'user_engagement_clicked_menu_item',
 'user_feedback_modal_displayed',
 'user_feedback_more_feedback',
 'a_b_test_conversion',
 'sharing',
 'news_letter_popup']

In [26]:
expected_events_and_segment_tables

[['LeadGeneration.ClickConversion', 'lead_generation_click_conversion'],
 ['LeadGeneration.FormStepCompleted', 'lead_generation_form_step_completed'],
 ['LeadGeneration.FormSubmitted', 'lead_generation_form_submitted'],
 ['LeadGeneration.PaymentCompleted', 'lead_generation_payment_completed'],
 ['LeadGeneration.ThankYou', 'lead_generation_thank_you'],
 ['LeadGeneration.RedirectCompleted', 'lead_generation_redirect_completed'],
 ['UserEngagement.ShowedMoreDetails', 'user_engagement_showed_more_details'],
 ['UserEngagement.ViewedMoreDetails', 'user_engagement_viewed_more_details'],
 ['UserEngagement.SortedList', 'user_engagement_sorted_list'],
 ['UserEngagement.UsedHelpHints', 'user_engagement_used_help_hints'],
 ['UserEngagement.ClickedMenuItem', 'user_engagement_clicked_menu_item'],
 ['UserEngagement.QuestionAnswered', 'user_engagement_question_answered'],
 ['UserEngagement.ShowMoreFilter', 'user_engagement_show_more_filter'],
 ['UserEngagement.ShowMoreOptions', 'user_engagement_show_m

In [27]:
# Removing the missing ones from the query list
events_and_tables_to_get_from_data_pier = [z for z in expected_events_and_segment_tables if z[1] not in missing_event_tables]

# Removing a problematic one (doesn't have context_page_url in it, and very unimportant
events_and_tables_to_get_from_data_pier = [z for z in events_and_tables_to_get_from_data_pier if z[1] not in ["user_auth_logged_out",]]

In [28]:
len(events_and_tables_to_get_from_data_pier)

22

In [29]:
cols = dp_querying.query_to_dataframe("""
select column_name, data_type, count(*) from information_schema.columns 
where 
table_name in  ('"""+"','".join([z[1] for z in events_and_tables_to_get_from_data_pier])+"""')
and table_schema='moneysmartsg_prod'

group by 1,2
""")

In [30]:
cols[cols["count"]>10].sort_values(["count"])

Unnamed: 0,column_name,data_type,count
287,page_referrer,text,11
352,user_id,text,13
27,context_campaign_content,text,14
42,context_campaign_term,text,14
33,context_campaign_medium,text,15
34,context_campaign_name,text,15
41,context_campaign_source,text,15
286,page_path,text,15
17,channel,text,16
60,context_locale,text,19


In [31]:
cols = dp_querying.query_to_dataframe("""
select  column_name, data_type, count(*) from information_schema.columns 
where 
 table_name in  ('"""+"','".join(["pages", "tracks"])+"""')
and table_schema='moneysmartsg_prod'
and column_name like '%%'
group by 1,2 order by count(*) desc
""")
cols

Unnamed: 0,column_name,data_type,count
0,context_campaign_term,text,2
1,context_campaign_medium,text,2
2,context_page_referrer,text,2
3,context_user_agent,text,2
4,context_page_search,text,2
5,context_page_title,text,2
6,context_campaign_name,text,2
7,context_page_url,text,2
8,id,character varying,2
9,context_ip,text,2


In [32]:
segment_date_constraint = " timestamp >= '%s' and timestamp < '%s' " % (from_datetime.isoformat(), to_datetime.isoformat())

In [33]:
dp_querying.query_to_dataframe("""SELECT
    nmsp_parent.nspname AS parent_schema,
    parent.relname      AS parent,
    nmsp_child.nspname  AS child_schema,
    child.relname       AS child
FROM pg_inherits
    JOIN pg_class parent            ON pg_inherits.inhparent = parent.oid
    JOIN pg_class child             ON pg_inherits.inhrelid   = child.oid
    JOIN pg_namespace nmsp_parent   ON nmsp_parent.oid  = parent.relnamespace
    JOIN pg_namespace nmsp_child    ON nmsp_child.oid   = child.relnamespace
WHERE parent.relname='%s';""")%"pages"

Unnamed: 0,parent_schema,parent,child_schema,child


In [34]:
pd.get_option("display.max_colwidth", 200)
indexes = dp_querying.query_to_dataframe("""SELECT
    indexname,
    indexdef
FROM
    pg_indexes
WHERE
    tablename = '%s';""" % "pages")

for a in indexes.values:
    print(a)

['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmarthk_prod.pages USING btree (id)']
['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmartsg_prod.pages USING btree (id)']
['pages_timestamp_idx'
 'CREATE INDEX pages_timestamp_idx ON moneysmartsg_prod.pages USING btree ("timestamp")']
['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmarthk_dev.pages USING btree (id)']
['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmartsg_dev.pages USING btree (id)']


In [35]:
query_segment_by_table = False #really shouldn't set this to true, didn't realise correct method.  Also need to add country stuff

segment_schemas = ["moneysmartsg_prod", "moneysmarthk_prod"]
# The meat of it
start_time = datetime.now()
event_name_to_rows = {}
if query_segment_by_table:
    for country_schema in segment_schemas:
        for i, (event_name, table_name) in enumerate(events_and_tables_to_get_from_data_pier):
            table_start_time = datetime.now()
            print("querying table %s / %s (%i/%i)" % (table_name, event_name, i+1, len(events_and_tables_to_get_from_data_pier)))
            query = "select {cols} from {schema}.{table} where {date_constraint}".format(cols=", ".join(segment_columns_to_query), 
                                                                           table=table_name,
                                                                           date_constraint =segment_date_constraint, schema=country_schema)

            events = dp_querying.query_to_dataframe(query)
            events["event_name"] = event_name #fills the entire column with the same value
            print("Got %i events"% len(events))
            event_name_to_rows[event_name]=events

            table_download_time = (datetime.now()-table_start_time).total_seconds()
            time_since_start = (datetime.now()-start_time).total_seconds()
            print("It took %.1f seconds to download from the table (%.1f seconds overall)" %(table_download_time, time_since_start))
            print()
            # if i>4:break


        # Merge tables
        segment_combined_df = pd.DataFrame()
        #combined_df = pd.DataFrame(columns=event_name_to_rows["LeadGeneration.ClickConversion"].columns)
        """for event_name, event_df in event_name_to_rows.items():
            print(len(event_df))
            combined_df.append(event_df, ignore_index=True)
            print(len(combined_df))
        #combined_df.astype({"event_name":"category"})
        """

        segment_combined_df = combined_df.append(list(event_name_to_rows.values()))
    
    
else:
    segment_columns_to_query_full = segment_columns_to_query + ["event_text",]
    tables_to_query = ["pages", "tracks"]
    all_event_dfs = []
    segment_combined_df = pd.DataFrame()
    for country_schema in segment_schemas:
        for table_name in tables_to_query:
            table_start_time = datetime.now()
            if table_name!="pages":
                cols_to_query = segment_columns_to_query_full
            else:
                cols_to_query = segment_columns_to_query
            print("querying table %s.%s" % (country_schema, table_name))
            print(cols_to_query)
            query = "select {cols} from {schema}.{table} where {date_constraint}".format(cols=", ".join(cols_to_query), 
                                                                           table=table_name,
                                                                           date_constraint =segment_date_constraint, schema=country_schema)

            events = dp_querying.query_to_dataframe(query)
            
            print("Got %i events"% len(events))
            #all_event_dfs.append(events)
            
            if table_name =="pages":
                events["event_text"] = "PageView" # fills the whole column
            table_download_time = (datetime.now()-table_start_time).total_seconds()
            time_since_start = (datetime.now()-start_time).total_seconds()
            print("merging")
            segment_combined_df = segment_combined_df.append(events)
            print("It took %.1f seconds to download from the table (%.1f seconds overall)" %(table_download_time, time_since_start))
            print()
            
        

querying table moneysmartsg_prod.pages
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent']
Got 474604 events
merging
It took 10.1 seconds to download from the table (10.1 seconds overall)

querying table moneysmartsg_prod.tracks
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent', 'event_text']
Got 656842 events
merging
It took 312.2 seconds to download from the table (322.6 seconds overall)

querying table moneysmarthk_prod.pages
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent']
Got 114004 events
merging
It took 267.5 seconds to download from the table (590.3 seconds overall)

querying table moneysmarthk_prod.tracks
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent', 'event_text']
Got 158634 events
merging
It took 70.3 seconds to download from the table (661.0 seconds overall)



In [36]:
if not query_segment_by_table:
    segment_combined_df.rename(columns={"event_text":"event_name"}, inplace=True)

In [37]:
len(all_event_dfs)

0

In [38]:
if include_device_segmentation:
    segment_combined_df.rename(columns={"context_user_agent":"user_agent"}, inplace=True)

In [39]:
segment_combined_df.head()

Unnamed: 0,timestamp,anonymous_id,context_page_url,context_ip,user_agent,event_name
0,2020-01-10 00:00:01.336000+00:00,b843bc94-64dc-4dca-8978-08877ddaf7eb,https://blog.moneysmart.sg/travel/singapore-pu...,183.90.37.6,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView
1,2020-01-10 00:00:01.915000+00:00,b843bc94-64dc-4dca-8978-08877ddaf7eb,https://www.moneysmart.sg/embed/7432e8a28bd976...,183.90.37.6,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView
2,2020-01-10 00:00:02.871000+00:00,0263aa86-16ca-46d3-b7ee-37bb7665f1da,https://blog.moneysmart.sg/transportation/erp-...,183.90.37.137,Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like...,PageView
3,2020-01-10 00:00:03.139000+00:00,0263aa86-16ca-46d3-b7ee-37bb7665f1da,https://www.moneysmart.sg/embed/52174e7b8d839f...,183.90.37.137,Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like...,PageView
4,2020-01-10 00:00:03.523000+00:00,c25624f4-6ccf-4add-8363-226926ca9737,https://blog.moneysmart.sg/fixed-deposits/best...,119.56.100.216,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...,PageView


In [40]:
segment_combined_df.rename(columns={"context_page_url":"page_url"}, inplace=True)
segment_combined_df.head(5)

Unnamed: 0,timestamp,anonymous_id,page_url,context_ip,user_agent,event_name
0,2020-01-10 00:00:01.336000+00:00,b843bc94-64dc-4dca-8978-08877ddaf7eb,https://blog.moneysmart.sg/travel/singapore-pu...,183.90.37.6,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView
1,2020-01-10 00:00:01.915000+00:00,b843bc94-64dc-4dca-8978-08877ddaf7eb,https://www.moneysmart.sg/embed/7432e8a28bd976...,183.90.37.6,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView
2,2020-01-10 00:00:02.871000+00:00,0263aa86-16ca-46d3-b7ee-37bb7665f1da,https://blog.moneysmart.sg/transportation/erp-...,183.90.37.137,Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like...,PageView
3,2020-01-10 00:00:03.139000+00:00,0263aa86-16ca-46d3-b7ee-37bb7665f1da,https://www.moneysmart.sg/embed/52174e7b8d839f...,183.90.37.137,Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like...,PageView
4,2020-01-10 00:00:03.523000+00:00,c25624f4-6ccf-4add-8363-226926ca9737,https://blog.moneysmart.sg/fixed-deposits/best...,119.56.100.216,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...,PageView


# Merging Segment and Kinesis Events

In [41]:
# Make names clear e.g. s_...

# Check the timezone / timestamps match
# Athena raw stuff is in UTC, not SG time.  So 2020-01-19T00:04:04.443Z is 8:05am Singapore time.
# whereas Segment is stored with tiemzone at UTC.  So, could convert them all.
# TODO: But it does meant that there's a lot of events coming at the day boundary.

In [42]:
athena_full_events_df.head(2)

Unnamed: 0,sent_at_timestamp,sent_at,type,event_name,status,anonymous_id,amp_id,page_url,referrer,user_agent,ip_address
0,2020-01-11 19:19:10.045,2020-01-11T19:19:10.045Z,event,Reading,Article Body 75,3c7ba482-7819-48f3-9d27-86975b92f77a,,https://blog.moneysmart.hk/zh-hk/budgeting/%E5...,https://www.google.com/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2...,
1,2020-01-11 19:19:09.930,2020-01-11T19:19:09.930Z,event,Reading,Article Body 75,c1a491a6-4ed5-41f9-a35f-1ed45a61f776,,https://blog.moneysmart.sg/budgeting/open-elec...,https://www.google.com/,Mozilla/5.0 (Linux; Android 9; EVR-L29) AppleW...,


In [43]:
athena_full_events_df.dtypes

sent_at_timestamp      object
sent_at                object
type                 category
event_name           category
status               category
anonymous_id           object
amp_id                 object
page_url               object
referrer               object
user_agent             object
ip_address            float64
dtype: object

In [44]:
segment_combined_df.dtypes

timestamp       datetime64[ns, UTC]
anonymous_id                 object
page_url                     object
context_ip                   object
user_agent                   object
event_name                   object
dtype: object

In [45]:
# Group by columns to get around date inaccuracy issue
cols_to_group_by = ["anonymous_id", "event_name", "page_url", "date"] #, "context_ip", "context_user_agent"] #TODO: add IP address

print("Grouping by %s"% ", ".join(cols_to_group_by))

print("Fixing dates before grouping")
print("... for Segment")
segment_combined_df["date"] = segment_combined_df.apply(lambda row: row.timestamp.date().isoformat(), axis=1) # making this a string
print("... for athena")
athena_full_events_df["date"] = athena_full_events_df.apply(lambda row: row.sent_at[:10], axis=1)
# super-slow,so moving to using strings athena_full_events_df["date"] = athena_full_events_df.apply(lambda row: dateparser.parse(row.sent_at_timestamp).date(), axis=1)  #conversion from string might not be needed in the future; using dateparser as more robust, also slow

#going to reduce the number of columns to make it safer, then can go back and look for user agents etc (can do a mapping of anonymous_id to user_agent for instance.)




Grouping by anonymous_id, event_name, page_url, date
Fixing dates before grouping
... for Segment
... for athena


In [46]:
print("Setting sensible data types for the columns to group by")
data_type_mappings = {"event_name":"category", "date":"category"}
segment_combined_df = segment_combined_df.astype(data_type_mappings, copy=False)
athena_full_events_df = athena_full_events_df.astype(data_type_mappings, copy=False)

Setting sensible data types for the columns to group by


In [47]:
segment_combined_df.head()[cols_to_group_by]

Unnamed: 0,anonymous_id,event_name,page_url,date
0,b843bc94-64dc-4dca-8978-08877ddaf7eb,PageView,https://blog.moneysmart.sg/travel/singapore-pu...,2020-01-10
1,b843bc94-64dc-4dca-8978-08877ddaf7eb,PageView,https://www.moneysmart.sg/embed/7432e8a28bd976...,2020-01-10
2,0263aa86-16ca-46d3-b7ee-37bb7665f1da,PageView,https://blog.moneysmart.sg/transportation/erp-...,2020-01-10
3,0263aa86-16ca-46d3-b7ee-37bb7665f1da,PageView,https://www.moneysmart.sg/embed/52174e7b8d839f...,2020-01-10
4,c25624f4-6ccf-4add-8363-226926ca9737,PageView,https://blog.moneysmart.sg/fixed-deposits/best...,2020-01-10


In [48]:
athena_full_events_df.head()[cols_to_group_by]

Unnamed: 0,anonymous_id,event_name,page_url,date
0,3c7ba482-7819-48f3-9d27-86975b92f77a,Reading,https://blog.moneysmart.hk/zh-hk/budgeting/%E5...,2020-01-11
1,c1a491a6-4ed5-41f9-a35f-1ed45a61f776,Reading,https://blog.moneysmart.sg/budgeting/open-elec...,2020-01-11
2,ff86bdf2-5b0f-44ab-8da1-730dd8e0551a,Reading,https://blog.moneysmart.sg/travel/cheap-batam-...,2020-01-11
3,7e9c9c4f-e157-4c29-97be-296cc5f6818d,PageView,https://blog.moneysmart.hk/zh-hk/career/%E6%95...,2020-01-11
4,b22ff5e1-20e4-41de-9b34-a8948de744ec,Reading,https://blog.moneysmart.sg/health-insurance/ax...,2020-01-11


In [49]:
# athena_full_events_df timestamp

print("Grouping by %s"%cols_to_group_by)
segment_grouped_df = segment_combined_df.groupby(cols_to_group_by).size().reset_index(name='s_count') #size preserves nulls, this sets the column to s_count

athena_grouped_df = athena_full_events_df.groupby(cols_to_group_by).size().reset_index(name='k_count')

# segment_combined_df.rename(columns = {"context_ip":"s_context_ip", "context_user_agent":"s_context_user_agent"}) 

Grouping by ['anonymous_id', 'event_name', 'page_url', 'date']


In [62]:
athena_grouped_df.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,k_count
0,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,2
1,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1
2,00000b54-600a-4de2-8700-fd9885252dca,Reading,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,3
3,00000b54-600a-4de2-8700-fd9885252dca,UserView.WidgetLoad,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1
4,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/cheapest-...,2020-01-10,1


In [51]:
# Actually join them

# set the column count names

merged_df = segment_grouped_df.merge(athena_grouped_df, how='outer', on=cols_to_group_by )

#Fill in the empty counts with 0s

merged_df["s_count"].fillna(0, inplace=True)
merged_df["k_count"].fillna(0, inplace=True)

In [52]:
merged_df.head(10)

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count
0,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,2.0,2.0
1,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1.0,1.0
2,00000b54-600a-4de2-8700-fd9885252dca,Reading,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,3.0,3.0
3,00000b54-600a-4de2-8700-fd9885252dca,UserView.WidgetLoad,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1.0,1.0
4,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/cheapest-...,2020-01-10,1.0,1.0
5,00008e20-54bd-495c-82ce-ebc6193bb1c9,PageView,https://blog.moneysmart.sg/dining/starbucks-si...,2020-01-10,1.0,1.0
6,00008e20-54bd-495c-82ce-ebc6193bb1c9,PageView,https://www.moneysmart.sg/embed/bf150432720b4e...,2020-01-10,1.0,1.0
7,00008e20-54bd-495c-82ce-ebc6193bb1c9,Reading,https://blog.moneysmart.sg/dining/starbucks-si...,2020-01-10,3.0,3.0
8,00008e20-54bd-495c-82ce-ebc6193bb1c9,UserView.WidgetLoad,https://www.moneysmart.sg/embed/bf150432720b4e...,2020-01-10,1.0,1.0
9,0000edbd-6d98-466c-8537-7e4f07a93e52,PageView,https://blog.moneysmart.hk/zh-hk/credit-cards/...,2020-01-10,1.0,1.0


In [53]:
merged_df.groupby(["date"]).count()

Unnamed: 0_level_0,anonymous_id,event_name,page_url,s_count,k_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-10,355997,355997,355997,355997,355997
2020-01-11,321527,321527,321527,321527,321527
2020-01-12,341972,341972,341972,341972,341972


# Add Page Filtering Metadata

* is url blog / shop / ...
* country

In [67]:
from urllib.parse import urlparse, parse_qs

In [68]:
def get_metadata_from_url(url):
    p = urlparse(url.lower())
    
    #urlparse("https://www-new.moneysmart.sg/rabbit/mouse/?a=b")
    #ParseResult(scheme='https', netloc='www-new.moneysmart.sg', path='/rabbit/mouse/', params='', query='a=b', fragment='')
    
    
    nl = p.netloc
    
    page_type = ""
    stripped_path = p.path.strip("/")
    
    #blog (for SG and HK)
    if "moneysmart.tw" in nl or "moneysmart.ph" in nl or "moneysmart.id" in nl or 'blog.moneysmart' in nl or 'blog-new' in nl or 'blog3' in nl:
        page_type = "blog"
    
    #LPS
    elif stripped_path.endswith("ms"):
        page_type = "lps"
    
    #interstitial
    elif "iss.moneysmart" in nl or stripped_path.endswith("apply") or stripped_path.endswith("redirect"):
        page_type = "iss"
        
    #unbounce
    elif "get.moneysmart" in nl:
        page_type = "unbounce"
        
    #embed     , "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 5) like '/embed/%' as is_embed
    
    
    #else shop
    else:
        page_type = "shop"
        
        
        
    
    #ab test side , CAST("strpos"("context"."page_url", '://www-new.') AS boolean) OR CAST("strpos"("context"."page_url", '://www3.') AS boolean)  OR CAST("strpos"("context"."page_url", '://blog3.') AS boolean) as "is_test"
    if "www-new." in nl or "www3." in nl or "blog3." in nl:
        ab_test = "test"
    else:
        ab_test = "control"
    
    slug = "/"+stripped_path
    
    if slug.startswith("/en/") or slug.startswith("/zh-hk/"):
        slug_root = "/"+stripped_path.split("/")[1]
    elif slug=="/en" or slug=="/zh-hk":
        slug_root = "/"
    else:
        slug_root = "/"+stripped_path.split("/")[0]
    
    
    
    
    """
     , CASE WHEN "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 4) LIKE '%moneysmart.sg%' THEN 'sg' 
        WHEN "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 4) LIKE '%moneysmart.hk%' THEN 'hk' 
        WHEN "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 4) LIKE '%moneysmart.id%' THEN 'id' 
        WHEN "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 4) LIKE '%moneysmart.ph%' THEN 'ph'
        WHEN "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 4) LIKE '%moneysmart.tw%' THEN 'tw' 
        ELSE null END as country_code
    """
    
    country_code = ""
    if "moneysmart.sg" in nl:
        country_code = "sg"
    elif "moneysmart.hk" in nl:
        country_code = "hk"
    elif "moneysmart.id" in nl:
        country_code = "id"
    elif "moneysmart.ph" in nl:
        country_code = "ph"
    elif "moneysmart.tw" in nl:
        country_code = "tw"
    elif "moneysmart.com" in nl:
        country_code = "ww" #worldwide
    else:
        country_code = "??"
    
    
    #return {"page_type":page_type, "path":path, "ab_test":ab_test, "country_code":country_code}
    return [page_type, slug, slug_root, ab_test, country_code]
    
    



In [69]:
# Do some tests to show that it's kind of working (bad version of a unit test!)

In [70]:
get_metadata_from_url("https://www-new.moneysmart.sg/rabbit/headlight/?scary=True")

['shop', '/rabbit/headlight', '/rabbit', 'test', 'sg']

In [71]:
get_metadata_from_url("https://blog.moneysmart.ph/rabbit/headlight/?scary=True")

['blog', '/rabbit/headlight', '/rabbit', 'control', 'ph']

In [72]:
get_metadata_from_url("https://blog3.moneysmart.tw")

['blog', '/', '/', 'test', 'tw']

In [73]:
get_metadata_from_url("https://www.moneysmart.hk/zh-hk/credit-cards/")

['shop', '/zh-hk/credit-cards', '/credit-cards', 'control', 'hk']

In [76]:
start_time = datetime.now()
print("starting at %s"%start_time.isoformat())
#This is a bit slow (consider at looking how to optimise, especially memory usage from creating loads of series objects
#Could probably optimise by splitting all the urls using a pandas function, then joining with a map to get page_type, path etc, but ymmv
metadata_df = merged_df.apply(lambda x: pd.Series(get_metadata_from_url(x.page_url)), axis=1)#, index=["page_type", "path", "ab_test", "country_code"])
end_time = datetime.now()
time_taken = (end_time-start_time).total_seconds()
print("Took %i seconds"%time_taken)

starting at 2020-01-24T02:31:48.379092
Took 398 seconds


In [77]:
metadata_df.rename(columns={0:"page_type", 1:"slug", 2:"slug_root", 3:"ab_test", 4:"country_code"}, inplace=True)

In [78]:
metadata_df.head()

Unnamed: 0,page_type,slug,slug_root,ab_test,country_code
0,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg
1,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg
2,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg
3,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg
4,blog,/budgeting/cheapest-sim-only-plans,/budgeting,control,sg


In [79]:
merged_df_with_meta = pd.concat([merged_df, metadata_df], axis=1)

In [80]:
# Set some sensible data types to speed it all up
#merged_df_with_meta.astype({"page_type":"category", "slug":"category"})
merged_df_with_meta = merged_df_with_meta.astype({"page_type":"category", "slug":"category", "ab_test":"category", "country_code":"category", "s_count":"int", "k_count":"int"})

In [81]:
merged_df_with_meta.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code
0,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,2,2,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg
1,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1,1,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg
2,00000b54-600a-4de2-8700-fd9885252dca,Reading,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,3,3,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg
3,00000b54-600a-4de2-8700-fd9885252dca,UserView.WidgetLoad,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1,1,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg
4,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/cheapest-...,2020-01-10,1,1,blog,/budgeting/cheapest-sim-only-plans,/budgeting,control,sg


In [82]:
merged_df_with_meta[(merged_df_with_meta.s_count>1) & (merged_df_with_meta.k_count>1)].head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code
0,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,2,2,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg
2,00000b54-600a-4de2-8700-fd9885252dca,Reading,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,3,3,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg
7,00008e20-54bd-495c-82ce-ebc6193bb1c9,Reading,https://blog.moneysmart.sg/dining/starbucks-si...,2020-01-10,3,3,blog,/dining/starbucks-singapore-menu-prices,/dining,control,sg
15,0001018d-97ee-48e4-a642-508643c00568,Reading,https://blog.moneysmart.sg/budgeting/cut-costs...,2020-01-10,4,4,blog,/budgeting/cut-costs-chinese-new-year-cheap,/budgeting,control,sg
26,00025fe1-ceb8-4d93-b8a2-7907c0a0ebd9,Reading,https://blog.moneysmart.sg/budgeting/singapore...,2020-01-10,2,2,blog,/budgeting/singapore-key-areas-invest,/budgeting,control,sg


# Add Device Type Metadata

In [83]:
segment_combined_df.head()

Unnamed: 0,timestamp,anonymous_id,page_url,context_ip,user_agent,event_name,date
0,2020-01-10 00:00:01.336000+00:00,b843bc94-64dc-4dca-8978-08877ddaf7eb,https://blog.moneysmart.sg/travel/singapore-pu...,183.90.37.6,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView,2020-01-10
1,2020-01-10 00:00:01.915000+00:00,b843bc94-64dc-4dca-8978-08877ddaf7eb,https://www.moneysmart.sg/embed/7432e8a28bd976...,183.90.37.6,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView,2020-01-10
2,2020-01-10 00:00:02.871000+00:00,0263aa86-16ca-46d3-b7ee-37bb7665f1da,https://blog.moneysmart.sg/transportation/erp-...,183.90.37.137,Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like...,PageView,2020-01-10
3,2020-01-10 00:00:03.139000+00:00,0263aa86-16ca-46d3-b7ee-37bb7665f1da,https://www.moneysmart.sg/embed/52174e7b8d839f...,183.90.37.137,Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like...,PageView,2020-01-10
4,2020-01-10 00:00:03.523000+00:00,c25624f4-6ccf-4add-8363-226926ca9737,https://blog.moneysmart.sg/fixed-deposits/best...,119.56.100.216,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...,PageView,2020-01-10


In [84]:
athena_full_events_df.head()

Unnamed: 0,sent_at_timestamp,sent_at,type,event_name,status,anonymous_id,amp_id,page_url,referrer,user_agent,ip_address,date
0,2020-01-11 19:19:10.045,2020-01-11T19:19:10.045Z,event,Reading,Article Body 75,3c7ba482-7819-48f3-9d27-86975b92f77a,,https://blog.moneysmart.hk/zh-hk/budgeting/%E5...,https://www.google.com/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2...,,2020-01-11
1,2020-01-11 19:19:09.930,2020-01-11T19:19:09.930Z,event,Reading,Article Body 75,c1a491a6-4ed5-41f9-a35f-1ed45a61f776,,https://blog.moneysmart.sg/budgeting/open-elec...,https://www.google.com/,Mozilla/5.0 (Linux; Android 9; EVR-L29) AppleW...,,2020-01-11
2,2020-01-11 19:19:09.156,2020-01-11T19:19:09.156Z,event,Reading,Article Body 50,ff86bdf2-5b0f-44ab-8da1-730dd8e0551a,,https://blog.moneysmart.sg/travel/cheap-batam-...,https://www.google.com/,Mozilla/5.0 (Linux; Android 9; SM-G973F) Apple...,,2020-01-11
3,2020-01-11 19:19:10.284,2020-01-11T19:19:10.284Z,page,PageView,,7e9c9c4f-e157-4c29-97be-296cc5f6818d,,https://blog.moneysmart.hk/zh-hk/career/%E6%95...,,Mozilla/5.0 (Linux; Android 9; SAMSUNG SM-G973...,,2020-01-11
4,2020-01-11 19:19:11.544,2020-01-11T19:19:11.544Z,event,Reading,Article Body 25,b22ff5e1-20e4-41de-9b34-a8948de744ec,,https://blog.moneysmart.sg/health-insurance/ax...,https://www.google.com/,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,,2020-01-11


### Segment

In [85]:
segment_combined_df.head()

Unnamed: 0,timestamp,anonymous_id,page_url,context_ip,user_agent,event_name,date
0,2020-01-10 00:00:01.336000+00:00,b843bc94-64dc-4dca-8978-08877ddaf7eb,https://blog.moneysmart.sg/travel/singapore-pu...,183.90.37.6,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView,2020-01-10
1,2020-01-10 00:00:01.915000+00:00,b843bc94-64dc-4dca-8978-08877ddaf7eb,https://www.moneysmart.sg/embed/7432e8a28bd976...,183.90.37.6,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView,2020-01-10
2,2020-01-10 00:00:02.871000+00:00,0263aa86-16ca-46d3-b7ee-37bb7665f1da,https://blog.moneysmart.sg/transportation/erp-...,183.90.37.137,Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like...,PageView,2020-01-10
3,2020-01-10 00:00:03.139000+00:00,0263aa86-16ca-46d3-b7ee-37bb7665f1da,https://www.moneysmart.sg/embed/52174e7b8d839f...,183.90.37.137,Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like...,PageView,2020-01-10
4,2020-01-10 00:00:03.523000+00:00,c25624f4-6ccf-4add-8363-226926ca9737,https://blog.moneysmart.sg/fixed-deposits/best...,119.56.100.216,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...,PageView,2020-01-10


In [86]:
group_by_cols = ["anonymous_id", "user_agent"]
segment_anonymous_id_to_user_agent_full_df = segment_combined_df.groupby(group_by_cols).count()
print("%i anonymous_id to user_agents found" % len(segment_anonymous_id_to_user_agent_full_df))

255878 anonymous_id to user_agents found


In [87]:
segment_anonymous_id_to_user_agent_full_df = segment_anonymous_id_to_user_agent_full_df.reset_index()
segment_anonymous_id_to_user_agent_full_df.rename({"0":"count"}, inplace=True)
segment_anonymous_id_to_user_agent_full_df.head()

Unnamed: 0,anonymous_id,user_agent,timestamp,page_url,context_ip,event_name,date
0,00000b54-600a-4de2-8700-fd9885252dca,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,7,7,7,7,7
1,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,1,1,1,1,1
2,00008e20-54bd-495c-82ce-ebc6193bb1c9,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...,6,6,6,6,6
3,0000edbd-6d98-466c-8537-7e4f07a93e52,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,4,4,4,4,4
4,0001018d-97ee-48e4-a642-508643c00568,Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_1 like...,7,7,7,7,7


In [88]:
# check for duplicates
sd = segment_anonymous_id_to_user_agent_full_df.groupby(["anonymous_id"]).size() #[["sent_at"]]
sd = sd.reset_index()
duplicates = sd[sd[0]>1]
print("%i / %i anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades" % (len(duplicates), len(sd)))

1974 / 253820 anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades


In [89]:
sd.head()

Unnamed: 0,anonymous_id,0
0,00000b54-600a-4de2-8700-fd9885252dca,1
1,000034a2-e973-4108-b920-0681877d4fc0,1
2,00008e20-54bd-495c-82ce-ebc6193bb1c9,1
3,0000edbd-6d98-466c-8537-7e4f07a93e52,1
4,0001018d-97ee-48e4-a642-508643c00568,1


In [90]:
segment_anonymous_id_to_user_agent_df = segment_anonymous_id_to_user_agent_full_df[["anonymous_id", "user_agent"]] # .set_index("anonymous_id")

#make a bit safer by stripping the strings
#segment_anonymous_id_to_user_agent_df["user_agent"] = segment_anonymous_id_to_user_agent_df["user_agent"].str.strip()
#segment_anonymous_id_to_user_agent_df["anonymous_id"] = segment_anonymous_id_to_user_agent_df["anonymous_id"].str.strip()

In [91]:
segment_anonymous_id_to_user_agent_df = segment_anonymous_id_to_user_agent_df.rename(columns={"user_agent": "s_user_agent"})
segment_anonymous_id_to_user_agent_df.head()

Unnamed: 0,anonymous_id,s_user_agent
0,00000b54-600a-4de2-8700-fd9885252dca,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
1,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
2,00008e20-54bd-495c-82ce-ebc6193bb1c9,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...
3,0000edbd-6d98-466c-8537-7e4f07a93e52,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
4,0001018d-97ee-48e4-a642-508643c00568,Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_1 like...


In [92]:
# Remove duplicates, so anonymous_id column is unique (otherwise on joins you'll expand the dataset)
segment_anonymous_id_to_user_agent_dedup_df = segment_anonymous_id_to_user_agent_df.groupby("anonymous_id").first().reset_index()
print("Before de-duplication %i, after %i"%(len(segment_anonymous_id_to_user_agent_df), len(segment_anonymous_id_to_user_agent_dedup_df)))
segment_anonymous_id_to_user_agent_dedup_df.head()

Before de-duplication 255878, after 253820


Unnamed: 0,anonymous_id,s_user_agent
0,00000b54-600a-4de2-8700-fd9885252dca,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
1,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
2,00008e20-54bd-495c-82ce-ebc6193bb1c9,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...
3,0000edbd-6d98-466c-8537-7e4f07a93e52,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
4,0001018d-97ee-48e4-a642-508643c00568,Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_1 like...


### Athena / Kinesis

In [93]:
group_by_cols = ["anonymous_id", "user_agent"]
athena_anonymous_id_to_user_agent_full_df = athena_full_events_df.groupby(group_by_cols).size()
print("%i anonymous_id to user_agents found" % len(athena_anonymous_id_to_user_agent_full_df))

290586 anonymous_id to user_agents found


In [94]:
athena_anonymous_id_to_user_agent_full_df = athena_anonymous_id_to_user_agent_full_df.reset_index()
athena_anonymous_id_to_user_agent_full_df.rename({"0":"count"}, inplace=True)
athena_anonymous_id_to_user_agent_full_df.head()

Unnamed: 0,anonymous_id,user_agent,0
0,00000b54-600a-4de2-8700-fd9885252dca,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,7
1,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,1
2,00008e20-54bd-495c-82ce-ebc6193bb1c9,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...,6
3,0000db08-89cd-4069-9a38-30a4d1b90bbc,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,5
4,0000edbd-6d98-466c-8537-7e4f07a93e52,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,4


In [95]:
# check for duplicates
ad = athena_anonymous_id_to_user_agent_full_df.groupby(["anonymous_id"]).size() #[["sent_at"]]
ad = ad.reset_index()
duplicates = ad[ad[0]>1]
print("%i / %i anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades" % (len(duplicates), len(ad)))

1993 / 288520 anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades


In [96]:
# explore if issue
#df = ad[ad[0]>1].merge(athena_anonymous_id_to_user_agent_full_df, how="inner")
#df.sort_values("anonymous_id")

In [97]:
#df = athena_anonymous_id_to_user_agent_full_df[athena_anonymous_id_to_user_agent_full_df.anonymous_id=="f4a0d91c-b118-40ce-890c-9142bce9f152"]
#pd.set_option('max_colwidth', 200)
#print(df.values[0][1])
#print(df.values[1][1])

In [98]:
#athena_anonymous_id_to_user_agent_full_df.head()
athena_anonymous_id_to_user_agent_df = athena_anonymous_id_to_user_agent_full_df[["anonymous_id", "user_agent"]]


#make a bit safer by stripping the strings #couldn't get this to work without warning easily, so skipping.
#athena_anonymous_id_to_user_agent_df.loc[:,1] = athena_anonymous_id_to_user_agent_df["user_agent"].str.strip()
#athena_anonymous_id_to_user_agent_df.loc[:,0] = athena_anonymous_id_to_user_agent_df["anonymous_id"].str.strip()

#?athena_anonymous_id_to_user_agent_df["user_agent"].str.strip()

In [99]:
athena_anonymous_id_to_user_agent_df = athena_anonymous_id_to_user_agent_df.rename(columns={"user_agent": "a_user_agent"})
athena_anonymous_id_to_user_agent_df.head()


Unnamed: 0,anonymous_id,a_user_agent
0,00000b54-600a-4de2-8700-fd9885252dca,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
1,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
2,00008e20-54bd-495c-82ce-ebc6193bb1c9,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...
3,0000db08-89cd-4069-9a38-30a4d1b90bbc,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
4,0000edbd-6d98-466c-8537-7e4f07a93e52,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


In [100]:
# Remove duplicates, so anonymous_id column is unique (otherwise on joins you'll expand the dataset)
athena_anonymous_id_to_user_agent_dedup_df = athena_anonymous_id_to_user_agent_df.groupby("anonymous_id").first().reset_index()
print("Before de-duplication %i, after %i"%(len(athena_anonymous_id_to_user_agent_df), len(athena_anonymous_id_to_user_agent_dedup_df)))
athena_anonymous_id_to_user_agent_dedup_df.head()

Before de-duplication 290586, after 288520


Unnamed: 0,anonymous_id,a_user_agent
0,00000b54-600a-4de2-8700-fd9885252dca,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
1,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
2,00008e20-54bd-495c-82ce-ebc6193bb1c9,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...
3,0000db08-89cd-4069-9a38-30a4d1b90bbc,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
4,0000edbd-6d98-466c-8537-7e4f07a93e52,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


### Joined up for all anonymous_ids

In [101]:
athena_anonymous_id_to_user_agent_dedup_df.set_index("anonymous_id", inplace=True)
segment_anonymous_id_to_user_agent_dedup_df.set_index("anonymous_id", inplace=True)




In [102]:
athena_anonymous_id_to_user_agent_dedup_df.head(2)

Unnamed: 0_level_0,a_user_agent
anonymous_id,Unnamed: 1_level_1
00000b54-600a-4de2-8700-fd9885252dca,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


In [103]:
segment_anonymous_id_to_user_agent_dedup_df.head(2)

Unnamed: 0_level_0,s_user_agent
anonymous_id,Unnamed: 1_level_1
00000b54-600a-4de2-8700-fd9885252dca,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


In [104]:
combined_anonymous_id_to_user_agent_df = athena_anonymous_id_to_user_agent_dedup_df.merge(segment_anonymous_id_to_user_agent_dedup_df, how="outer", left_index=True, right_index=True)


In [105]:
combined_anonymous_id_to_user_agent_df.head(1)

Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
00000b54-600a-4de2-8700-fd9885252dca,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


### Check if Segment and Kinesis disagree at all

In [106]:
print("%i segment anonymous_ids" % len(segment_anonymous_id_to_user_agent_df))
print("%i athena anonymous_ids" % len(athena_anonymous_id_to_user_agent_df))

255878 segment anonymous_ids
290586 athena anonymous_ids


In [107]:
# combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.a_user_agent.isnull())]

In [108]:
s_not_a = combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.s_user_agent.str.len()>0) \
                                                 & ((combined_anonymous_id_to_user_agent_df.a_user_agent.isnull()) |(combined_anonymous_id_to_user_agent_df.a_user_agent.str.len()==0))]
a_not_s = combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.a_user_agent.str.len()>0) \
                                                 & ((combined_anonymous_id_to_user_agent_df.s_user_agent.isnull()) |(combined_anonymous_id_to_user_agent_df.s_user_agent.str.len()==0))]

In [109]:
s_not_a.head()

Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0003957d-cec4-4d16-ab6f-25099277c451,,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
0012294b-4310-4fbf-9099-4caa05ebafdc,,Mozilla/5.0 (Linux; Android 9; SM-A7050 Build/...
001b0b20-b204-4466-b52c-9014823f3c0b,,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
001b242d-92c1-4cd5-ab82-b40cb35a4df9,,Mozilla/5.0 (Linux; Android 10; ONEPLUS A6010)...
001b71ef-9105-411c-ad6a-be3dee33b347,,Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7....


In [110]:
a_not_s.head()

Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0000db08-89cd-4069-9a38-30a4d1b90bbc,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,
00012bfb-43ed-428c-ab70-b848f2b928ae,Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7....,
0001d543-f501-4377-b634-6b9f36d4aaae,Mozilla/5.0 (Linux; Android 10; SM-G9750) Appl...,
00026de0-4272-45ac-829f-175d12686b6e,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,
0005f228-3b30-4083-aa6d-480a87b13a12,Mozilla/5.0 (Linux; Android 7.0; SAMSUNG SM-G9...,


In [111]:
total_count = len(combined_anonymous_id_to_user_agent_df)
s_not_a_count = len(s_not_a)
a_not_s_count = len(a_not_s)
print("%i / %i are in segment, not athena (%.1f percent )" % (s_not_a_count, total_count, s_not_a_count / total_count *100))
print("%i / %i are in athena, not segement (%.1f percent)" % (a_not_s_count, total_count, a_not_s_count / total_count *100))
print("If you include countries that aren't on Segment i.e. ID, PH, TW, then you'd expect more from athena")

5780 / 294300 are in segment, not athena (2.0 percent )
40480 / 294300 are in athena, not segement (13.8 percent)
If you include countries that aren't on Segment i.e. ID, PH, TW, then you'd expect more from athena


### Get an idea of how many don't have matching user_agents

In [112]:
df = combined_anonymous_id_to_user_agent_df.groupby("anonymous_id").size().reset_index()
duplicates = df[df[0]>1]
print("%i duplicate anonymous_ids - should be none at this stage" % len(duplicates))

0 duplicate anonymous_ids - should be none at this stage


In [113]:
non_matching_excl_nulls = combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.s_user_agent != combined_anonymous_id_to_user_agent_df.a_user_agent) \
                                                                 & ~combined_anonymous_id_to_user_agent_df.s_user_agent.isnull() \
                                                                 & ~combined_anonymous_id_to_user_agent_df.a_user_agent.isnull()]
print("%i User agent strings don't match" % len(non_matching_excl_nulls))
print("Look for changes in browser version for instance.  Don't worry about every last one.")
non_matching_excl_nulls.head()

45 User agent strings don't match
Look for changes in browser version for instance.  Don't worry about every last one.


Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
02e12673-bd69-496c-a46f-b12403354a8d,Mozilla/5.0 (Linux; Android 9; LYA-L29) AppleW...,Mozilla/5.0 (Linux; Android 9; LYA-L29) AppleW...
0614d968-0f7e-41b1-82d0-05df536f33c2,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
067ba25c-9abe-4b0f-bd6d-9f34d71891a9,Mozilla/5.0 (Linux; Android 9; CLT-L29) AppleW...,Mozilla/5.0 (Linux; Android 9; CLT-L29) AppleW...
0a93b5f4-be4f-42f8-9db3-e0a4b292fc90,Mozilla/5.0 (Linux; Android 9; SM-G960F) Apple...,Mozilla/5.0 (Linux; Android 9; SM-G960F) Apple...
0bf84bbb-6db7-4aaf-b06f-8f59081520e7,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0...,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0...


### Create a Single user agent string per anonymous_id

In [117]:
combined_anonymous_id_to_user_agent_single_col_df = combined_anonymous_id_to_user_agent_df["a_user_agent"]\
        .fillna(combined_anonymous_id_to_user_agent_df["s_user_agent"]).reset_index().set_index("anonymous_id")
combined_anonymous_id_to_user_agent_single_col_df.rename(columns={"a_user_agent":"user_agent"}, inplace=True)
combined_anonymous_id_to_user_agent_single_col_df.head()

Unnamed: 0_level_0,user_agent
anonymous_id,Unnamed: 1_level_1
00000b54-600a-4de2-8700-fd9885252dca,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
00008e20-54bd-495c-82ce-ebc6193bb1c9,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...
0000db08-89cd-4069-9a38-30a4d1b90bbc,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
0000edbd-6d98-466c-8537-7e4f07a93e52,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


In [118]:
# This bit is for development where I keep appending the user_agent column and it generates user_agent_x etc
user_agent_cols_to_delete = [z for z in merged_df_with_meta.columns if z.startswith("user_agent")]
print(" Removing %s "%str(user_agent_cols_to_delete))
merged_df_with_meta.drop(columns=user_agent_cols_to_delete, inplace=True)

 Removing [] 


### Useful segmentation / convert user agent to browser etc

In [119]:
def convert_user_agent_to_useful_strings(user_agent_string):
    """
    Sort of matches to https://github.com/moneysmartco/metl/blob/e13086fae453911bed5a40cb51ff0869e2f3a0ce/scripts/python/device_tagger.py
    """
    user_agent = user_agents.parse(user_agent_string)
    
    device_family = ""
    
    if user_agent.is_pc:
        device_family = 'desktop'
    elif user_agent.is_mobile:
        device_family = 'mobile'
    elif user_agent.is_tablet:
        device_family = 'tablet'
    else:
        device_family = 'other'
        
    
    os_family = user_agent.os.family
    os_version = user_agent.os.version_string
    browser_family = user_agent.browser.family 
    browser_version = user_agent.browser.version_string
    
    is_bot = user_agent.is_bot
    
    return [device_family, os_family, os_version, browser_family, browser_version, is_bot]
    



There's an important optimisation going on here (which still isn't that quick).

If you just do .apply across all the rows, then it's super slow (many minutes e.g. 278s vs 24s for my better version).  I tried the optimisation at https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/, but that didn't seem to provide benefit (or I slowed it down in other ways).

So I'm taking the unique user_agents, processing them and then doing a join, without creating Series objects as well.

There's probably more improvement do-able (e.g. creating the full data structure to insert into up front / generating fewer arrays, but it's fast enough for me right now.

In [120]:
distinct_user_agents = combined_anonymous_id_to_user_agent_single_col_df.user_agent.unique()

In [121]:
distinct_user_agents[:10]

array(['Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Mobile/15E148 Safari/604.1',
       'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/89.2.287201133 Mobile/15E148 Safari/604.1',
       'Mozilla/5.0 (Linux; Android 9; SM-N960F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.116 Mobile Safari/537.36',
       'Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 [FBAN/FBIOS;FBDV/iPhone10,5;FBMD/iPhone;FBSN/iOS;FBSV/12.4.1;FBSS/3;FBID/phone;FBLC/en_Qaau_GB;FBOP/5;FBCR/StarHub]',
       'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3; rctw; .NET4.0E; rv:11.0) like Gecko',
       'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
 

In [122]:
len(distinct_user_agents)

21289

In [123]:
# This isn't fast, but acceptable
start_time = datetime.now()
print("Starting to add user agent data at %s"% start_time.isoformat())
#meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: pd.Series(convert_user_agent_to_useful_strings(x.user_agent)), axis=1)
meta_rows = [[z, ]+convert_user_agent_to_useful_strings(z)  for z in distinct_user_agents]
#d = dfcombined_anonymous_id_to_user_agent_single_col_df.merge(meta_df)
end_time = datetime.now()
seconds_taken = (end_time - start_time).total_seconds()
print("Took %i seconds to process" % seconds_taken)

Starting to add user agent data at 2020-01-24T02:59:58.919123
Took 32 seconds to process


In [124]:
user_agent_meta_df = pd.DataFrame(meta_rows)

user_agent_meta_df.rename(columns = {0:"user_agent", 1:"device_family", 2:"os_family", 3:"os_version", 4:"browser_family",5:"browser_version", 6:"is_bot"}, inplace=True)
user_agent_meta_df.set_index("user_agent", inplace=True)


In [125]:
# Try to make the data types a bit efficient
user_agent_meta_df = user_agent_meta_df.astype({ "device_family":"category", "os_family":"category", "os_version":"category", "browser_family":"category","browser_version":"category","is_bot":"bool"})
user_agent_meta_df.dtypes

device_family      category
os_family          category
os_version         category
browser_family     category
browser_version    category
is_bot                 bool
dtype: object

In [126]:
user_agent_meta_df.head()

Unnamed: 0_level_0,device_family,os_family,os_version,browser_family,browser_version,is_bot
user_agent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Mobile/15E148 Safari/604.1",mobile,iOS,13.3,Mobile Safari,13.0.4,False
"Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/89.2.287201133 Mobile/15E148 Safari/604.1",mobile,iOS,13.3,Mobile Safari,13.3,False
"Mozilla/5.0 (Linux; Android 9; SM-N960F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.116 Mobile Safari/537.36",mobile,Android,,Chrome Mobile,79.0.3945,False
"Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 [FBAN/FBIOS;FBDV/iPhone10,5;FBMD/iPhone;FBSN/iOS;FBSV/12.4.1;FBSS/3;FBID/phone;FBLC/en_Qaau_GB;FBOP/5;FBCR/StarHub]",mobile,iOS,12.4.1,Mobile Safari UI/WKWebView,12.4.1,False
Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3; rctw; .NET4.0E; rv:11.0) like Gecko,desktop,Windows,7,IE,11.0,False


In [127]:
if False:# This is super slow currently.
    start_time = datetime.now()
    print("Starting to add user agent data at %s"% start_time.isoformat())
    #meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: pd.Series(convert_user_agent_to_useful_strings(x.user_agent)), axis=1)
    meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: convert_user_agent_to_useful_strings(x.user_agent), axis=1, result_type="expand")
    #d = dfcombined_anonymous_id_to_user_agent_single_col_df.merge(meta_df)
    end_time = datetime.now()
    seconds_taken = (end_time - start_time).total_seconds()
    print("Took %i seconds to process" % seconds_taken)

In [128]:
if False:
    # Trying something faster - based on https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/, but hasn't worked so far
    # but hasn't worked, still (after tidying) it takes 257s, slower than the original.
    start_time = datetime.now()
    print("Starting to add user agent data at %s"% start_time.isoformat())
    new_cols = [[]]*6 # make some empty arrays
    num_new_cols = len(new_cols)
    #for row_num, (_, row) in enumerate(combined_anonymous_id_to_user_agent_single_col_df.iterrows()):
    for _, row in combined_anonymous_id_to_user_agent_single_col_df.iterrows():
        #if row_num % 100000==0:
        #    print("row %i"%row_num)
        vals = convert_user_agent_to_useful_strings(row.user_agent)
        #for i in range(len(vals)):
            #new_cols[i].append(vals[i])
        new_cols[0].append(vals[0])
        new_cols[1].append(vals[1])
        new_cols[2].append(vals[2])
        new_cols[3].append(vals[3])
        new_cols[4].append(vals[4])
        new_cols[5].append(vals[5])
        

    print("New cols generated at %s"% start_time.isoformat())
    # meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: convert_user_agent_to_useful_strings(x.user_agent), axis=1, result_type="expand")
    #d = dfcombined_anonymous_id_to_user_agent_single_col_df.merge(meta_df)
    meta_df = pd.DataFrame({
        "device_family": new_cols[0], 
         "os_family" : new_cols[1], 
         "os_version" : new_cols[2], 
         "browser_family" : new_cols[3], 
         "browser_version":new_cols[4], 
         "is_bot":new_cols[5] 


    })
    print("Additional data frame generated at %s"% start_time.isoformat())
    end_time = datetime.now()
    seconds_taken = (end_time - start_time).total_seconds()
    print("Took %i seconds to process" % seconds_taken)

### Join onto the main dataframe 

In [129]:
merged_df_with_meta = merged_df_with_meta.merge(combined_anonymous_id_to_user_agent_single_col_df, on="anonymous_id", how="left")

In [130]:
merged_df_with_meta.head(2)

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent
0,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,2,2,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
1,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1,1,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


In [131]:
# add on the user agent breakdown

merged_df_with_meta = merged_df_with_meta.merge(user_agent_meta_df, on="user_agent", how="left")

In [132]:
merged_df_with_meta.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
0,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,2,2,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
1,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1,1,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
2,00000b54-600a-4de2-8700-fd9885252dca,Reading,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,3,3,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
3,00000b54-600a-4de2-8700-fd9885252dca,UserView.WidgetLoad,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1,1,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
4,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/cheapest-...,2020-01-10,1,1,blog,/budgeting/cheapest-sim-only-plans,/budgeting,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.3,False


In [133]:
#Check it's set them all
merged_df_with_meta[merged_df_with_meta.user_agent.isnull()].head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot


In [134]:
#Check it's set them all
merged_df_with_meta[merged_df_with_meta.device_family.isnull()].head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot


### Clean up data frames / save some memory

In [135]:
# TODO: could do a lot more here
segment_anonymous_id_to_user_agent_full_df = None
segment_anonymous_id_to_user_agent_df = None
athena_anonymous_id_to_user_agent_full_df = None
athena_anonymous_id_to_user_agent_df = None
sd = None
ad = None

# Top Level Checks

In [136]:
def group_by_and_show_count_difference(df, group_by_cols, with_styling=True):
    """
    This expects counts in s_count and k_count
    """
    
    grouped = df.groupby(group_by_cols).sum().reset_index()
    
    grouped["k_vs_s_%"] = grouped.apply(lambda row:(999 if row.k_count else 0) if row.s_count==0 else round(((row.k_count - row.s_count)/row.s_count)*100, 1), axis=1 )
    grouped = grouped[(grouped.k_count>0) | (grouped.s_count>0)] # filters out NaNs after grouping
    
    

    return grouped

In [137]:
def colour_grouped_table(df):

        

    def color_how_good(value):
        if isinstance(value, str):
            return
        av = abs(value)
        if av<2:
            c = "green"
        elif value <0:
            c =  "red"

        else:
            c= "blue"

        if av>20:
            return "background-color:rgb(250,200,200)"
        return "color:%s"%c # it's just CSS, so you can do background as well.
    return df.style.applymap(color_how_good , subset=["k_vs_s_%"])


In [138]:
merged_df_with_meta.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
0,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,2,2,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
1,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1,1,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
2,00000b54-600a-4de2-8700-fd9885252dca,Reading,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,3,3,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
3,00000b54-600a-4de2-8700-fd9885252dca,UserView.WidgetLoad,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1,1,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
4,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/cheapest-...,2020-01-10,1,1,blog,/budgeting/cheapest-sim-only-plans,/budgeting,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.3,False


## By Country


In [139]:
g = group_by_and_show_count_difference(merged_df_with_meta, ["country_code", "date"])

In [140]:
colour_grouped_table(g)

Unnamed: 0,country_code,date,s_count,k_count,is_bot,k_vs_s_%
0,??,2020-01-10,507,26,0,-94.9
1,??,2020-01-11,345,14,0,-95.9
2,??,2020-01-12,362,29,0,-92.0
3,hk,2020-01-10,97635,96753,130,-0.9
4,hk,2020-01-11,86904,85923,127,-1.1
5,hk,2020-01-12,88044,87417,103,-0.7
6,id,2020-01-10,0,2086,0,999.0
7,id,2020-01-11,0,908,0,999.0
8,id,2020-01-12,0,668,0,999.0
9,ph,2020-01-10,0,3979,2,999.0


## By Event Type

In [141]:
g = group_by_and_show_count_difference(merged_df_with_meta, ["country_code", "event_name"])
g.sort_values(["country_code", "event_name"], inplace=True)
colour_grouped_table(g)

Unnamed: 0,country_code,event_name,s_count,k_count,is_bot,k_vs_s_%
4,??,LeadGeneration.ClickedCTA,1,1,0,0.0
14,??,PageView,1213,32,0,-97.4
15,??,Reading,0,36,0,999.0
35,hk,Display user feedback form,1,0,0,-100.0
37,hk,LeadGeneration.ClickConversion,265,286,0,7.9
38,hk,LeadGeneration.ClickedApplyButton,976,1029,0,5.4
39,hk,LeadGeneration.ClickedCTA,1123,1199,0,6.8
41,hk,LeadGeneration.Conversion,2741,2678,0,-2.3
43,hk,LeadGeneration.FormSubmitted,643,646,0,0.5
45,hk,LeadGeneration.RedirectCompleted,2471,2534,0,2.5


In [142]:
g = group_by_and_show_count_difference(merged_df_with_meta[merged_df_with_meta.event_name=="LeadGeneration.ClickedApplyButton"], ["country_code", "slug_root", "event_name"])

In [143]:
g

Unnamed: 0,country_code,slug_root,event_name,s_count,k_count,is_bot,k_vs_s_%
3,hk,/credit-cards,LeadGeneration.ClickedApplyButton,976.0,1029.0,False,5.4
12,sg,/credit-cards,LeadGeneration.ClickedApplyButton,1616.0,1726.0,False,6.8
13,sg,/debt-consolidation-plan,LeadGeneration.ClickedApplyButton,47.0,53.0,False,12.8
14,sg,/personal-loan,LeadGeneration.ClickedApplyButton,404.0,429.0,False,6.2


## By Top Level Slug

In [144]:
g = group_by_and_show_count_difference(merged_df_with_meta[merged_df_with_meta.page_type!="blog"], ["country_code", "slug_root", "event_name"])
#g.sort_values(["country_code", "slug_root", "event_name"])

#filtering where s_count or k_count is >1000
pv = pd.pivot_table(g[(g.s_count>1000) | (g.k_count>1000)], index=["country_code", "slug_root"], values=["k_count","s_count","k_vs_s_%"], columns=["event_name"], fill_value="")

#colour_grouped_table(pv)
#TODO: not showing up the s_count and k_count :(
pv = pv.swaplevel(0, 1, axis=1).sort_index(axis=1)

In [145]:
pv.to_excel("kinesis_vs_segment.xlsx")

In [146]:

def highlight_cols(cell):
    #use hex colours, or named ones to ensure excel compatibility on export
    if cell=="":
        return ""
    ci = min(100, int(abs(cell*10)))
    if abs(cell)<=2:
        return "color:green;"
    if cell <0:
        return "background-color:#%02x%02x%02x;" % (255,255-ci,255-ci)
    if cell>0:
        return "background-color:#%02x%02x%02x);" % (255-ci,255-ci,255)
    return "background-color:red;"



#pv[:20].style.background_gradient(cmap=cm, subset=pd.IndexSlice[:, 's-count'])
styled = pv.style.applymap(highlight_cols, subset=pv.columns.get_loc_level('k_vs_s_%', level=1)[0]) #special multi-index on column for colour
#can't get they styling to be happy with colors
#styled.to_excel("kinesis_vs_segment.xlsx", engine='openpyxl') #use special engine for formatting https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html

styled

Unnamed: 0_level_0,event_name,LeadGeneration.ClickedApplyButton,LeadGeneration.ClickedApplyButton,LeadGeneration.ClickedApplyButton,LeadGeneration.ClickedCTA,LeadGeneration.ClickedCTA,LeadGeneration.ClickedCTA,LeadGeneration.Conversion,LeadGeneration.Conversion,LeadGeneration.Conversion,LeadGeneration.FormSubmitted,LeadGeneration.FormSubmitted,LeadGeneration.FormSubmitted,LeadGeneration.RedirectCompleted,LeadGeneration.RedirectCompleted,LeadGeneration.RedirectCompleted,LeadGeneration.ShowedMoreDetails,LeadGeneration.ShowedMoreDetails,LeadGeneration.ShowedMoreDetails,LeadGeneration.ViewedMoreDetails,LeadGeneration.ViewedMoreDetails,LeadGeneration.ViewedMoreDetails,PageView,PageView,PageView,UserEngagement.ClickedFilter,UserEngagement.ClickedFilter,UserEngagement.ClickedFilter,UserEngagement.FilterSelection,UserEngagement.FilterSelection,UserEngagement.FilterSelection,UserEngagement.QuestionAnswered,UserEngagement.QuestionAnswered,UserEngagement.QuestionAnswered,UserView.WidgetLoad,UserView.WidgetLoad,UserView.WidgetLoad
Unnamed: 0_level_1,Unnamed: 1_level_1,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count,k_count,k_vs_s_%,s_count
country_code,slug_root,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2
??,/www.moneysmart.sg,,,,,,,,,,,,,,,,,,,,,,0,-100.0,1149,,,,,,,,,,,,
hk,/,,,,,,,,,,,,,,,,,,,,,,899,-10.2,1001,,,,,,,,,,,,
hk,/credit-cards,1029.0,5.4,976.0,,,,1169.0,-3.1,1206.0,,,,1072.0,2.6,1045.0,7944.0,0.6,7896.0,2309.0,2.5,2252.0,24461,-1.6,24850,,,,2382.0,3.1,2310.0,,,,,,
hk,/embed,,,,,,,,,,,,,,,,,,,,,,14232,-1.2,14409,,,,,,,,,,14201.0,-1.2,14375.0
hk,/lending-companies-loan,,,,,,,,,,,,,,,,,,,,,,1659,-4.6,1739,,,,,,,,,,,,
hk,/mortgage,,,,,,,,,,,,,,,,,,,,,,2404,1.2,2375,,,,,,,,,,,,
hk,/personal-loan,,,,,,,,,,,,,,,,,,,,,,7567,-3.6,7846,878.0,-31.1,1275.0,,,,,,,,,
hk,/tax-loan,,,,,,,,,,,,,,,,,,,,,,2517,-0.2,2523,,,,,,,,,,,,
hk,/travel-insurance,,,,,,,,,,,,,,,,,,,,,,2152,-5.6,2279,,,,,,,,,,,,
sg,/,,,,,,,,,,,,,,,,,,,,,,3355,0.6,3334,,,,,,,,,,,,


In [147]:
pv.to_html("kinesis_vs_segment.html")

## By Type of Page

In [148]:
g = group_by_and_show_count_difference(merged_df_with_meta, ["country_code", "page_type"])
colour_grouped_table(g)

Unnamed: 0,country_code,page_type,s_count,k_count,is_bot,k_vs_s_%
3,??,shop,1214,69,0,-94.3
5,hk,blog,172845,172065,245,-0.5
6,hk,iss,9730,9605,0,-1.3
7,hk,lps,3040,3087,16,1.5
8,hk,shop,86968,85336,99,-1.9
10,id,blog,0,3662,0,999.0
15,ph,blog,0,10445,2,999.0
20,sg,blog,712280,718788,438,0.9
21,sg,iss,17049,15529,14,-8.9
22,sg,lps,1095,1124,4,2.6


### Type of Page, Just Pageviews

In [149]:
g = group_by_and_show_count_difference(merged_df_with_meta[(merged_df_with_meta.event_name=="PageView") & (merged_df_with_meta.country_code.isin(["sg", "hk"]))], ["country_code", "page_type"])
colour_grouped_table(g)

Unnamed: 0,country_code,page_type,s_count,k_count,is_bot,k_vs_s_%
5,hk,blog,55645,55422,211,-0.4
6,hk,iss,4518,4393,0,-2.8
7,hk,lps,2591,2593,16,0.1
8,hk,shop,51196,50204,95,-1.9
20,sg,blog,253379,257587,399,1.7
21,sg,iss,7013,6287,10,-10.4
22,sg,lps,1004,1024,4,2.0
23,sg,shop,208672,202037,699,-3.2
24,sg,unbounce,3377,3411,2,1.0


### By Device Type and Country for pageviews in HK and SG

In [150]:
merged_df_with_meta.head()


Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
0,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,2,2,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
1,00000b54-600a-4de2-8700-fd9885252dca,PageView,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1,1,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
2,00000b54-600a-4de2-8700-fd9885252dca,Reading,https://blog.moneysmart.sg/career/5-easy-side-...,2020-01-12,3,3,blog,/career/5-easy-side-businesses-you-can-run-whi...,/career,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
3,00000b54-600a-4de2-8700-fd9885252dca,UserView.WidgetLoad,https://www.moneysmart.sg/embed/98e61305602380...,2020-01-12,1,1,shop,/embed/98e61305602380971d9c5e68c4a75647,/embed,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False
4,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/cheapest-...,2020-01-10,1,1,blog,/budgeting/cheapest-sim-only-plans,/budgeting,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.3,False


# Other Issues to Check For
* duplicates
* skipping "https" from the url (observed as a current issue)
* certain browsers having issues
* users creating a lot of duplicate events (and doing the above analysis using sum vs count)

# Play Area

In [None]:
d = merged_df_with_meta[merged_df_with_meta.page_type=="iss"].groupby(["slug", "page_type"]).sum()
d[d.s_count>0]
merged_df_with_meta[(merged_df_with_meta.page_type=="iss") & (merged_df_with_meta.page_url.str.contains("iss."))]

In [None]:
merged_df_with_meta[(merged_df_with_meta.slug_root=="/zh-hk") & (merged_df_with_meta.country_code=="hk") & (merged_df_with_meta.page_type!="blog")].head(40) #.groupby(["slug"]).sum()

# Search GUI

Note that currently in jupyterlab, it outputs the results to a separate tab / the terminal

In [151]:
country_codes = merged_df_with_meta.country_code.unique().to_list()

In [152]:
country_codes

['sg', 'hk', '??', 'tw', 'id', 'ph']

In [153]:
top_level_slugs = list(merged_df_with_meta[merged_df_with_meta.page_type!="blog"].slug_root.unique())

In [154]:
top_level_slugs.sort()

In [155]:
event_types = list(merged_df_with_meta.event_name.unique())
event_types.sort()


In [156]:
merged_df_with_meta.columns.to_list()

['anonymous_id',
 'event_name',
 'page_url',
 'date',
 's_count',
 'k_count',
 'page_type',
 'slug',
 'slug_root',
 'ab_test',
 'country_code',
 'user_agent',
 'device_family',
 'os_family',
 'os_version',
 'browser_family',
 'browser_version',
 'is_bot']

In [180]:
def on_search_button_click(b):
    #pandas likes lists, not tuples (at least for group by)
    print("searching")
    print("erm...")
    df = merged_df_with_meta
    
    anonymous_id = anonymous_user_input.value.strip()
    search_anonymous_id = bool(anonymous_id)
    
    event_types = list(event_type_select.value)
    search_event_types = len(event_types)>0
    
    slug_search_string = slug_search_input.value.strip()
    search_slug_by_string = bool(slug_search_string)
    
    country_codes = list(country_code_dropdown.value)
    search_country_codes =  len(country_codes)!=0
        
    top_level_slugs = list(top_level_slug_select.value)
    search_by_top_level_slugs = len(top_level_slugs)>0
    
    group_by_cols = list(group_by_select.value)
    do_group_by = len(group_by_cols)>0
    

    print("Events search")
    d = df[(( not search_anonymous_id) | (df.anonymous_id==anonymous_id)) \
           & ( (not search_event_types) | (df.event_name.isin(event_types))) \
          & (( not search_slug_by_string) | (df.slug.str.contains(slug_search_string))) \
            & (( not search_country_codes) | (df.country_code.isin(country_codes))) \
           & (( not search_by_top_level_slugs) | (df.slug_root.isin(top_level_slugs))) \

           ]



    if do_group_by:
        #d = d.groupby(group_by_cols).sum()
        d = group_by_and_show_count_difference(d, group_by_cols)
        #colour_grouped_table(d)
    display(d)
    
    
    
    print("done searching")
    
def on_reset_button_click(b):
    print("I would be resetting")

In [181]:
def button_click_placeholder(b):
    print("just chilling")

anonymous_user_input = widgets.Text(description = "Anonymous_id")
country_code_dropdown = widgets.SelectMultiple(
    options= country_codes,
    value=["sg","hk"],
    # rows=10,
    description='Country',
    disabled=False
)
search_button = widgets.Button(description='Search', on_click=on_search_button_click)
reset_button = widgets.Button(description='Reset', on_click=on_reset_button_click)

search_button.on_click(on_search_button_click)

top_level_slug_select = widgets.SelectMultiple(options = top_level_slugs, description="slug")

slug_search_input = widgets.Text(description = "Slug like")

event_type_select = widgets.SelectMultiple(options=event_types, description="Event")



#search_modes = ["Summary", "Summary Deduped", "Events",] # summary - > grouped with difference, events -> grouped by anon_id etc, 
#search_mode_dropdown = widgets.Dropdown(description="Search Mode", options=search_modes, value=search_modes[0])

search_options = widgets.Box([anonymous_user_input, country_code_dropdown , top_level_slug_select, slug_search_input,  event_type_select])

search_options.layout=widgets.Layout(width='100%',display='inline-flex',flex_flow='row wrap') #auto wrap

group_by_select = widgets.SelectMultiple(description="Group By", options = merged_df_with_meta.columns.to_list())

search_bar = widgets.VBox([search_options, group_by_select, widgets.Box([search_button, reset_button])])
display(search_bar)



VBox(children=(Box(children=(Text(value='', description='Anonymous_id'), SelectMultiple(description='Country',…

In [186]:
#!pip install fastparquet

Collecting fastparquet
[?25l  Downloading https://files.pythonhosted.org/packages/58/49/dccb790fa17ab3fbf84a6b848050083c7a1899e9586000e34e3e4fbf5538/fastparquet-0.3.2.tar.gz (151kB)
[K    100% |████████████████████████████████| 153kB 8.3MB/s ta 0:00:01
Collecting thrift>=0.11.0 (from fastparquet)
[?25l  Downloading https://files.pythonhosted.org/packages/97/1e/3284d19d7be99305eda145b8aa46b0c33244e4a496ec66440dac19f8274d/thrift-0.13.0.tar.gz (59kB)
[K    100% |████████████████████████████████| 61kB 14.9MB/s ta 0:00:01
Building wheels for collected packages: fastparquet, thrift
  Running setup.py bdist_wheel for fastparquet ... [?25ldone
[?25h  Stored in directory: /home/ec2-user/.cache/pip/wheels/b9/36/13/01416a760ddcab0eb8281ec9c9ffcbed945c9b831647c8b904
  Running setup.py bdist_wheel for thrift ... [?25ldone
[?25h  Stored in directory: /home/ec2-user/.cache/pip/wheels/02/a2/46/689ccfcf40155c23edc7cdbd9de488611c8fdf49ff34b1706e
Successfully built fastparquet thrift
Installing c

In [185]:
merged_df_with_meta.to_parquet("merged_df_with_meta_01-10_01-12.gzip",  compression='gzip')

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support