# Introduction

This is used to compare the counts of events between segment and kinesis during the move between the two pipelines.

As they counts inevitably don't match, it then provides detailed segmentation and search / exploratory tools towards the bottom.

For a large number of days, you'll want a lot of RAM (16GB or 32GB).  For single day experimenting, you will be fine with 8GB.

Running the whole script takes quite a long time initially, in particular due to the segment query (minutes to tens of minutes).  Once this has been done, further exploration is generally very quick (less than a second to a few seconds).

It's not overly optimised, but some steps have been taken to reduce memory.

# What this notebook (in particular) does

This notebook is about extracting and joining the raw data.  The result can then be saved to file (parquet) and loaded into other notebooks for analysis / visualisation.

# Requirements / Jupyter Extensions

Install these through jupyterlab extension manager (if using jupyterlab)
* jupyter-widgets
* plotly (and ideally chart studio too)

In [1]:
# Safe imports
from datetime import datetime, timedelta, date

# Settings

In [2]:
num_days_to_query = 7
#from_datetime = datetime.now() - timedelta(days = 5)
#from_datetime = datetime(year=2020, month=1, day=4)
#to_datetime = from_datetime+ timedelta(days=num_days_to_query)
to_datetime = datetime(year=2020, month=2, day=16)
from_datetime = to_datetime - timedelta(days=num_days_to_query)
include_device_segmentation = True #E.g. iphone users.  This will use more memory (and likely slow things a bit).
save_end_dataframe_to_file = True #Saves a parquet for easy loading after crashes, or in other tools

# Imports

In [1]:
# Run imports that might require installation to the environment, and install if necessary.
try:
    import psycopg2
except:
    print("Failed ot import psychopg2, trying to install it")
    !{sys.executable} -m pip install psycopg2-binary
    import psycopg2
    print("Successfully installed")
    
    
try:
    import dateparser
except:
    print("Failed ot import dateparser, trying to install it")
    #!{sys.executable} -m pip install dateparser
    !pip install dateparser
    import dateparser
    print("Successfully installed")
    
try:
    import pyathena #used in other imports, so really just checking it's available
except:
    print("Failed ot import pyathena, trying to install it")
    ! pip install pyathena
    #!{sys.executable} -m pip install pyathena
    import pyathena
    print("Successfully installed")
    
try:
    import user_agents
except:
    print("Failed ot import user_agents, trying to install it")
    #!{sys.executable} -m pip install user_agents
    !pip install user_agents
    import user_agents
    print("Successfully installed")

    
import ipywidgets as widgets
    


Failed ot import dateparser, trying to install it


  """)


Collecting dateparser
[?25l  Downloading https://files.pythonhosted.org/packages/82/9d/51126ac615bbc4418478d725a5fa1a0f112059f6f111e4b48cfbe17ef9d0/dateparser-0.7.2-py2.py3-none-any.whl (352kB)
[K     |████████████████████████████████| 358kB 13.6MB/s eta 0:00:01
[?25hCollecting tzlocal
  Downloading https://files.pythonhosted.org/packages/ef/99/53bd1ac9349262f59c1c421d8fcc2559ae8a5eeffed9202684756b648d33/tzlocal-2.0.0-py2.py3-none-any.whl
Collecting regex
[?25l  Downloading https://files.pythonhosted.org/packages/ad/64/1b0eb918ebdfba27b4c148853ed93cc38d83aa452882f2a9dc64acaa9b2f/regex-2020.1.8-cp36-cp36m-manylinux2010_x86_64.whl (689kB)
[K     |████████████████████████████████| 696kB 20.3MB/s eta 0:00:01
Installing collected packages: tzlocal, regex, dateparser
Successfully installed dateparser-0.7.2 regex-2020.1.8 tzlocal-2.0.0


ModuleNotFoundError: No module named 'dateparser'

In [4]:
# Imports on files that might have dependencies that need installing
import data_pier_querying
from athena_querying import AthenaQuery
from athena_common_queries import *
import user_agents # this converts user agent from browser to mobile / desktop etc.

# Kinesis Data via Athena

Data goes tracker -> kinesis -> S3 (+ another S3 transform).  Then we can query S3 using Athena.

In [5]:
aq = AthenaQuery()

In [6]:
aq.connect()

In [7]:
athena_database = "ms_data_lake_production"
athena_raw_events_table = "ms_data_stream_production_processed"

In [8]:
#query = "select context.page_url, body.event_name, count(*) from "+athena_database+"."+athena_raw_events_table
#query += " where partition_0='2019' and partition_1>='12' and partition_2>='05' group by 1,2"

In [9]:
# I've removed the device_type data to save memory, but it would be useful.
query = create_generic_event_query(from_datetime, to_datetime, include_user_agent=include_device_segmentation, include_ip_address = include_device_segmentation, interpret_urls=False)

full_query = "select * from (%s) where country_code ='sg'" %query

In [10]:
print(full_query)

select * from (
    
    SELECT 
          CAST("from_iso8601_timestamp"("sent_at") AS timestamp) "sent_at_timestamp"
    , "sent_at"
    , substr(sent_at, 1, 10) as date
    , "type"
    , "body"."event_name"
    , "body"."data"."status"
    , "user"."anonymous_id"
    , "user"."amp_id"
    , "context"."page_url"
    , "context"."referrer"
 
    
        , context.user_agent as user_agent
        
        , context.ip_address
        
    
    FROM
      ms_data_lake_production.ms_data_stream_production_processed
    
    
    WHERE true -- makes query composition easier
    
 AND 
  (
 partition_0 >= '2020'
 AND partition_1 >= '02'
 AND partition_2 >= '09'
 OR (
 partition_0 >= '2020'
 AND partition_1 > '02'
 ) 
 OR (
 partition_0 > '2020'
 ) 
)
 AND ((partition_0 <= '2020'
	 AND partition_1 <= '02'
	 AND partition_2 <= '16'
) 
 OR (
	 partition_0 <= '2020'
	 AND partition_1 < '02'
) 
 OR (
	 partition_0 < '2020'
) 
)
 AND CAST(from_iso8601_timestamp(sent_at) AS timestamp)  between C

In [11]:
athena_full_events_df = aq.query(query)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [12]:
# Set types to speed queries and save on memory
athena_full_events_df = athena_full_events_df.astype({ "type":"category"
    , "event_name":"category"
    , "status":"category"}, copy=False)

In [13]:
athena_full_events_df.dtypes

sent_at_timestamp      object
sent_at                object
date                   object
type                 category
event_name           category
status               category
anonymous_id           object
amp_id                 object
page_url               object
referrer               object
user_agent             object
ip_address             object
dtype: object

In [14]:
athena_full_events_df.head(5)

Unnamed: 0,sent_at_timestamp,sent_at,date,type,event_name,status,anonymous_id,amp_id,page_url,referrer,user_agent,ip_address
0,2020-02-09 05:46:32.584,2020-02-09T05:46:32.584Z,2020-02-09,event,Reading,Article Body 75,362d99d6-74b2-439a-aa87-76fcef7bffd1,,https://www.moneysmart.tw/articles/%E5%A4%96%E...,https://www.google.com/,Mozilla/5.0 (Linux; Android 10; ASUS_Z01RD) Ap...,2404:0:802e:851:9237:2f05:11d0:e29e
1,2020-02-09 05:46:34.844,2020-02-09T05:46:34.844Z,2020-02-09,page,PageView,,45d802cb-aa9b-44fb-a3e1-e5bfb8ec8fe8,,https://www.moneysmart.sg/credit-cards/lazada-...,https://blog.moneysmart.sg/shopping/lazada-pro...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,202.166.49.60
2,2020-02-09 05:46:34.240,2020-02-09T05:46:34.240Z,2020-02-09,page,PageView,,33a27346-26bf-4918-9a2d-19e294cfa3db,,https://blog.moneysmart.sg/fitness-beauty/free...,https://www.google.com/,Mozilla/5.0 (Linux; Android 9; SAMSUNG SM-G955...,2406:3003:2006:27c2:ed35:5475:9db3:375c
3,2020-02-09 05:46:33.550,2020-02-09T05:46:33.550Z,2020-02-09,page,PageView,,4c4f3267-d90a-42ce-a201-8b9cf7de9ac4,,https://blog.moneysmart.sg/,https://blog.moneysmart.sg/health-insurance/he...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,113.210.54.208
4,2020-02-09 05:46:33.807,2020-02-09T05:46:33.807Z,2020-02-09,page,PageView,,df259713-4c6d-4d14-a041-fe8b298995ca,,https://www.moneysmart.tw/articles/%E5%85%92%E...,https://www.google.com.tw/,Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like...,2001:b011:4004:3510:2444:12cd:d0cb:583


# Segment Data

NB: screwed up, and can use the tracks table, rather than individual event tables, so a lot of this is pointless.

In [15]:
#from importlib import reload
#reload(data_pier_querying)

In [16]:
# Below there are some checks on what columns are available

segment_columns_to_query = [
    # "sent_at", - don't use this, use timestamp
    "timestamp",
    #"event", - going to get that implied from the table.
    # "status", # TODO: would like to have this, but not sure which column, or which tables.  Maybe just not used much, so only do for the 4 tables.
    "anonymous_id",
    "context_page_url",
    # "referrer", #maybe only used in pages table??
    "context_ip", 
    "context_user_agent"]

In [17]:
dp_querying = data_pier_querying.DataPierQuerying()
dp_querying.connect()

In [18]:
tables_df = dp_querying.query_to_dataframe("select * from information_schema.tables")

In [19]:
segment_event_tables_df = tables_df[tables_df.table_schema=="moneysmartsg_prod"]["table_name"]


In [20]:
# These are taken from the dictionary in https://docs.google.com/spreadsheets/d/1HICh77BoGMIat9K3NPwz3pBayJWiAr0ohAlTuv7dr80/edit#gid=1882048411
#but actually it turns out there should be more than this, and don't need to do it this way.
expected_events_str = """
LeadGeneration.ClickConversion
LeadGeneration.FormStepCompleted
LeadGeneration.FormSubmitted
LeadGeneration.PaymentCompleted
LeadGeneration.ThankYou
LeadGeneration.RedirectCompleted
UserEngagement.ShowedMoreDetails
UserEngagement.ViewedMoreDetails
UserEngagement.SortedList
UserEngagement.UsedHelpHints
UserEngagement.ClickedMenuItem
UserEngagement.QuestionAnswered
UserEngagement.ShowMoreFilter
UserEngagement.ShowMoreOptions
UserEngagement.ClickedFilter
UserEngagement.ButtonClick
UserAuth.LoggedIn
UserAuth.RegisteredAccount
UserAuth.LoggedOut
UserFeedback.ModalDisplayed
UserFeedback.MoodSubmitted
UserFeedback.FeedbackSubmitted
UserFeedback.MoreFeedback
ABTest.Conversion
UserView.WidgetLoad
EmailCapture
PageView
Sharing
Reading
NewsLetterPopup
"""
expected_events = [z.strip() for z in expected_events_str.split("\n") if len(z.strip())>0]

In [21]:


expected_events_and_segment_tables = []
special_maps = {
    "PageView": "pages"
}
for event in expected_events:
    if event in special_maps:
        new_event_name = special_maps[event]
    else:
        new_event_name = ""
        for i, c in enumerate(event):
            if i==0:new_event_name+=c.lower()
            elif str.isupper(c): 
                if i>0 and event[i-1]!=".":
                    new_event_name += "_"
                new_event_name += c.lower()
            elif c==".": new_event_name += "_"
            else: new_event_name+= c
    expected_events_and_segment_tables.append([event, new_event_name])

In [22]:
expected_events_and_segment_tables

[['LeadGeneration.ClickConversion', 'lead_generation_click_conversion'],
 ['LeadGeneration.FormStepCompleted', 'lead_generation_form_step_completed'],
 ['LeadGeneration.FormSubmitted', 'lead_generation_form_submitted'],
 ['LeadGeneration.PaymentCompleted', 'lead_generation_payment_completed'],
 ['LeadGeneration.ThankYou', 'lead_generation_thank_you'],
 ['LeadGeneration.RedirectCompleted', 'lead_generation_redirect_completed'],
 ['UserEngagement.ShowedMoreDetails', 'user_engagement_showed_more_details'],
 ['UserEngagement.ViewedMoreDetails', 'user_engagement_viewed_more_details'],
 ['UserEngagement.SortedList', 'user_engagement_sorted_list'],
 ['UserEngagement.UsedHelpHints', 'user_engagement_used_help_hints'],
 ['UserEngagement.ClickedMenuItem', 'user_engagement_clicked_menu_item'],
 ['UserEngagement.QuestionAnswered', 'user_engagement_question_answered'],
 ['UserEngagement.ShowMoreFilter', 'user_engagement_show_more_filter'],
 ['UserEngagement.ShowMoreOptions', 'user_engagement_show_m

### Check for missing tables

Expect some random events not to be in Segment, or blog specific ones that haven't been deployed to SG and HK

In [23]:
# Check all the event tables exist
expected_event_segment_tables = [z[1] for z in expected_events_and_segment_tables]
segment_table_names = segment_event_tables_df.to_list()
missing_event_tables = [z for z in expected_event_segment_tables if z not in segment_table_names]
missing_event_tables

['user_engagement_used_help_hints',
 'user_engagement_clicked_menu_item',
 'user_feedback_modal_displayed',
 'user_feedback_more_feedback',
 'a_b_test_conversion',
 'sharing',
 'news_letter_popup']

In [24]:
expected_events_and_segment_tables

[['LeadGeneration.ClickConversion', 'lead_generation_click_conversion'],
 ['LeadGeneration.FormStepCompleted', 'lead_generation_form_step_completed'],
 ['LeadGeneration.FormSubmitted', 'lead_generation_form_submitted'],
 ['LeadGeneration.PaymentCompleted', 'lead_generation_payment_completed'],
 ['LeadGeneration.ThankYou', 'lead_generation_thank_you'],
 ['LeadGeneration.RedirectCompleted', 'lead_generation_redirect_completed'],
 ['UserEngagement.ShowedMoreDetails', 'user_engagement_showed_more_details'],
 ['UserEngagement.ViewedMoreDetails', 'user_engagement_viewed_more_details'],
 ['UserEngagement.SortedList', 'user_engagement_sorted_list'],
 ['UserEngagement.UsedHelpHints', 'user_engagement_used_help_hints'],
 ['UserEngagement.ClickedMenuItem', 'user_engagement_clicked_menu_item'],
 ['UserEngagement.QuestionAnswered', 'user_engagement_question_answered'],
 ['UserEngagement.ShowMoreFilter', 'user_engagement_show_more_filter'],
 ['UserEngagement.ShowMoreOptions', 'user_engagement_show_m

In [25]:
# Removing the missing ones from the query list
events_and_tables_to_get_from_data_pier = [z for z in expected_events_and_segment_tables if z[1] not in missing_event_tables]

# Removing a problematic one (doesn't have context_page_url in it, and very unimportant
events_and_tables_to_get_from_data_pier = [z for z in events_and_tables_to_get_from_data_pier if z[1] not in ["user_auth_logged_out",]]

In [26]:
len(events_and_tables_to_get_from_data_pier)

22

In [27]:
cols = dp_querying.query_to_dataframe("""
select column_name, data_type, count(*) from information_schema.columns 
where 
table_name in  ('"""+"','".join([z[1] for z in events_and_tables_to_get_from_data_pier])+"""')
and table_schema='moneysmartsg_prod'

group by 1,2
""")

In [28]:
cols[cols["count"]>10].sort_values(["count"])

Unnamed: 0,column_name,data_type,count
288,page_referrer,text,12
353,user_id,text,13
27,context_campaign_content,text,15
43,context_campaign_term,text,15
287,page_path,text,15
17,channel,text,16
33,context_campaign_medium,text,17
34,context_campaign_name,text,17
41,context_campaign_source,text,17
61,context_locale,text,20


In [29]:
cols = dp_querying.query_to_dataframe("""
select  column_name, data_type, count(*) from information_schema.columns 
where 
 table_name in  ('"""+"','".join(["pages", "tracks"])+"""')
and table_schema='moneysmartsg_prod'
and column_name like '%%'
group by 1,2 order by count(*) desc
""")
cols

Unnamed: 0,column_name,data_type,count
0,context_campaign_term,text,2
1,context_campaign_name,text,2
2,context_page_referrer,text,2
3,context_user_agent,text,2
4,context_page_search,text,2
5,context_page_title,text,2
6,context_campaign_content,text,2
7,context_page_url,text,2
8,id,character varying,2
9,context_ip,text,2


In [30]:
segment_date_constraint = " timestamp >= '%s' and timestamp < '%s' " % (from_datetime.isoformat(), to_datetime.isoformat())

In [31]:
dp_querying.query_to_dataframe("""SELECT
    nmsp_parent.nspname AS parent_schema,
    parent.relname      AS parent,
    nmsp_child.nspname  AS child_schema,
    child.relname       AS child
FROM pg_inherits
    JOIN pg_class parent            ON pg_inherits.inhparent = parent.oid
    JOIN pg_class child             ON pg_inherits.inhrelid   = child.oid
    JOIN pg_namespace nmsp_parent   ON nmsp_parent.oid  = parent.relnamespace
    JOIN pg_namespace nmsp_child    ON nmsp_child.oid   = child.relnamespace
WHERE parent.relname='%s';""")%"pages"

Unnamed: 0,parent_schema,parent,child_schema,child


In [32]:
pd.get_option("display.max_colwidth", 200)
indexes = dp_querying.query_to_dataframe("""SELECT
    indexname,
    indexdef
FROM
    pg_indexes
WHERE
    tablename = '%s';""" % "pages")

for a in indexes.values:
    print(a)

['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmarthk_prod.pages USING btree (id)']
['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmartsg_prod.pages USING btree (id)']
['pages_timestamp_idx'
 'CREATE INDEX pages_timestamp_idx ON moneysmartsg_prod.pages USING btree ("timestamp")']
['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmarthk_dev.pages USING btree (id)']
['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmartsg_dev.pages USING btree (id)']


In [33]:
query_segment_by_table = False #really shouldn't set this to true, didn't realise correct method.  Also need to add country stuff

segment_schemas = ["moneysmartsg_prod", "moneysmarthk_prod"]
# The meat of it
start_time = datetime.now()
event_name_to_rows = {}
if query_segment_by_table:
    for country_schema in segment_schemas:
        for i, (event_name, table_name) in enumerate(events_and_tables_to_get_from_data_pier):
            table_start_time = datetime.now()
            print("querying table %s / %s (%i/%i)" % (table_name, event_name, i+1, len(events_and_tables_to_get_from_data_pier)))
            query = "select {cols} from {schema}.{table} where {date_constraint}".format(cols=", ".join(segment_columns_to_query), 
                                                                           table=table_name,
                                                                           date_constraint =segment_date_constraint, schema=country_schema)

            events = dp_querying.query_to_dataframe(query)
            events["event_name"] = event_name #fills the entire column with the same value
            print("Got %i events"% len(events))
            event_name_to_rows[event_name]=events

            table_download_time = (datetime.now()-table_start_time).total_seconds()
            time_since_start = (datetime.now()-start_time).total_seconds()
            print("It took %.1f seconds to download from the table (%.1f seconds overall)" %(table_download_time, time_since_start))
            print()
            # if i>4:break


        # Merge tables
        segment_combined_df = pd.DataFrame()
        #combined_df = pd.DataFrame(columns=event_name_to_rows["LeadGeneration.ClickConversion"].columns)
        """for event_name, event_df in event_name_to_rows.items():
            print(len(event_df))
            combined_df.append(event_df, ignore_index=True)
            print(len(combined_df))
        #combined_df.astype({"event_name":"category"})
        """

        segment_combined_df = combined_df.append(list(event_name_to_rows.values()))
    
    
else:
    segment_columns_to_query_full = segment_columns_to_query + ["event_text",]
    tables_to_query = ["pages", "tracks"]
    all_event_dfs = []
    segment_combined_df = pd.DataFrame()
    for country_schema in segment_schemas:
        for table_name in tables_to_query:
            table_start_time = datetime.now()
            if table_name!="pages":
                cols_to_query = segment_columns_to_query_full
            else:
                cols_to_query = segment_columns_to_query
            print("querying table %s.%s" % (country_schema, table_name))
            print(cols_to_query)
            query = "select {cols} from {schema}.{table} where {date_constraint}".format(cols=", ".join(cols_to_query), 
                                                                           table=table_name,
                                                                           date_constraint =segment_date_constraint, schema=country_schema)

            events = dp_querying.query_to_dataframe(query)
            
            print("Got %i events"% len(events))
            #all_event_dfs.append(events)
            
            if table_name =="pages":
                events["event_text"] = "PageView" # fills the whole column
            table_download_time = (datetime.now()-table_start_time).total_seconds()
            time_since_start = (datetime.now()-start_time).total_seconds()
            print("merging")
            segment_combined_df = segment_combined_df.append(events)
            print("It took %.1f seconds to download from the table (%.1f seconds overall)" %(table_download_time, time_since_start))
            print()
            
        

querying table moneysmartsg_prod.pages
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent']
Got 926835 events
merging
It took 38.2 seconds to download from the table (38.2 seconds overall)

querying table moneysmartsg_prod.tracks
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent', 'event_text']
Got 1169914 events
merging
It took 345.4 seconds to download from the table (384.3 seconds overall)

querying table moneysmarthk_prod.pages
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent']
Got 155588 events
merging
It took 275.1 seconds to download from the table (659.8 seconds overall)

querying table moneysmarthk_prod.tracks
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent', 'event_text']
Got 187421 events
merging
It took 75.7 seconds to download from the table (736.1 seconds overall)



In [34]:
if not query_segment_by_table:
    segment_combined_df.rename(columns={"event_text":"event_name"}, inplace=True)

In [35]:
len(all_event_dfs)

0

In [36]:
if include_device_segmentation:
    segment_combined_df.rename(columns={"context_user_agent":"user_agent"}, inplace=True)

In [37]:
segment_combined_df.head()

Unnamed: 0,timestamp,anonymous_id,context_page_url,context_ip,user_agent,event_name
0,2020-02-09 00:00:03.119000+00:00,711612a1-1655-4084-b0f4-ce1924b62878,https://blog.moneysmart.sg/invest/invest-luxur...,115.66.28.95,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView
1,2020-02-09 00:00:03.411000+00:00,d37cad6e-4d24-48f0-87da-df70d0b5bcd7,https://blog.moneysmart.sg/shopping/surgical-m...,119.56.110.139,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView
2,2020-02-09 00:00:05.548000+00:00,22ac42ba-3d96-4d2c-b216-008183c768a9,https://www.moneysmart.sg/forms/personal-loan/...,52.220.124.147,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,PageView
3,2020-02-09 00:00:07.906000+00:00,7a44d50c-0f31-470d-a320-ae56fedf4e9e,https://blog.moneysmart.sg/life-insurance/etiq...,182.55.163.63,Mozilla/5.0 (Linux; Android 9; SM-G965F Build/...,PageView
4,2020-02-09 00:00:16.709000+00:00,12d044d2-02f7-46e7-a2ca-c526e918f9a6,https://blog.moneysmart.sg/career/singapore-jo...,14.100.35.8,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,PageView


In [38]:
segment_combined_df.rename(columns={"context_page_url":"page_url"}, inplace=True)
segment_combined_df.head(5)

Unnamed: 0,timestamp,anonymous_id,page_url,context_ip,user_agent,event_name
0,2020-02-09 00:00:03.119000+00:00,711612a1-1655-4084-b0f4-ce1924b62878,https://blog.moneysmart.sg/invest/invest-luxur...,115.66.28.95,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView
1,2020-02-09 00:00:03.411000+00:00,d37cad6e-4d24-48f0-87da-df70d0b5bcd7,https://blog.moneysmart.sg/shopping/surgical-m...,119.56.110.139,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView
2,2020-02-09 00:00:05.548000+00:00,22ac42ba-3d96-4d2c-b216-008183c768a9,https://www.moneysmart.sg/forms/personal-loan/...,52.220.124.147,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,PageView
3,2020-02-09 00:00:07.906000+00:00,7a44d50c-0f31-470d-a320-ae56fedf4e9e,https://blog.moneysmart.sg/life-insurance/etiq...,182.55.163.63,Mozilla/5.0 (Linux; Android 9; SM-G965F Build/...,PageView
4,2020-02-09 00:00:16.709000+00:00,12d044d2-02f7-46e7-a2ca-c526e918f9a6,https://blog.moneysmart.sg/career/singapore-jo...,14.100.35.8,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,PageView


# Merging Segment and Kinesis Events

In [39]:
# Make names clear e.g. s_...

# Check the timezone / timestamps match
# Athena raw stuff is in UTC, not SG time.  So 2020-01-19T00:04:04.443Z is 8:05am Singapore time.
# whereas Segment is stored with tiemzone at UTC.  So, could convert them all.
# TODO: But it does meant that there's a lot of events coming at the day boundary.

In [40]:
athena_full_events_df.head(2)

Unnamed: 0,sent_at_timestamp,sent_at,date,type,event_name,status,anonymous_id,amp_id,page_url,referrer,user_agent,ip_address
0,2020-02-09 05:46:32.584,2020-02-09T05:46:32.584Z,2020-02-09,event,Reading,Article Body 75,362d99d6-74b2-439a-aa87-76fcef7bffd1,,https://www.moneysmart.tw/articles/%E5%A4%96%E...,https://www.google.com/,Mozilla/5.0 (Linux; Android 10; ASUS_Z01RD) Ap...,2404:0:802e:851:9237:2f05:11d0:e29e
1,2020-02-09 05:46:34.844,2020-02-09T05:46:34.844Z,2020-02-09,page,PageView,,45d802cb-aa9b-44fb-a3e1-e5bfb8ec8fe8,,https://www.moneysmart.sg/credit-cards/lazada-...,https://blog.moneysmart.sg/shopping/lazada-pro...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,202.166.49.60


In [41]:
athena_full_events_df.dtypes

sent_at_timestamp      object
sent_at                object
date                   object
type                 category
event_name           category
status               category
anonymous_id           object
amp_id                 object
page_url               object
referrer               object
user_agent             object
ip_address             object
dtype: object

In [42]:
segment_combined_df.dtypes

timestamp       datetime64[ns, UTC]
anonymous_id                 object
page_url                     object
context_ip                   object
user_agent                   object
event_name                   object
dtype: object

In [43]:
# Group by columns to get around date inaccuracy issue
cols_to_group_by = ["anonymous_id", "event_name", "page_url", "date"] #, "context_ip", "context_user_agent"] #TODO: add IP address

print("Grouping by %s"% ", ".join(cols_to_group_by))

print("Fixing dates before grouping")
print("... for Segment")
segment_combined_df["date"] = segment_combined_df.apply(lambda row: row.timestamp.date().isoformat(), axis=1) # making this a string
print("... for athena")
athena_full_events_df["date"] = athena_full_events_df.apply(lambda row: row.sent_at[:10], axis=1)
# super-slow,so moving to using strings athena_full_events_df["date"] = athena_full_events_df.apply(lambda row: dateparser.parse(row.sent_at_timestamp).date(), axis=1)  #conversion from string might not be needed in the future; using dateparser as more robust, also slow

#going to reduce the number of columns to make it safer, then can go back and look for user agents etc (can do a mapping of anonymous_id to user_agent for instance.)




Grouping by anonymous_id, event_name, page_url, date
Fixing dates before grouping
... for Segment
... for athena


In [44]:
print("Setting sensible data types for the columns to group by")
data_type_mappings = {"event_name":"category", "date":"category"}
segment_combined_df = segment_combined_df.astype(data_type_mappings, copy=False)
athena_full_events_df = athena_full_events_df.astype(data_type_mappings, copy=False)

Setting sensible data types for the columns to group by


In [45]:
segment_combined_df.head()[cols_to_group_by]

Unnamed: 0,anonymous_id,event_name,page_url,date
0,711612a1-1655-4084-b0f4-ce1924b62878,PageView,https://blog.moneysmart.sg/invest/invest-luxur...,2020-02-09
1,d37cad6e-4d24-48f0-87da-df70d0b5bcd7,PageView,https://blog.moneysmart.sg/shopping/surgical-m...,2020-02-09
2,22ac42ba-3d96-4d2c-b216-008183c768a9,PageView,https://www.moneysmart.sg/forms/personal-loan/...,2020-02-09
3,7a44d50c-0f31-470d-a320-ae56fedf4e9e,PageView,https://blog.moneysmart.sg/life-insurance/etiq...,2020-02-09
4,12d044d2-02f7-46e7-a2ca-c526e918f9a6,PageView,https://blog.moneysmart.sg/career/singapore-jo...,2020-02-09


In [46]:
athena_full_events_df.head()[cols_to_group_by]

Unnamed: 0,anonymous_id,event_name,page_url,date
0,362d99d6-74b2-439a-aa87-76fcef7bffd1,Reading,https://www.moneysmart.tw/articles/%E5%A4%96%E...,2020-02-09
1,45d802cb-aa9b-44fb-a3e1-e5bfb8ec8fe8,PageView,https://www.moneysmart.sg/credit-cards/lazada-...,2020-02-09
2,33a27346-26bf-4918-9a2d-19e294cfa3db,PageView,https://blog.moneysmart.sg/fitness-beauty/free...,2020-02-09
3,4c4f3267-d90a-42ce-a201-8b9cf7de9ac4,PageView,https://blog.moneysmart.sg/,2020-02-09
4,df259713-4c6d-4d14-a041-fe8b298995ca,PageView,https://www.moneysmart.tw/articles/%E5%85%92%E...,2020-02-09


In [47]:
# athena_full_events_df timestamp

print("Grouping by %s"%cols_to_group_by)
segment_grouped_df = segment_combined_df.groupby(cols_to_group_by).size().reset_index(name='s_count') #size preserves nulls, this sets the column to s_count

athena_grouped_df = athena_full_events_df.groupby(cols_to_group_by).size().reset_index(name='k_count')

# segment_combined_df.rename(columns = {"context_ip":"s_context_ip", "context_user_agent":"s_context_user_agent"}) 

Grouping by ['anonymous_id', 'event_name', 'page_url', 'date']


In [48]:
athena_grouped_df.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,k_count
0,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,1
1,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,1
2,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,3
3,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,4
4,0000628f-db5d-4554-96eb-66454e203e92,PageView,https://www.moneysmart.sg/embed/dc96c1e58d2f68...,2020-02-09,1


In [49]:
# Actually join them

# set the column count names

merged_df = segment_grouped_df.merge(athena_grouped_df, how='outer', on=cols_to_group_by )

#Fill in the empty counts with 0s

merged_df["s_count"].fillna(0, inplace=True)
merged_df["k_count"].fillna(0, inplace=True)

In [50]:
merged_df.head(10)

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count
0,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,1.0,1.0
1,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,1.0,1.0
2,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,3.0,3.0
3,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,4.0,4.0
4,0000628f-db5d-4554-96eb-66454e203e92,PageView,https://www.moneysmart.sg/embed/dc96c1e58d2f68...,2020-02-09,1.0,1.0
5,0000628f-db5d-4554-96eb-66454e203e92,UserView.WidgetLoad,https://www.moneysmart.sg/embed/dc96c1e58d2f68...,2020-02-09,1.0,1.0
6,00007ce2-f710-4f78-bf44-93fcc7e68c24,PageView,https://blog.moneysmart.sg/credit-cards/best-1...,2020-02-14,1.0,1.0
7,0000b18c-192c-4f9b-b3c4-4b8f65a9e197,PageView,https://blog.moneysmart.sg/budgeting/retiremen...,2020-02-14,1.0,1.0
8,0000b18c-192c-4f9b-b3c4-4b8f65a9e197,Reading,https://blog.moneysmart.sg/budgeting/retiremen...,2020-02-14,2.0,2.0
9,0000e821-1060-4146-9374-2e32bea14f00,LeadGeneration.ClickConversion,https://www.moneysmart.hk/zh-hk/credit-cards/h...,2020-02-12,1.0,0.0


In [51]:
merged_df.groupby(["date"]).count()

Unnamed: 0_level_0,anonymous_id,event_name,page_url,s_count,k_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-02-09,269277,269277,269277,269277,269277
2020-02-10,291919,291919,291919,291919,291919
2020-02-11,303506,303506,303506,303506,303506
2020-02-12,290399,290399,290399,290399,290399
2020-02-13,290005,290005,290005,290005,290005
2020-02-14,279776,279776,279776,279776,279776
2020-02-15,297001,297001,297001,297001,297001


# Add Page Filtering Metadata

* is url blog / shop / ...
* country

In [52]:
from urllib.parse import urlparse, parse_qs

In [53]:
from data_parsing import get_metadata_from_url


In [54]:
# Do some tests to show that it's kind of working (bad version of a unit test!)

In [55]:
get_metadata_from_url("https://www-new.moneysmart.sg/rabbit/headlight/?scary=True")

['shop', '/rabbit/headlight', '/rabbit', 'test', 'sg']

In [56]:
get_metadata_from_url("https://blog.moneysmart.ph/rabbit/headlight/?scary=True")

['blog', '/rabbit/headlight', '/rabbit', 'control', 'ph']

In [57]:
get_metadata_from_url("https://blog3.moneysmart.tw")

['blog', '/', '/', 'test', 'tw']

In [58]:
get_metadata_from_url("https://www.moneysmart.hk/zh-hk/credit-cards/")

['shop', '/zh-hk/credit-cards', '/credit-cards', 'control', 'hk']

In [59]:
start_time = datetime.now()
print("starting at %s"%start_time.isoformat())
#This is a bit slow (consider at looking how to optimise, especially memory usage from creating loads of series objects
#Could probably optimise by splitting all the urls using a pandas function, then joining with a map to get page_type, path etc, but ymmv
metadata_df = merged_df.apply(lambda x: pd.Series(get_metadata_from_url(x.page_url)), axis=1)#, index=["page_type", "path", "ab_test", "country_code"])
end_time = datetime.now()
time_taken = (end_time-start_time).total_seconds()
print("Took %i seconds"%time_taken)

starting at 2020-02-17T09:47:10.330432
Took 848 seconds


In [60]:
metadata_df.rename(columns={0:"page_type", 1:"slug", 2:"slug_root", 3:"ab_test", 4:"country_code"}, inplace=True)

In [61]:
metadata_df.head()

Unnamed: 0,page_type,slug,slug_root,ab_test,country_code
0,blog,/budgeting/mattress-singapore-guide,/budgeting,control,sg
1,blog,/property/3-things-look-buying-condo-2017,/property,control,sg
2,blog,/budgeting/mattress-singapore-guide,/budgeting,control,sg
3,blog,/property/3-things-look-buying-condo-2017,/property,control,sg
4,shop,/embed/dc96c1e58d2f6855228962060a1a8b77,/embed,control,sg


In [62]:
merged_df_with_meta = pd.concat([merged_df, metadata_df], axis=1)

In [63]:
# Set some sensible data types to speed it all up
#merged_df_with_meta.astype({"page_type":"category", "slug":"category"})
merged_df_with_meta = merged_df_with_meta.astype({"page_type":"category", "slug":"category", "ab_test":"category", "country_code":"category", "s_count":"int", "k_count":"int"})

In [64]:
merged_df_with_meta.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code
0,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,1,1,blog,/budgeting/mattress-singapore-guide,/budgeting,control,sg
1,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,1,1,blog,/property/3-things-look-buying-condo-2017,/property,control,sg
2,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,3,3,blog,/budgeting/mattress-singapore-guide,/budgeting,control,sg
3,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,4,4,blog,/property/3-things-look-buying-condo-2017,/property,control,sg
4,0000628f-db5d-4554-96eb-66454e203e92,PageView,https://www.moneysmart.sg/embed/dc96c1e58d2f68...,2020-02-09,1,1,shop,/embed/dc96c1e58d2f6855228962060a1a8b77,/embed,control,sg


In [65]:
merged_df_with_meta[(merged_df_with_meta.s_count>1) & (merged_df_with_meta.k_count>1)].head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code
2,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,3,3,blog,/budgeting/mattress-singapore-guide,/budgeting,control,sg
3,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,4,4,blog,/property/3-things-look-buying-condo-2017,/property,control,sg
8,0000b18c-192c-4f9b-b3c4-4b8f65a9e197,Reading,https://blog.moneysmart.sg/budgeting/retiremen...,2020-02-14,2,2,blog,/budgeting/retirement-planning-singapore,/budgeting,control,sg
15,0000fd92-314b-4c04-9a1f-702ad1aa9e3c,Reading,https://blog.moneysmart.sg/shopping/surgical-m...,2020-02-14,4,4,blog,/shopping/surgical-masks-watsons-guardian,/shopping,control,sg
18,000137c4-9530-4efe-acea-0bd0bde9cbda,Reading,https://blog.moneysmart.sg/fitness-beauty/ipl-...,2020-02-15,4,4,blog,/fitness-beauty/ipl-hair-removal,/fitness-beauty,control,sg


# Add Device Type Metadata

In [66]:
segment_combined_df.head()

Unnamed: 0,timestamp,anonymous_id,page_url,context_ip,user_agent,event_name,date
0,2020-02-09 00:00:03.119000+00:00,711612a1-1655-4084-b0f4-ce1924b62878,https://blog.moneysmart.sg/invest/invest-luxur...,115.66.28.95,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView,2020-02-09
1,2020-02-09 00:00:03.411000+00:00,d37cad6e-4d24-48f0-87da-df70d0b5bcd7,https://blog.moneysmart.sg/shopping/surgical-m...,119.56.110.139,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView,2020-02-09
2,2020-02-09 00:00:05.548000+00:00,22ac42ba-3d96-4d2c-b216-008183c768a9,https://www.moneysmart.sg/forms/personal-loan/...,52.220.124.147,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,PageView,2020-02-09
3,2020-02-09 00:00:07.906000+00:00,7a44d50c-0f31-470d-a320-ae56fedf4e9e,https://blog.moneysmart.sg/life-insurance/etiq...,182.55.163.63,Mozilla/5.0 (Linux; Android 9; SM-G965F Build/...,PageView,2020-02-09
4,2020-02-09 00:00:16.709000+00:00,12d044d2-02f7-46e7-a2ca-c526e918f9a6,https://blog.moneysmart.sg/career/singapore-jo...,14.100.35.8,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,PageView,2020-02-09


In [67]:
athena_full_events_df.head()

Unnamed: 0,sent_at_timestamp,sent_at,date,type,event_name,status,anonymous_id,amp_id,page_url,referrer,user_agent,ip_address
0,2020-02-09 05:46:32.584,2020-02-09T05:46:32.584Z,2020-02-09,event,Reading,Article Body 75,362d99d6-74b2-439a-aa87-76fcef7bffd1,,https://www.moneysmart.tw/articles/%E5%A4%96%E...,https://www.google.com/,Mozilla/5.0 (Linux; Android 10; ASUS_Z01RD) Ap...,2404:0:802e:851:9237:2f05:11d0:e29e
1,2020-02-09 05:46:34.844,2020-02-09T05:46:34.844Z,2020-02-09,page,PageView,,45d802cb-aa9b-44fb-a3e1-e5bfb8ec8fe8,,https://www.moneysmart.sg/credit-cards/lazada-...,https://blog.moneysmart.sg/shopping/lazada-pro...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,202.166.49.60
2,2020-02-09 05:46:34.240,2020-02-09T05:46:34.240Z,2020-02-09,page,PageView,,33a27346-26bf-4918-9a2d-19e294cfa3db,,https://blog.moneysmart.sg/fitness-beauty/free...,https://www.google.com/,Mozilla/5.0 (Linux; Android 9; SAMSUNG SM-G955...,2406:3003:2006:27c2:ed35:5475:9db3:375c
3,2020-02-09 05:46:33.550,2020-02-09T05:46:33.550Z,2020-02-09,page,PageView,,4c4f3267-d90a-42ce-a201-8b9cf7de9ac4,,https://blog.moneysmart.sg/,https://blog.moneysmart.sg/health-insurance/he...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,113.210.54.208
4,2020-02-09 05:46:33.807,2020-02-09T05:46:33.807Z,2020-02-09,page,PageView,,df259713-4c6d-4d14-a041-fe8b298995ca,,https://www.moneysmart.tw/articles/%E5%85%92%E...,https://www.google.com.tw/,Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like...,2001:b011:4004:3510:2444:12cd:d0cb:583


### Segment

In [68]:
segment_combined_df.head()

Unnamed: 0,timestamp,anonymous_id,page_url,context_ip,user_agent,event_name,date
0,2020-02-09 00:00:03.119000+00:00,711612a1-1655-4084-b0f4-ce1924b62878,https://blog.moneysmart.sg/invest/invest-luxur...,115.66.28.95,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView,2020-02-09
1,2020-02-09 00:00:03.411000+00:00,d37cad6e-4d24-48f0-87da-df70d0b5bcd7,https://blog.moneysmart.sg/shopping/surgical-m...,119.56.110.139,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,PageView,2020-02-09
2,2020-02-09 00:00:05.548000+00:00,22ac42ba-3d96-4d2c-b216-008183c768a9,https://www.moneysmart.sg/forms/personal-loan/...,52.220.124.147,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,PageView,2020-02-09
3,2020-02-09 00:00:07.906000+00:00,7a44d50c-0f31-470d-a320-ae56fedf4e9e,https://blog.moneysmart.sg/life-insurance/etiq...,182.55.163.63,Mozilla/5.0 (Linux; Android 9; SM-G965F Build/...,PageView,2020-02-09
4,2020-02-09 00:00:16.709000+00:00,12d044d2-02f7-46e7-a2ca-c526e918f9a6,https://blog.moneysmart.sg/career/singapore-jo...,14.100.35.8,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,PageView,2020-02-09


In [69]:
group_by_cols = ["anonymous_id", "user_agent"]
segment_anonymous_id_to_user_agent_full_df = segment_combined_df.groupby(group_by_cols).count()
print("%i anonymous_id to user_agents found" % len(segment_anonymous_id_to_user_agent_full_df))

648745 anonymous_id to user_agents found


In [70]:
segment_anonymous_id_to_user_agent_full_df = segment_anonymous_id_to_user_agent_full_df.reset_index()
segment_anonymous_id_to_user_agent_full_df.rename({"0":"count"}, inplace=True)
segment_anonymous_id_to_user_agent_full_df.head()

Unnamed: 0,anonymous_id,user_agent,timestamp,page_url,context_ip,event_name,date
0,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,5,5,5,5,5
1,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,4,4,4,4,4
2,0000628f-db5d-4554-96eb-66454e203e92,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,2,2,2,2,2
3,00007ce2-f710-4f78-bf44-93fcc7e68c24,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,1,1,1,1,1
4,0000b18c-192c-4f9b-b3c4-4b8f65a9e197,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,3,3,3,3,3


In [71]:
# check for duplicates
sd = segment_anonymous_id_to_user_agent_full_df.groupby(["anonymous_id"]).size() #[["sent_at"]]
sd = sd.reset_index()
duplicates = sd[sd[0]>1]
print("%i / %i anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades" % (len(duplicates), len(sd)))

10868 / 637295 anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades


In [72]:
sd.head()

Unnamed: 0,anonymous_id,0
0,000034a2-e973-4108-b920-0681877d4fc0,2
1,0000628f-db5d-4554-96eb-66454e203e92,1
2,00007ce2-f710-4f78-bf44-93fcc7e68c24,1
3,0000b18c-192c-4f9b-b3c4-4b8f65a9e197,1
4,0000e821-1060-4146-9374-2e32bea14f00,1


In [73]:
segment_anonymous_id_to_user_agent_df = segment_anonymous_id_to_user_agent_full_df[["anonymous_id", "user_agent"]] # .set_index("anonymous_id")

#make a bit safer by stripping the strings
#segment_anonymous_id_to_user_agent_df["user_agent"] = segment_anonymous_id_to_user_agent_df["user_agent"].str.strip()
#segment_anonymous_id_to_user_agent_df["anonymous_id"] = segment_anonymous_id_to_user_agent_df["anonymous_id"].str.strip()

In [74]:
segment_anonymous_id_to_user_agent_df = segment_anonymous_id_to_user_agent_df.rename(columns={"user_agent": "s_user_agent"})
segment_anonymous_id_to_user_agent_df.head()

Unnamed: 0,anonymous_id,s_user_agent
0,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
1,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
2,0000628f-db5d-4554-96eb-66454e203e92,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
3,00007ce2-f710-4f78-bf44-93fcc7e68c24,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
4,0000b18c-192c-4f9b-b3c4-4b8f65a9e197,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


In [75]:
# Remove duplicates, so anonymous_id column is unique (otherwise on joins you'll expand the dataset)
segment_anonymous_id_to_user_agent_dedup_df = segment_anonymous_id_to_user_agent_df.groupby("anonymous_id").first().reset_index()
print("Before de-duplication %i, after %i"%(len(segment_anonymous_id_to_user_agent_df), len(segment_anonymous_id_to_user_agent_dedup_df)))
segment_anonymous_id_to_user_agent_dedup_df.head()

Before de-duplication 648745, after 637295


Unnamed: 0,anonymous_id,s_user_agent
0,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
1,0000628f-db5d-4554-96eb-66454e203e92,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
2,00007ce2-f710-4f78-bf44-93fcc7e68c24,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
3,0000b18c-192c-4f9b-b3c4-4b8f65a9e197,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
4,0000e821-1060-4146-9374-2e32bea14f00,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...


### Athena / Kinesis

In [76]:
group_by_cols = ["anonymous_id", "user_agent"]
athena_anonymous_id_to_user_agent_full_df = athena_full_events_df.groupby(group_by_cols).size()
print("%i anonymous_id to user_agents found" % len(athena_anonymous_id_to_user_agent_full_df))

772899 anonymous_id to user_agents found


In [77]:
athena_anonymous_id_to_user_agent_full_df = athena_anonymous_id_to_user_agent_full_df.reset_index()
athena_anonymous_id_to_user_agent_full_df.rename({"0":"count"}, inplace=True)
athena_anonymous_id_to_user_agent_full_df.head()

Unnamed: 0,anonymous_id,user_agent,0
0,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,5
1,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,4
2,0000628f-db5d-4554-96eb-66454e203e92,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,2
3,00007ce2-f710-4f78-bf44-93fcc7e68c24,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,1
4,0000b18c-192c-4f9b-b3c4-4b8f65a9e197,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,3


In [78]:
# check for duplicates
ad = athena_anonymous_id_to_user_agent_full_df.groupby(["anonymous_id"]).size() #[["sent_at"]]
ad = ad.reset_index()
duplicates = ad[ad[0]>1]
print("%i / %i anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades" % (len(duplicates), len(ad)))

11509 / 760672 anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades


In [79]:
# explore if issue
#df = ad[ad[0]>1].merge(athena_anonymous_id_to_user_agent_full_df, how="inner")
#df.sort_values("anonymous_id")

In [80]:
#df = athena_anonymous_id_to_user_agent_full_df[athena_anonymous_id_to_user_agent_full_df.anonymous_id=="f4a0d91c-b118-40ce-890c-9142bce9f152"]
#pd.set_option('max_colwidth', 200)
#print(df.values[0][1])
#print(df.values[1][1])

In [81]:
#athena_anonymous_id_to_user_agent_full_df.head()
athena_anonymous_id_to_user_agent_df = athena_anonymous_id_to_user_agent_full_df[["anonymous_id", "user_agent"]]


#make a bit safer by stripping the strings #couldn't get this to work without warning easily, so skipping.
#athena_anonymous_id_to_user_agent_df.loc[:,1] = athena_anonymous_id_to_user_agent_df["user_agent"].str.strip()
#athena_anonymous_id_to_user_agent_df.loc[:,0] = athena_anonymous_id_to_user_agent_df["anonymous_id"].str.strip()

#?athena_anonymous_id_to_user_agent_df["user_agent"].str.strip()

In [82]:
athena_anonymous_id_to_user_agent_df = athena_anonymous_id_to_user_agent_df.rename(columns={"user_agent": "a_user_agent"})
athena_anonymous_id_to_user_agent_df.head()


Unnamed: 0,anonymous_id,a_user_agent
0,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
1,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
2,0000628f-db5d-4554-96eb-66454e203e92,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
3,00007ce2-f710-4f78-bf44-93fcc7e68c24,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
4,0000b18c-192c-4f9b-b3c4-4b8f65a9e197,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


In [83]:
# Remove duplicates, so anonymous_id column is unique (otherwise on joins you'll expand the dataset)
athena_anonymous_id_to_user_agent_dedup_df = athena_anonymous_id_to_user_agent_df.groupby("anonymous_id").first().reset_index()
print("Before de-duplication %i, after %i"%(len(athena_anonymous_id_to_user_agent_df), len(athena_anonymous_id_to_user_agent_dedup_df)))
athena_anonymous_id_to_user_agent_dedup_df.head()

Before de-duplication 772899, after 760672


Unnamed: 0,anonymous_id,a_user_agent
0,000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
1,0000628f-db5d-4554-96eb-66454e203e92,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
2,00007ce2-f710-4f78-bf44-93fcc7e68c24,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
3,0000b18c-192c-4f9b-b3c4-4b8f65a9e197,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
4,0000ca0d-320e-4d34-9d9d-6ae23575f78b,Mozilla/5.0 (Linux; Android 9; VCE-L22) AppleW...


### Joined up for all anonymous_ids

In [84]:
athena_anonymous_id_to_user_agent_dedup_df.set_index("anonymous_id", inplace=True)
segment_anonymous_id_to_user_agent_dedup_df.set_index("anonymous_id", inplace=True)




In [85]:
athena_anonymous_id_to_user_agent_dedup_df.head(2)

Unnamed: 0_level_0,a_user_agent
anonymous_id,Unnamed: 1_level_1
000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
0000628f-db5d-4554-96eb-66454e203e92,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


In [86]:
segment_anonymous_id_to_user_agent_dedup_df.head(2)

Unnamed: 0_level_0,s_user_agent
anonymous_id,Unnamed: 1_level_1
000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
0000628f-db5d-4554-96eb-66454e203e92,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


In [87]:
combined_anonymous_id_to_user_agent_df = athena_anonymous_id_to_user_agent_dedup_df.merge(segment_anonymous_id_to_user_agent_dedup_df, how="outer", left_index=True, right_index=True)


In [88]:
combined_anonymous_id_to_user_agent_df.head(1)

Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


### Check if Segment and Kinesis disagree at all

In [89]:
print("%i segment anonymous_ids" % len(segment_anonymous_id_to_user_agent_df))
print("%i athena anonymous_ids" % len(athena_anonymous_id_to_user_agent_df))

648745 segment anonymous_ids
772899 athena anonymous_ids


In [90]:
# combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.a_user_agent.isnull())]

In [91]:
s_not_a = combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.s_user_agent.str.len()>0) \
                                                 & ((combined_anonymous_id_to_user_agent_df.a_user_agent.isnull()) |(combined_anonymous_id_to_user_agent_df.a_user_agent.str.len()==0))]
a_not_s = combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.a_user_agent.str.len()>0) \
                                                 & ((combined_anonymous_id_to_user_agent_df.s_user_agent.isnull()) |(combined_anonymous_id_to_user_agent_df.s_user_agent.str.len()==0))]

In [92]:
s_not_a.head()

Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
000392da-fb0f-4416-bf9a-573c0fd36007,,Mozilla/5.0 (compatible; Baiduspider-render/2....
0004c260-2f62-4ccf-a68a-d968bdc63317,,Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; E...
00068df3-8ad3-4263-9d5f-6457fa83900e,,Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; ...
001286b2-7623-4755-86ac-908a274849b9,,Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Ma...
0013e71e-0193-4756-8da4-ca688f74f7b6,,Mozilla/5.0 (Linux; Android 7.1.2; Redmi 5A Bu...


In [93]:
a_not_s.head()

Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0000ca0d-320e-4d34-9d9d-6ae23575f78b,Mozilla/5.0 (Linux; Android 9; VCE-L22) AppleW...,
0000e342-eac1-4e2c-a78e-b8769670102e,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,
00014195-89b0-4900-9c93-257a7e83883d,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) ...,
0001edaf-b798-43d7-91c0-2c69eb074bc3,Mozilla/5.0 (Linux; Android 10; Pixel 3) Apple...,
00022ba4-b56d-4be8-8c0b-cb559ff20ae2,Mozilla/5.0 (Linux; Android 10; HMA-L29) Apple...,


In [94]:
total_count = len(combined_anonymous_id_to_user_agent_df)
s_not_a_count = len(s_not_a)
a_not_s_count = len(a_not_s)
print("%i / %i are in segment, not athena (%.1f percent )" % (s_not_a_count, total_count, s_not_a_count / total_count *100))
print("%i / %i are in athena, not segement (%.1f percent)" % (a_not_s_count, total_count, a_not_s_count / total_count *100))
print("If you include countries that aren't on Segment i.e. ID, PH, TW, then you'd expect more from athena")

3329 / 764001 are in segment, not athena (0.4 percent )
126706 / 764001 are in athena, not segement (16.6 percent)
If you include countries that aren't on Segment i.e. ID, PH, TW, then you'd expect more from athena


### Get an idea of how many don't have matching user_agents

In [95]:
df = combined_anonymous_id_to_user_agent_df.groupby("anonymous_id").size().reset_index()
duplicates = df[df[0]>1]
print("%i duplicate anonymous_ids - should be none at this stage" % len(duplicates))

0 duplicate anonymous_ids - should be none at this stage


In [96]:
non_matching_excl_nulls = combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.s_user_agent != combined_anonymous_id_to_user_agent_df.a_user_agent) \
                                                                 & ~combined_anonymous_id_to_user_agent_df.s_user_agent.isnull() \
                                                                 & ~combined_anonymous_id_to_user_agent_df.a_user_agent.isnull()]
print("%i User agent strings don't match" % len(non_matching_excl_nulls))
print("Look for changes in browser version for instance.  Don't worry about every last one.")
non_matching_excl_nulls.head()

117 User agent strings don't match
Look for changes in browser version for instance.  Don't worry about every last one.


Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
00bb6ab9-e694-4869-9bc9-aae201d09c68,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
028ebb50-7fe8-4e43-b40e-e729be72ef0a,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
037a6987-94d0-4025-965d-7c4340dd40f0,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
060713d6-777c-4651-934b-9d70d753a490,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) ...,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) ...
077a3086-8cce-4789-b10f-bfda134b935b,Mozilla/5.0 (Linux; Android 10; SAMSUNG SM-G97...,Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Bu...


### Create a Single user agent string per anonymous_id

In [97]:
combined_anonymous_id_to_user_agent_single_col_df = combined_anonymous_id_to_user_agent_df["a_user_agent"]\
        .fillna(combined_anonymous_id_to_user_agent_df["s_user_agent"]).reset_index().set_index("anonymous_id")
combined_anonymous_id_to_user_agent_single_col_df.rename(columns={"a_user_agent":"user_agent"}, inplace=True)
combined_anonymous_id_to_user_agent_single_col_df.head()

Unnamed: 0_level_0,user_agent
anonymous_id,Unnamed: 1_level_1
000034a2-e973-4108-b920-0681877d4fc0,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
0000628f-db5d-4554-96eb-66454e203e92,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
00007ce2-f710-4f78-bf44-93fcc7e68c24,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
0000b18c-192c-4f9b-b3c4-4b8f65a9e197,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
0000ca0d-320e-4d34-9d9d-6ae23575f78b,Mozilla/5.0 (Linux; Android 9; VCE-L22) AppleW...


In [98]:
# This bit is for development where I keep appending the user_agent column and it generates user_agent_x etc
user_agent_cols_to_delete = [z for z in merged_df_with_meta.columns if z.startswith("user_agent")]
print(" Removing %s "%str(user_agent_cols_to_delete))
merged_df_with_meta.drop(columns=user_agent_cols_to_delete, inplace=True)

 Removing [] 


### Useful segmentation / convert user agent to browser etc

In [99]:
def convert_user_agent_to_useful_strings(user_agent_string):
    """
    Sort of matches to https://github.com/moneysmartco/metl/blob/e13086fae453911bed5a40cb51ff0869e2f3a0ce/scripts/python/device_tagger.py
    """
    user_agent = user_agents.parse(user_agent_string)
    
    device_family = ""
    
    if user_agent.is_pc:
        device_family = 'desktop'
    elif user_agent.is_mobile:
        device_family = 'mobile'
    elif user_agent.is_tablet:
        device_family = 'tablet'
    else:
        device_family = 'other'
        
    
    os_family = user_agent.os.family
    os_version = user_agent.os.version_string
    browser_family = user_agent.browser.family 
    browser_version = user_agent.browser.version_string
    
    is_bot = user_agent.is_bot
    
    return [device_family, os_family, os_version, browser_family, browser_version, is_bot]
    



There's an important optimisation going on here (which still isn't that quick).

If you just do .apply across all the rows, then it's super slow (many minutes e.g. 278s vs 24s for my better version).  I tried the optimisation at https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/, but that didn't seem to provide benefit (or I slowed it down in other ways).

So I'm taking the unique user_agents, processing them and then doing a join, without creating Series objects as well.

There's probably more improvement do-able (e.g. creating the full data structure to insert into up front / generating fewer arrays, but it's fast enough for me right now.

In [100]:
distinct_user_agents = combined_anonymous_id_to_user_agent_single_col_df.user_agent.unique()

In [101]:
distinct_user_agents[:10]

array(['Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/91.1.292041477 Mobile/15E148 Safari/604.1',
       'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Mobile/15E148 Safari/604.1',
       'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
       'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/88.1.284108841 Mobile/15E148 Safari/604.1',
       'Mozilla/5.0 (Linux; Android 9; VCE-L22) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Mobile Safari/537.36',
       'Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Mobile/15E148 Safari/604.1',
       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Saf

In [102]:
len(distinct_user_agents)

31429

In [103]:
# This isn't fast, but acceptable
start_time = datetime.now()
print("Starting to add user agent data at %s"% start_time.isoformat())
#meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: pd.Series(convert_user_agent_to_useful_strings(x.user_agent)), axis=1)
meta_rows = [[z, ]+convert_user_agent_to_useful_strings(z)  for z in distinct_user_agents]
#d = dfcombined_anonymous_id_to_user_agent_single_col_df.merge(meta_df)
end_time = datetime.now()
seconds_taken = (end_time - start_time).total_seconds()
print("Took %i seconds to process" % seconds_taken)

Starting to add user agent data at 2020-02-17T10:01:51.064056
Took 35 seconds to process


In [104]:
user_agent_meta_df = pd.DataFrame(meta_rows)

user_agent_meta_df.rename(columns = {0:"user_agent", 1:"device_family", 2:"os_family", 3:"os_version", 4:"browser_family",5:"browser_version", 6:"is_bot"}, inplace=True)
user_agent_meta_df.set_index("user_agent", inplace=True)


In [105]:
# Try to make the data types a bit efficient
user_agent_meta_df = user_agent_meta_df.astype({ "device_family":"category", "os_family":"category", "os_version":"category", "browser_family":"category","browser_version":"category","is_bot":"bool"})
user_agent_meta_df.dtypes

device_family      category
os_family          category
os_version         category
browser_family     category
browser_version    category
is_bot                 bool
dtype: object

In [106]:
user_agent_meta_df.head()

Unnamed: 0_level_0,device_family,os_family,os_version,browser_family,browser_version,is_bot
user_agent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/91.1.292041477 Mobile/15E148 Safari/604.1",mobile,iOS,13.3,Google,91.1.292041477,False
"Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Mobile/15E148 Safari/604.1",mobile,iOS,13.3,Mobile Safari,13.0.4,False
"Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1",mobile,iOS,13.3.1,Mobile Safari,13.0.5,False
"Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/88.1.284108841 Mobile/15E148 Safari/604.1",mobile,iOS,13.3,Google,88.1.284108841,False
"Mozilla/5.0 (Linux; Android 9; VCE-L22) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Mobile Safari/537.36",mobile,Android,9,Chrome Mobile,80.0.3987,False


In [107]:
if False:# This is super slow currently.
    start_time = datetime.now()
    print("Starting to add user agent data at %s"% start_time.isoformat())
    #meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: pd.Series(convert_user_agent_to_useful_strings(x.user_agent)), axis=1)
    meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: convert_user_agent_to_useful_strings(x.user_agent), axis=1, result_type="expand")
    #d = dfcombined_anonymous_id_to_user_agent_single_col_df.merge(meta_df)
    end_time = datetime.now()
    seconds_taken = (end_time - start_time).total_seconds()
    print("Took %i seconds to process" % seconds_taken)

In [108]:
if False:
    # Trying something faster - based on https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/, but hasn't worked so far
    # but hasn't worked, still (after tidying) it takes 257s, slower than the original.
    start_time = datetime.now()
    print("Starting to add user agent data at %s"% start_time.isoformat())
    new_cols = [[]]*6 # make some empty arrays
    num_new_cols = len(new_cols)
    #for row_num, (_, row) in enumerate(combined_anonymous_id_to_user_agent_single_col_df.iterrows()):
    for _, row in combined_anonymous_id_to_user_agent_single_col_df.iterrows():
        #if row_num % 100000==0:
        #    print("row %i"%row_num)
        vals = convert_user_agent_to_useful_strings(row.user_agent)
        #for i in range(len(vals)):
            #new_cols[i].append(vals[i])
        new_cols[0].append(vals[0])
        new_cols[1].append(vals[1])
        new_cols[2].append(vals[2])
        new_cols[3].append(vals[3])
        new_cols[4].append(vals[4])
        new_cols[5].append(vals[5])
        

    print("New cols generated at %s"% start_time.isoformat())
    # meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: convert_user_agent_to_useful_strings(x.user_agent), axis=1, result_type="expand")
    #d = dfcombined_anonymous_id_to_user_agent_single_col_df.merge(meta_df)
    meta_df = pd.DataFrame({
        "device_family": new_cols[0], 
         "os_family" : new_cols[1], 
         "os_version" : new_cols[2], 
         "browser_family" : new_cols[3], 
         "browser_version":new_cols[4], 
         "is_bot":new_cols[5] 


    })
    print("Additional data frame generated at %s"% start_time.isoformat())
    end_time = datetime.now()
    seconds_taken = (end_time - start_time).total_seconds()
    print("Took %i seconds to process" % seconds_taken)

### Join onto the main dataframe 

In [109]:
merged_df_with_meta = merged_df_with_meta.merge(combined_anonymous_id_to_user_agent_single_col_df, on="anonymous_id", how="left")

In [110]:
merged_df_with_meta.head(2)

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent
0,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,1,1,blog,/budgeting/mattress-singapore-guide,/budgeting,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...
1,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,1,1,blog,/property/3-things-look-buying-condo-2017,/property,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...


In [111]:
# add on the user agent breakdown

merged_df_with_meta = merged_df_with_meta.merge(user_agent_meta_df, on="user_agent", how="left")

In [112]:
merged_df_with_meta.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
0,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,1,1,blog,/budgeting/mattress-singapore-guide,/budgeting,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Google,91.1.292041477,False
1,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,1,1,blog,/property/3-things-look-buying-condo-2017,/property,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Google,91.1.292041477,False
2,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,3,3,blog,/budgeting/mattress-singapore-guide,/budgeting,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Google,91.1.292041477,False
3,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,4,4,blog,/property/3-things-look-buying-condo-2017,/property,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Google,91.1.292041477,False
4,0000628f-db5d-4554-96eb-66454e203e92,PageView,https://www.moneysmart.sg/embed/dc96c1e58d2f68...,2020-02-09,1,1,shop,/embed/dc96c1e58d2f6855228962060a1a8b77,/embed,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False


In [113]:
#Check it's set them all
merged_df_with_meta[merged_df_with_meta.user_agent.isnull()].head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
1599281,fdff7776-4823-4e3e-9484-7d69fa95e1e1,PageView,https://blog.moneysmart.hk/zh-hk/mortgage/日出康城...,2020-02-15,2,0,blog,/zh-hk/mortgage/日出康城-6期-領都-montara-malibu,/mortgage,control,hk,,,,,,,
1599282,fdff7776-4823-4e3e-9484-7d69fa95e1e1,Reading,https://blog.moneysmart.hk/zh-hk/mortgage/日出康城...,2020-02-15,2,0,blog,/zh-hk/mortgage/日出康城-6期-領都-montara-malibu,/mortgage,control,hk,,,,,,,
1736887,4eee11bd-5453-4922-b4da-e8006c01549b,PageView,https://blog.moneysmart.sg/shopping/best-sex-s...,2020-02-11,0,1,blog,/shopping/best-sex-shops-singapore,/shopping,control,sg,,,,,,,
1815990,805b0e2a-9eac-4384-bac6-cac15fc8914f,PageView,https://blog.moneysmart.sg/entertainment/count...,2020-02-14,0,1,blog,/entertainment/country-clubs-singapore,/entertainment,control,sg,,,,,,,
1815991,805b0e2a-9eac-4384-bac6-cac15fc8914f,PageView,https://blog.moneysmart.sg/family/push-gift/,2020-02-14,0,1,blog,/family/push-gift,/family,control,sg,,,,,,,


In [114]:
#Check it's set them all
merged_df_with_meta[merged_df_with_meta.device_family.isnull()].head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
1599281,fdff7776-4823-4e3e-9484-7d69fa95e1e1,PageView,https://blog.moneysmart.hk/zh-hk/mortgage/日出康城...,2020-02-15,2,0,blog,/zh-hk/mortgage/日出康城-6期-領都-montara-malibu,/mortgage,control,hk,,,,,,,
1599282,fdff7776-4823-4e3e-9484-7d69fa95e1e1,Reading,https://blog.moneysmart.hk/zh-hk/mortgage/日出康城...,2020-02-15,2,0,blog,/zh-hk/mortgage/日出康城-6期-領都-montara-malibu,/mortgage,control,hk,,,,,,,
1736887,4eee11bd-5453-4922-b4da-e8006c01549b,PageView,https://blog.moneysmart.sg/shopping/best-sex-s...,2020-02-11,0,1,blog,/shopping/best-sex-shops-singapore,/shopping,control,sg,,,,,,,
1815990,805b0e2a-9eac-4384-bac6-cac15fc8914f,PageView,https://blog.moneysmart.sg/entertainment/count...,2020-02-14,0,1,blog,/entertainment/country-clubs-singapore,/entertainment,control,sg,,,,,,,
1815991,805b0e2a-9eac-4384-bac6-cac15fc8914f,PageView,https://blog.moneysmart.sg/family/push-gift/,2020-02-14,0,1,blog,/family/push-gift,/family,control,sg,,,,,,,


### Clean up data frames / save some memory

In [115]:
# TODO: could do a lot more here
segment_anonymous_id_to_user_agent_full_df = None
segment_anonymous_id_to_user_agent_df = None
athena_anonymous_id_to_user_agent_full_df = None
athena_anonymous_id_to_user_agent_df = None
sd = None
ad = None

# Play Area

In [116]:
d = merged_df_with_meta[merged_df_with_meta.page_type=="iss"].groupby(["slug", "page_type"]).sum()
d[d.s_count>0]
merged_df_with_meta[(merged_df_with_meta.page_type=="iss") & (merged_df_with_meta.page_url.str.contains("iss."))]

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
10,0000e821-1060-4146-9374-2e32bea14f00,LeadGeneration.RedirectCompleted,https://iss.moneysmart.hk/zh-hk/credit-cards/h...,2020-02-12,1,0,iss,/zh-hk/credit-cards/hang-seng-enjoy-card/redirect,/credit-cards,control,hk,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,mobile,iOS,13.3.1,Mobile Safari,13.0.5,False
11,0000e821-1060-4146-9374-2e32bea14f00,PageView,https://iss.moneysmart.hk/zh-hk/credit-cards/h...,2020-02-12,1,1,iss,/zh-hk/credit-cards/hang-seng-enjoy-card/redirect,/credit-cards,control,hk,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,mobile,iOS,13.3.1,Mobile Safari,13.0.5,False
1463,00382cbe-0ca3-470a-8614-c045a4d18770,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/credit-cards/citi-ca...,2020-02-13,1,1,iss,/credit-cards/citi-cash-back-card/redirect,/credit-cards,control,sg,Mozilla/5.0 (Linux; Android 9; SM-G965F Build/...,mobile,Android,9,Instagram,126.0.0,False
1464,00382cbe-0ca3-470a-8614-c045a4d18770,PageView,https://iss.moneysmart.sg/credit-cards/citi-ca...,2020-02-13,2,2,iss,/credit-cards/citi-cash-back-card/redirect,/credit-cards,control,sg,Mozilla/5.0 (Linux; Android 9; SM-G965F Build/...,mobile,Android,9,Instagram,126.0.0,False
1791,00468bde-5bf1-4e89-b0ce-13869bd77c65,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/credit-cards/citi-ca...,2020-02-09,1,1,iss,/credit-cards/citi-cash-back-card/redirect,/credit-cards,control,sg,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3...,desktop,Mac OS X,10.15.3,Chrome,79.0.3945,False
1792,00468bde-5bf1-4e89-b0ce-13869bd77c65,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/credit-cards/citiban...,2020-02-09,1,1,iss,/credit-cards/citibank-smrt-platinum-visa-card...,/credit-cards,control,sg,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3...,desktop,Mac OS X,10.15.3,Chrome,79.0.3945,False
1796,00468bde-5bf1-4e89-b0ce-13869bd77c65,PageView,https://iss.moneysmart.sg/credit-cards/citi-ca...,2020-02-09,1,1,iss,/credit-cards/citi-cash-back-card/redirect,/credit-cards,control,sg,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3...,desktop,Mac OS X,10.15.3,Chrome,79.0.3945,False
1797,00468bde-5bf1-4e89-b0ce-13869bd77c65,PageView,https://iss.moneysmart.sg/credit-cards/citiban...,2020-02-09,1,1,iss,/credit-cards/citibank-smrt-platinum-visa-card...,/credit-cards,control,sg,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3...,desktop,Mac OS X,10.15.3,Chrome,79.0.3945,False
1807,00470afa-6d2b-4a2a-b921-6fea0a56c71f,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/credit-cards/america...,2020-02-14,1,1,iss,/credit-cards/american-express-singapore-airli...,/credit-cards,control,sg,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,desktop,Windows,10,Chrome,79.0.3945,False
1808,00470afa-6d2b-4a2a-b921-6fea0a56c71f,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/credit-cards/citi-pr...,2020-02-14,1,1,iss,/credit-cards/citi-premiermiles-card/redirect,/credit-cards,control,sg,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,desktop,Windows,10,Chrome,79.0.3945,False


In [117]:
merged_df_with_meta[(merged_df_with_meta.slug_root=="/zh-hk") & (merged_df_with_meta.country_code=="hk") & (merged_df_with_meta.page_type!="blog")].head(40) #.groupby(["slug"]).sum()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot


In [118]:
merged_df_with_meta.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
0,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,1,1,blog,/budgeting/mattress-singapore-guide,/budgeting,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Google,91.1.292041477,False
1,000034a2-e973-4108-b920-0681877d4fc0,PageView,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,1,1,blog,/property/3-things-look-buying-condo-2017,/property,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Google,91.1.292041477,False
2,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/budgeting/mattress-...,2020-02-15,3,3,blog,/budgeting/mattress-singapore-guide,/budgeting,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Google,91.1.292041477,False
3,000034a2-e973-4108-b920-0681877d4fc0,Reading,https://blog.moneysmart.sg/property/3-things-l...,2020-02-10,4,4,blog,/property/3-things-look-buying-condo-2017,/property,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Google,91.1.292041477,False
4,0000628f-db5d-4554-96eb-66454e203e92,PageView,https://www.moneysmart.sg/embed/dc96c1e58d2f68...,2020-02-09,1,1,shop,/embed/dc96c1e58d2f6855228962060a1a8b77,/embed,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,mobile,iOS,13.3,Mobile Safari,13.0.4,False


# Store Data Frame for Faster Loading etc.

When stored as a zipped parquet, it's actually very small 3 days -> 30MB.

In [119]:
!pip install fastparquet

[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [120]:
if save_end_dataframe_to_file:
    from_to_str = "_to_".join([z.strftime("%Y%m%d_%H%M") for z in [from_datetime, to_datetime]])
    parquet_filename = "merged_df_with_meta_"+from_to_str+".gzip"
    
    merged_df_with_meta.to_parquet(parquet_filename, compression='gzip')

    

In [121]:
>> look into AB test stuff more.  I think the urls are different segment vs kinesis (but I think we've found the origin and might have been fixed / non-issue)
                                                                                     

SyntaxError: invalid syntax (<ipython-input-121-c9086cc76644>, line 1)