# Introduction

This is used to compare the counts of events between segment and kinesis during the move between the two pipelines.

As they counts inevitably don't match, it then provides detailed segmentation and search / exploratory tools towards the bottom.

For a large number of days, you'll want a lot of RAM (16GB or 32GB).  For single day experimenting, you will be fine with 8GB.

Running the whole script takes quite a long time initially, in particular due to the segment query (minutes to tens of minutes).  Once this has been done, further exploration is generally very quick (less than a second to a few seconds).

It's not overly optimised, but some steps have been taken to reduce memory.

# What this notebook (in particular) does

This notebook is about extracting and joining the raw data.  The result can then be saved to file (parquet) and loaded into other notebooks for analysis / visualisation.

# Requirements / Jupyter Extensions

Install these through jupyterlab extension manager (if using jupyterlab)
* jupyter-widgets
* plotly (and ideally chart studio too)

In [1]:
# Safe imports
from datetime import datetime, timedelta, date

# Settings

In [2]:
num_days_to_query = 7
#from_datetime = datetime.now() - timedelta(days = 5)
#from_datetime = datetime(year=2020, month=1, day=4)
#to_datetime = from_datetime+ timedelta(days=num_days_to_query)
to_datetime = datetime(year=2020, month=2, day=23)
from_datetime = to_datetime - timedelta(days=num_days_to_query)
include_device_segmentation = True #E.g. iphone users.  This will use more memory (and likely slow things a bit).
save_end_dataframe_to_file = True #Saves a parquet for easy loading after crashes, or in other tools

# Imports

In [3]:
# Run imports that might require installation to the environment, and install if necessary.
try:
    import psycopg2
except:
    print("Failed ot import psychopg2, trying to install it")
    !{sys.executable} -m pip install psycopg2-binary
    import psycopg2
    print("Successfully installed")
    
    
try:
    import dateparser
except:
    print("Failed ot import dateparser, trying to install it")
    #!{sys.executable} -m pip install dateparser
    !pip install dateparser
    import dateparser
    print("Successfully installed")
    
try:
    import pyathena #used in other imports, so really just checking it's available
except:
    print("Failed ot import pyathena, trying to install it")
    ! pip install pyathena
    #!{sys.executable} -m pip install pyathena
    import pyathena
    print("Successfully installed")
    
try:
    import user_agents
except:
    print("Failed ot import user_agents, trying to install it")
    #!{sys.executable} -m pip install user_agents
    !pip install user_agents
    import user_agents
    print("Successfully installed")

    
import ipywidgets as widgets
    


Failed ot import dateparser, trying to install it


  """)


Collecting dateparser
[?25l  Downloading https://files.pythonhosted.org/packages/82/9d/51126ac615bbc4418478d725a5fa1a0f112059f6f111e4b48cfbe17ef9d0/dateparser-0.7.2-py2.py3-none-any.whl (352kB)
[K    100% |████████████████████████████████| 358kB 13.2MB/s ta 0:00:01
[?25hCollecting regex (from dateparser)
[?25l  Downloading https://files.pythonhosted.org/packages/ed/36/fd20c656fb4a4fbe8db367ea274c3465b81cb2e01ffc57b9980f0578e131/regex-2020.2.20-cp36-cp36m-manylinux1_x86_64.whl (690kB)
[K    100% |████████████████████████████████| 696kB 22.6MB/s ta 0:00:01
[?25hCollecting tzlocal (from dateparser)
  Downloading https://files.pythonhosted.org/packages/ef/99/53bd1ac9349262f59c1c421d8fcc2559ae8a5eeffed9202684756b648d33/tzlocal-2.0.0-py2.py3-none-any.whl
Installing collected packages: regex, tzlocal, dateparser
Successfully installed dateparser-0.7.2 regex-2020.2.20 tzlocal-2.0.0
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading v

In [4]:
# Imports on files that might have dependencies that need installing
import data_pier_querying
from athena_querying import AthenaQuery
from athena_common_queries import *
import user_agents # this converts user agent from browser to mobile / desktop etc.

# Kinesis Data via Athena

Data goes tracker -> kinesis -> S3 (+ another S3 transform).  Then we can query S3 using Athena.

In [5]:
aq = AthenaQuery()

In [6]:
aq.connect()

In [7]:
athena_database = "ms_data_lake_production"
athena_raw_events_table = "ms_data_stream_production_processed"

In [8]:
#query = "select context.page_url, body.event_name, count(*) from "+athena_database+"."+athena_raw_events_table
#query += " where partition_0='2019' and partition_1>='12' and partition_2>='05' group by 1,2"

In [9]:
# I've removed the device_type data to save memory, but it would be useful.
query = create_generic_event_query(from_datetime, to_datetime, include_user_agent=include_device_segmentation, include_ip_address = include_device_segmentation, interpret_urls=False)

full_query = "select * from (%s) where country_code ='sg'" %query

In [10]:
print(full_query)

select * from (
    
    SELECT 
          CAST("from_iso8601_timestamp"("sent_at") AS timestamp) "sent_at_timestamp"
    , "sent_at"
    , substr(sent_at, 1, 10) as date
    , "type"
    , "body"."event_name"
    , "body"."data"."status"
    , "user"."anonymous_id"
    , "user"."amp_id"
    , "context"."page_url"
    , "context"."referrer"
 
    
        , context.user_agent as user_agent
        
        , context.ip_address
        
    
    FROM
      ms_data_lake_production.ms_data_stream_production_processed
    
    
    WHERE true -- makes query composition easier
    
 AND 
  (
 partition_0 >= '2020'
 AND partition_1 >= '02'
 AND partition_2 >= '16'
 OR (
 partition_0 >= '2020'
 AND partition_1 > '02'
 ) 
 OR (
 partition_0 > '2020'
 ) 
)
 AND ((partition_0 <= '2020'
	 AND partition_1 <= '02'
	 AND partition_2 <= '23'
) 
 OR (
	 partition_0 <= '2020'
	 AND partition_1 < '02'
) 
 OR (
	 partition_0 < '2020'
) 
)
 AND CAST(from_iso8601_timestamp(sent_at) AS timestamp)  between C

In [11]:
athena_full_events_df = aq.query(query)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [12]:
# Set types to speed queries and save on memory
athena_full_events_df = athena_full_events_df.astype({ "type":"category"
    , "event_name":"category"
    , "status":"category"}, copy=False)

In [13]:
athena_full_events_df.dtypes

sent_at_timestamp      object
sent_at                object
date                   object
type                 category
event_name           category
status               category
anonymous_id           object
amp_id                 object
page_url               object
referrer               object
user_agent             object
ip_address             object
dtype: object

In [14]:
athena_full_events_df.head(5)

Unnamed: 0,sent_at_timestamp,sent_at,date,type,event_name,status,anonymous_id,amp_id,page_url,referrer,user_agent,ip_address
0,2020-02-19 20:11:19.607,2020-02-19T20:11:19.607Z,2020-02-19,event,Reading,Article Body 50,58dc3621-fcf9-4b2c-a251-12ab7733646b,,https://www.moneysmart.tw/articles/%E4%B8%AD%E...,,Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7....,220.128.123.202
1,2020-02-19 20:11:17.592,2020-02-19T20:11:17.592Z,2020-02-19,page,PageView,,95b0cf96-f595-42b9-beab-ea24dce7017e,,https://www.moneysmart.sg/embed/98e61305602380...,https://s0.2mdn.net/dfp/509788/70424308/157292...,Mozilla/5.0 (Linux; Android 8.1.0; V92 Build/O...,119.30.38.48
2,2020-02-19 20:11:21.764,2020-02-19T20:11:21.764Z,2020-02-19,event,Reading,Article Body 25,47ba9951-4f1a-42cb-99e7-0e9fb2cf3e57,,https://blog.moneysmart.sg/invest/cpf-investme...,https://www.google.com/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6...,116.86.191.157
3,2020-02-19 20:11:17.623,2020-02-19T20:11:17.623Z,2020-02-19,event,UserView.WidgetLoad,,95b0cf96-f595-42b9-beab-ea24dce7017e,,https://www.moneysmart.sg/embed/98e61305602380...,https://s0.2mdn.net/dfp/509788/70424308/157292...,Mozilla/5.0 (Linux; Android 8.1.0; V92 Build/O...,119.30.38.48
4,2020-02-19 20:06:01.766,2020-02-19T20:06:01.766Z,2020-02-19,event,Reading,Article Body 75,7a25d1bc-4472-49ff-a0b7-58926b47bdd9,,https://blog.moneysmart.sg/travel/best-money-c...,https://www.google.com/,Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:7...,2406:3003:2073:2f18:9cf7:1589:8548:43d2


# Segment Data

NB: screwed up, and can use the tracks table, rather than individual event tables, so a lot of this is pointless.

In [15]:
#from importlib import reload
#reload(data_pier_querying)

In [16]:
# Below there are some checks on what columns are available

segment_columns_to_query = [
    # "sent_at", - don't use this, use timestamp
    "timestamp",
    #"event", - going to get that implied from the table.
    # "status", # TODO: would like to have this, but not sure which column, or which tables.  Maybe just not used much, so only do for the 4 tables.
    "anonymous_id",
    "context_page_url",
    # "referrer", #maybe only used in pages table??
    "context_ip", 
    "context_user_agent"]

In [17]:
dp_querying = data_pier_querying.DataPierQuerying()
dp_querying.connect()

In [18]:
tables_df = dp_querying.query_to_dataframe("select * from information_schema.tables")

In [19]:
segment_event_tables_df = tables_df[tables_df.table_schema=="moneysmartsg_prod"]["table_name"]


In [20]:
# These are taken from the dictionary in https://docs.google.com/spreadsheets/d/1HICh77BoGMIat9K3NPwz3pBayJWiAr0ohAlTuv7dr80/edit#gid=1882048411
#but actually it turns out there should be more than this, and don't need to do it this way.
expected_events_str = """
LeadGeneration.ClickConversion
LeadGeneration.FormStepCompleted
LeadGeneration.FormSubmitted
LeadGeneration.PaymentCompleted
LeadGeneration.ThankYou
LeadGeneration.RedirectCompleted
UserEngagement.ShowedMoreDetails
UserEngagement.ViewedMoreDetails
UserEngagement.SortedList
UserEngagement.UsedHelpHints
UserEngagement.ClickedMenuItem
UserEngagement.QuestionAnswered
UserEngagement.ShowMoreFilter
UserEngagement.ShowMoreOptions
UserEngagement.ClickedFilter
UserEngagement.ButtonClick
UserAuth.LoggedIn
UserAuth.RegisteredAccount
UserAuth.LoggedOut
UserFeedback.ModalDisplayed
UserFeedback.MoodSubmitted
UserFeedback.FeedbackSubmitted
UserFeedback.MoreFeedback
ABTest.Conversion
UserView.WidgetLoad
EmailCapture
PageView
Sharing
Reading
NewsLetterPopup
"""
expected_events = [z.strip() for z in expected_events_str.split("\n") if len(z.strip())>0]

In [21]:


expected_events_and_segment_tables = []
special_maps = {
    "PageView": "pages"
}
for event in expected_events:
    if event in special_maps:
        new_event_name = special_maps[event]
    else:
        new_event_name = ""
        for i, c in enumerate(event):
            if i==0:new_event_name+=c.lower()
            elif str.isupper(c): 
                if i>0 and event[i-1]!=".":
                    new_event_name += "_"
                new_event_name += c.lower()
            elif c==".": new_event_name += "_"
            else: new_event_name+= c
    expected_events_and_segment_tables.append([event, new_event_name])

In [22]:
expected_events_and_segment_tables

[['LeadGeneration.ClickConversion', 'lead_generation_click_conversion'],
 ['LeadGeneration.FormStepCompleted', 'lead_generation_form_step_completed'],
 ['LeadGeneration.FormSubmitted', 'lead_generation_form_submitted'],
 ['LeadGeneration.PaymentCompleted', 'lead_generation_payment_completed'],
 ['LeadGeneration.ThankYou', 'lead_generation_thank_you'],
 ['LeadGeneration.RedirectCompleted', 'lead_generation_redirect_completed'],
 ['UserEngagement.ShowedMoreDetails', 'user_engagement_showed_more_details'],
 ['UserEngagement.ViewedMoreDetails', 'user_engagement_viewed_more_details'],
 ['UserEngagement.SortedList', 'user_engagement_sorted_list'],
 ['UserEngagement.UsedHelpHints', 'user_engagement_used_help_hints'],
 ['UserEngagement.ClickedMenuItem', 'user_engagement_clicked_menu_item'],
 ['UserEngagement.QuestionAnswered', 'user_engagement_question_answered'],
 ['UserEngagement.ShowMoreFilter', 'user_engagement_show_more_filter'],
 ['UserEngagement.ShowMoreOptions', 'user_engagement_show_m

### Check for missing tables

Expect some random events not to be in Segment, or blog specific ones that haven't been deployed to SG and HK

In [23]:
# Check all the event tables exist
expected_event_segment_tables = [z[1] for z in expected_events_and_segment_tables]
segment_table_names = segment_event_tables_df.to_list()
missing_event_tables = [z for z in expected_event_segment_tables if z not in segment_table_names]
missing_event_tables

['user_engagement_used_help_hints',
 'user_engagement_clicked_menu_item',
 'user_feedback_modal_displayed',
 'user_feedback_more_feedback',
 'a_b_test_conversion',
 'sharing',
 'news_letter_popup']

In [24]:
expected_events_and_segment_tables

[['LeadGeneration.ClickConversion', 'lead_generation_click_conversion'],
 ['LeadGeneration.FormStepCompleted', 'lead_generation_form_step_completed'],
 ['LeadGeneration.FormSubmitted', 'lead_generation_form_submitted'],
 ['LeadGeneration.PaymentCompleted', 'lead_generation_payment_completed'],
 ['LeadGeneration.ThankYou', 'lead_generation_thank_you'],
 ['LeadGeneration.RedirectCompleted', 'lead_generation_redirect_completed'],
 ['UserEngagement.ShowedMoreDetails', 'user_engagement_showed_more_details'],
 ['UserEngagement.ViewedMoreDetails', 'user_engagement_viewed_more_details'],
 ['UserEngagement.SortedList', 'user_engagement_sorted_list'],
 ['UserEngagement.UsedHelpHints', 'user_engagement_used_help_hints'],
 ['UserEngagement.ClickedMenuItem', 'user_engagement_clicked_menu_item'],
 ['UserEngagement.QuestionAnswered', 'user_engagement_question_answered'],
 ['UserEngagement.ShowMoreFilter', 'user_engagement_show_more_filter'],
 ['UserEngagement.ShowMoreOptions', 'user_engagement_show_m

In [25]:
# Removing the missing ones from the query list
events_and_tables_to_get_from_data_pier = [z for z in expected_events_and_segment_tables if z[1] not in missing_event_tables]

# Removing a problematic one (doesn't have context_page_url in it, and very unimportant
events_and_tables_to_get_from_data_pier = [z for z in events_and_tables_to_get_from_data_pier if z[1] not in ["user_auth_logged_out",]]

In [26]:
len(events_and_tables_to_get_from_data_pier)

22

In [27]:
cols = dp_querying.query_to_dataframe("""
select column_name, data_type, count(*) from information_schema.columns 
where 
table_name in  ('"""+"','".join([z[1] for z in events_and_tables_to_get_from_data_pier])+"""')
and table_schema='moneysmartsg_prod'

group by 1,2
""")

In [28]:
cols[cols["count"]>10].sort_values(["count"])

Unnamed: 0,column_name,data_type,count
288,page_referrer,text,12
353,user_id,text,13
27,context_campaign_content,text,15
43,context_campaign_term,text,15
287,page_path,text,15
17,channel,text,16
33,context_campaign_medium,text,17
34,context_campaign_name,text,17
41,context_campaign_source,text,17
61,context_locale,text,20


In [29]:
cols = dp_querying.query_to_dataframe("""
select  column_name, data_type, count(*) from information_schema.columns 
where 
 table_name in  ('"""+"','".join(["pages", "tracks"])+"""')
and table_schema='moneysmartsg_prod'
and column_name like '%%'
group by 1,2 order by count(*) desc
""")
cols

Unnamed: 0,column_name,data_type,count
0,context_campaign_term,text,2
1,context_campaign_name,text,2
2,context_page_referrer,text,2
3,context_user_agent,text,2
4,context_page_search,text,2
...,...,...,...
101,context_campaign_referrer,text,1
102,context_campaign_solazada_20sgurce,text,1
103,context_campaign_soupnterestce,text,1
104,context_campaign_sourcehsbc_20rewards,text,1


In [30]:
segment_date_constraint = " timestamp >= '%s' and timestamp < '%s' " % (from_datetime.isoformat(), to_datetime.isoformat())

In [31]:
dp_querying.query_to_dataframe("""SELECT
    nmsp_parent.nspname AS parent_schema,
    parent.relname      AS parent,
    nmsp_child.nspname  AS child_schema,
    child.relname       AS child
FROM pg_inherits
    JOIN pg_class parent            ON pg_inherits.inhparent = parent.oid
    JOIN pg_class child             ON pg_inherits.inhrelid   = child.oid
    JOIN pg_namespace nmsp_parent   ON nmsp_parent.oid  = parent.relnamespace
    JOIN pg_namespace nmsp_child    ON nmsp_child.oid   = child.relnamespace
WHERE parent.relname='%s';""")%"pages"

Unnamed: 0,parent_schema,parent,child_schema,child


In [32]:
pd.get_option("display.max_colwidth", 200)
indexes = dp_querying.query_to_dataframe("""SELECT
    indexname,
    indexdef
FROM
    pg_indexes
WHERE
    tablename = '%s';""" % "pages")

for a in indexes.values:
    print(a)

['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmarthk_prod.pages USING btree (id)']
['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmartsg_prod.pages USING btree (id)']
['pages_timestamp_idx'
 'CREATE INDEX pages_timestamp_idx ON moneysmartsg_prod.pages USING btree ("timestamp")']
['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmarthk_dev.pages USING btree (id)']
['pages_pkey'
 'CREATE UNIQUE INDEX pages_pkey ON moneysmartsg_dev.pages USING btree (id)']


In [33]:
query_segment_by_table = False #really shouldn't set this to true, didn't realise correct method.  Also need to add country stuff

segment_schemas = ["moneysmartsg_prod", "moneysmarthk_prod"]
# The meat of it
start_time = datetime.now()
event_name_to_rows = {}
if query_segment_by_table:
    for country_schema in segment_schemas:
        for i, (event_name, table_name) in enumerate(events_and_tables_to_get_from_data_pier):
            table_start_time = datetime.now()
            print("querying table %s / %s (%i/%i)" % (table_name, event_name, i+1, len(events_and_tables_to_get_from_data_pier)))
            query = "select {cols} from {schema}.{table} where {date_constraint}".format(cols=", ".join(segment_columns_to_query), 
                                                                           table=table_name,
                                                                           date_constraint =segment_date_constraint, schema=country_schema)

            events = dp_querying.query_to_dataframe(query)
            events["event_name"] = event_name #fills the entire column with the same value
            print("Got %i events"% len(events))
            event_name_to_rows[event_name]=events

            table_download_time = (datetime.now()-table_start_time).total_seconds()
            time_since_start = (datetime.now()-start_time).total_seconds()
            print("It took %.1f seconds to download from the table (%.1f seconds overall)" %(table_download_time, time_since_start))
            print()
            # if i>4:break


        # Merge tables
        segment_combined_df = pd.DataFrame()
        #combined_df = pd.DataFrame(columns=event_name_to_rows["LeadGeneration.ClickConversion"].columns)
        """for event_name, event_df in event_name_to_rows.items():
            print(len(event_df))
            combined_df.append(event_df, ignore_index=True)
            print(len(combined_df))
        #combined_df.astype({"event_name":"category"})
        """

        segment_combined_df = combined_df.append(list(event_name_to_rows.values()))
    
    
else:
    segment_columns_to_query_full = segment_columns_to_query + ["event_text",]
    tables_to_query = ["pages", "tracks"]
    all_event_dfs = []
    segment_combined_df = pd.DataFrame()
    for country_schema in segment_schemas:
        for table_name in tables_to_query:
            table_start_time = datetime.now()
            if table_name!="pages":
                cols_to_query = segment_columns_to_query_full
            else:
                cols_to_query = segment_columns_to_query
            print("querying table %s.%s" % (country_schema, table_name))
            print(cols_to_query)
            query = "select {cols} from {schema}.{table} where {date_constraint}".format(cols=", ".join(cols_to_query), 
                                                                           table=table_name,
                                                                           date_constraint =segment_date_constraint, schema=country_schema)

            events = dp_querying.query_to_dataframe(query)
            
            print("Got %i events"% len(events))
            #all_event_dfs.append(events)
            
            if table_name =="pages":
                events["event_text"] = "PageView" # fills the whole column
            table_download_time = (datetime.now()-table_start_time).total_seconds()
            time_since_start = (datetime.now()-start_time).total_seconds()
            print("merging")
            segment_combined_df = segment_combined_df.append(events)
            print("It took %.1f seconds to download from the table (%.1f seconds overall)" %(table_download_time, time_since_start))
            print()
            
        

querying table moneysmartsg_prod.pages
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent']
Got 1037770 events
merging
It took 42.1 seconds to download from the table (42.1 seconds overall)

querying table moneysmartsg_prod.tracks
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent', 'event_text']
Got 1339129 events
merging
It took 355.9 seconds to download from the table (398.3 seconds overall)

querying table moneysmarthk_prod.pages
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent']
Got 168119 events
merging
It took 277.5 seconds to download from the table (676.0 seconds overall)

querying table moneysmarthk_prod.tracks
['timestamp', 'anonymous_id', 'context_page_url', 'context_ip', 'context_user_agent', 'event_text']
Got 198726 events
merging
It took 77.7 seconds to download from the table (754.1 seconds overall)



In [34]:
if not query_segment_by_table:
    segment_combined_df.rename(columns={"event_text":"event_name"}, inplace=True)

In [35]:
len(all_event_dfs)

0

In [36]:
if include_device_segmentation:
    segment_combined_df.rename(columns={"context_user_agent":"user_agent"}, inplace=True)

In [37]:
segment_combined_df.head()

Unnamed: 0,timestamp,anonymous_id,context_page_url,context_ip,user_agent,event_name
0,2020-02-16 00:00:00.943000+00:00,f566ba6b-affb-4963-a92a-9507c87e0f04,https://blog.moneysmart.sg/fixed-deposits/best...,111.65.61.238,Mozilla/5.0 (Linux; Android 9; INE-LX2) AppleW...,PageView
1,2020-02-16 00:00:01.486000+00:00,0a096abe-8c72-4829-8f91-3dbbe24f26d1,https://blog.moneysmart.sg/shopping/imm-singap...,119.56.110.213,Mozilla/5.0 (Linux; Android 9; BLA-L29) AppleW...,PageView
2,2020-02-16 00:00:02.233000+00:00,198712ef-3a10-47c1-907c-e51bdf5c3087,https://blog.moneysmart.sg/property/hmlet-co-l...,119.56.109.16,Mozilla/5.0 (Linux; Android 9; MHA-L29) AppleW...,PageView
3,2020-02-16 00:00:03.073000+00:00,dd5dc2c7-6695-4ea9-a7a5-b6c71dd45374,https://www.moneysmart.sg/embed/9cb432acbab519...,119.56.109.16,Mozilla/5.0 (Linux; Android 9; MHA-L29) AppleW...,PageView
4,2020-02-16 00:00:04.042000+00:00,7cb5ecf5-2e57-4623-83f9-6d266ab2ff5a,https://www.moneysmart.sg/embed/f2e62665f34622...,183.90.36.153,Mozilla/5.0 (Linux; Android 7.1.1; CPH1721) Ap...,PageView


In [38]:
segment_combined_df.rename(columns={"context_page_url":"page_url"}, inplace=True)
segment_combined_df.head(5)

Unnamed: 0,timestamp,anonymous_id,page_url,context_ip,user_agent,event_name
0,2020-02-16 00:00:00.943000+00:00,f566ba6b-affb-4963-a92a-9507c87e0f04,https://blog.moneysmart.sg/fixed-deposits/best...,111.65.61.238,Mozilla/5.0 (Linux; Android 9; INE-LX2) AppleW...,PageView
1,2020-02-16 00:00:01.486000+00:00,0a096abe-8c72-4829-8f91-3dbbe24f26d1,https://blog.moneysmart.sg/shopping/imm-singap...,119.56.110.213,Mozilla/5.0 (Linux; Android 9; BLA-L29) AppleW...,PageView
2,2020-02-16 00:00:02.233000+00:00,198712ef-3a10-47c1-907c-e51bdf5c3087,https://blog.moneysmart.sg/property/hmlet-co-l...,119.56.109.16,Mozilla/5.0 (Linux; Android 9; MHA-L29) AppleW...,PageView
3,2020-02-16 00:00:03.073000+00:00,dd5dc2c7-6695-4ea9-a7a5-b6c71dd45374,https://www.moneysmart.sg/embed/9cb432acbab519...,119.56.109.16,Mozilla/5.0 (Linux; Android 9; MHA-L29) AppleW...,PageView
4,2020-02-16 00:00:04.042000+00:00,7cb5ecf5-2e57-4623-83f9-6d266ab2ff5a,https://www.moneysmart.sg/embed/f2e62665f34622...,183.90.36.153,Mozilla/5.0 (Linux; Android 7.1.1; CPH1721) Ap...,PageView


# Merging Segment and Kinesis Events

In [39]:
# Make names clear e.g. s_...

# Check the timezone / timestamps match
# Athena raw stuff is in UTC, not SG time.  So 2020-01-19T00:04:04.443Z is 8:05am Singapore time.
# whereas Segment is stored with tiemzone at UTC.  So, could convert them all.
# TODO: But it does meant that there's a lot of events coming at the day boundary.

In [40]:
athena_full_events_df.head(2)

Unnamed: 0,sent_at_timestamp,sent_at,date,type,event_name,status,anonymous_id,amp_id,page_url,referrer,user_agent,ip_address
0,2020-02-19 20:11:19.607,2020-02-19T20:11:19.607Z,2020-02-19,event,Reading,Article Body 50,58dc3621-fcf9-4b2c-a251-12ab7733646b,,https://www.moneysmart.tw/articles/%E4%B8%AD%E...,,Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7....,220.128.123.202
1,2020-02-19 20:11:17.592,2020-02-19T20:11:17.592Z,2020-02-19,page,PageView,,95b0cf96-f595-42b9-beab-ea24dce7017e,,https://www.moneysmart.sg/embed/98e61305602380...,https://s0.2mdn.net/dfp/509788/70424308/157292...,Mozilla/5.0 (Linux; Android 8.1.0; V92 Build/O...,119.30.38.48


In [41]:
athena_full_events_df.dtypes

sent_at_timestamp      object
sent_at                object
date                   object
type                 category
event_name           category
status               category
anonymous_id           object
amp_id                 object
page_url               object
referrer               object
user_agent             object
ip_address             object
dtype: object

In [42]:
segment_combined_df.dtypes

timestamp       datetime64[ns, UTC]
anonymous_id                 object
page_url                     object
context_ip                   object
user_agent                   object
event_name                   object
dtype: object

In [None]:
# Group by columns to get around date inaccuracy issue
cols_to_group_by = ["anonymous_id", "event_name", "page_url", "date"] #, "context_ip", "context_user_agent"] #TODO: add IP address

print("Grouping by %s"% ", ".join(cols_to_group_by))

print("Fixing dates before grouping")
print("... for Segment")
segment_combined_df["date"] = segment_combined_df.apply(lambda row: row.timestamp.date().isoformat(), axis=1) # making this a string
print("... for athena")
athena_full_events_df["date"] = athena_full_events_df.apply(lambda row: row.sent_at[:10], axis=1)
# super-slow,so moving to using strings athena_full_events_df["date"] = athena_full_events_df.apply(lambda row: dateparser.parse(row.sent_at_timestamp).date(), axis=1)  #conversion from string might not be needed in the future; using dateparser as more robust, also slow

#going to reduce the number of columns to make it safer, then can go back and look for user agents etc (can do a mapping of anonymous_id to user_agent for instance.)




Grouping by anonymous_id, event_name, page_url, date
Fixing dates before grouping
... for Segment


In [None]:
print("Setting sensible data types for the columns to group by")
data_type_mappings = {"event_name":"category", "date":"category"}
segment_combined_df = segment_combined_df.astype(data_type_mappings, copy=False)
athena_full_events_df = athena_full_events_df.astype(data_type_mappings, copy=False)

In [None]:
segment_combined_df.head()[cols_to_group_by]

In [None]:
athena_full_events_df.head()[cols_to_group_by]

In [None]:
# athena_full_events_df timestamp

print("Grouping by %s"%cols_to_group_by)
segment_grouped_df = segment_combined_df.groupby(cols_to_group_by).size().reset_index(name='s_count') #size preserves nulls, this sets the column to s_count

athena_grouped_df = athena_full_events_df.groupby(cols_to_group_by).size().reset_index(name='k_count')

# segment_combined_df.rename(columns = {"context_ip":"s_context_ip", "context_user_agent":"s_context_user_agent"}) 

In [None]:
athena_grouped_df.head()

In [None]:
# Actually join them

# set the column count names

merged_df = segment_grouped_df.merge(athena_grouped_df, how='outer', on=cols_to_group_by )

#Fill in the empty counts with 0s

merged_df["s_count"].fillna(0, inplace=True)
merged_df["k_count"].fillna(0, inplace=True)

In [None]:
merged_df.head(10)

In [None]:
merged_df.groupby(["date"]).count()

# Add Page Filtering Metadata

* is url blog / shop / ...
* country

In [None]:
from urllib.parse import urlparse, parse_qs

In [None]:
from data_parsing import get_metadata_from_url


In [None]:
# Do some tests to show that it's kind of working (bad version of a unit test!)

In [None]:
get_metadata_from_url("https://www-new.moneysmart.sg/rabbit/headlight/?scary=True")

In [None]:
get_metadata_from_url("https://blog.moneysmart.ph/rabbit/headlight/?scary=True")

In [None]:
get_metadata_from_url("https://blog3.moneysmart.tw")

In [None]:
get_metadata_from_url("https://www.moneysmart.hk/zh-hk/credit-cards/")

In [None]:
start_time = datetime.now()
print("starting at %s"%start_time.isoformat())
#This is a bit slow (consider at looking how to optimise, especially memory usage from creating loads of series objects
#Could probably optimise by splitting all the urls using a pandas function, then joining with a map to get page_type, path etc, but ymmv
metadata_df = merged_df.apply(lambda x: pd.Series(get_metadata_from_url(x.page_url)), axis=1)#, index=["page_type", "path", "ab_test", "country_code"])
end_time = datetime.now()
time_taken = (end_time-start_time).total_seconds()
print("Took %i seconds"%time_taken)

Took 914 seconds


In [None]:
metadata_df.rename(columns={0:"page_type", 1:"slug", 2:"slug_root", 3:"ab_test", 4:"country_code"}, inplace=True)

In [None]:
metadata_df.head()

Unnamed: 0,page_type,slug,slug_root,ab_test,country_code
0,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk
1,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk
2,shop,/credit-cards/posb-everyday-card,/credit-cards,control,sg
3,shop,/embed/f645886bc036195148acd846a50232d9,/embed,control,sg
4,shop,/embed/f645886bc036195148acd846a50232d9,/embed,control,sg


In [None]:
merged_df_with_meta = pd.concat([merged_df, metadata_df], axis=1)

In [None]:
# Set some sensible data types to speed it all up
#merged_df_with_meta.astype({"page_type":"category", "slug":"category"})
merged_df_with_meta = merged_df_with_meta.astype({"page_type":"category", "slug":"category", "ab_test":"category", "country_code":"category", "s_count":"int", "k_count":"int"})

In [None]:
merged_df_with_meta.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code
0,00000487-4c71-4c7e-a372-ab7196780fb0,PageView,https://blog.moneysmart.hk/zh-hk/credit-cards/...,2020-02-18,1,1,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk
1,00000487-4c71-4c7e-a372-ab7196780fb0,Reading,https://blog.moneysmart.hk/zh-hk/credit-cards/...,2020-02-18,3,3,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk
2,000004a5-230d-4d4c-a850-d2ec44427589,PageView,https://www.moneysmart.sg/credit-cards/posb-ev...,2020-02-21,1,1,shop,/credit-cards/posb-everyday-card,/credit-cards,control,sg
3,00000bc0-99d3-4742-a855-74d49f6b617c,PageView,https://www.moneysmart.sg/embed/f645886bc03619...,2020-02-18,1,1,shop,/embed/f645886bc036195148acd846a50232d9,/embed,control,sg
4,00000bc0-99d3-4742-a855-74d49f6b617c,UserView.WidgetLoad,https://www.moneysmart.sg/embed/f645886bc03619...,2020-02-18,1,1,shop,/embed/f645886bc036195148acd846a50232d9,/embed,control,sg


In [None]:
merged_df_with_meta[(merged_df_with_meta.s_count>1) & (merged_df_with_meta.k_count>1)].head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code
1,00000487-4c71-4c7e-a372-ab7196780fb0,Reading,https://blog.moneysmart.hk/zh-hk/credit-cards/...,2020-02-18,3,3,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk
17,00012d10-4f63-424f-b411-b2f0dfa44fe0,Reading,https://blog.moneysmart.sg/career/highest-payi...,2020-02-18,3,3,blog,/career/highest-paying-jobs-in-singapore,/career,control,sg
20,00016018-5489-40a1-89a0-9964ec91259b,PageView,https://blog.moneysmart.sg/healthcare/cheap-de...,2020-02-22,3,3,blog,/healthcare/cheap-dentists-singapore-dental-cl...,/healthcare,control,sg
22,00016018-5489-40a1-89a0-9964ec91259b,Reading,https://blog.moneysmart.sg/healthcare/cheap-de...,2020-02-22,6,6,blog,/healthcare/cheap-dentists-singapore-dental-cl...,/healthcare,control,sg
25,00016ba1-b45e-44b2-ae61-79106143ae76,Reading,https://blog.moneysmart.sg/shopping/lazada-pro...,2020-02-17,3,3,blog,/shopping/lazada-promo-code-promotion,/shopping,control,sg


# Add Device Type Metadata

In [None]:
segment_combined_df.head()

Unnamed: 0,timestamp,anonymous_id,page_url,context_ip,user_agent,event_name,date
0,2020-02-16 00:00:00.943000+00:00,f566ba6b-affb-4963-a92a-9507c87e0f04,https://blog.moneysmart.sg/fixed-deposits/best...,111.65.61.238,Mozilla/5.0 (Linux; Android 9; INE-LX2) AppleW...,PageView,2020-02-16
1,2020-02-16 00:00:01.486000+00:00,0a096abe-8c72-4829-8f91-3dbbe24f26d1,https://blog.moneysmart.sg/shopping/imm-singap...,119.56.110.213,Mozilla/5.0 (Linux; Android 9; BLA-L29) AppleW...,PageView,2020-02-16
2,2020-02-16 00:00:02.233000+00:00,198712ef-3a10-47c1-907c-e51bdf5c3087,https://blog.moneysmart.sg/property/hmlet-co-l...,119.56.109.16,Mozilla/5.0 (Linux; Android 9; MHA-L29) AppleW...,PageView,2020-02-16
3,2020-02-16 00:00:03.073000+00:00,dd5dc2c7-6695-4ea9-a7a5-b6c71dd45374,https://www.moneysmart.sg/embed/9cb432acbab519...,119.56.109.16,Mozilla/5.0 (Linux; Android 9; MHA-L29) AppleW...,PageView,2020-02-16
4,2020-02-16 00:00:04.042000+00:00,7cb5ecf5-2e57-4623-83f9-6d266ab2ff5a,https://www.moneysmart.sg/embed/f2e62665f34622...,183.90.36.153,Mozilla/5.0 (Linux; Android 7.1.1; CPH1721) Ap...,PageView,2020-02-16


In [None]:
athena_full_events_df.head()

Unnamed: 0,sent_at_timestamp,sent_at,date,type,event_name,status,anonymous_id,amp_id,page_url,referrer,user_agent,ip_address
0,2020-02-19 20:11:19.607,2020-02-19T20:11:19.607Z,2020-02-19,event,Reading,Article Body 50,58dc3621-fcf9-4b2c-a251-12ab7733646b,,https://www.moneysmart.tw/articles/%E4%B8%AD%E...,,Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7....,220.128.123.202
1,2020-02-19 20:11:17.592,2020-02-19T20:11:17.592Z,2020-02-19,page,PageView,,95b0cf96-f595-42b9-beab-ea24dce7017e,,https://www.moneysmart.sg/embed/98e61305602380...,https://s0.2mdn.net/dfp/509788/70424308/157292...,Mozilla/5.0 (Linux; Android 8.1.0; V92 Build/O...,119.30.38.48
2,2020-02-19 20:11:21.764,2020-02-19T20:11:21.764Z,2020-02-19,event,Reading,Article Body 25,47ba9951-4f1a-42cb-99e7-0e9fb2cf3e57,,https://blog.moneysmart.sg/invest/cpf-investme...,https://www.google.com/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6...,116.86.191.157
3,2020-02-19 20:11:17.623,2020-02-19T20:11:17.623Z,2020-02-19,event,UserView.WidgetLoad,,95b0cf96-f595-42b9-beab-ea24dce7017e,,https://www.moneysmart.sg/embed/98e61305602380...,https://s0.2mdn.net/dfp/509788/70424308/157292...,Mozilla/5.0 (Linux; Android 8.1.0; V92 Build/O...,119.30.38.48
4,2020-02-19 20:06:01.766,2020-02-19T20:06:01.766Z,2020-02-19,event,Reading,Article Body 75,7a25d1bc-4472-49ff-a0b7-58926b47bdd9,,https://blog.moneysmart.sg/travel/best-money-c...,https://www.google.com/,Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:7...,2406:3003:2073:2f18:9cf7:1589:8548:43d2


### Segment

In [None]:
segment_combined_df.head()

Unnamed: 0,timestamp,anonymous_id,page_url,context_ip,user_agent,event_name,date
0,2020-02-16 00:00:00.943000+00:00,f566ba6b-affb-4963-a92a-9507c87e0f04,https://blog.moneysmart.sg/fixed-deposits/best...,111.65.61.238,Mozilla/5.0 (Linux; Android 9; INE-LX2) AppleW...,PageView,2020-02-16
1,2020-02-16 00:00:01.486000+00:00,0a096abe-8c72-4829-8f91-3dbbe24f26d1,https://blog.moneysmart.sg/shopping/imm-singap...,119.56.110.213,Mozilla/5.0 (Linux; Android 9; BLA-L29) AppleW...,PageView,2020-02-16
2,2020-02-16 00:00:02.233000+00:00,198712ef-3a10-47c1-907c-e51bdf5c3087,https://blog.moneysmart.sg/property/hmlet-co-l...,119.56.109.16,Mozilla/5.0 (Linux; Android 9; MHA-L29) AppleW...,PageView,2020-02-16
3,2020-02-16 00:00:03.073000+00:00,dd5dc2c7-6695-4ea9-a7a5-b6c71dd45374,https://www.moneysmart.sg/embed/9cb432acbab519...,119.56.109.16,Mozilla/5.0 (Linux; Android 9; MHA-L29) AppleW...,PageView,2020-02-16
4,2020-02-16 00:00:04.042000+00:00,7cb5ecf5-2e57-4623-83f9-6d266ab2ff5a,https://www.moneysmart.sg/embed/f2e62665f34622...,183.90.36.153,Mozilla/5.0 (Linux; Android 7.1.1; CPH1721) Ap...,PageView,2020-02-16


In [None]:
group_by_cols = ["anonymous_id", "user_agent"]
segment_anonymous_id_to_user_agent_full_df = segment_combined_df.groupby(group_by_cols).count()
print("%i anonymous_id to user_agents found" % len(segment_anonymous_id_to_user_agent_full_df))

737642 anonymous_id to user_agents found


In [None]:
segment_anonymous_id_to_user_agent_full_df = segment_anonymous_id_to_user_agent_full_df.reset_index()
segment_anonymous_id_to_user_agent_full_df.rename({"0":"count"}, inplace=True)
segment_anonymous_id_to_user_agent_full_df.head()

Unnamed: 0,anonymous_id,user_agent,timestamp,page_url,context_ip,event_name,date
0,00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...,4,4,4,4,4
1,000004a5-230d-4d4c-a850-d2ec44427589,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,1,1,1,1,1
2,00000bc0-99d3-4742-a855-74d49f6b617c,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,2,2,2,2,2
3,00002401-b04b-44a1-be96-5792a91666ba,Mozilla/5.0 (Linux; Android 10; SM-G975F) Appl...,2,2,2,2,2
4,0000b144-5c38-4158-9129-89261729aed5,Mozilla/5.0 (Linux; Android 9; SM-A805F) Apple...,2,2,2,2,2


In [None]:
# check for duplicates
sd = segment_anonymous_id_to_user_agent_full_df.groupby(["anonymous_id"]).size() #[["sent_at"]]
sd = sd.reset_index()
duplicates = sd[sd[0]>1]
print("%i / %i anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades" % (len(duplicates), len(sd)))

9515 / 727894 anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades


In [None]:
sd.head()

Unnamed: 0,anonymous_id,0
0,00000487-4c71-4c7e-a372-ab7196780fb0,1
1,000004a5-230d-4d4c-a850-d2ec44427589,1
2,00000bc0-99d3-4742-a855-74d49f6b617c,1
3,00002401-b04b-44a1-be96-5792a91666ba,1
4,0000b144-5c38-4158-9129-89261729aed5,1


In [None]:
segment_anonymous_id_to_user_agent_df = segment_anonymous_id_to_user_agent_full_df[["anonymous_id", "user_agent"]] # .set_index("anonymous_id")

#make a bit safer by stripping the strings
#segment_anonymous_id_to_user_agent_df["user_agent"] = segment_anonymous_id_to_user_agent_df["user_agent"].str.strip()
#segment_anonymous_id_to_user_agent_df["anonymous_id"] = segment_anonymous_id_to_user_agent_df["anonymous_id"].str.strip()

In [None]:
segment_anonymous_id_to_user_agent_df = segment_anonymous_id_to_user_agent_df.rename(columns={"user_agent": "s_user_agent"})
segment_anonymous_id_to_user_agent_df.head()

Unnamed: 0,anonymous_id,s_user_agent
0,00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
1,000004a5-230d-4d4c-a850-d2ec44427589,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
2,00000bc0-99d3-4742-a855-74d49f6b617c,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
3,00002401-b04b-44a1-be96-5792a91666ba,Mozilla/5.0 (Linux; Android 10; SM-G975F) Appl...
4,0000b144-5c38-4158-9129-89261729aed5,Mozilla/5.0 (Linux; Android 9; SM-A805F) Apple...


In [None]:
# Remove duplicates, so anonymous_id column is unique (otherwise on joins you'll expand the dataset)
segment_anonymous_id_to_user_agent_dedup_df = segment_anonymous_id_to_user_agent_df.groupby("anonymous_id").first().reset_index()
print("Before de-duplication %i, after %i"%(len(segment_anonymous_id_to_user_agent_df), len(segment_anonymous_id_to_user_agent_dedup_df)))
segment_anonymous_id_to_user_agent_dedup_df.head()

Before de-duplication 737642, after 727894


Unnamed: 0,anonymous_id,s_user_agent
0,00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
1,000004a5-230d-4d4c-a850-d2ec44427589,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
2,00000bc0-99d3-4742-a855-74d49f6b617c,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
3,00002401-b04b-44a1-be96-5792a91666ba,Mozilla/5.0 (Linux; Android 10; SM-G975F) Appl...
4,0000b144-5c38-4158-9129-89261729aed5,Mozilla/5.0 (Linux; Android 9; SM-A805F) Apple...


### Athena / Kinesis

In [None]:
group_by_cols = ["anonymous_id", "user_agent"]
athena_anonymous_id_to_user_agent_full_df = athena_full_events_df.groupby(group_by_cols).size()
print("%i anonymous_id to user_agents found" % len(athena_anonymous_id_to_user_agent_full_df))

864558 anonymous_id to user_agents found


In [None]:
athena_anonymous_id_to_user_agent_full_df = athena_anonymous_id_to_user_agent_full_df.reset_index()
athena_anonymous_id_to_user_agent_full_df.rename({"0":"count"}, inplace=True)
athena_anonymous_id_to_user_agent_full_df.head()

Unnamed: 0,anonymous_id,user_agent,0
0,00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...,4
1,000004a5-230d-4d4c-a850-d2ec44427589,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,1
2,00000bc0-99d3-4742-a855-74d49f6b617c,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,2
3,00002401-b04b-44a1-be96-5792a91666ba,Mozilla/5.0 (Linux; Android 10; SM-G975F) Appl...,2
4,0000b144-5c38-4158-9129-89261729aed5,Mozilla/5.0 (Linux; Android 9; SM-A805F) Apple...,2


In [None]:
# check for duplicates
ad = athena_anonymous_id_to_user_agent_full_df.groupby(["anonymous_id"]).size() #[["sent_at"]]
ad = ad.reset_index()
duplicates = ad[ad[0]>1]
print("%i / %i anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades" % (len(duplicates), len(ad)))

10132 / 854067 anonymous_ids with different user agent strings.  Expect there to be some due to browser upgrades


In [None]:
# explore if issue
#df = ad[ad[0]>1].merge(athena_anonymous_id_to_user_agent_full_df, how="inner")
#df.sort_values("anonymous_id")

In [None]:
#df = athena_anonymous_id_to_user_agent_full_df[athena_anonymous_id_to_user_agent_full_df.anonymous_id=="f4a0d91c-b118-40ce-890c-9142bce9f152"]
#pd.set_option('max_colwidth', 200)
#print(df.values[0][1])
#print(df.values[1][1])

In [None]:
#athena_anonymous_id_to_user_agent_full_df.head()
athena_anonymous_id_to_user_agent_df = athena_anonymous_id_to_user_agent_full_df[["anonymous_id", "user_agent"]]


#make a bit safer by stripping the strings #couldn't get this to work without warning easily, so skipping.
#athena_anonymous_id_to_user_agent_df.loc[:,1] = athena_anonymous_id_to_user_agent_df["user_agent"].str.strip()
#athena_anonymous_id_to_user_agent_df.loc[:,0] = athena_anonymous_id_to_user_agent_df["anonymous_id"].str.strip()

#?athena_anonymous_id_to_user_agent_df["user_agent"].str.strip()

In [None]:
athena_anonymous_id_to_user_agent_df = athena_anonymous_id_to_user_agent_df.rename(columns={"user_agent": "a_user_agent"})
athena_anonymous_id_to_user_agent_df.head()


Unnamed: 0,anonymous_id,a_user_agent
0,00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
1,000004a5-230d-4d4c-a850-d2ec44427589,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
2,00000bc0-99d3-4742-a855-74d49f6b617c,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
3,00002401-b04b-44a1-be96-5792a91666ba,Mozilla/5.0 (Linux; Android 10; SM-G975F) Appl...
4,0000b144-5c38-4158-9129-89261729aed5,Mozilla/5.0 (Linux; Android 9; SM-A805F) Apple...


In [None]:
# Remove duplicates, so anonymous_id column is unique (otherwise on joins you'll expand the dataset)
athena_anonymous_id_to_user_agent_dedup_df = athena_anonymous_id_to_user_agent_df.groupby("anonymous_id").first().reset_index()
print("Before de-duplication %i, after %i"%(len(athena_anonymous_id_to_user_agent_df), len(athena_anonymous_id_to_user_agent_dedup_df)))
athena_anonymous_id_to_user_agent_dedup_df.head()

Before de-duplication 864558, after 854067


Unnamed: 0,anonymous_id,a_user_agent
0,00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
1,000004a5-230d-4d4c-a850-d2ec44427589,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
2,00000bc0-99d3-4742-a855-74d49f6b617c,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
3,00002401-b04b-44a1-be96-5792a91666ba,Mozilla/5.0 (Linux; Android 10; SM-G975F) Appl...
4,0000b144-5c38-4158-9129-89261729aed5,Mozilla/5.0 (Linux; Android 9; SM-A805F) Apple...


### Joined up for all anonymous_ids

In [None]:
athena_anonymous_id_to_user_agent_dedup_df.set_index("anonymous_id", inplace=True)
segment_anonymous_id_to_user_agent_dedup_df.set_index("anonymous_id", inplace=True)




In [None]:
athena_anonymous_id_to_user_agent_dedup_df.head(2)

Unnamed: 0_level_0,a_user_agent
anonymous_id,Unnamed: 1_level_1
00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
000004a5-230d-4d4c-a850-d2ec44427589,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...


In [None]:
segment_anonymous_id_to_user_agent_dedup_df.head(2)

Unnamed: 0_level_0,s_user_agent
anonymous_id,Unnamed: 1_level_1
00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
000004a5-230d-4d4c-a850-d2ec44427589,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...


In [None]:
combined_anonymous_id_to_user_agent_df = athena_anonymous_id_to_user_agent_dedup_df.merge(segment_anonymous_id_to_user_agent_dedup_df, how="outer", left_index=True, right_index=True)


In [None]:
combined_anonymous_id_to_user_agent_df.head(1)

Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...


### Check if Segment and Kinesis disagree at all

In [None]:
print("%i segment anonymous_ids" % len(segment_anonymous_id_to_user_agent_df))
print("%i athena anonymous_ids" % len(athena_anonymous_id_to_user_agent_df))

737642 segment anonymous_ids
864558 athena anonymous_ids


In [None]:
# combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.a_user_agent.isnull())]

In [None]:
s_not_a = combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.s_user_agent.str.len()>0) \
                                                 & ((combined_anonymous_id_to_user_agent_df.a_user_agent.isnull()) |(combined_anonymous_id_to_user_agent_df.a_user_agent.str.len()==0))]
a_not_s = combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.a_user_agent.str.len()>0) \
                                                 & ((combined_anonymous_id_to_user_agent_df.s_user_agent.isnull()) |(combined_anonymous_id_to_user_agent_df.s_user_agent.str.len()==0))]

In [None]:
s_not_a.head()

Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
001605de-c107-4604-ab3d-04cc56504dbc,,Mozilla/5.0 (Linux; Android 7.0; TECNO LA6 Bui...
003645fb-0b16-4d5f-85b7-1ebe7be4b4fe,,Mozilla/5.0 (Linux; Android 10; ONEPLUS A6013)...
003c5a29-2f51-4a3f-9c9c-5f2176dd2dcf,,Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; ...
00595b8d-f8f1-4ddb-aad6-c651397683db,,Mozilla/5.0 (compatible; Baiduspider-render/2....
006419fc-2d8e-43a9-992a-90f4ca942f83,,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...


In [None]:
a_not_s.head()

Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
00013053-6c62-49dc-8e02-604fa57feb3f,Mozilla/5.0 (Linux; Android 9; SAMSUNG SM-A705...,
0001f25b-5400-4a2a-889c-919a7316bf2f,Mozilla/5.0 (iPhone; CPU iPhone OS 13_1 like M...,
00023a09-ce76-4b57-b998-2e6961e1cfeb,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like M...,
0003092f-2863-45f4-8d99-8e87521ed781,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,
00037da2-ec18-4e89-82ae-e12604313aaa,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,


In [None]:
total_count = len(combined_anonymous_id_to_user_agent_df)
s_not_a_count = len(s_not_a)
a_not_s_count = len(a_not_s)
print("%i / %i are in segment, not athena (%.1f percent )" % (s_not_a_count, total_count, s_not_a_count / total_count *100))
print("%i / %i are in athena, not segement (%.1f percent)" % (a_not_s_count, total_count, a_not_s_count / total_count *100))
print("If you include countries that aren't on Segment i.e. ID, PH, TW, then you'd expect more from athena")

2919 / 856986 are in segment, not athena (0.3 percent )
129092 / 856986 are in athena, not segement (15.1 percent)
If you include countries that aren't on Segment i.e. ID, PH, TW, then you'd expect more from athena


### Get an idea of how many don't have matching user_agents

In [None]:
df = combined_anonymous_id_to_user_agent_df.groupby("anonymous_id").size().reset_index()
duplicates = df[df[0]>1]
print("%i duplicate anonymous_ids - should be none at this stage" % len(duplicates))

0 duplicate anonymous_ids - should be none at this stage


Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
000004a5-230d-4d4c-a850-d2ec44427589,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
00000bc0-99d3-4742-a855-74d49f6b617c,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
00002401-b04b-44a1-be96-5792a91666ba,Mozilla/5.0 (Linux; Android 10; SM-G975F) Appl...,Mozilla/5.0 (Linux; Android 10; SM-G975F) Appl...
0000b144-5c38-4158-9129-89261729aed5,Mozilla/5.0 (Linux; Android 9; SM-A805F) Apple...,Mozilla/5.0 (Linux; Android 9; SM-A805F) Apple...


In [148]:
non_matching_excl_nulls = combined_anonymous_id_to_user_agent_df[(combined_anonymous_id_to_user_agent_df.s_user_agent != combined_anonymous_id_to_user_agent_df.a_user_agent) \
                                                                 & ~combined_anonymous_id_to_user_agent_df.s_user_agent.isnull() \
                                                                 & ~combined_anonymous_id_to_user_agent_df.a_user_agent.isnull()]
print("%i User agent strings don't match" % len(non_matching_excl_nulls))
print("%i total anonymous_ids" % len(combined_anonymous_id_to_user_agent_df))
print("Look for changes in browser version for instance.  Don't worry about every last one.")
non_matching_excl_nulls.head()

161 User agent strings don't match
856986 total anonymous_ids
Look for changes in browser version for instance.  Don't worry about every last one.


Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1
001a4d56-788c-43e7-a3ab-326f4adfcf1b,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
025b5196-167b-4df0-828f-99579b62cc6c,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...,Mozilla/5.0 (Linux; Android 9; SM-N960F) Apple...
043e73d5-620f-414c-94c0-4add86b742e1,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3...,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3...
0460f6dc-ddde-4a1e-90ee-32ae2ce9150b,Mozilla/5.0 (Linux; Android 9; SM-A6060) Apple...,Mozilla/5.0 (Linux; Android 9; SM-A6060) Apple...
057ef359-9c35-49bb-83d9-78e153991c7e,Mozilla/5.0 (Linux; Android 10; VOG-L29) Apple...,Mozilla/5.0 (Linux; Android 10; VOG-L29) Apple...


In [139]:
for a,(b,c) in non_matching_excl_nulls.iterrows():
    if "bot" in b or "bot" in c:print(a,"\n",b,"\n",c, "\n")

134b935b-8356-4d29-84fd-60fc1d34870a 
 Mozilla/5.0 (Linux; Android 10; SM-G970U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.136 Mobile Safari/537.36 
 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 

428b53a5-5116-42a1-ad26-9dd8e592152f 
 Mozilla/5.0 (X11; Linux x86_64)  AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview)  Chrome/79.0.3945.120 Safari/537.36 
 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.120 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 

5f1b428b-53a5-4116-b2a1-2d269dd8e592 
 Mozilla/5.0 (X11; Linux x86_64)  AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview)  Chrome/79.0.3945.120 Safari/537.36 
 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Geck

In [147]:
for a,(b,c) in combined_anonymous_id_to_user_agent_df[combined_anonymous_id_to_user_agent_df.a_user_agent.str.contains("Radius Compliance Bot", na=False, case=False)].iterrows():
        print(a,"\n",b,"\n",c, "\n")

00d1b887-f146-4510-9da1-2e0b37553edd 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 

04cd54a3-45cb-44d8-b01d-e0dfd6bbff7d 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 

09ba3eff-441d-42da-9ff1-427cadf9deeb 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 

21197c02-dce3-4499-b2dd-9a4774df80c0 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 

3b353938-9e1c-4aef-b9b3-280d7716fe39 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 

4d2b4f54-9c69-45a4-85dc-3e6bbc2ab9ed 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 

5175d3af-edfb-456c-8880-c7dfcd2e3cb6 
 Mozilla/5.0 (compatible;Impact Radius Compliance Bot) 


In [152]:
combined_anonymous_id_to_user_agent_df[combined_anonymous_id_to_user_agent_df.a_user_agent.str.contains("frog", na=False, case=False)].head()

Unnamed: 0_level_0,a_user_agent,s_user_agent
anonymous_id,Unnamed: 1_level_1,Unnamed: 2_level_1


### Create a Single user agent string per anonymous_id

In [None]:
combined_anonymous_id_to_user_agent_single_col_df = combined_anonymous_id_to_user_agent_df["a_user_agent"]\
        .fillna(combined_anonymous_id_to_user_agent_df["s_user_agent"]).reset_index().set_index("anonymous_id")
combined_anonymous_id_to_user_agent_single_col_df.rename(columns={"a_user_agent":"user_agent"}, inplace=True)
combined_anonymous_id_to_user_agent_single_col_df.head()

Unnamed: 0_level_0,user_agent
anonymous_id,Unnamed: 1_level_1
00000487-4c71-4c7e-a372-ab7196780fb0,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
000004a5-230d-4d4c-a850-d2ec44427589,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...
00000bc0-99d3-4742-a855-74d49f6b617c,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
00002401-b04b-44a1-be96-5792a91666ba,Mozilla/5.0 (Linux; Android 10; SM-G975F) Appl...
0000b144-5c38-4158-9129-89261729aed5,Mozilla/5.0 (Linux; Android 9; SM-A805F) Apple...


In [None]:
# This bit is for development where I keep appending the user_agent column and it generates user_agent_x etc
user_agent_cols_to_delete = [z for z in merged_df_with_meta.columns if z.startswith("user_agent")]
print(" Removing %s "%str(user_agent_cols_to_delete))
merged_df_with_meta.drop(columns=user_agent_cols_to_delete, inplace=True)

 Removing [] 


### Useful segmentation / convert user agent to browser etc

In [None]:
def convert_user_agent_to_useful_strings(user_agent_string):
    """
    Sort of matches to https://github.com/moneysmartco/metl/blob/e13086fae453911bed5a40cb51ff0869e2f3a0ce/scripts/python/device_tagger.py
    """
    user_agent = user_agents.parse(user_agent_string)
    
    device_family = ""
    
    if user_agent.is_pc:
        device_family = 'desktop'
    elif user_agent.is_mobile:
        device_family = 'mobile'
    elif user_agent.is_tablet:
        device_family = 'tablet'
    else:
        device_family = 'other'
        
    
    os_family = user_agent.os.family
    os_version = user_agent.os.version_string
    browser_family = user_agent.browser.family 
    browser_version = user_agent.browser.version_string
    
    is_bot = user_agent.is_bot
    
    return [device_family, os_family, os_version, browser_family, browser_version, is_bot]
    



There's an important optimisation going on here (which still isn't that quick).

If you just do .apply across all the rows, then it's super slow (many minutes e.g. 278s vs 24s for my better version).  I tried the optimisation at https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/, but that didn't seem to provide benefit (or I slowed it down in other ways).

So I'm taking the unique user_agents, processing them and then doing a join, without creating Series objects as well.

There's probably more improvement do-able (e.g. creating the full data structure to insert into up front / generating fewer arrays, but it's fast enough for me right now.

In [None]:
distinct_user_agents = combined_anonymous_id_to_user_agent_single_col_df.user_agent.unique()

In [None]:
distinct_user_agents[:10]

array(['Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36',
       'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
       'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
       'Mozilla/5.0 (Linux; Android 10; SM-G975F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.99 Mobile Safari/537.36',
       'Mozilla/5.0 (Linux; Android 9; SM-A805F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.99 Mobile Safari/537.36',
       'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36',
       'Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Mobile/15E148 Safari/604.1',
       'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac O

In [None]:
len(distinct_user_agents)

35376

In [None]:
# This isn't fast, but acceptable
start_time = datetime.now()
print("Starting to add user agent data at %s"% start_time.isoformat())
#meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: pd.Series(convert_user_agent_to_useful_strings(x.user_agent)), axis=1)
meta_rows = [[z, ]+convert_user_agent_to_useful_strings(z)  for z in distinct_user_agents]
#d = dfcombined_anonymous_id_to_user_agent_single_col_df.merge(meta_df)
end_time = datetime.now()
seconds_taken = (end_time - start_time).total_seconds()
print("Took %i seconds to process" % seconds_taken)

Starting to add user agent data at 2020-02-25T09:43:40.439547
Took 40 seconds to process


In [None]:
user_agent_meta_df = pd.DataFrame(meta_rows)

user_agent_meta_df.rename(columns = {0:"user_agent", 1:"device_family", 2:"os_family", 3:"os_version", 4:"browser_family",5:"browser_version", 6:"is_bot"}, inplace=True)
user_agent_meta_df.set_index("user_agent", inplace=True)


In [None]:
# Try to make the data types a bit efficient
user_agent_meta_df = user_agent_meta_df.astype({ "device_family":"category", "os_family":"category", "os_version":"category", "browser_family":"category","browser_version":"category","is_bot":"bool"})
user_agent_meta_df.dtypes

device_family      category
os_family          category
os_version         category
browser_family     category
browser_version    category
is_bot                 bool
dtype: object

In [None]:
user_agent_meta_df.head()

Unnamed: 0_level_0,device_family,os_family,os_version,browser_family,browser_version,is_bot
user_agent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36",desktop,Windows,7,Chrome,80.0.3987,False
"Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1",mobile,iOS,13.3.1,Mobile Safari,13.0.5,False
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",desktop,Windows,10,Chrome,79.0.3945,False
"Mozilla/5.0 (Linux; Android 10; SM-G975F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.99 Mobile Safari/537.36",mobile,Android,10,Chrome Mobile,80.0.3987,False
"Mozilla/5.0 (Linux; Android 9; SM-A805F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.99 Mobile Safari/537.36",mobile,Android,9,Chrome Mobile,80.0.3987,False


In [None]:
if False:# This is super slow currently.
    start_time = datetime.now()
    print("Starting to add user agent data at %s"% start_time.isoformat())
    #meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: pd.Series(convert_user_agent_to_useful_strings(x.user_agent)), axis=1)
    meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: convert_user_agent_to_useful_strings(x.user_agent), axis=1, result_type="expand")
    #d = dfcombined_anonymous_id_to_user_agent_single_col_df.merge(meta_df)
    end_time = datetime.now()
    seconds_taken = (end_time - start_time).total_seconds()
    print("Took %i seconds to process" % seconds_taken)

In [None]:
if False:
    # Trying something faster - based on https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/, but hasn't worked so far
    # but hasn't worked, still (after tidying) it takes 257s, slower than the original.
    start_time = datetime.now()
    print("Starting to add user agent data at %s"% start_time.isoformat())
    new_cols = [[]]*6 # make some empty arrays
    num_new_cols = len(new_cols)
    #for row_num, (_, row) in enumerate(combined_anonymous_id_to_user_agent_single_col_df.iterrows()):
    for _, row in combined_anonymous_id_to_user_agent_single_col_df.iterrows():
        #if row_num % 100000==0:
        #    print("row %i"%row_num)
        vals = convert_user_agent_to_useful_strings(row.user_agent)
        #for i in range(len(vals)):
            #new_cols[i].append(vals[i])
        new_cols[0].append(vals[0])
        new_cols[1].append(vals[1])
        new_cols[2].append(vals[2])
        new_cols[3].append(vals[3])
        new_cols[4].append(vals[4])
        new_cols[5].append(vals[5])
        

    print("New cols generated at %s"% start_time.isoformat())
    # meta_df = combined_anonymous_id_to_user_agent_single_col_df.apply(lambda x: convert_user_agent_to_useful_strings(x.user_agent), axis=1, result_type="expand")
    #d = dfcombined_anonymous_id_to_user_agent_single_col_df.merge(meta_df)
    meta_df = pd.DataFrame({
        "device_family": new_cols[0], 
         "os_family" : new_cols[1], 
         "os_version" : new_cols[2], 
         "browser_family" : new_cols[3], 
         "browser_version":new_cols[4], 
         "is_bot":new_cols[5] 


    })
    print("Additional data frame generated at %s"% start_time.isoformat())
    end_time = datetime.now()
    seconds_taken = (end_time - start_time).total_seconds()
    print("Took %i seconds to process" % seconds_taken)

### Join onto the main dataframe 

In [None]:
merged_df_with_meta = merged_df_with_meta.merge(combined_anonymous_id_to_user_agent_single_col_df, on="anonymous_id", how="left")

In [None]:
merged_df_with_meta.head(2)

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent
0,00000487-4c71-4c7e-a372-ab7196780fb0,PageView,https://blog.moneysmart.hk/zh-hk/credit-cards/...,2020-02-18,1,1,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
1,00000487-4c71-4c7e-a372-ab7196780fb0,Reading,https://blog.moneysmart.hk/zh-hk/credit-cards/...,2020-02-18,3,3,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...


In [None]:
# add on the user agent breakdown

merged_df_with_meta = merged_df_with_meta.merge(user_agent_meta_df, on="user_agent", how="left")

In [None]:
merged_df_with_meta.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
0,00000487-4c71-4c7e-a372-ab7196780fb0,PageView,https://blog.moneysmart.hk/zh-hk/credit-cards/...,2020-02-18,1,1,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...,desktop,Windows,7,Chrome,80.0.3987,False
1,00000487-4c71-4c7e-a372-ab7196780fb0,Reading,https://blog.moneysmart.hk/zh-hk/credit-cards/...,2020-02-18,3,3,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...,desktop,Windows,7,Chrome,80.0.3987,False
2,000004a5-230d-4d4c-a850-d2ec44427589,PageView,https://www.moneysmart.sg/credit-cards/posb-ev...,2020-02-21,1,1,shop,/credit-cards/posb-everyday-card,/credit-cards,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,mobile,iOS,13.3.1,Mobile Safari,13.0.5,False
3,00000bc0-99d3-4742-a855-74d49f6b617c,PageView,https://www.moneysmart.sg/embed/f645886bc03619...,2020-02-18,1,1,shop,/embed/f645886bc036195148acd846a50232d9,/embed,control,sg,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,desktop,Windows,10,Chrome,79.0.3945,False
4,00000bc0-99d3-4742-a855-74d49f6b617c,UserView.WidgetLoad,https://www.moneysmart.sg/embed/f645886bc03619...,2020-02-18,1,1,shop,/embed/f645886bc036195148acd846a50232d9,/embed,control,sg,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,desktop,Windows,10,Chrome,79.0.3945,False


In [None]:
#Check it's set them all
merged_df_with_meta[merged_df_with_meta.user_agent.isnull()].head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
201461,1c5a94bc-d2ba-4739-865c-bceed480fde2,PageView,https://blog.moneysmart.sg/property/hdb-bto-fl...,2020-02-22,1,1,blog,/property/hdb-bto-flat-guide,/property,control,sg,,,,,,,
2108265,b2dbb8dd-e5e2-44c0-afba-0e9c9a7b36d7,PageView,https://blog.moneysmart.sg/renovation-loans/re...,2020-02-22,0,1,blog,/renovation-loans/renovation-singapore,/renovation-loans,control,sg,,,,,,,


In [None]:
#Check it's set them all
merged_df_with_meta[merged_df_with_meta.device_family.isnull()].head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
201461,1c5a94bc-d2ba-4739-865c-bceed480fde2,PageView,https://blog.moneysmart.sg/property/hdb-bto-fl...,2020-02-22,1,1,blog,/property/hdb-bto-flat-guide,/property,control,sg,,,,,,,
2108265,b2dbb8dd-e5e2-44c0-afba-0e9c9a7b36d7,PageView,https://blog.moneysmart.sg/renovation-loans/re...,2020-02-22,0,1,blog,/renovation-loans/renovation-singapore,/renovation-loans,control,sg,,,,,,,


### Clean up data frames / save some memory

In [None]:
# TODO: could do a lot more here
segment_anonymous_id_to_user_agent_full_df = None
segment_anonymous_id_to_user_agent_df = None
athena_anonymous_id_to_user_agent_full_df = None
athena_anonymous_id_to_user_agent_df = None
sd = None
ad = None

# Play Area

In [None]:
d = merged_df_with_meta[merged_df_with_meta.page_type=="iss"].groupby(["slug", "page_type"]).sum()
d[d.s_count>0]
merged_df_with_meta[(merged_df_with_meta.page_type=="iss") & (merged_df_with_meta.page_url.str.contains("iss."))]

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
461,0014277b-48e8-4e3f-9c06-983f00cfb11b,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/personal-loan/dbs-pe...,2020-02-17,1,0,iss,/personal-loan/dbs-personal-loan/redirect,/personal-loan,control,sg,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6...,desktop,Mac OS X,10.13.6,Safari,12.0.3,False
462,0014277b-48e8-4e3f-9c06-983f00cfb11b,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/personal-loan/posb-p...,2020-02-17,1,0,iss,/personal-loan/posb-personal-loan/redirect,/personal-loan,control,sg,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6...,desktop,Mac OS X,10.13.6,Safari,12.0.3,False
463,0014277b-48e8-4e3f-9c06-983f00cfb11b,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/personal-loan/posb-p...,2020-02-17,1,0,iss,/personal-loan/posb-personal-loan/redirect,/personal-loan,control,sg,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6...,desktop,Mac OS X,10.13.6,Safari,12.0.3,False
464,0014277b-48e8-4e3f-9c06-983f00cfb11b,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/personal-loan/scb-ca...,2020-02-17,1,0,iss,/personal-loan/scb-cashone/redirect,/personal-loan,control,sg,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6...,desktop,Mac OS X,10.13.6,Safari,12.0.3,False
465,0014277b-48e8-4e3f-9c06-983f00cfb11b,PageView,https://iss.moneysmart.sg/personal-loan/dbs-pe...,2020-02-17,1,1,iss,/personal-loan/dbs-personal-loan/redirect,/personal-loan,control,sg,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6...,desktop,Mac OS X,10.13.6,Safari,12.0.3,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2224751,fc90a9e3-395e-415f-bb6d-f192bc70bdab,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/credit-cards/cimb-vi...,2020-02-19,0,1,iss,/credit-cards/cimb-visa-signature/redirect,/credit-cards,control,sg,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,desktop,Windows,10,Chrome,80.0.3987,False
2224753,fc90a9e3-395e-415f-bb6d-f192bc70bdab,PageView,https://iss.moneysmart.sg/credit-cards/cimb-vi...,2020-02-19,0,1,iss,/credit-cards/cimb-visa-signature/redirect,/credit-cards,control,sg,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,desktop,Windows,10,Chrome,80.0.3987,False
2226635,fdc6efad-f8fc-43e3-abff-c161a0de0d04,LeadGeneration.RedirectCompleted,https://iss.moneysmart.sg/credit-cards/standar...,2020-02-16,0,2,iss,/credit-cards/standard-chartered-unlimited-cas...,/credit-cards,control,sg,Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:7...,desktop,Windows,10,Firefox,72.0,False
2226639,fdc6efad-f8fc-43e3-abff-c161a0de0d04,PageView,https://iss.moneysmart.sg/credit-cards/standar...,2020-02-16,0,2,iss,/credit-cards/standard-chartered-unlimited-cas...,/credit-cards,control,sg,Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:7...,desktop,Windows,10,Firefox,72.0,False


In [None]:
merged_df_with_meta[(merged_df_with_meta.slug_root=="/zh-hk") & (merged_df_with_meta.country_code=="hk") & (merged_df_with_meta.page_type!="blog")].head(40) #.groupby(["slug"]).sum()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot


In [122]:
merged_df_with_meta.head()

Unnamed: 0,anonymous_id,event_name,page_url,date,s_count,k_count,page_type,slug,slug_root,ab_test,country_code,user_agent,device_family,os_family,os_version,browser_family,browser_version,is_bot
0,00000487-4c71-4c7e-a372-ab7196780fb0,PageView,https://blog.moneysmart.hk/zh-hk/credit-cards/...,2020-02-18,1,1,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...,desktop,Windows,7,Chrome,80.0.3987,False
1,00000487-4c71-4c7e-a372-ab7196780fb0,Reading,https://blog.moneysmart.hk/zh-hk/credit-cards/...,2020-02-18,3,3,blog,/zh-hk/credit-cards/%e9%85%92%e5%ba%97%e8%87%a...,/credit-cards,control,hk,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...,desktop,Windows,7,Chrome,80.0.3987,False
2,000004a5-230d-4d4c-a850-d2ec44427589,PageView,https://www.moneysmart.sg/credit-cards/posb-ev...,2020-02-21,1,1,shop,/credit-cards/posb-everyday-card,/credit-cards,control,sg,Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like...,mobile,iOS,13.3.1,Mobile Safari,13.0.5,False
3,00000bc0-99d3-4742-a855-74d49f6b617c,PageView,https://www.moneysmart.sg/embed/f645886bc03619...,2020-02-18,1,1,shop,/embed/f645886bc036195148acd846a50232d9,/embed,control,sg,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,desktop,Windows,10,Chrome,79.0.3945,False
4,00000bc0-99d3-4742-a855-74d49f6b617c,UserView.WidgetLoad,https://www.moneysmart.sg/embed/f645886bc03619...,2020-02-18,1,1,shop,/embed/f645886bc036195148acd846a50232d9,/embed,control,sg,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,desktop,Windows,10,Chrome,79.0.3945,False


# Store Data Frame for Faster Loading etc.

When stored as a zipped parquet, it's actually very small 3 days -> 30MB.

In [None]:
!pip install fastparquet

Collecting fastparquet
[?25l  Downloading https://files.pythonhosted.org/packages/5f/92/8135e08d0fd97b219e00a258c31ca95cf3cc1e654dff0a2859acf4c34d2b/fastparquet-0.3.3.tar.gz (152kB)
[K    100% |████████████████████████████████| 153kB 8.2MB/s ta 0:00:01
Collecting thrift>=0.11.0 (from fastparquet)
[?25l  Downloading https://files.pythonhosted.org/packages/97/1e/3284d19d7be99305eda145b8aa46b0c33244e4a496ec66440dac19f8274d/thrift-0.13.0.tar.gz (59kB)
[K    100% |████████████████████████████████| 61kB 17.8MB/s ta 0:00:01
Building wheels for collected packages: fastparquet, thrift
  Running setup.py bdist_wheel for fastparquet ... [?25ldone
[?25h  Stored in directory: /home/ec2-user/.cache/pip/wheels/a0/27/9f/d8066bbbbb77e97d8ad3daf4de155ead73693bc4aa2f52098c
  Running setup.py bdist_wheel for thrift ... [?25ldone
[?25h  Stored in directory: /home/ec2-user/.cache/pip/wheels/02/a2/46/689ccfcf40155c23edc7cdbd9de488611c8fdf49ff34b1706e
Successfully built fastparquet thrift
Installing c

In [None]:
if save_end_dataframe_to_file:
    from_to_str = "_to_".join([z.strftime("%Y%m%d_%H%M") for z in [from_datetime, to_datetime]])
    parquet_filename = "merged_df_with_meta_"+from_to_str+".gzip"
    
    merged_df_with_meta.to_parquet(parquet_filename, compression='gzip')

    

In [None]:
>> look into AB test stuff more.  I think the urls are different segment vs kinesis (but I think we've found the origin and might have been fixed / non-issue)
                                                                                     

SyntaxError: invalid syntax (<ipython-input-121-c9086cc76644>, line 1)