# Conversions

This is trying to get a set of all the actions that we consider conversions as a business.

Over time it might need to be updated.  In Metabase land, it's created on the view level, rather than being in the dim and fact tables, which makes it a bit hard to make general queries like "which users converted, and where did they convert?".

I see a few conversion types:
* Contact form filled in (as in they performed the action we wanted on the site)
* Advert clicked / action taken on the blog that wasn't directly revenue generating (it might also lead to a revenue generating action) - not the priority.
* Apply button clicked - whether this goes to an external site, or the application is performed on our own site.
* Purchased on our site i.e. clicked to go to an application form and then actually converted.
* Purchased on external site (we don't currently have user level tracking here though).

Of the above, the most important one is the apply button click as it's the most comparable across all areas.

You can get way more complicated by including information like the revenue per action etc if you want to know how valuable clicks are.  That's probably getting out of scope for the first level analysis.

The initial use case for this is in looking at revenue generating conversion from the blog during blog AB test.  Given we're into credit cards, and they're fairly high volume, I'll likely focus on that.

# Planning

## Overview

Suggested data structure something like [anonymous_id, timestamp, conversion_type, channel of product purchased, page action took place on, action_page_type, action_page_widget]
where page action took place on shouldn't be say ISS, but it should be something like the blog page.

The action_page_type should be something of listing_page / PDP / blog_article and action_page_widget would be say comparison_widget / advert / inline_widget.

In metabase land it also adds in things like session and landing page, but those are out of scope.  You can join the results of this based on anonymous_id and timestamp.

... but will likely build this up incrementally as the main thing is knowing about the overall conversions that can be attributed to an anonymous_id rather than getting full segmentation.


## Success Will Be

Creating a library that can be used in other areas.  I.e. this notebook isn't the product.  It's just about building up and validating the code (will need to check against dashboards etc)

## Known / Expected Data Sources

* There's click events on various apply button clicks.
* Most conversions will happen through ISS, which can be on one of several urls (blog, shop and iss.moneysmart.*) - I guess not used for applications on the site, but should be for all applications off the site.  This has been partly collected up IIRC in segment_vs_kinesis under ISS.
* Not sure about mortgage.

## Existing Code
* The current code in use is in the views in the metl/dags/refresh_rs_views/templates codebase (on github)


## Considerations

* Dashboards will exclude internal users (by IP address)
* https://get.moneysmart.sg/apply/citi-premiermiles-visa-sl-all-cards/ seems to not be tracked (unbounce, but maybe 130 views per day for a very high intent page)


In [10]:
from athena_querying import AthenaQuery
from athena_common_queries import create_partition_filter, athena_database, athena_raw_events_table #(from_datetime, to_datetime)
from datetime import datetime, timedelta

In [11]:
num_days_to_query = 2
to_datetime = datetime(year=2020, month=2, day=9)
from_datetime = to_datetime - timedelta(days=num_days_to_query)

In [12]:
aq = AthenaQuery()
aq.connect()

In [13]:
# if "iss.moneysmart" in nl or stripped_path.endswith("apply") or stripped_path.endswith("redirect"):
query = """select 
        sent_at, 
        user.anonymous_id,
        context.page_url
        

    """

query+= " from "+athena_database+"."+athena_raw_events_table+" "
query += """
where 
    (context.page_url like '%moneysmart.sg%' or context.page_url like '%moneysmart.hk%')
    and ( "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 5) like '%/apply' -- not sure if there's a risk of a trailing slash
        OR "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 5) like '%/redirect'
        OR "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 4) like '%iss.moneysmart%'
    )

"""
query += "and " + create_partition_filter(from_datetime, to_datetime)
print(query)

select 
        sent_at, 
        user.anonymous_id,
        context.page_url
        

     from ms_data_lake_production.ms_data_stream_production_processed 
where 
    (context.page_url like '%moneysmart.sg%' or context.page_url like '%moneysmart.hk%')
    and ( "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 5) like '%/apply' -- not sure if there's a risk of a trailing slash
        OR "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 5) like '%/redirect'
        OR "regexp_extract"("context"."page_url", '^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?', 4) like '%iss.moneysmart%'
    )

and 
  (
 partition_0 >= '2020'
 AND partition_1 >= '02'
 AND partition_2 >= '07'
 OR (
 partition_0 >= '2020'
 AND partition_1 > '02'
 ) 
 OR (
 partition_0 > '2020'
 ) 
)
 AND ((partition_0 <= '2020'
	 AND partition_1 <= '02'
	 AND partition_2 <= '09'
) 
 OR (
	 partition_0 <= '2020'
	 A

In [15]:
df = aq.query(query)

In [16]:
len(df)

23363

In [None]:
def process_apply_url_to_channel(url):
    if url.startswith("https://iss.")
    
    elif url.startswith("https://")  # need to be a bit careful with AB test here.  www vs www-new etc
    
    else:
        return "unknown"