### Instructions

Investigate the public dataset `bigquery-public-data.google_analytics_sample.ga_sessions_*` in order to select the most suitable features for a propensity model.

Field definitions: https://support.google.com/analytics/answer/3437719?hl=en

(Optional) Use Cloud Dataprep to understand data distributions (without running the Dataflow pipeline - it would cost more money)

Use basic SQL operations UNNEST(), IFNULL(), COUNTIF() to get the most important features

Start by selecting just one day (sharded table) - to avoid higher costs, e.g. _TABLE_SUFFIX = '20160801'

`Hints`

totals.newVisits = 1: user enters the web page for the first time

eCommerceAction.action_type='2': add product to cart action

eCommerceAction.action_type='3': checkout page

totals.transactions: number of eCommerce transactions - sale

fullvisitorid: random id of the user

### Solution

In [None]:
SELECT
  *
FROM (
  SELECT
    PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date,
    fullVisitorId,
    IFNULL(totals.bounces,
      0) AS bounces,
    IFNULL(totals.timeOnSite,
      0) AS time_on_site,
    totals.pageviews AS pageviews,
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    device.isMobile,
  IF
    ((
      SELECT
        SUM(
        IF
          (eCommerceAction.action_type='3',
            1,
            0))
      FROM
        UNNEST(hits))>=1,
      1,
      0) AS add_to_cart,
  IF
    ((
      SELECT
        SUM(
        IF
          (eCommerceAction.action_type='2',
            1,
            0))
      FROM
        UNNEST(hits))>=1,
      1,
      0) AS product_detail_view,
    IFNULL(geoNetwork.city,
      "") AS city,
    IFNULL(geoNetwork.country,
      "") AS country
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*` s
  WHERE
    totals.newVisits = 1
    AND _TABLE_SUFFIX BETWEEN '20160801'
    AND '20170801'
    AND geoNetwork.country = "United States"
JOIN (
  SELECT
    fullvisitorid,
  IF
    (COUNTIF(totals.transactions > 0
        AND totals.newVisits IS NULL) > 0,
      1,
      0) AS will_buy_later
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  GROUP BY
    fullvisitorid)
USING
  (fullVisitorId);