<a href="https://colab.research.google.com/github/jefftwebb/MSBA-Capstone/blob/main/EDA_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDA with  SQL, Python and R in Google Colab



# 1. Connect to BigQuery, format table display and load R extension.

This interactive authentication requires that you have a google account.

In [12]:
from google.colab import auth
auth.authenticate_user()

Using Data Table makes it more convenient to navigate tables within the notebook.

In [None]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

 The rpy2 package allows us to write R code.

In [5]:
%load_ext rpy2.ipython


# 2. Magic commands to write SQL and R code

Google colab is a web hosted version of a jupityr notebook. Project jupityr started in 2014 with the goal of making  ipython notebooks language agnostic.  Hence jupityr, which stands for "Julia," "Python," and "R." In google colab we can write code chunks in a variety of programming languages in addition to the native Python via the so-called "magic commands."

First: what are magic commands? 

In [8]:
%magic

Which ones are available?

In [None]:
%lsmagic

Notice that there are two levels of magic commands:

- %% affects the entire cell; 
- % affects an individual line.

In this tutorial we'll be using bigquery magics (to write SQL against tables in bigquery) and R magics (to write R code).

For example, the following cell uses %%bigquery to query the google analytics public dataset, storing the result in a pandas dataframe (the default format in .ipynb) defined in the cell magic statement as "df." **In the code below you will need to replace "project-id" with your own GCP project ID.**

Here for illustration we return 10 rows from the entire dataset for Aug 1, 2017.

In [13]:
%%bigquery --project project-id df

SELECT * FROM bigquery-public-data.google_analytics_sample.ga_sessions_20170801
limit 10



In [None]:
type(df)

In [None]:
df.shape

In [None]:
df.info()


In [None]:
df.head()

The arrays from BigQuery have been brought into a pandas dataframe as what appear to be JSON columns. These include: totals, trafficSource, device, geoNetwork, and hits.

I prefer to do EDA in R, so my ultimate goal is to produce a rectangular dataset that can be read into R for data exploration. However R will not accept the JSON columns (which it interprets as python dicts). To prepare for EDA in R  let's go back and revise the initial query to flatten the dataset (as well as to pull in more dates) with one row per visit.

# 3. Flatten Data

For reference, here is a detailed [data dictionary](https://support.google.com/analytics/answer/3437719?hl=en&ref_topic=3416089).

You will most likely want to create different tables as you explore different ideas during EDA.

For this exercise, we will focus on a few tables and bring in a month's worth of data. (The query is somewhat time-consuming.) The goal is to create a table without nested data, featuring just one row per session. Note that this will entail avoiding the `hits` and `product` tables since unnesting those will generate multiple rows per session--equivalent to a mutating join. 



In [2]:
%%bigquery --project project-id df 

SELECT fullVisitorId, visitId, date,  device.*, totals.*, channelGrouping
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20160831'



In [None]:
df.shape

In [None]:
df.columns

# 4. Import Data into R

We have already loaded the rpy2 package so we can now use R magic commands to load the df dataset into R. (-i stands for "input.")

In [6]:
%R -i df

Load the tidyverse as well as other any other packages you would like to use. Note that you will need to install packages first. 



In [None]:
%%R 
library(tidyverse)


Inspect the data.

In [None]:
%R str(df)

Looks good!

# 5. EDA

EDA is an art and a science. There is no one right way to explore a dataset. Use your creativity and critical thinking to ask and answer questions, based on your knowledge of the business problem and business context, perhaps starting with some QA and then increasingly focusing on the target variable and possible predictors.

Let's assume for now that the target variable for this project will be `transactionRevenue`.

It can be helpful at the beginning of the EDA process to list a set of questions to be investigated. 

**First**, evaluate the *quality* of the data by asking and answering these sorts of questions:

- Is the grain of the data as expected given the query? Here we are focusing on the `totals` table so there should be one row per session.  (The number of unique session IDs should equal the number of rows.) (How would this change if we brought in data recorded at the sub-session level, for example from the hits table?)
- What is the distribution of sessions across visitors?
- What is the scope of missing data? 
- What is the meaning of missing data?  Can NAs be replaced with meaningful quantities based on the data dictionary or (our understanding of) the data collection process?
- Is the daily frequency of visits as expected, or are there anomalies that need further investigation? 
- Is the range of the variables resonable or do there appear to be mistakes? In particular, given the definition of a visit do the distributions of variables like visits, hits, pageViews and bounces make logical sense?
- Are there columns without any (or with very little) variation? (If so, we will want to remove them, as they contain little or no information.) 

Bottom line:  you want to make sure that you understand the structure and quality of the data so that your results mean what you think they mean. (Nothing worse than having someone point out in a presentation that your great insight was the result of misunderstanding the data....)

**Second**, examine the univariate distributions of the important variables, still keeping an eye out for anomolies and problems but also considering possible transformations such as binning numeric distributions and log transforming skewed variables. In particular:

- What is the distribution of `transactionRevenue`? In particular: Skewed? Zero-inflated? How rare is purchasing?

**Third**, look at relationships involving the target, especially considering possible interactions.  The purpose here is to do informal modeling via exploratory analysis in preparation for confirmatory analysis later on. At this stage you want to explore hypotheses about why people decide to buy, or about the process that leads to buying. For example:

- Are page views correlated with transactions?
- Does the relationship between page views and transactions differ by day of week or by channel or device?

There are a large number of possible questions at this third stage, constrained only by your creativity. Very likely, asking and answering a question will engender further exploration.  The process may feel a little bit like looking for a needle in a haystack, but you can limit the questions by thinking about possible drivers of purchasing. For example, a large number of page views suggests that a customer is looking for something, and might therefore be more likely to purchase. At this stage, also, you may think of questions that would require new or different data to answer. Get creative about engineering new features to answer such questions. 


# Example EDA

The goal here is to give you some examples to work from, and to model the process. The exploration and cleaning below is just a tiny subset of what would need to happen in an actual project. 

## Grain of the data.  One row per visit?

In [None]:
%%R

df$visitId |> unique() |> length()

In [None]:
%%R

df |> nrow()

Not the same.  Arrgh! Row-by-row exploration will be required to understand what is going on.

In [None]:
%%R 

which(duplicated(df$visitId))[1]

In [None]:
%R df[107, ]

In [None]:
%%R

df |> filter(visitId == 1470499213)

Interesting!  `visitId` is not unique.  However, we can create a unique ID by combining it with `fullVisitorId` and `date` (since a single session is recorded twice by Google Anaytics if it happens to span the change in date from 11:59 PM to 12:00 AM).

In [None]:
%R df$uniqueId |> unique() |> length()

Now there is one row per unique visit.  

It is worth noting that we could choose to treat a session that occurs over two days not as two sessions, as we've done, but as one session that is arbitrarily split by the clock. This is a data modeling decision, and as such is neither right nor wrong. Keep it in mind though, since we may want to revise it. 

## Invariant columns?

We will remove columns with no variation and leave columns with little (or near zero) variation alone for now.



In [None]:
%%R 

apply(df, 2, n_distinct)

Yes, quite a few.  The removal strategy will be to transform the counts into a logical vector, which we can then use for filtering.  For example:

In [None]:
%%R 

apply(df, 2, n_distinct) > 1

No use this boolean to select columns.

In [40]:
%%R

df <- df |>
select_if(apply(df, 2, n_distinct) >1)

In [None]:
%%R

glimpse(df)

## What is the distribution of `transactionRevenue`?

In [None]:
%R head(df$transactionRevenue, 1000)

Clearly, a session that did not result in a purchase is coded `NaN`.  That should be recoded as 0 and the amount of the purchase should be returned to a legible number by dividing by 10^6. While we're at it, we will copy this column and shorten the name.

In [48]:
%%R 

df <- df |> 
mutate(rev = transactionRevenue,
       rev = rev/10^6,
       rev = replace_na(rev, 0))

In [None]:
%R head(df$rev, 1000)

Let's visualize the distribution.

In [None]:
%%R

df |>
ggplot(aes(rev)) +
geom_histogram() +
labs(title = "Distribution of purchase amounts") +
theme_minimal()


`transactionRevenue` is zero inflated.  Let's take a look at just the distribution of purchases.

In [None]:
%%R

df |>
filter(rev > 0) |>
ggplot(aes(rev)) +
geom_histogram() +
labs(title = "Distribution of purchase amounts") +
theme_minimal()

In [None]:
%%R

df %>%
filter(rev > 0) %>%
select(rev) %>%
summary


Long right tail. Log normal? Looks like it:

In [None]:
%%R

df |>
filter(rev > 0) |>
ggplot(aes(log(rev))) +
geom_histogram() +
labs(title = "Distribution of log purchase amounts") +
theme_minimal()

We may need to log transform `transactionRevenue`, depending on the algorithm used for modeling.

## What is the distribution of pageviews?

In [None]:
%%R 

ggplot(df, aes(pageviews)) +
geom_histogram()


## Are pageviews related to purchasing? 

Specifically does the proportion of purchases go up with `pageviews`? Our strategy will be to bin `pageviews` and calculate the purchase rate in the bins.  (Binning, incidentally, is generally very useful for visualization and modeling.) 

Two dpyr functions are helpful to create bins:  `cut_interval()` and `cut_number()`.

`cut_interval()` attempts to make the bin widths the same.

In [None]:
%%R 

df |>
filter(pageviews > 1) |>
mutate(pageview_bins = cut_interval(pageviews, 3)) |>
group_by(pageview_bins) |>
summarize(purchase_rate = sum(rev > 0)/n(),
          median_purchase = median(rev[rev > 0]), # this is median purchase excluding 0s
          n()) 

`cut_number()` attempts to put the same number of observations in each bin.

In [None]:
%%R 

df |>
filter(pageviews > 1) |>
mutate(pageview_bins = cut_number(pageviews,3)) |>
group_by(pageview_bins) |>
summarize(purchase_rate = sum(rev > 0)/n(),
          median_spent = median(rev[rev > 0]),
          n()) 

There's clearly a relationship between pageviews and the purchase rate.

Bins with equal numbers of customers work better in this case. But there is a bin number limitation imposed by the skewed distribution.  (This is because the [2,3] group is so large.) Let's drill down into that third bin to try to get a little more insight into the high `pageviews` group.



In [None]:
%%R 

df |>
filter(pageviews > 7) |>
mutate(pageview_bins = cut_number(pageviews,5)) |>
group_by(pageview_bins) |>
summarize(purchase_rate = sum(rev > 0)/n(),
          median_spent = median(rev[rev > 0]),
          n()) 

Indeed, as `pageviews` go up so does the purchase rate. There also appears to be a relationship with the *size* of the purchase: the median amount spent tends to increase with `pageviews`.

## Is there an interaction between `pageviews` and `channelGrouping` in predicting `transactionRevenue`?

In [None]:
%%R

df |>
filter(pageviews > 1, rev > 0) |>
mutate(rev = log(rev),
       views = log(pageviews)) |>
ggplot(aes(views, rev, col = channelGrouping)) +
geom_point() +
geom_smooth(se = F, method = "lm")

Hmmm.  Hard to interpret--seems inconclusive.  Let's try using the bins from above.

Here is the non-interaction plot first.

In [None]:
%%R

df |>
filter(pageviews > 7) |>
mutate(pageview_bins = cut_number(pageviews,5)) |>
group_by(pageview_bins) |>
summarize(purchase_rate = sum(rev > 0)/n()) |>
filter(purchase_rate > 0) |>
ggplot(aes(pageview_bins, purchase_rate)) +
geom_col() +
labs(title = "Purchase rate by pageviews",
     caption = "Google Analytics data from the Google Merchandise Store. Data includes only pageviews > 7.") +
theme_minimal()

The highest rate is about .27.

Here is the plot with the `channelGrouping` interaction.

In [None]:
%%R

df |>
filter(pageviews > 7) |>
mutate(pageview_bins = cut_number(pageviews,5)) |>
group_by(pageview_bins, channelGrouping) |>
summarize(purchase_rate = sum(rev > 0)/n()) |>
filter(purchase_rate > 0) |>
ggplot(aes(pageview_bins, purchase_rate, fill = channelGrouping)) +
geom_col(position = "dodge") +
labs(title = "Purchase rate by pageviews and channels",
     caption = "Google Analytics data from the Google Merchandise Store. Data includes only pageviews > 7.") +
theme_minimal()

Clearly the purchase rate for different levels of `pageviews` varies by channel.

# Additional Questions

There are many.  Have at it.