## Introduction

We are going to learn about **BigQuery ML**. Here is a link to the [official documentation tutorial](https://cloud.google.com/bigquery-ml/docs/bigqueryml-web-ui-start). We are going to use a Google Analytics sample dataset for BigQuery.

**BigQuery Machine Learning (BQML)** is a toolset that allows you to train and serve machine learning models directly in BigQuery.

Advantages:
* __You don't have to read your data into local memory__. A frequent question is " How can I train my ML model if my dataset is too big to fit on my computer? " 
Of course you can subsample your dataset, but you can also use tools like BQML that train your model directly in your database.

* __You don't have to use multiple languages__. If someone doesn't know your preferred language for modeling, working in SQL can make it easier for you to collaborate.

* __You can serve your model immediately after it's trained__. Because your model is already in the same place as your data, you can make predictions directly from your database. That way you don't have to clean your code.


BQML probably won't replace your modelling tools, but it's a nice quick way to train and serve a model without spending a lot of time moving code or data around.

#### Models supported by BQML

* Linear regression (LINEAR_REG). This is the OG modeling technique, used to predict the value of a continuous variable.

* Logistic regression (LOGISTIC_REG). This regression technique lets you classify which category an observation fits in to.

* K-means (KMEANS). This is an unsupervised clustering algorithm. It lets you identify categories.

* Tensorflow (TENSORFLOW). If you've already got a trained TensorFlow model, you can upload it to BQML and serve it directly from there. You can't currently train a TensorFlow model 

We'll use:

* BQML to create a binary logistic regression model using the `CREATE MODEL` statement.

* The `ML.EVALUATE` function to evaluate the ML model.

* The `ML.PREDICT` function to make prediction using the ML model.

In [0]:
# Set your own project id here
PROJECT_ID = 'kaggle-bqml-camp'

# Import and setup what is needed
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
dataset = client.create_dataset('bqml_tutorial',exists_ok=True)

from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

We'll use the Google Analytics sample dataset to predict whether a website visitor shall make a transaction.

In [0]:
# create a reference to our table
table = client.get_table("bigquery-public-data.google_analytics_sample.ga_sessions_*")

# look at five rows from our dataset
client.list_rows(table, max_results = 5).to_dataframe()

In [0]:
# create a small sample dataframe
sample_table = client.list_rows(table, max_results=5).to_dataframe()

# get the first cell in the "totals" column
sample_table.totals[0]

For this problem, we'll be trying to predict transactions from the totals column, so keep this in mind! :)

### Create your model

The standard SQL query uses a `CREATE MODEL` statement to create and train the model.

The BigQuery Python client library provides a custom magic command so that you don't need to set up the queries yourself. To load the magic commands from the client library, run the following code.

In [0]:
%load_ext google.cloud.bigquery

`%load_ext` is one of the many Jupyter built-in madic commands.

The BigQuery client library provides a cell magic, `%%bigquery`, which runs a SQL query and returns the results as a Pandas DataFrame. Once you use this command the rest of your cell will be treated as a SQL command.

In [0]:
%%bigquery
CREATE MODEL IF NOT EXISTS 'bqml_tutorial.sample_model'
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device,operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals,pageviews, 0) AS pageviews
FROM
  'bigquery-public-data.google_analytics_sample.ga_sessions_*'
WHERE
   _TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

Let's break down this command a little bit.The first two lines are writing our model to BigQuery.

The 3rd line is specifying our model (here logistic regression).

The code under the SELECT clause is where we define the variable we're trying to predict as well as what variables we want to use to predict it.

The column we alias as "label" will be our dependent variable, the thing we're trying to predict. The other four rows say what information we want to use to predict that label.

And finally the `WHERE` clause specifies the range of tables to use to train our model. In this case it looks like a new table is being produced every day.

After the first iteration is complete, your model (**sample_model**) appears in the navigation panel of the BigQuery UI. Because the query uses a `CREATE MODEL` statement to create a table, you do not see query results. The output is an empty string.

### Get training statistics

To see the results of the model training, you can use the `ML.TRAINING_INFO` function, or you can view the statistics in the BigQuery UI.

A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss.

To see that model training statistics that were generated when you ran the `CREATE MODEL` query, run the following:

In [0]:
%%bigquery
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL 'bqml_tutorial.sample_model')
ORDER_BY iteration

At this point you'll notice that BQML has taken care of some of the common ML decisions for you:

* Splitting into training & evaluation datasets to helo detect overfitting

* Early stopping (stopping training when additional iterations would not improve performance on the evaluation set)

* Picking and updating learning rates (starting with a low learning rate and increasing it over time)
 
* Picking an optimization strategy (batch gradient descent for large datasets with high cardinality, normal equation for small datasets where it tould be faster)

### Evaluate your model

After creating your model, you evaluate the performance of the classifier using the `ML.EVALUATE` function.

To run the `ML.EVALUATE` query that evaluates the model, run the following:

In [0]:
%%bigquery
SELECT
  *
FROM ML.EVALUATE(MODEL 'bqml_tutorial.sample_model', (
  SELECT
    IF(totals.transactions IS NULL, 0, 1) AS label,
    IFNULL(device.operatingSystem, "") AS os,
    device.isMobile AS is_mobile,
    IFNULL(geoNetwork.country, "") AS country,
    IFNULL(totals.pageviews, 0) AS pageviews
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  WHERE
    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))


While it's helpful to see these metrics, it's also common to plot the **ROC** curve when evaluating model performance for binary logistic regression. We can do this by using the ML.ROC_CURVE() function.

> **Pro-tip:** You can save the output of bigquery magic cell by putting a variable name to the right of the `%%bigquery` command. Here I've saved the output of the next cell as the variable "roc".

In [0]:
%%bigquery roc
SELECT
  *
FROM
  ML.ROC_CURVE(MODEL 'bqml_tutorial.sample_model')

In [0]:
# Check out the data that was returned
roc.head()

In [0]:
# Plot our ROC curve
import matplotlib.pyplot as plt

# Plot false positive rate by true positive rate (aka recall)
plt.plot(roc.false_positive_rate, roc.recall)

### Use your model to predict outcomes

You use your model to predict the number of transactions made by website visitors from each country. And you use it to predict purchases per user.

To run the query that uses the model to predict the number of transactions by country.

In [0]:
%%bigquery
SELECT
  country,
  SUM(predicted_label) as total_predicted_purchases
FROM ML.PREDICT(MODEL 'bqml_tutorial.sample_model', (
  SELECT
    IFNULL(device.operatingSystem, "") AS os,
    device.isMobile AS is_mobile,
    IFNULL(totals.pageviews, 0) AS pageviews,
    IFNULL(geoNetwork.country, "") AS country
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  WHERE
    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
  GROUP BY country
  ORDER BY total_predicted_purchases DESC
  LIMIT 10

In the next example, you try to predict the number of transactions each website visitor will make. This is query is identical to the previous except for the `GROUP BY` clause.

In [0]:
%%bigquery
SELECT
  fullVisitorId,
  SUM(predicted_label) as total_predicted_purchases
FROM ML.PREDICT(MODEL 'bqml_tutorial.sample_model', (
  SELECT
    IFNULL(device.operatingSystem, "") AS os,
    device.isMobile AS is_mobile,
    IFNULL(totals.pageviews, 0) AS pageviews,
    IFNULL(geoNetwork.country, "") AS country,
    fullVisitorId
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  WHERE
    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
  GROUP BY fullVisitorId
  ORDER BY total_predicted_purchases DESC
  LIMIT 10