## ML Challenges

This notebook includes various code snippets mentioned in the first chapter of our Machine Learning Design Patterns book.

In [5]:
!pip3 install google-cloud-bigquery

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m
Collecting google-cloud-bigquery
  Downloading google_cloud_bigquery-2.31.0-py2.py3-none-any.whl (205 kB)
     |████████████████████████████████| 205 kB 3.3 MB/s            
[?25hCollecting google-cloud-core<3.0.0dev,>=1.4.1
  Downloading google_cloud_core-2.2.1-py2.py3-none-any.whl (29 kB)
Collecting google-resumable-media<3.0dev,>=0.6.0
  Downloading google_resumable_media-2.1.0-py2.py3-none-any.whl (75 kB)
     |████████████████████████████████| 75 kB 3.3 MB/s            
Collecting google-api-core[grpc]<3.0.0dev,>=1.29.0
  Downloading google_api_core-2.3.2-py2.py3-none-any.whl (109 kB)
     |████████████████████████████████| 109 kB 3.3 MB/s            
Collecting proto-plus>=1.10.0
  Downloading proto_plus-1.19.8-py3

In [9]:
import pandas as pd
import tensorflow as tf

from sklearn.utils import shuffle
from google.cloud import bigquery

### Repeatability

Because of the inherent randomness in ML, there are additional measures required to ensure repeatability and reproducability between training and evaluation runs.

In [10]:
# Setting a random seed in TensorFlow
# Do this before you run training to ensure reproducible evaluation metrics
# You can use whatever value you'd like for the seed
tf.random.set_seed(2)

You also need to consider randomness when preparing your training, test, and validation datasets. To ensure consistency, prepare a shuffled dataset before training by setting a random seed value.

First, let's look at an example without shuffling. We'll grab some data from the NOAA storms public dataset in BigQuery. You'll need a Google Cloud account to run the cells that use this dataset.

In [8]:
from google.colab import auth
auth.authenticate_user()

ModuleNotFoundError: No module named 'google.colab'

Replace `your-cloud-project` below with the name of your Google Cloud project.

In [None]:
%%bigquery storms_df --project your-cloud-project
SELECT
  *
FROM
  `bigquery-public-data.noaa_historic_severe_storms.storms_*`
LIMIT 1000

Run the cell below multiple times, and notice that the order of the data changes each time.

In [None]:
storms_df = shuffle(storms_df)
storms_df.head()

Next, repeat the above but set a random seed. Note that the data order stays the same even when run multiple times.



In [None]:
shuffled_df = shuffle(storms_df, random_state=2)
shuffled_df.head()

### Data drift

It's important to analyze how data is changing over time to ensure your ML models are trained on accurate data. To demonstrate this, we'll use the same NOAA storms dataset as above with a slightly different query. 

Let's look at how the number of reported storms has increased over time.

In [None]:
%%bigquery storm_trends --project your-cloud-project
SELECT
  SUBSTR(CAST(event_begin_time AS string), 1, 4) AS year,
  COUNT(*) AS num_storms
FROM
  `bigquery-public-data.noaa_historic_severe_storms.storms_*`
GROUP BY
  year
ORDER BY
  year ASC

In [None]:
storm_trends.head()

As seen below, training a model on data before 2000 to predict storms now would result in incorrect predictions.

In [None]:
storm_trends.plot(title='Storm trends over time', x='year', y='num_storms')

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License