# GCP Professional Data Engineer
## Serverless Machine Learning with Tensorflow on Google Cloud Platform
### Modules:
- Getting Started with Machine Learning
- Building ML Models with Tensorflow
- Scaling ML models with Cloud ML Engine
- Feature Engineering

### Learning Objectives & Definitions:
- Weights: parameters we optimize
- Batch size: amount of data to compute on when calculating error rate
- Epoch: one pass through the entire dataset
- Gradient descent: process of reducing error
- Training: process of optimizing weights, including gradient descent 
- MSE: the loss measure for regression algorithms
- Cross-entropy: loss measure for classification problems
- Accuracy: measure of success for classification problems
- Precision: accuracy when classifier says "yes"
- Recall: accuracy when truth is "yes"

## Module 1 Review

1.) Machine learning is a way to derive insights from data by adjusting weights:
- on a model function so outputs are close to labels.

2.) Which of these is a machine learning problem where the outcome to be predicted is a continuous number?
- Regression

3.) What is the role of a neuron in a neural network?
- Combine it's inputs to map part of a decision surface

4.) Which of the following definitions are true?
- Epoch refers to one complete pass through a training dataset
- Batch is a small set of examples on which a gradient is computed

## Lab: Getting Started with TensorFlow
#### Objectives:
- Explore the TensorFlow Python API
- Building a graph
- Running a graph
- Feeding values into a graph
- Find area of a triangle using TensorFlow

#### Task 1: Launch Datalab & Clone Repo

```
# launch datalab in central zone
datalab create dataengvm --zone us-central1-a
# clone repo
%bash
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd training-data-analyst
>> datalab/training-data-analyst/courses/machine_learning/tensorflow/a_tfstart.ipynb
```

## Module 2 Review

1.) TensorFlow is:
- A software framework for writing portable ML code

2.) In tf.add(a,b), which one of these is a legal value for a?
- tf.constant([5,3,8])

3.) Which of these is a class that will do logistic regression?
- LinearClassifier

4.) Why is TextLineReader an efficient way to read data into TensorFlow?
- It reads data directly into the graph

## Module 3 Review

1.) Cloud ML Engine
- Lets you train your TensorFlow machine learning models at scale
- Hosts trained models to make predictions

2.) In a model to classify x-ray images of legs as "broken" or "not broken", which of these would normally be considered a hyperparameter?
- Number of layers in a neural network
- Number of graylevels into which to quantize image values 

NOT 
- Pixel values from the image (feature)
- Age of patient (feature)

## Lab: Scaling up ML using Cloud Engine

#### Objectives: 
- Package up the code
- Find absolute paths to data
- Run the Python module from the command line
- Run locally using gcloud
- Submit training job using gcloud
- Deploy model
- Prediction
- Train on a 1-million row dataset

#### Lab found in sample notebooks folder

## End Lab

## Feature Engineering

#### One-Hot Encoding

If you know the keys before hand:
```
tf.feature_column.categorical_column_with_vocabulary_list('employeeID', vocabulary_list = ['01234','56789', ....]
```
If your data is already indexed, i.e. has integers 0-N:
```
tf.feature_column.categorical_column_with_identity('employeeID', num_buckets = 5)
```
If you don't have a vocabulary of all possible values:
```
tf.feature_column.categorical_column_with_hash_bucket('employeeID', hash_bucket_size = 500')
```

#### Wide and Deep Neural Nets

Clearly, some features are particulary dense with a large number of continuous numbers (e.g. price), while other features are sparse (e.g. binarys or one-hot encoding). To combat this, you can tell the model to train different features based on their makeup using:
```
model = tf.estimator.DNNLinearCombinedClassifier(
model_dir= ...,
linear_feature_columns = wide_columns, # categorical
dnn_feature_columns = deep_columns, # numeric
dnn_hidden_units = [100,50])
```


## Module 4 Review

1.) The training data used in machine learning can often be enhanced by extraction of features from the raw data collected. This is referred to as:
- Feature engineering

2.) Which of these is a way of encoding categorical data ?
```layers.sparse_column_with_keys()```

3.) Which of these is a way of discretizing a continuous variable?
```layers.bucketized_column()```

[Tensorflow API documentation](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/bucketized_column)
